Best of Crawling2024

  1. 1
    Article
    Avatar of hnHacker News·2y

    Using GPT-4o for web scraping

    A developer experimented with using GPT-4o's structured outputs for web scraping, creating an AI-assisted web scraper. While the model performed well with simple and complex tables, it struggled with combined rows and generating XPaths. Cost is a concern due to the model's character volume requirements. Future improvements could include better UX through capturing browser events and further refining HTML data cleanup.

  2. 2
    Article
    Avatar of tilThis is Learning·2y

    7 Open Source Projects You Should Know - Python Edition ✔️

    Explore seven noteworthy open source projects written in Python, including pandas for data analysis, Apache Airflow for workflow management, G4F for decentralized AI technologies, Scrapy for web scraping, Ultroid as a Telegram UserBot, Zulip for team collaboration, and Freqtrade for crypto trading. Discover their features, installation guides, and more to enhance your coding endeavors.

  3. 3
    Video
    Avatar of beabetterdevBe A Better Dev·2y

    The BEST Project Idea to Learn AWS

    This post shares an engaging project idea to learn AWS hands-on, focusing on essential services like EC2, RDS, S3, IAM, VPC, and CloudWatch. The project involves building a web scraper to extract data from websites, providing practical skills that are valuable for a career in cloud engineering. The author emphasizes the importance of hands-on experience over certifications for solidifying AWS concepts.

  4. 4
    Article
    Avatar of uxplanetUX Planet·2y

    Automate 90% of Your Design Job Search with This AI Workflow

    Discover an AI-driven workflow to automate your job search process. This system uses tools like Bardeen for scraping job listings and Google Sheets with App Scripts to filter and organize them. It then leverages GPT-40 mini to analyze job descriptions based on specific criteria, helping you focus on the most relevant opportunities quickly.

  5. 5
    Article
    Avatar of communityCommunity Picks·2y

    Building a Netflix show recommender using Crawlee and React

    Learn how to build a Netflix show recommender using Crawlee and React, guided through scraping Netflix content with Crawlee and visualizing it with a React app built with Vite. The guide covers prerequisites, installation steps, writing scraping code using Cheerio, and integrating the scraped data into a React application.

  6. 6
    Video
    Avatar of youtubeYouTube·2y

    This is how I scrape 99% websites via LLM

    Explore how advancements in AI, particularly large language models (LLMs), are revolutionizing web scraping in 2024. Learn the best practices for scripting internet data at a large scale, building autonomous web scrapers, and handling complex web interactions. The post demonstrates various kinds of web scraping tasks, including scraping public websites, handling complex web manipulations, and more sophisticated tasks that require reasoning. It also includes details about services like OpenAI, AgentQL, and SpiderCloud that facilitate optimized web content extraction.

  7. 7
    Video
    Avatar of TechWithTimTech With Tim·2y

    Python AI Web Scraper Tutorial - Use AI To Scrape ANYTHING

    This post guides you on building an AI-powered web scraper using Python. The scraper can extract information from any website by passing a URL and a prompt to the AI. Essential tools include Streamlit for the front end, Selenium for web scraping, and Langchain for integrating with AI models. Detailed steps cover setting up the environment, handling dependencies, and developing the UI and backend functions necessary for scraping and parsing web content. The tuto­r­ial also explores overcoming common challenges like captchas and IP bans using Bright Data's scraping browser.

  8. 8
    Article
    Avatar of communityCommunity Picks·1y

    How to scrape Google Maps data using Python

    Learn how to build a Google Maps scraper using Crawlee and Python to extract hotel data including names, ratings, reviews, prices, and amenities. The guide covers setting up the environment, connecting to Google Maps, handling dynamic content, and managing infinite scrolling. It also explains how to use proxies for large-scale scraping and create an interactive analysis dashboard with the exported data.

  9. 9
    Article
    Avatar of hnHacker News·2y

    Tracking supermarket prices with playwright

    In Dec 2022, a website was created to track price changes in Greece's largest supermarkets using Playwright for web scraping. The main challenges included handling JavaScript-based sites, automating the scraping process, and avoiding IP restrictions. After initial attempts with an old laptop failed, a decision was made to use Hetzner for its cost-efficiency. The setup integrated Tailscale to tackle IP restrictions and used a CI server to manage daily scraping tasks. Optimizations focused on improving scrape speed and cost-efficiency, like upgrading server specs and reducing data fetched.

  10. 10
    Article
    Avatar of devtoDEV·2y

    I built my first SaaS - NotiFast

    NotiFast is a versatile notification bot designed to alert users about changes on websites they follow, such as new items or content updates. Built on the Discord platform, it offers seamless notifications without requiring user authentication and integrates easily with Discord's payment system. Initially derived from the open-source project webscraper-bot, NotiFast aims to simplify webpage monitoring with an easy-to-use visual creator. A free beta is currently available for the first 100 users.

  11. 11
    Video
    Avatar of communityCommunity Picks·2y

    Coding project IDEAS for portfolio

    Learn how to find real-life coding projects for your portfolio. Get inspiration for web design, web scraping, data analytics, Chrome extensions, and full-stack applications.

  12. 12
    Article
    Avatar of communityCommunity Picks·2y

    raznem/parsera: Lightweight library for scraping web-sites with LLMs

    Parsera is a lightweight Python library designed for scraping websites using large language models (LLMs). It is easy to set up with minimal token use, boosting speed and reducing costs. Users can configure it to use models from OpenAI or Azure, and it includes asynchronous support. The library can extract specified elements from web pages and return the results in JSON format.

  13. 13
    Article
    Avatar of communityCommunity Picks·2y

    Vercel + Puppeteer

    Learn how to use Puppeteer with Vercel to generate PDFs of websites. Discover best practices for setting up Puppeteer for Vercel and deploying your Puppeteer code on Vercel.

  14. 14
    Article
    Avatar of mlnewsMachine Learning News·2y

    Firecrawl: A Powerful Web Scraping Tool for Turning Websites into Large Language Model (LLM) Ready Markdown or Structured Data

    Firecrawl, developed by Mendable AI, is a state-of-the-art web scraping tool designed to efficiently extract data from websites, including those with dynamic JavaScript-rendered content. It outputs clean, well-formatted Markdown suitable for Large Language Model (LLM) applications, while incorporating caching mechanisms and generative feedback loops to enhance data quality and extraction efficiency. Users can access Firecrawl via an intuitive API and multiple SDKs for different programming languages.

  15. 15
    Article
    Avatar of devtoDEV·2y

    Link Eater: The WhatsApp Bot That Digests Content for You

    Link Eater is an AI-powered bot that summarizes web content and YouTube videos directly through WhatsApp, allowing users to quickly grasp key points without navigating through lengthy content. It uses technologies like Node.js, Express.js, Twilio WhatsApp API, OpenAI GPT-3.5, YouTube Data API, and Jina AI. Users can send a URL to the bot, which then responds with a concise summary, making it useful for quick information digestion while engaging in chats and discussions.

  16. 16
    Article
    Avatar of freecodecampfreeCodeCamp·2y

    How to Use the Python SDK to Build Your Own Web Scraper

    Learn how to use Python's Requests and Beautiful Soup libraries to build your own web scraper. This guide walks through scraping data from the UC Irvine Machine Learning Repository, covering the necessary libraries, defining functions to scrape and parse data, and saving the data to a CSV file. Important considerations include legal guidelines, ethical practices, and website compliance.

  17. 17
    Article
    Avatar of communityCommunity Picks·1y

    One Million Screenshots

    Explore over a million rendered homepages from the web in an interactive manner, allowing you to zoom, pan, and click similar to Google Maps. This visual dataset could help you find websites you've been looking for or discover new ones. Check out the FAQ for more details and learn about the Screenshot API if you're interested in the data.

  18. 18
    Article
    Avatar of kdnuggetsKDnuggets·2y

    7 Python Libraries Every Data Engineer Should Know

    Discover some essential Python libraries for data engineers, including Requests for API data extraction, BeautifulSoup for web scraping, Pandas for data manipulation, SQLAlchemy for database work, Airflow for workflow orchestration, PySpark for big data processing, and Kafka-Python for real-time data processing.

  19. 19
    Article
    Avatar of mlnewsMachine Learning News·1y

    Meet Steel.dev: An Open Source Browser API for AI Agents and Apps

    Steel.dev is an open-source tool that simplifies web automation for AI applications by abstracting complex browser interactions through a RESTful API. It reduces the need for detailed scripts and expertise in frameworks like Puppeteer, Selenium, and Playwright. The tool features a modular architecture that allows easy management and interaction with headless browsers, facilitating tasks such as data extraction and form completion while ensuring scalability for large-scale projects.

  20. 20
    Article
    Avatar of hnHacker News·2y

    Web Scraping in Python - The Complete Guide

    This post provides a comprehensive guide to web scraping in Python, including the advantages of using Python for web scraping, the best Python libraries for web scraping, and some tips and best practices for handling challenges in web scraping. It also includes code examples and recommendations for alternative libraries and tools for web scraping.

  21. 21
    Article
    Avatar of hnHacker News·2y

    finic-ai/finic: Create Playwright-based browser agents to scrape websites and automate tasks.

    Finic is a cloud platform that simplifies deploying and managing browser-based automation agents, focusing on fault-tolerant execution. It supports Playwright for DOM interaction and BeautifulSoup for HTML parsing. Features include cloud deployment, secure credential management, monitoring, and advanced functionalities like self-healing selectors and session impersonation.

  22. 22
    Article
    Avatar of rpythonReal Python·2y

    Introduction to Web Scraping With Python – Real Python

    Web scraping is the process of collecting and parsing raw data from the web using powerful Python tools. This video course offers 12 lessons covering methods such as string methods, regular expressions, and HTML parsing. It includes downloadable resources, subtitles, transcripts, an interactive quiz, and a certificate of completion to help you effectively scrape data from websites.

  23. 23
    Article
    Avatar of planetpythonPlanet Python·2y

    Web scraping as an API service

    This post discusses the use of web scraping as an API service in systems-to-systems integrations. It highlights why web scraping should be avoided in backend integrations and introduces Playwright as a tool for generating Python code for web scraping.

  24. 24
    Article
    Avatar of mlnewsMachine Learning News·2y

    ScrapeGraphAI: A Web Scraping Python Library that Uses LLMs to Create Scraping Pipelines for Websites, Documents, and XML Files

    ScrapeGraphAI is an advanced web scraping library that simplifies data collection using large language models (LLMs) and a unique direct graph logic. It minimizes the time and technical skills required for web scraping projects, allowing users to focus more on analyzing the extracted data.

  25. 25
    Article
    Avatar of communityCommunity Picks·1y

    FlareSolverr/FlareSolverr: Proxy server to bypass Cloudflare protection

    FlareSolverr is a proxy server designed to bypass Cloudflare and DDoS-GUARD protection using Selenium with an undetected Chrome driver. It opens URLs with user parameters, solves Cloudflare challenges, and returns HTML code and cookies. Installation using Docker is recommended due to its dependencies on an external browser. Memory consumption can be high, so it should be used cautiously on low-RAM machines. It supports multiple architectures and provides examples for making requests via Bash, Python, and PowerShell. Users can also create permanent sessions to avoid repeatedly solving challenges.