Best of CrawlingAugust 2024

  1. 1
    Article
    Avatar of tilThis is Learning·2y

    7 Open Source Projects You Should Know - Python Edition ✔️

    Explore seven noteworthy open source projects written in Python, including pandas for data analysis, Apache Airflow for workflow management, G4F for decentralized AI technologies, Scrapy for web scraping, Ultroid as a Telegram UserBot, Zulip for team collaboration, and Freqtrade for crypto trading. Discover their features, installation guides, and more to enhance your coding endeavors.

  2. 2
    Video
    Avatar of TechWithTimTech With Tim·2y

    Python AI Web Scraper Tutorial - Use AI To Scrape ANYTHING

    This post guides you on building an AI-powered web scraper using Python. The scraper can extract information from any website by passing a URL and a prompt to the AI. Essential tools include Streamlit for the front end, Selenium for web scraping, and Langchain for integrating with AI models. Detailed steps cover setting up the environment, handling dependencies, and developing the UI and backend functions necessary for scraping and parsing web content. The tuto­r­ial also explores overcoming common challenges like captchas and IP bans using Bright Data's scraping browser.

  3. 3
    Article
    Avatar of hnHacker News·2y

    Tracking supermarket prices with playwright

    In Dec 2022, a website was created to track price changes in Greece's largest supermarkets using Playwright for web scraping. The main challenges included handling JavaScript-based sites, automating the scraping process, and avoiding IP restrictions. After initial attempts with an old laptop failed, a decision was made to use Hetzner for its cost-efficiency. The setup integrated Tailscale to tackle IP restrictions and used a CI server to manage daily scraping tasks. Optimizations focused on improving scrape speed and cost-efficiency, like upgrading server specs and reducing data fetched.

  4. 4
    Article
    Avatar of communityCommunity Picks·2y

    raznem/parsera: Lightweight library for scraping web-sites with LLMs

    Parsera is a lightweight Python library designed for scraping websites using large language models (LLMs). It is easy to set up with minimal token use, boosting speed and reducing costs. Users can configure it to use models from OpenAI or Azure, and it includes asynchronous support. The library can extract specified elements from web pages and return the results in JSON format.

  5. 5
    Article
    Avatar of communityCommunity Picks·2y

    How to scrape infinite scrolling webpages with Python

    Learn how to scrape infinite-scrolling websites using Crawlee for Python. The tutorial includes steps for setting up the project, handling cookie dialogs, adding requests for all shoe links, extracting product details, and managing infinite scroll on listing pages. The scraped data is then exported to a CSV file. The complete working code is available on a GitHub repository, and additional support can be found in the Crawlee community Discord.