Best of Crawling — 2024

1
Article
Hacker News·2y
Using GPT-4o for web scraping
A developer experimented with using GPT-4o's structured outputs for web scraping, creating an AI-assisted web scraper. While the model performed well with simple and complex tables, it struggled with combined rows and generating XPaths. Cost is a concern due to the model's character volume requirements. Future improvements could include better UX through capturing browser events and further refining HTML data cleanup.
565
15
2
Article
This is Learning·2y
7 Open Source Projects You Should Know - Python Edition ✔️
Explore seven noteworthy open source projects written in Python, including pandas for data analysis, Apache Airflow for workflow management, G4F for decentralized AI technologies, Scrapy for web scraping, Ultroid as a Telegram UserBot, Zulip for team collaboration, and Freqtrade for crypto trading. Discover their features, installation guides, and more to enhance your coding endeavors.
366
7
3
Video
Be A Better Dev·2y
The BEST Project Idea to Learn AWS
This post shares an engaging project idea to learn AWS hands-on, focusing on essential services like EC2, RDS, S3, IAM, VPC, and CloudWatch. The project involves building a web scraper to extract data from websites, providing practical skills that are valuable for a career in cloud engineering. The author emphasizes the importance of hands-on experience over certifications for solidifying AWS concepts.
184
1
4
Article
UX Planet·2y
Automate 90% of Your Design Job Search with This AI Workflow
Discover an AI-driven workflow to automate your job search process. This system uses tools like Bardeen for scraping job listings and Google Sheets with App Scripts to filter and organize them. It then leverages GPT-40 mini to analyze job descriptions based on specific criteria, helping you focus on the most relevant opportunities quickly.
161
4
5
Article
Community Picks·2y
Building a Netflix show recommender using Crawlee and React
Learn how to build a Netflix show recommender using Crawlee and React, guided through scraping Netflix content with Crawlee and visualizing it with a React app built with Vite. The guide covers prerequisites, installation steps, writing scraping code using Cheerio, and integrating the scraped data into a React application.
153
6
Video
YouTube·2y
This is how I scrape 99% websites via LLM
Explore how advancements in AI, particularly large language models (LLMs), are revolutionizing web scraping in 2024. Learn the best practices for scripting internet data at a large scale, building autonomous web scrapers, and handling complex web interactions. The post demonstrates various kinds of web scraping tasks, including scraping public websites, handling complex web manipulations, and more sophisticated tasks that require reasoning. It also includes details about services like OpenAI, AgentQL, and SpiderCloud that facilitate optimized web content extraction.
152
7
Video
Tech With Tim·2y
Python AI Web Scraper Tutorial - Use AI To Scrape ANYTHING
This post guides you on building an AI-powered web scraper using Python. The scraper can extract information from any website by passing a URL and a prompt to the AI. Essential tools include Streamlit for the front end, Selenium for web scraping, and Langchain for integrating with AI models. Detailed steps cover setting up the environment, handling dependencies, and developing the UI and backend functions necessary for scraping and parsing web content. The tutorial also explores overcoming common challenges like captchas and IP bans using Bright Data's scraping browser.
119
3
8
Article
Community Picks·1y
How to scrape Google Maps data using Python
Learn how to build a Google Maps scraper using Crawlee and Python to extract hotel data including names, ratings, reviews, prices, and amenities. The guide covers setting up the environment, connecting to Google Maps, handling dynamic content, and managing infinite scrolling. It also explains how to use proxies for large-scale scraping and create an interactive analysis dashboard with the exported data.
90
4
9
Article
Hacker News·2y
Tracking supermarket prices with playwright
In Dec 2022, a website was created to track price changes in Greece's largest supermarkets using Playwright for web scraping. The main challenges included handling JavaScript-based sites, automating the scraping process, and avoiding IP restrictions. After initial attempts with an old laptop failed, a decision was made to use Hetzner for its cost-efficiency. The setup integrated Tailscale to tackle IP restrictions and used a CI server to manage daily scraping tasks. Optimizations focused on improving scrape speed and cost-efficiency, like upgrading server specs and reducing data fetched.
89
2
10
Article
DEV·2y
I built my first SaaS - NotiFast
NotiFast is a versatile notification bot designed to alert users about changes on websites they follow, such as new items or content updates. Built on the Discord platform, it offers seamless notifications without requiring user authentication and integrates easily with Discord's payment system. Initially derived from the open-source project webscraper-bot, NotiFast aims to simplify webpage monitoring with an easy-to-use visual creator. A free beta is currently available for the first 100 users.
87
2
11
Video
Community Picks·2y
Coding project IDEAS for portfolio
Learn how to find real-life coding projects for your portfolio. Get inspiration for web design, web scraping, data analytics, Chrome extensions, and full-stack applications.
85
4
12
Article
Community Picks·2y
raznem/parsera: Lightweight library for scraping web-sites with LLMs
Parsera is a lightweight Python library designed for scraping websites using large language models (LLMs). It is easy to set up with minimal token use, boosting speed and reducing costs. Users can configure it to use models from OpenAI or Azure, and it includes asynchronous support. The library can extract specified elements from web pages and return the results in JSON format.
84
13
Article
Community Picks·2y
Vercel + Puppeteer
Learn how to use Puppeteer with Vercel to generate PDFs of websites. Discover best practices for setting up Puppeteer for Vercel and deploying your Puppeteer code on Vercel.
84
5
14
Article
Machine Learning News·2y
Firecrawl: A Powerful Web Scraping Tool for Turning Websites into Large Language Model (LLM) Ready Markdown or Structured Data
Firecrawl, developed by Mendable AI, is a state-of-the-art web scraping tool designed to efficiently extract data from websites, including those with dynamic JavaScript-rendered content. It outputs clean, well-formatted Markdown suitable for Large Language Model (LLM) applications, while incorporating caching mechanisms and generative feedback loops to enhance data quality and extraction efficiency. Users can access Firecrawl via an intuitive API and multiple SDKs for different programming languages.
67
3
15
Article
DEV·2y
Link Eater: The WhatsApp Bot That Digests Content for You
Link Eater is an AI-powered bot that summarizes web content and YouTube videos directly through WhatsApp, allowing users to quickly grasp key points without navigating through lengthy content. It uses technologies like Node.js, Express.js, Twilio WhatsApp API, OpenAI GPT-3.5, YouTube Data API, and Jina AI. Users can send a URL to the bot, which then responds with a concise summary, making it useful for quick information digestion while engaging in chats and discussions.
65
3
16
Article
freeCodeCamp·2y
How to Use the Python SDK to Build Your Own Web Scraper
Learn how to use Python's Requests and Beautiful Soup libraries to build your own web scraper. This guide walks through scraping data from the UC Irvine Machine Learning Repository, covering the necessary libraries, defining functions to scrape and parse data, and saving the data to a CSV file. Important considerations include legal guidelines, ethical practices, and website compliance.
64
4
17
Article
Community Picks·1y
One Million Screenshots
Explore over a million rendered homepages from the web in an interactive manner, allowing you to zoom, pan, and click similar to Google Maps. This visual dataset could help you find websites you've been looking for or discover new ones. Check out the FAQ for more details and learn about the Screenshot API if you're interested in the data.
57
10
18
Article
KDnuggets·2y
7 Python Libraries Every Data Engineer Should Know
Discover some essential Python libraries for data engineers, including Requests for API data extraction, BeautifulSoup for web scraping, Pandas for data manipulation, SQLAlchemy for database work, Airflow for workflow orchestration, PySpark for big data processing, and Kafka-Python for real-time data processing.
52
19
Article
Machine Learning News·1y
Meet Steel.dev: An Open Source Browser API for AI Agents and Apps
Steel.dev is an open-source tool that simplifies web automation for AI applications by abstracting complex browser interactions through a RESTful API. It reduces the need for detailed scripts and expertise in frameworks like Puppeteer, Selenium, and Playwright. The tool features a modular architecture that allows easy management and interaction with headless browsers, facilitating tasks such as data extraction and form completion while ensuring scalability for large-scale projects.
51
20
Article
Hacker News·2y
Web Scraping in Python - The Complete Guide
This post provides a comprehensive guide to web scraping in Python, including the advantages of using Python for web scraping, the best Python libraries for web scraping, and some tips and best practices for handling challenges in web scraping. It also includes code examples and recommendations for alternative libraries and tools for web scraping.
51
2
21
Article
Hacker News·2y
finic-ai/finic: Create Playwright-based browser agents to scrape websites and automate tasks.
Finic is a cloud platform that simplifies deploying and managing browser-based automation agents, focusing on fault-tolerant execution. It supports Playwright for DOM interaction and BeautifulSoup for HTML parsing. Features include cloud deployment, secure credential management, monitoring, and advanced functionalities like self-healing selectors and session impersonation.
49
1
22
Article
Real Python·2y
Introduction to Web Scraping With Python – Real Python
Web scraping is the process of collecting and parsing raw data from the web using powerful Python tools. This video course offers 12 lessons covering methods such as string methods, regular expressions, and HTML parsing. It includes downloadable resources, subtitles, transcripts, an interactive quiz, and a certificate of completion to help you effectively scrape data from websites.
47
23
Article
Planet Python·2y
Web scraping as an API service
This post discusses the use of web scraping as an API service in systems-to-systems integrations. It highlights why web scraping should be avoided in backend integrations and introduces Playwright as a tool for generating Python code for web scraping.
46
24
Article
Machine Learning News·2y
ScrapeGraphAI: A Web Scraping Python Library that Uses LLMs to Create Scraping Pipelines for Websites, Documents, and XML Files
ScrapeGraphAI is an advanced web scraping library that simplifies data collection using large language models (LLMs) and a unique direct graph logic. It minimizes the time and technical skills required for web scraping projects, allowing users to focus more on analyzing the extracted data.
45
2
25
Article
Community Picks·1y
FlareSolverr/FlareSolverr: Proxy server to bypass Cloudflare protection
FlareSolverr is a proxy server designed to bypass Cloudflare and DDoS-GUARD protection using Selenium with an undetected Chrome driver. It opens URLs with user parameters, solves Cloudflare challenges, and returns HTML code and cookies. Installation using Docker is recommended due to its dependencies on an external browser. Memory consumption can be high, so it should be used cautiously on low-RAM machines. It supports multiple architectures and provides examples for making requests via Bash, Python, and PowerShell. Users can also create permanent sessions to avoid repeatedly solving challenges.
43
1

See all Crawling archives