Best of Crawling — 2025

1
Article
Daily Dose of Data Science | Avi Chawla | Substack·1y
5 Agentic AI Design Patterns
Explore five agentic AI design patterns that enhance the effectiveness of AI agents through reflection, tool use, reason and act, planning, and multi-agent approaches. Learn how Firecrawl Extract facilitates web scraping by using simple English prompts to extract clean, structured data. Discover additional resources on machine learning techniques and data science provided by Daily Dose of Data Science.
294
2
Video
Tech With Tim·1y
Web Scraping 101: A Million Dollar SaaS Idea
The post explores a web scraping SaaS idea with high potential, targeting influencer marketing inefficiencies. It outlines a project to build a system that identifies video sponsorships on YouTube, including detailed steps for data collection and analysis using Bright Data's web scraping API. The project aims to help companies find suitable influencers and track competitors, while addressing challenges like scaling data collection and handling API token limits.
263
4
3
Video
Tech With Tim·1y
I Built a Web Scraping AI Agent From Scratch - It's Insane...
Building powerful AI applications requires the integration of large language models (LLMs) with real-time data and useful tools. In this post, the author demonstrates the development of an AI travel agent using Python. This agent uses Bright Data APIs for real-time travel data, Google Flights, and hotel information to provide relevant and current recommendations. The post covers the project's architecture, details the steps of web scraping with automated browsers, and explains how the AI processes and combines data to generate personalized travel plans.
233
2
4
Article
Addy Osmani·34w
Give your AI eyes: Introducing Chrome DevTools MCP
Chrome DevTools MCP is a new tool that connects AI coding assistants to Chrome's DevTools through the Model Context Protocol, allowing AI agents to see, interact with, and debug live web applications in real browsers. The tool enables AI to perform tasks like running performance traces, inspecting DOM elements, monitoring network requests, simulating user interactions, and automatically fixing issues based on actual browser feedback rather than guessing.
198
4
5
Video
ByteGrad·1y
AI-Scraping Is Getting Crazy Easy Now
Traditional web scraping required manually pinpointing data within HTML structures and managing infrastructure for requests. AI-based solutions like Scraperless simplify this process by allowing users to describe the desired data without specifying HTML details. Scraperless utilizes free-form prompts to determine scraping targets and offers integration via API keys, allowing developers to incorporate it into their applications seamlessly. Results are available in formats like CSV, making data handling straightforward with minimal user effort.
135
3
6
Article
Community Picks·1y
lightpanda-io/browser: The open-source browser made for headless usage
Lightpanda is an open-source headless browser designed for efficient web automation, AI agents, LLM training, scraping, and testing. It features a significantly lower memory footprint and faster execution times compared to Chrome. The browser supports Javascript execution and web APIs, is compatible with tools like Playwright and Puppeteer, and is built using the Zig programming language. Installation and configuration instructions are provided for both Linux and MacOS.
86
1
7
Article
Arcjet·41w
How long until we need to block Google?
Google's AI Overviews are reducing website traffic as search results increasingly provide direct answers instead of linking out. Public companies report traffic declines from 52% to 28%, though Google claims overall click volume remains stable. Unlike OpenAI's granular bot controls, Google offers limited options for site owners to control how their content is used in AI features. Site owners can only block all Google crawling or allow everything, creating a dilemma as the traditional web traffic contract may be breaking down.
71
5
8
Article
Awesome Go·41w
How I Made Europe Searchable From a Single Server - The Story of HydrAIDE
A developer built HydrAIDE, a custom data engine that indexes millions of European websites from a single server using only 3% CPU load. Instead of traditional databases, the system uses thousands of small binary files for O(1) data access, leveraging modern SSD performance. The engine powers precise B2B partner searches across Europe and is now open-source with Go SDK support and Python/Node.js SDKs in development.
63
2
9
Article
Neon·1y
Building RagRabbit, An Open Source RAG Search with Postgres as the Vector Store
RagRabbit is an open-source tool designed to simplify Retrieval-Augmented Generation (RAG) workflows by using Postgres with pgVector for handling vector embeddings. It can crawl websites, convert pages to Markdown, generate embeddings, and use MCP servers to integrate with development tools. RagRabbit supports secure user authentication and can be deployed effortlessly using Vercel.
58
10
Article
freeCodeCamp·51w
How To Build A Simple Portfolio Blog With Next.js
A comprehensive guide to building a portfolio blog with Next.js that automatically aggregates articles from multiple platforms. The tutorial covers creating server and client components, implementing web scraping with Cheerio to extract metadata from article URLs, building search and filtering functionality, and structuring a JSON-based content management system without requiring a database.
54
1
11
Article
freeCodeCamp·35w
Build an Enterprise-Grade AI Project
A comprehensive course teaches how to build production-grade AI systems beyond simple toy projects. The curriculum covers creating robust data pipelines with web scraping APIs, document processing, quality control for toxicity and bias detection, and exporting datasets in standard formats. Key engineering practices include modular architecture with managers and clients, asynchronous processing, fallback mechanisms, logging, and API cost tracking for scalable enterprise applications.
53
12
Article
Product Hunt·24w
BrowserBook: The Browser Automation IDE
BrowserBook is an AI-powered IDE that combines a Jupyter-style notebook interface with an inline browser and context-aware coding assistant for building Playwright-based browser automations. It addresses common issues with browser agents (cost, speed, reliability, debugging) by shifting AI assistance to the coding phase rather than execution. Key features include interactive browser testing, notebook-style cell execution, DOM-aware code suggestions, built-in authentication management, screenshot tools, data extraction helpers, and API deployment capabilities for production use.
51
13
Video
Tech With Tim·49w
Python Advanced AI Agent Tutorial - LangGraph, LangChain, Tools & More!
A comprehensive tutorial on building advanced AI agents using LangGraph, LangChain, and Firecrawl. The guide demonstrates creating a coding research assistant that follows structured multi-step workflows to research developer tools and frameworks. It covers both simple agent creation using MCP servers and advanced implementations with custom workflows, structured outputs using Pydantic models, and controlled agent flow through graph-based state management.
51
14
Video
Oxylabs·1y
Building a Real Estate Monitoring System
Alex discusses building a real estate monitoring system, focusing on the types of data that can be extracted from real estate websites, the use cases for the extracted data including price comparisons and market trends, and the challenges faced such as getting fresh data, overcoming anti-bot measures, and scaling the system. He then advises using Oxylabs' Real Estate Scraper API to handle these challenges efficiently.
50
15
Article
swizec.com·1y
Server-side React that renders as png, pdf, or interactive webapp
React can be rendered as PNG, PDF, static HTML, or an interactive webapp by simply changing the URL. This process involves server-side rendering (SSR) with components supporting css-in-js and data loading via useQuery. Different formats of rendering are controlled by query parameters in the URL, using TanStack Start and TanStack Router alongside Puppeteer. The approach aims for sophisticated rendering with minimal effort from product engineers.
44
3
16
Article
AI·1y
Maven AI: Building an AI-Powered Product Research Assistant
Maven AI is an open-source project created to streamline electronic product research with AI. It offers features like personalized recommendations, fast product searches, deep insights, and side-by-side comparisons to enhance user decision-making. The project utilizes a modular architecture with an orchestrator and agent tools, leveraging TypeScript and React for type-safety and a responsive UI. To handle data, it uses web scraping, APIs, and various technologies like Google Gemini, Firecrawl, and Upstash Redis. The project is open-source for community collaboration and future scalability improvements.
39
9
17
Article
gitconnected·1y
I Tried 20+ No-Code Web Scraping Tools to Make Money — These 3 Are the Absolute Best
The post explores three effective no-code web scraping tools - Octoparse, Magical Chrome Extension, and Browse AI. It provides detailed instructions on how to use these tools and highlights their unique features, such as ease of use, data extraction capabilities, and cloud execution options. Furthermore, it offers insights on how to monetize web scraping by providing services like lead generation, competitor analysis, and product data scraping for clients on freelance platforms or through direct outreach.
38
18
Video
Oxylabs·23w
n8n Web Scraping: Complete Automation Guide
n8n enables no-code web scraping through its visual workflow builder. The platform offers multiple approaches: basic HTTP Request and HTML nodes for static sites, Markdown conversion for AI processing, and third-party tools like Oxylabs AI Studio for JavaScript-heavy pages. Workflows can be configured with error handling, retry logic, and rate limiting. Scraped data integrates directly with databases, spreadsheets, and LLMs. Both cloud-hosted and self-hosted deployment options are available, with self-hosted being free but requiring infrastructure management.
37
19
Article
Daily Dose of Data Science | Avi Chawla | Substack·1y
MCP-powered Agentic RAG
A demonstration of an MCP-powered Agentic RAG system shows how to use an MCP-driven workflow for searching a vector database and falling back to web search when necessary. The system employs tools such as Qdrant for the vector database, Bright Data for web scraping, and Cursor as the MCP client. The post includes a detailed guide on setting up the MCP server, integrates it with Cursor, and addresses common challenges like IP blocks and bots using Bright Data.
37
20
Article
Daily Dose of Data Science | Avi Chawla | Substack·45w
Build a Multi-agent Content Creation System
A demonstration of building a multi-agent content creation system using Motia, an open-source backend framework that unifies multi-agent orchestration, APIs, and background jobs. The system scrapes web content using Firecrawl, processes it with locally-served Deepseek-R1 LLM through Ollama, and generates social media content for Twitter and LinkedIn in parallel. The workflow includes automatic scheduling via Typefully and exposes functionality through APIs. Motia supports multiple programming languages, one-click deployment, built-in observability, automatic retries, and streaming responses.
36
21
Article
Daily Dose of Data Science | Avi Chawla | Substack·1y
[Hands-on] Build a Multi-agent Brand Monitoring System
Learn how to build a brand monitoring app using Bright Data for web scraping, CrewAI for orchestration, and Ollama for serving DeepSeek-R1 locally. The app scrapes data from various platforms (e.g., X, Instagram, YouTube), analyzes it, and generates insights through platform-specific Crews, ultimately producing a comprehensive report.
36
22
Video
YouTube·1y
Scrape Any Website for FREE Using DeepSeek & Crawl4AI
Learn how to create a powerful AI web scraper using DeepSeek, Grock, and Crawl4AI. This guide walks through setting up a browser, configuring LLM strategies, and scraping data from websites. All source code is provided for free, allowing you to easily customize and extend the project for your own web scraping needs.
28
23
Article
TechCrunch·29w
Amazon sends legal threats to Perplexity over agentic browsing
Amazon sent a cease-and-desist letter to Perplexity demanding its AI shopping assistant Comet identify itself as an agent when browsing Amazon's site. Perplexity argues agents acting on behalf of users should have the same permissions as human users, while Amazon insists third-party agents must identify themselves and respect service provider decisions. The dispute echoes previous controversies around Perplexity's web scraping practices and raises broader questions about how websites will handle autonomous AI agents in e-commerce, travel booking, and other online services.
26
5
24
Article
The React Community·1y
From HTML Templates to Well-Formatted PDFs: Using Puppeteer and NestJs
Learn how to generate well-formatted PDFs using Puppeteer and NestJS in your projects. This guide covers dynamic and versatile document creation with these powerful tools.
24
25
Article
Planet Python·1y
Create Project-Less Python Utilities with uv and Inline Script Metadata
Learn how to create and run Python utility scripts with inline metadata using uv. This method avoids the need for a full Python project and simplifies dependency management by embedding metadata directly into the script. The post provides an example script for searching and fetching details from the Google Books API, along with additional examples for summarizing YouTube videos and scraping articles.
24
1

See all Crawling archives