Best of Crawling — July 2024

1
Article
freeCodeCamp·2y
How to Use the Python SDK to Build Your Own Web Scraper
Learn how to use Python's Requests and Beautiful Soup libraries to build your own web scraper. This guide walks through scraping data from the UC Irvine Machine Learning Repository, covering the necessary libraries, defining functions to scrape and parse data, and saving the data to a CSV file. Important considerations include legal guidelines, ethical practices, and website compliance.
64
4
2
Article
Machine Learning News·2y
Meet Reworkd: An AI Startup that Automates End-to-end Data Extraction
Reworkd AI is an AI startup that automates the entire web data extraction process. The platform can automatically create and update scraping code in response to website changes, offering a no-code, easy-to-use interface. It streamlines web data pipelines, managing website scans, code generation, data validation, and export. Additional features include self-healing scrapers, scheduling, deduplication, and automated proxy handling. Reworkd simplifies and enhances scalable data extraction, making it accessible for businesses of all sizes.
43
2
3
Article
Real Python·2y
Exercises Course: Introduction to Web Scraping With Python – Real Python
Web scraping involves collecting and parsing raw data from the Web. This course covers parsing website data using string methods, regular expressions, and an HTML parser. The course includes 23 lessons, downloadable resources, and a certificate of completion.
42
2
4
Article
Community Picks·2y
felipeall/resumeio-to-pdf: Download your resume from resume.io as PDF
The post provides a guide on how to download resumes from resume.io as PDF files by entering a renderingToken, merging image files, injecting hyperlinks, converting to PDF, and running OCR to extract text. The guide includes steps to clone the repository, build a Docker image, and run a container. It emphasizes the importance of adhering to applicable laws and supporting Resume.io by subscribing to their services.
27
1
5
Article
Community Picks·2y
apify/crawlee-python: Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other file
Crawlee is a comprehensive web scraping and browser automation library for Python, designed to build reliable crawlers. It provides tools for efficient data extraction and persistent storage, with configurations to fly under the radar of modern bot protections. Available on PyPI, it supports BeautifulSoupCrawler for fast HTML parsing and PlaywrightCrawler for handling JavaScript-heavy pages. The library features types hints, proxy rotation, automatic retries, and more. Explore its documentation for detailed guides and examples. Contributions and bug reports are welcome on GitHub.
22

See all Crawling archives