Best of Crawling — October 2024

1
Article
Community Picks·2y
How to create a LinkedIn job scraper in Python with Crawlee
Learn to build a LinkedIn job scraper in Python using Crawlee and Streamlit. The guide includes setting up a Crawlee project, crafting a script to scrape job postings, and creating a Streamlit web application to run the scraper. It covers URL encoding, routing the crawler, extracting job details, and saving the output to a CSV file.
41
2
Article
Community Picks·2y
Python Web Scraping Made Easy: Explore These 8 Libraries 🔍
Python is an ideal language for web scraping due to its simplicity and robust ecosystem. This post highlights eight popular Python libraries for web scraping: BeautifulSoup for HTML/XML parsing, Scrapy for creating spiders, Selenium for browser automation, Requests for handling HTTP requests, LXML for efficient XML/HTML processing, PyQuery for jQuery-like querying, MechanicalSoup for browser emulation, and Playwright for cross-browser automation. Each library has unique features catering to various scraping needs.
41
2
3
Article
Community Picks·2y
Optimizing web scraping: Scraping auth data using JSDOM
Web scraping can be resource-intensive, especially when extracting authentication data from sites like TikTok ads creative center. Traditional approaches using browsers like Puppeteer or Playwright are efficient but expensive. A new method leverages JSDOM to run page JavaScript more efficiently, improving performance and reducing RAM usage. This method involves making API calls to fetch required headers and data, significantly speeding up the process compared to browser-based scraping while being more suitable for JavaScript-rendered websites than libraries like Cheerio.
40
4
Article
Community Picks·2y
A Guide To Robots.txt: Best Practices For SEO
Understanding and correctly using robots.txt is critical for SEO, as mistakes can impact how search engines crawl your site. Key principles include blocking unimportant URLs, managing conflicting rules with order of precedence, and specifying sitemaps. Proper usage helps save crawl budget and computing resources while ensuring important pages are prioritized. Validate your robots.txt file to prevent blocking essential content inadvertently.
22
1
5
Video
Oxylabs·2y
How To Bypass IP Bans
Learn what causes IP bans and effective methods to bypass them. Use tools such as VPNs and proxies to mask your IP address. Clearing your cache and changing your MAC address are also viable options. For data scraping, residential proxies and AI-powered proxy management solutions can help avoid detection. Follow website terms of service to minimize the risk of future IP bans.
21
2

See all Crawling archives