Some Tips to Avoid while doing Web scraping in 2023?

Don't Use Headless Browsers for Everything

Selenium, Puppeteer, and Playwright are great, no doubt, but not a silver bullet. They bring a resource overhead and slow down the scraping process. So why use them? 100% needed for Javascript-rendered content and helpful in many circumstances. But ask yourself if that's your case.

Most of the sites serve the data, one way or another, on the first HTML request. Because of that, we advocate going the other way around. Test first plain HTML by using your favorite tool and language (cURL, requests in Python, Axios in Javascript, whatever). Check for the content you need: text, IDs, and prices. Be careful here since sometimes the data you see on the browser might be encoded (i.e.," shown in plain HTML as "). Copy & paste might not work. 😅

If you find the info, try to write the extractors. A quick hack might be good enough for a test. Once you have identified all the content you want, the following point is to separate the generic crawling code from the custom one for the target site.

We did a small-scale benchmark with 10 URLs using three different methods to obtain the HTML.
  1. Using Python's "requests": 2.41 seconds
  2. A playwright with chromium opening a new browser per request: 11.33 seconds
  3. A playwright with chromium sharing browser and context for all the URLs: 7.13 seconds

It is not 100% conclusive nor statistically accurate, but it shows the difference. In the best case, we are talking about 3x slower using Playwright, and sharing context is not always a good idea. And we are not even talking about CPU and memory consumption.

DON'T Couple Code to Target

Some actions are independent of the website you are scraping: get HTML, parse it, queue new links to crawl, store content, and more. In an ideal scenario, we would separate those from the ones that depend on the target site: CSS selectors, URL structure, DDBB structure.

The first script is usually entangled; no problem there. But as it grows and new pages are added, separating responsibilities is crucial. We know, easier said than done. But to pause and think matters to develop a maintainable and scalable scraper.

We published a repository and blog post about distributed crawling in Python. It is a bit more complicated than what we've seen so far. It uses external software (Celery for the asynchronous task queue and Redis for the database).

Long story short, separate and abstract the parts related to target sites. In our example, we simplified by creating a single file per domain. In there, we specify four things:
  • How to get the HTML (requests VS headless browser)
  • Filter URLs to queue for crawling
  • What content to extract (CSS selectors)
  • Where to store the data (a list in Redis)
  • # ... 
    def extract_content(url, soup): 
    	# ... 
     
    def store_content(url, content): 
    	# ... 
     
    def allow_url_filter(url): 
    	# ... 
     
    def get_html(url): 
    	return headless_chromium.get_html(url, headers=random_headers(), proxies=random_proxies())

    It is still far from massive-scale production-ready. But code reuse is easy, as is adding new domains. And when adding updated browsers or headers, it would be easy to modify the old scrapers to use those.

    Don't Take Down your Target Site

    Your extra load might be a drop in the ocean for Amazon but a burden for a small independent store. Be mindful of the scale of your scraping and the size of your targets.

    You can probably crawl hundreds of pages at Amazon concurrently, and they won't even notice (carefully nonetheless). But many websites run on a single shared machine with poor specs, and they deserve our understanding. Tune down your script capabilities for those sites. It might complicate the code, but stopping if the response times increase would be nice.

    Another point is to inspect and comply with their robots.txt. Mainly two rules: do not scrape disallowed pages and obey Crawl-Delay. That directive is not common, but when present, represents the amount of seconds crawlers should wait between requests. There is a Python module that can help us to comply with robots.txt.

    We will not go into details but do not perform malicious activities (there should be no need to say it, just in case). We are always talking about extracting data without breaking the law or causing damage to the target site.

    Don't Mix Headers from Different Browsers

    This last technique is for higher-level anti-bot solutions. Browsers send several headers with a set format that varies from version to version. And advanced solutions check those and compare them to a real-world header set database. Which means you will raise red flags when sending the wrong ones. Or even more difficult to notice, by not sending the right ones! 

    There is no easy way out of this but to have an actual full set of headers. And to have plenty of them, one for each User-Agent you use. Not one for Chrome and another for iPhone, nope. One. Per. User-Agent. 🤯

    Some people try to avoid this by using headless browsers, but we have already shaw why it is better to avoid them. And anyway, you are not in the clear with them. They send the whole set of headers that work for that browser on that version. If you modify any of that, the rest might not be valid. If using Chrome with Puppeteer and overwriting the UA to use the iPhone one... you can have a surprise. A real iPhone does not send "Sec-Ch-Ua", but Puppeteer will since you overwrote UA but didn't delete that one.

    Some sites offer a list of User-Agents. But it is hard to get the complete sets for hundreds of them, which is the needed scale when scraping at complex sites.

    # ... 
     
    header_sets = [ 
    	{ 
    		"Accept-Encoding": "gzip, deflate, br", 
    		"Cache-Control": "no-cache", 
    		"User-Agent": "Mozilla/5.0 (iPhone ...", 
    		# ... 
    	}, { 
    		"User-Agent": "Mozilla/5.0 (Windows ...", 
    		# ... 
    	}, 
    	# ... more header sets 
    ] 
     
    for url in urls: 
    	# ... 
    	headers = random.choice(header_sets) 
    	response = requests.get(url, proxies=proxies, headers=headers) 
    	print(response.text)

    We know this last one was a bit picky. But some anti-scraping solutions can be super-picky and even more than headers. Some might check browser or even connection fingerprinting — high-level stuff.

    Conclusion

    Rotating IPs and having good headers will allow you to crawl and scrape most websites. Use headless browsers only when necessary and apply Software Engineering good practices.

    Build small and grow from there, adding functionalities and use cases. But always try to keep scale and maintainability in mind while keeping success rates high. Don't despair if you get blocked from time to time, and learn from every case.

    Web scraping at scale is a challenging and long journey, but you might not need the best ever system. Nor a 100% accuracy. If it works on the domains you want, good enough! Do not freeze trying to reach perfection since you probably don't need it.

    Source of the Article: https://www.zenrows.com/blog/

    Comments

    Popular posts from this blog

    Simplify Bing Search API Key Retrieval with SERPHouse

    Bing News API: A Comprehensive Guide for Developers

    How to Choose the Right Google SERP API for Your Business