The Web Scraping Clubs’ recent series on device fingerprinting has been making the rounds here at Salad, and this got the team and I very excited!
Pierluigi Vinciguerra, co-founder of databoutique.com, outlined how device fingerprinting is used for cookie-free device tracking and bot detection, both being key inputs to block web scraping operations at scale. To effectively and consistently collect a large volume of data, you need to spoof device fingerprints to bypass these systems…that is, unless you’re collecting this data from real-world consumer devices!
At Salad, we’ve already built a global network of tens of thousands of residential PCs to lower the cost of cloud computing. Could our infrastructure help solve the problem of device fingerprinting for web scraping companies, simply by making spoofing unnecessary? After all, PCs on the Salad network already have the exact hardware that web scraping developers are going to great lengths to emulate. Better yet, our compute costs are a fraction of major cloud providers and each node comes with high-quality residential IPs inbuilt!
When we decided to put it to the test, our hunch proved correct: 100% of the 1000+ unique Salad nodes we tested passed both the fingerprinting test and ‘Harrods’ test. Read on to learn how we did it, and how to use our network to power your own web scraping at scale.
Skipping the Spoof
Instead of spoofing device hardware, IP address, and other features that are used to detect a bot, we decided to use what we already have: our compute network. The Salad Network consists of tens of thousands of unique consumer PCs (or “nodes”), concurrently connected to our centralized servers and receiving containerized workloads to run based on characteristics of the nodes themselves. We compensate the owners of these nodes for sharing their compute resources.
Because these nodes typically have powerful GPUs attached, most of the workloads currently running on the network are for generative AI companies. However, we suspected these nodes would work well for the web scraping use case because they typically have:
- Fast broadband connections – prevents bottlenecking on throughput
- Residential IPs – prevents the need for a separate residential proxy service
- Consumer hardware – prevents the need for spoofing device characteristics
- Powerful hardware – Able to run multiple headless or headful browser instances concurrently
We already know what hardware is on each machine, what their average internet speeds are, and their IP quality. With that in mind, we set up and conducted two tests to verify point #3 – that the hardware on these nodes would obviate the need for device hardware spoofing.
Step 1. Target selection
We had two objectives for these tests, to better understand how our network would perform for a web scraping use case:
- What percent of randomly selected nodes pass a stringent fingerprinting test?
- What percent of randomly selected nodes pass a real-world test?
For the first test, we decided to use GoLogin’s browser fingerprinting test. GoLogin provides a product for customizing the hardware and software fingerprint that you want to send to a target, but they also offer a free online fingerprinting test that looks at a wide range of browser- and device- level characteristics across the following categories:
- Browser (e.g. Browser version, headers, PDF viewers)
- Location (e.g. Country, City, Lat/Long, Zip code)
- IP Address (e.g. IP Address, ISP, ASN, and various red flags)
- Hardware (e.g. WebGL, Canvas, Microphone)
- Software (e.g. Time Zone, installed fonts, JS enabled)
Step 2. Create Playwright script
Here, we’re using firefox in headless mode, targeting the first destination (gologin) and screenshotting the result. The script is saved to tests_pl_firefox.py, and we extended our script to upload the images to a central location so they could be reviewed.
Here’s the script we used for the gologin test:
Step 3. Create Dockerfile
We used the official Playwright docker container hosted on the azure container directory at http://mcr.microsoft.com/playwright:v1.34.3.
Step 4. Deploy
We created an organization in the Salad Container Engine and a container group which pointed at the source image detailed in Step 3. Then we configured the container group to be deployed to 1000 nodes, each with 1 vCPU and 1GB RAM. The total cost of the test was a few cents.
The most time-consuming part of the test, by far, was reviewing all of the screenshots that we collected! If we ran this test again, we’d just update the script to extract certain page components and return a score – that’s probably what a real web scraping expert would do. Despite all the extra work we had to do, the results were encouraging: 100% of nodes were detected as ‘clean’ by gologin across all measured device features, and 100% of nodes returned complete results pages from the live test of harrods.com.
Web scraping results from GoLogin using Salad
This means that we achieved a 100% scraping success rate without any attempt to spoof device fingerprint. Granted, this was only conducted against two sites, and only at a single point in time – but it bodes well for future investigation by ourselves and our web scraping partners.
While encouraging, these results were unsurprising – it just makes sense that using consumer PCs with a residential internet connection for scraping would avoid the need for spoofing device hardware and IP, because these are exactly the devices that web scraping developers attempt to emulate.
If you’d like to learn more about Salad’s global network of computing nodes, want to replicate our tests, experiment with your own scraping workload, or just chat webscraping, check us out here! We’d love to hear from you.