The Idea
The Web Scraping Club’s recent series on device fingerprinting has been making the rounds here at Salad, and this got the team and me very excited!
Pierluigi Vinciguerra, co-founder of databoutique.com, outlined how device fingerprinting is used for cookie-free device tracking and bot detection, both being key inputs to block web scraping operations at scale. To effectively and consistently collect a large volume of data, you need to spoof device fingerprints to bypass these systems…that is unless you’re collecting this data from real-world consumer devices!
With SaladCloud, we’ve already built a global network of tens of thousands of residential PCs to lower the cost of cloud computing. Could our infrastructure help solve the problem of device fingerprinting for web scraping companies simply by making spoofing unnecessary? After all, PCs on the SaladCloud network already have the exact hardware that web scraping developers are going to great lengths to emulate. Better yet, our compute costs are a fraction of those of major cloud providers, and each node comes with high-quality residential IPs built into it!
When we decided to put it to the test, our hunch proved correct: 100% of the 1000+ unique Salad nodes we tested passed both the fingerprinting test and the ‘Harrods’ test. Read on to learn how we did it and how to use our network to power your own web scraping at scale.
Skipping the Spoof
Instead of spoofing device hardware, IP address, and other features that are used to detect a bot, we decided to use what we already have: our compute network. The SaladCloud Network consists of tens of thousands of unique consumer PCs (or “nodes”), concurrently connected to our centralized servers and receiving containerized workloads to run based on characteristics of the nodes themselves. We compensate the owners of these nodes for sharing their compute resources.
Because these nodes typically have powerful GPUs attached, most of the workloads currently running on the network are for generative AI companies. However, we suspected these nodes would work well for the web scraping use case because they typically have:
- Fast broadband connections – prevent bottlenecking on throughput
- Residential IPs – prevent the need for a separate residential proxy service
- Consumer hardware – prevents the need for spoofing device characteristics
- Powerful hardware – Able to run multiple headless or headful browser instances concurrently
We already know what hardware is on each machine, what their average internet speeds are, and their IP quality. With that in mind, we set up and conducted two tests to verify point #3 – that the hardware on these nodes would obviate the need for device hardware spoofing.
Testing
Step 1. Target selection
We had two objectives for these tests, to better understand how our network would perform for a web scraping use case:
- What percent of randomly selected nodes pass a stringent fingerprinting test?
- What percent of randomly selected nodes pass a real-world test?
For the first test, we decided to use GoLogin’s browser fingerprinting test. GoLogin provides a product for customizing the hardware and software fingerprint that you want to send to a target, but they also offer a free online fingerprinting test that looks at a wide range of browser- and device-level characteristics across the following categories:
- Browser (e.g. Browser version, headers, PDF viewers)
- Location (e.g. Country, City, Lat/Long, Zip code)
- IP Address (e.g. IP Address, ISP, ASN, and various red flags)
- Hardware (e.g. WebGL, Canvas, Microphone)
- Software (e.g. Time Zone, installed fonts, JS enabled)
For the real-world test, we crawled the Italian homepage of Harrod’s e-commerce site and the first 8 pages of results in the ‘Womens’ section.
Step 2. Create Playwright script
Here, we’re using Firefox in headless mode, targeting the first destination (gologin) and screenshotting the result. The script is saved to tests_pl_firefox.py, and we extended our script to upload the images to a central location so they could be reviewed.
Here’s the script we used for the gologin test:
Step 3. Create Dockerfile
We used the official Playwright docker container hosted on the Azure container directory at http://mcr.microsoft.com/playwright:v1.34.3.
Step 4. Deploy
We created an organization in the Salad Container Engine and a container group that pointed at the source image detailed in Step 3. Then, we configured the container group to be deployed to 1000 nodes, each with 1 vCPU and 1GB RAM. The total cost of the test was a few cents.
Results
The most time-consuming part of the test, by far, was reviewing all of the screenshots that we collected! If we ran this test again, we’d just update the script to extract certain page components and return a score – that’s probably what a real web scraping expert would do. Despite all the extra work we had to do, the results were encouraging: 100% of nodes were detected as ‘clean’ by gologin across all measured device features, and 100% of nodes returned complete results pages from the live test of harrods.com.
Web scraping results from GoLogin using SaladCloud
This means that we achieved a 100% scraping success rate without any attempt to spoof the device’s fingerprint. Granted, this was only conducted against two sites and only at a single point in time, but it bodes well for future investigation by our web scraping partners and us.
While encouraging, these results were unsurprising – it just makes sense that using consumer PCs with a residential internet connection for scraping would avoid the need for spoofing device hardware and IP because these are exactly the devices that web scraping developers attempt to emulate.
If you’d like to learn more about Salad’s global network of computing nodes, want to replicate our tests, experiment with your own scraping workload, or just chat webscraping, check us out here! We’d love to hear from you.