Web scraping is a self-explanatory term. It means scraping data from the web using bots or specialized software. The bot sends requests to a specific website's server, gets a response, and extracts the required data. You can download the data as a spreadsheet or database for future reference.
Web scraping has many use cases, such as:
- Price comparison websites. These websites frequently scrape pricing information from multiple sources to help users find the best deals.
- Search engine indexing. Search engines like Google and Bing scrape data from websites to give helpful responses to user queries.
- Lead generation. Businesses can scrape (publicly available) contact information on industry professionals, such as names, phone numbers, and email addresses. Then, they can contact these professionals to pitch products.
- News aggregation apps. These apps scrape information from external news websites and summarize them for users.
- Market research. Companies can scrape reviews about their products to evaluate customer sentiment.
You can scrape many data types, including text, images, videos, customer reviews, user comments, and product descriptions. The data can be downloaded in several formats, including CSV, XLSX, Excel, and JSON. These standard formats make it easy to export scraped files to other devices or upload them online.
However, web scraping can be a legal grey area. Many website owners discourage scraping because malicious actors can abuse it to steal data or copyrighted content. You should avoid scraping unauthorized information the owner explicitly asks not to scrape.
How does web scraping work?
Web scraping relies on bots to automatically retrieve data from websites. You’ll first give the bot a target URL to scrape. Then, the bot loads the HTML source code of the URL, or some sophisticated bots render the entire website, including its visual elements.
After loading the URL, the scraper bot extracts all the data or specific sections, depending on your instructions. Most web scrapers let you download the extracted data in CSV or Excel, while more sophisticated scrapers support the JSON format.
Web scraping bots are technically complex, but their developers have simplified the process for end users. They have visual interfaces that let you type the URL you want to scrape and select the specific data you want to retrieve. Some render the complete website; you can simply click on the data you want to retrieve.
Let’s dive deeper into the process a web scraper undergoes to retrieve data from a website.
Step 1: An HTTP request to a target website’s server
The web scraper sends an HTTP request to the URL you want to scrape. This request is akin to knocking on the target website's door. The web scraper asks for permission to access the target website.
This part can get tricky, as some websites use sophisticated tools to block scrapers. For example, some website owners monitor IP addresses used by scapers and block them en masse. That’s why a scraper is often used alongside a proxy server to counter such blocks.
Step 2: Parsing the target website’s code
Once the scraper gets permission to access the target URL, it reads and extracts the URL’s HTML code. Every website's code determines its content structure, so scrapers must understand it.
Suppose you instruct a scraper to retrieve images from a website. It’ll first parse the website’s code to understand where each element is placed. Then, it’ll focus on extracting images and ignore the other elements.
Some sophisticated scrapers render the target website visually, and you can click on the particular elements to scrape. In this case, the scraper still parses the website’s code, but the process is visual, so you can understand it better.
Step 3: Extracting the data
After parsing the code, the web scraper begins extracting the required data. Suppose you instruct the bot to extract product prices; it will execute the instruction by downloading pricing data from the target URL.
Data extracted by web scrapers is downloaded on your local device in standard formats like CSV and JSON. Then, you can seamlessly send these files to external devices or upload them online.
The above steps seem simple, but there’s a catch. Scraping is not a one-and-done process. The web scraper repeats the above process until it scrapes all the required data. Sometimes, it faces obstacles like an IP or usage restriction. Some scrapers can send too many HTTP requests and crash a site, making it impossible to extract data. It involves constant trial and error, and scraping large datasets can take a long time.
Types of web scrapers
1. Browser extension scrapers
Some scrapers are accessible as extensions for web browsers like Chrome and Firefox. You can easily install the scraping extension on your browser and scrape data without hassles. The data will be downloaded in CSV and other standard file formats.
Browser extension scrapers are ideal if you don’t want to download standalone software to extract data from websites. However, they have significant limitations. Most browser extension scrapers are built to scrape one page at a time. Hence, they aren't suitable for extracting large datasets.
These scrapers have user-friendly interfaces that the average layperson can quickly understand. They work directly with the browsers you're used to, so you shouldn't have problems setting them up.
2. Software-based scrapers
These scrapers are standalone apps you’ll download on a desktop. They’re designed for scraping large datasets at robust speeds. Software-based scrapers have more advanced features, e.g., you can define the target data more accurately, and they support more file formats than browser extension scrapers.
The drawback is that software-based scrapers can be technically complicated. An average layperson might need external help to set up and run scraping tasks. Some large-volume scraping tasks require programming knowledge to execute on software-based scrapers. These scrapers have a steep learning curve, so you’ll likely need time to familiarize yourself with their interfaces.
3. Cloud-based scrapers
Cloud-based scrapers run on remote servers you can control via the internet. First, you’ll sign up for the cloud-based scraping provider. Then, you can rent servers to perform scraping tasks.
We mentioned earlier that proxy servers and web scrapers work hand in hand. Unsurprisingly, many proxy platforms offer scraping servers for individuals and businesses. These servers are suitable for massive scraping tasks that your desktop or laptop can’t run locally.
Suppose you want to scrape all product descriptions on Amazon, the world’s biggest e-commerce marketplace. You’re looking at thousands of pages with voluminous text, images, and videos. Your PC will take unbelievably long to scrape all the information, so this task is best handed to a powerful remote server.
A remote server works 24/7 while you pay attention to other activities. You’ll get notified once the scraping task is complete, which can take a few hours to a few days, depending on the volume. Remote servers are convenient, but the drawback is that they’re often expensive.
How to scrape a website
Step 1: Identify your target URLs
The first step is identifying the URLs you want to scrape. In our previous example of scraping Amazon product descriptions, the target URL would be "Amazon.com."
Step 2: Inspect the webpage
Inspect your target webpage to understand its structure. Right-click anywhere on the webpage and select inspect element or view page source. These options let you view the web page's backend code, which is the scraper’s focus.
Inspecting the backend code helps you understand how the site is structured and how to instruct the scraper to extract data.
Step 3: Identify the data type and its legality
You should clearly understand the data types you want to extract. For example, text is your desired data type if your goal is scraping pricing information.
Consider the legality of the data you want to scrape. Generally, web scraping is legal if the data is publicly available. Product descriptions, pricing information, and news articles are examples of public data you can scrape freely. However, scraping non-public data can be illegal and invite lawsuits.
Scraping public personal data can get tricky because some jurisdictions consider it illegal despite the data being public. For example, the Dutch Data Protection Authority (DPA) fined Clearview, an American company, 30 million euros ($32 million) for scraping billions of social media users' public photos without consent.
The photos were publicly posted, yet the Dutch DPA considered it a privacy violation to scrape them. Several European countries have similar laws that consider scraping public personal data illegal. Hence, confirm relevant laws before scraping public personal data like email addresses and phone numbers, especially for commercial use.
Data scraping can also invite action from private parties, such as when LinkedIn, the popular business social network, sued hiQ Labs, a data analytics company, for scraping public LinkedIn profiles. LinkedIn prevailed in a six-year legal battle, helped by its user agreement that explicitly barred scraping profiles.
Before scraping any website where you’re registered, check if its terms of service explicitly disallow public scraping. If so, it’s advisable not to scrape data from the site.
Step 4: Write the necessary code/instructions
After deciding the data type and affirming the legality of scraping it, the next step is writing the instructions for your web scraper. Most scraping bots use the Python programming language. You can write a brief code telling the bot the target URL and the type of data to be scraped.
If you use a browser extension scraper, you might not need to get technical; these scrapers usually receive instructions via a graphical interface. However, software-based and cloud-based scrapers often usually receive instructions via code.
Step 5: Execute the code/instructions
Execute the instructions and wait for the scraping job to complete. The timing depends on the volume of data and the performance of the device or remote server used to scrape the website. Scraping jobs can take anywhere from a few minutes to a few days or weeks.
Step 6: Download and store the data
Once the scraping job is complete, download the data in your desired format and store it on a local device or the cloud. Raw scraped data often has errors, so it’s advisable to double-check and correct them. You can play around with the data and learn how to tweak it to get the desired results.
Use cases of web scraping
1. Price comparison websites
Many companies have built formidable businesses by helping shoppers find the best deals online. They aggregate discounted deals and promote them to users, earning money via commissions and ads.
Price comparison sites need to scrape frequently updated pricing data from e-commerce sites like Amazon, Walmart.com, and JD.com. They use scraping bots to perform this task, usually high-performance servers built to extract massive data volumes quickly.
Without web scraping bots, price comparison sites would have to extract pricing data manually, which is cumbersome. Web scrapers enable them to scrape accurate pricing data and show customers the best deals. These scrapers power many successful price comparison sites like PriceRunner and Getprice.
2. Search engine indexing
Search engines scrape URLs, website descriptions, images, videos, and other information from numerous websites. Google is the most popular search engine, but many other small players exist. They all depend on web scrapers to extract data from websites, process it, and display relevant information to end users.
Website owners also depend on scraping to build internal search tools. Suppose you run a social media website and want to add a box for users to search for specific information. You’ll first use a scraping bot to extract and gather data from the website, including descriptions, URLs, and images. Then, you'll refine this data and upload it to a searchable database.
Search engines usually use remote, high-performance servers for scraping because they deal with massive data volumes. Yet, you can use your local device for small search engine scraping tasks, e.g., an internal search tool for a blog.
3. Real estate listing platforms
Many real estate listing sites use bots to scrape data on properties available for rent or sale. Most listings on U.S.-focused real estate websites were likely scraped from a multiple listing service (MLS) database, a prominent database that real estate professionals use to share listings.
Suppose you run a real estate site focused on the Florida market. You can import data from an MLS listing database into your site using web scrapers. You’ll instruct the scraping bot to scrape URLs from MLS’s Florida section and download them to a CVS file. Then, you can refine the data before uploading it to your website.
4. Lead generation
Many companies use scraping bots to gather information about potential clients. This practice is widely adopted by companies that sell business-to-business (B2B) services.
Suppose you’re a sales representative of a banking software provider. You can scrape (publicly available) contact information of local banking executives.
The aim is to contact these executives to pitch your company’s banking software. A single database with all relevant contact information helps your sales team run efficiently, and scraping bots can help build this database.
Some companies also scrape public contact information of business executives and offer them for sale. For example, ZoomInfo has built a multi-billion dollar business by selling a commercial search engine of contact information. It curates this information from numerous public sources and charges monthly or annual fees to access its vast database.
However, curating and monetizing public contact information can get tricky. We earlier mentioned a LinkedIn lawsuit against hiQ Labs for scraping public LinkedIn data. Many platforms prohibit users from scraping data even when it’s public. Thus, confirm a platform’s user agreement before scraping information from it.
5. Sentiment analysis
Many companies use web scraping bots to gather data that indicates customer sentiment. For example, an apparel brand can scrape reviews from various sites and import them into a single database. Then, the brand’s executives analyze this database to evaluate customer sentiment and know how to improve their products.
Scraping lets companies collate reviews en masse instead of manually checking them, which takes too much time and effort. A single scraping server or local device can extract voluminous reviews from different websites and output them in CVS format. The owner can then refine or export this file for further evaluation.
The legality of web scraping
The legality of web scraping is complicated and depends on many factors, such as the origin of the data (public or non-public), the type of data, and the method of scraping it.
As a rule of thumb, web scraping is considered legal when it doesn’t
- Violate the Computer Fraud and Abuse Act (CFAA) or similar laws in other jurisdictions. This law prohibits accessing a computer without authorization, i.e., scraping stolen data.
- Violate the Digital Millennium Copyright Act (DMCA) or similar laws. This law prohibits downloading and distributing copyrighted content without permission, i.e., scraping copyrighted data without consent.
- Violate the terms & conditions of the platform being scraped from. The LinkedIn-hiQ Labs lawsuit was based on scraping being explicitly prohibited in LinkedIn’s user terms. A U.S. federal judge sided with LinkedIn, precedenting many companies to take legal action against data scraping violators.
- Compromise the security of the information being scraped. In other words, data scrapers should avoid scraping confidential information that could cause harm in the wrong hands.
Most jurisdictions don’t have laws explicitly governing data scraping. However, legal experts often cite existing data privacy laws in web scraping cases. Below are the relevant laws.
General Data Protection Regulation (GDPR)
GDPR is a landmark data privacy law passed by the European Union (EU) in 2016 and implemented in 2018. This law defines data privacy rights for all EU residents, giving them control over their personal data and restricting organizations that collect this data.
Scraping public personal data of EU residents requires complying with GDPR laws, with potential fines for violators. Compliance includes:
Transparency. Be transparent about how you intend to use the scraped data.
Owners’ rights. Respect requests from users who want to access, restrict, or delete their scraped data from your database.
Security. Implement sophisticated security measures to prevent unauthorized access to scraped data. Establish firm data protection policies and train employees on them. Any data breaches should be swiftly reported to the local data supervisory authority.
Lawfulness. Scraped data shouldn’t be used for illegal purposes. You should have a lawful basis for scraping data.
GDPR applies to all European Union (EU) countries. The U.K. left the EU in 2020 but implemented an identical version of GDPR shortly after, so it also applies to U.K. residents.
Violators can face significant fines; in September 2024, the Dutch Data Protection Authority (DPA) fined Clearview 30 million euros ($32 million) for scraping public social media pictures to build facial recognition software. Despite the data being publicly available, the Dutch DPA considered its use unlawful, demonstrating the need for all businesses to have a lawful basis for scraping data.
California Consumer Privacy Act (CCPA)
California is the U.S.A.’s most populous state, has the biggest state economy, and is home to the world’s largest tech companies, including Meta, Google, and Snap Inc. In June 2018, it signed the California Consumer Privacy Act (CCPA) into law, enacting data privacy rights for all California residents.
The CCPA differs from GDPR in several ways, including what it considers personal information. CCPA concerns personal data that the owner provides, while GDPR considers all personal information regardless of the source.
Complying with CCPA includes
Terms of use. Adhere to a website’s terms & conditions regarding scraping. If a website’s terms bar data scraping, obey their terms to prevent potential legal troubles. Many websites have robots.txt files indicating the sections to avoid scraping; follow these directions.
Security and governance. Implement robust policies to protect scraped data from malicious actors. Scrapers can be held responsible for misuse of data stolen from them.
Copyright. Avoid scraping copyrighted material without the owner’s permission.
Transparency. Have clear purposes for scraping data and be transparent about it if asked by the owner.
Violations of the CCPA can invite fines and lawsuits from data owners. For example, OpenAI has faced numerous lawsuits alleging unauthorized data scraping to train its ChatGPT artificial intelligence (AI) tool. It has won some of these lawsuits, but the company continues to fight many more expensive lawsuits over its scraping practices. OpenAI can afford such lawsuits, but similar cases can sink small companies.
Computer Fraud and Abuse Act (CFAA)
The Computer Fraud and Abuse Act (CFAA) is a U.S. federal law prohibiting unauthorized access to protected computers and servers. Many companies have invoked the CFAA in civil suits against web scrapers but with mixed results.
In 2022, in the famous LinkedIn vs. hiQ Labs lawsuit, the U.S. Ninth Circuit Court of Appeals ruled that scraping public data doesn't violate the CFAA. The Appeals Court cited a previous Supreme Court case (Van Buren v. United States), which ruled that the CFAA's scope be limited to cases where people breached technical barriers they weren't unauthorized to pass through, not freely accessible public data.
The Ninth Circuit’s ruling set a precedent discouraging companies from filing lawsuits alleging CFAA violations against scrapers. However, the same court found HiQ Labs liable for violating LinkedIn’s user agreement, ordering the company to pay $500,000 in damages, delete all scraped data, and cease scraping more data.
Why do many websites limit web scraping?
Web scraping bots often encounter a common issue. Many websites implement sophisticated techniques to prevent scrapers from accessing their data. These include
Using cookies and Captchas to differentiate real visitors from scraping bots.
Setting limits on the number of requests to a page. This action hinders your web scrapers from making repeated HTTP requests to extract data.
Obfuscating data to feed web scraping bots confusing information.
Detecting and blocking IP addresses tied to web scraping bots and proxy servers.
Publishing fictitious content or hidden links to trap web scraping bots.
Websites discourage scraping for many reasons. For one, web scrapers consume bandwidth, which increases hosting costs and can lead to slower website loading times. Websites with copyrighted data don’t want scrapers to extract it, and website owners can be held liable for misuse of data scraped from their sites.
For instance, the Irish government fined Meta (Facebook’s owner) 265 million euros for a 2021 data breach, during which scrapers harvested public data from Facebook profiles, including full names, birthdates, IDs, and phone numbers. Ireland's Data Protection Commission ruled that Meta failed to properly protect personal data from unauthorized scraping and imposed a hefty fine as a deterrent.
Since the fine, Meta has become more aggressive in tackling web scraping, including filing lawsuits against several companies and individuals. Many other platforms have implemented stricter anti-scraping measures due to fears of being held liable.
Website owners dislike scraping because of privacy fears. Thus, you should adhere to their terms regarding scraping copyrighted or personal data.
Yet, websites often prevent third parties from scraping mundane data like pricing, product descriptions, and customer reviews. Every website has the right to implement anti-scraping measures, but you can bypass some measures to get the required data.
How to bypass anti-scraping blocks
1. IP rotation
Rotating IP addresses is the easiest way to bypass anti-scraping measures because they mostly rely on IP detection and blocking. Rotation implies changing IP addresses frequently to avoid getting detected.
Many proxy server providers offer seamless IP rotation. With their servers, you can alternate IP addresses with each scraping request. This constant rotation mimics the behavior of real users visiting a website from multiple locations, so no red flag gets raised.
Yet, a few high-traffic sites use sophisticated proxy blocklists. These blocklists contain known IP addresses of proxy server providers, preventing you from accessing their sites even with proxy rotation. Your best bet against these advanced blocklists is to switch to another proxy provider or use residential or mobile proxies, which are harder to detect because their IPs are indistinguishable from real users.
Besides, some proxy providers offer a virtually unlimited pool of IP addresses. If one gets blocked, another one simply takes over, and this process continues until the server successfully accesses the target URL.
2. Keep random intervals between scraping requests
A web scraper works like a robot. It sends scraping requests at regular intervals, a common pattern website owners use to detect scrapers. You can program your scraping server to send requests randomly, which looks more human-like and avoids easy detection.
Also, avoid sending too many requests within a short period. Excessive requests are a scraping bot's hallmark and can overload the target website's servers, causing slow speeds or a complete crash.
A website's robots.txt file usually has a crawl-delay directive telling crawlers how quickly to request information from a web page to avoid server overload. Adhere to this directive where applicable.
3. Watch out for honeypot traps
Some websites use complex techniques to detect web scrapers, including placing invisible links that scraping bots follow. Any IP address that visits these hidden links is marked as a bot and banned from the site.
Website owners hide honeypot links through JavaScript and CSS tricks like setting the CSS visibility rule to display: none or visibility: hidden. They can also change the link text color to the background color to make it transparent. Scraping bots often fall for these traps because they can’t differentiate between visible and invisible text unless explicitly commanded to.
You can avoid honeypot traps by creating well-defined rules for your scraping bots to avoid hidden links. For example, you can view the source code for a page and check for its hidden links. Then, you’ll instruct the scraping bot not to visit these links, preventing detection.
Final words
Web scraping has many legitimate use cases, such as market research and price comparison. Yet, many websites discourage it, and legal battles often arise from this practice. It doesn’t help that scraping has no explicitly defined laws, so court precedents tend to be the de facto law in this sector.
This guide has explained what you need to know about web scraping, how it works, its legality, types, and the steps to scrape a website. At this point, web scraping should no longer be a hassle, and you should know how to bypass anti-scraping measures to get data for legitimate purposes.