Web scraping with Python

This post is for those who are interested in learning about common design patterns, tricks, and rules related to web scraping.

For scraping, we will use a programming language and corresponding libraries. In our case, Python will be used. This language is a pretty strong tool for writing scrapers if you know how to use it and its libraries correctly: requests, bs4, json, lxml, re.

Here we work with selectors to get the elements we want. To do this, first we need to connect the requests library and make a request. Special attention should be paid to headers, because with their help, the server analyzes the request and returns you the result depending on what was indicated in them, I highly recommend finding information about the standard headers and their values.

import requests
headers = {
'authority': 'www.walmart.com',
'cache-control': 'max-age=0',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Linux; U; Android 2.3.5; ru-ru; Philips W632 Build/GRJ90) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1',
'sec-fetch-dest': 'document',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'same-origin',
'sec-fetch-user': '?1',
'accept-language': 'en,en-US;q=0.9',
}
session = requests.session()

response = session.get("https://www.walmart.com/ip/Apple-10-2-inch-iPad-8th-Gen-Wi-Fi-32GB-Space-Gray/989344107", headers=headers)

if response.status_code == 200:
print("success")
else:
print("failure")

And so we got the HTML page of the product and we need to extract data from this nsup of tags and text. There are two ways to get the data. The first way is using regular expressions. I found step by step instructions on how to get price, item description from walmart.com here. But we will use a different method.

We will use BeautifulSoup Python

BeautifulSoup is a library that allows you to parse (parse) HTML code [see documentation]. In addition, you will also need the Request library, which will render the content of the url. However, you also have to take care of a number of other issues, such as error handling, data export, parallelization, etc.

Beautiful Soup is powerful because our Python objects correspond to the nested structure of the HTML document we are cleaning up.

Comments

Popular Posts