01

May

What should I do if I am given a list of URLs, the essence of amazing data that automatically extracts data from the Web with Python?

It can be smartly handled even if it is unreasonable (Photo: Funtap / FIFO)

Part 2 of a series that introduces how to improve work efficiency using Python over all three times. From "Ultra-fast Python Work Techniques" by Takeshi Izawa, we will introduce a program that automatically extracts information from Web pages and writes it to Excel. Last time: The essence of the trick of connecting inconsistent Excel files with Python (delivered on January 18) ■ Automatically extract information from web pages If there is some information you want to collect, search the corresponding web page from a web browser and use the web. When the page opens, it is troublesome to find the information you want from there, copy it, paste it into another document, and so on. In such a case, if you use Python, you can collect the items you want to extract at once by simply collecting the URLs in an Excel file as shown below.

For example, suppose you want to collect information on the "title," "author," "price," and "number of pages" of multiple books. ① First, prepare an Excel file called "Book List.xlsx" as shown below. For the contents of the Excel file, the first line is the header line, and enter "URL", "Title", "Author", "Price", and "Number of pages" in order from column A. Enter the URL of the web page you want to collect in the second and subsequent rows of column A. ② After that, execute the Python program.

(3) When you execute the program and open "Book List.xlsx" again, the necessary data is entered. However, please note that an error will occur if "Book List.xlsx" is opened while the program is running. Automatically collecting data from the Web in this way is called scraping. Let's take a look at the program. This program assumes the use of Anaconda as the Python execution environment. Anaconda can be installed from https://www.anaconda.com/products/individual.

The content of the program is a combination of manipulating Excel files with pandas (one of the Python libraries) and scraping with a function called requests + BeautifulSoup. The read_excel function on line 07 reads "book list.xlsx", and the for statement and iterrows method iterate line by line. In the 10th to 20th lines in the iterative process, HTML is obtained by requests.get (). Text based on the URL obtained from the Excel file and passed to BeautifulSoup. After that, find "Title", "Author", "Price", and "Number of pages" with the find method.

PythonでWebから自動でデータ抜き出すスゴワザの神髄、URLの一覧を渡されたら一体どうしたらいいか

■ Extracting only numerical values ​​from the acquired text Lines 22 to 26 are the process of inserting the acquired data into a DataFrame object (tabular data). However, the text to be retrieved is in the form of "Price: 2,178 yen (tax included)". All you need is a number, so use the re.sub function to erase all but the numbers, leaving only "2178". re.sub is a function that replaces various strings with a regular expression (representing various strings with one string). The first argument "r" \ D "" represents a non-numeric character, and the second argument "" "represents an empty string. Since the non-numbers are replaced with empty strings, only the numbers remain.

■ Insert data by specifying rows and columns In the 22nd to 26th rows, the iloc attribute is used to insert the data acquired by the find method into the DataFrame object. The iloc attribute gets / sets a specific cell by specifying the row and column index numbers. In this case, the index number of the row is the one contained in the variable index, and the column index number is checked from the column name by the columns.get_loc method.

Finally, use the to_excel method to overwrite the Excel file and save it. ■ You may be curious about the description "time.sleep (1)" written at the end of the for loop on the 27th line, which is 1 second apart for continuous acquisition. This is a function to stop the processing of the program for 1 second. Originally, web servers (computers that publish web pages) assume the speed at which humans view web pages. Therefore, if a large amount of data is continuously fetched by scraping, the load on the Web server will increase, and in the worst case, it may stop. Therefore, it is said that it is better to leave about 1 second when acquiring by iterative processing.

Please note that if you overload the web server by scraping too much, access may be prohibited. This time, I searched for the book data "book title", "author", "price", and "number of pages", but when actually using it, I searched for "book list.xlsx" or the corresponding part of the sample program according to the purpose. However, scraping like this requires knowledge of HTML (HyperText Markup Language). HTML is a language for describing web pages. You can embed symbols such as tags in places in the text to describe tables and lists, and embed images, videos, sounds, etc. in pages.

For example, on the 14th line of the sample program, there is a description "“ h1 ”, {“ class ”:“ book-title ”}”, but this is the website used this time, and the class in the h1 tag indicating the book name. This is because the attribute (which specifies the class name) is "class =" book-title "". Next time, the final episode, I will introduce a program that supports the aggregation of data sent from inquiry forms. Last time: "Python" The essence of connecting inconsistent Excel files (delivered on January 18)

Toyo Keizai Online

Last updated: 1/25 (Tue) 15:01

Toyo Keizai Online