How Web Scraping helped us collect data on official collections like Belgazprombank.

Web Scraping is one of the most popular methods for reading various data located on web pages for their systematization and further analysis. In fact, this can be called “website parsing,” where information is collected and exported in a more user-friendly format, be it a table or an API.

Web Scraping tools allow you to not only manually, but also automatically receive new or updated data for the successful implementation of goals.

What is Web Scraping used for?

  • Data collection for marketing research. Allows you to quickly prepare information for making strategically important decisions in the conduct of business.
  • To extract certain information (phones, emails, addresses) from various sites to create your own lists.
  • Product data collection for competitor analysis.
  • Clearing site data before migration.
  • Collection of financial data.
  • HR works to track resumes and vacancies.

The Lansoft team has successfully mastered this method. Therefore, we want to share with you one of the data collection cases for analyzing datasets of art objects for the New York company Pryph.

Pryph analyzes famous auction houses such as Christie’s, Sotheby’s and Phillips and summarizes the findings about the popularity of various authors.

By the way, several paintings were bought at these auctions in the sensational case of Belgazprombank and Victor Babariko. In our opinion, these transactions are by no means illegal (link news.tut.by/culture/349226.html )

For work, we chose the tool - Puppeteer . This is the JavaScript library for Node.js that controls the Chrome browser without a user interface.

Using this library, it is quite easy to automatically read data from various websites or create so-called web scrapers that mimic user actions.

Actually, there are better ways to scrap sites using node.js tools
(described here - habr.com/en/post/301426 ).

The reasons for choosing Puppeteer in our case were:

  • analysis of only 3 sites with clear sections and structure;
  • Google’s active promotion of this tool;
  • emulation of a real user’s work on a UI without the risk of getting banned, like potential DDOS attacks.

So, our task was to go to the sites of auction houses and for each type of auction to collect data on the sales of all lots for the year from 2006 to 2019.

For example, we inserted a piece of code written in Puppeteer to extract the links of the pictures of lots from the Phillips auction house:

image

In a similar vein, the Lansoft team for each lot needed to find the author’s name, job description, price, sale details and a link to art objects.

image

Examples of links to lots:

www.phillips.com/detail/takashi-murakami/HK010120/110

www.sothebys.com/en/buy/auction/2020/contemporary-art-evening-auction/lynette-yiadom-boayke-cloister? locale=en

For example, in the picture above we see the author’s name TAKASHI MURAKAMI, the name of the painting is “Blue Flower Painting B” and the data is priced at $ 231,000-359,000. We collected and wrote down all the necessary fields in csv files, broken down by year.

It looked like this:

image

As a result, we received sets of csv files for sales for different years. The file size was about 6,000 lines. And then the client, using his own algorithms, did a trend analysis for various authors.

Results can be found on pryph.org/insights

But there are some nuances in working with Puppeteer:

  1. some resources may block access when obscure activity is detected;
  2. Puppeteer’s efficiency is not high, it can be improved by throttling animations, limiting network calls, etc.
  3. You must end the session using a browser instance;
  4. the context of the page/browser is different from the context of the node in which the application is running;
  5. using a browser, even in Headless mode, is not so efficient and fast in time for large data analyzes.
.

Source