There is a huge amount of open data on the Internet. With the right collection and analysis of information, important business problems can be solved. For example, is it worth starting your own business?

With this question, I was approached by clients who want to get analytics of the photo studio services market. It was important for them to understand: is it worth opening a photo studio, where to open it, what area of ​​the premises, how many rooms to open at the beginning, in which month it is better to start, and many other questions.

As a result of the project, I wrote a series of articles with a detailed step-by-step description of the tasks performed, the tools used and the results obtained.

In this article, the first of three, I'll cover planning and writing parsing in Python.

In the second article I will describe the algorithm for interacting parsing with the database and updating data.

In the third article, I will discuss the process of analyzing the collected data and answering questions from a client who wants to open a photo studio.

ITKarma picture

While studying the sites of photographic studios, I formulated a general scheme of their work:

  • describe your services on the site;
  • provide a service for booking rooms through the website or by phone;
  • inform contacts;
  • accept visitors at the booked time.

Problem statement


The main objective of this project is to analyze the market of services for photo studios in Moscow.
First you need to understand what companies are on the market and how they work.

The most common photo studio aggregator site: Studiorent . On it we see the approximate number of companies, their websites, topics, prices and other general information.

On the studiorent website, we saw important data for analyzing the market situation: the booking calendar. Having collected and analyzed this data from different photo studios, we can answer a huge number of important questions:

  • what is the seasonality of the business;
  • what is the workload of the available photo studios;
  • which days of the week are booked more often and by how much;
  • what is the income of photo studios;
  • what is the hourly rate for renting a photo studio;
  • how many rooms the photo studios have and how many there were at the opening;
  • how the number of halls affects the amount of income from one hall;
  • what is the area of ​​the halls
  • and many others.

The first task is indicated: it is necessary to create a parsing that collects data from booking calendars of different photo studios.

What are we going to parse?


Upon closer examination of the sites of different photo studios, we see the following main booking services:

  1. Google Calendar;
  2. application AppEvent ;
  3. application Ugoloc ;
  4. self-written booking calendars ( eg );
  5. booking by phone with no booking information on the website.

To parse custom calendars, you need a separate code for each of them. Therefore, we will not waste time on this.

AppEvent is inconvenient for parsing. It lacks information about past bookings. Therefore, AppEvent needs to be parsed daily, which creates certain inconveniences, such as: a long-term data collection (more than a year) is required to obtain an approximate picture of the market, a separate server (for example, a cloud one) with a daily parser launch is required.

With Google Calendar, there was initially a fear of blocking the work of the parser by Google itself. It is likely that you will have to use special paid proxy services ( for example ), and this complicates the work of the parser and makes its more expensive.

The Ugoloc service seemed ideal for parsing, due to the short lifespan of the project (since 2015) and the significant number of photo studios registered in it (84 at the time of this writing).

The easiest way is to parse the Ugoloc service, because:

  • will be able to access the entire booking history.Hence, it will be possible to parse the data as needed;
  • will not need to use a proxy. This means we can use simple libraries (urllib);
  • about a third of the total number of Moscow photo studios are registered on the service. Therefore, we will receive reliable data on the state of the market.

Site structure


A complete list of photo studios with links to their pages is presented on the page .

On the page of each photo studio we can find data on the studio area, ceiling height, number of halls, special equipment and links for each room.

On the page hall there is data on the size of the hall, ceiling height, minimum booking period and link to booking calendar .

A link to the studio's page is built in the following form: ugoloc.ru/studios/show /studio_id
Link to the hall: ugoloc.ru/studios/hall /hall_id
Link to booking calendar: ugoloc.ru/studios/booking /hall_id

Stages of parsing


  1. uploading the list of photo studios;
  2. dumping the list of halls;
  3. uploading booking data for the selected week;
  4. uploading historical booking data;
  5. uploading future data
  6. decryption of uploaded json data

1. Uploading a list of photo studios


How to upload a list of photo studios?

Six months ago, the list of photo studios on page was uploaded as a json file. If this is the first time you come across the analysis of the site, the content of json files can be seen as follows: F12 - > "Network" - > "JS" - > select the desired file - > "Response".

ITKarma picture

Thus, I found a link to the list of photo studios: https://ugoloc.ru/studios/list.json

For the request, we use the urllib library.

Requesting complete data for the list of photo studios
url='https://ugoloc.ru/studios/list.json' json_data=urllib.request.urlopen(url).read().decode() 


Received string format of json data. To decode it, we will use the json library.

The list of photo studios is located under the 'features' key
json.loads(json_data)['features'] 


Going through the list of photo studios and saving the necessary data, we get the following procedure code
def studio_list(): url='https://ugoloc.ru/studios/list.json' json_data=urllib.request.urlopen(url).read().decode() id=list() name=list() metro=list() address=list() phone=list() email=list() for i in range(len(json.loads(json_data)['features'])): id.append(json.loads(json_data)['features'][i]['studio']['id']) name.append(json.loads(json_data)['features'][i]['studio']['name']) metro.append(json.loads(json_data)['features'][i]['studio']['metro']) address.append(json.loads(json_data)['features'][i]['studio']['address']) phone.append(json.loads(json_data)['features'][i]['studio']['phone']) email.append(json.loads(json_data)['features'][i]['studio']['email']) return pd.DataFrame.from_dict({'studio_id': id, 'name': name, 'metro': metro, 'address': address, 'phone': phone, 'email': email}).set_index('studio_id') 


At the end of the procedure, we received a table with the studio id, name, nearest metro, address, phone number and e-mail.

If necessary, you can pull out the site, geolocation data, a text description of the studio and other data.

2. Uploading the list of halls


A detailed description of the photo studio with an indication of the halls is in the folder " ugoloc.ru/studios/show " + id of the photo studio.

On the page of the photo studio, we find a list of links to the halls. The list of links is obtained using the libraries BeautifulSoup and regular expressions re :

  1. first we make a get request (urllib.request.urlopen) studio pages;
  2. then translate the resulting string data into a BeautifulSoup object parsed as html code ("html.parser ');
  3. then we find all links that contain an indication of the "studios/hall/" folder.

Request code:
url_studio='https://ugoloc.ru/studios/show/' + str(584) html=urllib.request.urlopen(url_studio).read() soup=BeautifulSoup(html, "html.parser") halls_html=soup.find_all('a', href=re.compile('studios/hall/')) 


Got a list of BeautifulSoup objects containing links to room pages. Further actions:

  1. retrieve links to the hall using the.get ('href') method or by specifying the index ['href'];
  2. check if you have followed this link before (necessary when the cycle is running);
  3. uploading data on the name of the hall, links, area, ceiling height;
  4. check if the gym is not a dressing room (the name contains the syllable "makeup").

Code of procedure for requesting a list of halls
def hall_list(studio_id): st_id=list() hall_id=list() name=list() is_hall=list() square=list() ceiling=list() for id in studio_id: url_studio='https://ugoloc.ru/studios/show/' + str(id) html=urllib.request.urlopen(url_studio).read() soup=BeautifulSoup(html, "html.parser") halls_html=soup.find_all('a', href=re.compile('studios/hall/')) halls=dict() for hall in halls_html: if int(hall.get('href').replace('/studios/hall/','')) not in hall_id: st_id.append(id) name.append(hall['title']) hall_id.append(int(hall.get('href').replace('/studios/hall/',''))) if 'грим' in str.lower(hall['title']): is_hall.append(0) else: is_hall.append(1) url_hall='https://ugoloc.ru/studios/hall/' + str(hall.get('href').replace('/studios/hall/','')) html_hall=urllib.request.urlopen(url_hall).read() soup_hall=BeautifulSoup(html_hall, "html.parser") try: square.append(int(soup_hall.find_all('div', class_='param-value')[0].contents[0])) except: square.append(np.nan) try: ceiling.append(float(soup_hall.find_all('div', class_='param-value')[1].contents[0])) except: ceiling.append(np.nan) return pd.DataFrame.from_dict({'studio_id': st_id, 'hall_id': hall_id, 'name': name, 'is_hall': is_hall, 'square': square, 'ceiling': ceiling }).set_index('hall_id') 


3. Uploading booking data for the selected week


To download booking data for a week, you need to find a link to download json data. In this case, you can find it in the page code by searching for the word "json". The first match (line 27) contains the variable:
CDMY0CDMY

Let's check this link: https://ugoloc.ru/studios/calendar/975.json? week=
Works! We see data by booking hours, by days, by cost.

The default value of the week parameter is 0. To view the previous weeks, take negative values: -1 (the previous week), -2 (2 weeks ago), etc., - to view the next, respectively, positive values.

Uploading json booking data
url_booking='https://ugoloc.ru/studios/calendar/' + str(id) + '.json?week=' + str(week) json_booking=json.loads(urllib.request.urlopen(url_booking).read().decode()) 


Date data can be seen by the 'days' index,
by working hours - index 'hours',
by prices - index 'prices',
by minimum booking period - index 'min_hours',
by booking - index 'bookings'.

Collecting reservation data for the selected week
def get_week_booking(id, week=0): url_booking='https://ugoloc.ru/studios/calendar/' + str(id) + '.json?week=' + str(week) json_booking=json.loads(urllib.request.urlopen(url_booking).read().decode()) booking={ 'hall_id': id, 'week': week, 'monday_date': json_booking['days']['1']['date'], 'days': json_booking['days'], 'hours': json_booking['hours'], 'bookings': json_booking['bookings'], 'prices': json_booking['prices'], 'min_hours': json_booking['min_hours'], 'is_opened': 1 if np.sum([len(json_booking['bookings'][str(x)]) for x in range(1, 8)]) > 0 else 0 } time.sleep(.1) return booking 


We check in a separate line whether at least one booking is available per week. If available, then consider the studio open.

4. Uploading historical booking data


To download the historical data, we load the booking data for the last week, the year before last, 3 weeks ago, etc.

The main problem is to determine when the hall will open. If we request data 1000 weeks ago, the API will upload us correct data .
We formulate a criterion confirming that the hall is closed: if the hall was never booked for 2 months in a row (9 weeks), then it did not work.

Using the criterion, write a procedure
def get_past_booking(id, weeks_ago=500): week=-1 null_period=9 flag=0 d=dict() while flag != 1: d[week]=get_week_booking(id, week) if (len(d) > null_period and 1 not in [d[-1 * x]['is_opened'] for x in range(len(d) - 9, len(d))] ): flag=1 for x in range(0, null_period + 1): del d[week + x] if week < weeks_ago * -1: return d week += -1 time.sleep(1) return d 


Separate parameters indicated the period in data viewing 10 years (500 weeks), which means that historical data for more than 10 years is not interesting to us (despite the fact that the aggregator has been open since 2015).

In addition, we set the timeout between requests for data to 0.1 second.

5. Uploading future data


When uploading future booking data, it is important to find the week when the bulk of the bookings ends.

We establish a criterion similar to the previous paragraph: if no reservation has been found for 2 months in a row (9 weeks), then we can not view further periods.

Get the procedure code
def get_future_booking(id): week=0 null_period=9 flag=0 d=dict() while (flag != 1 and week <= 30): d[week]=get_week_booking(id, week) if (len(d) > null_period and 1 not in [d[x]['is_opened'] for x in range(len(d) - 9, len(d))]): flag=1 for x in range(0, null_period): del d[week - x] week += 1 time.sleep(1) return d 


6. Decrypting the uploaded json data


To decrypt json data, I often used try methods, except: if it does not decode as a familiar data type, for example, dictionary (try), then we decode it as another expected type, for example list (except). Now I understand that it is better to build calculations on checking the data type directly (the type () function) and further processing them.

We have written procedures for uploading booking data for the selected room. The next task is to convert json data into a tabular DataFrame for the convenience of further processing or writing to the database.

To decrypt, we iterate over every day of the week.

Converting the date from text format to date format
cur_date=pd.Timestamp(datetime.datetime.strptime(d[week]['days'][str(weeks_day)]['date'], '%d.%m.%Y').isoformat()) 


Booking hours can be presented in the form of text ("12:00"), numbers (12), and in case of round-the-clock work can be indicated by the final booking number (24).

I used try methods, except to decode the opening hours
try: try: working_hour=list([int(x) for x in d[week]['prices'][str(weeks_day)].keys()]) working_hour_is_text=1 except: working_hour=list(d[week]['prices'][str(weeks_day)].keys()) except: working_hour=list(range(len(d[week]['prices'][str(weeks_day)]))) 


If the booking price is set at the same level regardless of the day of the week and time, then the price can be stored as a number or as a string. In addition, the price can be stored as a list or dictionary.

To decode the booking price, I used try methods, except
try: price=list( d[week]['prices'][str(weeks_day)].values() if type(d[week]['prices'][str(weeks_day)].values()) != type(dict()) else d[week]['prices'][str(weeks_day)] ) except: price=list( d[week]['prices'][str(weeks_day)] if type(d[week]['prices'][str(weeks_day)]) != type(dict()) else d[week]['prices'][str(weeks_day)] ) 


The booking period can be specified as a list of hours available for booking, or as a single number indicating the 24-hour nature of the booking.

Decoding of available time for booking:
try: booking_hours=sorted([int(x) for x in d[week]['bookings'][str(weeks_day)]]) duration=[d[week]['bookings'][str(weeks_day)][str(h)]['duration'] for h in booking_hours] except: booking_hours=0 duration=24 


Booked hours are business hours that are not available for booking. That is, if the hall is open from 10:00 to 22:00 and the time is available for booking from 10:00 to 18:00, then the time from 18:00 to 22:00 is considered booked. We use this logic to calculate the booked time.

General procedure for decrypting json data:
def hall_booking(d): hour=list(range(24)) df=pd.DataFrame(columns=['hour', 'date', 'is_working_hour', 'price', 'duration', 'week', 'min_hours']) for week in d.keys(): for weeks_day in range(1, 8): working_hour_is_text=0 cur_date=pd.Timestamp(datetime.datetime.strptime(d[week]['days'][str(weeks_day)]['date'], '%d.%m.%Y').isoformat()) try: try: working_hour=list([int(x) for x in d[week]['prices'][str(weeks_day)].keys()]) working_hour_is_text=1 except: working_hour=list(d[week]['prices'][str(weeks_day)].keys()) except: working_hour=list(range(len(d[week]['prices'][str(weeks_day)]))) try: price=list( d[week]['prices'][str(weeks_day)].values() if type(d[week]['prices'][str(weeks_day)].values()) != type(dict()) else d[week]['prices'][str(weeks_day)] ) except: price=list( d[week]['prices'][str(weeks_day)] if type(d[week]['prices'][str(weeks_day)]) != type(dict()) else d[week]['prices'][str(weeks_day)] ) try: booking_hours=sorted([int(x) for x in d[week]['bookings'][str(weeks_day)]]) duration=[d[week]['bookings'][str(weeks_day)][str(h)]['duration'] for h in booking_hours] except: booking_hours=0 duration=24 min_hours=d[week]['min_hours'] df_temp=pd.DataFrame(hour, columns=['hour']) df_temp['date']=cur_date df_temp['is_working_hour']=[1 if y else 0 for y in [x in working_hour for x in hour]] df_temp['price']=0 if len(working_hour) == 24: df_temp['price']=price else: df_temp.loc[working_hour, 'price']=price df_temp['duration']=0 if duration != 24 and working_hour_is_text == 0: df_temp.loc[[x in booking_hours for x in df_temp['hour']], 'duration']=duration elif duration != 24 and working_hour_is_text != 0: df_temp.loc[[x in booking_hours for x in df_temp['hour']], 'duration']=duration else: df_temp.loc[0, 'duration']=24 df_temp['week']=week df_temp['min_hours']=min_hours df=pd.concat([df, df_temp]) df=df.sort_values(by=['week', 'date', 'hour']) df.index=list(range(len(df))) df['is_booked']=0 for i in df.index: if df.loc[i, 'duration'] != 0: if i + df.loc[i, 'duration'] < df.index[-1]: df.loc[i:(i + int(df.loc[i, 'duration'])) - 1, 'is_booked']=1 else: df.loc[i:, 'is_booked']=1 df['hall_id']=d[np.min(list(d.keys()))]['hall_id'] return df 


Summary


We examined the work of a parser that collects data on booking rooms in Moscow photo studios from the site ugoloc.ru. As a result, the list of photo studios, the list of halls, the list of booked hours were uploaded and transferred to the DataFrame format. You can already work with the received data, but parsing takes a long time and the unloaded data must be stored somewhere.

Therefore, in the next article I will describe how to save the received information to a simple database and upload them if necessary.

You can find the finished project on my page at github .

Source