Hello, Habr! As you know, data sets are the fuel for machine learning. As sources for getting datasets that people usually use and which everyone is hearing about, are sites like Kaggle, ImageNet, Google Dataset Search and Visual Genom, but I rarely see people who use sites like Bing Image Search to search for data and Instagram. Therefore, in this article I will show how easy it is to get data from these sources by writing two small Python programs.

Bing Image Search

The first thing to do is go to link click the Get API Key button and register using any of the proposed social networks (Microsoft, Facebook, LinkedIn or GitHub). After the registration process is completed, you will be redirected to the Your APIs page, which should look similar (what is covered up is your API keys):


Let's move on to writing code. We import the necessary libraries:

from requests import exceptions import requests import cv2 import os 

Next, you need to specify some parameters: API key (you need to choose one of the two proposed keys), specify the search conditions, the maximum number of images per request, and set the final URL:

subscription_key="YOUR_API_KEY" search_terms=['girl', 'man'] number_of_images_per_request=100 search_url="https://api.cognitive.microsoft.com/bing/v7.0/images/search" 

Now we will write three small functions that:
1) Create a separate folder for each search term:

def create_folder(name_folder): path=os.path.join(name_folder) if not os.path.exists(path): os.makedirs(path) print('------------------------------') print("create folder with path {0}".format(path)) print('------------------------------') else: print('------------------------------') print("folder exists {0}".format(path)) print('------------------------------') return path 

2) Returns the contents of the server response in JSON:

def get_results(): search=requests.get(search_url, headers=headers, params=params) search.raise_for_status() return search.json() 

3) Burns images to disk:

def write_image(photo): r=requests.get(v["contentUrl"], timeout=25) f=open(photo, "wb") f.write(r.content) f.close() 

Next, we iterate over the images and try to upload each individual image to the output folder:

for category in search_terms: folder=create_folder(category) headers={"Ocp-Apim-Subscription-Key": subscription_key} params={"q": category, "offset": 0, "count": number_of_images_per_request} results=get_results() total=0 for offset in range(0, results["totalEstimatedMatches"], number_of_images_per_request): params["offset"]=offset results=get_results() for v in results["value"]: try: ext=v["contentUrl"][v["contentUrl"]. rfind("."):] photo=os.path.join(category, "{}{}". format('{}'.format(category) + str(total).zfill(6), ext)) write_image(photo) print("saving: {}".format(photo)) image=cv2.imread(photo) if image is None: print("deleting: {}".format(photo)) os.remove(photo) continue total += 1 except Exception as e: if type(e) in EXCEPTIONS: continue 


Import the libraries:

from selenium import webdriver from time import sleep import pyautogui from bs4 import BeautifulSoup import requests import shutil 

As you can see, I use the selenium library, so you need to download geckodriver . In instagram, we will search for images using a hashtag, let's say we take the hashtag #bird. By this hashtag you can find about 26 million publications. Copy the link that was formed as a result of the request and the path to the geckodriver, and paste it into the two lines, respectively, which are presented below:

browser=webdriver.Firefox(executable_path='/path/to/geckodriver') browser.get('https://www.instagram.com/explore/tags/bird/') 

Next, we write 6 functions that:
1) Included in the Instagram account. In the lines login.send_keys ('') and password.send_keys ('') you need to insert your username and password, respectively:

def enter_in_account(): button_enter=browser.find_element_by_xpath("//*[@class='sqdOP L3NKy y3zKF ']") button_enter.click() sleep(2) login=browser.find_element_by_xpath("//*[@class='_2hvTZ pexuQ zyHYP']") login.send_keys('') sleep(1) password=browser.find_element_by_xpath("//*[@class='_2hvTZ pexuQ zyHYP']") password.send_keys('') enter=browser.find_element_by_xpath( "//*[@class=' Igw0E IwRSH eGOV_ _4EzTm ']") enter.click() sleep(4) not_now_button=browser.find_element_by_xpath("//*[@class='sqdOP yWX7d y3zKF ']") not_now_button.click() sleep(2) 

2) Find the first post and click on it:

def find_first_post(): sleep(3) pyautogui.moveTo(450, 800, duration=0.5) pyautogui.click() 

It should be noted here that, possibly because everyone has a different screen resolution, the first post may be in different coordinates, so in the moveTo () method you will need to change the first two parameters.

3) We get the link to the publication and click on the button next:

def get_url(): sleep(0.5) pyautogui.moveTo(1740, 640, duration=0.5) pyautogui.click() return browser.current_url 

Here a similar problem may arise, as in the method above: the button can be located further in other coordinates.

4) Get the html-code of the source page:

def get_html(url): r=requests.get(url) return r.text 

5) Get the image URL:

def get_src(html): soup=BeautifulSoup(html, 'lxml') src=soup.find('meta', property="og:image") return src['content'] 

6) Download and save the current image. In the filename variable, you need to specify which path your image will be saved:

def download_image(image_name, image_url): filename='bird/bird{}.jpg'.format(image_name) r=requests.get(image_url, stream=True) if r.status_code == 200: r.raw.decode_content=True with open(filename, 'wb') as f: shutil.copyfileobj(r.raw, f) print('Image sucessfully Downloaded') else: print('Image Couldn\'t be retreived') 


In conclusion, I would like to say about the lack of sources and implementation. As for the resources themselves, a large number of images can be collected from them, but this data will have to be sorted, since the images do not always fit the search criteria that you specified. As for the implementation, the pyautogui library was used in obtaining data from the instagram, which emulates the actions of the user, as a result of which, when the program is executed, you will not be able to use your computer to solve other tasks. If there are suggestions on how to write code better, please write in the comments.
As for writing code, everything was done on Ubuntu 18.04. The source code was uploaded to GitHub .