Suppose you have > 100'000 images that need to be sorted and conveniently posted on the web for mass viewing. It can be anything - a gallery of all art created by mankind (in the task that I did), or a historical photo archive of the city of Moscow, or stills from movies, or a common archive of holiday photos from a major travel agency, or a website for stock illustrations and photos, or Archive of images at large mass media for many years - in which it is necessary to restore order, organize navigation and access for employees from the internal network.

I will tell you how it is advisable to program it.

Keywords and their inheritance


The modern approach used by all photo banks and galleries is to provide illustrations with a set of tags (keywords). I developed this approach in two directions: (1) tags can have inheritance (the user searched for berries - found a picture with the “cherry” tag), and (2) tags can be attached at the level of catalogs rather than single illustrations.

The disadvantage of this approach with tags is that you search based on keywords, ignoring the plot of the picture. The dragon killed by the girl and the dragon who killed the girl are two different plots, but the same for the list of words: Dragon, Girl, Death, and Winner (if there was a battle). A keyword-based approach will not allow you to get a sample of the query “Dead Dragon”, which would not include the image of the dragon - the winner over the killed enemy.

The main tags are those that are visible to the user in the alphabetical directory. Additional tags are those that are available to the user only by manually entering the names of these tags in the text of the search string. I consider the number of tags optimal: basic=1/75, additional=1/195, of the number of images.

Plural (riders, mountains, etc.) tags indicate in the file names as & lt; tag name & gt ;! (i.e. an exclamation mark). You will need a dictionary of how tags can be called - plural, feminine/masculine, synonyms, misspellings.

Keep the tag dictionary in 4 files: Marks.csv - main tags, Other.csv - additional tags, Wrong.csv - misspellings, synonyms, plural tag names, Artists.csv - authors. In the Marks.csv and Other.csv files, after the tag identifier and the main name in Russian, enumeration of parent tags (i.e. inheritance) follows.

Marks.csv

Arwen;Арвен (Властелин Колец);Person,Girl,Elf,LordOfTheRings ThorinOakenshield;Торин Дубощит;Person,Male,Beard,LordOfTheRings 

It says that Arwen is a person, a girl, an elf, a character of "Lord of the Rings"; Thorin Oakenshield - a person, a man, wears a beard, the character "Lord of the Rings." Accordingly, when a user searches for “Lord of the Rings”, all images of Arwen and Torin will be found. When searching for a “beard” - among other things, there will be Torin. When searching for “Thorin” - it will also be found, since this abbreviated spelling is in Wrong.csv.

Folder Structure


If you select “show girls” or “show sun” on 100,000 images, the number of results will be excessively large. But this will not happen if the images are divided into folders. For example, in the root directory there is a Dragons folder, inside it is a Yellow folder, inside it is a Girls folder (i.e. images that have girls), and inside it (for all subfolders) there are 200 images. In this case, not these 200 images will appear in the search results, but the folder containing them. This is better for the user.

Here, however, there is a problem of closely related relations. Almost always, kings in images have crowns, but not in all cases. Suppose there is a folder of Kings, and there are 3000 images in it, 2500 of them in crowns. Here, with regard to the crown - a simple approach to show the folder does not work.

I consider the optimal number of folders=1/28 of the number of images

As you understand, if the file already lies in the Dragons/Yellow/Girls folder, then these tags do not need to be added to the file name, add only tag identifiers to the file name that do not follow from its storage location.

ITKarma picture

Multilingualism, icons, texts, virtual subfolders


A _ file has been created inside each folder.jpg of size 200 (width) * 280 (height) is the folder icon when viewing it (the text is displayed on top of it), both when the user is in a higher folder and when the user views the search results (if this folder is found). Keyword icons have the same resolution.

Also, in many folders the _.txt file was created, consisting of the following lines:

Artefact \ _. Txt (snippet)

Миелофон=Mielofon Мьёльнир=Mjolnir Палантир=Palantir Перчатка Таноса=ThanosGlove Склянки=Glass-Potion by-DavisonCarvalho=* TheWitcher/Wolf-Head-Logo|Амулет Ведьмака DisneyPrincess/Moana/HeartOfTeFiti|Сердце Те Фити SuperHeroes/Hellraiser/HellraiserBox|Шкатулка Лемаршана -m|Artefact 

Here we see the types of entries:

  1. Flasks=Glass-Potion - aliases for subfolders. In the illustration above, we see that the alias was not recorded for the Japan folder, and when viewing the folder, it is not translated into Russian. Two tags - Glass and Potion (Glass and Potion) - are translated in one word.
  2. by-DavisonCarvalho=* - no alias required
  3. SuperHeroes/Hellraiser/HellraiserBox | Lemarshan's Box is a virtual subfolder. A subfolder located in another directory will also be displayed here under the given name.
  4. -m | Artefact - the folder represents the tag "Artifact". If text is attached to this tag, it will be written under the illustrations.

Disk size


Now 111'000 images occupy a 65GB disk. And this despite the fact that in many cases they have to be made heavier png format:

  • If an image with fields (and no stroke or subject of the image enters them), the fields must be removed in paint.
  • If the image with alien watermark photo galleries a la picabu, watermark cleaned in Photoshop.
  • If it is in.webp format, you just have to save it in.png, otherwise my program will not be able to make thumbnails (yes, I know, it was possible to add code).
  • If the format is not.png,.jpg,.gif. I am against an excessive variety of formats.

Site structure - files and folders


index.php - launched without parameters, displays the gallery root folder, alphabet and search string. By clicking on a subfolder in the root folder - goes to it. By clicking on the letter of the alphabet - goes to the main tags starting with this letter. When you enter text in the search string, it goes to the tag identified by this text.

i.php - a tool for viewing one selected image. Allows you to jump to tags from the list to which this image corresponds.
img - web gallery root folder
m - folder with generated thumbnails of all images. Thumbnails have a height of 200, width in accordance with the proportions of the image. The folder structure m follows the structure of the img folder. The m folder is created programmatically before uploading each version of the gallery.
Tags - for each keyword, contains a file with the result of its search in directories.
Marks - file types:

  1. For each keyword, contains its thumbnail file
  2. For most keywords, contains a file with their text description or thematic history, anecdote
  3. For some keywords, contains one or more html-text subject stories
  4. Also, files like & lt; letter code >.txt are stored in this folder - alphabetically ordered lists of keywords for each letter of the Russian alphabet

The procedure for uploading a new version of the gallery to the site


A specially written program (using Delphi and the Graphics32 library) does the following:

  1. Gallery folder check - checks for the absence of extra characters, the correctness and absence of redundancy in file tags (including taking into account their hierarchy), the absence of synonyms and incorrect spelling of keywords among file names (using Wrong.csv), file correctness _.txt, the presence of thumbnails for folders, the absence of files with incorrect names.
  2. Recreating thumbnails for all images. At this stage, it often turns out that some files have the wrong extension:.jpg instead of.png, etc.
  3. Generate search results for each keyword. Check for thumbnails for key keywords. A special order for keywords - exceptions for which you need to give a specially prescribed selection.
  4. Generate keyword lists for letters of the Russian alphabet.

Then, both the gallery folder and these materials are uploaded to the server.

The web gallery engine does not use a DBMS.

Hosting


I use hosting Avahost , 100GB on a disk cost 500 rubles a month. As you can see, with a collection size of 65GB, + thumbnails, etc., and a hosting size of 100GB, the update is not seamless. There is not enough space to first unload a completely new version and then switch to it seamlessly, an inevitable interval of website inoperability of several hours appears. I do updates once a month now.

Files are sent to the hosting in the form of archives. The cPanel system currently in use on all hosting services can only unpack zip archives. It is advisable to use files up to 2.5GB in length, otherwise after the file has been uploaded to the folder via the cPanel web interface, the progress bar (the initial color is blue) may turn red rather than green. What is the difference I did not understand (the file seems to be uploaded normally even in this case), but in this case I redistribute it. For some folders, this causes the folders to be divided into several separate zip archives.

Earlier, I tried to create a hosting at home, I bought a used netbook on Avito for 2000 rubles. Configured, everything works. A couple of days pass - does not work. Reboot - to no avail. Then, it worked again, then again not. I replaced the netbook (I bought another, more powerful one, also on Avito, for 3000 rubles) and began to use another software - the same thing. Replaced three providers (Seven sky > Akado > MGTS) - the same thing. In short, the equipment standing by the providers cuts off apparently the home hosting, and the providers themselves do not know about it. Or what other reasons. Go to the hosters, do not do hosting at home. Indy hosting sucks. Even a primitive router for the interaction of network games is better to pile on php and put on hosting than to keep at home or in the office, and wait for it to break for no reasonable reason.

Note to the hostess (about hosting)


In addition to techno-characteristics (of which only one is really needed - the number of gigabytes, everything else - the numbers are all on their own scale, I came to the conclusion that the characteristics are better for the avahost), there is such a parameter - abuse stability. “Abuse” is a complaint. Moreover, the reason for the complaint may arise out of the blue, for example, at the studio of Artemy Lebedev . Therefore, normal hosting has the parameter of fault tolerance, resistance to complaints. (Not to be confused with special hosting services, where you can place anything at all, even though the phishing page of the Sberbank with an invitation to enter your personal account is separate offices, I don’t understand them).

Monetization


Let's say that you are a large media outlet and have decided to make a significant part of the photographs you have (accumulated over decades) made public. For example, using the technology described above. How can one make money from this (except for branding by overprinting photos, as well as selling them)? Well, if you are the media then you know, I’ll tell you for the rest.

Most monetization schemes give you 10 kopecks per average visitor to the site per day (including those who visited the site once, and visited several times a day). Similarly, gives the author of the site and YAN (Yandex Advertising Network). To earn more, you need to involve people in religious sects or sell miracle talismans, I do not do this. Aggregators of such advertisements are easy to find on the network, they pay to achieve results (a person bought a Kirby vacuum cleaner or became a member of a sect). Moreover, it’s a shame: I don’t do this, but Yandex keeps driving like this through my site. As a result, people still sometimes sell bullshit at a high price (via Yandex), but I get 6-10 times less from this.

Many of my friends have an ad block or something similar by default and Yandex ads are not visible. And they themselves did not put this. Why - I don’t know.

Yandex allows you to withdraw the amount upon reaching 3000 rubles.

Also, the site owner can register at miralinks.ru and post articles. The address of the article and links to it must be posted forever, i.e. foresee that their placement is not too toxic. It is acceptable for new articles to supplant previous ones in the following pages of history.

You can sell the placement of banners, and the other according to the meaning of the resource.

Where can I see this technology in action (what project am I doing)?


I make the site corchaosis.ru - a kind of graphic wiki analogue.

Why so far it has not turned out to untwist it (as I think):

- People only need a means to accomplish achievements.

Even if people go to an art gallery to watch paintings, material achievement is still important to them. I visited the Tretyakov Gallery. I watched Swan Lake.

If a web resource does not bring a person closer to material achievements, then they don’t come to it.
People themselves may think differently that they like pictures. It does not matter. If we do something about people, we must be “more complex” than people. Understand and be aware of more. If a fox eats chickens and mice, then the fox should be more perfect than chicken. From the level of chicken perceptions, fox results cannot be achieved.

- People need interactive.

WEB 1.0 is dead.

If you cannot offer interactivity, then nobody needs you.

You are not looking. This is again about achieving results. Cobvo does not go to the jungle for tourism, he goes to the jungle to establish his own ranch. While the site has no tools to create its own ranch (portfolio, etc.), the cowboys are not interested in the jungle.

Where to get the finished engine


In principle, I described everything necessary to do it. You can write to me.

The local exe file is written in Delphi + Graphics32, the server side is two.php files.

Source