ITKarma picture

The better it is possible to consider a potential purchase before payment, the less likely it is to encounter unpleasant surprises after, contrary to unscrupulous sellers and insufficiently detailed descriptions. To ensure that users' expectations more often coincide with reality, more and more online stores are introducing 3D-photos on the cards of their goods: clothes, electronics and even marketplaces. Demand for panoramic photos of cars was only a matter of time and technology, because unlike shoes or a phone, shooting a car requires a lot more space and effort.

Hi, I’m in touch with Anton Timofeev, product manager at Auto.ru, and Alexander Sapatov, developer of the Yandex computer vision team. Under the cat, we’ll talk about what happens under the hood of our application after you click the Panorama button, and why now a regular smartphone is enough to create a good picture.

What came before


Large car dealers began to use panoramas in large numbers sometime in 2017. Strictly speaking, these were not really panoramas: whole shooting pavilions or closed "rails" were built around the car, along which they pushed the camera until it returned to the starting point, at the exit there was a video where the car was visible from all sides. In 2018, we supported this format at Avto.ru and observed the preparation process: dealers purchase equipment at the cost of hundreds of thousands of rubles, build auxiliary devices - a camera, a motor! Beautiful photos and videos are nice to see even if you are not going to buy a car. It is not surprising that private sellers began to wonder whether this can be repeated at home.

Technologies developed, the process was simplified, but to create a high-quality panorama, it was still required, if not a professional camera, then at least a top-end smartphone. By 2019, we set ourselves an ambitious goal - to make panoramic auto shooting accessible to almost every smartphone owner: simple and fast, requiring no additional costs, special equipment and professional skills.

And done.

How it works


In a nutshell: the user shoots the video in our application, and it processes the received frames using computer vision and machine learning technologies - clears noise, aligns the image, and glues the panorama. It sounds simple, but if you imagine how a person walks with the phone around the car, holding it in the frame, the task will begin to become more detailed: for example, the phone wobbles from side to side, despite optical stabilization. Obstacles may arise on the way - pillars, trees, fences - and the subject will disappear from the frame. A gait or step speed can lead to uneven shooting - below we will return to how we deal with these and similar noises.

So, you opened the application, clicked on the “Panorama” button in the form of editing an ad, went around the car with the camera turned on, carefully following all the instructions. What's next?


Step 0 . The shot video is sent to our server: firstly, why experiment, whether your device will cope with not the easiest calculations, if we can take it upon ourselves. Secondly, the more computing power, the faster the panorama is generated. The classic video processing algorithm
You could try to determine the position of the camera using a gyroscope and other sensors built into the phone, but they accumulate a measurement error over time: the longer the video, the greater the error. Therefore, we are trying to restore points in three-dimensional space: comparing neighboring frames, we collect a common three-dimensional model of the camera’s trajectory. Experienced readers may object - why not try to do segmentation on different frames? - the mask may turn out different, there will be a jitter of the object.

Step 2 . For reference frames, the algorithm searches for neighboring frames: it chooses in such a way that they uniformly (by circumference) cover the panorama. That is, the points that exist in the real 3D world are not calculated for each new frame, but are searched for in neighboring ones.

In parallel, the algorithm * selects the camera parameters: after finding the reference points, it optimizes the system of linear equations (minimizes the error in determining the position) by which these points are projected onto the frames where they were found. Solving such a system of equations, we require that the camera parameters on each frame are the same, the position - from frame to frame - is different, and the point in the 3D profile must again be the same on all frames. For frames where there were no reference points, respectively, only camera positions are determined.

* We already referred to this
article above: more details, mathematical details.

Step 3 . There is a cloud of points: how to understand which of them belong to a machine? The technology that helps frame by frame to create a point cloud and find the position of the camera is quite well known. We use opensfm to solve this problem. However, our own technology, which allows us to filter the cloud of points and align the position of the camera, is working out next. The heuristic is as follows:

  • The object must be extended: the machine in any section is longer than tall.
  • The point cloud is dense: the car is hardly visible.
  • Everything outside the camera’s path (and it’s now known) is noise.
  • The car usually stands on the ground - a dense cluster of dots at the bottom of the frame also usually has a characteristic shape.

Total: we are looking for a certain parallelepiped-shaped (bbox) object, which includes as many points from the cloud as possible. If there are more than one such object (a pillar or something else fell into the frame), the segmentation will leave only the central one.

By the way, dispel the popular myth. Heterogeneous Background - Best:

  • For us - because it’s easier for the algorithm to distinguish which points relate to the car and which ones to the background. Brick wall, mesh netting and other repeating exception textures - the algorithm may incorrectly determine which points match.
  • For you - because cars on a lively background, according to statistics, find a greater response from customers than salon interiors and proper lighting.

Step 4 . Get rid of the noise. The main ones are called by the matrix of the phone, defocus and motion blur. Because of them, it is impossible to accurately determine the same point at different frames - for example, due to a blur of ± 3 pixels, the error of finding this point in three-dimensional space accumulates in different directions.

When it is not possible to find the exact position of the point in the photographs, the three-dimensional model begins to resemble a hedgehog: the pixel jitter in the frames is transferred to the same jitter, but with great distances. Be sure to smooth the surface, fight emissions: if there are distant points near which there is no cloud, we throw them away, as most likely noise.

Better to see once:

ITKarma picture

Step 5 . From the point cloud, we return to the picture that is understandable to the human eye: panorama.

When we have approximated the machine’s bounding box, we can estimate how much it got into different frames and reject unsuccessful ones.If the location of the object is known, all the necessary points can easily be projected onto each frame, cut to the desired size. In parallel, the algorithm restored the camera parameters: it knows when the image needs to be scaled (zoom in/out) or rotated.

For the seller’s safety, do not forget to hide the numbers: in Russia we were one of the first to implement this option. There is also a nuance here: if you look for a number on each raw frame, there will be trembling, and there is a chance that we will miss something somewhere. Therefore, our algorithm looks for the number on the finished 3D model and projects it onto the panorama.

Of course, if there are not enough suitable frames, you have to re-shoot the video, but we are trying very hard to get the most out of the source.

Step 6 . The finished panorama is sent to the ad.


A few numbers for scale


  • Now, about 60 reference frames are used to create a panorama, about 120 more are “stuck” to them. For this, 15 neighboring frames are selected, the camera positions and a cloud of coincident points are calculated for them. The number of frames is not accidental: neither the quality of the panorama, nor the speed of assembly suffer. The first panoramas, where we had not yet found a balance, gathered for 40 minutes, which, of course, is unacceptable.
  • Usually shooting a panorama of a car takes about a minute. On smartphones, video of such a length after shooting takes from 100 to 300 MB, which with the condition of downloading to our servers would be a serious stopper for the user. Therefore, as part of the project, we began to "change the bitrate and shooting format" on the fly, thereby reducing the video file size to an acceptable 20–40 megabytes, without loss in image size or quality for computer vision.
  • On the other hand, we had to adapt the issuance of panorama ads, as users are not ready to wait for 5-6 megabytes for each download, given that they look at dozens or even hundreds of ads during a car search session. To solve this problem, we encode the result to different resolutions and file formats - from a simple.jpeg set to a rather rare.webm, in some cases reducing the size of the downloaded panorama to 150 KB.
  • At the moment, on our site you can find more than 3000 panoramas of cars created both by dealers (we have not forgotten about them too) or ordinary sellers.
  • Future buyers see in the issuance of "live" ads with panoramas and "stick" to them up to 30% longer. Call conversion for a seller with a panorama car is also growing.

The panorama allows you to see the goods from all sides, it is almost impossible to hide defects. In addition, it makes it difficult to cheat on the fact that the picture is being collected on our servers, you cannot intervene and, for example, retouch defects. In the future, by the way, we plan to provide sellers with the opportunity to mark problem areas in panoramas and attach photos with details to them.

Product!=Car. We hope that our experience will be useful for other areas of retail: the simple, quick and free creation of a 3D model for the announcement will make life easier for sellers, the opportunity to carefully examine the subject of the transaction will help customers.

Source