Neural networks guard traffic rules
Violations of traffic rules (SDA) by drivers carry operational, reputational and legal risks for organizations.
Previously, video recordings from official vehicles were analyzed to identify violations. This is a routine and time-consuming process, since very large volumes of video were processed manually. It was decided to automate this process and create a model for detecting traffic violations to form a risk-based selection of videos.
First of all, it was decided to look for such gross violations of traffic rules as crossing a double solid line and driving through a red traffic signal.
For image segmentation and detection of road markings, a convolutional neural network of the U-Net architecture was used. This architecture is a sequence of convolution and pooling layers, which first reduce the spatial resolution of the image, and then increase it by first combining it with the image data and passing through other layers of the convolution.
To train the model, a training dataset was needed. Unfortunately, all found datasets from open access consisted of photographs of roads not from Russia. The results of training the model on foreign roads were disappointing: the model often simply refused to perceive our domestic road markings as markings. Therefore, it was decided to start creating a training sample on their own. About 1,500 screenshots were cut from the video from the registrars, and with the help of the Supervise.ly service they marked the roadway (Fig. 1).
Trained on such a dataset, the model became able to recognize road markings on our videos from registrars. The neural network finds solid lines in the video and, if they contain at least a predetermined number of pixels (so that random lines, discontinuous or not solid, are not taken into account), approximates them into a straight line, which our car should not cross anymore.
Figure 2 shows how U-Net works: from above - the original recording from the windshield, from below - an example of a neural network, where the green areas are the road marking mask, and the thin red lines are the approximation of the marking in a straight line.
The model showed itself very well when processing most of the video from the registrars, but it should be noted that there were difficulties in analyzing a snowy road or a video taken in the dark - in some cases, the markup is simply not visible.
To determine the presence of traffic lights and cars used the pre-trained neural network Darknet + Yolo v3. This neural network is an improved version of the YOLO architecture, which stands for You Only Look Once. The main feature of YOLO v3 is that it has three layers at the output, each of which is designed to detect objects of various sizes.
The main feature of this architecture compared to others is that most systems apply the neural network several times to different parts of the image, and in YOLO the neural network is applied to the entire image at once and once. The network divides the image into a kind of grid and predicts bounding boxes (parallelepipeds that limit the found objects) and the likelihood that these objects are there for each section.
The advantages of this approach are that when viewing the entire image, YOLO takes into account the image context when detecting and recognizing an object. YOLO also has clear performance advantages: it is a thousand times faster than R-CNN and several hundred times faster than Fast R-CNN.
An example of the operation of YOLO is shown in Figure 3. Image analysis is carried out frame by frame, all red traffic lights found are correctly detected by the neural network.
Training of two whole neural networks requires a sufficiently powerful computer, especially in terms of a video card, because GPU computing is used. We used the eighth generation Core i7 processor, the nvidia gtx1080 graphics card and 32 GB of RAM. Such system characteristics were quite enough for the implementation of the project.
According to the results of the use of models for detecting traffic violations, we can say that this was a successful project. The script was fed with video from the DVR in one month with a total duration of 7 hours 11 minutes; the model’s inference (incoming video processing) time was 25 minutes. At the end of processing all the video files, 112 fragments were cut for 8 seconds (15 minutes in total), of which almost 7 hours were saved, and violations were easily identified.
You can send your questions to email address .