Calculation of factors in antifraud. Yandex Report
Antifraud leader Andrei Popov Nox_andry made a presentation on how we were able to fulfill all these conflicting requirements. The central topic of the report is a model for calculating complex factors in a data stream and ensuring system fault tolerance. Andrey also briefly described the next, even faster iteration of antifraud, which we are currently developing.
The antifraud team, in fact, solves the binary classification problem. Therefore, the report may be of interest not only to anti-fraud specialists, but also to those who make diverse systems that need fast, reliable and flexible factors on large volumes of data.
- Hi My name is Andrey. I work in Yandex, I lead the development of antifraud. I was told that people prefer to use the word “features,” so I will mention it everywhere in the report, but the title and introduction have remained old, with the word “factors”.
What is antifraud?
What is antifraud in general? This is a system that protects users from negative impact on the service. By negative influence, I mean targeted actions that can degrade the quality of service and, accordingly, worsen the user experience. It can be quite simple parsers and robots that worsen our statistics, or targeted complex fraudulent activity. The second, of course, is more difficult and interesting to define.
What is antifraud generally struggling with? A couple of examples.
For example, imitating user actions. This is what the guys we call black SEO do, those who don’t want to improve the quality of the site and the content on the site. Instead, they write robots that go to Yandex, click on their site. They expect their website to rise higher. Just in case, I remind you that such actions are contrary to the user agreement and can lead to serious sanctions by Yandex.
Or, for example, cheat reviews. Such a review can be seen at the organization on Maps, which puts plastic windows. She herself paid for this review.
The top-level antifraud architecture looks like this: a certain set of raw events fall into the antifraud system itself as in a black box. At the exit from it, marked events are generated.
Yandex has many services. All of them, especially large ones, in one way or another encounter different types of fraud. Search, Market, Maps and dozens of others.
Where were we two or three years ago? Each team survived under the onslaught of the fraud as best they could. She generated her antifraud teams, her systems that did not always work well, were not very convenient for interacting with analysts. And most importantly, they were poorly integrated with each other.
I want to tell how we decided to create a single platform.
Why do we need a single platform? Reuse of experience and data. The centralization of experience and data in one place allows you to quickly and better respond to large attacks - they are usually cross-service.
Single toolkit.People have the tools they are used to. And, obviously, the connection speed. If we launched a new service, which is currently under active attack, we must quickly connect a high-quality antifraud to it.
We can say that we are not unique in this regard. All large companies face similar problems. And everyone with whom we communicate comes to the creation of their single platform.
I'll tell you a little about how we classify antifrauds.
It can be an offline system that counts hours, days, and heavy offline processes: for example, complex clustering or complex retraining. I will hardly touch upon this part in the report. There is a near real-time part that works in units of minutes. This is some kind of golden mean, she has a quick reaction and heavy methods. First of all, I will dwell on it. But it is equally important to say that at this stage we use data from the stage a level higher.
There are online parts that are necessary in those places where a quick reaction is required and the fraud is critically weeded out before we accept the event and pass it on to the user. Here we reuse data and machine learning algorithms counted at higher levels.
I’ll talk about how this single platform works, about the language for describing features and interacting with the system, about our way to increase speed, that is, about the transition from the second stage to the third.
I will hardly touch upon the ML methods themselves. I will mainly talk about platforms that create features that we then use in training.
Who could be interested in this? Obviously, those who write antifraud or are struggling with scammers. But also to those who simply start the data stream and consider features, ML considers. Since we made a fairly general system, maybe some of this will be interesting to you.
What are the system requirements? There are a lot of them, here are some of them:
- Big data stream. We process hundreds of millions of events in five minutes.
- Fully Configurable Features.
- Declarative language for describing factors.
- Of course, cross-DC and exactly-once-data processing, which is needed for some services. Convenient infrastructure - both for analysts who select the final features, train models and the like, and for developers who support the system.
- And, of course, speed.
Further I will tell about each of these points separately.
Since for security reasons I can’t talk about real services, let's introduce the new Yandex service. Actually not, forget, this is a fictional service that I came up with to show examples. Let it be a service on which people have a base of all existing books. They come in, give grades from one to ten, and attackers want to influence the final rating so that their books are bought.
All coincidences with real services are naturally random. First of all, consider the near real-time version, since online is not specifically needed here at a first approximation.
Yandex has a classic way to solve big data problems: use MapReduce. We use our own implementation of MapReduce, called YT . By the way, Maxim Akhmedov has a story about her tonight. You can use your implementation or open source implementation like Hadoop.
Why do not we immediately use the online version? It is not always needed, it can complicate recounts to the past. If we added a new algorithm, new features, we want to often recount data in the past in order to change the verdicts for it. It is more difficult to use heavy methods - I think it’s clear why. And the online version for several reasons may be more demanding on resources.
If we use MapReduce, we get something like this. We use some minibatching, that is, we divide it into as small pieces of batches as possible. In this case, it is one minute.But those who work with MapReduce know that smaller than this size, probably, there are already too large overheads of the system itself - overhead. Conditionally, she will not be able to handle the processing in one minute.
Then we start the Reduce set over this set of batches and get a marked batches.
In our tasks, it is often necessary to read the exact value of features. For example, if we want to calculate the exact number of books that the user has read in the last month, then we will calculate this value for each batch and must store all the statistics collected in a single place. And further remove old values from it and add new ones.
Why not use approximate methods? Short answer: we also use them, but sometimes in antifraud tasks it is important to have exactly the exact value at some intervals. For example, the difference between two and three books read can be very significant for certain methods.
As a result, we need a big data history in which we will store these statistics.
Let's try "on the forehead." We have one minute and a big old story. We let it go to Reduce and give it an updated history and a tagged log, data.
Those of you who work with MapReduce probably know that this can work quite poorly. If the story can be hundreds, or even thousands, tens of thousands of times larger than the batch itself, then such processing can work in proportion to the size of the story, not the size of the batch.
Replace this with some key-value store. This is again our own implementation, key-value storage, but it stores data in memory. Probably the closest analogue is some Redis. But here we get a slight advantage: our key-value store implementation is very integrated with MapReduce and the MapReduce cluster on which it runs. It turns out convenient transactionality, convenient data transfer between them.
But the general scheme is that we will go to this key-value storage in each job of this Reduce, update the data and write back after forming a verdict on them.
We get a story that allows you to process only the necessary keys and is easily scalable.
A little bit about how we configure features. Simple counters are often not enough. To search for scammers, you need quite a variety of features, you need a smart and convenient system to configure them.
Let's break it into three stages:
- Extract, where we extract data for a given key and from the log.
- Merge, where we merge this data with statistics that are in history.
- Build, where we form the final value of the feature.
Let's, for example, calculate the percentage of detective stories read by the user.
If the user reads too many detective stories, he is too suspicious. It is never clear what to expect from him. Then Extract is the removal of the number of detectives that the user read in this batch. Merge - taking all the detectives, all this data from batches for the month. And Build is some amount.
Then we do the same for the meaning of all the books that he has read, and in the end we get division.
And if we want to calculate different values, for example, the number of different authors that the user reads?
Then we can take the number of different authors that the user read in this batch. Next, keep a certain structure where we make an association of authors recently when the user read them. Thus, if we meet this author again with the user, then we update this time. If we need to delete old events, we know what to delete. To calculate the total feature, we simply count the number of keys in it.
But in a noisy signal, such features in one section is not enough, we need a system for gluing their joins, gluing these features from different sections.
Let’s, for example, introduce such sections: user, author and genre.
Let's count something complicated. For example, the average loyalty of the author. By loyalty, I mean that users who read the author - they read almost only him. Moreover, this average value is low enough for authors who read it on average for users who read it.
This may be a potential signal. Of course, it can mean that the author is just like that: there are only fans around him, everyone who reads him, reads only him. But also this may mean that the author himself is trying to cheat the system and creates these fake users who supposedly read it.
Let's try to count it. Let's count a feature that counts the number of different authors for a large interval. For example, here the second and third meanings seem suspicious to us, there are too few of them.
Then we calculate the average value for the authors who are connected over a large interval. And then here the average value is again quite low: 3. For some reason, this author seems suspicious to us.
And we can return it back to the user in order to understand what exactly this user has a connection with the author, who seems suspicious to us.
It is clear that in itself this cannot be an explicit criterion that the user needs to be filtered or something like that. But this may be one of the signals that we can use.
How to do it in the MapReduce paradigm? Let's make some consistent reductions and dependencies between them.
We get a graph of updates. It affects by what slices we consider features, which joins are generally acceptable, the amount of resources consumed: obviously, the more reductions, the more resources. And Latency, throughput.
We construct, for example, such a graph.
That is, we will split into two stages the reductions that we have. At the first stage, we will simultaneously calculate different reductions for different sections - our users, authors and genre.And we need some kind of second stage, where we will collect features from these different reductions and take the final verdict.
For the next batch, we do the same. Moreover, we have a dependence of the first stage of each batch on the first stage of the past and the second stage on the second stage of the past.
It is important here that we do not have such a dependency here:
That is, we actually get the conveyor. That is, the first stage of the next batch can work in parallel with the second stage of the first batch.
And how to make the three-stage statistics in this, which I cited above, if we have only two stages? Very simple. We can take the first value in the first stage of the batch N.
The second value at the first stage of the batch is N + 1, and the final value should be considered at the second stage of the batch N + 1. Thus, during the transition between the first stage and the second, there will be, perhaps, not quite accurate statistics for the N + 1 batch. But usually this is enough for such calculations.
Having all these things, you can build more complex features from the cubes. For example, the deviation of the current book rating from the average user rating. Or the proportion of users who give very positive or very negative ratings for a book. Suspicious too. Or the average rating of books by users who have more than N ratings for different books. This may be a more accurate and fair assessment from some point of view.
To this is added what we call the relationship between events. Often duplicates appear in the logs or in the data that is sent to us. It can be either technical events or robotic behavior. We also find such duplicates. Or, for example, some related events. Say you have book recommendations displayed in your system, and users click on those recommendations. So that the final statistics that affect the ranking do not spoil, we need to make sure that if we filter the show, we must also filter the click on the current recommendation.
But since the flow may come unevenly with us, first click, we must postpone it until we see the show and accept a verdict based on it.
Feature Description Language
I'll tell you a little about the language of the description of all this.
You can not get a grasp, this is an example. We started with three main components. The first is the description of data units in history, generally speaking, of an arbitrary type.
This is some kind of feature, a nullable number.
And some rule. What do we call a rule? This is a set of conditions on these features and something else. We had three separate files.
The problem is that here one chain of actions is spread across different files. A large number of analysts need to work with our system. They were uncomfortable.
The language turns out to be imperative: we describe how to calculate the data, and not declarative when we would describe what we need to calculate. This is also not very convenient, it is easy enough to make a mistake and a high input threshold.New people come, but do not quite understand how to work with this at all.
Solution - let's make our DSL. He more clearly describes our scenario, it is easier for new people, it is more high-level. We took inspiration from SQLAlchemy, C # Linq and the like.
I’ll give a couple of examples similar to those that I gave above.
The percentage of detective stories read. We consider the number of books read, that is, group by user. We add filtering to this condition and, if we want to calculate the final percentage, we simply consider the rating. Everything is simple, clear and intuitive.
If we consider the number of different authors, then we group by user, we set distinct authors. We can add some conditions to this, for example, a calculation window or a limit on the number of values that we store due to memory limitations. As a result, we consider count, the number of keys in it.
Or the average loyalty about which I spoke. That is, again, we have some expression calculated from above. We group by author and set some average value among these expressions. Then we narrow it down to the user.
To this we can then add a filter condition. That is, we can have a filter, for example, like this: loyalty is not too high and the percentage of detectives is between 80 out of 100.
What do we use for this under the hood?
Under the hood, we use the latest technologies, directly from the 70s, such as Flex, Bison. Maybe they heard. They generate code. We have a code file that passes through our lexer, which is generated in Flex, and through the parser, which is generated in Bison. The lexer generates terminal characters or words in the language, the parser generates syntax expressions.
From this we get an abstract syntax tree with which we can already make transformations. And in the end, we turn this into low-level files that the system understands.
What is the result? This is more complicated than it might seem at first glance. You need to spend a lot of resources, think over such trifles as priorities of operations, extreme cases and the like. You need to study rare technologies that are unlikely to be useful to you in real life, unless you write compilers, of course. But in the end, it's worth it. That is, if you, like us, have a large number of analysts who often come from their other teams, then in the end this gives a significant advantage, because it becomes easier for them to work.
Some services require fault tolerance: cross-DC and exactly-once processing. Violation can cause discrepancies in statistics and losses, including monetary ones. Our solution for MapReduce is that we read data at each moment of time on only one cluster and synchronize them on the second.
For example, how would we behave here? There is a leader, follower and message broker. We can assume that this is a conditional kafka, although here, of course, its own implementation.
We deliver our batches to both clusters, run a set of updates on one of the leaders, accept the final verdicts, update the history and transfer the results back to the message broker service.
Once in a while, we naturally need to do replication. That is, we collect snapshots, collect changelogs - changes for each batch. Both that, and that we synchronize on the second follower cluster. And also bring up a story that is in such a hot state. Let me remind you that the story is kept in memory here.
Thus, if for some reason one DC becomes unavailable, we can quickly enough, with a minimum lag, switch to the second cluster.
Why not even count on two clusters in parallel? External data may differ on two clusters; external services can supply them. What is external data? This is something that rises from this higher level. That is, complex clustering and the like. Or just auxiliary data for calculations.
We need a coordinated solution. If we simultaneously consider verdicts using different data and periodically switch between the results from two different clusters, the consistency between them will drop significantly. And, of course, saving resources. Since we use CPU resources on only one cluster at a time.
And what about the second cluster? When we work, it is almost idle. Let's use its resources for a full preprod. By full-fledged preprod, I mean a full-fledged installation that accepts the same data stream, works with the same amount of data, etc.
If the cluster is unavailable, we change these installations from sale to preprod. Thus, the preprod has been lying with us for some time, but that's okay.
Advantage - we can count more features on preproduct. Why is this even necessary? Because it is clear that if we want to count a large amount of features, then we often do not need to count all of them on the prod. There we consider only what is needed to obtain final verdicts.
But at the same time, on the preprod, we have a kind of hot cache, large, with a wide variety of features. In case of an attack, we can use it to close the problem and transfer these features to the products.
Added to this are the benefits of B2B testing. That is, we all roll out, of course, first on the preprod. We completely compare any differences, and thus, we will not make a mistake, we minimize the likelihood that we may make a mistake when rolling out to the products.
A little about the planner. It is clear that we have some machines that run the task in MapReduce. These are some workers. They regularly synchronize their data in the Cross-DC Database. This is just a state of what they managed to calculate at the moment.
If the worker becomes unavailable, another worker tries to capture the log, pick up the state.
Get up from him and continue to work. Continue to set tasks on this MapReduce.
It is clear that if these tasks are re-raised, some of them may restart. Therefore, there is a very important property for us: idempotency, the ability to restart every operation without consequences.
That is, all code must be written so that it works fine.
I’ll talk a bit about exactly-once. We reach a verdict in concert, it is very important. We use technologies that give us such guarantees, and we monitor, of course, all discrepancies, we reduce them to zero. Even when it seems that this has already been reduced, a very tricky problem arises periodically, which we did not take into account.
Very briefly about the tools that we use. Supporting multiple antifrauds for different systems is a difficult task. We have literally dozens of different services, you need some kind of a single place where you can see the status of their work at the moment.
Here is our command post where you can see the status of the clusters that we are currently working with. You can switch them among themselves, roll out a release, etc.
Or, for example, a dashboard of problems, where we immediately see on one page all the problems of all antifrauds of different services that are connected to us. Here you can see that there is clearly something wrong with the preprode of our Book service now. But monitoring will work, and the attendant will look at it.
What are we monitoring at all? Obviously, system lag is extremely important. Obviously, the operating time of each individual stage and, of course, the filtering of individual rules. This is a business requirement.
There are hundreds of charts, dashboards. For example, on this dashboard it can be seen that the contour was bad enough now, that we gained a substantial lag.
I'll tell you about the transition to the online part. Here the problem is that the lag in a full circuit can reach units of minutes. This is in outline on MapReduce. In some cases, we need to ban, detect fraudsters faster.
What could it be? For example, our service has the opportunity to buy books. And at the same time, a new type of payment fraud appeared. You need to respond to it faster. The question arises - how to transfer this whole scheme, ideally preserving the language of interaction familiar to analysts as much as possible? Let's try to transfer it “on the forehead”.
Suppose we have a balancer with data from the service and some number of workers for whom we shard data from the balancer. There are external data that we use here, they are very important, and a set of these stories. Let me remind you that each such story we have is different for different updates, because it has different keys.
In such a scheme, the following problem may occur.
Let's say we got two events at the worker. In this case, with any sharding of these workers, we may have a situation where one key gets to different workers. In this case, this is the author of Tolkien, he fell into two workers.
Then, from this key-value storage, we read the data for both workers from history, we will update it differently and a race will arise when trying to write back.
Solution: let's make the assumption that it is possible to separate reading and writing, that writing can occur with a slight delay. This is usually not very important. By a small delay, I mean here a few seconds. This is important, in particular, for the reason that our implementation of this key-value store takes longer to write data than to read.
We will update statistics with a lag.On average, this works more or less well, given that we will store cached state on machines.
And another thing. For simplicity, we’ll merge these stories into one and shard it by type and key of the cut. We have a single story.
Then we will add the balancer again, add the readers' machines, which can be shuffled as you like - for example, just by load. They will simply read this data, take the final verdicts and return them to the balancer.
In this case, we need a set of writers' machines to which this data will be sent directly. Writers will accordingly update the story. But there is still a problem that I wrote about above. Let's change the structure of the writer a bit.
Let's make it shuffled the same way with the story - by type and value of the key. In this case, when its sharding is the same as the story, we will not have the problem that I mentioned above.
Here he changes his mission. He no longer accepts verdicts. Instead, it simply takes updates from Reader, mixes them, and applies them correctly to the story.
It is clear that here we need a component, a coordinator, which distributes these updates between readers and riders.
To this, of course, is added that the worker needs to maintain the current cache. As a result, it turns out that we are responsible for hundreds of milliseconds, sometimes less, and update statistics in a second. In general, this works well, for services this is enough.
What did we get at all? Analysts began to do work faster and the same for all services. This has enhanced the quality and connectivity of all systems. You can reuse data between antifrauds of different services, and new services get high-quality antifraud quickly.
A couple of thoughts at the end. If you write something similar, immediately think about the convenience of analysts in terms of support and extensibility of these systems. Configure everything you can, you will need it. Sometimes the properties of cross-DC and exactly-once can be difficult to achieve, but possible. If it seems to you that you have already achieved, double-check. Thanks for your attention.