Colleagues, good afternoon! My name is Misha, I work as a programmer.

In this article, I want to talk about how our team decided to apply the CQRS & amp; Event Sourcing in the project, which is a platform for online auctions. And also about what came of it, which of our experience we can draw conclusions on and what rake it is important not to step on those who go by CQRS & amp; ES.
image


Prelude


First, a bit of history and a business background. A customer came to us with a platform for conducting so-called timed auctions, which was already in production and on which a certain amount of feedback was collected. The customer wanted us to make him a platform for live auctions.

Now a little bit of terminology. An auction is when certain items are sold - lots, and buyers (bidders) make bids. The buyer who offers the highest bid becomes the owner of the lot. A timed auction is when each lot has a predetermined time for closing it. Buyers bid, at some point the lot closes. It looks like ebay.

The timed platform was made classically using CRUD. Lots closed a separate application, running on schedule. All this worked not too reliably: some bets were lost, some were made as if on behalf of the wrong buyer, the lots did not close or were closed several times.

Live auction is an opportunity to participate in a real offline auction remotely via the Internet. There is a room (in our internal terminology - “room”), it houses the auction host with a hammer and an audience, and right there next to the laptop is the so-called clerk, who, by clicking on the buttons in his interface, transmits the auction process to the Internet, and those connected for an auction, buyers see bids that are made offline and can bid.

Both platforms, in principle, work in real time, but if in the case of timed all the buyers are in an equal position, in the case of live it is extremely important that online buyers can successfully compete with those who are in the room. That is, the system must be very fast and reliable. The sad experience of the timed platform clearly told us that the classic CRUD is not suitable for us.

His experience with CQRS & amp; We didn’t have ES, so we consulted with colleagues who had it (we have a large company), presented our business realities to them and jointly concluded that CQRS & amp; ES should suit us.

What else is the specificity of online auction work:

  • Many users simultaneously try to influence the same object in the system - the current lot. Buyers make their bets, the clerk enters bets “from the room” into the system, closes the lot, opens the next one. At each point in time in the system you can make a bet only one value - for example, 5 rubles. And only one user can make this bid.
  • You need to keep the entire history of actions on system objects so that, if necessary, you can see who made what kind of bet.
  • The response time of the system should be very short - the course of the online version of the auction should not lag behind the offline one, users should understand what their attempts to make a bid led to - whether they were successful or not.
  • Users must promptly learn about all changes during the auction, and not just about the results of their actions.
  • The solution must be scalable - several auctions can take place simultaneously.

CQRS & amp; ES


I will not dwell on the CQRS & amp; ES, there are materials about this on the Internet and in particular on Habré (for example, here: Introduction to CQRS + Event Sourcing ). However, I’ll briefly recall the main points:

  • The most important thing in event sourcing: the system does not store data, but the history of its change, that is, events. The current state of the system is obtained by sequential application of events.
  • The domain model is divided into entities called aggregates. The unit has a version. Events apply to aggregates.Applying an event to an aggregate increments its version.
  • Events are stored in the write database. The same table stores the events of all system aggregates in the order in which they occurred.
  • System changes are initiated by commands. The command applies to a single unit. The command applies to the latest, that is, the current, version of the aggregate. The assembly for this is built up by the consistent application of all "their" events. This process is called rehydration.
  • In order not to rehydrate every time from the very beginning, some versions of the unit (usually every Nth version) can be stored in the system in a ready-made form. Such “snapshots” of the unit are called snapshots. Then, to obtain the latest version of the aggregate during rehydration, the events that occurred after its creation are applied to the freshest snapshot of the aggregate.
  • The command is processed by the business logic of the system, as a result of which, in the general case, several events are obtained, which are stored in the write database.
  • In addition to the write base, the system may also have a read base that stores data in a form in which it is convenient for clients of the system to receive it. Read-base entities are not required to correspond to one to one system aggregates. Read-base is updated by event handlers.
  • Thus, we get a separation of commands and requests to the system - Command Query Responsibility Segregation (CQRS): commands that change the state of the system are processed by the write part; stateless requests access the read part.

ITKarma picture

Implementation. Subtleties and complexity.


Framework selection


In order to save time, and also due to the lack of specific experience, we decided that we need to use some kind of framework for CQRS & amp; ES.

In general, our technology stack is Microsoft, that is,.NET and C #. Database - Microsoft SQL Server. Everything is hosted in Azure. A timed platform was made on this stack, and it was logical to make a live platform on it.

At that moment, as I recall now, Chinchilla was almost the only option that suited us on the technological stack. So we took her.

Why do we need the CQRS & amp; ES? It can “out of the box” solve such problems and support implementation aspects such as:

  • Aggregate entities, commands, events, aggregation versioning, rehydration, snapshot mechanism.
  • Interfaces for working with different DBMS. Saving/loading events and snapshots of aggregates to/from the write-base (event store).
  • Interfaces for working with queues - sending commands and events to the appropriate queues, reading commands and events from the queue.
  • Interface for working with web sockets.

Thus, taking into account the use of Chinchilla, we added to our stack:

  • Azure Service Bus as a command and event bus, Chinchilla supports it out of the box;
  • Write and read databases are Microsoft SQL Server, that is, both of them are SQL databases. I can’t say that this is the result of informed choice, rather for historical reasons.

Yes, the frontend is made on Angular.

As I said, one of the requirements for the system - so that users as quickly as possible learn about the results of their actions and the actions of other users - this applies to both customers and the clerk. Therefore, we use SignalR and web sockets to quickly update data on the frontend. Chinchilla supports integration with SignalR.

Aggregate Selection


One of the first things to do when implementing the CQRS & amp; ES - is to determine how the domain model will be divided into aggregates.

In our case, the domain model consists of several basic entities, approximately the following:

public class Auction { public AuctionState State { get; private set; } public Guid? CurrentLotId { get; private set; } public List<Guid> Lots { get; } } public class Lot { public Guid? AuctionId { get; private set; } public LotState State { get; private set; } public decimal NextBid { get; private set; } public Stack<Bid> Bids { get; } } public class Bid { public decimal Amount { get; set; } public Guid? BidderId { get; set; } } 


We got two units: Auction and Lot (with bid’s). In general, it is logical, but we did not take into account one thing - that with such a division, the state of the system was spread over two units, and in some cases, to maintain consistency, we must make changes to both units, and not to one. For example, an auction can be paused. If the auction is paused, then you can’t bid on the lot.It would be possible to pause the lot itself, but the auction on pause also cannot process any commands, except for “un-pause”.

As an alternative, you could only make one unit, Auction, with all the lots and bets inside. But such an object will be quite difficult, because lots in the auction can be up to several thousand and there can be several tens of bids per lot. During the lifetime of the auction, such an aggregate will have a lot of versions, and rehydration of such an aggregate (consistent application of all events to the aggregate), if you do not take snapshots of the aggregates, will take quite a long time. Which is unacceptable for our situation. If you use snapshots (we use them), then the snapshots themselves will weigh a lot.

On the other hand, in order to guarantee the application of changes to two units within the framework of processing one user action, it is necessary either to change both units within the same team using a transaction, or to execute two teams within the same transaction. Both that, and another, by and large, is a violation of architecture.

Such circumstances must be taken into account when breaking the domain model into aggregates.

At this stage in the evolution of the project, we live with two units, Auction and Lot, and violate the architecture, changing both units within some teams.

Applying a command to a specific version of an aggregate


If several buyers simultaneously place bets on the same lot, that is, send a “place a bet” command to the system, only one of the bets will succeed. A lot is an aggregate, it has a version. When the command is processed, events are created, each of which increments the aggregate version. You can go in two ways:

  • Send a command, indicating in it which version of the unit we want to apply it to. Then the command handler can immediately compare the version in the command with the current version of the unit and not continue if there is a mismatch.
  • Do not specify the unit version in the command. Then the unit is rehydrated with some version, the corresponding business logic is executed, events are created. And already only when they are saved can the execution jump out that such a version of the unit already exists. Because someone else managed earlier.

We use the second option. So the teams are more likely to fulfill. Because in the part of the application that sends commands (in our case, this is the frontend), the current version of the aggregate with some probability will lag behind the real version on the backend. Especially in conditions when a lot of commands are sent, and the version of the unit changes often.

Errors when executing a command using a queue


In our implementation, heavily driven by the use of Chinchilla, the command processor reads commands from the queue (Microsoft Azure Service Bus). We clearly distinguish situations when the team was floppy for technical reasons (timeouts, connection errors to the queue/base) and when it was business (trying to place a bet on the lot of the same value that had already been accepted, etc.). In the first case, the attempt to execute the command is repeated until the number of repetitions specified in the queue settings is released, after which the command is sent to Dead Letter Queue (a separate topic for unprocessed messages in Azure Service Bus). In the event of a business reception, the team is sent to Dead Letter Queue immediately.

ITKarma picture

Errors when processing events using the queue


Events generated as a result of executing a command, depending on the implementation, can also be sent to the queue and taken from the queue by event handlers. And when processing events, errors also occur.

However, unlike the situation with an unfulfilled command, everything is worse here - it may happen that the command is executed and the events are written to the write database, but the processing of them by the handlers is floppy. And if one of these handlers updates the read base, then the read base will not be updated. That is, it will be in an inconsistent state. Thanks to the mechanism of repeated attempts to process the event, the read-base is almost always, ultimately, updated, however, the probability that after all attempts it remains broken, still remains.

ITKarma picture

We faced this problem. The reason, however, was to a large extent that in the processing of events we had some business logic, which, with an intensive flow of bets, has good chances of failing from time to time. Unfortunately, we realized this too late; it was not possible to redo the business implementation quickly and simply failed.

As a result, as a temporary measure, we refused to use the Azure Service Bus to send events from the write part of the application to the read part. Instead, it uses the so-called In-Memory Bus, which allows you to process the command and events in one transaction and, in case of failure, roll back the whole.

ITKarma picture

This solution does not contribute to scalability, but we exclude situations when our read-base breaks, which in turn breaks the frontends and continuing the auction without re-creating the read-base by replaying all the events becomes impossible.

Sending a team in response to an event


This, in principle, is appropriate, but only if the failure to execute this second command does not break the state of the system.

Handling multiple events of one team


In the general case, as a result of executing one command, several events are obtained. It happens that for each of the events we need to make some kind of change in the read-base. It also happens that the sequence of events is also important, and in the wrong sequence, event processing will not work as it should. All this means that we cannot read from the queue and process the events of one command independently, for example, with different instances of code that reads messages from the queue. Plus, we need a guarantee that events from the queue will be read in the same sequence in which they were sent there. Or we need to be prepared for the fact that not all team events will be successfully processed on the first try.

ITKarma picture

Handling a single event by multiple handlers


If the system needs to perform several different actions in response to a single event, usually several handlers of this event are made. They can work out in parallel or in series. In the case of a sequential start if one of the handlers fails, the entire sequence is restarted (in Chinchilla it is). With such an implementation, it is important that the handlers are idempotent, so that the second run of the successful handler once does not fail. Otherwise, when the second handler falls from the chain, it, the chain, will definitely not work out completely, because the first handler will fall into the second (and subsequent) attempt.

For example, an event handler in the read-base adds a bet on a lot of 5 rubles. The first attempt to do this will be successful, and the second will not allow the constraint to be performed in the database.

ITKarma picture

Conclusions/Conclusion


Now our project is at a stage when, as it seems to us, we have already stepped on most of the existing rake, relevant to our business specifics. In general, we consider our experience to be quite successful, CQRS & amp; ES is well suited for our subject area. Further development of the project is seen in the abandonment of Chinchilla in favor of another framework that provides more flexibility. However, the option of not using the framework in general is also possible. There will also probably be some changes in the direction of finding a balance between reliability on the one hand and speed and scalability of the solution on the other.

As for the business component, here some questions still remain open - for example, dividing the domain model into aggregates.

I would like to hope that our experience will be useful to someone, will help save time and avoid a rake. Thanks for your attention.

Source