Can you write Deadlock on Camunda BPM? I can
Some time ago, I wrote about the successful migration from IBM BPM to Camunda, and now our life is full of happiness and pleasant impressions. Camunda did not disappoint, and we continue to be friends with this BPM engine.
But, alas, Camunda can present unpleasant surprises, because of which sometimes not the most obvious results are obtained. This article will examine one case, which, despite its simplicity, turned out to be interesting and somewhat more complicated than it seemed at first glance.
We train on cats
To describe the problem, consider a synthetic example. Suppose we decide to expand our customer base and we need to serve cats and cats. Each potential customer should be checked and, perhaps, immediately offer something.
We will check the reliability of the candidate and the possible services that we can offer him. Reliability check and possible services are not connected in any way - these actions can be performed in parallel. Schematically in a bpmn diagram, it will look like this:
Chart 1. Schematic fluffy maintenance process
The diagram schematically shows the basic steps for forking and joining gateways.
This icon depicts a parallel gateway. Parallel Gateway - the easiest of the gateways for building a parallel part of the process.
There are two types of parallel gateways:
- fork - creates a separate execution for each branch;
- join - Waits for all incoming executions to complete.
Execution - represent a 'path of execution' in a process instance (from the documentation). That is, it is a process flow.
Now let's complicate the task a little. We will check and search for services as follows: first we check the status of the client, then we look at what services may suit him, and we do some pre-processing. In addition, several services may be suitable for the client at once, so we should be able to offer all of them to the client.
Since we work with furry customers, then the services will be appropriate: valerian, claw dog, housekeeping pillow and other useful things.
Chart 2. Updated fluffy customer service chart
The new version of the process is as follows. The process parallels the verification of reliability and the search for possible proposals. Search is also parallelized. In this case, those branches will be executed on which the corresponding conditions will be fulfilled.
To parallelize the conditions, the Inclusive Gateway is used, which is indicated by the following icon:
Inclusive Gateway - a parallel gateway with conditions on the branches. Branches on which the conditions are true will be executed.
There are two types of gateways:
- fork - for each branch with a fulfilled condition, execution is created, which is executed in parallel similarly to execution in Parallel Gateway;
- join, unlike Parallel Gateway, does not expect executions to execute all branches, but only those on which the condition is true.
It may happen that the checks performed are not enough and the client will have to be checked again. To do this, add a condition at the end of all checks, which can be sent for re-verification at the very beginning:
Diagram 3.The final version of the process that should work
It turned out to be cumbersome, but the process solves the problem.
What is it? What happened?
Here strange things begin to happen. The reliability verification branch fulfills and reaches the collecting parallel gateway. So far, everything is going fine.
The second branch checks the material condition, and depending on the results, corresponding tasks are performed. Next, the process stops at the gateway collecting Inclusive Gateway and does not move further. If you look at Coockpit (the admin panel), then executions will hang on the collecting Inclusive Gateway and Parallel Gateway.
Chart 4. Hanging up service process
It is done. We can say that we got a deadlock in the process on Camunda. In this case, it is not directly related to the deadlocks from the theory of parallel programming and deadlocks.
In search of ̶п̶р̶и̶к̶л̶ю̶ч̶е̶н̶и̶й̶ answer
Since I did not have enough understanding of what happened and why the process stopped, the problem had to be solved empirically.
Perhaps you need a default branch for the Inclusive Gateway, and without it, the process cannot run normally?
Strange, of course, but try to add a default branch. The presence of a default branch is a good practice, since otherwise no condition can be fulfilled and then we will get an error.
Chart 5. Service process with default branch
We start and get the same result - the process remains hanging on the Inclusive Gateway.
Next is sorting out all sorts of parameters, reading the documentation, and it drags on for half a day. On another attempt, the process unexpectedly passes the ill-fated getvey. The lower branch with Inclusive Gateway worked in a situation when the upper branch was removed during the search and debugging process, checking the client's reliability. That is, when the process degenerated only into the lower branch with the Inclusive Gateway, the process ended.
Chart 6. Degenerate process
It turns out that Parallel Gateway somehow influences the Inclusive Gateway. It's weird, illogical, and it shouldn't be that way.
How is this possible? Perhaps you should re-read the theory about how Parallel and Inclusive Gateway work. What needs to happen for the join gateway to get everyone together and the process go on? On the Internet, they write that everyone who collects an Inclusive Gateway (join) waits for as many people to enter it as they got out of the “fork” (fork). Then one more question suddenly arises: how does this counter work at all?
What are you? How do you work?
This problem is worthy of puzzle games and intelligent television shows. Only on TV shows are you allowed to use a call to a friend. On the other hand, I can also ask for help. We will call our business process architect Denis.
- Denis, hello! Can you tell me how the collecting getway determines when it is time for the process to move on? Everywhere they write: "How much has come out - so much should come in." But how exactly does he think this?
- Very simple. Camunda counts the number of active executions.
- Thank you so much. Bye
Consider what happened. To do this, once again recall the initial scheme, which turned out:
Chart 7. Hanging process with default branch
For simplicity, we consider the case when all conditions are met. What do we have at the moment in time when three tasks after these conditions are fulfilled?
How many active executions? Three on the lower branch and one on the upper, where we checked the reliability of the client. Camunda does not care about the fact that these are generally different parallel branches. Only the number of active executions is of interest, of which there are four, and the incoming inclusive gateway received only three.
To remedy the situation, the collecting Gateway must collect all executions at once, and then, in theory, the process moves on.Let's try to leave one instead of two join gateways:
Chart 8. Corrected version of the process
Alas, after editing, the process began to look, in my opinion, less obvious. But it worked as planned initially. On this quest ended successfully, I was able to push the changes and go home.
The fun is just beginning
When I sat down to write this article and came up with an example of a process on which I could describe this case, I was disappointed: the process worked as it should and there was no deadlock.
At first I assumed that the version of Camunda in the example is higher than in the project, and this problem has already been fixed in the new version. But downgrading the Camunda did nothing. By the way, in all examples version 7.8.0 is used - it is far from the latest, but it does not matter. The problem was also checked and reproduced on the latest version at the moment - 7.13.
By trial and error, the problem was established. The initial artificial example did not have a reverse branch, unlike the process that I developed in the workplace.
It turns out that in the presence of a reverse branch, the problem is reproduced and we find ourselves in a kind of deadlock, and without a reverse branch, everything works as it should.
Case required understanding and analysis. To do this, I had to look at the sources of the Camunda BPM. Since the problem was with the Inclusive Gateway, it seemed logical to look for an answer in the class that is responsible for the behavior of this element - InclusiveGatewayActivityBehavior . Running a couple of times debug on both versions of the process, I realized how it works.
If it’s not clear, see sources!
In order not to make a dull story, the description of the work of InclusiveGateway based on the source code will be sketchy. The logic we are interested in is concentrated in the execute , where for this case the most valuable is the activatesGateway . As I understand it, it checks whether it is possible to pass the InclusiveGateway. The execute method is called for each execution (for each branch that is running). In our case, there are three such branches, which means this method will be called three times.
Let's see how the activatesGateway method works. For a better understanding, give names to all branches that are running.
Chart 9. Process diagram with executions
As I understand it, the logic of the method is as follows: compares the number of executions who came to this getaway and the number of arrows included in this getaway . This check was made in case of the simplest situation, when all Inclusive Gateway branches are executed, and the logic of checking the collecting gateway is to wait until the number of entered executions is equal to the number of incoming arrows. That is, in the simplest case, the execute method is called as many times as there are branches entering the collecting gateway, then the process goes further.
In our case, this method is called three times, because the number of executions arriving will increase from 1 to 3. At the last call, the number of arriving and expected ones will be 3 and 4, respectively, and we will leave on the false branch.
If the conditions are not met, the remaining executions are checked for belonging to the Inclusive Gateway. Namely, the ability of active executions to get to join Inclusive Gateway is checked.
Here you need to be patient a little, exhale and read.The denouement is near!
In the false branch of the activatesGateway method, every call that has not yet arrived in Inclusive join executions is checked for the possibility of reaching this join. If at least one execution can lead to the Inclusive Gateway, you need to take it into account and wait for it to come to this join too. If there are no executions that can result in join, the method will return true.
The most interesting part is coming. At first glance, the last execution (execution 1 in the diagram) cannot lead to the Inclusive Gateway. But it’s worth looking at the implementation of the method canReachActivity , which performs this check, and we will understand the reason for this behavior of this element.
If we discard all the details of the code, then the isReachable method is called recursively inside this method, which step by step checks the possibility of this execution getting into the collecting InclusiveGateway. The reverse branch just gives such an opportunity, and, alas, this is taken into account, although it should not, since we will go back after all the join'ov.
As a result - Inclusive Gateway is waiting for another execution that will never come. Thus, we get some kind of deadlock. In principle, if we drop the conventions, we get the classic deadlock: join on Parallel waits for the branch with Inclusive to execute, and vice versa, the branch with Inclusive waits for the Parallel to execute.
The diagram below shows the approximate direction of checking accessibility of an Inclusive Gateway join from execution, which came to join Prallel Gateway via a parallel branch.
Diagram 10. Possible path from Parallel Join to Inclusive Join
The diagram shows that, indeed, join Inclusive Gateway is available from the Parallel Gateway join, and according to Camunda BPM logic it doesn’t matter that there is already a “lead in the circle”.
After finding out the reasons, the question involuntarily arose: is this a bug or a feature? In my opinion, this is a bug. I am now collecting information and cases to send a report to the Camunda team.
It’s good that the problem is localized. But what about now?
Actually, now - conclusions:
- Forewarned - means armed. We must build our processes, taking into account this behavior of Camunda.
- There is a work round described above. You can use a common parallel join.
- You can move the logic with Inclusive Gateway into a subprocess, and there will not be such a problem, since checking for executions is within a specific process.
- Do not use nested getway chains, but try to get by with simpler constructions whenever possible. For example, use Parallel Gateway and check conditions on parallel branches.
Apparent simplicity and clarity are sometimes misleading. This can only be fought through the accumulation and replication of knowledge. Alas, at the time of solving this problem, I did not have a deep knowledge of the logic of Inclusive Join, so I had to tinker. I gained this knowledge by trial, error, calling a friend and debug source.
From all this an obvious and far from new conclusion follows that you need to understand how the tool you use works. The better you understand, the less such problems will be.
The second conclusion is also quite obvious: you need to decompose not only the code, but also the processes.
Links that were useful in parsing this case and writing an article: