Casey Rosenthal, CEO and co-founder of Verica.io, spoke at the Test in Production . Casey debunked some of the myths about reliability and explained that many of the intuitive actions to increase system reliability are actually counterproductive. Moreover, he explained how the Continuous Verification concept helps developers avoid such pitfalls.

Casey Full Performance:



Presenter: April 30, 2020 My name is Joz Graham, I defend the interests of developers in LaunchDarkly. Our guest today is Casey Rosenthal, CEO and co-founder of Verica.io. Hi Casey, thank you for joining us.

Casey: I'm glad to be with you today.

Host: So Casey, you are the person who wrote the book on chaos engineering recently published in O’Reilly, or was its co-author.

Casey: Yeah. Wrote with Nora Jones. And many more people helped us, it was wonderful. We have gathered the views of employees from companies such as Google, Slack, Microsoft, LinkedIn, Capital One and the like.

Host: Cool. I flipped through it quickly, there are interesting things about which we hopefully will talk about now. Then our viewers will ask questions. Casey will talk for about 20 minutes. Getting Started.

Casey: Great. So, I want to talk about two different mental models and their impact on what we do in production; about the evolution we are seeing at DevOps, as well as the best practices in this area.

I work for Verica, but I won’t talk about it. I’ll come from the other side. I want to start with two myths about reliable systems, which makes systems reliable. This is partly to stimulate discussion, as well as to see how people will accept my words.

Reliability Myths


One of the myths says that you can make the system more reliable by removing the people who are causing the incidents. That is, it is assumed that there are people who are more likely to create incidents, or less cautious. This is sometimes associated with the concept of managing "bad apples." However, statistics convincingly prove that this is a fallacy.

Unfortunately, the healthcare system in the USA sometimes experimentes with such concepts. For example, about 80% of medical error claims are related to 5% of American doctors. At first glance, it sounds like this: "it is obvious that they are bad doctors, get rid of them, and the number of lawsuits will decrease." Unfortunately, this is not so: these 5% also include those doctors who cope well with high-risk cases. Therefore, they are associated with a large number of claims for medical negligence. And if we get rid of these doctors, we lose the people who have the most experience and knowledge about such cases. The same applies to your systems. If you have a team or employee with whom you think reliability problems are associated, then getting rid of this person is unlikely to increase the reliability of the system.

The second myth says that you need to document the best practices and standard procedures (runbooks) that increase reliability. It's a delusion. I did not mean that you do not need to document anything, especially if it allows you to effectively interact within the organization. However, most critical failures are unique cases. There are no best practices for them. We will be honest, in our sphere there are no best methods at all, there are only popular ones. We have no idea what could be the best technique in a complex system.

Back to the standard procedures. Although they may be for someone an effective way of interacting or expressing their personal knowledge, but to really make the system reliable, you need experienced people. Standard instructions are not enough. People cannot load each other their experiences in this way. Especially the experience that is needed for improvisation and implementation.

Third myth says that if you identify the root causes of failures and protect yourself from them, the system will become more reliable. It sounds very convincing. I repeat, all these myths are intuitive, and therefore they are so easily believed in. But this statement has several weaknesses:

  • If you find vulnerabilities in a complex system and protect yourself from them, no one can guarantee that you will prevent the emergence of new unique vulnerabilities.
  • In complex systems, there are no root causes. Moreover, if you analyze them, then in the best case you lose time. The absence of root causes in complex systems, as well as their analysis, is a bureaucratic exercise to find the guilty in specific situations. This does not benefit the system and does not increase its reliability. But we’ll talk more about this when I answer questions.

The Fourth Myth says that you can follow the procedures. It turns out that there is someone who knows better what to do and how, and the people doing the work do not really follow the rules. This is almost always a mistake in the case of systems that need to increase reliability, especially with complex systems. Usually this indicates how much the representations of people at higher levels diverge from the real situation, with how the work is actually carried out.

Fifth myth says that if you avoid risk, the system becomes safer. It seems very reasonable: “don't do anything risky.” Today, everywhere we hear this phrase-fence. If you put up a fence and avoid risk, the system will be safer. However, this interferes with increased reliability. Firstly, people do not get the experience necessary to introduce, improvise and use innovations to make the system more reliable or reduce the consequences of incidents. And secondly, as a rule, fences become an obstacle for those who are able to best cope with emerging incidents. I repeat, no matter how rational the myth sounds, in practice it is counterproductive.

Sixth myth says that if you simplify the system, it will become more accessible or reliable. This is a problem for complex systems. Just remove the complexity and the system will get better. There are also pitfalls here. From experience we know that more complex systems are safer or more affordable than simpler systems. This is confirmed by the data. But complexity is also associated with business success. Your customers pay for complexity. We can say that the work of the developer can be described as adding complexity to the product. And if you remove the complexity, then you remove the potential success limit of your product.

You can look at it differently. Random complexity will always grow when working on software. As well as the inherent complexity added intentionally, which is a feature of the development. And since both types of complexity are always growing, we have not developed methods for its sustainable reduction. So this will not make your system more reliable.

And the last myth : duplication means reliability. This is an interesting statement. Much research remains to be done in this area, since we do not understand the relationship between duplication and reliability very well. But we can already cite many examples where duplication was an important factor in a failure or incident. This is true both for software development and for engineering, in particular, aircraft and rocket science. One example is the Challenger disaster. So at best duplication does not affect reliability.

Mental software development models


Now let's talk about two mental models. We will associate them with different functions, as well as with continuous verification and some other concepts.

The first model describes how developers make some everyday design decisions, which are not necessary to talk about. Within this model, developers are, in fact, between the three pillars: between the economy, workload and security. You can imagine developers tied with rubber bands to these three pillars. And if you move too far from some pillar, the gum will burst and the game will end.

Subconsciously, developers feel how far they are from the pillar of the economy. Your company may not have a rule prohibiting developers from raising a million copies of AWS EC2. Why don't you have such a rule? Because it's common sense.Developers understand that everything costs money, they themselves and their teams cost money, and know that they should not spend more than they have. Therefore, developers are trying to find a balance, making some decisions during the day.

The same goes for workload (people or systems). Developers understand that their servers can do some work. Like all people, developers have their own limits of working capacity and productivity, and they feel the relationship with this pillar. They make sure not to move too far from this pillar, otherwise their servers will be overloaded or the developers themselves will burn out.

The same goes for security. The difference is that developers do not have an intuitive feeling of how far they are from this pillar. And I calmly transfer this situation to the entire industry, because incidents still happen. Leaks and hacks occur because there are surprises. If we, as developers, knew what was going to happen, then in order to prevent this, we would stop and start doing otherwise. We would obviously change our behavior.

This is one of the models that lays the foundation for understanding chaos engineering. The principleofchaos.org website provides my favorite definition: " Conducting experiments to identify system vulnerabilities ". Chaos engineering teaches us that safety is needed. It helps intuitively understand how far we have moved from the pillar of “security”. And this knowledge implicitly changes the decisions that developers make, which is quite important.

There are a number of principles in chaos engineering. One of them says that experiments should be carried out in a combat environment. But this does not mean that it makes no sense to experiment in order to identify systemic vulnerabilities in the staging environment or other non-progeny environments. It has. I will give many examples when it was really useful.

But in general, when it comes to complex systems, you should study the specific system that you care about. Thus, the gold standard of chaos engineering will be conducting experiments in a real combat environment. This was the first mental model.

The second model is called the "economic pillars of complexity." This is just the framework within which we reflect on the evolution of technology. There are four pillars in this model:

  • States.
  • Relations.
  • Wednesday.
  • Reversibility.

The clearest example of this model is Ford in the 1910s, when they could control the number of conditions in production, telling customers that you can get any car you want if it is a black Ford T. Thus, the ability to control one of the four pillars helped the company navigate the complexity of its manufacturing process.

They could also manage relationships. Other automakers had teams that created the whole car. Ford moved on to assembly, conveyor production, scientific management and Taylorism to limit the number of relationships between components and people. This helped to navigate the complexity of the market.

The company managed or influenced its environment by first building confidence in automobiles in the USA, which allowed it to do business. In the end, they became a monopolist and continued to control the environment.

And this leads us to the fourth pillar - reversibility. Here Ford could not do anything, because it is possible to turn on the reverse gear of the car, but it is impossible to reverse the production process. Ford was no longer able to adapt to reversibility.

How is this model applicable to software development?

States and application functionality typically increase. Typically, you, as a developer, have to pay to add features, so you do not have much control over this pillar.

Relationships . Unfortunately, our industry is characterized by the addition of levels of abstraction. Whether we like it or not, interconnections arise between moving parts. Today, many of us work remotely, which complicates human relationships and exposes architectural changes, say, microservices, which we can separate from the feature delivery cycle. In software development, we have little control over the growing number of relationships.

Wednesday .Most development companies are not monopolists, so we can not greatly influence the environment.

Reversibility . Here the scope of software development is on top. This is clearly illustrated by the transition from a waterfall model to extreme programming and Agile.

The waterfall model says: " We plan the entire product, create it and deliver it to customers ." After a year, the customer says: “ This is not what I wanted. I’ll say it harshly and will continue to live . "

Extreme programming says: " About a week from now we will show something to the customer ." And the customer says: " This is not what I wanted ." Well, the solution is easy to reverse. Quite easy to throw out the results of a weekly development. The second week begins, we still show something to the customer, and we hope that by the third week we will understand what the customer really wants.

This sophisticated realization of reversibility was a way of orienting ourselves in a new and complex production process. But we can also make explicit architectural decisions that prove their reversibility. And there are a few better examples than feature flag.

Feature flag switches help us navigate a complex software production system: we can clearly make reversible architectural decisions. And that is important. Also add source control, automated canaries, blue-green deployments, and chaos engineering as part of the feedback loop. Observability as part of a feedback loop. All this improves reversibility and facilitates orientation in complex systems.

And the last thing I want to say regarding software and mental models:

The main merit of the development is its technical efficiency, with a premium for accuracy, speed, expert management, continuity, prudence and optimal returns depending on the input data.
Merton.

Verbose but good definition of design. However, in fact, I quoted the definition of the main merit of the bureaucracy.

And I want to emphasize this. Development in some circles is considered a bureaucratic profession. This opinion has a negative connotation. Perhaps it is. Why would anyone complain about this? In no other industry is there such a clear division into those who decide what needs to be done ; those who decide how to do it ; and those who do . This is an idealized bureaucracy.

Chief architects, product experts, project managers, managers, techlides - all these roles exist in order to take away responsibility for decision-making from those who carry out the work itself. All this brings us back to Taylorism of the 1910s. But you need to realize that this is the wrong model for development if you think that knowledge is the basis of this profession.

This is the right or acceptable model for creating widgets. But most of us do not make widgets. So if you think that knowledge is at the heart of software production, then bureaucracy is counterproductive. I want to say that we are not trying to reduce complexity, we are trying to navigate it.

The evolution of complex systems


Continuous integration (continuous integration) is limited by the number of bugs that are acceptable when accelerating code merging, when errors become more obvious and do not lead to new errors. And if we succeed, then everything is fine, engineers can create features faster.

This later turned into continuous delivery (continuous delivery). We can make software faster, and we need to send it into operation faster, roll out faster, change our thinking. This is an important part of the reversibility resulting from evolution.

And now we are witnessing the creation of a new concept of continuous verification (continuous verification): “ Under the hood, we have a bunch of very complex and fast-moving software. And how do we follow the road? How to make sure that a complex system behaves as a business needs, taking into account the very high speed of creating features and sending them to products? That is, how can we move fast and not break anything? ".

You can look at it from the other side. Continuous verification is a proactive testing tool to confirm system behavior.While most industry-known tools gravitate toward reaction testing methodologies to test known properties. I do not want to say that this is bad, on the contrary, good and useful. But these techniques and disciplines arose during the times of simpler systems and adapted to them. And in complex systems you have to scale everything to get the properties you need for your business.

Reliability is not created by tools, but by people, but tools can help.

Now we can go to the questions.

Questions and Answers


Host: You warned us at the beginning that you would be talking about what is contrary to generally accepted concepts. I want to say that we conducted testing in a combat environment, created chaos in the most correct way, when you look calm outwardly and rush around behind the scenes like crazy.

But much more needs to be done. I was somewhat discouraged by some of your words when you talked about myths. It always seemed to me that documentation and a set of standard procedures are extremely important. I want to say that I would like to talk more about the value of personal experience and how it was arranged in the companies with which you worked.

Casey: First, let's talk about resilience. Outside the software development industry, a separate direction in engineering is responsible for this. I think the most effective definition of fault tolerance will be “adaptive incident handling capacity.” And adaptability requires some kind of improvisation, right? It requires people whose skills and knowledge help them to improvise.

Host: True.

Casey: The set of procedures does not and cannot have this. And although you have documented something very well, you cannot convey the necessary knowledge, skills or experience in a standard set, even so that another person can understand that this list of procedures needs to be followed.

If a person needs to check with standard procedures, then this, in fact, is a guess on his part. And since this is a hunch, then you limit the reliability with which your system can operate. You cannot increase reliability by investing more in writing standard procedures.

Now about the documentation. If you like to interact with it, great. You should improve on this. Interaction is the most difficult in most professions. Therefore, you need to take advantage of those methods that are more suitable for you, and to improve in this. But that will not improve reliability.

Host: True. I find this important because one of the most popular books in the last 20 years has been Atul Gawande's Checklist Manifesto. I think that some concepts came from medicine, for example, the idea of ​​strict observance of checklists in crisis situations. Is this the same as standard procedures? Or is there a difference between them that I missed?

Casey: I think there is a difference. It’s great if the tool you already know helps you get the job done. But this is the difference from standard procedures, which try to be sufficient, try to supplant human improvisation or adaptation. Do you understand me?

Host: Yes.

Casey: Standard routines are recovery tactics. A checklist cannot be used as such tactics. This is a tool that helps you gain experience.

Host: Maybe this refers to the diagnosis? So that you can make assumptions or get a complete set of facts before trying to recover?

Casey: More often than not, the checklist is not related to recovery. For example, a pilot, in order not to forget anything, performs a checklist before takeoff. You are trying to use a tool from another profession to get your own new experience.

Host: Well, yes.

Casey: Again, the main difference from standard procedures...

Host: This is not a recovery.

Casey: Yes.

Host: Yeah, fine. And this is also fascinating, given that we went into the territory of old sayings and terrible jokes about automation that arose in the period 1960-80s.Like the one controlled by a man and a dog; the person must be at the control panel, and the dog bites the person if he tries to touch the remote control.

The idea is that automated systems are a reliable part, and people are an unreliable organic part, but it is the organic nature and the ability to improvise that ensure reliability.

Casey: Delivers resiliency.

Host: Yes, fault tolerance.

Casey: You can have a reliable system with lots of automation. But you, by definition, cannot use automation to make a system more fault tolerant. Because we have not yet learned how to automate improvisation.

Presenter: Exactly, and more recently, they are more actively involved in the development of supposedly self-healing systems. I know that some companies, especially those working in the field of productivity, alerts and monitoring, are creating self-healing technologies. They are trying to use AI instead of detecting deviations or establishing cause-effect relationships. Do you know anything about success in this area?

Casey: Of course. You can increase the minimum level of reliability by implementing true AI-based self-healing technologies. What we call self-healing tomorrow will simply be called an algorithm. If you look at the bulwark pattern, shutdown machines, or failure prevention algorithms, recovery technologies appeared here 10-20 years ago. That is, a system that determines abnormal deviations and even notifies about it can be called a primitive form of self-healing, right?

Host: Yes.

Casey: She doesn’t do everything herself, but uses a person instead of anything else. This is a completely correct algorithm for improving the reliability of the system. But again, you have no improvisation and context. The introduction of a new fashionable term about self-healing will not make your system fault tolerant, because it does not make it smarter than an engineer who can think ahead.

We know that the defining quality of a complex system, in contrast to a simple one, is that a complex system does not fit in the head of one person.

Host: True.

Casey: Therefore, it makes no sense to assume that a software engineer can think ahead of any failures that might occur on this system.

Host: Yes.

Casey: And if so, then let your self-healing algorithms be really smart and detailed, but by definition they will not be able to cover all situations, because they depend on the person who needs to think over the necessary conditions in advance. And we have already recognized that this is impossible.

Host: True.

Casey: And chaos engineering helps people get the experience they need to better adapt. Which can lead to better automation. This can lead to self-healing or failure prevention algorithms that will make the system more reliable. But overall, chaos engineering informs people about something related to the system that they did not know about before. And this allows people to implicitly change their behavior to make the system more reliable.

Host: This clarifies the picture, I'm new to this. But it seems that you are claiming that reliability and fault tolerance are two different levels. And that fault tolerance is a higher level of decision-making that comes into play with the failure of automated systems that are bound to fail.

Casey: Yeah. I think this is the right point of view. I consider these two concepts as different properties. Your system is reliable in different ways and to different degrees. There is always a border beyond which reliability is lost. The simplest example: you test a certain physical system with load testing, and at some point it eventually fails.

And fault tolerance does not know if something will fail. There is no evidence for such a claim. You can always say, well, let's do something else instead. There are always options. It is not possible to list all alternatives.

Host: Right, right.This is fully consistent with the pattern we mentioned above that the level of complexity exceeds our ability to manage it. In fact, this is what has been happening for decades, this is an arms race between complexity and its management tools. And complexity always wins.

Casey: I don’t know about victory. But one of the reasons I love LaunchDarkly is that feature flags are an explicit architectural solution that can help you navigate your way through a complex system.

In software development, we pretend that complex systems are the enemy. Although compared to the rest of our lives, software is probably the simplest system people have to deal with. Think about the complexity of human relationships. Or driving a car, and mentally model what other drivers are doing. Therefore, autonomous driving has not yet taken off. Developers cannot mentally simulate other people on the road and their intentions. Not in the sense of computing physics, it's just that easy.

So we, as people, even as software engineers (programmers), are constantly dealing with complex systems. This is the foundation of our life. But for a number of reasons, we do not like to see them in our work. So truly complex systems are not our enemy. They allow us to be successful. But we need things like feature flags, because they allow us to comfortably use the complex systems that we create. We need to know that we can make some kind of architectural decisions, and then quickly change our minds if necessary.

Host: I like to discuss reversibility at different timescales. The project has a timeline. It's like Agile — you can change direction within the same iteration for a couple of weeks. And for quick recovery, you can use reversibility over much shorter periods of time.

Casey: If you hired a consultant to increase the speed of developing features that focus on processes. And that is wonderful. You can, for example, implement Agile, or Scrum, or something else. But in fact, we want to increase the speed of feature development through the adoption of architectural decisions. So comfortable. More possibilities. Easier to measure. Solid benefits, right?

Host: That's right. As if we got a tool. If we see the tools in front of us, then the benefits of the process become theoretical. But they are much harder to grab. When we see tools in front of us and use them in everyday work, we can switch flags and see changes in milliseconds - this is a tactile pleasure.

Casey: Yeah.

Host: So, we already have some great questions. You are often asked how to take the first steps in chaos engineering. But let's take a step back: based on your experience, especially taking into account myths, where, in your opinion, is it most important to implement incident resolution techniques?

Casey: I would first study what the resiliency engineering community says about the lessons learned (posthumous photograph in direct translation). Better about learning incident reviews. For example, there is a wonderful article by John Allspaw, Blameless PostMortems. She is already many years old, and I think that John has a lot of excellent material now, because today, only innocent post-mortemmers are not enough.

The point is not only that we avoid accusations, but also that there are no perpetrators of the incidents. In complex systems, there are no root causes. Therefore, looking for them, or admitting that “this is a problem, but there is no one to blame for it” is not the best solution.

So, innocent training reviews are a more effective approach to managing incident response when you consider them as lessons. As for the recovery time (time to remediate), there are many models, I don’t have any favorite. Different models use factors other than software and may work better for some organizations.

On the one hand, there are coordinating models with the role of an incident leader, and you create a group in accordance with certain rules. However, a study was conducted that showed that having an incident coordinator can sometimes be more expensive than just a group of people doing normal work.Brain workers have the tools and the ability to make decisions that allow them to independently recover the system after an incident.

There have been many interesting studies in this area. However, I will not give specific recommendations for recovery from incidents. Response management has many compelling examples of how to do this right. Well, or many examples of bad post-mortem or root cause analysis. All this indicates processes that will not cause you to lose time. I think I answered a question in a roundabout way.

Host: No, everything is fine. The innocent lessons learned that I came across worked well. And people like Allowspaw talk about traditional types of root cause analysis that are not applicable here.

Casey: Yes, his company, Adaptive Capacity Labs, has developed metrics to help you understand what you need. For example, if people prefer to read the briefing after the training reviews, then this means that you are doing everything right.

Host: We have a question about feature flags. What is the best way to manage dependencies when there are a lot of them? I think I can expand the question, move from dependencies to relationships as one of the pillars of complexity. I have been to companies that work with large outdated systems and realize an incredible amount of dependencies, not only software, but also organizational, and even event dependencies. All this is simply impossible to track.

What is your advice on how to manage dependencies in such cases?

Casey: I'm not a dependency manager. I know examples where people pass before a bug in the form of an excess of unmanaged feature flags. I think there are many interesting approaches in this area.

I think that as new levels of abstraction appear above us it will be more useful not to solve this problem, but to navigate it as a problem of complexity.

Host: I agree. You want to say that this can be foggy, but it can also increase the number of relationships for some reason.

Casey: Yes. And sometimes the reason may not be obvious to you, or insignificant to achieve your personal goals, goals of the organization or business. For example, we at Verica did a lot of research last year. And they noticed some disappointment in the enterprise market in Kubernetes as an additional layer in the deployment process. Many do not like it. Kubernetes simply support as a platform, not to mention the fact that the search for new ways to integrate it is regarded as additional work that does not help to achieve goals of a smaller scale.

It’s like a burst of dependencies. If you focus on the experience of developing and managing dependencies, and not on the language, or libraries, or information security, then you will want to limit the number of dependencies. But if you have different circumstances, and if you are trying to speed up the delivery of features, then you will have the opposite desire: “ Let me make as many dependencies as possible, because I want to take the results of someone else’s work and use them ».

Depending on your circumstances, you may have a completely different perspective on the distribution of dependencies. I believe that there should be no letters in the inbox. But sometimes there are so many important letters that it remains to give up. I know that this approach to managing the dependencies of many will greatly disappoint, and I understand them. But I'm not an expert in this, so I shut up.

Host: Well, that seemed like a curious philosophy to me. I think it is very applicable to many people who are good at coping with difficult situations. No need to fight this. There are reasons for this situation. The best thing is to learn how to navigate, diagnose.

It is close to me because I spoke about debugging. It turned out that many people do not realize the breadth of possibilities and ease of use of their own debuggers. They don’t understand how important it is to customize tools for a specific situation. It rarely happened to me that a tool created for a system did not pay for itself in a couple of months.

Casey: This is a way for you, as an engineer, to better navigate in a complex system: you create a special tool. Yes, that’s reasonable.

Host: Many people ask questions about chaos engineering. " How do I start applying this in my organization? And how can I convince the leadership of this? "I guess it all depends on how you present it. What advice do you give in such situations?

Casey: I heard a good book on this subject has recently been released [ hints at his book, Chaos engineering ]. General tip - start with a game day. There are tons of ways to organize this. In fact, we need to gather people with the necessary experience and start a discussion of the consequences, or abnormal behavior of the system. Just a discussion often helps uncover a lack of knowledge about the actual behavior of the system.

So, playing day, you are gathered in the room and decide to stop some piece of infrastructure for several hours. You assume that this will not affect customers. The hypothesis for a chaos experiment with accessibility is almost always based on the pattern: “ Under such and such conditions, clients will still feel good .”
If you think that under such conditions customers will feel unwell, then do not conduct such an experiment. It will be just awful.

Host: Yeah.

Casey: Someone talking about chaos engineering is bold or too risky. No, this should not be risky. If you think that something will break, then just do not do it. Fix it first before doing anything.

Host: True.

Casey: Chaos engineering is not about calling chaos engineers or creating chaos on your system. It is believed that he is already present in it. And if you try to bare it, then it will help you better avoid pitfalls. And that’s it. Great way to start. Plan a game day or script, and then continue to delve in that direction.

This is the gold standard for chaos engineering in a combat environment. But I recommend starting with a staging environment, if you have one. Many companies refuse it, and this can be understood. Feature flags help a little. But if you have a staging environment, or hidden features, or different user groups associated with different features, then start experimenting there. There is no reason not to do this.

I hope you learn something useful about your staging environment and system, which will help you make them more reliable. Or you will gain resiliency skills. Or you will become more confident in navigating the system, and this confidence will help you conduct the same experiments with the same tools in the production environment.

Host: True.

Casey: We are witnessing this evolution.

Host: Well said. We have a staging environment at LaunchDarkly. We did not refuse it and do everything in production, because if it is possible to reduce the danger of the experiment, then we need to do it. You need to test in production because you already do this. Just don't call testing.

Finally I want to ask. You talked about the “bad apples” problem, that the people who cause the problems are not really bad employees. They were just sent to do it...

Casey: Yes, that's what happens most often. We do not use the word “reason” because it implies that these people are the root cause.

Host: That's right, I'm sorry, you're absolutely right. Those passing by, those who were nearby, the main suspects.

So, I worked at Linden Labs, and there I learned most about complex legacy systems. If the mistake you made ruined the combat environment, you had to wear [inaudible] during the day. I know other companies that do this. But as my colleague explained, this is actually a distinction. You tried something, did something important, fixed something important and learned. I know that Etsy uses a three-sleeved sweater for this. And what interesting symbols are known to you?

Casey: In Netflix, they either give you badges or take them away for dropping the battle environment. Google has a ritual. I understand and accept this. I can’t say that I fully support it, because it symbolizes that the person became the cause of the problem. Usually this happens by accident, sabotage is extremely rare.

If you had no bad intentions, then in a complex system there are no cause-effect relationships. This is an amendment to thinking about how the incident differs from the fact that a person published one line of code that dropped production. You are saying that a minimal recovery effort is a rollback of a line of code.

But there are other points of view. The person who wrote the continuous delivery tool is to blame for not implementing automatic staging. Or managers are to blame for the lack of communication. Or the director is to blame for the lack of budgeting for another medium. Or the vice president is to blame for not explaining to the board why this budget is needed and not choosing another delivery tool. Or the technical director is to blame for not explaining the importance of the tool.

Which of these is more likely to create a more reliable solution? Putting on a three-sleeved sweater for the person who sent a line of code to production, or changing the behavior of the technical director? The second is more likely to increase the reliability of the system. And all these things with RCA and focus on one person will not affect the reliability of the system at all. A sweater with three sleeves is more comfortable, so shame does not bother them, and this is important. But at the same time, you are stuck with the thought that this person is to blame, not the technical director.

Host: Interesting. This indicates that there is a big difference between the analysis of the records and the way I would ask the management five questions “why”: why did it happen, one, two, three, etc.

Casey: As Richard Cook says, “ I have a six-year-old child, and he is wondering why, he can be sent to recover from incidents ." The problem of the “five why” is that the questions are completely arbitrary. And the questioner may pursue his interests. If I’m a director, I will ask about reports, and poor interaction. And if an employee asks from the lower levels of the hierarchy, then the questions may be related to the C language. So the “five why” does not help much in the investigation. I would rather not use it. But I understand that in some organizations, in the hands of a well-informed person with good intentions, this tool can make the investigation more acceptable.

Host: Yes, it all depends on the point of view. The paradox is that if you look at the situation from below, then the situation can be seen better. But at the same time you have less influence on her.

Casey: Another story, right? Who said that one point of view is more accurate than another. But in the case of the hierarchy, intuition suggests that to change the whole system is much more effective to influence people at the top of the hierarchy.

And when it comes to reliability, you need to look at the system as a whole, and not at the front-line employees.

Host: Right. If you get paid a big salary for being upstairs, then you have the opportunity to fix something. Let's hope.

It was very interesting, but it's time for us to finish. Thank you for joining us today.

Casey: Thanks for inviting me.

My personal opinion is, like CTO, or “dry squeeze”


In my opinion, it was a deep philosophical conversation, the main essence of which is reliability, you need to deal professionally and adjusted for the architecture and capabilities of your organization. And also the fact that testing on the prod or “constant check” is a necessary process for software development, because there are more and more relationships, the systems themselves are becoming more complicated. It is impossible to test everything in QA environments, you must have feature switches to quickly turn off functionality. And build the development process so that you can safely roll back the latest changes.

I definitely disagree with Casey that the benefits of duplication have not been proven. If it had not been proven, then a person would have one eye on his forehead and one ear somewhere on his chin. And also the plane had only one engine. You can’t do without standard procedures, it is a question, first of all, of team turnover and scaling. You cannot get professionals in your team at the same time; you grow them from June and Middle, including on the basis of standard procedures.This is the foundation for growth and scaling, as well as non-closeness of knowledge in one head. In IT terms, “bass reduction.”
.

Source