If your IT infrastructure is growing too fast, you will sooner or later come up with a choice - linearly increase the human resources to support it or start automation. Until a certain moment, we lived in the first paradigm, and then the long road to Infrastructure-as-Code began.

ITKarma picture

Of course, NSPK is not a startup, but such an atmosphere reigned in the company in the first years of its existence, and these were very interesting years. My name is Dmitry Kornyakov , I have been supporting the Linux infrastructure with high availability requirements for more than 10 years. He joined the NSPK team in January 2016 and, unfortunately, did not find the very beginning of the company’s existence, but came at the stage of major changes.

In general, we can say that our team supplies 2 products for the company. The first is infrastructure. Mail should go, DNS should work, and domain controllers should let you onto servers that should not fall. The IT landscape of the company is huge! This is a business & amp; mission critical system, with some accessibility requirements of 99,999. The second product is the servers themselves, physical and virtual. You need to monitor existing ones, and regularly supply new ones to customers from many departments. In this article, I want to focus on how we developed the infrastructure that is responsible for the server life cycle.

Getting Started

At the beginning of the journey, our technology stack looked like this:
CentOS 7

FreeIPA Domain Controllers
Automation - Ansible (+ Tower), Cobbler

All this was located in 3 domains, spread over several data centers. In one data center - office systems and test sites, in the rest of the PROD.

Creating servers at some point looked like this:

ITKarma picture

In the VM template CentOS minimal and the necessary minimum like the correct/etc/resolv.conf, the rest comes through Ansible.

CMDB - Excel.

If the server is physical, then instead of copying the virtual machine, the OS was installed using Cobbler - MAC addresses of the target server are added to the Cobbler config, the server receives the IP address via DHCP, and then the OS is loaded.

At first, we even tried to do some kind of configuration management in Cobbler. But over time, this began to bring problems with configuration portability to other data centers as well as to Ansible code for preparing the VM.

At that time, many of us perceived Ansible as a convenient extension of Bash and did not skimp on designs using shell, sed. In general, Bashsible. This ultimately led to the fact that if for some reason the playbook did not work on the server, it was easier to remove the server, fix the playbook and roll again. In fact, there was no versioning of scripts, nor portability of configurations either.

For example, we wanted to change some kind of config on all servers:

  1. We change the configuration on existing servers in the logical segment/data center. Sometimes not in one day - the requirements for accessibility and the law of large numbers do not allow all changes to be applied at once. And some changes are potentially destructive and require a restart of anything - from services to the OS itself.
  2. Fixing in Ansible
  3. Fixing in Cobbler
  4. Repeat N times for each logical segment/data center

In order for all changes to go smoothly, it was necessary to take into account many factors, and changes occur constantly.

  • Refactoring ansible code, configuration files
  • Change internal best practice
  • Changes resulting from the analysis of incidents/accidents
  • Changing security standards, both internal and external. For example, PCI DSS is updated every year with new requirements

Growing Infrastructure and Getting Started

The number of servers/logical domains/data centers grew, and with them the number of errors in the configurations. At some point, we came to three areas in which direction we need to develop configuration management:

  1. Automation.As far as possible, the human factor should be avoided in repeated operations.
  2. Repeatability. Managing infrastructure is much easier when it is predictable. The configuration of the servers and the tools for their preparation should be the same everywhere. This is also important for product teams - the application must be guaranteed to get into a productive environment configured after testing, similar to the test one.
  3. Ease and transparency of changes to configuration management.

It remains to add a couple of tools.

We chose GitLab CE as the code repository, not least for the availability of built-in CI/CD modules.

Storage of secrets - Hashicorp Vault, incl. for a great API.

Testing configurations and ansible roles - Molecule + Testinfra. Tests are much faster if you connect to ansible mitogen. At the same time, we began to write our own CMDB and orchestrator for automatic deployment (in the picture above Cobbler), but this is a completely different story, which my colleague and chief developer of these systems will tell in the future.
Our choice:

Molecule + Testinfra
Ansible + Tower + AWX
Server World + DITNET (Own Development)
Gitlab + GitLab runner
Hashicorp Vault

ITKarma picture

Speaking of ansible roles. At first she was alone, after several refactoring they became 17. I categorically recommend breaking the monolith into idempotent roles, which can then be launched separately, in addition, tags can be added. We divided the roles by functionality - network, logging, packages, hardware, molecule etc. In general, we adhered to the strategy below. I do not insist that this is the only instance, but it worked for us.

  • Copying servers from the Golden Image is evil!

    Of the main drawbacks, you don’t know exactly what state the images are in now, and that all changes will come to all images to all virtualization farms.
  • Use the default configuration files to a minimum and agree with other departments that you are responsible for the main system files , for example:

    1. Leave/etc/sysctl.conf empty, the settings should only be in/etc/sysctl.d/. Your default in one file, custom for the application in another.
    2. Use override files to edit systemd units.
  • Template all the configs and add the whole, if possible, no sed and its analogues in playbooks
  • Reactor configuration management system code:

    1. Break tasks into logical entities and rewrite the monolith into roles
    2. Use linter! Ansible-lint, yaml-lint, etc
    3. Change the approach! No bashsible. It is necessary to describe the state of the system
  • For all Ansible roles, you need to write tests in the molecule and generate reports once a day.
  • In our case, after preparing the tests (there are more than 100), there were about 70,000 errors. Corrected for several months.

    ITKarma picture

Our implementation

So, ansible roles were ready, templated and checked by linters. And even gitas are everywhere raised. But the question of reliable code delivery to different segments remained open. We decided to synchronize with scripts. It looks like this:

ITKarma picture

After the change has arrived, CI is launched, a test server is created, roles are rolled, tested by the molecule. If everything is ok, the code goes to the branch. But we do not apply the new code to existing servers in the machine. This is a kind of stopper, which is necessary for the high availability of our systems. And when the infrastructure becomes huge, the law of large numbers comes into play - even if you are sure that the change is harmless, it can lead to sad consequences.

There are many options for creating servers too. We ended up choosing custom python scripts. And for CI ansible:

- name: create1.yml - Create a VM from a template vmware_guest: hostname: "{{datacenter}}".domain.ru username: "{{ username_vc }}" password: "{{ password_vc }}" validate_certs: no cluster: "{{cluster}}" datacenter: "{{datacenter}}" name: "{{ name }}" state: poweredon folder: "/{{folder}}" template: "{{template}}" customization: hostname: "{{ name }}" domain: domain.ru dns_servers: - "{{ ipa1_dns }}" - "{{ ipa2_dns }}" networks: - name: "{{ network }}" type: static ip: "{{ip}}" netmask: "{{netmask}}" gateway: "{{gateway}}" wake_on_lan: True start_connected: True allow_guest_control: True wait_for_ip_address: yes disk: - size_gb: 1 type: thin datastore: "{{datastore}}" - size_gb: 20 type: thin datastore: "{{datastore}}" 

That's what we have come to, the system continues to live and develop.

  • 17 ansible roles to configure the server. Each of the roles is designed to solve a separate logical problem (logging, auditing, user authorization, monitoring, etc.).
  • Testing roles. Molecule + TestInfra.
  • Own development: CMDB + Orchestra.
  • Server creation time ~ 30 minutes, automated and almost independent of the task queue.
  • The same state/name of the infrastructure in all segments - playbooks, repositories, virtualization elements.
  • Daily check of server status with generation of reports on discrepancies with the standard.

I hope my story will be useful to those who are at the beginning of the journey. What automation stack are you using ?.