ITKarma picture

Hello everyone! I am a CV developer at CROC. For 3 years we have been implementing projects in the field of CV. During this time, we just didn’t do anything, for example: we monitored drivers so that they didn’t drink, smoke, talk on the phone, look at the road, not dreams or clouds; fixed lovers to ride in designated lanes and occupy several parking spaces; ensure that workers wear helmets, gloves, etc.; identified an employee who wants to go to the facility; counted everything that is possible.

Am I all this for what?

In the process of project implementation, we got bumps, a lot of bumps, you are either familiar with some of the problems, or will meet in the future.

Modeling a situation

Imagine that we got a job in a young company “N”, whose activities are associated with ML. We are working on an ML (DL, CV) project, then for some reason we switch to another job, generally take a break, and return to our or someone else’s neuron.

  1. There comes a moment of truth, you need to somehow remember what you stopped at, what hyper parameters you tried and, most importantly, what results they led to. There can be many options who stored information on all launches: in the head, configs, notepad, in the working environment in the cloud. I happened to see the option when the hyperparameters were stored as commented out lines in the code, in general a flight of fancy. Now imagine that you didn’t return to your project, but to the project of the person who left the company and inherited a code and model called model_1.pb. To complete the picture and convey all the pain, imagine that you are also a beginner.
  2. Let's move on. To run the code, we and everyone who will work with it need to create an environment. It often happens that he was also left to us as a legacy for some reason. This can also be a non-trivial task. You don’t want to waste time on this step, do you?
  3. We train a model (for example, a car detector). We get to the moment when it becomes very personal - it's time to save the result. Call it car_detection_v1.pb. Then we train one more - car_detection_v2.pb. Some time later, our colleagues or we ourselves train more and more using various architectures. As a result, a bunch of artifacts are formed, information about which must be painstakingly collected (but we will do this later, because we still have more priority cases).
  4. Well, that’s it! We have a model! Can we start training the next model, to develop an architecture to solve a new problem, or can we go have a drink of tea? And who will deploy?

Identify problems

Work on a project or product is the work of many people. And over time, people leave and come, there are more projects, the projects themselves become more complicated. One way or another, situations from the cycle described above (and not only) in certain combinations will occur from iteration to iteration. All this results in a waste of time, confusion, nerves, perhaps - in the dissatisfaction of the customer, and ultimately - in lost money. Although we all usually walk on the old rake, I believe that no one wants to experience these moments over and over again.

ITKarma picture

So, we went through one development cycle and see that there are problems that need to be addressed. To do this:

  • conveniently store work results;
  • make the process of engaging new employees simple;
  • Simplify your development environment deployment process;
  • configure the process of versioning models;
  • have a convenient way to validate models;
  • find a model state management tool;
  • find a way to deliver models in production.

Apparently you need to come up with a workflow that would allow you to easily and conveniently manage this life cycle? This practice has the name MLOps

MLOps, or DevOps for machine learning, allows teams of data processing and analysis professionals and IT professionals to collaborate, as well as increasing the pace of model development and deployment through monitoring, verification and management systems for machine learning models.

You can read , what all the guys at Google think about all this. From the article it’s clear that MLOps is a pretty voluminous thing.

ITKarma picture

Further in my article I will describe only part of the process. For implementation, I will use the MLflow tool, as it is an open-source project, to connect you need a small amount of code and there is integration with popular ml-frameworks. You can search the Internet for other tools, such as Kubeflow, SageMaker, Trains, etc., and perhaps choose the one that best suits your needs.

"Build" MLOps using the MLFlow tool example

MLFlow is an open source platform for managing the life cycle of ml models ( ).

MLflow has four components:

  • MLflow Tracking - closes the issues of fixing the results and parameters that led to this result;
  • MLflow Project - allows you to pack code and play it on any platform;
  • MLflow Models - is responsible for the deployment of models in prod;
  • MLflow Registry - allows you to store models and manage their state in a centralized repository.

MLflow operates on two entities:

  • launch is a complete training cycle, the parameters and metrics by which we want to register;
  • experiment is the “theme” of which launches are combined.

All example steps are implemented on the Ubuntu 18.04 operating system.

1. Deploy the server

So that we can easily manage our project and get all the necessary information, we’ll deploy a server. MLflow tracking server has two main components:

  • backend store - is responsible for storing information about registered models (supports 4 DBMSs: mysql, mssql, sqlite, and postgresql);
  • artifact store - responsible for storing artifacts (supports 7 storage options: Amazon S3, Azure Blob Storage, Google Cloud Storage, FTP server, SFTP Server, NFS, HDFS).

For the sake of artifact store we’ll take sftp server for simplicity.

  • create a group

    $ sudo groupadd sftpg 
  • add a user and set a password for him

    $ sudo useradd -g sftpg mlflowsftp $ sudo passwd mlflowsftp 
  • adjust a couple of access settings

    $ sudo mkdir -p/data/mlflowsftp/upload $ sudo chown -R root.sftpg/data/mlflowsftp $ sudo chown -R mlflowsftp.sftpg/data/mlflowsftp/upload 
  • add a few lines to/etc/ssh/sshd_config

    Match Group sftpg ChrootDirectory/data/%u ForceCommand internal-sftp 
  • restarting the service

    $ sudo systemctl restart sshd 

For the backend store we’ll take postgresql.

$ sudo apt update $ sudo apt-get install -y postgresql postgresql-contrib postgresql-server-dev-all $ sudo apt install gcc $ pip install psycopg2 $ sudo -u postgres -i # Create new user: mlflow_user [postgres@user_name~]$ createuser --interactive -P Enter name of role to add: mlflow_user Enter password for new role: mlflow Enter it again: mlflow Shall the new role be a superuser? (y/n) n Shall the new role be allowed to create databases? (y/n) n Shall the new role be allowed to create more new roles? (y/n) n # Create database mlflow_bd owned by mlflow_user $ createdb -O mlflow_user mlflow_db 

To start the server, you need to install the following python packages (I advise you to create a separate virtual environment):

pip install mlflow pip install pysftp 

We start our server

$ mlflow server \ --backend-store-uri postgresql://mlflow_user:mlflow@localhost/mlflow_db \ --default-artifact-root sftp://mlflowsftp:mlflow@sftp_host/upload \ --host server_host \ --port server_port 

2. Adding Tracking

In order for the results of our trainings not to disappear, future generations of developers understand what was happening, and senior comrades and you could calmly analyze the learning process, we need to add tracking. Tracking means saving parameters, metrics, artifacts and any additional information about starting training, in our case, on the server.

For example, I created a small github project on Keras to segment everything that is in COCO dataset .To add tracking, I created the file

Here are the lines where the fun part happens:

def run(self, epochs, lr, experiment_name): # getting the id of the experiment, creating an experiment in its absence remote_experiment_id=self.remote_server.get_experiment_id(name=experiment_name) # creating a "run" and getting its id remote_run_id=self.remote_server.get_run_id(remote_experiment_id) # indicate that we want to save the results on a remote server mlflow.set_tracking_uri(self.tracking_uri) mlflow.set_experiment(experiment_name) with mlflow.start_run(run_id=remote_run_id, nested=False): mlflow.keras.autolog() self.train_pipeline.train(lr=lr, epochs=epochs) try: self.log_tags_and_params(remote_run_id) except mlflow.exceptions.RestException as e: print(e) 

Here self.remote_server is a small binding over mlflow.tracking methods. MlflowClient (I made for convenience), with which I create an experiment and run on the server. Next, I indicate where the launch results should merge (mlflow.set_tracking_uri (self.tracking_uri)). I connect automatic logging of mlflow.keras.autolog (). Currently MLflow Tracking supports automatic logging for TensorFlow, Keras, Gluon XGBoost, LightGBM, Spark. If you have not found your framework or library, then you can always log in explicitly. We start training. We register tags and input parameters on a remote server.

A couple of lines and you, like everyone else, have access to information about all launches. Cool?

3. Making out the project

Now let’s make starting the project easier than ever. To do this, add the MLproject and conda.yaml file to the project root.

name: flow_segmentation conda_env: conda.yaml entry_points: main: parameters: categories: {help: 'list of categories from coco dataset'} epochs: {type: int, help: 'number of epochs in training'} lr: {type: float, default: 0.001, help: 'learning rate'} batch_size: {type: int, default: 8} model_name: {type: str, default: 'Unet', help: 'Unet, PSPNet, Linknet, FPN'} backbone_name: {type: str, default: 'resnet18', help: 'exampe resnet18, resnet50, mobilenetv2...'} tracking_uri: {type: str, help: 'the server address'} experiment_name: {type: str, default: 'My_experiment', help: 'remote and local experiment name'} command: "python \ --epochs={epochs} --categories={categories} --lr={lr} --tracking_uri={tracking_uri} --model_name={model_name} --backbone_name={backbone_name} --batch_size={batch_size} --experiment_name={experiment_name}" 

MLflow Project has several properties:

  • Name - the name of your project;
  • Environment - in my case conda_env indicates that Anaconda is used to start and the description of dependencies is in the file conda.yaml;
  • Entry Points - indicates which files and with which parameters we can run (all parameters are automatically logged when training starts)


name: flow_segmentation channels: - defaults - anaconda dependencies: - python==3.7 - pip: - mlflow==1.8.0 - pysftp==0.2.9 - Cython==0.29.19 - numpy==1.18.4 - pycocotools==2.0.0 - requests==2.23.0 - matplotlib==3.2.1 - segmentation-models==1.0.1 - Keras==2.3.1 - imgaug==0.4.0 - tqdm==4.46.0 - tensorflow-gpu==1.14.0 

You can use docker as your runtime, see the documentation for more information.

4. We start training

We clone the project and go to the project directory:

git clone cd mlflow_example/ 

You need to install libraries to run

pip install mlflow pip install pysftp 

Because In the example I use conda_env, Anaconda must be installed on your computer (but you can also get around this by installing all the necessary packages yourself and playing around with the launch options).

All the preparatory steps are completed and we can begin to start the training. From the root of the project:

$ mlflow run -P epochs=10 -P categories=cat,dog -P tracking_uri=http://server_host:server_port. 

After entering the command, a conda environment will be created automatically and training will start.
In the example above, I passed the number of eras for training, the categories into which we want to segment (for a complete list, see here ) and the address of our remote server.
A complete list of possible parameters can be found in the MLproject file.

5. Evaluating Learning Outcomes

After completing the training, we can go to the browser at our server address http://server_host: server_port

ITKarma picture

Here we see a list of all experiments (top left), as well as information on launches (in the middle). We can see more detailed information (parameters, metrics, artifacts and some additional information) for each launch.

ITKarma picture

For each metric we can observe the history of change

ITKarma picture

i.e. at the moment we can analyze the results in "manual" mode, you can also configure automatic validation using the MLflow API.

6. Register the model

After we have analyzed our model and decided that it is ready for battle, we proceed to register it for this, select the launch we need (as shown in the previous paragraph) and go down.

ITKarma picture

After we gave the name of our model, it has a version. If you save another model with the same name, the version will automatically upgrade.

ITKarma picture

For each model, we can add a description and choose one of three states (Staging, Production, Archived), later we can use these api to access these states, which, along with versioning, gives additional flexibility.

ITKarma picture

We also have convenient access to all models

ITKarma picture

and their versions

ITKarma picture

As in the previous paragraph, all operations can be done using the API.

7. Deployment model

At this stage, we already have a keras model. An example of how to use it:

class SegmentationModel: def __init__(self, tracking_uri, model_name): self.registry=RemoteRegistry(tracking_uri=tracking_uri) self.model_name=model_name self.model=self.build_model(model_name) def get_latest_model(self, model_name): registered_models=self.registry.get_registered_model(model_name) last_model=self.registry.get_last_model(registered_models) local_path=self.registry.download_artifact(last_model.run_id, 'model', './') return local_path def build_model(self, model_name): local_path=self.get_latest_model(model_name) return mlflow.keras.load_model(local_path) def predict(self, image): image=self.preprocess(image) result=self.model.predict(image) return self.postprocess(result) def preprocess(self, image): image=cv2.resize(image, (256, 256)) image=image/255. image=np.expand_dims(image, 0) return image def postprocess(self, result): return result 

Here self.registry is again a small binding over mlflow.tracking.MlflowClient, for convenience. The bottom line is that I go to the remote server and look for a model with the specified name, and the latest production version. Next, I download the artifact locally to the./model folder and collect the model from this directory mlflow.keras.load_model (local_path). All now we can use our model. CV (ML) developers can safely improve the model and publish new versions.

In conclusion

I introduced a system that allows you to:

  • centrally store information about ML models, progress and learning outcomes;
  • quickly deploy your development environment;
  • monitor and analyze the progress of work on models;
  • it is convenient to version and manage the state of models;
  • it's easy to deploy the resulting models.

This example is a toy one and serves as a starting point for building your own system, which, possibly, will include automation of the evaluation of results and registration of models (items 5 and 6, respectively) or you add versioning of datasets, or maybe something? I tried to convey the idea that you need MLOps in general, MLflow is just a means to an end.

Write what problems you encountered, I did not display?
What would you add to the system to cover your needs?
What tools and approaches do you use to close all or part of the problems?

P.S. I’ll leave a couple of links:
github project -
MLflow -
My work mail, for questions -

We have periodically held various events for IT specialists in our company, for example: on July 8 at 19:00 Moscow time the CV will be held in online format, if interested, you can take part, register here .