Checklist for a machine learning project
Here are 8 steps you must follow in almost every project. Some of the steps can be performed interchangeably in order.
1. Identify the problem from a high-level perspective
This is to understand and formulate the business logic of the problem. This should tell you:
- nature of the problem (controlled/uncontrolled, classification/regression),
- type of solutions you can develop
- what indicators should you use to measure performance?
- is machine learning the right approach to solve this problem?
- manual approach to solving the problem.
- inherent prerequisites for the problem
2. Identify data sources and get data
In most cases, this step can be performed before the first step, if you have data and you want to identify issues (problems) around them in order to better use the incoming data.
Based on the definition of your problem, you will need to determine the data sources, which can be a database, data warehouse, sensors, etc. To deploy the application in production, this step should be automated by developing data pipelines that ensure incoming data to the system.
- List the sources and amount of data you need.
- check if the place is a problem.
- check if you are allowed to use the data for your purposes or not.
- get the data and convert it to a workable format.
- check the data type (textual, categorical, numeric, time series, images)
- select a sample for final testing.
3. Initial Data Intelligence
At this stage, you study all the features that affect your result/forecast/goal. If you have a huge data set, try this step to make the analysis more manageable.
- Use Notebook Jupyter as it provides a simple and intuitive interface for exploring data.
- define the target variable
- define the types of features (categorical, numerical, textual, etc.)
- analyze the relationship between features.
- add multiple data visualizations to easily interpret the effect of each function on the target variable.
- document your research results.
4. Exploratory data analysis for data preparation
It's time to complete the conclusions of the previous step by defining functions for data conversion, cleaning, selecting/developing features and scaling.
- Writing functions for data conversion and process automation for upcoming data packets.
- Write functions to clear the data (imputing missing values and handling sharply different values)
- Write functions for selecting and designing features - remove redundant features, format the transformation of objects, and other mathematical transformations.
- Features Scaling - standardization of features.
5. Develop a base model and then explore other models to select the best
Create a very basic model that should serve as the basis for all other complex machine learning models. Checklist:
- Train some commonly used models, such as naive bayes model, linear regression model, SVM, etc., using the default options.
- Measure and compare the performance of each model with the base and with everyone else.
- Use N-fold cross-validation for each model and calculate the average and standard deviation of performance metrics from N-fold metrics.
- Learn the features that have the greatest impact on the goal.
- Analyze the types of errors that models make when predicting.
- Design functions differently.
- Repeat the above steps several times (trial and error) to make sure that we use the correct functions in the correct format.
- Make a shortlist of the best models based on their performance metrics.
6. Fine tune your models from the shortlist and check for ensemble methods
This should be one of the decisive steps when you are approaching your final decision. Key points should include:
- Configuring a hyperparameter using cross-validation.
- Use auto-tuning methods such as random search or grid search to find the best configuration for your best models.
- Test an ensemble of methods, such as a voting classifier, etc.
- Test models with as much data as possible.
- After completing the work, use the test sample that we set aside at the beginning to check whether it fits well or not.
7. Document the code and communicate your decision
The communication process is diverse. You must keep in mind all existing and potential stakeholders. Therefore, the main points include:
- Document the code, as well as your approach to the entire project.
- Create a dashboard, such as voila, or an insightful presentation with visualization that needs no explanation.
- Write a blog/report on how you analyzed features, tested various transformations, etc. Describe your training (failures and methods that worked)
- Finish with the main result and future volume (if any)
8. Deploy your model in production, monitoring
If your project requires testing deployment on real data, you must create a web application or REST API for use on all platforms (web, Android, iOS). Key points (will vary by project) include:
- Save your final trained model to an h5 or pickle file.
- Serve your model with web services, you can use Flask to develop these web services.
- Connect input sources and configure ETL pipelines.
- Manage dependencies using pipenv, docker/Kubernetes (based on scaling requirements)
- You can use AWS, Azure, or the Google Cloud Platform to deploy your service.
- Do performance monitoring on real data or just for people so they can use your model with their data.
Note. A checklist can be adapted depending on the complexity of your project.
Learn the details of how to get a sought-after profession from scratch or Level Up in skills and salary, taking SkillFactory paid online courses:
- /Machine Learning Course a>(12 weeks)
- Learning Data Science from scratch (12 months)
- Analyst profession with any starting level (9 months)
- Python for Web Development Course (9 months)
- Trends in the Data Scene 2020
- Data Science has died. Long live Business Science
- Cool Data Scientists do not waste time on statistics
- How to become a Data Scientist without online courses
- 450 free courses from the Ivy League
- Data Science for the humanities: what is “data”
- Steroid Data Scenes: Introducing Decision Intelligence