Machine learning is increasingly moving from manually designed models to automatically optimized pipeline using tools like H20 , TPOT and auto-sklearn . These libraries, along with methods such as random search , seek to simplify model selection and tuning of machine parts learning by finding the best model for a data set without any manual intervention. However, object development, possibly the more valuable aspect of machine learning pipelines, remains almost entirely human labor.

Feature Design ( Feature engineering ), also known as feature creation, is the process of creating new features from existing data to train a machine learning model. This step may be more important than the actual model used, because the machine learning algorithm only learns from the data that we provide to it, and creating attributes that are relevant to the task is absolutely necessary (see the excellent article A few useful things you need to know about Machine Learning" ).

As a rule, the development of attributes is a lengthy manual process based on knowledge of the subject area, intuition and data manipulation. This process can be extremely tedious, and the final characteristics will be limited by both the subjectivity of the person and time. Computer-aided design of features aims to help the data science professional automatically create many candidate objects from a data set from which you can select the best and use for training. In this article, we’ll look at an example of using automatic feature development with the featuretools library for Python . We will use an exemplary dataset to show the basics (stay tuned for future publications using real data). The full code from this article is available on GitHub .

Character Design Basics

Characterization means creating additional characteristics from existing data, which are often distributed across several related tables. The development of features requires the extraction of relevant information from the data and placing it in a single table, which can then be used to teach the machine learning model.

The process of creating traits is very time-consuming, since it usually takes several steps to create each new trait, especially when using information from several tables. We can group characteristics creation operations into two categories: conversions and aggregations . Let's look at a few examples to see these concepts in action.

Conversion acts on one table (in terms of Python, the table is just Pandas CDMY0CDMY), creating new characteristics from one or more existing columns.For example, if we have a customer table below,


we can create characteristics by finding the month from the CDMY1CDMY column or by taking the natural logarithm from the CDMY2CDMY column. These are both conversions because they use information from only one table.


On the other hand, aggregations are performed according to tables and use a one-to-many relationship to group observations and then calculate statistics. For example, if we have another table with information about customer loans, where each client can have several loans, we can calculate statistics such as average, maximum and minimum loan values ​​for each client.

This process includes grouping the loans table by customer, calculating aggregation, and then combining the received data with customer data. Here's how we could do it in Python using the Pandas language .

import pandas as pd # Group loans by client id and calculate mean, max, min of loans stats=loans.groupby('client_id')['loan_amount'].agg(['mean', 'max', 'min']) stats.columns=['mean_loan_amount', 'max_loan_amount', 'min_loan_amount'] # Merge with the clients dataframe stats=clients.merge(stats, left_on='client_id', right_index=True, how='left') stats.head(10) 


These operations themselves are not complicated, but if we have hundreds of variables scattered across dozens of tables, this process cannot be carried out manually. Ideally, we need a solution that can automatically perform conversions and aggregations for multiple tables and combine the resulting data into one table. Despite the fact that Pandas is an excellent resource, there are many more data manipulations that we want to do manually! (For more information on manual feature design, see the excellent Python Data Science Handbook ).


Fortunately, featuretools is exactly the solution we are looking for. This open source Python library automatically creates many features from a set of related tables. Featuretools is based on a method known as " Deep Feature Synthesis ", which sounds much more impressive than it actually is (the name comes from combining several attributes, and not because it uses deep learning!).

Deep feature synthesis combines several conversion and aggregation operations (called feature primitives in the FeatureTools dictionary) to create features from data, distributed over many tables. Like most machine learning ideas, this is a complex method based on simple concepts. By studying one building block at a time, we can build a good understanding of this powerful method.

First, let's look at the data from our example. We have already seen something from the dataset above, and the full set of tables is as follows:

  • CDMY3CDMY: Basic information about customers in a credit union. Each client has only one row in this data frame


  • CDMY4CDMY: loans to customers. Each loan has only its own line in this data frame, but customers can have several loans.

  • CDMY5CDMY: loan payments. Each payment has only one line, but each loan will have several payments.


If we have a task for machine learning, such as predicting whether a customer will repay a future loan, we will want to combine all the customer information into one table. The tables are linked (via the variables CDMY6CDMY and CDMY7CDMY), and we could use a series of transformations and aggregations to perform this process manually. However, we will soon see that instead we can use featuretools to automate the process.

Entities and EntitySets (entities and entity sets)

The first two concepts of featuretools are entities and entitysets . Entity is just a table (or CDMY8CDMY if you think in Pandas). EntitySet is a set of tables and the relationships between them. Imagine entityset is just another Python data structure with its own methods and attributes.

We can create an empty entity set in featuretools using the following:

import featuretools as ft # Create new entityset es=ft.EntitySet(id='clients') 

Now we have to add entities. Each entity must have an index, which is a column with all unique elements. That is, each value in the index should appear in the table only once. The index in the CDMY9CDMY data frame is CDMY10CDMY, because each client has only one row in this data frame. We add an entity with an existing index to the entity set using the following syntax:

# Create an entity from the client dataframe # This dataframe already has an index and a time index es=es.entity_from_dataframe(entity_id='clients', dataframe=clients, index='client_id', time_index='joined') 

The CDMY11CDMY data frame also has a unique CDMY12CDMY index, and the syntax for adding it to an entity set is the same as for CDMY13CDMY. However, there is no unique index for the payment data frame. When we add this entity to the entity set, we need to pass the CDMY14CDMY parameter and specify the index name. In addition, although featuretools will automatically infer the data type of each column in the entity, we can override this by passing the column type dictionary to CDMY15CDMY.

# Create an entity from the payments dataframe # This does not yet have a unique index es=es.entity_from_dataframe(entity_id='payments', dataframe=payments, variable_types={'missed': ft.variable_types.Categorical}, make_index=True, index='payment_id', time_index='payment_date') 

For this data frame, although CDMY16CDMY is an integer, it is not numeric variable , since it can only take 2 discrete values, so we tell featuretools that we should consider it as a categorical variable. After adding data frames to a set of entities, we examine any of them:


Column types were correctly deduced with the specified modification. Next, we need to specify how the tables in the entity set are related.

Relationships between tables

The best way to present a relationship between two tables is analogy for parents and children . One to many relationship: each parent can have several children. In the table pane, the parent table has one row for each parent, but the child table can have multiple rows corresponding to multiple children of the same parent.

For example, in our CDMY17CDMY dataset, the frame is the parent of the CDMY18CDMY frame. Each client has only one line in CDMY19CDMY, but can have several lines in CDMY20CDMY. Similarly, CDMY21CDMY are parents of CDMY22CDMY, because each loan will have multiple payments. Parents are associated with their children by a common variable. When we perform aggregation, we group the child table by the parent variable and compute statistics on the children of each parent.

To formalize relationships in featuretools , we only need to specify a variable that links the two tables together. CDMY23CDMY and the CDMY24CDMY table are linked using the variable CDMY25CDMY, and CDMY26CDMY and CDMY27CDMY are linked using CDMY28CDMY.The syntax for creating a relationship and adding it to a set of entities is shown below:

# Relationship between clients and previous loans r_client_previous=ft.Relationship(es['clients']['client_id'], es['loans']['client_id']) # Add the relationship to the entity set es=es.add_relationship(r_client_previous) # Relationship between previous loans and previous payments r_payments=ft.Relationship(es['loans']['loan_id'], es['payments']['loan_id']) # Add the relationship to the entity set es=es.add_relationship(r_payments) es 


The entity set now contains three entities (tables) and relationships that bind these entities together. After adding entities and formalizing relationships, our entity set is complete and we are ready to create attributes.

Characteristic primitives

Before we can fully move on to a deep synthesis of attributes, we need to understand the attribute primitives . We already know what it is, but we just call them by different names! These are just the basic operations that we use to form new attributes:

  • Aggregations: Parent-child (one-to-many) operations that are grouped by parent and calculate statistics on children. An example is the grouping of the CDMY29CDMY table by CDMY30CDMY and determining the maximum loan amount for each client.
  • Conversions: operations performed from one table to one or more columns. As an example, we take the difference between two columns in the same table or the absolute value of a column.

New features are created in featuretools using these primitives, either on their own or in the form of several primitives. The following is a list of some of the primitives in featuretools (we can also define custom primitives ):


These primitives can be used on their own or combined to create traits. To create features with the indicated primitives, we use the CDMY31CDMY function (stands for deep synthesis of features). We pass the entity set CDMY32CDMY, which is a table into which we want to add the characteristics selected by CDMY33CDMY (transformations) and CDMY34CDMY (aggregates):

# Create new features using specified primitives features, feature_names=ft.dfs(entityset=es, target_entity='clients', agg_primitives=['mean', 'max', 'percent_true', 'last'], trans_primitives=['years', 'month', 'subtract', 'divide']) 

The result is a data frame of new characteristics for each client (because we made CDMY35CDMY clients). For example, we have a month in which each client joined, which is a conversion primitive:


We also have a number of aggregation primitives, such as average payments per customer:


Despite the fact that we specified only a few primitives, featuretools created many new features by combining and stacking these primitives.


The full data frame contains 793 columns of new features!

Deep Sign Synthesis

Now we have everything to understand the deep synthesis of traits (dfs). In fact, we already did dfs in the previous function call! A deep feature is just a feature consisting of a combination of several primitives, and dfs is the name of the process that creates these features. The depth of a deep feature is the number of primitives needed to create the feature.

For example, the CDMY36CDMY column is a deep feature with a depth of 1 because it was created using a single aggregation. An element with a depth of two is CDMY37CDMY. This is done by combining two aggregations: LAST (most recent) on top of MEAN. This represents the average payment on the most recent loan for each customer.


We can compose signs to any depth we want, but in practice I never went beyond depth 2. After this point, the signs are difficult to interpret, but I urge everyone who is interested, try to go deeper .

We do not need to manually specify primitives, but instead we can let featuretools automatically select attributes for us. To do this, we use the same call to the CDMY38CDMY function, but do not pass any primitives:

# Perform deep feature synthesis without specifying primitives features, feature_names=ft.dfs(entityset=es, target_entity='clients', max_depth=2) features.head() 


Featuretools has created many new features for us. Although this process automatically creates new features, it will not replace a Data Science specialist because we still have to figure out what to do with all of these features. For example, if our goal is to predict whether a customer will repay a loan, we could look for signs that are most appropriate for a particular outcome. Moreover, if we have domain knowledge, we can use it to select specific attribute primitives or for deep synthesis of candidate attributes .

Next Steps

Computer-aided design of tags solved one problem, but created another: too many tags. Although it is difficult to say which of these features will be important before selecting a model, most likely, not all of them will be related to the task on which we want to train our model. Moreover, too many features can lead to a decrease in model performance, since less useful features crowd out those that are more important.

The problem of too many symptoms is known as the dimensional curse . As the number of attributes (dimensionality of data) increases, it becomes increasingly difficult to study the correspondence between attributes and goals. In fact, the amount of data needed for the model to work well, scales exponentially with the number of attributes .

The curse of dimension is combined with the reduction of features (also known as feature selection) : the process of removing unnecessary features. This can take various forms: Principal Component Analysis (PCA), SelectKBest, using feature values ​​from a model, or automatic coding using deep neural networks. However, feature reduction is a separate topic for another article. At the moment, we know that we can use featuretools to create many features from many tables with minimal effort!


Like many topics in machine learning, automated feature design using featuretools is a complex concept based on simple ideas. Using the concepts of sets of entities, entities, and relationships, featuretools can perform a deep synthesis of features to create new features. A deep synthesis of attributes, in turn, combines primitives - aggregates , which act through a one-to-many relationship between tables, and transformations , functions that apply to one or more columns in one table, to create new characteristics from several tables.


Learn the details of how to get a sought-after profession from scratch or Level Up in skills and salary by completing SkillFactory paid online courses:

Read more