Some time ago, we were faced with the question of choosing an ETL tool for working with BigData. The previously used Informatica BDM solution did not suit us due to limited functionality. Its use has come down to the spark-submit command framework. There were not many analogues on the market, in principle, capable of working with the amount of data with which we deal every day. In the end, we chose Ab Initio. During the pilot demonstrations, the product showed very high data processing speed. There is almost no information about Ab Initio in Russian, so we decided to talk about our experience on Habr.

Ab Initio has many classic and unusual transformations, the code of which can be expanded using its own language PDL. For a small business, such a powerful tool is likely to be redundant, and most of its capabilities can be expensive and unclaimed. But if your scale is close to Sberian, then you may be interested in Ab Initio.

It helps the business globally accumulate knowledge and develop the ecosystem, and helps the developer pump their skills in ETL, pull knowledge in the shell, provide an opportunity to learn the PDL language, give a visual picture of loading processes, simplify development due to the abundance of functional components.

In a post I will talk about the capabilities of Ab Initio and give comparative characteristics on its work with Hive and GreenPlum.

  • Description of the MDW framework and work on its tuning under GreenPlum
  • Ab Initio Hive vs. GreenPlum Performance Comparisons
  • Ab Initio working with GreenPlum in Near Real Time

The functionality of this product is very wide and requires a lot of time to study. However, with the right skills and the right performance settings, the data processing results are very impressive. Using Ab Initio for a developer can give him an interesting experience. This is a new look at ETL development, a hybrid between the visual environment and the development of downloads in a script-like language.

Business is developing its ecosystems and this tool turns out to be more useful to him than ever. With Ab Initio, you can accumulate knowledge about your current business and use this knowledge to expand your old and open new businesses. Alternatives to Ab Initio can be called from Visual Development Environments Informatica BDM and from non-visual environments - Apache Spark.

Description of Ab Initio


Ab Initio, like other ETL tools, is a collection of products.

ITKarma picture

Ab Initio GDE (Graphical Development Environment) is a developer’s environment in which he configures data transformations and connects them with data streams in the form of arrows. Moreover, such a set of transformations is called a graph:

ITKarma picture

The input and output connections of the functional components are ports and contain fields calculated inside the transforms. Several graphs connected by streams in the form of arrows in the order of their execution are called a plan.

There are several hundred functional components, which is a lot. Many of them are highly specialized. The possibilities of classical transformations in Ab Initio are wider than in other ETL tools. For example, Join has several outputs. In addition to the result of connecting datasets, one can obtain at the output a record of input datasets whose keys failed to connect. You can also get rejects, errors and the log of the transformation, which can be read as a text file in the same column and processed by other transformations:

ITKarma picture

Or, for example, you can materialize a data receiver in the form of a table and read data from it in the same column.

There are original transformations. For example, the Scan transformation has a functionality similar to analytical functions. There are transformations with speaking names: Create Data, Read Excel, Normalize, Sort within Groups, Run Program, Run SQL, Join with DB, etc.Graphs can use run-time parameters, including transferring parameters from the operating system or to the operating system. Files with a ready-made set of parameters passed to the graph are called parameter sets (psets).

As expected, Ab Initio GDE has its own repository called EME (Enterprise Meta Environment). Developers have the opportunity to work with local versions of the code and check in their developments in the central repository.

It is possible at runtime or after graph execution to click on any thread connecting transformations and look at the data that has passed between these transformations:

ITKarma picture

It is also possible to click on any stream and see tracking details - how many parallels the transformation worked, how many lines and bytes in which of the parallels it loaded:

ITKarma picture

It is possible to break the graph execution into phases and note that some transformations must be performed first (in the zero phase), the next in the first phase, the next in the second phase, etc.

For each transformation, you can choose the so-called layout (where it will be executed): without parallels or in parallel threads, the number of which can be set. At the same time, temporary files that Ab Initio creates during the work of transformations can be placed both in the server file system and in HDFS.

In each transformation, based on the default template, you can create your own script in the PDL language, which is a bit like a shell.

Using the PDL language, you can extend the functionality of transformations and, in particular, you can dynamically (at runtime) generate arbitrary pieces of code depending on the parameters of the runtime.

Also in Ab Initio, integration with the OS through the shell is well developed. Specifically, Sberbank uses linux ksh. You can exchange variables with shell and use them as graph parameters. You can invoke Ab Initio graphs from the shell and administer Ab Initio.

In addition to Ab Initio GDE, many other products are included. There is a Co > Operation System with a claim to be called an operating system. There is a Control > Center where you can schedule and monitor download flows. There are products to carry out development at a more primitive level than Ab Initio GDE allows.

Description of the MDW framework and work on its tuning under GreenPlum


Together with its products, the vendor supplies the MDW (Metadata Driven Warehouse) product, which is a graph configurator designed to help with typical tasks of filling data warehouses or data vaults.

It contains custom (project-specific) metadata parsers and out-of-the-box code generators.

ITKarma picture
At the input, MDW receives a data model, a configuration file for setting up a database connection (Oracle, Teradata or Hive) and some other settings. A project-specific part, for example, deploys a model in a database. The boxed part of the product generates graphs and tuning files for them by loading data into the model tables. This creates graphs (and psets) for several modes of initializing and incremental work on updating entities.

In Hive and RDBMS cases, different graphs are generated for initializing and incremental data updating.

In the case of Hive, the received delta data is connected via Ab Initio Join with the data that was in the table before the update. Data loaders in MDW (both in Hive and in RDBMS) not only insert new data from the delta, but also close the periods of data relevance, for the primary keys of which the delta arrived. In addition, it is necessary to rewrite the newly unchanged part of the data. But you have to do this, since there are no delete or update operations in Hive.

ITKarma picture

In the case of RDBMS, incremental data update graphs look more optimal, because RDBMS have real update capabilities.

ITKarma picture

The received delta is loaded into the staging table in the database. After that, the delta is connected to the data that was in the table before the update. And this is done by SQL using the generated SQL query. Then, using the delete + insert SQL commands, the new data from the delta is inserted into the target table and the data validity periods for which the delta received primary keys are closed.
There is no need to rewrite unchanged data.

Thus, we came to the conclusion that in the case of Hive, MDW must rewrite the entire table, because Hive does not have an update function. And nothing better than a complete rewrite of data during the upgrade is not invented. In the case of RDBMS, on the contrary, the creators of the product considered it necessary to entrust the connection and updating of tables to the use of SQL.

For the project at Sberbank, we created a new reusable implementation of the database loader for GreenPlum. This was done based on the version that MDW generates for Teradata. It was Teradata, not Oracle, that came the best and closest for this, because is also an MPP system. The working methods, as well as the syntax of Teradata and GreenPlum were close.

Examples of MDW-critical differences between different RDBMSs are as follows. In GreenPlum, unlike Teradata, when creating tables, you need to write the clause

distributed by 

Teradata writes

delete <table> all 

, and in GreenPlum they write

delete from <table> 

Oracle writes to optimize

delete from t where rowid in (<соединение t с дельтой>) 

and Teradata and GreenPlum write

delete from t where exists (select * from delta where delta.pk=t.pk) 

We also note that for Ab Initio to work with GreenPlum, it was required to install the GreenPlum client on all nodes of the Ab Initio cluster. This is because we connected to GreenPlum simultaneously from all nodes of our cluster. And so that reading from GreenPlum was parallel and each parallel stream of Ab Initio read its portion of data from GreenPlum, we had to put in the “where” section of SQL queries the construct understood by Ab Initio

where ABLOCAL() 

and determine the value of this construction by indicating the
parameter reading from the transformation database
ablocal_expr=«string_concat("mod(t.", string_filter_out("{$TABLE_KEY}","{}"), ",", (decimal(3))(number_of_partitions()),")=", (decimal(3))(this_partition()))» 

that compiles into something like

mod(sk,10)=3 

, i.e. it is necessary to prompt GreenPlum the explicit filter for each partition. For other databases (Teradata, Oracle), Ab Initio can perform this parallelization automatically.

Ab Initio Hive and GreenPlum Performance Comparisons


An experiment was conducted at Sberbank to compare the performance of generated MDW graphs for Hive and for GreenPlum. In the experiment, in the case of Hive, there were 5 nodes on the same cluster as Ab Initio, and in the case of GreenPlum, there were 4 nodes on a separate cluster. Those. Hive had some edge over GreenPlum hardware.

We considered two pairs of graphs that perform the same task of updating data in Hive and in GreenPlum. In this case, the graphs generated by the MDW configurator were launched:

  • initialization loading + incremental loading of randomly generated data into the Hive table
  • initialization loading + incremental loading of randomly generated data into the same GreenPlum table

In both cases (Hive and GreenPlum) launched downloads in 10 parallel threads on the same Ab Initio cluster. Ab Initio saved the intermediate data for calculations in HDFS (in terms of Ab Initio, MFS layout using HDFS was used). One line of randomly generated data occupied in both cases 200 bytes each.

The result is this:

Hive:
Initializing download to Hive
Inserted Rows 6,000,000 60,000,000 600,000,000
Initialization Duration
downloads in seconds
41 203 1 601
Hive incremental download
Number of lines available in
target table at the start of the experiment
6,000,000 60,000,000 600,000,000
Number of delta lines applied to
experiment target table
6,000,000 6,000,000 6,000,000
Incremental Duration
downloads in seconds
88 299 2,541

GreenPlum:
Initializing download in GreenPlum
Inserted Rows 6,000,000 60,000,000 600,000,000
Initialization Duration
downloads in seconds
72 360 3,631
Incremental loading in GreenPlum
Number of lines available in
target table at the start of the experiment
6,000,000 60,000,000 600,000,000
Number of delta lines applied to
experiment target table
6,000,000 6,000,000 6,000,000
Incremental Duration
downloads in seconds
159 199 321

We see that the initialization download speed in both Hive and GreenPlum linearly depends on the amount of data and for reasons of better hardware it is somewhat faster for Hive than for GreenPlum.

Incremental loading in Hive also linearly depends on the volume of previously loaded data available in the target table and proceeds rather slowly with increasing volume. This is due to the need to overwrite the target table completely. This means applying small changes to huge tables is not a good use case for Hive.

Incremental loading in GreenPlum weakly depends on the amount of previously loaded data available in the target table and passes quite quickly. It turned out thanks to SQL Joins and the GreenPlum architecture, which allows the delete operation.

So, GreenPlum injects the delta with the delete + insert method, and in Hive there are no delete or update operations, therefore, the entire data array during incremental updating was forced to rewrite the whole. The most indicative comparison is bold cells, since it corresponds to the most common version of operating resource-intensive downloads. We see that GreenPlum beat Hive in this test 8 times.

Ab Initio working with GreenPlum in Near Real Time


In this experiment, we will check the ability of Ab Initio to update the GreenPlum table with randomly generated pieces of data in near real-time mode. Let's look at the GreenPlum table dev42_1_db_usl.TESTING_SUBJ_org_finval, which we will work with.

We will use three columns of Ab Initio to work with it:

1) Graph Create_test_data.mp - creates files with data in HDFS for 6,000,000 lines in 10 parallel streams.The data is random, their structure is organized to be inserted into our table

ITKarma picture

ITKarma picture

2) Graph mdw_load.day_one.current.dev42_1_db_usl_testing_subj_org_finval.pset - MDW graph generated by initializing data insertion into our table in 10 parallel streams (using test data generated by graph (1))

ITKarma picture

3) Graph mdw_load.regular.current.dev42_1_db_usl_testing_subj_org_finval.pset - MDW graph generated by incremental updating of our table in 10 parallel streams using a portion of fresh received data (delta) generated by graph (1)

ITKarma picture

Run the script below in NRT mode:

  • generate 6,000,000 test lines
  • perform initialization load insert 6,000,000 test rows into an empty table
  • repeat 5 times incremental download

    • generate 6,000,000 test lines
    • incrementally insert 6,000,000 test rows into the table (in this case the valid_to_ts expiration time is stamped with the old data and more recent data with the same primary key is inserted)

Such a scenario emulates the real work mode of a certain business system - in real time, a sufficiently large portion of new data appears and immediately merges with GreenPlum.

Now let's see the log of the script:

Start Create_test_data.input.pset at 2020-06-04 11:49:11
Finish Create_test_data.input.pset at 2020-06-04 11:49:37 | Start mdw_load.day_one.current.dev42_1_db_usl_testing_subj_org_finval.pset at 2020-06-04 11:49:37
Finish mdw_load.day_one.current.dev42_1_db_usl_testing_subj_org_finval.pset at 2020-06-04 11:50:42
Start Create_test_data.input.pset at 2020-06-04 11:50:42
Finish Create_test_data.input.pset at 2020-06-04 11:51:06 | Start mdw_load.regular.current.dev42_1_db_usl_testing_subj_org_finval.pset at 2020-06-04 11:51:06
Finish mdw_load.regular.current.dev42_1_db_usl_testing_subj_org_finval.pset at 2020-06-04 11:53:41 | Start Create_test_data.input.pset at 2020-06-04 11:53:41
Finish Create_test_data.input.pset at 2020-06-04 11:54:04
Start mdw_load.regular.current.dev42_1_db_usl_testing_subj_org_finval.pset at 2020-06-04 11:54:04
Finish mdw_load.regular.current.dev42_1_db_usl_testing_subj_org_finval.pset at 2020-06-04 11:56:51
Start Create_test_data.input.pset at 2020-06-04 11:56:51
Finish Create_test_data.input.pset at 2020-06-04 11:57:14
Start mdw_load.regular.current.dev42_1_db_usl_testing_subj_org_finval.pset at 2020-06-04 11:57:14
Finish mdw_load.regular.current.dev42_1_db_usl_testing_subj_org_finval.pset at 2020-06-04 11:59:55 a.m. Start Create_test_data.input.pset at 2020-06-04 11:59:55 AM
Finish Create_test_data.input.pset at 2020-06-04 12:00:23
Start mdw_load.regular.current.dev42_1_db_usl_testing_subj_org_finval.pset at 2020-06-04 12:00:23
Finish mdw_load.regular.current.dev42_1_db_usl_testing_subj_org_finval.pset at 2020-06-04 12:03:23
Start Create_test_data.input.pset at 2020-06-04 12:03:23
Finish Create_test_data.input.pset at 2020-06-04 12:03:49 | Start mdw_load.regular.current.dev42_1_db_usl_testing_subj_org_finval.pset at 2020-06-04 12:03:49
Finish mdw_load.regular.current.dev42_1_db_usl_testing_subj_org_finval.pset at 2020-06-04 12:06:46


It turns out the following picture:
Graph Start time Finish time Length
Create_test_data.input.pset 06/04/2020 11:49:11 06/04/2020 11:49:37 00:00:26
mdw_load.day_one.current.
dev42_1_db_usl_testing_subj_org_finval.pset
06/04/2020 11:49:37 04.06.2020 11:50:42 00:01:05
Create_test_data.input.pset 04.06.2020 11:50:42 04.06.2020 11:51:06 00:00:24
mdw_load.regular.current.
dev42_1_db_usl_testing_subj_org_finval.pset
04.06.2020 11:51:06 04.06.2020 11:53:41 00:02:35
Create_test_data.input.pset 04.06.2020 11:53:41 04.06.2020 11:54:04 00:00:23
mdw_load.regular.current.
dev42_1_db_usl_testing_subj_org_finval.pset
04.06.2020 11:54:04 04.06.2020 11:56:51 00:02:47
Create_test_data.input.pset 04.06.2020 11:56:51 04.06.2020 11:57:14 00:00:23
mdw_load.regular.current.
dev42_1_db_usl_testing_subj_org_finval.pset
04.06.2020 11:57:14 04.06.2020 11:59:55 00:02:41
Create_test_data.input.pset 04.06.2020 11:59:55 04.06.2020 12:00:23 00:00:28
mdw_load.regular.current.
dev42_1_db_usl_testing_subj_org_finval.pset
04.06.2020 12:00:23 04.06.2020 12:03:23 00:03:00
Create_test_data.input.pset 04.06.2020 12:03:23 04.06.2020 12:03:49 00:00:26
mdw_load.regular.current.
dev42_1_db_usl_testing_subj_org_finval.pset
04.06.2020 12:03:49 04.06.2020 12:06:46 00:02:57

Видим, что 6 000 000 строк инкремента обрабатываются за 3 минуты, что достаточно быстро.
Данные в целевой таблице получились распределёнными следующим образом:
select valid_from_ts, valid_to_ts, count(1), min(sk), max(sk) from dev42_1_db_usl.TESTING_SUBJ_org_finval group by valid_from_ts, valid_to_ts order by 1,2; 

ITKarma picture
Можно разглядеть соответствие вставленных данных моментам запуска графов.
Значит можно запускать в Ab Initio инкрементальную загрузку данных в GreenPlum с очень высокой частотой и наблюдать высокую скорость вставки этих данных в GreenPlum. Конечно, раз в секунду запускаться не получится, так как Ab Initio, как и любое ETL-средство, при запуске требует времени «на раскачку».

Заключение


Сейчас Ab Initio используется в Сбербанке для построения Единого семантического слоя данных (ЕСС). Этот проект подразумевает построение единой версии состояния различных банковских бизнес-сущностей. Информация приходит из различных источников, реплики которых готовятся на Hadoop. Исходя из потребностей бизнеса, готовится модель данных и описываются трансформации данных. Ab Initio загружает информацию в ЕСС и загруженные данные не только представляют интерес для бизнеса сами по себе, но и служат источником для построения витрин данных. При этом функционал продукта позволяет использовать в качестве приёмника различные системы (Hive, Greenplum, Teradata, Oracle), что даёт возможность без особых усилий подготавливать данные для бизнеса в различных требуемых ему форматах.

Возможности Ab Initio широки, например, прилагающийся фреймворк MDW даёт возможность строить техническую и бизнес-историчность данных “из коробки”. Для разработчиков Ab Initio даёт возможность “не изобретать велосипед”, а пользоваться множеством имеющихся функциональных компонентов, по сути являющихся библиотеками, нужными при работе с данными.

Автор — эксперт профессионального сообщества Сбербанка SberProfi DWH/BigData. Профессиональное сообщество SberProfi DWH/BigData отвечает за развитие компетенций в таких направлениях, как экосистема Hadoop, Teradata, Oracle DB, GreenPlum, а также BI инструментах Qlik, SAP BO, Tableau и др.
.

Source