Documentation
< All Topics
Print

2.4 Transformer

Momentum provides SQL-based interface for data transformation. Data cleaning, null removal, datatype conversion, column renaming, mathematical transformation, blending, merging, joining with multiple data sources are some of the transformation tasks that can be performed over data created within Momentum. Anything that is supported by ANSI-standard SQL can be performed using Momentum’s transformation engine. A multi-level SQL based transformation is a powerful way of performing complex transformation tasks involving single or multiple data sources. 

In this section we will explore how to perform a single SQL-based as well as multi-step transformations. 

Single Step Transformer 

In the following example, binary formatted date is transformed into human readable date using standard SQL. To do this: 

  1. Form the left hand side navigation panel, expand Transformer” and then click “Transformer Home”. 
  1. Click “New Transformer” located at the top menu option. The transformer window opens. 
  1. The transformer window is divided into two sections: 
  1. The left side section containing a form is used to write transformation SQL, and 
  1. the right-side section shows all data created within Momentum by various components. Clicking on any of the data sources will display 100 rows of data by default.  When the data is displayed, it shows a query block at the top where you can write any SQL statements on top of this datasource to test out any potential transformation query. This view is provided to look at the data while writing the transformation query on the left side section.  See Figure 2.5 below for an example. 

Figure 2.5: View showing 100 records of data to help in writing transformation SQL on the left section 

  1. Notice that the DATE column in the above screenshot is in binary format. We will write SQL to transform that column to a human readable date format. Fill out the form on the left section to configure the transformer as described below (and shown in Figure 2.6 below) 

Figure 2.5: Transformer example showing SQL statement 

  1. Name: give a unique and meaningful name to this transformer 
  1. Transformer Query: Write SQL statement to transform your data. Notice that the table name contains the first two letters of the components that generated data, followed by a dot, followed by the username and another dot, and the name of the component. For example, our datasource table name is “io.sansari.machine_data_ingester” where “io” stands for ingester output”. 
  1. In our example, the SQL statements to convert binary DATE into human readable date is:  

SELECT *, from_unixtime(to_unix_timestamp(`DATE`, ‘yyyy-MM-dd’),’yyyy-MM-dd HH:mm:ss’)  as TRANSFORMED_DATE from  io.sansari.machine_data_ingester 

  1. Select output format, parquet being the default. 
  1. Max core: specify how many CPU core of the cluster your transformer should run on concurrently to perform parallel operations. For example, 4 cores of CPU is good enough for small to mid-size data. For larger dataset, the more that core, the faster the processing will be. 
  1. RAM: specify how much RAM each CPU core should occupy. 4GB per core is a good default for most cases and should be larger for very large dataset. 
  1. Submit to save the transformer configuration. If everything goes well, the page will transition to the Transformer Home page. 
  1. From the Transformer Home page, check the transformer just created, and click the “Run” button located at the top menu bar. 
  1. It might take a few seconds to provision and start the transformation process, and depending on the data size, it make take some time to complete the transformation. 
  1. Click on the “Transformer” menu option from the top menu to refresh the transformer status, shown below in Figure 2.6. 

Figure 2.6: Screen showing the transformation run status. 

  1. Click on the Logs link to monitor the logs and watch for any errors. 
  1. After the transformer is completed successfully, checkbox, and click on “View Data” to take a quick look of the data created by the transformation (see Figure 2.7 below and notice the TRANFORMED_DATE column has human readable dates). See more on data exploration, “Exploring Data”, below.  

Figure 2.7: A sample data generated by the transformer 

Multi Step Transformer 

To demonstrate multi-step transformation process, we will work on an example that creates training and test sets needed for machine learning model training. The steps are as follows. 

Creating Training and Test Sets Using Transformer 

There are many ways to create training and test sets from the dataset we ingested. This section demonstrates how to use SQL compliant transformer to create the two sets for machine learning training and test. 

Assuming we want to create 80% training and 20% test sets from the original 10,000 records. We also want to randomize the data so that the training and test sets are not biased. We will create two transformers: 

  • The training set transformer will have a randomized 8000 records 
  • The test set transformer will have 2000 records that are not in the training set. 

Creating Training Set 

Here are the steps to create the training set: 

  1. Expand “Transformer” and click “Transformer Home” to launch the transformer home page. 
  1. Click “New Transformer” from the top menu to open the form to write transformation SQL. Provide a meaningful name to this transformer, e.g. machine_data_training_set 
  1. In the Transformer Query block, write SQL statement as follows. Notice that the SQL statement contains the ingester name or a previously transformed dataset name. 
  1. Save the Transformer. 
  1. Select the transformer and click “Run” located at the top menu bar. 
  1. After the transformer is successfully done, use the data exploration and interactive query tools to check the data. 

SELECT * FROM tr.sansari.machine_data_transformer order by RAND() limit 8000 

Listing 2: Transformer SQL to generate training set 

Figure 2.8: Transformer to create training set 

Notice that the transformer page shows a list of data sources on the right-side panel. You can click on any data item to see the data in tabular form. This view is provided to give an aid to the user while writing the SQL statements for the transformer. 

Creating Test Set 

This section demonstrates a multi-step transformation process, a powerful way of transforming complex data into meaningful forms. Here are the steps to create the test set: 

  1. Click New Transformer at the top menu bar. Give a meaningful name to the transformer, such as machine_data_test_set 
  1. Write a simple “Select” statement to load the data from the original data source, e.g. machine_data_transformer. Logically, the output of this statement will be stored in a temp table called “S1”. 

SELECT * FROM tr.sansari.machine_data_transformer 

Listing 3: SQL statement to load all the data from a transformer 

  1. Click “Add SQL Query” to open another query block. We will write a simple SQL statement to load the data from the training set transformer that we created previously. The output of this SQL is logically stored in another temp table called “S2”. 

SELECT * FROM tr.sansari.machine_data_training_set 

Listing 4: SQL statement to load transformer data that created training set 

  1. The final SQL statement will simply perform a MINUS operation between the two SQL statements above.  

select * from s1 minus select * from s2 

Listing 5: SQL statement to perform MINUS operation 

The following Figure 2.9 shows the three SQL statements as you will see in Momentum Transformer page. 

Figure 2.9: Multi-step transformer to create test set 

  1. Save and run this transformer. Check the result from the Data Upload and Exploration utilities. 

Notes: 

  1. If the column name is a SQL reserved keyword, such as DATE, backquote that column, e.g. `DATE` within the SQL statements. 
  1. The transformer stores data of the last run only. In other words, if you run the transformer multiple times, only the last run output will be stored in transformer.  
  1. Add the transformer in a data pipeline if you want incremental data or full data as a result of the transformation. See “Setting Data Pipeline” section. 
Table of Contents

Lester Firstenberger

Lester is recognized nationally as a regulatory attorney and expert in consumer finance, securitization, mortgage, and banking law.

Lester is recognized nationally as a regulatory attorney and expert in consumer finance, securitization, mortgage, and banking law. In a variety of capacities, over the past 30 years as an attorney, Mr. Firstenberger has represented the interests of numerous financial institutions in transactions valued in excess of one trillion dollars. He was appointed to and served a three-year term as a member of the Consumer Advisory Council of the Board of Governors of the Federal Reserve System. He has extensive governmental relations experience in the US and Canada at both the federal and state and provincial levels.

Shamshad (Sam) Ansari

Shamshad (Sam) Ansari is the founder, president and CEO of Accure. He drives technology innovations and works with a great team of engineers, data scientists, and business drivers at Accure.

Shamshad (Sam) Ansari is the founder, president, and CEO of Accure. He drives technology innovations and works with a great team of engineers, data scientists, and business drivers at Accure. He takes great pride in working with customers and putting together teams for solving their business problems. Sam is the product architect of Momentum, an AI and automation platform for data engineers, scientists, and business analysts.

Sam brings more than 20 years of technology development and management expertise. He developed, deployed and managed several large scale AI projects. He is a domain expert in healthcare systems, protocols, standards and compliances. Sam is a serial entrepreneur and worked with 4 startups. Prior to starting Accure, he worked with Apixio as the principal architect and director of engineering. He had another successful startup Orbit Solutions where he developed healthcare systems that went through an acquisition. He worked with IBM and the US Government at various capacities.

Sam is a distinguished data scientist, inventor and author. He has several technology publications in his name. He has co-authored 4 US Patents in healthcare AI. He is a well respected authority in computer vision and AI and has authored a book, “Building Computer Vision Applications Using Artificial Neural Networks” that is also translated into other languages including Chinese. Sam contributes to academia as well. He mentors graduate students and sponsors Capstone projects. He is also a member of the Advisory Board, Data Analytics Engineering Department at George Mason University.

Sam has a Master’s degree from Indian Institute of Information Technology & Management, Kerala (IIITM-K) and Bachelor’s degree in engineering from Bihar Institute of Technology Sindri (BIT Sindri).

Moghisuddin Raza

Mogishuddin Raza is a technology leader. As the COO of Accure he is having global product delivery responsibility along with overall strategic and operational responsibility.

Mogishuddin Raza is a technology leader. As the COO of Accure he is having global product delivery responsibility along with overall strategic and operational responsibility.

Having extensive background in technology product development and integration, in particular to Enterprise storage, virtualization, cloud computing, high availability & business continuity technology/solutions, and Big Data & related technologies. Has been passionate and evangelizing the usage of Big data technologies using Momentum to implement advanced analytics (descriptive and predictive) to directly impact the business via an intuitive set of use cases.

Having approximately two decades of experience in high-tech industries which includes big MNCs corporate like EMC Corp and Hewlett-Packard to mid-size organization such as Netkraft, Trados Inc driving transformation in strategizing, planning and architecting product engineering, execution and delivery of high quality products releases within budget & time.

Skilled in all aspects of big MNCs as well as company startups and growth including: strategizing, business planning, market research, finance, product development and profit margins & revenue management. Excellent leadership and people motivation skills. Expert in managing cross-functional, cross cultural global team and building strategic partnership in the global virtual matrix team environment.

Overall, a senior software business professional, skilled in the management of people, resources and partnerships which enables building an eco system for a winning organization.