Documentation
< All Topics
Print

2.7 Setting Up a Data Pipeline

The data pipeline allows automating data ingestion, transformation, ML prediction, and export. You can chain various components in some logical sequence to automate data processing. The data pipeline runs in two modes: On Demand and Scheduled (described below). The data pipeline also allows to attach a custom processor that is build using the Processor interface and packaged as Jar file (more on custom processor below).  

A data pipeline is a sequence of execution of one or more data processing units. For example, a data pipeline may contain one or more ingesters, transformer, custom processing, ML models and emitter. 

To create a data pipeline: 

  1. Create one or more ingesters. See Instructions here. 
  1. Create a transformer that may contain one or more SQL statements within it. Only one transformer per pipeline is allowed. Therefore, all relevant SQL statements must be included in a single transformer. See instructions on how to create a transformer containing multiple SQL statements. 
  1. If your data processing needs any custom processor, create one to be included in the pipeline. See instructions on how to create a processor. 
  1. Create an emitter if the processed data need to be stored outside of Momentum storage (for example, index in Impulse EDW, MongoDB, MySQL, Oracle etc). See instructions on how to create an emitter. 
  1. Create a data pipeline and add all required components to it. See below for more details. 

A few example pipelines: 

  • one or more ingesters –> one transformer –> one or more processors –> one emitter 
  • one or more ingesters –> emitter 
  • one transformer –> emitter 
  • one transformer –> one or more processors –> emitter 
  • a single ingester –> emitter 
  • one or more ingesters –> one transformer –> one or more ML models –> one emitter 

If emitter is omitted, the processed data of the pipeline is stored within the distributed data lake based on HFDS, the main storage system that Momentum utilizes for storing files. 

Creating A Data Pipeline 

  1. Expand “Data Pipeline” menu (under ETL section) from the main menu options –> click “Pipeline Home”. 
  1. Click “Create New Pipeline” from the top menu options 
  1. Fill out the form fields: 
  1. Name: a user defined unique name to identify the pipeline 
  1. Core: Number of cluster cores to execute the pipeline job in distributed and parallel mode. For a big dataset and complex pipeline execution, allocate as much core as you have it available to speed up the execution. 
  1. Memory: RAM per core. 4GB default works for most cases. Tune if required. 
  1. Output Format: If no emitter is attached to this pipeline, the data is stored within the Momentum’s distributed file system (HDFS). Specify the output file format. 
  1. Run Mode: 
  1. On demand: The pipeline needs to be manually executed by clicking the “Run” button. 
  1. Scheduled: Specify a Linux style cron expression to schedule the execution of the pipeline in an automated mode. Here is an online tool to create cron expressions. 
  1. Storage mode: Used only if no emitter is attached to this pipeline. 
  1. Log Input and output Count: If select yes, it will generate the count of processed data for auditing and inventory purpose. This is an expensive process and should be avoided if count is not necessary. 
  1. Submit the form to save it. 
  1. Once the pipeline form is submitted, you will need to add processing units to it. Here are the steps:  
  1. Add one or more ingesters: expand ingester menu –> click on the ingester you want to add –> a rectangular widget is added on the main canvas. 
  1. Add a transformer: expand transformer menu –> click on the transformer you want to add –> a rectangular widget is added on the main canvas. 
  1. To add one or more ML Models, expand the ML models from the left menu panel and click on the models you want to add to the pipeline. 
  1. To add a new processor (not already created): Click “Add Processor” button located at the top of the pipeline canvas. Fill out the form to add to the canvas. 
  1. To add an existing processor: expand process menu –> click on the processor you want to add –> a rectangular widget is added on the main canvas. 
  1. To add a new emitter (not already created): Click “Add Emitter” button located at the top of the pipeline canvas. Fill out the form to add to the canvas. For details on the form field, see the Emitter section of this wiki. 
  1. To add an existing emitter: expand emitter menu –> click on the emitter you want to add to the canvas –> a rectangular widget is added on the main canvas. 
  1. If needed, move the widgets around to organize. Widgets may overlap if the canvas size is small. Drag the overlapped widgets to separate them out. 
  1. Once all widgets are laid out on the canvas, connect them by clicking on the output tip of one widget to the input tail of the other widget. See Figure 2.11 below for an example pipeline with connected units. 
  1. To connect the units, click on the “out” tip and drag the arrow and click on the “in” tip. 
  1. Save the pipeline by clicking the “Save” button. You may need to scroll down to see the “save” button. 

Running Data Pipeline 

To run the data pipeline: 

  1. From the pipeline home page, click on the checkbox corresponding to the pipeline you want to run. 
  1. Click “Run” button located at the top menu bar. 
  1. When the pipeline starts running, it will show the status of execution of each unit that are included in the pipeline. When all the units complete execution, the pipeline status will show as “complete” and result as “success”. 

Figure 2.10: Screen showing pipeline home and menu options 

Figure 2.11: Example pipeline with the connected units (3 ingesters connected with one transformer who output feeds to semantic model that in turn feeds to ANN regression model. The final output is exported to Impulse emitter. 

Important Notes: 

1. A pipeline can contain only one transformer. If you need multiple transformers, write multi-step SQL statements (see Transformer section for details). 

2. Only those models that are deployed to MLOps can be included in the pipeline. If multiple versions of the same model is deployed, it will use the latest version for prediction. 

Table of Contents

Lester Firstenberger

Lester is recognized nationally as a regulatory attorney and expert in consumer finance, securitization, mortgage, and banking law.

Lester is recognized nationally as a regulatory attorney and expert in consumer finance, securitization, mortgage, and banking law. In a variety of capacities, over the past 30 years as an attorney, Mr. Firstenberger has represented the interests of numerous financial institutions in transactions valued in excess of one trillion dollars. He was appointed to and served a three-year term as a member of the Consumer Advisory Council of the Board of Governors of the Federal Reserve System. He has extensive governmental relations experience in the US and Canada at both the federal and state and provincial levels.

Shamshad (Sam) Ansari is an author, inventor, and thought leader in the fields of computer vision, machine learning, artificial intelligence, and cognitive science. He has extensive experience in high scale, distributed, and parallel computing. Sam currently serves as an Adjunct Professor at George Mason University, teaching graduate- level programs within the Data Analytics Engineering department of the Volgenau School of Engineering. His areas of instruction encompass machine learning, natural language processing, and computer vision, where he imparts his knowledge and expertise to aspiring professionals.

Having authored multiple publications on topics such as machine learning, RFID, and high-scale enterprise computing, Sam’s contributions extend beyond academia. Sam’s book, titled “Building Computer Vision Applications Using Artificial Neural Networks,” has garnered acclaim with two published editions. It received recognition as one of the top 10 books ever written on this subject by bookauthority.org, highlighting the significant impact and quality of Sam’s contributions to the field. He holds four US patents related to healthcare AI, showcasing his innovative mindset and practical application of technology.

Throughout his extensive 20+ years of experience in enterprise software development, Sam has been involved with several tech startups and early-stage companies. He has played pivotal roles in building and expanding tech teams from the ground up, contributing to their eventual acquisition by larger organizations. At the beginning of his career, he worked with esteemed institutions such as the US Department of Defense (DOD) and IBM, honing his skills and knowledge in the industry.

Currently, Sam serves as the President and CEO of Accure, Inc., an AI company that he founded. He is the creator, architect, and a significant contributor to Momentum AI, a no-code platform that encompasses data engineering, machine learning, AI, MLOps, data warehousing, and business intelligence. Throughout his career, Sam has made notable contributions in various domains including healthcare, retail, supply chain, banking and finance, and manufacturing. Demonstrating his leadership skills, he has successfully managed teams of software engineers, data scientists, and DevSecOps professionals, leading them to deliver exceptional results. Sam earned his bachelor’s degree in engineering from Birsa Institute of Technology (BIT) Sindri and subsequently a Master’s degree from the prestigious Indian Institute of Information Technology and Management Kerala (IIITM-K).

Moghisuddin Raza

Mogishuddin Raza is a technology leader. As the COO of Accure he is having global product delivery responsibility along with overall strategic and operational responsibility.

Mogishuddin Raza is a technology leader. As the COO of Accure he is having global product delivery responsibility along with overall strategic and operational responsibility.

Having extensive background in technology product development and integration, in particular to Enterprise storage, virtualization, cloud computing, high availability & business continuity technology/solutions, and Big Data & related technologies. Has been passionate and evangelizing the usage of Big data technologies using Momentum to implement advanced analytics (descriptive and predictive) to directly impact the business via an intuitive set of use cases.

Having approximately two decades of experience in high-tech industries which includes big MNCs corporate like EMC Corp and Hewlett-Packard to mid-size organization such as Netkraft, Trados Inc driving transformation in strategizing, planning and architecting product engineering, execution and delivery of high quality products releases within budget & time.

Skilled in all aspects of big MNCs as well as company startups and growth including: strategizing, business planning, market research, finance, product development and profit margins & revenue management. Excellent leadership and people motivation skills. Expert in managing cross-functional, cross cultural global team and building strategic partnership in the global virtual matrix team environment.

Overall, a senior software business professional, skilled in the management of people, resources and partnerships which enables building an eco system for a winning organization.



Rajesh Kumar Nedungadi

Scion of A Former Royal House of Kerala, India President Garuttman Group, USA. Rajesh is an entrepreneur & visionary specializing in International Business Strategy and Market Development with focus on Middle East & North America. With over 20 years’ experience in international trade, Rajesh is an expert on Business Strategy Development, Market Opportunity Development and International Sales & Marketing of Products and Services including the IT Industry. Rajesh is working as Managing Partner / Board Member of many companies including, Globistic Company USA, Castlewick Companies, USA.

Former Gartner Analyst having authored or co-authored over 280 research notes, on emerging technologies like AI, SD-WAN, 5G, mobile video, cloud CDN, IoT, SASE in Cybersecurity, 6G, etc. in the past decade. A frequent speaker at tech events, he is often quoted in leading institutions like CNN, Wall St. Journal, etc. He is a former CTO of one of the first video/WiFi smartphone firms. Currently also a Cybersecurity Advisor at Lionfish Tech Advisors, and on the advisory board for a few startups.

 Sharma helped contribute to fly-by-wire standards used in avionics and holds an Engineering degree in Computer Systems Engineering from Carleton University in Canada and completed graduate coursework in AI/ML from there.

His Engineering Thesis was on the application of Hopfield’s Neural Networks applied to Combinatorial Optimization.

Mike joined Accure as the Senior Vice President of Sales after leaving MarkLogic where he worked for 17 years and held several executive sales positions including his last role leading the global sales strategy. Prior to MarkLogic, Mike spent 10 years at Autonomy as VP of sales for their Business Process Management practice. He began his career in database technology with Informix in the Chicago area where he led Channel Sales for Central US.

When not driving Accure’s sales, Mike spends time with his lovely wife, Meg, two daughters, Leigh Rose and Georgia, and his puppy Kalua (“Lulu”).