Documentation
< All Topics
Print

2.4.2 Uploading File Using Impulse UI

Impulse provides a convenient way to create a table and upload data to it. Data uploaded to impulse is partitioned and indexed for efficient query. This section describes how to upload data into a table using Impulse’s file upload mechanism.

Step 1: Upload Data

  1. From the main navigation menu, click “Load Data”
  2. Drag and drop as many files as you want to upload to a table. You may browse and upload files as well. See Figure 2.4.2a below.
  3. Fill out the form (See Figure 2.4.2b below):
    1. Warehouse: from the drop down, select the data warehouse in which you wish to create the table.
    2. Datasource: Give a meaningful name to your datasource. The datasource is analogous to a table in RDBMS paradigm. If the table name within the selected warehouse exists, the  data will  be uploaded in the existing table, else a new table will be created.
    3. Input Source: Since we are uploading, leave the default selection as “Browse & Upload”. For other types of input source, see the appropriate sections of this document.
    4. File Format: Select the appropriate file format of the data file you are uploading. The supported file formats are:
      1. Parquet
      2. CSV or comma separated values
      3. TSV or tab separated values
      4. PSV or pipe separated values
      5. JSON (line delimited) meaning each line in the file represents a single row
    5. Input Header: This field is optional. If your input file is delimited (csv, psv, tsv) and does not contain the field header as the first line in each uploaded file, provide a comma separated list of header. Leave this field empty if your input file contains the header, otherwise, the ingestion engine will try to ingest the first line as data and not as header.
    6. Click Next button to configure the indexing and partitioning of data for efficient query execution.

Step 2: Column Mapping and Partition Parameters

After clicking the “Next” button in step 1, the next page will shows the parameters for the step 2 (see Figure 2.4.2c for example). These parameters control how the data indexing and partitions will be created. The description of each field within this step is as follows:

**For the best result, use a date or time based column as the primary partition column. If none of the column can be parsed as a date/time, do not use any partition.**

  1. Datasource: the table name (as set in step 1 above)
  2. Secondary Partition Strategy: This defines the column or columns that will be used to create the secondary partition. Impulse supports two types of secondary partition strategies:
    1. Dynamic: This is the best partition strategy and does the most efficient partitioning based on the data. In most cases, you will leave this as the default secondary partition strategy.
    2. Single Column: If your data will have only one column in the group by or where clause, this single column based strategy will likely to work the best. However, this is highly discouraged to use a single column based partitioning.
  3. Primary Partition Granularity: If you have a date/time based primary column, this parameter specifies how your data will be split into partitions. For example, if you select a “day” for the granularity level, the entire data will be grouped by day and split into partitions.
  4. Missing Datetime Placeholder: If you select a date/time based column as the primary partition column and if any of the rows contains invalid/missing/null values for the primary partition column, it will fill the missing value with this placeholder datetime.
  5. Max Parallel Upload Tasks: This parameter defines how many threads the system will create to upload the data in parallel. For a single node deployment, this should be set at maximum of 60% of number of available CPU cores in your server. For example, if you have 32-core CPU, set the max parallel tasks as 20 or less. For a distributed cluster nodes, this value should be  60% of the sum of cores of all worker nodes.
  6. Upload Mode: Specify whether you want to append rows to and existing table or overwrite existing partition.

Field Mapping: System will try to guess the datatypes of each column. In case of incorrect interpretation, you should edit the datatypes of every column that were incorrectly interpreted. Only the “STRING” “LONG” and “DOUBLE” datatypes are supported. Dates are represented as a STRING datatype.

From the field mapping section, select the primary partition column, preferably a datetime column.

you must specify the datetime format of the primary partition column. ISO date format and joda-time datetime (  https://www.joda.org/joda-time/key_format.html ) format are supported.

If your secondary partition strategy is “Single Column” based, you must select the secondary partition column.

At the bottom of the page, the system displays a few lines of actual data to help you to see the datatype, format and sample values of the actual dataset.

To start indexing, click the “Load and Index” button.

This will open the “Tasks” page that shows a list of all active or completed indexing tasks. See Figure 2.4.2d as an example.

Figure 2.4.2a: Browse or drag-and-drop to upload files to a datasource
Figure 2.4.2b: Screen showing file upload options, data warehouse name, and datasource name
Figure 2.4.2c: Screen showing data ingestion, field mapping, and partitioning parameters
Figure 2.4.2d: Screen showing task status after indexing is triggered

Previous 2.4.1 Ingesting From Momentum Data Pipeline
Next 2.4.3 Ingesting From External File/Storage System
Table of Contents

Lester Firstenberger

Lester is recognized nationally as a regulatory attorney and expert in consumer finance, securitization, mortgage, and banking law.

Lester is recognized nationally as a regulatory attorney and expert in consumer finance, securitization, mortgage, and banking law. In a variety of capacities, over the past 30 years as an attorney, Mr. Firstenberger has represented the interests of numerous financial institutions in transactions valued in excess of one trillion dollars. He was appointed to and served a three-year term as a member of the Consumer Advisory Council of the Board of Governors of the Federal Reserve System. He has extensive governmental relations experience in the US and Canada at both the federal and state and provincial levels.

Shamshad (Sam) Ansari

Shamshad (Sam) Ansari is the founder, president and CEO of Accure. He drives technology innovations and works with a great team of engineers, data scientists, and business drivers at Accure.

Shamshad (Sam) Ansari is the founder, president, and CEO of Accure. He drives technology innovations and works with a great team of engineers, data scientists, and business drivers at Accure. He takes great pride in working with customers and putting together teams for solving their business problems. Sam is the product architect of Momentum, an AI and automation platform for data engineers, scientists, and business analysts.

Sam brings more than 20 years of technology development and management expertise. He developed, deployed and managed several large scale AI projects. He is a domain expert in healthcare systems, protocols, standards and compliances. Sam is a serial entrepreneur and worked with 4 startups. Prior to starting Accure, he worked with Apixio as the principal architect and director of engineering. He had another successful startup Orbit Solutions where he developed healthcare systems that went through an acquisition. He worked with IBM and the US Government at various capacities.

Sam is a distinguished data scientist, inventor and author. He has several technology publications in his name. He has co-authored 4 US Patents in healthcare AI. He is a well respected authority in computer vision and AI and has authored a book, “Building Computer Vision Applications Using Artificial Neural Networks” that is also translated into other languages including Chinese. Sam contributes to academia as well. He mentors graduate students and sponsors Capstone projects. He is also a member of the Advisory Board, Data Analytics Engineering Department at George Mason University.

Sam has a Master’s degree from Indian Institute of Information Technology & Management, Kerala (IIITM-K) and Bachelor’s degree in engineering from Bihar Institute of Technology Sindri (BIT Sindri).

Moghisuddin Raza

Mogishuddin Raza is a technology leader. As the COO of Accure he is having global product delivery responsibility along with overall strategic and operational responsibility.

Mogishuddin Raza is a technology leader. As the COO of Accure he is having global product delivery responsibility along with overall strategic and operational responsibility.

Having extensive background in technology product development and integration, in particular to Enterprise storage, virtualization, cloud computing, high availability & business continuity technology/solutions, and Big Data & related technologies. Has been passionate and evangelizing the usage of Big data technologies using Momentum to implement advanced analytics (descriptive and predictive) to directly impact the business via an intuitive set of use cases.

Having approximately two decades of experience in high-tech industries which includes big MNCs corporate like EMC Corp and Hewlett-Packard to mid-size organization such as Netkraft, Trados Inc driving transformation in strategizing, planning and architecting product engineering, execution and delivery of high quality products releases within budget & time.

Skilled in all aspects of big MNCs as well as company startups and growth including: strategizing, business planning, market research, finance, product development and profit margins & revenue management. Excellent leadership and people motivation skills. Expert in managing cross-functional, cross cultural global team and building strategic partnership in the global virtual matrix team environment.

Overall, a senior software business professional, skilled in the management of people, resources and partnerships which enables building an eco system for a winning organization.