Import Sqoop column names issue - apache-nifi

I have a question on Kylo and Nifi.
The version of Kylo used is 0.10.1
The version of Nifi used is 1.6.0
When we create a feed for database ingest (using database as source), in the Additional Options step there is no provision to enter the source table column names.
However, in Nifi side, we use an Import Sqoop processor which has a mandatory field called Source Fields and it requires that the columns be entered, separated by commas. If it is not done, we get an error:
ERROR tool.ImportTool: Imported Failed: We found column without column name. Please verify that you've entered all column names in your query if using free form query import (consider adding clause AS if you're using column transformation)
For our requirement, we want Import Sqoop to take all the columns from the table automatically into this property without manual intervention at Nifi level. Is there any option to include all columns of a database table in the background automatically? Or is there any other possibility of giving this value in UpdateAttribute processor?

As mentioned in the Comments, ImportSqoop is not a not a normal Nifi processor. This does not have to be problem, but will mean it is probably not possible to troubleshoot the problem without involving the creator.
Also, though I am still debating whether Nifi on Sqoop is an antipattern, it is certainly not necessary.
Please look into the standard options first:
Standard way to get data into Nifi from tables is with standard processors such as ExecuteSQL
If that doesn't suffice, the standard way to use Sqoop (a batch tool) is with a batch scheduler, such as Oozie or Airflow
This thread may take away further doubts on point 1: http://apache-nifi.1125220.n5.nabble.com/Sqoop-Support-in-NIFI-td5653.html

Yes, Teradata Kylo Import Sqoop is not standard NiFi processor, but it's there for us to use. Looking deeper at processor's properties, we can see that indeed, SOURCE_TABLE_FIELDS is required there. Then you have an option to manually hard-code the list of columns or set up a method to generate the list dynamically.
Typical solution is to provide the list of fields is by querying table's metadata. A particular solution depends on where source and target tables are set up and how mapping is defined between source and target columns. For example, one could use databases' INFORMATION_SCHEMA tables and match columns by name. Because SQOOP's output should match the source, one has to find a way to generate the column list and provide it to ImportSqoop processor. A better yet approach could involve a separate metadata that would store the source and target information along with mappings and possible transforms (many tools are available there for that purpose, for example, Wherescape).
More specifically, I would use LookupAttribute paired with database or scripted lookup service to retrieve the column list from some metadata provider.

Related

Providing user defined column names in AWS glue

I have a lot of parquet files. I need to read them through Amazon Glue and then provide column names to the table that is being read.
The problem is parquet already have column names which is being read by the crawler and show it in the table. Is it possible to provide my column names to these parquet files in glue
To replace the detected column names with names of your own, you could either:
Use one of the following build in transformations on DynamicFrame
ApplyMapping - Applies a declarative mapping to this DynamicFrame and returns a new DynamicFrame with those mappings applied. (source column, source type, target column, target type)
RenameField - Renames a field in this DynamicFrame and returns a new DynamicFrame with the field renamed. (oldName -> newName)
See the Scala or Python ETL programming guides for more detail.
Or try updating the data catalog field names manually if you don't need to continuously re-crawl the data (or if you do, it is possible to prevent a glue crawler from updating existing data catalog tables via the crawler configuration).
Alternatively, if your requirements are more discrete, the map transform is available to convert each DynamicRecord in the DynamicFrame to a new DynamicRecord of your choosing.

What is the best way to ingest data from Terdata into Hadoop with Informatica?

What is the best ways to parallel ingest data from Teradata database into Hadoop with parallel data moving?
If we create a job which is simple opens one session to Teradata database it will take a lot of time to load huge table.
if we create a set of sessions to load data in parallel, and also make Select in each of the sessions, than it will make a set of Full table scans Teradata to produce a data
What is the recommended best practice to load data in parallelised streams and make unnecessary workload to Teradata?
If Tera data supports table partitioning like oracle, you could try reading the table based on partitioning points which will enable parallelism in read...
Other option you have is, split the table into multiple partitions like adding a where clause on indexed column. This will ensure index scan and you can avoid full table scan.
The most scalable way to ingest data into Hadoop form teradata, which i found is to use Teradata connector for hadoop. It is included in Cloudera & Hortonworks distributions. I will show example base on Cloudera documentation, but the same works with Hortonworks as well:
Informatica big Data edition is using standard Scoop invocation via command line and submitting set of parameters to it. So the main question is - which driver to use to make parallel connections between two MPP systems.
Here is the link to the Cloudera documentation:
Using the Cloudera Connector Powered by Teradata
And here is the digest from this documentation (You could find that this connector support different kinds of load balancing between connections):
Cloudera Connector Powered by Teradata supports the following methods for importing data from Teradata to Hadoop:
split.by.amp
split.by.value
split.by.partition
split.by.hash
split.by.amp Method
This optimal method retrieves data from Teradata. The connector creates one mapper per available Teradata AMP, and each mapper subsequently retrieves data from each AMP. As a result, no staging table is required. This method requires Teradata 14.10 or higher.
If you use partition names in the select clause, Power Center will select only the rows within that partition so there won't be duplicate read (don't forget to choose Database partitioning in Informatica session level). However if you use key range partition you have to choose the range as you mentioned in settings. Usually we use NTILE oracle analytical function to split the table into multiple portions so that the read will be unique across the selects. Please let me know if you have any question. If you have range/auto generated/surrogate key column in the table use it in where clause - write a sub-query to divide the table into multiple portions.

Schema verification/validation before loading data into HDFS/Hive

I am a newbie to Hadoop Ecosystem and I need some suggestion from Bigdata experts on achieving schema verification/validation before loading the huge data into hdfs.
The scenario is:
I have a huge dataset with given schema (having around 200
column-header in it). This dataset is going to be stored in Hive
tables/HDFS. Before loading the data into hive table/hdfs I want to
perform a schema level verification/validation on the data supplied to
avoid any unwanted errors/exception while loading the data into hdfs.
Like in case somebody tries to pass a data file having fewer or more
number of columns in it then at the first level of verification this
load fail.
What could be the best possible approach for achieving the same?
Regards,
Bhupesh
Since you have files, you can add them into HDFS,and run map reduce on top of that. Here you would be having a hold on each row, so you can verify number of columns, their types and any other validations.
When i referred to jason/xml, there is slight overhead to make map reduce identify the records in that format. However with respect to validation there is schema validation which you can enforce and also define only specific values for a field using schema. So once the schema is ready, your parsing(xml to java) and then store them at another final HDFS location for further use(like HBase). When you are sure that data is validated, you can create Hive tables on top of that.
Use below utility to create temp tables every time based on the schema you receive in csv file format in staging directory and then apply some conditions to identify whether you have valid columns or not. Finally load into original table.
https://github.com/enahwe/Csv2Hive

How to import only some columns from XLS with ETL?

I want to do something like Read only certain columns from xls in Jaspersoft ETL Express 5.4.1, but I don't know the schema of the files. However, from what I read here, it looks like I can only do this with the Enterprise Version's Dynamic Schema thing.
Is there no other way?
you can do it using tMap component. design job like below.
tFileInputExcel--main--TMap---main--youroutput
create metadata for your input file that is excel
then used this metadata in your input component
in Tmap select only required columns in output.
See the image of tMap wherein i am selecting only two columns from input flow.
Enterprise version has many features and dynamic schema is the most important one. But as far as your concern that is not required. it is required if you have variable of schema wherein you don`t know how many columns you will received in your feed.

Using Apache Hive as a MapReduce Input Format and/or Scraping Hive Metadata

Our environment is heavy into storing data in hive. I find myself currently working on something that it outside the scope though. I have a mapreduce written, but it requires a lot of direct user inputs for information that could easily be scraped from Hive. That said, when I query hive for extended table data, all of the extended information is thrown out in 1 or 2 columns as a giant blob of almost-JSON. Is there either a convenient way to parse this information, or better yet, get it directly in a more direct manor?
Alternatively, if I could get pointed to documentation on manually using the CombinedHiveInputFormat, that would simplify my code a lot more. But it seems like that InputFormat is solely used inside of Hive, using it's custom structs.
Ultimately, what I want is to know table names, columns (not including partitions), and partition locations for the split a mapper is working on. If there is yet another way to accomplish this, I am eager to know.

Resources