How to import only some columns from XLS with ETL? - etl

I want to do something like Read only certain columns from xls in Jaspersoft ETL Express 5.4.1, but I don't know the schema of the files. However, from what I read here, it looks like I can only do this with the Enterprise Version's Dynamic Schema thing.
Is there no other way?

you can do it using tMap component. design job like below.
tFileInputExcel--main--TMap---main--youroutput
create metadata for your input file that is excel
then used this metadata in your input component
in Tmap select only required columns in output.
See the image of tMap wherein i am selecting only two columns from input flow.
Enterprise version has many features and dynamic schema is the most important one. But as far as your concern that is not required. it is required if you have variable of schema wherein you don`t know how many columns you will received in your feed.

Related

Read data from multiple tables at a time and combine the data based where clause using Nifi

I have scenario where I need to extract multiple database table data including schema and combine(combination data) them and then write to xl file?
In NiFi the general strategy to read in from a something like a fact table with ExecuteSQL or some other SQL processor, then using LookupRecord to enrich the data with a lookup table. The thing in NiFi is that you can only do a table at a time, so you'd need one LookupRecord for each enrichment table. You could then write to a CSV file that you could open in Excel. There might be some extensions elsewhere that can write directly to Excel but I'm not aware of any in the standard NiFi distro.

Import Sqoop column names issue

I have a question on Kylo and Nifi.
The version of Kylo used is 0.10.1
The version of Nifi used is 1.6.0
When we create a feed for database ingest (using database as source), in the Additional Options step there is no provision to enter the source table column names.
However, in Nifi side, we use an Import Sqoop processor which has a mandatory field called Source Fields and it requires that the columns be entered, separated by commas. If it is not done, we get an error:
ERROR tool.ImportTool: Imported Failed: We found column without column name. Please verify that you've entered all column names in your query if using free form query import (consider adding clause AS if you're using column transformation)
For our requirement, we want Import Sqoop to take all the columns from the table automatically into this property without manual intervention at Nifi level. Is there any option to include all columns of a database table in the background automatically? Or is there any other possibility of giving this value in UpdateAttribute processor?
As mentioned in the Comments, ImportSqoop is not a not a normal Nifi processor. This does not have to be problem, but will mean it is probably not possible to troubleshoot the problem without involving the creator.
Also, though I am still debating whether Nifi on Sqoop is an antipattern, it is certainly not necessary.
Please look into the standard options first:
Standard way to get data into Nifi from tables is with standard processors such as ExecuteSQL
If that doesn't suffice, the standard way to use Sqoop (a batch tool) is with a batch scheduler, such as Oozie or Airflow
This thread may take away further doubts on point 1: http://apache-nifi.1125220.n5.nabble.com/Sqoop-Support-in-NIFI-td5653.html
Yes, Teradata Kylo Import Sqoop is not standard NiFi processor, but it's there for us to use. Looking deeper at processor's properties, we can see that indeed, SOURCE_TABLE_FIELDS is required there. Then you have an option to manually hard-code the list of columns or set up a method to generate the list dynamically.
Typical solution is to provide the list of fields is by querying table's metadata. A particular solution depends on where source and target tables are set up and how mapping is defined between source and target columns. For example, one could use databases' INFORMATION_SCHEMA tables and match columns by name. Because SQOOP's output should match the source, one has to find a way to generate the column list and provide it to ImportSqoop processor. A better yet approach could involve a separate metadata that would store the source and target information along with mappings and possible transforms (many tools are available there for that purpose, for example, Wherescape).
More specifically, I would use LookupAttribute paired with database or scripted lookup service to retrieve the column list from some metadata provider.

Can I keep data of different file formats in same hive table?

I am receiving data of formats like csv, xml, json and I want to keep all the files in same hive table.Is it achievable?
Hive expects all the files for one table to use the same delimiter, same compression applied etc. So, you cannot use a Hive table on top of files with multiple formats.
The solution you may want to use is
Create a separate table (json/xml/csv) for each of the file formats
Create a view for the UNION of the 3 tables created above.
This way the consumer of the data has to query only one view/object, if that's what you are looking for.
Yes, you can achieve this through a combination of different external tables.
Because different SerDes with different specifications for how to read columns in the different files will be needed, you will need to create one external table per type of file (and table). The data from each of these external tables can then be combined into a view with UNION, as suggested by Ramesh. The view can could then be used for reading from these, and you could e.g. insert the data into a managed table.

Using orientDB ETL json transform

does anyone have examples of orientDB etl transformers with multiple transforms or something that can create class identifiers on the fly, so for example, if you want to create organization entities and the id could be a hash from the organization name , essentially if the json we are importing is not exactly the schema we want in the destination
What about using block code in your ETL configuration file? You can use it in the begin phase, so you could transform id column in your .csv input file. It is not an ideal solution I agree.
see the Block documentation

How to load multiple excel files into different tables based on xls metadata using SSIS?

I have multiple excel files with two types of metadata, Now i have to push the data into two different tables based on metadata of excel files using SSIS.
There are many, many different ways to do this. You'd need to share a lot more information on how your data is structured to really give a great answer, but here's the general strategy I'd suggest.
In the control flow tab, have a separate data flow for each Excel file. The data flows will all work the same, with the exception of having a different Excel source in each data flow, so it will be enough to get the first version working and then copy and paste for the other files.
In the data flow, use a conditional split transformation to read the metadata coming from Excel and send the row to the correct table.
If you really want to be fancy, however, you could create a child package that includes all your data flow logic. Using the Execute Package Task you can pass the Excel file name to the child package for each Excel file you need to import. This way you consolidate your logic in one package and can still import from multiple Excel files in parallel.

Resources