nifi: how to merge multiple columns in csv file? - apache-nifi

nifi version: 1.5
input file:
col1,col2,col3,col4,col5,col6
a,hr,nat,REF,6,2481
a,hr,nat,TDB,6,1845
b,IT,raj,NAV,6,2678
i want to merge the last three columns with : delimiter and separator by / based on col1.
expected output:
col1,col2,col3,col4
a,hr,nat,REF:6:2481/TDB:6:1845
b,IT,raj,NAV:6:2678
i am not able to find the solution because lot of response were based merging two files.
is there a better way to do it?
tia.

I think you'll want a PartitionRecord processor first, with partition field col1, this will split the flow file into multiple flow files where each distinct value of col1 will be in its own flow file. If the first 3 columns are to be used for partitioning, you can add all three columns as user-defined properties for partitioning.
Then whether you use a scripted solution or perhaps QueryRecord (if Calcite supports "group by" concatenation), the memory usage should be less as you are only dealing with a flow file at a time whose rows are already associated by the specified group.

Related

How to find all the files created by GenerateTableFetch has been processed

We have a flow where GenerateTableFetch takes inpute from splitJson which gives TableName, ColumnName as argument. At once multiple tables are passed as input to GenerateTableFetch and next ExecuteSql executes the query.
Now i want to trigger a new process when all the files for a table has been processed by the below processor (At the end there is PutFile).
How to find that all the files created for a Table has been processed?
You may need NIFI-5601 to accomplish this, there is a patch currently under review at the time of this writing, I hope to get it into NiFi 1.9.0.
EDIT: Adding potential workarounds in the meantime
If you can use ListDatabaseTables instead of getting your table names from a JSON file, then you can set Include Count to true. Then you will get attributes for the table name and the count of its rows. Then you can divide the count by the value of the Partition Size in GTF and that will give you the number of fetches (let's call it X). Then add an attribute via UpdateAttribute called "parent" or something, and set it to ${UUID()}. Keep these attributes in the flow files going into GTF and ExecuteScript, then you can use Wait/Notify to wait until X flow files are received (setting Target Signal Count to ${X}) and using ${parent} as the Release Signal Identifier.
If you can't use ListDatabaseTables, then you may be able to have ExecuteSQLRecord after your SplitJSON, you can execute something like SELECT COUNT(*) FROM ${table.name}. If using ExecuteSQL, you may need a ConvertAvroToJSON, if using ExecuteSQLRecord use a JSONRecordSetWriter. Then you can extract the count from the flow file contents using EvaluateJsonPath.
Once you have the table name and the row count in attributes, you can continue with the flow I outlined above (i.e. determine the number of flow files that GTF will generate, etc.).

How to coalesce large portioned data into single directory in spark/Hive

I have a requirement,
Huge data is partitioned and inserting it into Hive.To bind this data, I am using DF.Coalesce(10). Now i want to bind this portioned data to single directory, if I use DF.Coalesce(1) will the performance decrease? or do I have any other process to do so?
From what I understand is that you are trying to ensure that there are less no of files per partition. So, by using coalesce(10), you will get max 10 files per partition. I would suggest using repartition($"COL"), here COL is the column used to partition the data. This will ensure that your "huge" data is split based on the partition column used in HIVE. df.repartition($"COL")

Time based directory structure Apache Drill

I have CSV files organized by date and time as follows
logs/YYYY/MM/DD/CSV files...
I have setup Apache Drill to execute SQL queries on top of these CSV files. Since there are many CSV files; the organization of the files can be utilized to optimize the performance. For example,
SELECT * from data where trans>='20170101' AND trans<'20170102';
In this SQL, the directory logs/2017/01/01 should be scanned for data. Is there a way to let Apache Drill do optimization based on this directory structure? Is it possible to do this in Hive, Impala or any other tool?
Please note:
SQL queries will almost always contain the time frame.
Number of CSV files in a given directory is not huge. Combined all years worth of data, it will be huge
There is a field called 'trans' in every CSV file, which contains the date and time.
The CSV file is put under appropriate directory based on the value of 'trans' field.
CSV files do not follow any schema. Columns may or may not be different.
Querying using column inside the data file would not help in partition pruning.
You can use dir* variables in Drill to refer to partitions in table.
create view trans_logs_view as
select
`dir0` as `tran_year`,
`dir1` as `trans_month`,
`dir2` as `tran_date`, * from dfs.`/data/logs`;
You can query using tran_year,tran_month and tran_date columns for partition pruning.
Also see if below query helps for pruning.
select count(1) from dfs.`/data/logs`
where concat(`dir0`,`dir1`,`dir2`) between '20170101' AND '20170102';
If so , you can define view by aliasing concat(dir0,dir1,dir2) to trans column name and query.
See below for more details.
https://drill.apache.org/docs/how-to-partition-data/

How can Informatica do the job of a link collector in Datastage?

Is it possible to replicate the functionality of link collector in Datastage in Informatica, using its pre-built transformations?
I have 4 different streams in the same mapping. I want a union of all the streams. It may or may not relate to one another. Hence I do not have a common column. I just want to dump values from those 4 streams into a single column in informatica. Is it possible to do so?
The Union transformation allows you to define a number of input groups and input ports (4 groups with single port in your case) and merges the source rows together.

how to perform ETL in map/reduce

how do we design mapper/reducer if I have to transform a text file line-by-line into another text file.
I wrote a simple map/reduce programs which did a small transformation but the requirement is a bit more elaborate below are the details:
the file is usually structured like this - the first row contains a comma separated list of column names. Second and the rest of the rows specify values against the columns
In some rows the trailing column values might be missing ex: if there are 15 columns then values might be specified only for the first 10 columns.
I have about 5 input files which I need to transform and aggregate into one file. the transformations are specific to each of the 5 input files.
How do I pass contextual information like file name to the mapper/reducer program?
Transformations are specific to columns so how do I remember the columns mentioned in the first row and then correlate and transform values in rows?
Split file into lines, transform (map) each line in parallel, join (reduce) the resulting lines into one file?
You can not rely on the column info in the first row. If your file is larger than a HDFS block, your file will be broken into multiple splits and each split handed to a different mapper. In that case, only the mapper receiving the first split will receive the first row with column info and the rest won't.
I would suggest passing file specific meta data in separate file and distribute it as side data. Your mapper or reducer tasks could read the meta data file.
Through the Hadoop Context object, you can get hold of the name of the file being processed by a mapper. Between all these, I think you have all the context information you are referring to and you can do file specific transformation. Even though the transformation logic is different for different files, the mapper output needs to have the same format.
If you using reducer, you could set the number of reducers to one, to force all output to aggregate to one file.

Resources