Loading a Flat File in Informatica 9.5 - informatica-powercenter

I have three sources ie. Flat Files(4 Records),Oracle(3 Records) and Netezza(5 Records). I want 12 Records in my target table which is a flat file. How to achieve this via Informatica 9.5? I know we can use Joiner Transformation for joining heterogeneos records but i am not sure about the join conditions in the Joiner. Pls advise on the same

Use a Union transformation. First drag the required ports from any one of the sources to the Union trf. Now, create 3 groups in the Union transformation. This will create 3 sets of input ports. Now connect the ports from other two sources to the remaining input groups of the Union transformation.
There will one set of output ports, which you can connect to the target definition (flat file).

You can use UNION transformation, if the structure of the three sources are same.
Otherwise we need to go for joiner transformation with one of the matching columns in all three sources and do a full outer join.

Related

Input files with different columns to be loaded in single Nifi flow

I have some input files with different column names. Can I create a common Nifi flow which processes all the types of files? Each files will be having different types of columns and different types of output table to be loaded. For example, File 1 will be having column A, Column B to be loaded to Table AB, File 2 will be having column C, Column D, Column E to be loaded to Table CDE. Can I achieve this in a single flow or should I create different flows for different types of files? I am new to Nifi, please suggest.
You should be able to do this with a single flow, perhaps with RouteOnContent to look for the header so you know which type of file it is. Each outgoing connection would correspond to a different type of file / output table, so you could have an UpdateAttribute on the other end of each outgoing connection, to set attributes for things like the table name, possibly the record schema (if using record-based processors, which I recommend), etc. Then you can use a funnel to merge the sub-flows, or just connect all the outgoing connections to whatever the next downstream processor is (PutDatabaseTable for example).
If you don't want to split the flow at all, you'd probably need to do the same work of identifying the file type and setting the attributes, but from a single script (using ExecuteScript for example). In any case, the downstream processors should be able to make use of the attributes using NiFi Expression Language such that the same processor can handle the different file types appropriately.

nifi: how to merge multiple columns in csv file?

nifi version: 1.5
input file:
col1,col2,col3,col4,col5,col6
a,hr,nat,REF,6,2481
a,hr,nat,TDB,6,1845
b,IT,raj,NAV,6,2678
i want to merge the last three columns with : delimiter and separator by / based on col1.
expected output:
col1,col2,col3,col4
a,hr,nat,REF:6:2481/TDB:6:1845
b,IT,raj,NAV:6:2678
i am not able to find the solution because lot of response were based merging two files.
is there a better way to do it?
tia.
I think you'll want a PartitionRecord processor first, with partition field col1, this will split the flow file into multiple flow files where each distinct value of col1 will be in its own flow file. If the first 3 columns are to be used for partitioning, you can add all three columns as user-defined properties for partitioning.
Then whether you use a scripted solution or perhaps QueryRecord (if Calcite supports "group by" concatenation), the memory usage should be less as you are only dealing with a flow file at a time whose rows are already associated by the specified group.

Processing semi-inhomogenous structured files with Spark (CSV, Parquet)

I have several inhomogenous structured files stored in a Hadoop cluster. The files contain a header line but not all files contain the same columns.
file1.csv:
a,b,c
1,2,1
file2.csv:
a,b,d
2,2,2
What I need to do is looking for all data in column a or column c and process it further (possibly Spark SQL). So I expect something like:
a,b,c,d
1,2,1,,
2,2,,2
Just doing
spark.read.format("csv").option("header", "true").load(CSV_PATH)
will miss all columns not present in the "first" file read.
How can I do this? Is a conversion to Parquet and its dataset feature a better approach?
Read two files separately and create a two dataframes. Then do an inner join between those two w.r.t join keys as a,b

Time based directory structure Apache Drill

I have CSV files organized by date and time as follows
logs/YYYY/MM/DD/CSV files...
I have setup Apache Drill to execute SQL queries on top of these CSV files. Since there are many CSV files; the organization of the files can be utilized to optimize the performance. For example,
SELECT * from data where trans>='20170101' AND trans<'20170102';
In this SQL, the directory logs/2017/01/01 should be scanned for data. Is there a way to let Apache Drill do optimization based on this directory structure? Is it possible to do this in Hive, Impala or any other tool?
Please note:
SQL queries will almost always contain the time frame.
Number of CSV files in a given directory is not huge. Combined all years worth of data, it will be huge
There is a field called 'trans' in every CSV file, which contains the date and time.
The CSV file is put under appropriate directory based on the value of 'trans' field.
CSV files do not follow any schema. Columns may or may not be different.
Querying using column inside the data file would not help in partition pruning.
You can use dir* variables in Drill to refer to partitions in table.
create view trans_logs_view as
select
`dir0` as `tran_year`,
`dir1` as `trans_month`,
`dir2` as `tran_date`, * from dfs.`/data/logs`;
You can query using tran_year,tran_month and tran_date columns for partition pruning.
Also see if below query helps for pruning.
select count(1) from dfs.`/data/logs`
where concat(`dir0`,`dir1`,`dir2`) between '20170101' AND '20170102';
If so , you can define view by aliasing concat(dir0,dir1,dir2) to trans column name and query.
See below for more details.
https://drill.apache.org/docs/how-to-partition-data/

How can Informatica do the job of a link collector in Datastage?

Is it possible to replicate the functionality of link collector in Datastage in Informatica, using its pre-built transformations?
I have 4 different streams in the same mapping. I want a union of all the streams. It may or may not relate to one another. Hence I do not have a common column. I just want to dump values from those 4 streams into a single column in informatica. Is it possible to do so?
The Union transformation allows you to define a number of input groups and input ports (4 groups with single port in your case) and merges the source rows together.

Resources