How can Informatica do the job of a link collector in Datastage? - informatica-powercenter

Is it possible to replicate the functionality of link collector in Datastage in Informatica, using its pre-built transformations?
I have 4 different streams in the same mapping. I want a union of all the streams. It may or may not relate to one another. Hence I do not have a common column. I just want to dump values from those 4 streams into a single column in informatica. Is it possible to do so?

The Union transformation allows you to define a number of input groups and input ports (4 groups with single port in your case) and merges the source rows together.

Related

Looking for an Equivalent of GenerateTableFetch

I use ExecuteSQLRecord to run a query and write to CSV format. The table has 10M rows. Although I can split the output into multiple flow files, the query is executed by only a single thread and is very slow.
Is there a way to partition the query into multiple queries so that the next processor can run multiple concurrent tasks, each one process one partition? It would be like:
GenerateTableFetch -> ExecuteSQLRecord (with concurrent tasks)
The problem is that GenerateTableFetch only accepts table name as input. It does not accept customized queries.
Please advise if you have solutions. Thank you in advance.
You can increase the concurrency on Nifi processors (by increase the number in Councurrent Task), you can also increase the throughput, some time it works :
Also if you are working on the cluster, before the processor, you can apply load balancing on the queue, so it will distribute the workload among the nodes of your cluster (load balance strategy, put to round robin):
Check this, youtube channel, for Nifi antipatterns (there is a video on concurrency): Nifi Notes
Please clarify your question, if I didn't answer it.
Figured out an alternative way. I developed a Oracle PL/SQL function which takes table name as an argument, and produces a series of queries like "SELECT * FROM T1 OFFSET x ROWS FETCH NEXT 10000 ROWS ONLY". The number of queries is based on the number of rows of the table, which is a statistics number in the catalog table. If the table has 1M rows, and I want to have 100k rows in each batch, it will produces 10 queries. I use ExecuteSQLRecord to call this function, which effectively does the job of NiFi processor GenerateTableFetch. My next processor (e.g. ExecuteSQLRecord again) can now have 10 concurrent tasks working in parallel.

Input files with different columns to be loaded in single Nifi flow

I have some input files with different column names. Can I create a common Nifi flow which processes all the types of files? Each files will be having different types of columns and different types of output table to be loaded. For example, File 1 will be having column A, Column B to be loaded to Table AB, File 2 will be having column C, Column D, Column E to be loaded to Table CDE. Can I achieve this in a single flow or should I create different flows for different types of files? I am new to Nifi, please suggest.
You should be able to do this with a single flow, perhaps with RouteOnContent to look for the header so you know which type of file it is. Each outgoing connection would correspond to a different type of file / output table, so you could have an UpdateAttribute on the other end of each outgoing connection, to set attributes for things like the table name, possibly the record schema (if using record-based processors, which I recommend), etc. Then you can use a funnel to merge the sub-flows, or just connect all the outgoing connections to whatever the next downstream processor is (PutDatabaseTable for example).
If you don't want to split the flow at all, you'd probably need to do the same work of identifying the file type and setting the attributes, but from a single script (using ExecuteScript for example). In any case, the downstream processors should be able to make use of the attributes using NiFi Expression Language such that the same processor can handle the different file types appropriately.

nifi: how to merge multiple columns in csv file?

nifi version: 1.5
input file:
col1,col2,col3,col4,col5,col6
a,hr,nat,REF,6,2481
a,hr,nat,TDB,6,1845
b,IT,raj,NAV,6,2678
i want to merge the last three columns with : delimiter and separator by / based on col1.
expected output:
col1,col2,col3,col4
a,hr,nat,REF:6:2481/TDB:6:1845
b,IT,raj,NAV:6:2678
i am not able to find the solution because lot of response were based merging two files.
is there a better way to do it?
tia.
I think you'll want a PartitionRecord processor first, with partition field col1, this will split the flow file into multiple flow files where each distinct value of col1 will be in its own flow file. If the first 3 columns are to be used for partitioning, you can add all three columns as user-defined properties for partitioning.
Then whether you use a scripted solution or perhaps QueryRecord (if Calcite supports "group by" concatenation), the memory usage should be less as you are only dealing with a flow file at a time whose rows are already associated by the specified group.

Loading a Flat File in Informatica 9.5

I have three sources ie. Flat Files(4 Records),Oracle(3 Records) and Netezza(5 Records). I want 12 Records in my target table which is a flat file. How to achieve this via Informatica 9.5? I know we can use Joiner Transformation for joining heterogeneos records but i am not sure about the join conditions in the Joiner. Pls advise on the same
Use a Union transformation. First drag the required ports from any one of the sources to the Union trf. Now, create 3 groups in the Union transformation. This will create 3 sets of input ports. Now connect the ports from other two sources to the remaining input groups of the Union transformation.
There will one set of output ports, which you can connect to the target definition (flat file).
You can use UNION transformation, if the structure of the three sources are same.
Otherwise we need to go for joiner transformation with one of the matching columns in all three sources and do a full outer join.

Using Hadoop to process data from multiple datasources

Does mapreduce and any of the other hadoop technologies (HBase, Hive, pig etc) lend themselves well to situations where you have multiple input files and where data needs to be compared between the different datasources.
In the past I've written a few mapreduce jobs using Hadoop and Pig. However these tasks were quite simple since they involved manipulating only a single dataset. The requirements we have now, dictates that we read data from multiple sources and perform comparisons on various data elements on another datasource. We then report on the differences. The datasets we are working with are in the region of 10million - 60million records and so far we haven't manage to make these jobs fast enough.
Is there a case for using mapreduce in order to solve such issues or am I going down the wrong route.
Any suggestions are much appreciated.
I guess I'd preprocess the different datasets into a common format (being sure to include a "data source" id column with a single unique value for each row coming from the same dataset). Then move the files into the same directory, load the whole dir and treat it as a single data source in which you compare the properties of rows based on their dataset id.
Yes, you can join multiple datasets in a mapreduce job. I would recommend getting a copy of the book/ebook Hadoop In Action which addresses joining data from multiple sources.
When you have multiple input files you can use MapReduce API FileInputFormat.addInputPaths() in which can take a comma separated list of multiple files, as below:
FileInputFormat.addInputPaths("dir1/file1,dir2/file2,dir3/file3");
You can also pass multiple inputs into a Mapper in hadoop using Distributed Cache, more info is described here: multiple input into a Mapper in hadoop
If i am not misunderstanding you are trying to normalize the structured data in records, coming in from several inputs and then process it. Based on this, i think you really need to look at this article which helped me in past. It included How To Normalize Data Using Hadoop/MapReduce as below:
Step 1: Extract the column value pairs from the original data.
Step 2: Extract column-value Pairs Not In Master ID File
Step 3: Calculate the Maximum ID for Each Column in the Master File
Step 4: Calculate a New ID for the Unmatched Values
Step 5: Merge the New Ids with the Existing Master IDs
Step 6: Replace the Values in the Original Data with IDs
Using MultipleInputs we can do this.
MutlipleInputs.addInputPath(job, Mapper1.class, TextInputFormat.class,path1);
MutlipleInputs.addInputPath(job, Mapper2.class, TextInputFormat.class,path2);
job.setReducerClass(Reducer1.class);
//FileOutputFormat.setOutputPath(); set output path here
If both classes have a common key, then they can be joined in reducer and do the necessary logics

Resources