I am new to nifi. Currently I have a use case where I have to merge three csv files that I extracted from the XLXS files. I need to merge them on the basis of NPI column in all the three csv file.
files1.xlsx looks like-
NPI FULL_NAME LAST_NAME FIRST_NAME
1003002627 Arora Himanshu Arora Himanshu
1003007204 Arora Vishal Arora Vishal
files2.xlsx looks like-
NPI No Employee Number CHI/SL Acct-Unit
1003002627 147536 5812207
1003007204 185793 5854207
files3.xlsx looks like -
Individual NPI Group NPI Market
1003002627 1396935714 Houston
1003007204 1396935714 Houston
I want to left join on my first csv with the specific column in the other csv file so that my desired output is
NPI Full Name Employee Number Market
1003002627 Arora Himanshu 147536 Houston
1003007204 Arora Vishal 185793 Houston
I tried with the Query Record processor but I don't know how three different schema from three different csv file should I combine together. Any help on this will be highly appreciated.
This is what I have tried.
NiFi doesn't have the ability to do what amounts to a SQL join on three different record sets like this out of the box, but there is a work around here if you know Python:
Zip them up.
Ingest the zip file.
Use PutFile to put the zip to a standard location.
Write a Python script that will unzip the zip for you and do the merging.
Call the Python script immediately after PutFile with the ExecuteStreamCommand processor.
The ExecuteStreamCommand processor will read stdout, so just write your combined output with sys.stdout and it'll get back into NiFi.
Related
I have a requirement where I need to JOIN a tweets table with person names, like filtering the tweets if it contains any person name. I have following data:
Tweets Table: (70 million records stored as a HIVE Table)
id
tweet
1
Cristiano Ronaldo greatest of all time
2
Brad Pitt movies
3
Random tweet without any person name
Person Names: (1.6 million names stored on HDFS as .tsv file)
id
person_name
1
Cristiano Ronaldo
2
Brad Pitt
3
Angelina Jolie
Expected Result:
id
tweet
person_name
1
Cristiano Ronaldo greatest of all time
Cristiano Ronaldo
2
Brad Pitt movies
Brad Pitt
What I've tried so far:
I have converted the person names .tsv file to HIVE table as well and then tried to join 2 tables with the following HIVE query:
SELECT * FROM tweets t INNER JOIN people p WHERE instr(t.tweet, p.person_name) > 0;
Tried with some sample data and it works fine. But when I try to run on entire data (70m tweets JOIN with 1.6m Person Names), it takes forever. Definitely doesn't look very efficient.
I wanted to try JOIN with PIG as well (as it is considered little more efficient than HIVE JOIN), where I can directly JOIN person names .tsv file tweets HIVE Table, but not sure how to JOIN based on substring in PIG.
Can someone please share the PIG JOIN syntax for this problem, if you have any idea? Also, please do suggest me any alternatives that I can use?
The idea is to create buckets so that we don't have to compare a lot of records. We are going to increase the number of records / joins to use multiple nodes to do work instead of a large crossjoin.--> WHERE instr(t.tweet, p.person_name) > 0;
I'd suggest splitting the tweets into individual words. Yes multiplying your record count way up.
Filtering out 'stopwords' or some other list of words that fit in memory.
Split names into (firstnames) and "last name"
Join tweets and names on "lastname" and instr(t.tweet, p.person_name) This should significantly reduce the size of data that you compare via a function. It will run faster.
If you are going to do this regularly consider creating tables with
sort/bucket to really make things sizzle. (Make it faster as it can hopefully be Sort Merge Join ready.)
It is worth trying Map-Join.
Person table is small one and join with it can be converted to Map-Join operator if it fits into memory. Table will be loaded into each mapper memory.
Check EXPLAIN output. If it says that Common Join operator is on Reducer vertex, then try to increase mapper container memory and adjust map-join settings to convert to Map Join.
Settings responsible for Map Join (suppose the People table <2.5Gb)
Try to bump mapjoin table size to 2.5Gb (check the actual size) and run explain again.
set hive.auto.convert.join=true; --this enables map-join
set hive.auto.convert.join.noconditionaltask = true;
set hive.mapjoin.smalltable.filesize=2500000000; --size of table to fit in memory
set hive.auto.convert.join.noconditionaltask.size=2500000000;
Also container size should be increased to avoid OOM (if you are on Tez):
set hive.tez.container.size=8192; --container size in megabytes
set hive.tez.java.opts=-Xmx6144m; --set this 80% of hive.tez.container.size
Figures are just an example. Try to adjust and check the EXPLAIN again, if it shows Map-Join operator, then check execution again, it should run much faster.
Suppose I Have a table xyz:
id name add city act_flg start_dtm end_dtm
1 amit abc,z pune Y 21012018 null
and this table is loaded from a file Using Informatica using SCD2.
suppose there is one file that contains two record with id=2
ie. 2 vipul abc,z mumbai
2 vipul asdf bangalore
so who will this be loaded into db?
It depends how your doing the SCD type 2. If you are using a look-up with Static cache , both records will be added end date as null
Best case in this scenario is to use a dynamic lookup cache and read your source data in such a way that latest record is read last. This will ensure one record is expired with end date and only one active record( ie end date is null) exists per id.
Hmm 1 of 2 possibilities depending on what you mean... if you mean that you're pulling data from different source systems which sometimes have the same ids on those systems then its easy... just stamp both the natural key (i.e. the id) and a source system value on the dimension column along with the arbitrary surrogate key which is unique to your target table... (this is a datawarehousing basic so read kimball).
If you mean that you are somehow tracing realtime changes in the single record in the source system and writing these changes to the input files of your etl job then you need to agree with your client whether they're happy for you to aggregate them based on the timestamp of the change and just pick the most recent one or to create 2 records, one with its expiry datetime set and the other still open (which is the standard scd approach... again read kimball).
I have written a successful script for counting total number of steps taken by pedestrians, and their highest step count. What I don't get is producing headers in Pig Output, so that output looks neat, and clean. Is there any way that can produce headers while writing output. Following is my code,
register 'piggybank-0.15.0.jar';
DEFINE CSVLoader org.apache.pig.piggybank.storage.CSVLoader();
part1 = LOAD '/home/cloudera/Pedestrian_Counts.csv' using CSVLoader(',') as (date_time, sensor_id: int, sensor_name: chararray, hourly_counts: int);
part2 = GROUP part1 BY (sensor_id, sensor_name);
part3 = FOREACH part2 GENERATE FLATTEN(group) AS (sensor_id, sensor_name), SUM(part1.hourly_counts), MAX(part1.hourly_counts);
STORE part3 into '/home/cloudera/pedestrian_result' using PigStorage('\t');
First 5 lines of my output is as follows,
1 Bourke Street Mall (North) 49591633 5573
2 Bourke Street Mall (South) 67759939 7035
3 Melbourne Central 70973929 5890
4 Town Hall (West) 90274498 8052
5 Princes Bridge 58752043 7391
Can we place headers while writing output? Thanks in advance.
Either merge all the part files data to a file in local file system which has header information in it or use hive table to store the output of this pig script.
Using Hive table for storing the output will have its own schema.
You should be using Hcat for accessing Hive in Pig.
I have a set of CSV files in a folder and I'd like to merge them in one "super-csv". Some of the columns are available in all files, some not.
Fields in the output should just be empty, if it was not available in the source. If a columnname is the same over multiple CSV, it should fill the existing column (Name in the example)
File1.CSV
ID Name ContactNo
53 Vikas 9874563210
File2.CSV
ID Name Designation
23 MyShore Software Engineer
Output Expected
ID Name ContactNo Designation
53 Vikas 9874563210
23 MyShore Software Engineer
I've already tried other solutions, but they cannot handle empty fields. eg. merge csv files with different column order remove duplicates
Thanks in advance
Michael
In python, you can use the pandas module that allows to fill a dataframe from a csv, merge dataframe and then save the merged dataframe into new csv file.
For example :
import pandas as pd
df1 = pd.DataFrame.from_csv("file1.csv", sep=",")
df2 = pd.DataFrame.from_csv("file2.csv", sep=",")
final_df = df1.reset_index().merge(df2.reset_index(), how="outer").set_index('ID')
final_df.to_csv("result.csv", sep=",")
which would produce
ID,Name,ContactNo,Designation
53,Vikas,9874563210.0,
23,MyShore,,Software Engineer
You would have to play with the sep argument to adapt to your files format.
I would like to bulk load a bunch of data into a Oracle database. I have written a program to easily format my data however I want. I see many examples of loading a csv file into Oracle, but they all require a control file for each table, linking it to one file.
It would be simple for me to create a script to generate all of the control files, however I would first like to know if it would be possibly to have all data in one file with table names designated in the data file?
For example:
onefile.csv:
------------
details
1, John, john#gmail.com
2, Steve, steve#gmail.com
3, Sally, sally#gmail.com
account
1, John, johntheman, johnh43
2, Steve, password, steve.12
3, Sally, letmein, slllya2
Disclaimer: This is a completely fictional database design and is not at all reflective of how I might store user data in the real world.
You could use UTL_FILE to read a CSV. This would give you complete control over how you process its contents. But it does mean you would waste a lot of time and effort hand-rolling a meagre and slow implementation of SQL*Loader. Why would you want to do that?
One file per data source and/or target is the accepted convention for CSV generation. It is a necessary part of the contract. If we want to do anything else then we need to use a more suitable protocol such as XML or JSON, somethinhg which can support programmatic interrogation.
Why don't you use SQLLdr? See SQLdr docs. See "Distinguishing Different Input Record Formats".
Benefits of Using Multiple INTO TABLE Clauses
Multiple INTO TABLE clauses enable you to:
Load data into different tables
Extract multiple logical records from a single input record
Distinguish different input record formats
Distinguish different input row object subtypes
But you would have to specify table name(or it's ID) on every line.
SQLLrd works on line level, I afraid it does not support stanzas.
So you might need to change it into:
onefile.csv:
------------
details,1, John, john#gmail.com
details,2, Steve, steve#gmail.com
details,3, Sally, sally#gmail.com
account,1, John, johntheman, johnh43
account,2, Steve, password, steve.12
account,3, Sally, letmein, slllya2
Which looks like a reminisce of old COBOL formats.