How to load multiple json files to multiple hive tables with correct mapping using apache nifi? - hadoop

i need to ingest multiple csv files based on table names into their respective hive tables using apache nifi.
the data for table_address present in source json file should go to table_address in hive and similarly for other tables.
In short, records from source json file needs to be segregated into multiple csv files with tablename.csv format and loaded into their respective hive tables.
processors i am using
consume kafka ---> splitjson ----> evaluatejsonpath ----> updateattribute ----> replacetext ----> putfile
Records from source json file consumed from kafka Golden gate trials needs to be segregated into multiple csv files with tablename.csv format and loaded into their respective hive tables using apache nifi flow.

You can use PartitionRecord processor in NiFi.
Configure Record Reader(json)/Writer(csv) controller services
Output flowfile will be in csv format and based on partition column value you can store data into hive tables dynamically.
Flow:
Consume Kafka -->
Partition Record (specify partition field) -->
PutFile (or) PutHiveStreaming (or) PutHDFS(based on the value of partition field)

Related

SpoolDirectory to Hbase using Flume

I am able to perform the operation of transferring my data from spooldir to Hbase, but my data is in Json Format and I want them to be in separate columns. I am using Kafka channel.
P.F.A attached the photo of my hbase column
If you notice in the picture status, category, sub_category should be different columns.
I created HBASE table like this - create 'table_name','details'. So, they are in same column family but how do I segregate json data to different columns? Any thoughts?

Read data from multiple tables at a time and combine the data based where clause using Nifi

I have scenario where I need to extract multiple database table data including schema and combine(combination data) them and then write to xl file?
In NiFi the general strategy to read in from a something like a fact table with ExecuteSQL or some other SQL processor, then using LookupRecord to enrich the data with a lookup table. The thing in NiFi is that you can only do a table at a time, so you'd need one LookupRecord for each enrichment table. You could then write to a CSV file that you could open in Excel. There might be some extensions elsewhere that can write directly to Excel but I'm not aware of any in the standard NiFi distro.

I want to ingest data using NIFI to two directions one in HDFS and one in Oracle Database. Is it Possible?

We are using Nifi to ingesting data in HDFS. Can at same time same data be ingested in Oracle or any other database using NIFI?
I need to publish same data two places (HDFS and Oracle Database) and do not want to write two subscribe program.
NiFi has processors to get data from an RDBMS (Oracle, e.g.) such as QueryDatabaseTable and ExecuteSQL, and also from HDFS (ListHDFS, FetchHDFS, etc.). It also has processors to put data into an RDBMS (PutDatabaseRecord, PutSQL, etc.) or HDFS (PutHDFS, e.g.). So you can get your data from multiple sources and send it to multiple targets with NiFi.

How to keep hive table in NiFi DistributedMapCache

I want to keep my hive/MySQL table in NiFi DistributedMapCache. Can someone please help me with the example?
Or please correct me if we can not cache hive table anyhow in NiFi cache.
Thanks
You can use SelectHiveQL processor to pull data from Hive table and output format as CSV and include Header as false.
SplitText processor to split each line as individual flowfile.
Note
if your flowfile size is big then you have to use series of split text processors in series to split the flowfile to each line individually
ExtractText processor to extract the key attribute from the flowfile content.
PutDistributedMapCache processor
Configure/Enable DistributedMapCacheClientService, DistributedMapCacheServer controller service.
Add the Cache Entry Identifier property as your extracted attribute from ExtractText processor.
You need to change the Max cache entry size depending on the flowfile size.
To fetch the cached data you can use FetchDistributedMapCache processor and we need to use same exact value for the identifier that we have cached in PutDistributedMapCache
In the same way if you want to load data from external sources as we are going to have data in Avro format use ConvertRecord processor to convert Avro --> CSV format then load the data into distributed cache.
However this not an best practice to load all the data into distributedmapcache for the huge datasets as you can use lookuprecord processor also.

Apache NiFi/Hive - store merged tweets in HDFS, create table in hive

I want to create a following workflow:
1.Fetch tweets using GetTwitter processor.
Merge tweets in a bigger file using MergeContent process.
Store merged files in HDFS.
On the hadoop/hive side I want to create an external table based on these tweets.
There are examples how to do this but what I am missing is how to configure MergeContent processor: what to set as header,footer and demarcator.
And what to use on on hive side as separator so thatit will split merged tweets in rows.
Hope I described myself clearly.
Thanks in advance.
MergeContent processor in binary mode does the job fine. No need for header,footer and demarcator.

Resources