Apache NiFi/Hive - store merged tweets in HDFS, create table in hive - hadoop

I want to create a following workflow:
1.Fetch tweets using GetTwitter processor.
Merge tweets in a bigger file using MergeContent process.
Store merged files in HDFS.
On the hadoop/hive side I want to create an external table based on these tweets.
There are examples how to do this but what I am missing is how to configure MergeContent processor: what to set as header,footer and demarcator.
And what to use on on hive side as separator so thatit will split merged tweets in rows.
Hope I described myself clearly.
Thanks in advance.

MergeContent processor in binary mode does the job fine. No need for header,footer and demarcator.

Related

How do I partition data from a txt/csv file by year and month using Flume? Is is possible to make the HDFS path dynamic?

I want to configure a flume flow so that it takes in a CSV file as a source, checks the data, and dynamically separates each row of data into folders by year/month in HDFS. Is this possible?
I might suggest you look at using Nifi instead. I feel like it's the natural replacement for Flume.
Having said that it would seem that you might want to consider using a the spooling directory source and a hive sink (instead of hdfs). The hive partitions (Partitions on year/Month) would enable you to land the data in the Manner you are suggesting.

To Replace Name with Another Name in a file

I am very new to hadoop and i have requirement of scrubbing the file in which account no,name and address details and i need to change these name and address details with some other name and address which are existed in another file.
And am good with either Mapreduce or Hive.
Need help on this.
Thank you.
You can write simple Mapper only job (with reducer set to zero), update the information and store them on some other location. Verify the output of the your job, if it is as you expected, then remove the old files. Remember, HDFS does not support in-placing editing and over-write of files.
Hadoop - MapReduce Tutorial.
You can also use Hive to accomplish this task.
Write hive UDF based on your logic of scrubbing
Use above UDF for each column in hive table you want to scrub and store data in new Hive table.
3.You can remove old hive table.

How to skip file headers in impala external table?

I have file on HDFS with 78 GB size
I need to create an Impala External table over it to perform some grouping and aggregation on data available
Problem
The file contain headers.
Question
Is there any way to skip headers from file while reading the file and do querying on the rest of data.
Although i have a way to solve the problem by copying file to local then remove the headers and then copy the updated file to HDFS again but that is not feasible as the file size is too large
Please suggest if anyone have any idea...
Any suggestions will be appreciated....
Thanks in advance
UPDATE or DELETE row operations are not available in Hive/Impala. So you should simulate DELETE as
Load data file into a temporary Hive/Impala table
Use INSERT INTO or CREATE TABLE AS on temp table to create require table
A straightforward approach would be to run the HDFS data through Pig to filter out the headers and generate a new HDFS dataset formatted so that Impala could read it cleanly.
A more arcane approach would depend on the format of the HDFS data. For example, if both header and data lines are tab-delimited, then you could read everything using a schema with all STRING fields and then filter or partition out the headers before doing aggregations.

Is there a way to access avro data stored in hbase using hive to do analysis

My Hbase table has rows that contain both serialized avro (put there using havrobase) and string data. I know that Hive table can be mapped to avro data stored in hdfs to do data analysis but I was wondering if anyone has tried to map hive to hbase table(s) that contains avro data. Basically I need to be able to query both avro and non avro data stored in Hbase, do some analysis and store the result in a different hbase table. I need the capability to do this as a batch job as well. I don't want to write a JAVA MapReduce job to do this because we have constantly changing configurations and we need to use a scripted approach. Any suggestions? Thanks in advance!
You can write an HBase co-processor to expose the avro record as regular HBase qualifiers. You can see an implementation of that in Intel's panthera-dot

How to add data in same file in Apache PIG?

I am new to PIG.
Actually I have a use case in which I have to store the data again and again in the same file after every regular interval. But as I gone through some tutorial and links, I didn't see the anything related to this.
How should I do store the data in same file?
It's impossible. Pig uses Hadoop and right now there is no "recommended" solution for appending files.
The other point is that pig would produce one file only if one mapper has been used or one reducer has been used and the end of the whole data flow.
You can:
Give more info about the problem you are trying to solve
Bad solution:
2.1. process data in your pig script
2.2. load data from exisitng file
2.3. union relations hwre first relation keeps new data, the second relation keeps data from exisitng file
2.4. store union result to new output
2.5. replace old file with new one.
Good solution:
Create folder /mydata
create partitions inside folder, they can be /yyyy/MM/dd/HH if you do process data each hour
Use globs to read data:
/mydata/*/*/*/*/*
All files from hour partitions would be read by PIG/HIVE/MR or whatever hadoop tool.
make a date folder like: /abc/hadoop/20130726/
within you generate output based on timestamp like: /abc/hadoop/20130726/201307265465.gz.
Then use getmerge command to merge all data into a single file
Usage: hadoop fs -getmerge <src> <localdst> [addnl]
Hope it will help you.

Resources