Flume to stream gz files - hadoop

I have a folder contains a lot of gzip files. Each gzip file contains xml file. I had used flume to stream the files into HDFS. Below is my configuration file:
agent1.sources = src
agent1.channels = ch
agent1.sinks = sink
agent1.sources.src.type = spooldir
agent1.sources.src.spoolDir = /home/tester/datafiles
agent1.sources.src.channels = ch
agent1.sources.src.deserializer = org.apache.flume.sink.solr.morphline.BlobDeserializer$Builder
agent1.channels.ch.type = memory
agent1.channels.ch.capacity = 1000
agent1.channels.ch.transactionCapacity = 1000
agent1.sinks.sink.type = hdfs
agent1.sinks.sink.channel = ch
agent1.sinks.sink.hdfs.path = /user/tester/datafiles
agent1.sinks.sink.hdfs.fileType = CompressedStream
agent1.sinks.sink.hdfs.codeC = gzip
agent1.sinks.sink.hdfs.fileSuffix = .gz
agent1.sinks.sink.hdfs.rollInterval = 0
agent1.sinks.sink.hdfs.rollSize = 122000000
agent1.sinks.sink.hdfs.rollCount = 0
agent1.sinks.sink.hdfs.idleTimeout = 1
agent1.sinks.sink.hdfs.batchSize = 1000
After I stream the files into HDFS, i use Spark to read it using the following code:
df = sparkSession.read.format('com.databricks.spark.xml').options(rowTag='Panel', compression='gzip').load('/user/tester/datafiles')
But I am having issue to read it. If i manually upload one gzip file into HDFS folder and re-run the above Spark code, it able to read it without any issue. I am not sure is it due to flume.
I tried to download the file streamed by flume and unzip it, when i viewed the contents, it no longer showing the xml format, it is some unreadable character. Could anyone shed me some light on this? Thanks.

I think you are doing it Wrong!!! Why ?
See you are having a source which is "Non Split-able" ZIP . you can'read them partially as record by record,if you do not decompress you will get a GZIPInputStream, which you are getting in flume source.
And after reading that GZIP input stream as input records you are saving already ziped streams into another GZIP stream as you selected sink type as compressed.
So you have Zipped Streamed inside a Gzip in HDFS . :)
I suggest schedule a script in cron to do a copy from local to HDFS will solve your problem .

Related

How to extract all the collected tweets in a single file

I'm using Flume to collect tweets and store them on HDFS.
The collecting part is working fine, and I can find all my tweets in my file system.
Now I would like to extract all these tweets in one single file.
The problem is that the different tweets are stored as follow :
As we can see, the tweets are stored inside blocks of 128 MB but only use a few Ko, which is a normal behaviour for HDFS correct me if I'm wrong.
However how could I get all the different tweets on one file ?
Here is my conf file that I run with the follwing command :
flume-ng agent -n TwitterAgent -f ./my-flume-files/twitter-stream-tvseries.conf
twitter-stream-tvseries.conf :
TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS
TwitterAgent.sources.Twitter.type =
org.apache.flume.source.twitter.TwitterSource
TwitterAgent.sources.Twitter.consumerKey=hidden
TwitterAgent.sources.Twitter.consumerSecret=hidden
TwitterAgent.sources.Twitter.accessToken=hidden
TwitterAgent.sources.Twitter.accessTokenSecret=hidden
TwitterAgent.sources.Twitter.keywords=GoT, GameofThrones
TwitterAgent.sources.Twitter.keywords=GoT, GameofThrones
TwitterAgent.sinks.HDFS.channel=MemChannel
TwitterAgent.sinks.HDFS.type=hdfs
TwitterAgent.sinks.HDFS.hdfs.path=hdfs://ip-addressl:8020/user/root/data/twitter/tvseries/tweets
TwitterAgent.sinks.HDFS.hdfs.fileType=DataStream
TwitterAgent.sinks.HDFS.hdfs.writeformat=Text
TwitterAgent.sinks.HDFS.hdfs.batchSize=1000
TwitterAgent.sinks.HDFS.hdfs.rollSize=0
TwitterAgent.sinks.HDFS.hdfs.rollCount=10000
TwitterAgent.sinks.HDFS.hdfs.rollInterval=600
TwitterAgent.channels.MemChannel.type=memory
TwitterAgent.channels.MemChannel.capacity=10000
TwitterAgent.channels.MemChannel.transactionCapacity=1000
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sinks.HDFS.channel = MemChannel
You can configure the HDFS sink to produce a message by time, event or size. So, if you want to save multiple messages till 120MB limit is reached, set
hdfs.rollInterval = 0 # This is to create new file based on time
hdfs.rollSize = 125829120 # This is to create new file based on size
hdfs.rollCount = 0 # This is to create new file based on events (different tweets in your case)
You can use the following commands to concatenate the files into single file:
find . -type f -name 'FlumeData*' -exec cat {} + >> output.file
or if you want to store the data into Hive tables for later analysis, create an external table and consume it into Hive DB.

How can I get raw content of a file which is stored on hdfs with gzip compressed ?

Is there any way that can read raw content of a file which is stored on hadoop hdfs byte by byte ?
Typically when I submit a streaming job with -input param that point to an .gz file (like -input hdfs://host:port/path/to/gzipped/file.gz).
My task received decompressed input line by line, this is NOT what I want.
You can initialize the FileSystem with respective Hadoop configuration:
FileSystem.get(conf);
It has a method open which should in principle allow you to read raw data.

Flume-ng: source path and type for copying log file from local to HDFS

I am trying to copy some log files from local to HDFS using flume-ng. The source is /home/cloudera/flume/weblogs/ and the sink is hdfs://localhost:8020/flume/dump/. A cron job will copy the logs from tomcat server to /home/cloudera/flume/weblogs/ and I want to log files to be copied to HDFS as the files are available in /home/cloudera/flume/weblogs/ using flume-ng. Below is the conf file I created:
agent1.sources= local
agent1.channels= MemChannel
agent1.sinks=HDFS
agent1.sources.local.type = ???
agent1.sources.local.channels=MemChannel
agent1.sinks.HDFS.channel=MemChannel
agent1.sinks.HDFS.type=hdfs
agent1.sinks.HDFS.hdfs.path=hdfs://localhost:8020/flume/dump/
agent1.sinks.HDFS.hdfs.fileType=DataStream
agent1.sinks.HDFS.hdfs.writeformat=Text
agent1.sinks.HDFS.hdfs.batchSize=1000
agent1.sinks.HDFS.hdfs.rollSize=0
agent1.sinks.HDFS.hdfs.rollCount=10000
agent1.sinks.HDFS.hdfs.rollInterval=600
agent1.channels.MemChannel.type=memory
agent1.channels.MemChannel.capacity=10000
agent1.channels.MemChannel.transactionCapacity=100
I am not able to understand:
1) what will be the value of agent1.sources.local.type = ???
2) where to mention the source path /home/cloudera/flume/weblogs/ in the above conf file ?
3) Is there anything I am missing in the above conf file?
Please let me know on these.
You can use either :
An Exec Source and use a command (i.e. cat or tail on gnu/linux on you files)
Or a Spooling Directory Source for read all files in a directory

Hadoop MapReduce: with fixed number of input files?

for Map Reducer Job
In my input directory having around 1000 files. and each files contains some GB's of data.
for example /MyFolder/MyResults/in_data/20140710/ contains 1000 files.
when I give the inputpath as /MyFolder/MyResults/in_data/20140710 it's taking all 1000 files to process.
I would like to run a job by talking 200 files only at a time. How we can do this?
Here my command to execute:
hadoop jar wholefile.jar com.form1.WholeFileInputDriver -libjars myref.jar -D mapred.reduce.tasks=15 /MyFolder/MyResults/in_data/20140710/ <<Output>>
Can any help me, how to run a job like a batch size for the inputfiles.
Thanks in advance
-Vim
A simple way would be to modify your driver to take only 200 files as input out of all the files in that directory. Something like this:
FileSystem fs = FileSystem.get(new Configuration());
FileStatus[] files = fs.globStatus(new Path("/MyFolder/MyResults/in_data/20140710/*"));
for (int i=0;i<200;i++) {
FileInputFormat.addInputPath(job, files[i].getPath());
}

How to load data from local machine to hdfs using flume

i am new to flume so please tell me...how to store log files from my local machine to local my HDFS using flume
i have issues in setting classpath and flume.conf file
Thank you,
ajay
agent.sources = weblog
agent.channels = memoryChannel
agent.sinks = mycluster
## Sources #########################################################
agent.sources.weblog.type = exec
agent.sources.weblog.command = tail -F REPLACE-WITH-PATH2-your.log-FILE
agent.sources.weblog.batchSize = 1
agent.sources.weblog.channels =
REPLACE-WITH-
CHANNEL-NAME
## Channels ########################################################
agent.channels.memoryChannel.type = memory
agent.channels.memoryChannel.capacity = 100 agent.channels.memoryChannel.transactionCapacity = 100
## Sinks ###########################################################
agent.sinks.mycluster.type =REPLACE-WITH-CLUSTER-TYPE
agent.sinks.mycluster.hdfs.path=/user/root/flumedata
agent.sinks.mycluster.channel =REPLACE-WITH-CHANNEL-NAME
Save this file as logagent.conf and run with below command
# flume-ng agent –n agent –f logagent.conf &
We do need more information to know why things are working for you.
The short answer is that you need a Source to read your data from (maybe the spooling directory source), a Channel (memory channel if you don't need reliable storage) and the HDFS sink.
Update
The OP reports receiving the error message, "you must include conf file in flume class path".
You need to provide the conf file as an argument. You do so with the --conf-file parameter. For example, the command line I use in development is:
bin/flume-ng agent --conf-file /etc/flume-ng/conf/flume.conf --name castellan-indexer --conf /etc/flume-ng/conf
The error message reads that way because the bin/flume-ng script adds the contents of the --conf-file argument to the classpath before running Flume.
If you are appending data to your local file, you can use an exec source with "tail -F" command. If the file is static, use cat command to transfer the data to hadoop.
The overall architecture would be:
Source: Exec source reading data from your file
Channel : Either memory channel or file channel
Sink: Hdfs sink where data is being dumped.
Use user guide to create your conf file (https://flume.apache.org/FlumeUserGuide.html)
Once you have your conf file ready, you can run it like this:
bin/flume-ng agent -n $agent_name -c conf -f conf/your-flume-conf.conf

Resources