Hadoop read from standard input stream - hadoop

I want my MapReduce program to read from the standard input stream (System.in)
For example in the run() method, how can I make my program read from System.in instead of a file like this..FileInputFormat.addInputPath(job, new Path("dummy.txt"));
Also what class should I set for the job.setInputFormat(...)

Use Hadoop Streaming to do this:
http://wiki.apache.org/hadoop/HadoopStreaming
Supports stdin, stdout

I have not seen such InputFormat present in hadoop. Probably you will have to write System.in somewhere from time to time and run hadoop job over the saved content eveytime you get new one.
Such situation is common while using hadoop for processing log files which are generated/populated continuously. In such use case its wise to get the log file(s) on daily or weekly basis and run the hadoop job over it once you obtain it.

Related

Write time series data into hdfs partitioned by month and day?

I'm writing a program which save the time series data from kafka into hadoop. and I designed the directory struct like this:
event_data
|-2016
|-01
|-data01
|-data02
|-data03
|-2017
|-01
|-data01
Because the is a daemon task, I write a LRU-based manager to manage the opened file and close inactive file in time to avoid resource leaking, but the income data stream is not sorted by time, it's very common to open the existed file again to append new data.
I tried use FileSystem#append() method to open a OutputStream when file existed, but it run error on my hdfs cluster(Sorry, I can't offer the specific error here because it's several month ago and now I tried another solution).
Then I use another ways to achieve my goals:
Adding a sequence suffix to the file name when the same name file exists. now I have a lot of file in my hdfs. It looks very dirty.
My question is: what's the best practice for the circumstances?
Sorry that this is not a direct answer to your programming problem, but if you're open for all options rather than implement it by yourself, I'd like to share you our experiences with fluentd and it's HDFS (WebHDFS) Output Plugin.
Fluentd is a open source, pluggable data collector and by which you can build your data pipeline easily, it'll read data from inputs, process it and then write it to the specified outputs, in your scenario, the input is kafka and the output is HDFS. What you need to do is:
Config fluentd input following fluentd kafka plugin, you'll config the source part with your kafka/topic info
Enable webhdfs and append operation for your HDFS cluster, you can find how to do it following HDFS (WebHDFS) Output Plugin
Config your match part to write your data to HDFS, there's example on the plugin docs page. For partition your data by month and day, you can configure path parameter with time slice placeholders, something like:
path "/event_data/%Y/%m/data%d"
With this option to collect your data, you can then write your mapreduce job to do ETL or whatever you like.
I don't know if this is suitable for your problem, just provide one more option here.

Processing logs in Amazon EMR with or without using Hive

I have a lot of log files in my EMR cluster at path 'hdfs:///logs'. Each log entry is multiple lines but have a starting and ending marker to demarcate between two entries.
Now,
Not all entries in a log file are useful
the entries which are useful needs to be transformed and the output needs to be stored in an output file, so that I can efficiently query (using Hive) the output logs later.
I have a python script which can simply take a log file and do part a. and b. mentioned above but I have not written any mappers or reducers.
Hive takes care of Mappers and Reducers for its queries. Please tell me if and how it is possible to use the python script to run it over all logs and save the output in 'hdfs:///outputlogs' ?
I am new to Map Reduce and have seen some examples of Word count but all of them has a single input file. Where can I find examples which has multiple input files ?
Here I see that you have two-fold issue:
Having more than one file as input
The same word count example will work if you pass in more than one
file as input. In fact you can very easily pass a folder name as
input instead of a file name, in your case hdfs:///logs.
you may even pass on a comma separated list of paths as input, for
this instead of using following:
FileInputFormat.setInputPaths(conf, new Path(args[0]));
You may use the following:
FileInputFormat.setInputPaths(job, args[0]);
Note that only passing a list of comma separated as args[0] will be
sufficient.
How to convert your logic to mapreduce
This does have a steep learning curve as you will need to think in
terms of key and values. But I feel that you can just have all the
logic in the mapper itself and have an IdentityReducer, like this :
conf.setReducerClass(IdentityReducer.class);
If you spend sometime reading examples from the following locations,
you should be in a better position to make these decisions:
hadoop-map-reduce-examples ( http://hadoop-map-reduce-examples.googlecode.com/svn/trunk/hadoop-examples/src/ )
http://developer.yahoo.com/hadoop/tutorial/module4.html
http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/
http://kickstarthadoop.blogspot.in/2011/04/word-count-hadoop-map-reduce-example.html
The long-term correct way to do this is, as Amar stated, to write a MapReduce job to do it.
However, if this is a one-time thing, and the data isn't too enormous, it might be simplest/easiest to do this with a simple bash script since you already have the python script:
hadoop fs -text /logs/* > input.log
python myscript.py input.log output.log
hadoop fs -copyFromLocal output.log /outputlogs
rm -f input.log output.log
If this is a repeated process - something you want to be reliable and efficient - or if you just want to learn to use MapReduce better, then stick with Amar's answer.
If you have logic already written, and you want to do parallell processing using EMR and/or vanilla Hadoop - you can use Hadoop streaming : http://hadoop.apache.org/docs/r0.15.2/streaming.html. In a nutshell - your script taking data into stdin and making output to stdout can became a mapper.
Thus you will run the processing of data in HDFS using cluster, without a need to repackage you code.

Restrict Hadoop MapReduce to Specific File Extension

I am trying to run a MapReduce job on my cluster that only runs on a specific file extension. We have a bunch of heterogeneous data that sits on the cluster and for this particular job I only want to execute on .jpg. Is there a way this can be done without restricting it in the mapper. It seems like this should be something easy to do when you execute the job. I'm thinking something like hadoop fs JobName /users/myuser/data/*.jpg /users/myuser/output.
Your example should work as written, but you'll want to check with the input format that you're calling the setInputPaths(Job, String) method, as this will resolve the glob string "/users/myuser/data/*.jpg" into the individual jpg files in /users/myuser/data.

Can Apache Pig load data from STDIN instead of a file?

I want to use Apache pig to transform/join data in two files, but I want to implement it step by step, which means, test it from real data, but with a small size(10 lines for example), is it possible to use pig that read from STDIN and output to STDOUT?
Basically Hadoop supports Streaming in various ways, but Pig originally lacked support for loading data through streaming. However there are some solutions.
You can check out HStreaming:
A = LOAD 'http://myurl.com:1234/index.html' USING HStream('\n') AS (f1, f2);
The answer is no. The data needs to be out in the cluster on data nodes before any MR job can even run over the data.
However if you are using a small sample of data and are just wanting to do something simple you could use Pig in local mode and just write stdin to a local file and run it through your script.
But the bigger question becomes why are you wanting to use MR/Pig on a stream of data? It was and is not intended for that type of use.

How do I control output files name and content of an Hadoop streaming job?

Is there a way to control the output filenames of an Hadoop Streaming job?
Specifically I would like my job's output files content and name to be organized by the ket the reducer outputs - each file would only contain values for one key and its name would be the key.
Update:
Just found the answer - Using a Java class that derives from MultipleOutputFormat as the jobs output format allows control of the output file names.
http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/lib/MultipleOutputFormat.html
I haven't seen any samples for this out there...
Can anyone point out to an Hadoop Streaming sample that makes use of a custom output format Java class?
Using a Java class that derives from MultipleOutputFormat as the jobs output format allows control of the output file names. http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/lib/MultipleOutputFormat.html
When using Hadoop Streaming, since only one JAR is supported you actually have to fork the streaming jar and put your new output format classes in it for streaming jobs to be able to reference it...
EDIT:
As of version 0.20.2 of hadoop this Class has been deprecated and you should now use:
http://hadoop.apache.org/docs/mapreduce/current/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html
In general, Hadoop would have you consider the entire directory to be the output, and not an individual file. There's no way to directly control the filename, whether using Streaming or regular Java jobs.
However, nothing is stopping you from doing this splitting and renaming yourself, after the job has finished. You can $HADOOP dfs -cat path/to/your/output/directory/part-*, and pipe that to a script of yours that splits content up by keys and writes it to new files.

Resources