Short version of my question: We need to call the command line from within a Spark job. Is this feasible? The cluster support group indicated this could cause problems with memory.
Long Version: I have a job that I need to run on a Hadoop/MapR cluster processing packet data captured with tshark/wireshark. The data is binary packet data, one file per minute of capture. We need to extract certain fields from this packet data such as IP addresses etc. We have investigated options such a jNetPcap but this library is a bit limited. So it looks like we need to call the tshark command line from within the job and process the response. We can't do this directly during capture as we need the capture to be as efficient as possible to avoid dropping packets. Converting the binary data to text outside the cluster is possible but this is 95% of the work so we may as well run the entire job as a non-distributed job on a single server. This limits the number of cores we can use.
Command line to decode is:
tshark -V -r somefile.pcap
or
tshark -T pdml -r somefile.pcap
Well, it is not impossible. Spark provides pipe method which can be used to pipe data to external process and read the output. General structure could be for example something like this:
val files: RDD[String] = ??? // List of paths
val processed: RDD[String] = files.pipe("some Unix pipe")
Still, from your description it looks like GNU Parallel could be much better choice than Spark here.
Related
What is the procedure to define new external collectors in bosun using scollector.
Can we write python or shell scripts to collect data?
The documentation around this is not quite up to date. You can do it as described in http://godoc.org/bosun.org/cmd/scollector#hdr-External_Collectors , but we also support JSON output which is better.
Either way, you write something and put it in the external collectors directory, followed by a frequency directory, and then an executable script or binary. Something like:
<external_collectors_dir>/<freq_sec>/foo.sh.
If the directory frequency is zero 0, then the the script is expected to be continuously running, and you put a sleep inside the code (This is my preferred method for external collectors). The scripts outputs the telnet format, or the undocumented JSON format to stdout. Scollector picks it up, and queues that information for sending.
I created an issue to get this documented not long ago https://github.com/bosun-monitor/bosun/issues/1225. Until one of us gets around to that, here is the PR that added JSON https://github.com/bosun-monitor/bosun/commit/fced1642fd260bf6afa8cba169d84c60f2e23e92
Adding to what Kyle said, you can take a look at some existing external collectors to see what they output. here is one written in java that one of our colleagues wrote to monitor jvm stuff. It uses the text format, which is simply:
metricname timestamp value tag1=foo tag2=bar
If you want to use the JSON format, here is an example from one of our collectors:
{"metric":"exceptional.exceptions.count","timestamp":1438788720,"value":0,"tags":{"application":"AdServer","machine":"ny-web03","source":"NY_Status"}}
And you can also send metadata:
{"Metric":"exceptional.exceptions.count","Name":"rate","Value":"counter"}
{"Metric":"exceptional.exceptions.count","Name":"unit","Value":"errors"}
{"Metric":"exceptional.exceptions.count","Name":"desc","Value":"The number of exceptions thrown per second by applications and machines. Data is queried from multiple sources. See status instances for details on exceptions."}`
Or send error messages to stderror:
2015/08/05 15:32:00 lookup OR-SQL03: no such host
I have a problem that would be solved by Hadoop Streaming in "typedbytes" or "rawbytes" mode, which allow one to analyze binary data in a language other than Java. (Without this, Streaming interprets some characters, usually \t and \n, as delimiters and complains about non-utf-8 characters. Converting all my binary data to Base64 would slow down the workflow, defeating the purpose.)
These binary modes were added by HADOOP-1722. On the command line that invokes a Hadoop Streaming job, "-io rawbytes" lets you define your data as a 32-bit integer size followed by raw data of that size, and "-io typedbytes" lets you define your data as a 1-bit zero (which means raw bytes), followed by a 32-bit integer size, followed by raw data of that size. I have created files with these formats (with one or many records) and verified that they are in the right format by checking them with/against the output of typedbytes.py. I've also tried all conceivable variations (big-endian, little-endian, different byte offsets, etc.). I'm using Hadoop 0.20 from CDH4, which has the classes that implement the typedbytes handling, and it is entering those classes when the "-io" switch is set.
I copied the binary file to HDFS with "hadoop fs -copyFromLocal". When I try to use it as input to a map-reduce job, it fails with an OutOfMemoryError on the line where it tries to make a byte array of the length I specify (e.g. 3 bytes). It must be reading the number incorrectly and trying to allocate a huge block instead. Despite this, it does manage to get a record to the mapper (the previous record? not sure), which writes it to standard error so that I can see it. There are always too many bytes at the beginning of the record: for instance, if the file is "\x00\x00\x00\x00\x03hey", the mapper would see "\x04\x00\x00\x00\x00\x00\x00\x00\x00\x07\x00\x00\x00\x08\x00\x00\x00\x00\x03hey" (reproducible bits, though no pattern that I can see).
From page 5 of this talk, I learned that there are "loadtb" and "dumptb" subcommands of streaming, which copy to/from HDFS and wrap/unwrap the typed bytes in a SequenceFile, in one step. When used with "-inputformat org.apache.hadoop.mapred.SequenceFileAsBinaryInputFormat", Hadoop correctly unpacks the SequenceFile, but then misinterprets the typedbytes contained within, in exactly the same way.
Moreover, I can find no documentation of this feature. On Feb 7 (I e-mailed it to myself), it was briefly mentioned in the streaming.html page on Apache, but this r0.21.0 webpage has since been taken down and the equivalent page for r1.1.1 has no mention of rawbytes or typedbytes.
So my question is: what is the correct way to use rawbytes or typedbytes in Hadoop Streaming? Has anyone ever gotten it to work? If so, could someone post a recipe? It seems like this would be a problem for anyone who wants to use binary data in Hadoop Streaming, which ought to be a fairly broad group.
P.S. I noticed that Dumbo, Hadoopy, and rmr all use this feature, but there ought to be a way to use it directly, without being mediated by a Python-based or R-based framework.
Okay, I've found a combination that works, but it's weird.
Prepare a valid typedbytes file in your local filesystem, following the documentation or by imitating typedbytes.py.
Use
hadoop jar path/to/streaming.jar loadtb path/on/HDFS.sequencefile < local/typedbytes.tb
to wrap the typedbytes in a SequenceFile and put it in HDFS, in one step.
Use
hadoop jar path/to/streaming.jar -inputformat org.apache.hadoop.mapred.SequenceFileAsBinaryInputFormat ...
to run a map-reduce job in which the mapper gets input from the SequenceFile. Note that -io typedbytes or -D stream.map.input=typedbytes should not be used--- explicitly asking for typedbytes leads to the misinterpretation I described in my question. But fear not: Hadoop Streaming splits the input on its binary record boundaries and not on its '\n' characters. The data arrive in the mapper as "rawdata" separated by '\t' and '\n', like this:
32-bit signed integer, representing length (note: no type character)
block of raw binary with that length: this is the key
'\t' (tab character... why?)
32-bit signed integer, representing length
block of raw binary with that length: this is the value
'\n' (newline character... ?)
If you want to additionally send raw data from mapper to reducer, add
-D stream.map.output=typedbytes -D stream.reduce.input=typedbytes
to your Hadoop command line and format the mapper's output and reducer's expected input as valid typedbytes. They also alternate for key-value pairs, but this time with type characters and without '\t' and '\n'. Hadoop Streaming correctly splits these pairs on their binary record boundaries and groups by keys.
The only documentation on stream.map.output and stream.reduce.input that I could find was in the HADOOP-1722 exchange, starting 6 Feb 09. (Earlier discussion considered a different way to parameterize the formats.)
This recipe does not provide strong typing for the input: the type characters are lost somewhere in the process of creating a SequenceFile and interpreting it with the -inputformat. It does, however, provide splitting at the binary record boundaries, rather than '\n', which is the really important thing, and strong typing between the mapper and the reducer.
We solved the binary data issue using hexaencoding the data at split level when streaming down data to the Mapper. This would utilize and increase the Parallel efficiency of your operation instead of first tranforming your data before processing on a node.
Apparently there is a patch for a JustBytes IO mode for streaming, that feeds a whole input file to the mapper command:
https://issues.apache.org/jira/browse/MAPREDUCE-5018
I have a lot of log files in my EMR cluster at path 'hdfs:///logs'. Each log entry is multiple lines but have a starting and ending marker to demarcate between two entries.
Now,
Not all entries in a log file are useful
the entries which are useful needs to be transformed and the output needs to be stored in an output file, so that I can efficiently query (using Hive) the output logs later.
I have a python script which can simply take a log file and do part a. and b. mentioned above but I have not written any mappers or reducers.
Hive takes care of Mappers and Reducers for its queries. Please tell me if and how it is possible to use the python script to run it over all logs and save the output in 'hdfs:///outputlogs' ?
I am new to Map Reduce and have seen some examples of Word count but all of them has a single input file. Where can I find examples which has multiple input files ?
Here I see that you have two-fold issue:
Having more than one file as input
The same word count example will work if you pass in more than one
file as input. In fact you can very easily pass a folder name as
input instead of a file name, in your case hdfs:///logs.
you may even pass on a comma separated list of paths as input, for
this instead of using following:
FileInputFormat.setInputPaths(conf, new Path(args[0]));
You may use the following:
FileInputFormat.setInputPaths(job, args[0]);
Note that only passing a list of comma separated as args[0] will be
sufficient.
How to convert your logic to mapreduce
This does have a steep learning curve as you will need to think in
terms of key and values. But I feel that you can just have all the
logic in the mapper itself and have an IdentityReducer, like this :
conf.setReducerClass(IdentityReducer.class);
If you spend sometime reading examples from the following locations,
you should be in a better position to make these decisions:
hadoop-map-reduce-examples ( http://hadoop-map-reduce-examples.googlecode.com/svn/trunk/hadoop-examples/src/ )
http://developer.yahoo.com/hadoop/tutorial/module4.html
http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/
http://kickstarthadoop.blogspot.in/2011/04/word-count-hadoop-map-reduce-example.html
The long-term correct way to do this is, as Amar stated, to write a MapReduce job to do it.
However, if this is a one-time thing, and the data isn't too enormous, it might be simplest/easiest to do this with a simple bash script since you already have the python script:
hadoop fs -text /logs/* > input.log
python myscript.py input.log output.log
hadoop fs -copyFromLocal output.log /outputlogs
rm -f input.log output.log
If this is a repeated process - something you want to be reliable and efficient - or if you just want to learn to use MapReduce better, then stick with Amar's answer.
If you have logic already written, and you want to do parallell processing using EMR and/or vanilla Hadoop - you can use Hadoop streaming : http://hadoop.apache.org/docs/r0.15.2/streaming.html. In a nutshell - your script taking data into stdin and making output to stdout can became a mapper.
Thus you will run the processing of data in HDFS using cluster, without a need to repackage you code.
I'm streaming data in a pig script through an executable that returns an xml fragment for each line of input I stream to it. That xml fragment happens to span multiple lines and I have no control whatsoever over the output of the executable I stream to
In relation to Use Hadoop Pig to load data from text file w/ each record on multiple lines?, the answer was suggesting writing a custom record reader. The problem is, this works fine if you want to implement a LoadFunc that reads from a file, but to be able to use streaming, it has to implement StreamToPig. StreamToPig allows you to only read one line at a time as far as I understood
Does anyone know how to handle such a situation?
If you are absolutely sure, then one option is to manage it internally to the streaming solution. That is to say, you build up the tuple yourself, and when you hit whatever your desired size is, you do the processing and return a value. In general, evalfuncs in pig have this issue.
I want to use Apache pig to transform/join data in two files, but I want to implement it step by step, which means, test it from real data, but with a small size(10 lines for example), is it possible to use pig that read from STDIN and output to STDOUT?
Basically Hadoop supports Streaming in various ways, but Pig originally lacked support for loading data through streaming. However there are some solutions.
You can check out HStreaming:
A = LOAD 'http://myurl.com:1234/index.html' USING HStream('\n') AS (f1, f2);
The answer is no. The data needs to be out in the cluster on data nodes before any MR job can even run over the data.
However if you are using a small sample of data and are just wanting to do something simple you could use Pig in local mode and just write stdin to a local file and run it through your script.
But the bigger question becomes why are you wanting to use MR/Pig on a stream of data? It was and is not intended for that type of use.