Getting byte offset with MRJob - hadoop

According to "The Definitive Guide to Hadoop", the input format TextInputFormat gives key value pairs (k, v) = (byte offset, line). However, in MRJob, the key in the mapper input is always None. It should be easy to get the byte offset as key, since that's what TextInputFormat does. How do I get this?
I know that you can use the environment variable 'map_input_start' and calculate byte offsets yourself, but this has caused problems and I would like to do it the much simpler way of just getting the offset as key.

The TextInputFormat is a Java class ... I do not see how that would work in the streaming world.

Doesn't defining the map method in your mapper class with the following signature give you the byte offset as the key.
public void map(LongWritable key,Text value,OutputCollector<>,Reporter)

Related

h2o Steam Prediction Servlet not accepting character values from python script

I am using Steam to attempt to build a prediction service using a python preprocessing script. When python passes the cleaned data to the prediction service in the
variable:value var2:value2 var3:value3
format (as seen in the Spam Detection Example) I get a
ERROR PredictPythonServlet - Failed to parse
error from the service. When I look at the PredictPythonServlet.java file it seems to only use the strMapToRowData function which assumes every value in the input string is a number:
for (String p : pairs) {
String[] a = p.split(":");
String term = a[0];
double value = Float.parseFloat(a[1]);
row.put(term, value);
}
Are character values not allowed to be sent in this format? If so is there a way to get the PredictPythonServlet file to use the csvToRowData function that is defined but never used? I'd like to not have to use One-Hot encoding for my models so being able to pass the actual character string representation would be ideal.
Additionally, I passed the numeric representation found in the model pojo file for the categorical variables and received the error:
hex.genmodel.easy.exception.PredictUnknownTypeException: Unexpected object type java.lang.Double for categorical column home_team
So it looks like the service expects a character string but I can't figure out how to pass it along to the actual model. Any help would be greatly appreciated!
The prediction service is using EasyPredictModelWrapper and it can only use what the underlying model uses. Here it's not clear what model you use, but most use numerical float values. In the for loop code snippet you can see that the number has to be float.

Mapper input Key-Value pair in Hadoop

Normally, we write the mapper in the form :
public static class Map extends Mapper<**LongWritable**, Text, Text, IntWritable>
Here the input key-value pair for the mapper is <LongWritable, Text> - as far as I know when the mapper gets the input data its goes through line by line - so the Key for the mapper signifies the line number - please correct me if I am wrong.
My question is : If I give the input key-value pair for mapper as <Text, Text> then it is giving the error
java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.io.Text
Is it a mandatory to give the input key-value pair of mapper as <LongWritable, Text> - if yes then why ? if no then what the reason of the error ? Can you please help me understand the proper reasoning of the error ?
Thanks in advance.
The input to the mapper depends on what InputFormat is used. The InputFormat is responsible for reading the incoming data and shaping it into whatever format the Mapper expects.The default InputFormat is TextInputFormat, which extends FileInputFormat<LongWritable, Text>.
If you do not change the InputFormat, using a Mapper with different Key-Value type signature than <LongWritable, Text> will cause this error. If you expect <Text, Text> input, you will have to choose an appropiate InputFormat. You can set the InputFormat in Job setup:
job.setInputFormatClass(MyInputFormat.class);
And like I said, by default this is set to TextInputFormat.
Now, let's say your input data is a bunch of newline-separated records delimited by a comma:
"A,value1"
"B,value2"
If you want the input key to the mapper to be ("A", "value1"), ("B", "value2") you will have to implement a custom InputFormat and RecordReader with the <Text, Text> signature. Fortunately, this is pretty easy. There is an example here and probably a few examples floating around StackOverflow as well.
In short, add a class which extends FileInputFormat<Text, Text> and a class which extends RecordReader<Text, Text>. Override the FileInputFormat#getRecordReader method, and have it return an instance of your custom RecordReader.
Then you will have to implement the required RecordReader logic. The simplest way to do this is to create an instance of LineRecordReader in your custom RecordReader, and delegate all basic responsibilities to this instance. In the getCurrentKey and getCurrentValue-methods you will implement the logic for extracting the comma delimited Text contents by calling LineRecordReader#getCurrentValue and splitting it on comma.
Finally, set your new InputFormat as Job InputFormat as shown after the second paragraph above.
In the book "Hadoop: The Difinitive Guide" by Tom White I think he has an appropriate answer to this(pg. 197):
"TextInputFormat’s
keys, being simply the offset within the file, are not normally very
useful. It is common for each line in a file to be a key-value pair, separated by a delimiter
such as a tab character. For example, this is the output produced by
TextOutputFormat, Hadoop’s default
OutputFormat. To interpret such files correctly,
KeyValueTextInputFormat
is appropriate.
You can specify the separator via the
key.value.separator.in.input.line
property. It
is a tab character by default."
Key for Mapper Input will always be a Integer type....the mapper input key indicates the line's offset no. and the values indicates the whole line ......
record reader reads a single line in first cycle. And o/p of the mapper can be whatever u want (it can be (Text,Text) or (Text, IntWritable) or ......)

How to just output value in context.write(k,v)

In my mapreduce job, I just want to output some lines.
But if I code like this:
context.write(data, null);
the program will throw java.lang.NullPointerException.
I don't want to code like below:
context.write(data, new Text(""));
because I have to trim the blank space in every line in the output files.
Is there any good ways to solve it?
Thanks in advance.
Sorry, it's my mistake. I checked the program carefully, found the reason is I set the Reducer as combiner.
If I do not use the combiner, the statement
context.write(data, null);
in reducer works fine. In the output data file, there is just the data line.
Share the NullWritable explanation from hadoop definitive guide:
NullWritable is a special type of Writable, as it has a zero-length serialization. No bytes
are written to, or read from, the stream. It is used as a placeholder; for example, in
MapReduce, a key or a value can be declared as a NullWritable when you don’t need
to use that position—it effectively stores a constant empty value. NullWritable can also
be useful as a key in SequenceFile when you want to store a list of values, as opposed
to key-value pairs. It is an immutable singleton: the instance can be retrieved by calling
NullWritable.get().
You should use NullWritable for this purpose.

"Map" and "Reduce" functions in Hadoop's MapReduce

I've been looking at this word count example by hadoop:
http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.html#Source+Code
And I'm a little confused about the Map function. In the map function shown, it takes in a "key" of type LongWritable, but this parameter is never used in the body of the Map function. What does the application programmer expect Hadoop to pass in for this key? Why does a map function require a key if it simply parses values from a line of text or something. Can someone give me an example where both a key and a value is required for input? I only see map as V1 -> (K2, V2).
Another question: In the real implementation of hadoop, are their multiple reduction steps? If so, how does hadoop apply the same reduction function multiple times if the function is (K2, V2) -> (K3, V3)? If another reduction is performed, it needs to take in type (K3, V3)...
Thank you!
There's a key there because the map() method is always passed a key and a value (and a context). It's up to you as to whether you actually use the key and/or value. In this case, the key represents a line number from the file being read. The word count logic doesn't need that. The map() method just uses the value, which in the case of a text file is a line of the file.
As to your second question (which really should be its own stack overflow question), you may have any number of map/reduce jobs in a hadoop workflow. Some of those jobs will read as input pre-existing files and others will read the output of other jobs. Each job will have one or more mappers and a single reducer.

Convert a string of 0-F into a byte array in Ruby

I am attempting to decrypt a number encrypted by another program that uses the BouncyCastle library for Java.
In Java, I can set the key like this: key = Hex.decode("5F3B603AFCE22359");
I am trying to figure out how to represent that same step in Ruby.
To get Integer — just str.hex. You may get byte array in several ways:
str.scan(/../).map(&:hex)
[str].pack('H*').unpack('C*')
[str].pack('H*').bytes.to_a
See other options for pack/unpack and examples (by codeweblog).
For a string str:
"".tap {|binary| str.scan(/../) {|hn| binary << hn.to_i(16).chr}}

Resources