PIG UDF handle multi-lined tuple split into different mapper - hadoop

I have file where each tuple span multiple lines, for example:
START
name: Jim
phone: 2128789283
address: 56 2nd street, New York, USA
END
START
name: Tom
phone: 6308789283
address: 56 5th street, Chicago, 13611, USA
END
.
.
.
So above are 2 tuples in my file. I wrote my UDF that defined a getNext() function which check if it is START then I will initialize my tuple; if it is END then I will return the tuple (from string buffer); otherwise I will just add the string to string buffer.
It works well for file size is less than HDFS block size which is 64 MB (on Amazon EMR), whereas it will fail for the size larger than this. I try to google around, find this blog post. Raja's explaination is easy to understand and he provided a sample code. But the code is implementing the RecordReader part, instead of getNext() for pig LoadFunc. Just wondering if anyone has experience to handle multi-lined pig tuple split problem? Should I go ahead implement RecordReader in Pig? If so, how?
Thanks.

You may preprocess your input as Guy mentioned or can apply other tricks described here.
I think the cleanest solution would be to implement a custom InputFormat (along with its RecordReader) which creates one record/START-END. The Pig's LoadFunc sits on the top of the Hadoop's InputFormat, so you can define which InputFormat your LoadFunc will use.
A raw, skeleton implementation of a custom LoadFunc would look like:
import java.io.IOException;
import org.apache.hadoop.mapreduce.InputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.pig.LoadFunc;
import org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigSplit;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.TupleFactory;
public class CustomLoader extends LoadFunc {
private RecordReader reader;
private TupleFactory tupleFactory;
public CustomLoader() {
tupleFactory = TupleFactory.getInstance();
}
#Override
public InputFormat getInputFormat() throws IOException {
return new MyInputFormat(); //custom InputFormat
}
#Override
public Tuple getNext() {
Tuple result = null;
try {
if (!reader.nextKeyValue()) {
return null;
}
//value can be a custom Writable containing your name/value
//field pairs for a given record
Object value = reader.getCurrentValue();
result = tupleFactory.newTuple();
// ...
//append fields to tuple
}
catch (Exception e) {
// ...
}
return result;
}
#Override
public void prepareToRead(RecordReader reader, PigSplit pigSplit)
throws IOException {
this.reader = reader;
}
#Override
public void setLocation(String location, Job job) throws IOException {
FileInputFormat.setInputPaths(job, location);
}
}
After the LoadFunc initializes the InputFormat and its RecordReader, it locates the input location of your data and begins to obtain the records from recordReader, creates the resulting tuples (getNext()) until the input has been fully read.
Some remarks on the custom InputFormat:
I'd create a custom InputFormat in which the RecordReader is a modified version of
org.apache.hadoop.mapreduce.lib.input.LineRecordReader: Most of the methods would
remain the same, except initialize(): it would call a custom LineReader
(based on org.apache.hadoop.util.LineReader).
The InputFormat's key would be the line offset (Long), the value would be a custom
Writable. This would hold the fields of a record (i.e data between START-END) as a list of key-value pairs. Each time your RecordReader's nextKeyValue() is called the record is written to the custom Writable by the LineReader. The gist of the whole thing is how you
implement LineReader.readLine().
Another, probably an easier approach would be to change the delimiter of TextInputFormat (It is configurable in Hadoop 0.23, see textinputformat.record.delimiter)
to one that is appropriate for your data structure (if it is possible). In this case you'll end up having your data in Text from which you need to split and extract KV pairs and into tuples.

If can take start as your delimiter, probably below code works without UDF
SET textinputformat.record.delimiter 'START';
a = load '<input path>' as (data:chararray);
dump a;
the output would look like:
(
name: Jim
enter code here`phone: 2128789283
address: 56 2nd street, New York, USA
END
)
(
name: Tom
phone: 6308789283
address: 56 5th street, Chicago, 13611, USA
END
)
Now both are separated into two tuples.

Related

Flip key, value in a HashMap using Java8 Stream

I have a Map in following structure and I want to flip the key and value.
Map<String, List<String>> dataMap
Sample data :
acct01: [aa, ab, ad],
acct02: [ac, ad]
acct03: [ax, ab]
Want this data to be converted to,
aa: [acct01],
ab: [acct01, acct03],
ac: [acct02],
ad: [acct01, acct02],
ax: [acct03]
Want to know if there is a java 8 - stream way to transform the Map.
My Current implementation in (without Stream)
Map<String, List<String>> originalData = new HashMap<String, List<String>>();
originalData.put("Acct01", Arrays.asList("aa", "ab", "ad"));
originalData.put("Acct02", Arrays.asList("ac", "ad"));
originalData.put("Acct03", Arrays.asList("ax", "ab"));
System.out.println(originalData);
Map<String, List<String>> newData = new HashMap<String, List<String>>();
originalData.entrySet().forEach(entry -> {
entry.getValue().forEach(v -> {
if(newData.get(v) == null) {
List<String> t = new ArrayList<String>();
t.add(entry.getKey());
newData.put(v, t);
} else {
newData.get(v).add(entry.getKey());
}
});
});
System.out.println(newData);
input and output,
{Acct01=[aa, ab, ad], Acct02=[ac, ad], Acct03=[ax, ab]}
{aa=[Acct01], ab=[Acct01, Acct03], ac=[Acct02], ad=[Acct01, Acct02], ax=[Acct03]}
Looking for way to implement using Stream.
Get the stream for the entry set, flatten it out into one entry per key-value pair, group by value, collect associated keys into a list.
import static java.util.Arrays.asList;
import static java.util.stream.Collectors.groupingBy;
import static java.util.stream.Collectors.mapping;
import static java.util.stream.Collectors.toList;
import java.util.AbstractMap.SimpleImmutableEntry;
import java.util.List;
import java.util.Map;
import java.util.Map.Entry;
<K, V> Map<V, List<K>> invert(Map<K, List<V>> map) {
return map.entrySet().stream().flatMap(
entry -> entry.getValue().stream().map(
value -> new SimpleImmutableEntry<>(
entry.getKey(),
value
)
)
).collect(
groupingBy(
Entry::getValue,
mapping(
Entry::getKey,
toList()
)
)
);
}
Here is the solution by Java 8 stream library: StreamEx
newData = EntryStream.of(originalData).invert().flatMapKeys(k -> k.stream()).grouping();
If you are open to using a third party library like Eclipse Collections you can use a ListMultimap (Each key can have a List of values). Multimap has flip(). So, this will work:
MutableListMultimap<String, String> originalData = Multimaps.mutable.list.empty();
originalData.putAll("Acct01", Arrays.asList("aa", "ab", "ad"));
originalData.putAll("Acct02", Arrays.asList("ac", "ad"));
originalData.putAll("Acct03", Arrays.asList("ax", "ab"));
System.out.println(originalData);
MutableBagMultimap<String, String> newData = originalData.flip();
System.out.println(newData);
input: {Acct03=[ax, ab], Acct02=[ac, ad], Acct01=[aa, ab, ad]}
output: {ac=[Acct02], ad=[Acct02, Acct01], aa=[Acct01], ab=[Acct03, Acct01], ax=[Acct03]}
Please note that flip() returns a BagMultimap where each key can have a Bag of values. Bag is special data structure which is unordered and allows duplicates.
Note: I am a committer for Eclipse Collections.
Your current implementation already relies on features of Java8. The forEach method was added to a number of data structures in J8, and while you could use streams, there would be no point, as much of the advantage of streams comes from being able to execute filters, sorting, and other methods in a lazy fashion, which would not apply to key remapping.
If you really wanted to, you could sprinkle in a couple streams by changing all .forEach instances to .stream().forEach in your example.

what is use of Tuple.getStringByField("ABC") in Storm

I am not able to understand the use of the Tuple.getStringByField("ABC") in Apache Storm.
The following is the code:
Public Void execute(Tuple input){
try{
if (input.getSourceStreamId.equals("signals"))
{
str=input.getStringByField("action")
if ("refresh".equals(str))
{....}
}
}...
Here what is input.getStringByField("action") is doing exactly..
Thank you.
In storm, both spout and bolt emit tuple. But the question is what are contained in each tuple. Each spout and bolt can use the below method to define the tuple schema.
#Override
public void declareOutputFields(
OutputFieldsDeclarer outputFieldsDeclarer)
{
// tell storm the schema of the output tuple
// tuple consists of columns called 'mycolumn1' and 'mycolumn2'
outputFieldsDeclarer.declare(new Fields("mycolumn1", "mycolumn2"));
}
The subsequent bolt then can use getStringByField("mycolumn1") to retrieve the value based on column name.
getStringByField() is like getString(), except it looks up the field by it's field name instead of position.

Include YAML files from snakeyaml

I would like to have YAML files with an include, similar to this question, but with Snakeyaml:
How can I include an YAML file inside another?
For example:
%YAML 1.2
---
!include "load.yml"
!include "load2.yml"
I am having a lot of trouble with it. I have the Constructor defined, and I can make it import one document, but not two. The error I get is:
Exception in thread "main" expected '<document start>', but found Tag
in 'reader', line 5, column 1:
!include "load2.yml"
^
With one include, Snakeyaml is happy that it finds an EOF and processes the import. With two, it's not happy (above).
My java source is:
package yaml;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.InputStream;
import org.yaml.snakeyaml.Yaml;
import org.yaml.snakeyaml.constructor.AbstractConstruct;
import org.yaml.snakeyaml.constructor.Constructor;
import org.yaml.snakeyaml.nodes.Node;
import org.yaml.snakeyaml.nodes.ScalarNode;
import org.yaml.snakeyaml.nodes.Tag;
public class Main {
final static Constructor constructor = new MyConstructor();
private static class ImportConstruct extends AbstractConstruct {
#Override
public Object construct(Node node) {
if (!(node instanceof ScalarNode)) {
throw new IllegalArgumentException("Non-scalar !import: " + node.toString());
}
final ScalarNode scalarNode = (ScalarNode)node;
final String value = scalarNode.getValue();
File file = new File("src/imports/" + value);
if (!file.exists()) {
return null;
}
try {
final InputStream input = new FileInputStream(new File("src/imports/" + value));
final Yaml yaml = new Yaml(constructor);
return yaml.loadAll(input);
} catch (FileNotFoundException ex) {
ex.printStackTrace();
}
return null;
}
}
private static class MyConstructor extends Constructor {
public MyConstructor() {
yamlConstructors.put(new Tag("!include"), new ImportConstruct());
}
}
public static void main(String[] args) {
try {
final InputStream input = new FileInputStream(new File("src/imports/example.yml"));
final Yaml yaml = new Yaml(constructor);
Object object = yaml.load(input);
System.out.println("Loaded");
} catch (FileNotFoundException ex) {
ex.printStackTrace();
}
finally {
}
}
}
Question is, has anybody done a similar thing with Snakeyaml? Any thoughts as to what I might be doing wrong?
I see two issues:
final InputStream input = new FileInputStream(new File("src/imports/" + value));
final Yaml yaml = new Yaml(constructor);
return yaml.loadAll(input);
You should be using yaml.load(input), not yaml.loadAll(input). The loadAll() method returns multiple objects, but the construct() method expects to return a single object.
The other issue is that you may have some inconsistent expectations with the way that the YAML processing pipeline works:
If you think that your !include works like in C where the preprocessor sticks in the contents of the included file, the way to implement it would be to handle it in the Presentation stage (parsing) or Serialization stage (composing). But you have implemented it in the Representation stage (constructing), so !include returns an object, and the structure of your YAML file must be consistent with this.
Let's say that you have the following files:
test1a.yaml
activity: "herding cats"
test1b.yaml
33
test1.yaml
favorites: !include test1a.yaml
age: !include test1b.yaml
This would work ok, and would be equivalent to
favorites:
activity: "herding cats"
age: 33
But the following file would not work:
!include test1a.yaml
!include test1b.yaml
because there is nothing to say how to combine the two values in a larger hierarchy. You'd need to do this, if you want an array:
- !include test1a.yaml
- !include test1b.yaml
or, again, handle this custom logic in an earlier stage such as parsing or composing.
Alternatively, you need to tell the YAML library that you are starting a 2nd document (which is what the error is complaining about: expected '<document start>') since YAML supports multiple "documents" (top-level values) in a single .yaml file.

Hadoop MultipleOutputFormat.generateFileNameForKeyValue with many keys

I am trying to play with MultipleOutputFormat.generateFileNameForKeyValue() .
The idea is to create directory for each of my keys.
This is the code:
static class MyMultipleTextOutputFormat extends MultipleTextOutputFormat<Text, Text> {
#Override
protected String generateFileNameForKeyValue(Text key, Text value, String name) {
arr = key.toString().split("_");
return arr[0]+"/"+name;
}
}
This code works only if the emitted records are few. If i run the code against my real input, it just hangs on reducer around 70%.
What might be the problem here - working on small number of keys, not working on many .

Custom Partitioner : N number of keys to N different files

My requirement is to write a custom partitioner. I have these N number of keys coming from mapper for example('jsa','msa','jbac'). Length is not fixed. It can be anyword infact. My requirement is to write a custom partitioner in such a way that It will collect all the same key data in to same file. Number of keys is not fixed. Thank you in Advance.
Thanks,
Sathish.
So you have multiple keys which mapper is outputting and you want different reducers for each key and have a separate file for each key.
So first thing writing Partitioner can be a way to achieve that. By default hadoop has its own internal logic that it performs on keys and depending on that it calls reducers. So if you want to write a custom partitioner than you have to overwrite that default behaviour by your own logic/algorithm. Unless you know how exactly your keys will vary this logic wont be generic and based on variations you have to figure out the logic.
I am providing you a sample example here you can refer that but its not generic.
public class CustomPartitioner extends Partitioner<Text, Text>
{
#Override
public int getPartition(Text key, Text value, int numReduceTasks)
{
if(key.toString().contains("Key1"))
{
return 1;
}else if(key.toString().contains("Key2"))
{
return 2;
}else if(key.toString().contains("Key3"))
{
return 3;
}else if(key.toString().contains("Key4"))
{
return 4;
}else if(key.toString().contains("Key5"))
{
return 5;
}else
{
return 7;
}
}
}
This should solve your problem. Just replace key1,key2 ..etc by your key name...
In case you don't know the key names you can write your own logic by referring following:
public class CustomPartitioner<Text, Text> extends Partitioner<K, V>
{
public int getPartition(Text key, Text value,int numReduceTasks)
{
return (key.toString().charAt(0)) % numReduceTasks;
}
}
In above partitioner just to illustrate that how you can write your own logic I have shown that if you take out length of the keys and do % operation with number of reducers than you will get one unique number which will be between 0 to Number of Reducers so by default different reducers get called and gives output in different files. But in this approach you have to make sure that for two keys same value should not be written
This was about Customized partitioner.
Another solution can be you can override the MultipleOutputFormat class methods that will enable to do the job in a generic way. Also using this approach you will be able to generate customized file name for reducer output files in hdfs.
NOTE: Make sure you use same libraries. Don't mix mapred against mapreduce libraries. org.apache.hadoop.mapred are older libraries and org.apache.hadoop.mapreduce are new ones.
Hope this will help.
I imagine the best way to do this since it will give a more even split would be:
public class CustomPartitioner<Text, Text> extends Partitioner<K, V>
{
public int getPartition(Text key, Text value,int numReduceTasks)
{
return key.hashCode() % numReduceTasks;
}
}

Resources