Distributed Cache Files Hadoop

Distributed Cache Files Hadoop - hadoop

I want to attach different files to different reducers. Is it possible using distributed cache technology in hadoop?
I able to attach the same file(files) to all the reducers. But due to memory constraints, I want to know if I can attach different files to different reducers.
Forgive me if its an ignorant question.
Pls help!
Thanks in advance!

Also it may be worth trying to use an in-memory compute/data grid technology like GridGain, Infinispan, etc... This way you can load your data in memory and you would not have any limits on how to map your computational jobs (map/reduce) to any data using data affinity.

It is a strange desire since any reducer is not bound to a particular node and during the execution a reducer can be run on any node or even nodes (if there is a failure or speculative execution). Therefore all reducers should be homogeneous, the only thing that differs them is data they process.
So I suppose when you say that you want to put different files on different reducers you actually want to put different files on reducer and those files should correspond to the data (keys) those reducers will be processing.
The only one way I know to do it is put your data on HDFS and read it from reducer when it start processing data.

package com.a;
import javax.security.auth.login.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
public class PrefixNew4Reduce4 extends MapReduceBase implements Reducer<Text, Text, Text, Text>{
// #SuppressWarnings("unchecked")
ArrayList<String> al = new ArrayList<String>();
public void configure(JobConf conf4)
{
String from = "home/users/mlakshm/haship";
OutputStream dst = null;
try {
dst = new BufferedOutputStream(new FileOutputStream(to, false));
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} /* src (hdfs file) something like hdfs://127.0.0.1:8020/home/rystsov/hi */
FileSystem fs = null;
try {
fs = FileSystem.get(new URI(from), conf4);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (URISyntaxException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
FSDataInputStream src;
try {
src = fs.open(new Path(from));
String val = src.readLine();
StringTokenizer st = new StringTokenizer(val);
al.add(val);
System.out.println("val:----------------->"+val);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
public void reduce (Text key, Iterator<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
StringTokenizer stk = new StringTokenizer(key.toString());
String t = stk.nextToken();
String i = stk.nextToken();
String j = stk.nextToken();
ArrayList<String> al1 = new ArrayList<String>();
for(int i = 0; i<al.size(); i++)
{
boolean a = (al.get(i).equals(i)) || (al.get(i).equals(j));
if(a==true)
{
output.collect(key, new Text(al.get(i));
}
while(values.hasNext())
{
String val = values.next().toString();
al1.add(val);
}
for(int i = 0; i<al1.size(); i++)
{
output.collect(key, new Text(al1.get(i));
}

Related

Extracting information from XML-Files by URL-Source takes a lot of time

I would like to extract specific information from 108 Xml files. The general source is also a XML-File with further URLs as resources.
XML-Source
The static method getURL() extracts the URLs in order to set them as URL-paths within a for loop in the main method. The programm works, but it takes approx. 5 minutes to get the data from all files. Any ideas how to increase the performance?
import java.io.File;
import java.io.IOException;
import java.net.URL;
import java.util.ArrayList;
import java.util.List;
import org.jdom2.Document;
import org.jdom2.Element;
import org.jdom2.Namespace;
import org.jdom2.filter.Filters;
import org.jdom2.input.SAXBuilder;
import org.jdom2.xpath.XPathExpression;
import org.jdom2.xpath.XPathFactory;
public class XmlReader2 {
public static void main(String[] args) throws IOException {
for (int i = 0; i < getURL().size(); i++) {
URL url = new URL(getURL().get(i));
try {
Document doc = new SAXBuilder().build(url);
final String getDeath = String
.format("//ns:teiHeader/ns:profileDesc/ns:particDesc/ns:listPerson/ns:person/ns:death");
XPathExpression<Element> xpath = XPathFactory.instance().compile(getDeath, Filters.element(), null,
Namespace.getNamespace("ns", "http://www.tei-c.org/ns/1.0"));
String test;
for (Element elem : xpath.evaluate(doc)) {
test = elem.getValue();
if (elem.getAttributes().size() != 0) {
test = elem.getAttributes().get(0).getValue();
}
System.out.println(elem.getName() + ": " + test);
}
} catch (org.jdom2.JDOMException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
public static List<String> getURL() throws IOException {
List<String> urlList = new ArrayList<>();
URL urlSource = new URL("http://www.steinheim-institut.de:80/cgi-bin/epidat?info=resources-mz1");
try {
Document doc = new SAXBuilder().build(urlSource);
final String getURL = String.format("/collection");
XPathExpression<Element> xpath = XPathFactory.instance().compile(getURL, Filters.element());
int i = 0;
for (Element elem : xpath.evaluate(doc)) {
while (i != elem.getChildren().size()) {
String url = elem.getChildren().get(i).getAttributes().get(1).getValue();
// System.out.println(url);
urlList.add(url);
i++;
}
}
} catch (org.jdom2.JDOMException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return urlList;
}
}

A delay of this order may be caused by fetching files from the web. Find a tool for monitoring HTTP requests issued from your machine to see what is going on. Look in particular for requests for common W3C files such as the XHTML DTD: because these files are requested so often, W3C deliberately injects a delay into the process to encourage people to use local copies of the files. If it turns out that this is the problem, there are various techniques you can use to access cached local copies.
Having said that, I'm puzzled by the logic of your code. The method getURL() appears to fetch and parse the document at http://www.steinheim-institut.de:80/cgi-bin/epidat?info=resources-mz1 every time it is called, and yet you are calling it within a loop, even using getURL().size() as your terminating condition.

Mapreduce with HCATALOG integration with oozie in MAPR

I have written a mapreduce program that reads the data from hive table using HCATLOG and writes into HBase. This is a map only job with no reducers. I have ran the program from command line and it works as expected(Created a fat jar to avoid Jar issues). I wanted to integrate it oozie (with Help of HUE) . I have two options to run it
Use Mapreduce Action
Use Java Action
Since my Mapreduce program has a driver method that holds the below code
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.util.*;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hive.hcatalog.data.schema.HCatSchema;
import org.apache.hive.hcatalog.mapreduce.HCatInputFormat;
import org.apache.hive.hcatalog.mapreduce.HCatOutputFormat;
public class HBaseValdiateInsertDriver {
public static void main(String[] args) throws Exception {
String dbName = "Test";
String tableName = "emp";
Configuration conf = new Configuration();
args = new GenericOptionsParser(conf, args).getRemainingArgs();
Job job = new Job(conf, "HBase Get Put Demo");
job.setInputFormatClass(HCatInputFormat.class);
HCatInputFormat.setInput(job, dbName, tableName, null);
job.setJarByClass(HBaseValdiateInsertDriver.class);
job.setMapperClass(HBaseValdiateInsert.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setNumReduceTasks(0);
FileInputFormat.addInputPath(job, new Path("maprfs:///user/input"));
FileOutputFormat.setOutputPath(job, new Path("maprfs:///user/output"));
job.waitForCompletion(true);
}
}
How do i specify the driver method in oozie, All that i can see is to specify mapper and reducer class.Can someone guide me how do i set the properties ?
Using java action i can specify my driver class as the main class and get this executed , but i face errors like table not found, HCATLOG jars not found etc. I have include hive-site.xml in the workflow(Using Hue) but i feel the system is not able to pick up the properties. Can someone advise me what all do i have to take care of, are there any other configuration properties that i need to include ?
Also the sample program i referred in cloudera website uses
HCatInputFormat.setInput(job, InputJobInfo.create(dbName,
inputTableName, null));
where as i use the below (I dont see a method that accept the above input
HCatInputFormat.setInput(job, dbName, tableName, null);
Below is my mapper code
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.KeyValue;
import org.apache.hadoop.hbase.client.Durability;
import org.apache.hadoop.hbase.client.Get;
import org.apache.hadoop.hbase.client.HTableInterface;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hive.hcatalog.data.HCatRecord;
public class HBaseValdiateInsert extends Mapper<WritableComparable, HCatRecord, Text, Text> {
static HTableInterface table;
static HTableInterface inserted;
private String hbaseDate = null;
String existigValue=null;
List<Put> putList = new ArrayList<Put>();
#Override
public void setup(Context context) throws IOException {
Configuration conf = context.getConfiguration();
String tablename = "dev_arch186";
Utils.getHBConnection();
table = Utils.getTable(tablename);
table.setAutoFlushTo(false);
}
#Override
public void cleanup(Context context) {
try {
table.put(putList);
table.flushCommits();
table.close();
} catch (IOException e) {
e.printStackTrace();
}
Utils.closeConnection();
}
#Override
public void map(WritableComparable key, HCatRecord value, Context context) throws IOException, InterruptedException {
String name_hive = (String) value.get(0);
String id_hive = (String) value.get(1);
String rec[] = test.toString().split(",");
Get g = new Get(Bytes.toBytes(name_hive));
existigValue=getOneRecord(Bytes.toBytes("Info"),Bytes.toBytes("name"),name_hive);
if (existigValue.equalsIgnoreCase("NA") || !existigValue.equalsIgnoreCase(id_hive)) {
Put put = new Put(Bytes.toBytes(rec[0]));
put.add(Bytes.toBytes("Info"),
Bytes.toBytes("name"),
Bytes.toBytes(rec[1]));
put.setDurability(Durability.SKIP_WAL);
putList.add(put);
if(putList.size()>25000){
table.put(putList);
table.flushCommits();
}
}
}
public String getOneRecord(byte[] columnFamily, byte[] columnQualifier, String rowKey)
throws IOException {
Get get = new Get(rowKey.getBytes());
get.setMaxVersions(1);
Result rs = table.get(get);
rs.getColumn(columnFamily, columnQualifier);
System.out.println(rs.containsColumn(columnFamily, columnQualifier));
KeyValue result = rs.getColumnLatest(columnFamily,columnQualifier);
if (rs.containsColumn(columnFamily, columnQualifier))
return (Bytes.toString(result.getValue()));
else
return "NA";
}
public boolean columnQualifierExists(String tableName, String ColumnFamily,
String ColumnQualifier, String rowKey) throws IOException {
Get get = new Get(rowKey.getBytes());
Result rs = table.get(get);
return(rs.containsColumn(ColumnFamily.getBytes(),ColumnQualifier.getBytes()));
}
}
Note:
I use MapR (M3) Cluster with HUE as the interface for oozie.
Hive Version : 1-0
HCAT Version: 1-0

I couldn't find any way to initialize HCatInputFormat from Oozie mapreduce action.
But I have a workaround as below.
Created LazyHCatInputFormat by extending HCatInputFormat.
Override the getJobInfo method, to handle initalization. This will be called as part of getSplits(..) call.
private static void lazyInit(Configuration conf){
try{
if(conf==null){
conf = new Configuration(false);
}
conf.addResource(new Path(System.getProperty("oozie.action.conf.xml")));
conf.addResource(new org.apache.hadoop.fs.Path("hive-config.xml"));
String databaseName = conf.get("LazyHCatInputFormat.databaseName");
String tableName = conf.get("LazyHCatInputFormat.tableName");
String partitionFilter = conf.get("LazyHCatInputFormat.partitionFilter");
setInput(conf, databaseName, tableName);
//setFilter(partitionFilter);
//System.out.println("After lazyinit : "+conf.get("mapreduce.lib.hcat.job.info"));
}catch(Exception e){
System.out.println("*** LAZY INIT FAILED ***");
//e.printStackTrace();
}
}
public static InputJobInfo getJobInfo(Configuration conf)
throws IOException {
String jobString = conf.get("mapreduce.lib.hcat.job.info");
if (jobString == null) {
lazyInit(conf);
jobString = conf.get("mapreduce.lib.hcat.job.info");
if(jobString == null){
throw new IOException("job information not found in JobContext. HCatInputFormat.setInput() not called?");
}
}
return (InputJobInfo) HCatUtil.deserialize(jobString);
}
In the oozie map-redcue action, configured as below.
<property>
<name>mapreduce.job.inputformat.class</name>
<value>com.xyz.LazyHCatInputFormat</value>
</property>
<property>
<name>LazyHCatInputFormat.databaseName</name>
<value>HCAT DatabaseNameHere</value>
</property>
<property>
<name>LazyHCatInputFormat.tableName</name>
<value>HCAT TableNameHere</value>
</property>
This might not be the best implementation, but a quick hack to make it work.

Hadoop Distributed Cache via Generic Options -files

While I was going through book Hadoop In Action there was an option which states that rather than adding the small files to distributed cache via program this can be done using the -files generic options.
When I tried this in the setup() of my code I get a FileNotFoundException at fs.open() and it shows me a path which am not sure with.
Question is :
If I use -files generic options by default where in HDFS the file gets copied to ?
The code am trying to execute is below..
public class JoinMapSide2 extends Configured implements Tool{
/* Program : JoinMapSide2.java
Description : Passing the small file via GenericOptionsParser
hadoop jar JoinMapSide2.jar -files orders.txt .........
Input : /data/patent/orders.txt(local file system), /data/patent/customers.txt
Output : /MROut/JoinMapSide2
Date : 23/03/2015
*/
protected static class MapClass extends Mapper <Text,Text,NullWritable,Text>{
// hash table to store the key+value from the distributed file or the background data
private Hashtable <String, String> joinData = new Hashtable <String, String>();
// setup function for filling up the joinData for each each map() call
protected void setup(Context context) throws IOException, InterruptedException {
String line;
String[] tokens;
FileSystem fs;
FSDataInputStream fdis;
LineReader joinReader;
Configuration conf;
Text buffer = new Text();
// get configuration
conf = context.getConfiguration();
// get file system related to the configuration
fs = FileSystem.get(conf);
// get all the local cache files distributed as part of the job
URI[] localFiles = context.getCacheFiles();
System.out.println("Cache File Path:"+localFiles[0].toString());
// check if there are any distributed files
// in our case we are sure we will always one so use that only
if (localFiles.length > 0){
// since the file is now on HDFS FSDataInputStream to read through the file
fdis = fs.open(new Path(localFiles[0].toString()));
joinReader = new LineReader(fdis);
// read local file until EOF
try {
while (joinReader.readLine(buffer) > 0) {
line = buffer.toString();
// apply the split pattern only once
tokens = line.split(",",2);
// add key+value into the Hashtable
joinData.put(tokens[0], tokens[1]);
}
} finally {
joinReader.close();
fdis.close();
}
}
else{
System.err.println("No Cache Files are distributed");
}
}
// map function
protected void map(Text key,Text value, Context context) throws IOException, InterruptedException{
NullWritable kNull = null;
String joinValue = joinData.get(key.toString());
if (joinValue != null){
context.write(kNull, new Text(key.toString() + "," + value.toString() + "," + joinValue));
}
}
}
#Override
public int run(String[] args) throws Exception {
if (args.length < 2){
System.err.println("Usage JoinMapSide -files <smallFile> <inputFile> <outputFile>");
}
Path inFile = new Path(args[0]); // input file(customers.txt)
Path outFile = new Path(args[1]); // output file file
Configuration conf = getConf();
// delimiter for the input file
conf.set("mapreduce.input.keyvaluelinerecordreader.key.value.separator", ",");
Job job = Job.getInstance(conf, "Map Side Join2");
// this is not used as the small file is distributed to all the nodes in the cluster using
// generic options parser
// job.addCacheFile(disFile.toUri());
FileInputFormat.addInputPath(job, inFile);
FileOutputFormat.setOutputPath(job, outFile);
job.setInputFormatClass(KeyValueTextInputFormat.class);
job.setOutputKeyClass(NullWritable.class);
job.setOutputValueClass(Text.class);
job.setJarByClass(JoinMapSide2.class);
job.setMapperClass(MapClass.class);
job.setNumReduceTasks(0);
job.waitForCompletion(true);
return 0;
}
public static void main(String args[]) throws Exception {
int ret = ToolRunner.run(new Configuration(), new JoinMapSide2(), args);
System.exit(ret);
}
This is the below exception I see in the trace
Error: java.io.FileNotFoundException: File does not exist: /tmp/hadoop-yarn/staging/shiva/.staging/job_1427126201553_0003/files/orders.txt#orders.txt
at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:64)
at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:54)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1795)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1738)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1718)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1690)
I start the job like
hadoop jar JoinMapSide2.jar -files orders.txt /data/patent/join/customers.txt /MROut/JoinMapSide2
Any directions would be really helpful. Thanks

First you need to move your orders.txt to hdfs and the you have to use -files

Okay after some searching around I did find out there are 2 errors in my code above.
I should not be using FileDataInputStream to read the distributed file as its local to the node running the mapper I should be using File.
I should not be using URI.toString() instead I should be using the symbolic link added to my file which is just orders.txt
I have corrected code listed below hope it helps.
package org.samples.hina.training;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.net.URI;
import java.util.Hashtable;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class JoinMapSide2 extends Configured implements Tool{
/* Program : JoinMapSide2.java
Description : To learn Replicated Join using Distributed Cache via Generic Options -files
Input : file:/patent/join/orders1.txt(distributed to all nodes), /data/patent/customers.txt
Output : /MROut/JoinMapSide2
Date : 24/03/2015
*/
protected static class MapClass extends Mapper <Text,Text,NullWritable,Text>{
// hash table to store the key+value from the distributed file or the background data
private Hashtable <String, String> joinData = new Hashtable <String, String>();
// setup function for filling up the joinData for each each map() call
protected void setup(Context context) throws IOException, InterruptedException {
String line;
String[] tokens;
// get all the cache files set in the configuration set in addCacheFile()
URI[] localFiles = context.getCacheFiles();
System.out.println("File1:"+localFiles[0].toString());
// check if there are any distributed files
// in our case we are sure we will always one so use that only
if (localFiles.length > 0){
// read from LOCAL copy
File localFile1 = new File("./orders1.txt");
// created reader to localFile1
BufferedReader joinReader = new BufferedReader(new FileReader(localFile1));
// read local file until EOF
try {
while ((line = joinReader.readLine()) != null){
// apply the split pattern only once
tokens = line.split(",",2);
// add key+value into the Hashtable
joinData.put(tokens[0], tokens[1]);
}
} finally {
joinReader.close();
}
} else{
System.err.println("Local Cache File does not exist");
}
}
// map function
protected void map(Text key,Text value, Context context) throws IOException, InterruptedException{
NullWritable kNull = null;
String joinValue = joinData.get(key.toString());
if (joinValue != null){
context.write(kNull, new Text(key.toString() + "," + value.toString() + "," + joinValue));
}
}
}
#Override
public int run(String[] args) throws Exception {
if (args.length < 2){
System.err.println("Usage JoinMapSide2 <inputFile> <outputFile>");
}
Path inFile = new Path(args[0]); // input file(customers.txt)
Path outFile = new Path(args[1]); // output file file
Configuration conf = getConf();
// delimiter for the input file
conf.set("mapreduce.input.keyvaluelinerecordreader.key.value.separator", ",");
Job job = Job.getInstance(conf, "Map Side Join2");
// add the files orders1.txt, orders2.txt to distributed cache
// the files added by the Generic Options -files
//job.addCacheFile(disFile1);
FileInputFormat.addInputPath(job, inFile);
FileOutputFormat.setOutputPath(job, outFile);
job.setInputFormatClass(KeyValueTextInputFormat.class);
job.setOutputKeyClass(NullWritable.class);
job.setOutputValueClass(Text.class);
job.setJarByClass(JoinMapSide2.class);
job.setMapperClass(MapClass.class);
job.setNumReduceTasks(0);
job.waitForCompletion(true);
return 0;
}
public static void main(String args[]) throws Exception {
int ret = ToolRunner.run(new Configuration(), new JoinMapSide2(), args);
System.exit(ret);
}
}

How do you use ORCFile Input/Output format in MapReduce?

I need to implement a custom I/O format based on ORCFile I/O format. How do I go about it?
Specifically I would need a way to include the ORCFile library in my source code (which is a custom Pig implementation) and use the ORCFile Output format to write data, and later use the ORCFile Input format to read back the data.

You need to create your subclass of InputFormat class (or FileInputFormat, depending on the nature of your files).
Just google for Hadoop InputFormat and you will find a plenty of articles and tutorials on how to create your own InputFormat class.

You can use HCatalog library to read write orc files in mapreduce.

Just wrote a sample code here. Hope it helps.
sample mapper code
public static class MyMapper<K extends WritableComparable, V extends Writable>
extends MapReduceBase implements Mapper<K, OrcStruct, Text, IntWritable> {
private StructObjectInspector oip;
private final OrcSerde serde = new OrcSerde();
public void configure(JobConf job) {
Properties table = new Properties();
table.setProperty("columns", "a,b,c");
table.setProperty("columns.types", "int,string,struct<d:int,e:string>");
serde.initialize(job, table);
try {
oip = (StructObjectInspector) serde.getObjectInspector();
} catch (SerDeException e) {
e.printStackTrace();
}
}
public void map(K key, OrcStruct val,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
System.out.println(val);
List<? extends StructField> fields =oip.getAllStructFieldRefs();
StringObjectInspector bInspector =
(StringObjectInspector) fields.get(B_ID).getFieldObjectInspector();
String b = "garbage";
try {
b = bInspector.getPrimitiveJavaObject(oip.getStructFieldData(serde.deserialize(val), fields.get(B_ID)));
} catch (SerDeException e1) {
e1.printStackTrace();
}
OrcStruct struct = null;
try {
struct = (OrcStruct) oip.getStructFieldData(serde.deserialize(val),fields.get(C_ID));
} catch (SerDeException e1) {
e1.printStackTrace();
}
StructObjectInspector cInspector = (StructObjectInspector) fields.get(C_ID).getFieldObjectInspector();
int d = ((IntWritable) cInspector.getStructFieldData(struct, fields.get(D_ID))).get();
String e = cInspector.getStructFieldData(struct, fields.get(E_ID)).toString();
output.collect(new Text(b), new IntWritable(1));
output.collect(new Text(e), new IntWritable(1));
}
}
Launcher code
JobConf job = new JobConf(new Configuration(), OrcReader.class);
// Specify various job-specific parameters
job.setJobName("myjob");
job.set("mapreduce.framework.name","local");
job.set("fs.default.name","file:///");
job.set("log4j.logger.org.apache.hadoop","INFO");
job.set("log4j.logger.org.apache.hadoop","INFO");
//push down projection columns
job.set("hive.io.file.readcolumn.ids","1,2");
job.set("hive.io.file.read.all.columns","false");
job.set("hive.io.file.readcolumn.names","b,c");
FileInputFormat.setInputPaths(job, new Path("./src/main/resources/000000_0.orc"));
FileOutputFormat.setOutputPath(job, new Path("./target/out1"));
job.setMapperClass(OrcReader.MyMapper.class);
job.setCombinerClass(OrcReader.MyReducer.class);
job.setReducerClass(OrcReader.MyReducer.class);
job.setInputFormat(OrcInputFormat.class);
job.setOutputFormat(TextOutputFormat.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
JobClient.runJob(job);

Broadcasting using the protocol Zab in ZooKeeper

Good morning,
I am new to ZooKeeper and its protocols and I am interested in its broadcast protocol Zab.
Could you provide me with a simple java code that uses the Zab protocol of Zookeeper? I have been searching about that but I did not succeed to find a code that shows how can I use Zab.
In fact what I need is simple, I have a MapReduce code and I want all the mappers to update a variable (let's say X) whenever they succeed to find a better value of X (i.e. a bigger value). In this case, the leader has to compare the old value and the new value and then to broadcast the actual best value to all mappers. How can I do such a thing in Java?
Thanks in advance,
Regards

You don't need to use the Zab protocol. Instead you may follow the below steps:
You have a Znode say /bigvalue on Zookeeper. All the mappers when starts reads the value stored in it. They also put an watch for data change on the Znode. Whenever a mapper gets a better value, it updates the Znode with the better value. All the mappers will get notification for the data change event and they read the new best value and they re-establish the watch for data changes again. That way they are in sync with the latest best value and may update the latest best value whenever there is a better value.
Actually zkclient is a very good library to work with Zookeeper and it hides a lot of complexities ( https://github.com/sgroschupf/zkclient ). Below is an example that demonstrates how you may watch a Znode "/bigvalue" for any data change.
package geet.org;
import java.io.UnsupportedEncodingException;
import org.I0Itec.zkclient.IZkDataListener;
import org.I0Itec.zkclient.ZkClient;
import org.I0Itec.zkclient.exception.ZkMarshallingError;
import org.I0Itec.zkclient.exception.ZkNodeExistsException;
import org.I0Itec.zkclient.serialize.ZkSerializer;
import org.apache.zookeeper.data.Stat;
public class ZkExample implements IZkDataListener, ZkSerializer {
public static void main(String[] args) {
String znode = "/bigvalue";
ZkExample ins = new ZkExample();
ZkClient cl = new ZkClient("127.0.0.1", 30000, 30000,
ins);
try {
cl.createPersistent(znode);
} catch (ZkNodeExistsException e) {
System.out.println(e.getMessage());
}
// Change the data for fun
Stat stat = new Stat();
String data = cl.readData(znode, stat);
System.out.println("Current data " + data + "version = " + stat.getVersion());
cl.writeData(znode, "My new data ", stat.getVersion());
cl.subscribeDataChanges(znode, ins);
try {
Thread.sleep(36000);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
#Override
public void handleDataChange(String dataPath, Object data) throws Exception {
System.out.println("Detected data change");
System.out.println("New data for " + dataPath + " " + (String)data);
}
#Override
public void handleDataDeleted(String dataPath) throws Exception {
System.out.println("Data deleted " + dataPath);
}
#Override
public byte[] serialize(Object data) throws ZkMarshallingError {
if (data instanceof String){
try {
return ((String) data).getBytes("UTF-8");
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
}
return null;
}
#Override
public Object deserialize(byte[] bytes) throws ZkMarshallingError {
try {
return new String(bytes, "UTF-8");
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
return null;
}
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Distributed Cache Files Hadoop - hadoop

Also it may be worth trying to use an in-memory compute/data grid technology like GridGain, Infinispan, etc... This way you can load your data in memory and you would not have any limits on how to map your computational jobs (map/reduce) to any data using data affinity.

Related

Extracting information from XML-Files by URL-Source takes a lot of time

Mapreduce with HCATALOG integration with oozie in MAPR

Hadoop Distributed Cache via Generic Options -files

How do you use ORCFile Input/Output format in MapReduce?

Broadcasting using the protocol Zab in ZooKeeper

Categories

Resources