Hadoop Distributed Cache via Generic Options -files

Hadoop Distributed Cache via Generic Options -files - hadoop

While I was going through book Hadoop In Action there was an option which states that rather than adding the small files to distributed cache via program this can be done using the -files generic options.
When I tried this in the setup() of my code I get a FileNotFoundException at fs.open() and it shows me a path which am not sure with.
Question is :
If I use -files generic options by default where in HDFS the file gets copied to ?
The code am trying to execute is below..
public class JoinMapSide2 extends Configured implements Tool{
/* Program : JoinMapSide2.java
Description : Passing the small file via GenericOptionsParser
hadoop jar JoinMapSide2.jar -files orders.txt .........
Input : /data/patent/orders.txt(local file system), /data/patent/customers.txt
Output : /MROut/JoinMapSide2
Date : 23/03/2015
*/
protected static class MapClass extends Mapper <Text,Text,NullWritable,Text>{
// hash table to store the key+value from the distributed file or the background data
private Hashtable <String, String> joinData = new Hashtable <String, String>();
// setup function for filling up the joinData for each each map() call
protected void setup(Context context) throws IOException, InterruptedException {
String line;
String[] tokens;
FileSystem fs;
FSDataInputStream fdis;
LineReader joinReader;
Configuration conf;
Text buffer = new Text();
// get configuration
conf = context.getConfiguration();
// get file system related to the configuration
fs = FileSystem.get(conf);
// get all the local cache files distributed as part of the job
URI[] localFiles = context.getCacheFiles();
System.out.println("Cache File Path:"+localFiles[0].toString());
// check if there are any distributed files
// in our case we are sure we will always one so use that only
if (localFiles.length > 0){
// since the file is now on HDFS FSDataInputStream to read through the file
fdis = fs.open(new Path(localFiles[0].toString()));
joinReader = new LineReader(fdis);
// read local file until EOF
try {
while (joinReader.readLine(buffer) > 0) {
line = buffer.toString();
// apply the split pattern only once
tokens = line.split(",",2);
// add key+value into the Hashtable
joinData.put(tokens[0], tokens[1]);
}
} finally {
joinReader.close();
fdis.close();
}
}
else{
System.err.println("No Cache Files are distributed");
}
}
// map function
protected void map(Text key,Text value, Context context) throws IOException, InterruptedException{
NullWritable kNull = null;
String joinValue = joinData.get(key.toString());
if (joinValue != null){
context.write(kNull, new Text(key.toString() + "," + value.toString() + "," + joinValue));
}
}
}
#Override
public int run(String[] args) throws Exception {
if (args.length < 2){
System.err.println("Usage JoinMapSide -files <smallFile> <inputFile> <outputFile>");
}
Path inFile = new Path(args[0]); // input file(customers.txt)
Path outFile = new Path(args[1]); // output file file
Configuration conf = getConf();
// delimiter for the input file
conf.set("mapreduce.input.keyvaluelinerecordreader.key.value.separator", ",");
Job job = Job.getInstance(conf, "Map Side Join2");
// this is not used as the small file is distributed to all the nodes in the cluster using
// generic options parser
// job.addCacheFile(disFile.toUri());
FileInputFormat.addInputPath(job, inFile);
FileOutputFormat.setOutputPath(job, outFile);
job.setInputFormatClass(KeyValueTextInputFormat.class);
job.setOutputKeyClass(NullWritable.class);
job.setOutputValueClass(Text.class);
job.setJarByClass(JoinMapSide2.class);
job.setMapperClass(MapClass.class);
job.setNumReduceTasks(0);
job.waitForCompletion(true);
return 0;
}
public static void main(String args[]) throws Exception {
int ret = ToolRunner.run(new Configuration(), new JoinMapSide2(), args);
System.exit(ret);
}
This is the below exception I see in the trace
Error: java.io.FileNotFoundException: File does not exist: /tmp/hadoop-yarn/staging/shiva/.staging/job_1427126201553_0003/files/orders.txt#orders.txt
at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:64)
at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:54)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1795)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1738)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1718)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1690)
I start the job like
hadoop jar JoinMapSide2.jar -files orders.txt /data/patent/join/customers.txt /MROut/JoinMapSide2
Any directions would be really helpful. Thanks

First you need to move your orders.txt to hdfs and the you have to use -files

Okay after some searching around I did find out there are 2 errors in my code above.
I should not be using FileDataInputStream to read the distributed file as its local to the node running the mapper I should be using File.
I should not be using URI.toString() instead I should be using the symbolic link added to my file which is just orders.txt
I have corrected code listed below hope it helps.
package org.samples.hina.training;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.net.URI;
import java.util.Hashtable;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class JoinMapSide2 extends Configured implements Tool{
/* Program : JoinMapSide2.java
Description : To learn Replicated Join using Distributed Cache via Generic Options -files
Input : file:/patent/join/orders1.txt(distributed to all nodes), /data/patent/customers.txt
Output : /MROut/JoinMapSide2
Date : 24/03/2015
*/
protected static class MapClass extends Mapper <Text,Text,NullWritable,Text>{
// hash table to store the key+value from the distributed file or the background data
private Hashtable <String, String> joinData = new Hashtable <String, String>();
// setup function for filling up the joinData for each each map() call
protected void setup(Context context) throws IOException, InterruptedException {
String line;
String[] tokens;
// get all the cache files set in the configuration set in addCacheFile()
URI[] localFiles = context.getCacheFiles();
System.out.println("File1:"+localFiles[0].toString());
// check if there are any distributed files
// in our case we are sure we will always one so use that only
if (localFiles.length > 0){
// read from LOCAL copy
File localFile1 = new File("./orders1.txt");
// created reader to localFile1
BufferedReader joinReader = new BufferedReader(new FileReader(localFile1));
// read local file until EOF
try {
while ((line = joinReader.readLine()) != null){
// apply the split pattern only once
tokens = line.split(",",2);
// add key+value into the Hashtable
joinData.put(tokens[0], tokens[1]);
}
} finally {
joinReader.close();
}
} else{
System.err.println("Local Cache File does not exist");
}
}
// map function
protected void map(Text key,Text value, Context context) throws IOException, InterruptedException{
NullWritable kNull = null;
String joinValue = joinData.get(key.toString());
if (joinValue != null){
context.write(kNull, new Text(key.toString() + "," + value.toString() + "," + joinValue));
}
}
}
#Override
public int run(String[] args) throws Exception {
if (args.length < 2){
System.err.println("Usage JoinMapSide2 <inputFile> <outputFile>");
}
Path inFile = new Path(args[0]); // input file(customers.txt)
Path outFile = new Path(args[1]); // output file file
Configuration conf = getConf();
// delimiter for the input file
conf.set("mapreduce.input.keyvaluelinerecordreader.key.value.separator", ",");
Job job = Job.getInstance(conf, "Map Side Join2");
// add the files orders1.txt, orders2.txt to distributed cache
// the files added by the Generic Options -files
//job.addCacheFile(disFile1);
FileInputFormat.addInputPath(job, inFile);
FileOutputFormat.setOutputPath(job, outFile);
job.setInputFormatClass(KeyValueTextInputFormat.class);
job.setOutputKeyClass(NullWritable.class);
job.setOutputValueClass(Text.class);
job.setJarByClass(JoinMapSide2.class);
job.setMapperClass(MapClass.class);
job.setNumReduceTasks(0);
job.waitForCompletion(true);
return 0;
}
public static void main(String args[]) throws Exception {
int ret = ToolRunner.run(new Configuration(), new JoinMapSide2(), args);
System.exit(ret);
}
}

Related

Not understanding the path in distributed path

From the below code I didn't understand 2 things:
DistributedCache.addcachefile(new URI ('/abc.dat'), job.getconfiguration())
I didn't understand URI path has to be present in the HDFS. Correct me if I am wrong.
And what is p.getname().equals() from the below code:
public class MyDC {
public static class MyMapper extends Mapper < LongWritable, Text, Text, Text > {
private Map < String, String > abMap = new HashMap < String, String > ();
private Text outputKey = new Text();
private Text outputValue = new Text();
protected void setup(Context context) throws
java.io.IOException, InterruptedException {
Path[] files = DistributedCache.getLocalCacheFiles(context.getConfiguration());
for (Path p: files) {
if (p.getName().equals("abc.dat")) {
BufferedReader reader = new BufferedReader(new FileReader(p.toString()));
String line = reader.readLine();
while (line != null) {
String[] tokens = line.split("\t");
String ab = tokens[0];
String state = tokens[1];
abMap.put(ab, state);
line = reader.readLine();
}
}
}
if (abMap.isEmpty()) {
throw new IOException("Unable to load Abbrevation data.");
}
}
protected void map(LongWritable key, Text value, Context context)
throws java.io.IOException, InterruptedException {
String row = value.toString();
String[] tokens = row.split("\t");
String inab = tokens[0];
String state = abMap.get(inab);
outputKey.set(state);
outputValue.set(row);
context.write(outputKey, outputValue);
}
}
public static void main(String[] args)
throws IOException, ClassNotFoundException, InterruptedException {
Job job = new Job();
job.setJarByClass(MyDC.class);
job.setJobName("DCTest");
job.setNumReduceTasks(0);
try {
DistributedCache.addCacheFile(new URI("/abc.dat"), job.getConfiguration());
} catch (Exception e) {
System.out.println(e);
}
job.setMapperClass(MyMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}

The idea of Distributed Cache is to make some static data available to the task node before it starts its execution.
File has to be present in HDFS ,so that it can then add it to the Distributed Cache (to each task node)
DistributedCache.getLocalCacheFile basically gets all the cache files present in that task node. By if (p.getName().equals("abc.dat")) { you are getting the appropriate Cache File to be processed by your application.
Please refer to the docs below:
https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html#DistributedCache
https://hadoop.apache.org/docs/r1.2.1/api/org/apache/hadoop/filecache/DistributedCache.html#getLocalCacheFiles(org.apache.hadoop.conf.Configuration)

DistributedCache is an API which is used to add a file or a group of files in the memory and will be available for every data-nodes whether the map-reduce will work. One example of using DistributedCache is map-side joins.
DistributedCache.addcachefile(new URI ('/abc.dat'), job.getconfiguration()) will add the abc.dat file in the cache area. There can be n numbers of file in the cache and p.getName().equals("abc.dat")) will check the file which you required. Every path in HDFS will be taken under Path[] for map-reduce processing. For example :
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
The first Path(args[0]) is the first argument
(input file location) you pass while Jar execution and Path(args[1]) is the second argument which the output file location. Everything is taken as Path array.
In the same way when you add any file to cache it will get arrayed in the Path array which you shud be retrieving using the below code.
Path[] files = DistributedCache.getLocalCacheFiles(context.getConfiguration());
It will return all the files present in the cache and you will your file name by p.getName().equals() method.

Map-reduce job giving ClassNotFound exception even though mapper is present when running with yarn?

I am running a hadoop job which is working fine when I am running it without yarn in pseudo-distributed mode, but it is giving me class not found exception when running with yarn
16/03/24 01:43:40 INFO mapreduce.Job: Task Id : attempt_1458775953882_0002_m_000003_1, Status : FAILED
Error: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.hadoop.keyword.count.ItemMapper not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)
at org.apache.hadoop.mapreduce.task.JobContextImpl.getMapperClass(JobContextImpl.java:186)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:745)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.lang.ClassNotFoundException: Class com.hadoop.keyword.count.ItemMapper not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
... 8 more
Here is the source-code for the job
Configuration conf = new Configuration();
conf.set("keywords", args[2]);
Job job = Job.getInstance(conf, "item count");
job.setJarByClass(ItemImpl.class);
job.setMapperClass(ItemMapper.class);
job.setReducerClass(ItemReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
Here is the command I am running
hadoop jar ~/itemcount.jar /user/rohit/tweets /home/rohit/outputs/23mar-yarn13 vodka,wine,whisky
Edit Code, after suggestion
package com.hadoop.keyword.count;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.Mapper.Context;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.json.simple.JSONObject;
import org.json.simple.parser.JSONParser;
import org.json.simple.parser.ParseException;
public class ItemImpl {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
conf.set("keywords", args[2]);
Job job = Job.getInstance(conf, "item count");
job.setJarByClass(ItemImpl.class);
job.setMapperClass(ItemMapper.class);
job.setReducerClass(ItemReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
public static class ItemMapper extends Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
JSONParser parser = new JSONParser();
#Override
public void map(Object key, Text value, Context output) throws IOException,
InterruptedException {
JSONObject tweetObject = null;
String[] keywords = this.getKeyWords(output);
try {
tweetObject = (JSONObject) parser.parse(value.toString());
} catch (ParseException e) {
e.printStackTrace();
}
if (tweetObject != null) {
String tweetText = (String) tweetObject.get("text");
if(tweetText == null){
return;
}
tweetText = tweetText.toLowerCase();
/* StringTokenizer st = new StringTokenizer(tweetText);
ArrayList<String> tokens = new ArrayList<String>();
while (st.hasMoreTokens()) {
tokens.add(st.nextToken());
}*/
for (String keyword : keywords) {
keyword = keyword.toLowerCase();
if (tweetText.contains(keyword)) {
output.write(new Text(keyword), one);
}
}
output.write(new Text("count"), one);
}
}
String[] getKeyWords(Mapper<Object, Text, Text, IntWritable>.Context context) {
Configuration conf = (Configuration) context.getConfiguration();
String param = conf.get("keywords");
return param.split(",");
}
}
public static class ItemReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
#Override
protected void reduce(Text key, Iterable<IntWritable> values, Context output)
throws IOException, InterruptedException {
int wordCount = 0;
for (IntWritable value : values) {
wordCount += value.get();
}
output.write(key, new IntWritable(wordCount));
}
}
}

Running in full distributed mode your TaskTracker/NodeManager (the thing running your mapper) is running in a separate JVM and it sounds like your class is not making it onto that JVM's classpath.
Try using the -libjars <csv,list,of,jars> command line arg on job invocation. This will have Hadoop distribute the jar to the TaskTracker JVM and load your classes from that jar. (Note, this copies the jar out to each node in your cluster and makes it available only for that specific job. If you have common libraries that would need to be invoked for a lot of jobs, you'd want to look into using the Hadoop distributed cache.)
You may also want to try yarn -jar ... when launching your job versus hadoop -jar ... since that's the new/preferred way to launch yarn jobs.

Can you check the content of your itemcount.jar ?( jar -tvf itemcount.jar). I faced this issue once only to find that the .class was missing from the jar.

I had the same error a few days ago.
Changing map and reduce classes to static fixed my problem.
Make your map and reduce classes inner classes.
Control constructors of map and reduce classes (i/o values and override statement)
Check your jar command
old one
hadoop jar ~/itemcount.jar /user/rohit/tweets /home/rohit/outputs/23mar-yarn13 vodka,wine,whisky
new
hadoop jar ~/itemcount.jar com.hadoop.keyword.count.ItemImpl /user/rohit/tweets /home/rohit/outputs/23mar-yarn13 vodka,wine,whisky
add packageName.mainclass after you specified .jar file
Try-catch
try {
tweetObject = (JSONObject) parser.parse(value.toString());
} catch (Exception e) { **// Change ParseException to Exception if you don't only expect Parse error**
e.printStackTrace();
return; **// return from function in case of any error**
}
}
extends Configured and implement Tool
public class ItemImpl extends Configured implements Tool{
public static void main (String[] args) throws Exception{
int res =ToolRunner.run(new ItemImpl(), args);
System.exit(res);
}
#Override
public int run(String[] args) throws Exception {
Job job=Job.getInstance(getConf(),"ItemImpl ");
job.setJarByClass(this.getClass());
job.setJarByClass(ItemImpl.class);
job.setMapperClass(ItemMapper.class);
job.setReducerClass(ItemReducer.class);
job.setMapOutputKeyClass(Text.class);//probably not essential but make it certain and clear
job.setMapOutputValueClass(IntWritable.class); //probably not essential but make it certain and clear
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
add public static map
add public static reduce
I'm not an expert about this topic but This implementation is from one of my working projects. Try this if doesn't work for you I would suggest you check the libraries you added to your project.
Probably first step will solve it but
If these steps doesn't work , share the code with us.

Mapreduce with HCATALOG integration with oozie in MAPR

I have written a mapreduce program that reads the data from hive table using HCATLOG and writes into HBase. This is a map only job with no reducers. I have ran the program from command line and it works as expected(Created a fat jar to avoid Jar issues). I wanted to integrate it oozie (with Help of HUE) . I have two options to run it
Use Mapreduce Action
Use Java Action
Since my Mapreduce program has a driver method that holds the below code
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.util.*;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hive.hcatalog.data.schema.HCatSchema;
import org.apache.hive.hcatalog.mapreduce.HCatInputFormat;
import org.apache.hive.hcatalog.mapreduce.HCatOutputFormat;
public class HBaseValdiateInsertDriver {
public static void main(String[] args) throws Exception {
String dbName = "Test";
String tableName = "emp";
Configuration conf = new Configuration();
args = new GenericOptionsParser(conf, args).getRemainingArgs();
Job job = new Job(conf, "HBase Get Put Demo");
job.setInputFormatClass(HCatInputFormat.class);
HCatInputFormat.setInput(job, dbName, tableName, null);
job.setJarByClass(HBaseValdiateInsertDriver.class);
job.setMapperClass(HBaseValdiateInsert.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setNumReduceTasks(0);
FileInputFormat.addInputPath(job, new Path("maprfs:///user/input"));
FileOutputFormat.setOutputPath(job, new Path("maprfs:///user/output"));
job.waitForCompletion(true);
}
}
How do i specify the driver method in oozie, All that i can see is to specify mapper and reducer class.Can someone guide me how do i set the properties ?
Using java action i can specify my driver class as the main class and get this executed , but i face errors like table not found, HCATLOG jars not found etc. I have include hive-site.xml in the workflow(Using Hue) but i feel the system is not able to pick up the properties. Can someone advise me what all do i have to take care of, are there any other configuration properties that i need to include ?
Also the sample program i referred in cloudera website uses
HCatInputFormat.setInput(job, InputJobInfo.create(dbName,
inputTableName, null));
where as i use the below (I dont see a method that accept the above input
HCatInputFormat.setInput(job, dbName, tableName, null);
Below is my mapper code
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.KeyValue;
import org.apache.hadoop.hbase.client.Durability;
import org.apache.hadoop.hbase.client.Get;
import org.apache.hadoop.hbase.client.HTableInterface;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hive.hcatalog.data.HCatRecord;
public class HBaseValdiateInsert extends Mapper<WritableComparable, HCatRecord, Text, Text> {
static HTableInterface table;
static HTableInterface inserted;
private String hbaseDate = null;
String existigValue=null;
List<Put> putList = new ArrayList<Put>();
#Override
public void setup(Context context) throws IOException {
Configuration conf = context.getConfiguration();
String tablename = "dev_arch186";
Utils.getHBConnection();
table = Utils.getTable(tablename);
table.setAutoFlushTo(false);
}
#Override
public void cleanup(Context context) {
try {
table.put(putList);
table.flushCommits();
table.close();
} catch (IOException e) {
e.printStackTrace();
}
Utils.closeConnection();
}
#Override
public void map(WritableComparable key, HCatRecord value, Context context) throws IOException, InterruptedException {
String name_hive = (String) value.get(0);
String id_hive = (String) value.get(1);
String rec[] = test.toString().split(",");
Get g = new Get(Bytes.toBytes(name_hive));
existigValue=getOneRecord(Bytes.toBytes("Info"),Bytes.toBytes("name"),name_hive);
if (existigValue.equalsIgnoreCase("NA") || !existigValue.equalsIgnoreCase(id_hive)) {
Put put = new Put(Bytes.toBytes(rec[0]));
put.add(Bytes.toBytes("Info"),
Bytes.toBytes("name"),
Bytes.toBytes(rec[1]));
put.setDurability(Durability.SKIP_WAL);
putList.add(put);
if(putList.size()>25000){
table.put(putList);
table.flushCommits();
}
}
}
public String getOneRecord(byte[] columnFamily, byte[] columnQualifier, String rowKey)
throws IOException {
Get get = new Get(rowKey.getBytes());
get.setMaxVersions(1);
Result rs = table.get(get);
rs.getColumn(columnFamily, columnQualifier);
System.out.println(rs.containsColumn(columnFamily, columnQualifier));
KeyValue result = rs.getColumnLatest(columnFamily,columnQualifier);
if (rs.containsColumn(columnFamily, columnQualifier))
return (Bytes.toString(result.getValue()));
else
return "NA";
}
public boolean columnQualifierExists(String tableName, String ColumnFamily,
String ColumnQualifier, String rowKey) throws IOException {
Get get = new Get(rowKey.getBytes());
Result rs = table.get(get);
return(rs.containsColumn(ColumnFamily.getBytes(),ColumnQualifier.getBytes()));
}
}
Note:
I use MapR (M3) Cluster with HUE as the interface for oozie.
Hive Version : 1-0
HCAT Version: 1-0

I couldn't find any way to initialize HCatInputFormat from Oozie mapreduce action.
But I have a workaround as below.
Created LazyHCatInputFormat by extending HCatInputFormat.
Override the getJobInfo method, to handle initalization. This will be called as part of getSplits(..) call.
private static void lazyInit(Configuration conf){
try{
if(conf==null){
conf = new Configuration(false);
}
conf.addResource(new Path(System.getProperty("oozie.action.conf.xml")));
conf.addResource(new org.apache.hadoop.fs.Path("hive-config.xml"));
String databaseName = conf.get("LazyHCatInputFormat.databaseName");
String tableName = conf.get("LazyHCatInputFormat.tableName");
String partitionFilter = conf.get("LazyHCatInputFormat.partitionFilter");
setInput(conf, databaseName, tableName);
//setFilter(partitionFilter);
//System.out.println("After lazyinit : "+conf.get("mapreduce.lib.hcat.job.info"));
}catch(Exception e){
System.out.println("*** LAZY INIT FAILED ***");
//e.printStackTrace();
}
}
public static InputJobInfo getJobInfo(Configuration conf)
throws IOException {
String jobString = conf.get("mapreduce.lib.hcat.job.info");
if (jobString == null) {
lazyInit(conf);
jobString = conf.get("mapreduce.lib.hcat.job.info");
if(jobString == null){
throw new IOException("job information not found in JobContext. HCatInputFormat.setInput() not called?");
}
}
return (InputJobInfo) HCatUtil.deserialize(jobString);
}
In the oozie map-redcue action, configured as below.
<property>
<name>mapreduce.job.inputformat.class</name>
<value>com.xyz.LazyHCatInputFormat</value>
</property>
<property>
<name>LazyHCatInputFormat.databaseName</name>
<value>HCAT DatabaseNameHere</value>
</property>
<property>
<name>LazyHCatInputFormat.tableName</name>
<value>HCAT TableNameHere</value>
</property>
This might not be the best implementation, but a quick hack to make it work.

is Generic option -D not supporting in hadoop 0.20.2?

i am trying to set the configuration property from the console using -D generic options.
here is my console input:
$ hadoop jar hadoop-0.20.2/gtee.jar dd.MaxTemperature -D file.pattern=2007.* /inputdata /outputdata
but i did cross verification from the code by
Configuration conf;
System.out.println(conf.get("file.pattern"));
results null output.what would be the problem here, why value of the property "file.pattern" not displaying ? Can any one please help me.
Thanks
EDITED SECTION:
Driver Code:
public int run(String[] args) throws Exception {
Job job = new Job();
job.setJarByClass(MaxTemperature.class);
job.setJobName("MaxTemperature");
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(args[0]), conf);
// System.out.println(conf.get("file.pattern"));
if (fs.exists(new Path(args[1]))) {
fs.delete(new Path(args[1]), true);
}
System.out.println(conf.get("file.pattern"));
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileInputFormat.setInputPathFilter(job, RegexExcludePathFilter.class);
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setOutputFormatClass(TextOutputFormat.class);
job.setMapperClass(MapMapper.class);
job.setCombinerClass(Mapreducers.class);
job.setReducerClass(Mapreducers.class);
return job.waitForCompletion(true) ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int xx = ToolRunner.run(new Configuration(),new MaxTemperature(), args);
System.exit(xx);
}
path filter implementation:
public static class RegexExcludePathFilter extends Configured implements
PathFilter {
//String pattern = "2007.[0-1]?[0-2].[0-9][0-9].txt" ;
Configuration conf;
Pattern pattern;
#Override
public boolean accept(Path path) {
Matcher m = pattern.matcher(path.toString());
return m.matches();
}
#Override
public void setConf(Configuration conf) {
this.conf = conf;
pattern = Pattern.compile(conf.get("file.pattern"));
System.out.println(pattern);
}
}

To confirm you -D option is supported in the version 20.2 however that requires you to implement the Tool interface to read variables from command line
Configuration conf = new Configuration(); //this is the issue
// When implementing tool use this
Configuration conf = this.getConf();

You are passing it with a space in between, that's not how you should do it. Instead try:
-Dfile.pattern=2007.*

Distributed Cache Files Hadoop

I want to attach different files to different reducers. Is it possible using distributed cache technology in hadoop?
I able to attach the same file(files) to all the reducers. But due to memory constraints, I want to know if I can attach different files to different reducers.
Forgive me if its an ignorant question.
Pls help!
Thanks in advance!

Also it may be worth trying to use an in-memory compute/data grid technology like GridGain, Infinispan, etc... This way you can load your data in memory and you would not have any limits on how to map your computational jobs (map/reduce) to any data using data affinity.

It is a strange desire since any reducer is not bound to a particular node and during the execution a reducer can be run on any node or even nodes (if there is a failure or speculative execution). Therefore all reducers should be homogeneous, the only thing that differs them is data they process.
So I suppose when you say that you want to put different files on different reducers you actually want to put different files on reducer and those files should correspond to the data (keys) those reducers will be processing.
The only one way I know to do it is put your data on HDFS and read it from reducer when it start processing data.

package com.a;
import javax.security.auth.login.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
public class PrefixNew4Reduce4 extends MapReduceBase implements Reducer<Text, Text, Text, Text>{
// #SuppressWarnings("unchecked")
ArrayList<String> al = new ArrayList<String>();
public void configure(JobConf conf4)
{
String from = "home/users/mlakshm/haship";
OutputStream dst = null;
try {
dst = new BufferedOutputStream(new FileOutputStream(to, false));
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} /* src (hdfs file) something like hdfs://127.0.0.1:8020/home/rystsov/hi */
FileSystem fs = null;
try {
fs = FileSystem.get(new URI(from), conf4);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (URISyntaxException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
FSDataInputStream src;
try {
src = fs.open(new Path(from));
String val = src.readLine();
StringTokenizer st = new StringTokenizer(val);
al.add(val);
System.out.println("val:----------------->"+val);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
public void reduce (Text key, Iterator<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
StringTokenizer stk = new StringTokenizer(key.toString());
String t = stk.nextToken();
String i = stk.nextToken();
String j = stk.nextToken();
ArrayList<String> al1 = new ArrayList<String>();
for(int i = 0; i<al.size(); i++)
{
boolean a = (al.get(i).equals(i)) || (al.get(i).equals(j));
if(a==true)
{
output.collect(key, new Text(al.get(i));
}
while(values.hasNext())
{
String val = values.next().toString();
al1.add(val);
}
for(int i = 0; i<al1.size(); i++)
{
output.collect(key, new Text(al1.get(i));
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Hadoop Distributed Cache via Generic Options -files - hadoop

First you need to move your orders.txt to hdfs and the you have to use -files

Related

Not understanding the path in distributed path

Map-reduce job giving ClassNotFound exception even though mapper is present when running with yarn?

Mapreduce with HCATALOG integration with oozie in MAPR

is Generic option -D not supporting in hadoop 0.20.2?

Distributed Cache Files Hadoop

Categories

Resources