HBase map/reduce dependency issue - hadoop

Overview
I developed a Rest api service based on resteasy framework. In the service, i will store data to HBase database. then, execute map/reduce process trigged by some condition(e.g. insert one record).
Requires
In the Map class, i import some third part libraries. i do not want to package those libraries to the war file.
TableMapReduceUtil.initTableMapperJob(HBaseInitializer.TABLE_DATA, // input HBase table name
scan, // Scan instance to control CF and attribute selection
LuceneMapper.class, // mapper
null, // mapper output key
null, // mapper output value
job);
FileOutputFormat.setOutputPath(job, new Path("hdfs://master:9000/qin/luceneFile"));
job.submit();
Problem
If package all libraries in the war file which will be deploy to jetty container, it work well. if not package third part library to the war,but upload those library to hdfs and add them to class path, it does not work. like below
conf.set("fs.defaultFS","hdfs://master:9000");
FileSystem hdfs = FileSystem.get(conf);
Path classpathFilesDir = new Path("bjlibs");
FileStatus[] jarFiles = hdfs.listStatus(classpathFilesDir);
for (FileStatus fs : jarFiles) {
Path disqualified = new Path(fs.getPath().toUri().getPath());
DistributedCache.addFileToClassPath(disqualified, conf);
}
hdfs.close();

try TableMapReduceUtil.addHBaseDependencyJars()

Related

How to test hadoop mapreduce with hdfs?

I am using MRUnit to write unit tests for my mapreduce jobs.
However, I am having trouble including hdfs into that mix. My MR job needs a file from hdfs. How do I mock out the hdfs part in MRUnit test case?
Edit:
I know that I can specify inputs/exepctedOutput for my MR code in the test infrastructure. However, that is not what I want. My MR job needs to read another file that has domain data to do the job. This file is in HDFS. How do I mock out this file?
I tried using mockito but it didnt work. The reason was that FileSystem.open() returns a FSDataInputStream which inherits from other interfaces besides java.io.Stream. It was too painful to mock out all the interfaces. So, I hacked it in my code by doing the following
if (System.getProperty("junit_running") != null)
{
inputStream = this.getClass().getClassLoader().getResourceAsStream("domain_data.txt");
br = new BufferedReader(new InputStreamReader(inputStream));
} else {
Path pathToRegionData = new Path("/domain_data.txt");
LOG.info("checking for existence of region assignment file at path: " + pathToRegionData.toString());
if (!fileSystem.exists(pathToRegionData))
{
LOG.error("domain file does not exist at path: " + pathToRegionData.toString());
throw new IllegalArgumentException("region assignments file does not exist at path: " + pathToRegionData.toString());
}
inputStream = fileSystem.open(pathToRegionData);
br = new BufferedReader(new InputStreamReader(inputStream));
}
This solution is not ideal because I had to put test specific code in my production code. I am still waiting to see if there is an elegant solution out there.
Please follow the this small tutorial for MRUnit.
https://github.com/malli3131/HadoopTutorial/blob/master/MRUnit/Tutorial
In MRUnit test case, we supply the data inside the testMapper() and testReducer() methods. So there is no need of input from HDFS for MRUnit Job. Only MapReduce jobs require data inputs from HDFS.

How incorporate Storm component-specific configuration data?

I have a Storm topology containing spouts/bolts.
There is some configuration data that is specific to a particular spout and
also a particular bolt that I would like to use (i.e. read from a config file)
so that it is not hard coded. Examples of config data is a filename that the
spout is to read from and a filename that a bolt is to write to.
I think config data is passed into the open and prepare methods.
How can I incorporate the component-specific data from a configuration file?
There are at least two ways to do this:
1) Include application-specific configuration in Storm config, which will be available during IBolt.prepare() ISpout.open() method calls. One strategy you could use is to have application prefix for the configuration keys, avoiding potential conflicts.
Config conf = new backtype.storm.Config();
// Storm-specific configuration
// ...
// ..
// .
conf.put("my.application.configuration.foo", "foo");
conf.put("my.application.configuration.bar", "foo");
StormSubmitter.submitTopology(topologyName, conf, topology);
2) Include component configuration during Spout/Bolt constructor.
Properties properties = new java.util.Properties();
properties.load(new FileReader("config-file"));
BaseComponent bolt = new MyBoltImpl(properties);

Hive setup()-like functionality similar to Mapper setup()?

I want to replace a Hadoop job with Hive. My challenge is in Hadoop, I'm using setup() to build a kdtree by reading in reference data (points of interest) from the distributed cache. I then use the kdtree in map() to evaluate distance of the target data against the kdtree.
In Hive, I wanted to use a udf with evaluate() method to determine the distance, but I don't know how to setup the kdtree with the reference data. Is this possible?
I probably don't have the entire answer, so I'm just going to throw out some ideas that might be of help.
You can add files to the distributed cache in hive using ADD FILE ...
Hive 11+ (I think) should let you access to the distributed cache in GenericUDF.initialize
https://issues.apache.org/jira/browse/HIVE-1016 which references...
https://issues.apache.org/jira/browse/HIVE-3628
So when you initialize the UDF, you might be able to build your kdtree by accessing the file you added in the distributed cache.
Like climbage says ADD FILE command adds the file into distributed cache.
You can access the distributed cache in your UDF simply by opening a file which is in the current directory.
ie... open( new File( System.getProperty("user.dir") + "/myfile") );
You can use a ConstantObjectInspector to access the filename in the initialize method of GenericUDF, where you can open the file and read into memory into your data structure.
The distributed_map UDF of Brickhouse does something similar ( https://github.com/klout/brickhouse/blob/master/src/main/java/brickhouse/udf/dcache/DistributedMapUDF.java )
Something like
public ObjectInspector initialize(ObjectInspector[] inspArr) {
ConstantObjectInspector fileNameInsp = (ConstantObjectInspector)inspArr[0];
String fileName = fileNameInsp.getWritableConstantValue().toString();
FileInputStream inFile = new FileInputStream("./" + fileName);
doStuff( inFile );
.....
}

Hadoop Map-Reduce , Need to combine two mapper with one common Reducer

I need to implement below Functionality using Hadoop Map-Reduce?
1) I am reading one input for a mapper from one source & another input from another different input source.
2) I need to pass both output of mapper into a single reducer for further process.
Is there any to do the above requirement in Hadoop Map-Reduce
MultipleInputs.addInputPath is what you are looking for. This is how your configuration would look like. Make sure both AnyMapper1 and AnyMapper2 write the same output expected by MergeReducer
JobConf conf = new JobConf(Merge.class);
conf.setJobName("merge");
conf.setOutputKeyClass(IntWritable.class);
conf.setOutputValueClass(Text.class);
conf.setReducerClass(MergeReducer.class);
conf.setOutputFormat(TextOutputFormat.class);
MultipleInputs.addInputPath(conf, inputDir1, SequenceFileInputFormat.class, AnyMapper1.class);
MultipleInputs.addInputPath(conf, inputDir2, TextInputFormat.class, AnyMapper2.class);
FileOutputFormat.setOutputPath(conf, outputPath);
You can create a custom writable. You can populate the same in the Mapper. Later in the Reducer you can get the Custom writable Object and do the necessary business operation.

Files not stored in Distributed Cache

I am using DistributedCache. But there are no files in the cache after execution of code.
I have referred to other similar questions but the answers does not solve my issue.
Please find the code below:
Configuration conf = new Configuration();
Job job1 = new Job(conf, "distributed cache");
Configuration conf1 = job1.getConfiguration();
DistributedCache.addCacheFile(new Path("File").toUri(), conf1);
System.out.println("distributed cache file "+DistributedCache.getLocalCacheFiles(conf1));
This gives null..
The same thing when given inside mapper also gives null hence. Please let me know your suggestions.
Thanks
try getCacheFiles() instead of getLocalCacheFiles()
I believe this is (at least partly) due to what Chris White wrote here:
After you create your Job object, you need to pull back the
Configuration object as Job makes a copy of it, and configuring values
in conf2 after you create the job will have no effect on the job
iteself. Try this:
job = new Job(new Configuration());
Configuration conf2 = job.getConfiguration();
job.setJobName("Join with Cache");
DistributedCache.addCacheFile(new URI("hdfs://server:port/FilePath/part-r-00000"), conf2);
I guess if it still does not work, there is another problem somewhere, but that doesn't mean that Chris White's point is not correct.
When distributing, don't forget the local link name, preferably using a relative path:
URI is of the form hdfs://host:port/absolute-path#local-link-name
When reading:
if you don't use distributed cache possibilities, you are supposed to use HDFS's FileSystem to access the hdfs://host:port/absolute-path
if you use the distributed cache, then you have to use standard Java file utilities to access the local-link-name
The cache file needs to be in the Hadoop FileSystem. You can do this:
void copyFileToHDFS(JobConf jobConf, String from, String to){
try {
FileSystem aFS = FileSystem.get(jobConf);
aFS.copyFromLocalFile(false, true, new Path(
from), new Path(to));
} catch (IOException e) {
throw new RuntimeException(e);
}
}
Once the files are copied you can add them to the cache, like so:
void fillCache(JobConf jobConf){
Job job;
copyFileToHDFS(jobConf, fromLocation, toLocation);
job = Job.getInstance(jobConf);
job.addCacheFile(new URI(toLocation));
JobConf newJobConf = new JobConf(job.getConfiguration());
}

Resources