DistributedCache unable to access archives - hadoop

I am able to access individual files using DistributedCache but unable to access archives.
In the main method I am adding the archive as
DistributedCache.addCacheArchive(new Path("/stocks.gz").toUri(), job.getConfiguration());
where /stocks.gz is in hdfs. In the mapper I use,
Path[] paths = DistributedCache.getLocalCacheArchives(context.getConfiguration());
File localFile = new File(paths[0].toString());
which throws the exception,
java.io.FileNotFoundException: /tmp/hadoop-user/mapred/local/taskTracker/distcache/-8696401910194823450_622739733_1347031628/localhost/stocks.gz (No such file or directory)
I am expecting the DistributedCache to unzip /stocks.gz and the mapper to use the underlying file, but it throws a FileNotFound exception.
DistributedCache.addCacheFile and DistributedCache.getLocalCacheFiles works correctly when passing a single file, however passing an archive does not work. What am I doing wrong here ?

Can you try giving the stocks.gz with the Absolute path.
DistributedCache.addCacheArchive(new Path("<Absolute Path To>/stocks.gz").toUri(), job.getConfiguration());

Related

Could not find or load main class hdfs problem

I am trying to use Apache Rya for some tests (https://rya.apache.org/).
For those who are familiar with Rya and RDF stores, I am trying to do a bulk loading which is explained here: https://github.com/apache/rya/blob/master/extras/rya.manual/src/site/markdown/loaddata.md.
Briefly, I should copy a Jar file 'mapreduce/target/rya.mapreduce--shaded.jar' into an hdfs volume then run the following command:
hadoop hdfs://volume/rya.mapreduce-<version>-shaded.jar org.apache.rya.accumulo.mr.tools.RdfFileInputTool -Dac.zk=localhost:2181 -Dac.instance=accumulo -Dac.username=root -Dac.pwd=secret -Drdf.tablePrefix=rya_ -Drdf.format=N-Triples hdfs://volume/dir1,hdfs://volume/dir2,hdfs://volume/file1.nt
Well I copied the needed Jar and the input files into hdfs and verified that they are really there using bin/hadoop fs -put command. My problem is that when I run the cmd in the official example I get the following lines of error that I could not understand or resolve.
/project/hadoop/libexec/hadoop-functions.sh: line 2393: HADOOP_HDFS://LOCALHOST:9000/USER/RYA.MAPREDUCE-4.0.0-INCUBATING-SHADED.JAR_USER: invalid variable name
/project/hadoop/libexec/hadoop-functions.sh: line 2358: HADOOP_HDFS://LOCALHOST:9000/USER/RYA.MAPREDUCE-4.0.0-INCUBATING-SHADED.JAR_USER: invalid variable name
/project/hadoop/libexec/hadoop-functions.sh: line 2453: HADOOP_HDFS://LOCALHOST:9000/USER/RYA.MAPREDUCE-4.0.0-INCUBATING-SHADED.JAR_OPTS: invalid variable name
Error: Could not find or load main class hdfs:..localhost:9000.user.rya.mapreduce-4.0.0-incubating-shaded.jar
For information; all env variables are properly set, HADOOP_HOME and HADOOP_PREFIX

Pig register jar, file does not exist error

I'm using Hortonworks sandbox and trying to run a simple pig script. There appear to be annoying error related to "file does not exist".
Below is the script:
REGISTER '/piggybank.jar';
inp = load '/my.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage..
ERROR 2997: Encountered IOException. File does not exist:
hdfs://sandbox.hortonworks.com:8020/tmp/udfs/ '/piggybank.jar'
However, my jar is present at the root(/) and I have given proper permission as well. Don't know why the path is pointing to /tmp/udfs....
Can anyone provide some suggestion?
Do not place the path within quotes. Also provide full URI of the Jar file location.
REGISTER hdfs://sandbox.hortonworks.com:8020/piggybank.jar;
Refer REGISTER (a jar/script).

FileStream on cluster gives me an exception

I am writing a Spark STreaming application using file stream...
val probeFileLines = ssc.fileStream[LongWritable, Text, TextInputFormat]("/data-sources/DXE_Ver/1.4/MTN_Abuja/DXE/20160221/HTTP", filterF, false) //.persist(StorageLevel.MEMORY_AND_DISK_SER)
But I get exception error for file/IO..for
16/09/07 10:20:30 WARN FileInputDStream: Error finding new files
java.io.FileNotFoundException: /mapr/cellos-mapr/data-sources/DXE_Ver/1.4/MTN_Abuja/DXE/20160221/HTTP
at com.mapr.fs.MapRFileSystem.listMapRStatus(MapRFileSystem.java:1486)
at com.mapr.fs.MapRFileSystem.listStatus(MapRFileSystem.java:1523)
at com.mapr.fs.MapRFileSystem.listStatus(MapRFileSystem.java:86)
While the directory exist in my cluster.
I am running my job using spark submit
spark-submit --class "StreamingEngineSt" target/scala-2.11/sprkhbase_2.11-1.0.2.jar
This could be related to file permissions or ownership(May be need hdfs user).

Hadoop jar command error

While executing the JAR file command on HDFS getting error as below
#hadoop jar WordCountNew.jar WordCountNew /MRInput57/Input-Big.txt /MROutput57
15/11/06 19:46:32 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
15/11/06 19:46:32 INFO mapred.JobClient: Cleaning up the staging area hdfs://localhost:8020/var/lib/hadoop-0.20/cache/mapred/mapred/staging/root/.staging/job_201511061734_0003
15/11/06 19:46:32 ERROR security.UserGroupInformation: PriviledgedActionException as:root (auth:SIMPLE) cause:org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory /MRInput57/Input-Big.txt already exists
Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory /MRInput57/Input-Big.txt already exists
at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:132)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:921)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:882)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:882)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:526)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:556)
at MapReduce.WordCountNew.main(WordCountNew.java:114)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:197)
My Driver class Program is as below
public static void main(String[] args) throws IOException, Exception {
// Configutation details w. r. t. Job, Jar file
Configuration conf = new Configuration();
Job job = new Job(conf, "WORDCOUNTJOB");
// Setting Driver class
job.setJarByClass(MapReduceWordCount.class);
// Setting the Mapper class
job.setMapperClass(TokenizerMapper.class);
// Setting the Combiner class
job.setCombinerClass(IntSumReducer.class);
// Setting the Reducer class
job.setReducerClass(IntSumReducer.class);
// Setting the Output Key class
job.setOutputKeyClass(Text.class);
// Setting the Output value class
job.setOutputValueClass(IntWritable.class);
// Adding the Input path
FileInputFormat.addInputPath(job, new Path(args[0]));
// Setting the output path
FileOutputFormat.setOutputPath(job, new Path(args[1]));
// System exit strategy
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
Can someone please rectify the issue in my code?
Regards
Pranav
You need to check that the output directory doesn't already exist and delete it if it does. MapReduce can't (or won't) write files to a directory that exists. It needs to create the directory to be sure.
Add this:
Path outPath = new Path(args[1]);
FileSystem dfs = FileSystem.get(outPath.toUri(), conf);
if (dfs.exists(outPath)) {
dfs.delete(outPath, true);
}
Output directory which you are trying to create to store output is already present.So try to delete the previous directory of same name or change the name of output directory.
Output directory should not exist prior to execution of program. Either delete existing directory or provide new directory or remove output directory in your program.
I prefer deletion of output directory from command prompt before executing your program from command prompt.
From command prompt:
hdfs dfs -rm -r <your_output_directory_HDFS_URL>
From java:
Chris Gerken code is good enough.
As others have noted, you are getting the error because the output directory already exists, most likely because you have tried executing this job before.
You can remove the existing output directory right before you run the job, i.e.:
#hadoop fs -rm -r /MROutput57 && \
hadoop jar WordCountNew.jar WordCountNew /MRInput57/Input-Big.txt /MROutput57

Load shared library from distributed cache

I have a shared library that I copied to hdfs at
/user/uokuyucu/lib/libxxx.so
and I have a WordCount.java with the identical code from tutorial plus my own FileInputFormat class called MyFileInputFormat that has nothing in it except the constructor modified as follows:
public MyInputFileFormat() {
System.loadLibrary("xxx");
}
and I'm also adding my shared library to distributed cache like this in job setup (main):
DistributedCache.addCacheFile(new URI("/user/uokuyucu/lib/libxxx.so"),
job.getConfiguration());
I run it as;
hadoop jar mywordcount.jar mywordcount.WordCount input output
and got java.lang.UnsatisfiedLinkError: no far_jni_interface in java.library.path exception.
How can I load a shared library in my hadoop job?

Resources