I recently came across an issue when using Hadoop FileSystem API and GlobStatus while writing Mapreduce application.
Here's snippet of the driver program.
FileSystem fs = FileSystem.get(URI.create(args[0]), conf);
Path path = new Path(args[0] + args[1]);
FileStatus[] status = fs.globStatus(path);
Path[] paths = FileUtil.stat2Paths(status);
and here is how I invoke the program
yarn jar MyMapReduceTest01.jar com.abc.test.MyMapRedTestDriver /user/root/raw_data/ abc*
This results in the following exception
Exception in thread "main" java.lang.NullPointerException
at com.abc.test.MyMapRedTestDriver.run(MyMapRedTestDriver.java:42)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at com.abc.test.MyMapRedTestDriver.main(MyMapRedTestDriver.java:77)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
The files in the /user/root/raw_data/ directory are named as
abc_01.txt
abc_02.txt
...
there seems to be a problem in the way the globstatus is handled in the Hadoop filesystem API.
To make the above program work without any code change I just needed to change the command to
yarn jar MyMapReduceTest01.jar com.abc.test.MyMapRedTestDriver /user/root/raw_data/ abc_*
Related
I'm making my first steps in Hadoop's Pig Latin,but I'm really blocked since i can't load any input data even if it exists
R = LOAD '/home/cloudera/Desktop/vol.csv' USING PigStorage(';')
AS
(AnneeVol:int,MoisVol:int,JourVol:int,NumVol:int,AeroDep:chararray,AeroArriv:chararray,DistVol:int);
File Location
upon running :
DUMP R;
I get this error
> Failed Jobs:
JobId Alias Feature Message Outputs
job_1590934825774_0010 R MAP_ONLY Message: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Input path does not exist: hdfs://quickstart.cloudera:8020/home/cloudera/pig_lab/input/vol.csv
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:288)
at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:305)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:322)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:200)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1307)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1304)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1304)
at org.apache.hadoop.mapreduce.lib.jobcontrol.ControlledJob.submit(ControlledJob.java:335)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.pig.backend.hadoop23.PigJobControl.submit(PigJobControl.java:128)
at org.apache.pig.backend.hadoop23.PigJobControl.run(PigJobControl.java:191)
at java.lang.Thread.run(Thread.java:745)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$1.run(MapReduceLauncher.java:270)
Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://quickstart.cloudera:8020/home/cloudera/pig_lab/input/vol.csv
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTextInputFormat.listStatus(PigTextInputFormat.java:36)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:274)
... 18 more
hdfs://quickstart.cloudera:8020/tmp/temp14790577/tmp132115611,
Input(s): Failed to read data from "/home/cloudera/pig_lab/input/vol.csv"
Output(s): Failed to produce result in "hdfs://quickstart.cloudera:8020/tmp/temp14790577/tmp132115611"
How do I fix this issue ?
Thank you !
The program try to find the vol.csv in the HDFS not on your local file system
Input path does not exist: hdfs://quickstart.cloudera:8020/home/cloudera/pig_lab/input/vol.csv
Please check core-site.xml for your default fileSystem. Currently it would have value hdfs://quickstart.cloudera:8020. Thats why it searches the file HDFS. You DON'T have to change anything there.
Just add file:// tag before the path to tell the program to find the vol.csv from your local fileSystem.
R = LOAD 'file:///home/cloudera/Desktop/vol.csv' USING PigStorage(';') AS (AnneeVol:int,MoisVol:int,JourVol:int,NumVol:int,AeroDep:chararray,AeroArriv:chararray,DistVol:int);
Ref: Cloudera blog
If that doesn’t help either put the file in HDFS and then refer that location in your code.
hdfs dfs -put /home/cloudera/Desktop/vol.csv hdfs://quickstart.cloudera:8020/user/<hdfs-user>/
Then in your code
R = LOAD '/user/<hdfs-user>/vol.csv' USING PigStorage(';') AS (AnneeVol:int,MoisVol:int,JourVol:int,NumVol:int,AeroDep:chararray,AeroArriv:chararray,DistVol:int);
Hadoop 2.9.1, standalone installation.
The hdfs directory is organized by time (yyyyMMdd/HH/mm), like, hdfs://server1:9000/foo/20190410/10/00. And there're several files in each minute.
What I need to do is, process data for each hour, for example, process all data under hdfs://server1:9000/foo/20190410/10. So the mapreduce input setting is something like,
job.setInputFormatClass(org.apache.hadoop.mapreduce.lib.input.SequenceFileAsBinaryInputFormat.class);
Path inputPath = new Path("hdfs://server1:9000/foo/20190410/10");
org.apache.hadoop.mapreduce.lib.input.SequenceFileAsBinaryInputFormat.addInputPath(job, inputPath);
But I keep getting this,
Exception in thread "main" java.io.FileNotFoundException: File does not exist: hdfs://server01:9000/foo/20190410/10/00/data
at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1533)
at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1526)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1526)
at org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:67)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:393)
at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:314)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:331)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:202)
at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1570)
at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1567)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1889)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1567)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1588)
at com.misc.mr.TestJob.main(TestJob.java:54)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:239)
at org.apache.hadoop.util.RunJar.main(RunJar.java:153)
I have no idea why it try to access path hdfs://server01:9000/foo/20190410/10/00/data
If the input is a file instead of a folder (for example hdfs://server1:9000/foo/20190410/10/00/part1), it works fine.
Can anyone please help to give some explanation? Many thanks.
Figured out.
Set mapreduce.input.fileinputformat.input.dir.recursive to true.
Or
In code, call, FileInputFormat.setInputDirRecursive(job, true)
I have created a input directory and put sample file in it.I have created an output directory also.but at the time of mapreduce program execution i got the below error.Here is my command to execute mapreduce
bin/hdfs dfs -mkdir /input
bin/hdfs dfs -put /home/biswajit/sample.txt /input/
bin/hadoop jar /usr/local/hadoop/hadoop-2.9.0/share/hadoop/mapreduce/units.jar com.hadoop.ProcessUnits /input/sample.txt /output
Error is
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: **hdfs://localhost:54310/home/biswajit/input/sample.txt**
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:294)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:236)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:322)
at org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(JobSubmitter.java:341)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:333)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:202)
at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1570)
at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1567)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1886)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1567)
at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:576)
at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:571)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1886)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:571)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:562)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:871)
at com.hadoop.ProcessUnits.main(ProcessUnits.java:96)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:239)
at org.apache.hadoop.util.RunJar.main(RunJar.java:153)
$HADOOP_HOME/input doesn't exist on HDFS.
$HADOOP_HOME is a bash variable on your local filesystem.
You only created a directory for /input, so you can either mkdir the full path with the variable, if you want that command to run as-is, or you need to remove the variable when running the JAR file
As long as hdfs dfs -ls /input/* shows some files, then that command looks fine otherwise, but I'm not sure what that Java class is actually expecting as input
Note: there is a difference between
hdfs://localhost:54310/home/biswajit/input
And
hdfs://localhost:54310/input
More specifically, HDFS doesn't have /home folders, so it looks like you're either not running pseudo distributed cluster, or you made that directory yourself.
I have installed hadoop 2.4.1 and hbase 0.98.8 in 2 machines. When I run an hbase mapreduce job I get the below error:
Exception in thread "main" java.io.FileNotFoundException: File does not exist: hdfs://pc1/opt/hbase-0.98.8-hadoop2/lib/hbase-server-0.98.8-hadoop2.jar
at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1128)
at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120)
at org.apache.hadoop.mapreduce.filecache.ClientDistributedCacheManager.getFileStatus(ClientDistributedCacheManager.java:288)
at org.apache.hadoop.mapreduce.filecache.ClientDistributedCacheManager.getFileStatus(ClientDistributedCacheManager.java:224)
at org.apache.hadoop.mapreduce.filecache.ClientDistributedCacheManager.determineTimestamps(ClientDistributedCacheManager.java:93)
at org.apache.hadoop.mapreduce.filecache.ClientDistributedCacheManager.determineTimestampsAndCacheVisibilities(ClientDistributedCacheManager.java:57)
at org.apache.hadoop.mapreduce.JobSubmitter.copyAndConfigureFiles(JobSubmitter.java:265)
at org.apache.hadoop.mapreduce.JobSubmitter.copyAndConfigureFiles(JobSubmitter.java:301)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:389)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1282)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1303)
at thesis.test2.run(test2.java:93)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at thesis.test2.main(test2.java:107)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
I can run hadoop mapreduce jobs and simple hbase jobs without any problems. The code I m trying to run is an example that is supposed to run.
Please provide "jps" output.
Because it seems like your hbase is not working , hopefully the problem will be with zookeeper
I faced the exact problem. You have to add the hbase library path to the .bashrc file. Add the lib folder in hbase to the CLASSPATH.
Also, add the classpath of hbase to HADOOP_CLASSPATH.
Your .bashrc file should contain the following:
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:`${HBASE_HOME}/bin/hbase classpath`
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:`${HBASE_HOME}/bin/hbase mapredcp`
export CLASSPATH=${HBASE_HOME}/lib/*
Note: The CLASSPATH should point to the lib folder of your hbase installation folder. Use the following to compile and run your java code.
javac Example.java
java -classpath $CLASSPATH:. Example
I am trying to copy some files from S3 bucket to HDFS of my EMR cluster. But I am getting the following error:
Exception in thread "main" java.lang.RuntimeException: Error running job
at com.amazon.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:771)
at com.amazon.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:580)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at com.amazon.elasticmapreduce.s3distcp.Main.main(Main.java:22)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://10.87.26.26:9000/tmp/33e4f3b9-d29a-49e8-9706-ea70e07e3ff2/files
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:285)
at org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:340)
at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:491)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:508)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:392)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1268)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1265)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1265)
at com.amazon.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:751)
... 9 more
The command I am using is :
./elastic-mapreduce --jobflow j-12345678 --jar /home/hadoop/lib/emr-s3distcp-1.0.jar --args '--src,s3n://my-bucket/data/,--dest,hdfs:///data/in,--srcPattern,xyz01-1-1*ped*' --step-name "Copy input files to HDFS" --wait-for-steps
I tried to run the sample word-count job, to check if there is any issue with HDFS, but it ran fine.
Can anyone please help me with this? If any more info is needed, please let me know and I will update the description.
Usually its the --srcPattern '<regex>' argument. You can also use hadoop fs -cp s3://src/file1.something /my/output/path/ to test for 1 file and modify your regex. Also starting with .* any char-0 or more times, should relax the matching.
It would be great to know if regex non-matches get logged and where.