Hadoop Pig Latin always fails to Load Data - hadoop

I'm making my first steps in Hadoop's Pig Latin,but I'm really blocked since i can't load any input data even if it exists
R = LOAD '/home/cloudera/Desktop/vol.csv' USING PigStorage(';')
AS
(AnneeVol:int,MoisVol:int,JourVol:int,NumVol:int,AeroDep:chararray,AeroArriv:chararray,DistVol:int);
File Location
upon running :
DUMP R;
I get this error
> Failed Jobs:
JobId Alias Feature Message Outputs
job_1590934825774_0010 R MAP_ONLY Message: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Input path does not exist: hdfs://quickstart.cloudera:8020/home/cloudera/pig_lab/input/vol.csv
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:288)
at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:305)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:322)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:200)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1307)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1304)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1304)
at org.apache.hadoop.mapreduce.lib.jobcontrol.ControlledJob.submit(ControlledJob.java:335)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.pig.backend.hadoop23.PigJobControl.submit(PigJobControl.java:128)
at org.apache.pig.backend.hadoop23.PigJobControl.run(PigJobControl.java:191)
at java.lang.Thread.run(Thread.java:745)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$1.run(MapReduceLauncher.java:270)
Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://quickstart.cloudera:8020/home/cloudera/pig_lab/input/vol.csv
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTextInputFormat.listStatus(PigTextInputFormat.java:36)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:274)
... 18 more
hdfs://quickstart.cloudera:8020/tmp/temp14790577/tmp132115611,
Input(s): Failed to read data from "/home/cloudera/pig_lab/input/vol.csv"
Output(s): Failed to produce result in "hdfs://quickstart.cloudera:8020/tmp/temp14790577/tmp132115611"
How do I fix this issue ?
Thank you !

The program try to find the vol.csv in the HDFS not on your local file system
Input path does not exist: hdfs://quickstart.cloudera:8020/home/cloudera/pig_lab/input/vol.csv
Please check core-site.xml for your default fileSystem. Currently it would have value hdfs://quickstart.cloudera:8020. Thats why it searches the file HDFS. You DON'T have to change anything there.
Just add file:// tag before the path to tell the program to find the vol.csv from your local fileSystem.
R = LOAD 'file:///home/cloudera/Desktop/vol.csv' USING PigStorage(';') AS (AnneeVol:int,MoisVol:int,JourVol:int,NumVol:int,AeroDep:chararray,AeroArriv:chararray,DistVol:int);
Ref: Cloudera blog
If that doesn’t help either put the file in HDFS and then refer that location in your code.
hdfs dfs -put /home/cloudera/Desktop/vol.csv hdfs://quickstart.cloudera:8020/user/<hdfs-user>/
Then in your code
R = LOAD '/user/<hdfs-user>/vol.csv' USING PigStorage(';') AS (AnneeVol:int,MoisVol:int,JourVol:int,NumVol:int,AeroDep:chararray,AeroArriv:chararray,DistVol:int);

Related

Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:

I have created a input directory and put sample file in it.I have created an output directory also.but at the time of mapreduce program execution i got the below error.Here is my command to execute mapreduce
bin/hdfs dfs -mkdir /input
bin/hdfs dfs -put /home/biswajit/sample.txt /input/
bin/hadoop jar /usr/local/hadoop/hadoop-2.9.0/share/hadoop/mapreduce/units.jar com.hadoop.ProcessUnits /input/sample.txt /output
Error is
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: **hdfs://localhost:54310/home/biswajit/input/sample.txt**
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:294)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:236)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:322)
at org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(JobSubmitter.java:341)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:333)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:202)
at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1570)
at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1567)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1886)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1567)
at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:576)
at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:571)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1886)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:571)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:562)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:871)
at com.hadoop.ProcessUnits.main(ProcessUnits.java:96)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:239)
at org.apache.hadoop.util.RunJar.main(RunJar.java:153)
$HADOOP_HOME/input doesn't exist on HDFS.
$HADOOP_HOME is a bash variable on your local filesystem.
You only created a directory for /input, so you can either mkdir the full path with the variable, if you want that command to run as-is, or you need to remove the variable when running the JAR file
As long as hdfs dfs -ls /input/* shows some files, then that command looks fine otherwise, but I'm not sure what that Java class is actually expecting as input
Note: there is a difference between
hdfs://localhost:54310/home/biswajit/input
And
hdfs://localhost:54310/input
More specifically, HDFS doesn't have /home folders, so it looks like you're either not running pseudo distributed cluster, or you made that directory yourself.

Pig Error trying Edureka's Tutorial

I've been trying to run hadoop and other components like PIG.
I'm trying this tutorial:
https://www.edureka.co/blog/pig-programming-create-your-first-apache-pig-script/
Everything is right, but when I run the script at step 2 it throws this error:
2018-01-09 13:47:20,682 [JobControl] INFO org.apache.hadoop.mapreduce.lib.jobcontrol.ControlledJob - PigLatin:output.pig got an error while submitting
org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Input path does not exist: hdfs://localhost:9000/carlos/information.txt
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:279)
at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
at org.apache.hadoop.mapreduce.lib.jobcontrol.ControlledJob.submit(ControlledJob.java:335)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.pig.backend.hadoop23.PigJobControl.submit(PigJobControl.java:128)
at org.apache.pig.backend.hadoop23.PigJobControl.run(PigJobControl.java:194)
at java.lang.Thread.run(Thread.java:748)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$1.run(MapReduceLauncher.java:276)
Prior to step 1, you needed this command
hadoop fs –copyFromLocal /home/carlos/information.txt /carlos
However, this will copy your file to a file named /carlos on HDFS, if that directory doesn't already exist.
If you want /carlos to be a directory, you need to delete the file and make it
hadoop fs -rm /carlos
hadoop fs -mkdir /carlos
Also, trailing slashes should generally be used when copying files into a directory, like so
hadoop fs –copyFromLocal /home/carlos/information.txt /carlos/
You could also just make your Pig code load /carlos as the file. Even if it was a directory, this would still work to read all files in there

Input path does not exist: hdfs://localhost:9000/user/rab/input

Following is the output of my job ... on single mode it runs well but on pseudo distribute mode it throws the following error all the time... I have tried a lot but could not meet a possible solution from anyone yet. I need a quick fix of the problem.
Highly obliged upon success...
rab#rab-VirtualBox:~/hadoop/hadoop-1.2.1$ bin/hadoop jar hadoop-examples-*.jar grep input output 'dfs[a-z.]+'
17/03/17 22:13:12 INFO util.NativeCodeLoader: Loaded the native-hadoop library
17/03/17 22:13:12 WARN snappy.LoadSnappy: Snappy native library not loaded
17/03/17 22:13:12 INFO mapred.JobClient: Cleaning up the staging area hdfs://localhost:9000/tmp/hadoop-rab/mapred/staging/rab/.staging/job_201703172201_0004
17/03/17 22:13:12 ERROR security.UserGroupInformation: PriviledgedActionException as:rab cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://localhost:9000/user/rab/input
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://localhost:9000/user/rab/input
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081)
at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073)
at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)
at org.apache.hadoop.examples.Grep.run(Grep.java:69)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.examples.Grep.main(Grep.java:93)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:64)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.main(RunJar.java:160)
The log output shows error:
"org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://localhost:9000/user/rab/input"
Do you have a directory called "input" under /user/rab on the HDFS file system? The error indicates that you don't ! The error also shows that it is looking at hdfs and not the local file system. You can check with the following command
"hdfs dfs -ls /user/rab"
The full command that you used "bin/hadoop jar hadoop-examples-*.jar grep input output 'dfs[a-z.]+'" expects the command to be of the form "hadoop jar jarfilename classname hdfsinputdirectory hdfsoutputdirectory" where classname is the name of the class in the jar file that you want to run.

Spark Streaming: java.io.FileNotFoundException: File does not exist: <input_filename>._COPYING_

I am writing a spark streaming application which reads input from HDFS. I submit spark application to yarn and then run a script which copies data from local fs to HDFS.
But Spark application starts throwing fileNotFoundException.
I believe this is happening because spark is picking up files before it is being copied fully onto HDFS.
Following is the some part of exception trace:
java.io.FileNotFoundException: File does not exist: <filename>._COPYING_
at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:66)
at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:56)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1932)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1873)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1853)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1825)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:559)
at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.getBlockLocations(AuthorizationProviderProxyClientProtocol.java:87)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:363)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1060)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2044)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2040)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2038)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
Any suggestion how to resolve this?
Thanks
You need to have your data producer filename the current file being copied different from the completely copied file. Then you need to add filter on file dstream getting the completely copied file only. E.g.
Ongoing copy file prefix _copying*
Completely copies file prefix data*

Error while copying from S3 to HDFS

I am trying to copy some files from S3 bucket to HDFS of my EMR cluster. But I am getting the following error:
Exception in thread "main" java.lang.RuntimeException: Error running job
at com.amazon.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:771)
at com.amazon.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:580)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at com.amazon.elasticmapreduce.s3distcp.Main.main(Main.java:22)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://10.87.26.26:9000/tmp/33e4f3b9-d29a-49e8-9706-ea70e07e3ff2/files
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:285)
at org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:340)
at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:491)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:508)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:392)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1268)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1265)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1265)
at com.amazon.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:751)
... 9 more
The command I am using is :
./elastic-mapreduce --jobflow j-12345678 --jar /home/hadoop/lib/emr-s3distcp-1.0.jar --args '--src,s3n://my-bucket/data/,--dest,hdfs:///data/in,--srcPattern,xyz01-1-1*ped*' --step-name "Copy input files to HDFS" --wait-for-steps
I tried to run the sample word-count job, to check if there is any issue with HDFS, but it ran fine.
Can anyone please help me with this? If any more info is needed, please let me know and I will update the description.
Usually its the --srcPattern '<regex>' argument. You can also use hadoop fs -cp s3://src/file1.something /my/output/path/ to test for 1 file and modify your regex. Also starting with .* any char-0 or more times, should relax the matching.
It would be great to know if regex non-matches get logged and where.

Resources