PIG Unable to Read Local CSV Leading to Job Failure - hadoop

Relatively new to the pig/hadoop ecosystem and encountering a frustrating issue when trying to execute a simple DUMP. I am trying to call the below pig script (the file is local, not HFDS, so I am opening the pig shell using pig -x local).
REGISTER utils.py USING jython AS utils;
events = LOAD '../test/events.csv' USING PigStorage(',') AS (patientid:int, eventid:chararray, eventdesc:chararray, timestamp:chararray, value:float);
events = FOREACH events GENERATE patientid, eventid, ToDate(timestamp, 'yyyy-MM-dd') AS etimestamp, value;
DUMP events;
However, when doing this, I receive the following error messages (failed job summary below, full PIG stack trace at bottom):
Input(s): Failed to read data from "file:///bootcamp/test/events.csv"
Output(s): Failed to produce result in "file/tmp/temp/305054006/tmp-908064458"
Pig Stack Trace:
ERROR 1066: Unable to open iterator for alias events. Backend error : java.lang.IllegalStateException: Job in state DEFINE instead of RUNNING
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias events. Backend error : java.lang.IllegalStateException: Job in state DEFINE instead of RUNNING
at org.apache.pig.PigServer.openIterator(PigServer.java:925)
at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:746)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:372)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:230)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:66)
at org.apache.pig.Main.run(Main.java:558)
at org.apache.pig.Main.main(Main.java:170)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 0: java.lang.IllegalStateException: Job in state DEFINE instead of RUNNING
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.getStats(MapReduceLauncher.java:822)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:452)
at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.launchPig(HExecutionEngine.java:280)
at org.apache.pig.PigServer.launchPlan(PigServer.java:1390)
at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1375)
at org.apache.pig.PigServer.storeEx(PigServer.java:1034)
at org.apache.pig.PigServer.store(PigServer.java:997)
at org.apache.pig.PigServer.openIterator(PigServer.java:910)
... 13 more
Caused by: java.lang.IllegalStateException: Job in state DEFINE instead of RUNNING
at org.apache.hadoop.mapreduce.Job.ensureState(Job.java:294)
at org.apache.hadoop.mapreduce.Job.getTaskReports(Job.java:540)
at org.apache.pig.backend.hadoop.executionengine.shims.HadoopShims.getTaskReports(HadoopShims.java:235)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.getStats(MapReduceLauncher.java:801)
...20 more
I have seen similar issues in regards to failed jobs, but sadly I haven't managed to hunt down a resolution as of yet.
EDIT: I should mention that when following the PIG tutorial at the below link, I was encountering the same issue.
http://www.sunlab.org/teaching/cse8803/fall2016/lab/hadoop-pig/

So, I found I was able to "DUMP" the file by doing the following:
tmp = events 100000; --any int larger than number of rows
dump tmp;
I had seen a similar issue on here, and was able to resolve by running as root.

Related

Getting error as Failed to create data storage when trying to load the data from HDFS with MovieLens data

I am trying to load data from HDFS to Pig but I am getting error as Failed to create Data Storage.
The command that I executed was:
movies = LOAD 'hdfs://localhost:9000/Movie_Lens/ratings' USING PigStorage(':') AS (user_id, dummy1, movie_id, dummy2, movie_rating, dummy3, timestamp);
I tried to find the mentioned problem in stack overflow but the link that I got are not related to HDFS and Pig, they are related to HDFS and HBase or Pig and HBase.
The detail of the log file is mentioned below.
Somewhere in the log file I found this mentioned:
Caused by: org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot communicate with client version 4
Pig Stack Trace
ERROR 1200: Failed to create DataStorage
Failed to parse: Failed to create DataStorage
at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:201)
at org.apache.pig.PigServer$Graph.validateQuery(PigServer.java:1707)
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1680)
at org.apache.pig.PigServer.registerQuery(PigServer.java:623)
at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:1082)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:505)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:230)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:66)
at org.apache.pig.Main.run(Main.java:565)
at org.apache.pig.Main.main(Main.java:177)
Caused by: java.lang.RuntimeException: Failed to create DataStorage
at org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:75)
at org.apache.pig.backend.hadoop.datastorage.HDataStorage.<init>(HDataStorage.java:53)
at org.apache.pig.builtin.JsonMetadata.findMetaFile(JsonMetadata.java:109)
at org.apache.pig.builtin.JsonMetadata.getSchema(JsonMetadata.java:189)
at org.apache.pig.builtin.PigStorage.getSchema(PigStorage.java:538)
at org.apache.pig.newplan.logical.relational.LOLoad.getSchemaFromMetaData(LOLoad.java:175)
at org.apache.pig.newplan.logical.relational.LOLoad.<init>(LOLoad.java:89)
at org.apache.pig.parser.LogicalPlanBuilder.buildLoadOp(LogicalPlanBuilder.java:901)
at org.apache.pig.parser.LogicalPlanGenerator.load_clause(LogicalPlanGenerator.java:3568)
at org.apache.pig.parser.LogicalPlanGenerator.op_clause(LogicalPlanGenerator.java:1625)
at org.apache.pig.parser.LogicalPlanGenerator.general_statement(LogicalPlanGenerator.java:1102)
at org.apache.pig.parser.LogicalPlanGenerator.statement(LogicalPlanGenerator.java:560)
at org.apache.pig.parser.LogicalPlanGenerator.query(LogicalPlanGenerator.java:421)
at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:191)
... 10 more
Caused by: org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot communicate with client version 4
at org.apache.hadoop.ipc.Client.call(Client.java:1070)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
at com.sun.proxy.$Proxy4.getProtocolVersion(Unknown Source)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379)
at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:119)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:238)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:203)
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
at org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:70)
... 23 more
To solve this problem I tried doing 'ant'
so when I run the command
bash ant -version
in ant bin folder it is working
but when I am running the command
bash ant clean jar-all -Dhadoopversion=23
in bin folder it is not working. In some of the links I found that new version of pig does not have jar-all command so I tried the following command
bash ant clean jar -Dhadoopversion=23
and this command is also not working.

Reading data from csv file using pig

I'm trying to read a csv file on pig shell on mac. All I'm doing is load a file into a variable and dump the variable. Here is how I'm doing it:
movies = LOAD '/user/myhome/movies_data.csv' USING PigStorage(',') as (id,name,year,rating,duration);
DUMP movies;
The data I'm using is downloaded from github provided here
This file is available in locally installed hdfs on my mac. When I do dump I get an error:
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias movies
at org.apache.pig.PigServer.openIterator(PigServer.java:935) at
org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:754)
at
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:376)
at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:230)
at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:66) at
org.apache.pig.Main.run(Main.java:565) at
org.apache.pig.Main.main(Main.java:177) at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497) at
org.apache.hadoop.util.RunJar.run(RunJar.java:221) at
org.apache.hadoop.util.RunJar.main(RunJar.java:136) Caused by:
java.io.IOException: Job terminated with anomalous status FAILED at
org.apache.pig.PigServer.openIterator(PigServer.java:927) ... 13 more
When I hit the app cluster link when this job is run, I get the following exception:
Diagnostics: Exception from container-launch.
Container id: container_1443887668938_0007_02_000001 Exit code: 127
Stack trace: ExitCodeException exitCode=127: at
org.apache.hadoop.util.Shell.runCommand(Shell.java:538) at
org.apache.hadoop.util.Shell.run(Shell.java:455) at
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
at
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266) at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745) Container exited with a
non-zero exit code 127 Failing this attempt. Failing the application.
Pig version is 0.15.0 and hadoop is 2.6.1. Am I missing something here?
You could use CSVLoader from piggybank. Get the piggybank jar if not available and register it and use the CSVLoader. Something like this.
register '/your/path/to/piggybank/jar' ;
define CSVLoader org.apache.pig.piggybank.storage.CSVLoader();
movies = LOAD '/user/myhome/movies_data.csv' USING CSVLoader as (id,name,year,rating,duration);

Job fails to read from one ORC file and write a subset to another

Working in the Apache Pig interactive shell in HDP 2.3 for Windows, I've got an existing ORC file in /path/to/file. If I load and then save that using:
a = LOAD '/path/to/file' USING OrcStorage('');
STORE a INTO '/path/to/second_file' USING OrcStorage('');
Then everything works. However, if I try:
a = LOAD '/path/to/file' USING OrcStorage('');
b = LIMIT a 10;
STORE b INTO '/path/to/third_file' USING OrcStorage('');
Then I get the following error traceback in the logs for the second job (out of two that it schedules):
2015-08-25 16:03:42,161 FATAL [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Error starting MRAppMaster
java.lang.NoClassDefFoundError: org/apache/hadoop/hive/ql/io/orc/OrcNewOutputFormat
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.pig.impl.PigContext.resolveClassName(PigContext.java:657)
at org.apache.pig.impl.PigContext.instantiateFuncFromSpec(PigContext.java:726)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POStore.getStoreFunc(POStore.java:251)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputCommitter.getCommitters(PigOutputCommitter.java:88)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputCommitter.<init>(PigOutputCommitter.java:71)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat.getOutputCommitter(PigOutputFormat.java:289)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$1.call(MRAppMaster.java:476)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$1.call(MRAppMaster.java:458)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.callWithJobClassLoader(MRAppMaster.java:1560)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.createOutputCommitter(MRAppMaster.java:458)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.serviceInit(MRAppMaster.java:377)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$4.run(MRAppMaster.java:1518)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.initAndStartAppMaster(MRAppMaster.java:1515)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1448)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hive.ql.io.orc.OrcNewOutputFormat
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
I suspect that the classpath for the two jobs is different, causing a ClassNotFound. Is that likely to be the case? If so, how can I fix it? (Bonus question: Why has this happened?)
Check the dependent library for OrcStorage is placed in all nodes.
The first option only spawn single job
The second option will spawn multiple jobs which maybe run in different machine
which doesnt have the dependent library in its classpath.

Unable to open iterator for alias -Pig

I am trying to load an XML file using XMLLoader(Piggybank) in Pig, but I get an error saying "unable to open iterator for alias B".
I have written the following code:
REGISTER /home/hdfs/spig/trunk/contrib/piggybank/java/piggybank.jar
A = LOAD '/core-site.xml'using org.apache.pig.piggybank.storage.XMLLoader('property') as (x:chararray);
B = foreach A GENERATE FLATTEN(REGEX_EXTRACT_ALL(x,'<property>\\s*<name>(.*) </name>\\s*<value>(.*)</value>\\s*<description>(.*)</description>\\s*</property>'));
dump B;
The following is the log file:
Pig Stack Trace
ERROR 1066: Unable to open iterator for alias A
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable
to open iterator for alias A at
org.apache.pig.PigServer.openIterator(PigServer.java:935) at
org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:754)
at
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:376)
at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:230)
at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81) at
org.apache.pig.Main.run(Main.java:631) at
org.apache.pig.Main.main(Main.java:177) at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606) at
org.apache.hadoop.util.RunJar.run(RunJar.java:221) at
org.apache.hadoop.util.RunJar.main(RunJar.java:136) Caused by:
java.io.IOException: Job terminated with anomalous status FAILED at
org.apache.pig.PigServer.openIterator(PigServer.java:927) ... 13 more
Looks like one of your jobs might have failed.
Caused by: java.io.IOException: Job terminated with anomalous status FAILED

EMR Hadoop Pig job error "Internal error creating job configuration"

I have a PIG job running on Amazon EMR and suddnly it has stopped working giving the following error:
Pig Stack Trace
---------------
ERROR 2017: Internal error creating job configuration.
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobCreationException: ERROR 2017: Internal error creating job configuration.
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getJob(JobControlCompiler.java:855)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.compile(JobControlCompiler.java:294)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:177)
at org.apache.pig.PigServer.launchPlan(PigServer.java:1264)
at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1249)
at org.apache.pig.PigServer.execute(PigServer.java:1239)
at org.apache.pig.PigServer.executeBatch(PigServer.java:333)
at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:137)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:198)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:170)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
at org.apache.pig.Main.run(Main.java:479)
at org.apache.pig.Main.main(Main.java:159)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:187)
Caused by: java.lang.NullPointerException
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.adjustNumReducers(JobControlCompiler.java:875)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getJob(JobControlCompiler.java:480)
... 17 more
================================================================================
Does anyone know why or what might be the problem? this is one of the most vague errors I have ever seen.
The problem actually turned out to be that PIG was unable to locate one of the input files to be processed, yet the error doesn't even remotely suggest a missing file issue.

Resources