Pig script for frequency of books published each year - hadoop

I am trying to run the pig script by following the steps given on this link- http://www.orzota.com/pig-tutorialfor-beginners/
But I am getting this error.It is not able to read the file loaded into HDFS.
Can you please help? The error is as follows-
Failed Jobs:
JobId Alias Feature Message Outputs
N/A BookXRecords,CountByYear,GroupByYear GROUP_BY,COMBINER Message: Unexpected System Error Occured: java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat.setupUdfEnvAndStores(PigOutputFormat.java:225)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat.checkOutputSpecs(PigOutputFormat.java:186)
at org.apache.hadoop.mapreduce.JobSubmitter.checkSpecs(JobSubmitter.java:458)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:343)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1282)
at org.apache.hadoop.mapreduce.lib.jobcontrol.ControlledJob.submit(ControlledJob.java:335)
at org.apache.hadoop.mapreduce.lib.jobcontrol.JobControl.run(JobControl.java:240)
at org.apache.pig.backend.hadoop20.PigJobControl.run(PigJobControl.java:121)
at java.lang.Thread.run(Thread.java:662)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$1.run(MapReduceLauncher.java:271)
/user/hduser/output/pig_output_bookx,
Input(s):
Failed to read data from "/user/hduser/input/BX-BooksCorrected1.txt"
Output(s):
Failed to produce result in "/user/hduser/output/pig_output_bookx"
Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
null
2015-02-19 22:19:45,852 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed!

As I understand you have just downloaded the script and run. That is the reason why your the script is not able to find the exact file you want to run the pig script on. Please ensure:
You have run sed command to filter the books.csv file on your local system
Name the filtered file which you have mentioned in the script get after running the sed (Not sure what's in your case but it should be BX-BooksCorrected or BX-BooksCorrected1, please check)
Then move that file into HDFS and then try running the script it will work and wont give the error
P.S.: You get to know the nature of the error by carefully reading the error log. Happy Hadooping!

Related

Unable to initialize any output collector in CDH5.3

15/05/24 06:11:40 INFO mapreduce.Job: Task Id : attempt_1432456238397_0004_m_000000_0, Status : FAILED
Error: java.io.IOException: Unable to initialize any output collector
at org.apache.hadoop.mapred.MapTask.createSortingCollector(MapTask.java:412)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:439)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
I am using CDH 5.3 cloudera quickstart, I wrote MapReduce Program. When i run that on shell i getting above exception.
Can any one please help me on this, how to resolve
The error "Unable to initialize any output collector" indicates that the job failed to start the container's, there can be multiple reasons for the same. However, one must review the container logs at hdfs to identify the cause the error.
In this specific instance, the value of mapreduce.task.io.sort.mb value was entered greater than 2047 MB, however the maximum value which it allows is 2047 MB, thus anything above its causes the jobs to fail marking the value provided as Invalid.
Solution:
Set the value of mapreduce.task.io.sort.mb < 2048MB
Reference:
https://support.pivotal.io/hc/en-us/articles/205649987-Map-Reduce-job-failed-with-Unable-to-initialize-any-output-collector-
CDH5.2: MR, Unable to initialize any output collector
https://community.cloudera.com/t5/Storage-Random-Access-HDFS/HBase-MapReduce-Job-Error-java-io-IOException-Unable-to/td-p/23786

Spark Streaming UpdateStateByKey

I am running a spark streaming 24X7 and using updateStateByKey function to save the computed historical data like in the case of NetworkWordCount Example..
I am tried to stream a file with 3lac records with 1 sec sleep for every 1500 records.
I am using 3 workers
Over a period updateStateByKey is growing, then the program throws the following exception
ERROR Executor: Exception in task ID 1635
java.lang.ArrayIndexOutOfBoundsException: 3
14/10/23 21:20:43 ERROR TaskSetManager: Task 29170.0:2 failed 1 times; aborting job
14/10/23 21:20:43 ERROR DiskBlockManager: Exception while deleting local spark dir: /var/folders/3j/9hjkw0890sx_qg9yvzlvg64cf5626b/T/spark-local-20141023204346-b232
java.io.IOException: Failed to delete: /var/folders/3j/9hjkw0890sx_qg9yvzlvg64cf5626b/T/spark-local-20141023204346-b232/24
14/10/23 21:20:43 ERROR Executor: Exception in task ID 8037
java.io.FileNotFoundException: /var/folders/3j/9hjkw0890sx_qg9yvzlvg64cf5626b/T/spark-local-20141023204346-b232/22/shuffle_81_0_1 (No such file or directory)
at java.io.FileOutputStream.open(Native Method)
How to handle this?
I guess updateStateByKey should be periodically reset as its growing in a rapid rate, please share some example on when and how to reset the updateStateByKey.. or i there any other problem? shed some light.
Any help is much appreciated. Thanks for your time
Did you set the CheckPoint
ssc.checkpoint("path to checkpoint")

Loading data from HDFS to HBASE

I'm using Apache hadoop 1.1.1 and Apache hbase 0.94.3.I wanted to load data from HDFS to HBASE.
I wrote pig script to serve the purpose. First i created hbase table in habse and next wrote pig script to load the data from HDFS to HBASE. But it is not loading the data into hbase table. Dont know where it's going worng.
Below is the command used to craete hbase table :
create table 'mydata','mycf'
Below is the pig script to load data from hdfs to hbase:
A = LOAD '/user/hduser/Dataparse/goodrec1.txt' USING PigStorage(',') as (c1:int, c2:chararray,c3:chararray,c4:int,c5:chararray);
STORE A INTO 'hbase://mydata'
USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
'mycf:c1,mycf:c2,mycf:c3,mycf:c4,mycf:c5');
After excecuting the script it says
2014-04-29 16:01:06,367 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2014-04-29 16:01:06,376 [main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed!
2014-04-29 16:01:06,382 [main] INFO org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics:
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
1.1.1 0.12.0 hduser 2014-04-29 15:58:07 2014-04-29 16:01:06 UNKNOWN
Failed!
Failed Jobs:
JobId Alias Feature Message Outputs
job_201403142119_0084 A MAP_ONLY Message: Job failed! Error - JobCleanup Task Failure, Task: task_201403142119_0084_m_000001 hbase://mydata,
Input(s):
Failed to read data from "/user/hduser/Dataparse/goodrec1.txt"
Output(s):
Failed to produce result in "hbase://mydata"
Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_201403142119_0084
2014-04-29 16:01:06,382 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed!
Please help where i'm going wrong ?
You have specified too many columns in the output to hbase. You have 5 input columns and 5 output columns, but HBaseStorage requires the first column to be the row key so there should only be 4 in the output

Transfer data to HBASE using Pig

I have pseudo distributed mode.. Versions are as follows...
kandabap#prakashl:~$ hadoop version
Hadoop 1.2.0
kandabap#prakashl:~$ pig version
Apache Pig version 0.12.0 (r1529718) compiled Oct 07 2013, 12:20:14
kandabap#prakashl:~$ hbase version
HBase 0.90.1-cdh3u0
I am trying to transfer data into HBASE using PIG script as follows...
PIG_CLASSPATH=/usr/lib/hbase/hbase-0.94.8/hbase-0.94.8.jar:/usr/lib/hbase/hbase-0.94.8/lib/zookeeper-3.4.5.jar /usr/lib/pig/pig-0.12.0/bin/pig /home/kandabap/Documents/H/HBASE/scripts.pig
But there seems to be some errors as,
Failed Jobs:
JobId Alias Feature Message Outputs
job_201405201031_0005 raw_data MAP_ONLY Message: Job failed! Error - JobCleanup Task Failure, Task: task_201405201031_0005_m_000001 hbase://sample_names,
Input(s):
Failed to read data from "/input.csv"
Output(s):
Failed to produce result in "hbase://sample_names"
Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_201405201031_0005
2014-05-20 11:14:59,604 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed!
2014-05-20 11:14:59,608 [main] ERROR org.apache.pig.tools.grunt.GruntParser - ERROR 2244: Job failed, hadoop does not return any error message
Details at logfile: /usr/lib/hbase/hbase-0.94.8/pig_1400548430579.log

Running pig script gives the error as: job has failed. Stop running all dependent jobs

I need a help in why I am getting an error while running a pig script. But when I try the same script in a smaller data, it executes successfully.
There are a few questions with similar issues, but none of them have the solution.
My script looks like this:
A = load ‘test.txt’ using TextLoader();
B = foreach A generate STRSPLIT($0,’”,”’) as t;
C = FILTER B BY (t.$1==2 and t.$2 matches ‘.*xxx.*’);
Store C into temp;
The error is:
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 40% complete
2013-07-15 14:21:41,914 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - job job_201307111759_7495 has failed! Stop running all dependent jobs
2013-07-15 14:21:41,914 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2013-07-15 14:21:42,754 [main] ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2997: Unable to recreate exception from backed error: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /xxx/ temp/_temporary/_attempt_201307111759_7495_m_000527_0/part-m-00527 File does not exist. Holder DFSClient_attempt_201307111759_7495_m_000527_0 does not have any open files.
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1606)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1597)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:1652)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:1640)
at org.apache.hadoop.hdfs.server.namenode.NameNode.complete(NameNode.java:689)
at sun.reflect.GeneratedMethodAccessor27.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.c
2013-07-15 14:21:42,754 [main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed!
Any help will be appreciated.
Thanks.
After some research, I found that the problem here is LeaseExpiredException. This might be because the output of mapper was removed. One of the reason for this might be due to the quota allocated for the user. In my case, I was running this script in a very large data, and my quota was insufficient to process/store the data.
We can check the quota by the following command:
hadoop fs -count -q /user/username
Thank you.

Resources