How to divert the output of pig command to a text file in order to print it out? - hadoop

2015-09-24 01:59:28,436 [main] INFO org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, PartitionFilterOptimizer, PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter]}
2015-09-24 01:59:28,539 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-09-24 01:59:28,556 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false
2015-09-24 01:59:28,560 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1
2015-09-24 01:59:28,561 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1
2015-09-24 01:59:28,620 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-09-24 01:59:28,624 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2015-09-24 01:59:28,638 [main] INFO org.apache.pig.tools.pigstats.mapreduce.MRScriptState - Pig script settings are added to the job
2015-09-24 01:59:28,640 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2015-09-24 01:59:28,641 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - This job cannot be converted run in-process
2015-09-24 01:59:29,268 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/home/vivek/Applications/pig/pig-0.14.0-core-h2.jar to DistributedCache through /tmp/temp-1176581946/tmp-2078805221/pig-0.14.0-core-h2.jar
2015-09-24 01:59:29,452 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/home/vivek/Applications/pig/lib/automaton-1.11-8.jar to DistributedCache through /tmp/temp-1176581946/tmp-1750967439/automaton-1.11-8.jar
2015-09-24 01:59:29,538 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/home/vivek/Applications/pig/lib/antlr-runtime-3.4.jar to DistributedCache through /tmp/temp-1176581946/tmp1997290065/antlr-runtime-3.4.jar
2015-09-24 01:59:29,843 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/home/vivek/Applications/hadoop/share/hadoop/common/lib/guava-11.0.2.jar to DistributedCache through /tmp/temp-1176581946/tmp-256046780/guava-11.0.2.jar
2015-09-24 01:59:29,990 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/home/vivek/Applications/pig/lib/joda-time-2.1.jar to DistributedCache through /tmp/temp-1176581946/tmp955728106/joda-time-2.1.jar
2015-09-24 01:59:30,129 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
2015-09-24 01:59:30,131 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Key [pig.schematuple] is false, will not generate code.
2015-09-24 01:59:30,131 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Starting process to move generated code to distributed cacche
2015-09-24 01:59:30,132 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Setting key [pig.schematuple.classes] with classes to deserialize []
2015-09-24 01:59:30,276 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
2015-09-24 01:59:30,283 [JobControl] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2015-09-24 01:59:30,568 [JobControl] WARN org.apache.hadoop.mapreduce.JobSubmitter - No job jar file set. User classes may not be found. See Job or Job#setJar(String).
2015-09-24 01:59:30,868 [JobControl] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2015-09-24 01:59:30,871 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
2015-09-24 01:59:30,874 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1
2015-09-24 01:59:31,190 [JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1
2015-09-24 01:59:31,499 [JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1443082231600_0003
2015-09-24 01:59:31,516 [JobControl] INFO org.apache.hadoop.mapred.YARNRunner - Job jar is not present. Not adding any jar to the list of resources.
2015-09-24 01:59:31,704 [JobControl] INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted application application_1443082231600_0003
2015-09-24 01:59:31,738 [JobControl] INFO org.apache.hadoop.mapreduce.Job - The url to track the job: http://ubuntu:8088/proxy/application_1443082231600_0003/
2015-09-24 01:59:31,742 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_1443082231600_0003
2015-09-24 01:59:31,745 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases highsal,salaries
2015-09-24 01:59:31,745 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: salaries[3,10],salaries[-1,-1],highsal[13,9] C: R:
2015-09-24 01:59:31,781 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
2015-09-24 01:59:31,782 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1443082231600_0003]
2015-09-24 02:00:48,025 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete
2015-09-24 02:00:48,025 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1443082231600_0003]
2015-09-24 02:00:53,055 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2015-09-24 02:00:53,104 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2015-09-24 02:00:58,180 [main] INFO org.apache.hadoop.ipc.Client - Retrying connect to server: 0.0.0.0/0.0.0.0:10020. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-09-24 02:00:59,182 [main] INFO org.apache.hadoop.ipc.Client - Retrying connect to server: 0.0.0.0/0.0.0.0:10020. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-09-24 02:01:00,185 [main] INFO org.apache.hadoop.ipc.Client - Retrying connect to server: 0.0.0.0/0.0.0.0:10020. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
(F,96,86000.0,95105)
(M,24,80000.0,95050)
(F,84,89000.0,94040)
(M,36,85000.0,95101)
(F,69,91000.0,95050)
(F,96,80000.0,95051)
(M,78,87000.0,95105)
(M,25,96000.0,95103)
(M,89,90000.0,95102)
(F,82,77000.0,95051)
(M,97,96000.0,95102)
(F,39,82000.0,95051)
(M,36,79000.0,95101)
(M,75,84000.0,95103)
(F,78,91000.0,95102)
(M,59,77000.0,95051)
(F,52,76000.0,95050)
(M,52,97000.0,95102)
(F,28,98000.0,95105)
(M,91,96000.0,94041)
(F,47,85000.0,95051)
(M,79,85000.0,95101)
(F,93,93000.0,95102)
(F,33,82000.0,95101)
(F,77,96000.0,95103)
(F,93,84000.0,95051)
(M,23,83000.0,95050)
(M,54,97000.0,95101)
(F,25,93000.0,94040)
(M,52,85000.0,95102)
(M,60,78000.0,94040)
(F,74,89000.0,94040)
(F,23,76000.0,95101)
(M,46,93000.0,95051)
(F,63,92000.0,95105)
(F,86,93000.0,95101)
(F,37,95000.0,95101)
(M,41,89000.0,95050)
(F,89,77000.0,94041)
(F,82,84000.0,95050)
(M,66,96000.0,95051)
(F,75,79000.0,95051)
(M,91,90000.0,95105)
(M,27,98000.0,95051)
(M,24,85000.0,94041)
(M,82,96000.0,95050)
(F,75,88000.0,95101)
(F,80,77000.0,95051)
(M,63,80000.0,95101)
(M,29,86000.0,95103)
(F,44,91000.0,95101)
(M,40,78000.0,95103)
(F,46,83000.0,95051)
(F,42,85000.0,95105)
(M,44,90000.0,95102)
(F,26,90000.0,94041)
(F,31,87000.0,95051)
(F,88,76000.0,95050)
(M,67,87000.0,95102)
(F,58,86000.0,94041)
(F,57,85000.0,95051)
(M,97,85000.0,95101)
(M,73,90000.0,95103)
(M,47,95000.0,95105)
(F,83,98000.0,94040)
(F,56,78000.0,95101)
(M,72,89000.0,94041)
(M,90,99000.0,95101)
(F,59,79000.0,95105)
(F,32,84000.0,95051)
(F,60,93000.0,95103)
(M,47,87000.0,94041)
(M,52,87000.0,95103)
(M,82,92000.0,95051)
(M,39,87000.0,95102)
(F,93,89000.0,95103)
(M,31,88000.0,95050)
(M,21,92000.0,94040)
(F,65,84000.0,95050)
(M,68,89000.0,94041)
(F,63,92000.0,94041)
(F,95,77000.0,95050)
(F,34,98000.0,95102)
(F,44,94000.0,94040)
(M,69,81000.0,95103)
(F,30,85000.0,95051)
(F,85,82000.0,95050)
(M,75,78000.0,94040)
(F,91,94000.0,95105)
(F,71,91000.0,94041)
(M,39,91000.0,95051)
(M,43,90000.0,95105)
(F,35,94000.0,94040)
(F,41,83000.0,95051)
(M,62,94000.0,94041)
(F,38,77000.0,94041)
(F,63,89000.0,95051)
(M,78,90000.0,95050)
(M,65,92000.0,95101)
(F,42,94000.0,95103)
(M,65,80000.0,95103)
(F,38,91000.0,95102)
(M,58,93000.0,94040)
(F,63,83000.0,95103)
(F,23,96000.0,95103)
(F,43,96000.0,95102)
(F,27,86000.0,94041)
(M,94,76000.0,94041)
(F,53,79000.0,94041)
(M,78,79000.0,95102)
(F,62,82000.0,95101)
(M,86,83000.0,95051)
(F,91,98000.0,95105)
(M,61,99000.0,95103)
(M,58,94000.0,95050)
(F,47,99000.0,95102)
(F,24,89000.0,95101)
(M,80,92000.0,95051)
(F,30,83000.0,95102)
(F,35,86000.0,95051)
(M,69,82000.0,95102)
(F,49,83000.0,95105)
(M,59,82000.0,95103)
(F,74,84000.0,95103)
(F,82,83000.0,95051)
(M,32,85000.0,95102)
(M,39,91000.0,95103)
(M,50,95000.0,95051)
(M,98,89000.0,95105)
(M,84,96000.0,95050)
(M,61,90000.0,95103)
(F,69,83000.0,95102)
(F,59,91000.0,95101)
(M,79,90000.0,95050)
(F,98,83000.0,95050)
(F,65,78000.0,94040)
(F,74,81000.0,95103)
(M,83,97000.0,95101)
(M,42,92000.0,95102)
(M,82,92000.0,95105)
(F,41,91000.0,94041)
(F,35,97000.0,94040)
(F,46,85000.0,95050)
(M,34,86000.0,94041)
(F,37,85000.0,94041)
(M,64,91000.0,94040)
(M,92,84000.0,95051)
(M,56,83000.0,95103)
(F,68,98000.0,95101)
(M,28,81000.0,95050)
(F,81,93000.0,95050)
(M,71,87000.0,95051)
(M,90,86000.0,95050)
(F,92,78000.0,94041)
(M,42,97000.0,95101)
(F,97,83000.0,94041)
(M,41,86000.0,95051)
(F,96,99000.0,95102)
(F,56,96000.0,95051)
(F,63,99000.0,95105)
(F,69,89000.0,95050)
(M,67,85000.0,95105)
(M,61,83000.0,95051)
(M,86,96000.0,95103)
(F,84,82000.0,94041)
(M,91,90000.0,95050)
(F,36,99000.0,94041)
(M,75,97000.0,95105)
(M,39,93000.0,95050)
(M,56,90000.0,95050)
(M,61,91000.0,95105)
(M,29,93000.0,94041)
(M,79,99000.0,95102)
(M,48,91000.0,95101)
(F,95,76000.0,95101)
(M,47,98000.0,95050)
(M,61,88000.0,95101)
(M,74,77000.0,95101)
(M,75,83000.0,94040)
(M,34,82000.0,95103)
(M,70,85000.0,95103)
(F,43,94000.0,94041)
(F,64,91000.0,95105)
(F,21,95000.0,95051)
(M,55,91000.0,95051)
(M,27,85000.0,95105)
(F,40,84000.0,94040)
(F,41,84000.0,94041)
(F,50,87000.0,95051)
(M,72,82000.0,95103)
(F,50,87000.0,95105)
(F,31,93000.0,95102)
(F,45,80000.0,95050)
(F,62,77000.0,94040)
(M,93,91000.0,95101)
(M,77,94000.0,95051)
(F,33,82000.0,95051)
(M,95,87000.0,95105)
(M,40,79000.0,95102)
(M,82,87000.0,95050)
(M,55,85000.0,95051)
(M,52,96000.0,95102)
(F,52,96000.0,95050)
(F,78,82000.0,95102)
(F,31,82000.0,94041)
(F,60,97000.0,95101)
(M,77,81000.0,95102)
(F,78,93000.0,95101)
(M,74,82000.0,94040)
(M,62,77000.0,95050)
(F,72,77000.0,95102)
(M,96,87000.0,94041)
(F,89,93000.0,95051)
(M,59,87000.0,95050)
(F,26,81000.0,95105)
(F,84,77000.0,95051)
(F,42,84000.0,94040)
(F,59,96000.0,94041)
(F,31,78000.0,95050)
(F,91,85000.0,95105)
(F,87,79000.0,95102)
(M,39,88000.0,95105)
(F,47,86000.0,95051)
(F,24,92000.0,95101)
(F,76,85000.0,95103)
(F,48,83000.0,95105)
(M,50,88000.0,95105)
(F,61,93000.0,94041)
(F,59,98000.0,95050)
(F,57,95000.0,95050)
(M,77,76000.0,95105)
(M,34,90000.0,95105)
(M,23,91000.0,95050)
(M,38,88000.0,95051)
(F,35,86000.0,95102)
(M,27,91000.0,95103)
(F,99,78000.0,95051)
(F,77,94000.0,94041)
(M,23,83000.0,95103)
(M,93,91000.0,95051)
(F,94,89000.0,95103)
(M,99,99000.0,95105)
(M,75,84000.0,94040)
(M,32,89000.0,94041)
(F,57,76000.0,94040)
(F,94,95000.0,95103)
(M,66,82000.0,94041)
(F,56,98000.0,94041)
(M,37,88000.0,95105)
(M,89,82000.0,95050)
(M,91,79000.0,95103)
(F,72,90000.0,95102)
(F,53,85000.0,95050)
(F,87,91000.0,95105)
(M,74,91000.0,95050)
(F,62,99000.0,95102)
(M,46,95000.0,95105)
(F,73,78000.0,95050)
(F,35,94000.0,95102)
(F,60,77000.0,95105)
(M,83,93000.0,95105)
(F,55,76000.0,95051)
(F,36,90000.0,95101)
(F,75,87000.0,95103)
(F,91,98000.0,95103)
(F,66,87000.0,95101)
(M,83,91000.0,95103)
(M,52,77000.0,94040)
(F,76,85000.0,95103)
(F,98,78000.0,95102)
(F,60,89000.0,95050)
(F,30,76000.0,95101)
(F,53,95000.0,95050)
(M,63,85000.0,95105)
(F,25,94000.0,95050)
(M,29,98000.0,95103)
(M,53,82000.0,95050)
(F,70,89000.0,95101)
(F,76,83000.0,95105)
(M,85,98000.0,95050)
(F,81,97000.0,95103)
(M,30,77000.0,94041)
(F,73,85000.0,95102)
(M,94,93000.0,95103)
(F,83,80000.0,95101)
(F,44,88000.0,94040)
(F,35,83000.0,95051)
(F,25,82000.0,94040)
(M,26,92000.0,95101)
(F,60,81000.0,95105)
(F,47,78000.0,94040)
(F,53,87000.0,94040)
(F,44,88000.0,95051)
(M,73,96000.0,95103)
(F,77,95000.0,95103)
(M,24,93000.0,95050)
(F,21,76000.0,95050)
(F,82,90000.0,95103)
(M,71,97000.0,95051)
(M,53,79000.0,95105)
(M,28,84000.0,94040)
(M,35,97000.0,95101)
(F,75,76000.0,94040)
(M,87,94000.0,94041)
(F,89,79000.0,95102)
(F,80,92000.0,95102)
(M,24,77000.0,95102)
(F,40,94000.0,95105)
(M,43,80000.0,94041)
(M,23,80000.0,94041)
(F,51,83000.0,94041)
(F,90,78000.0,94040)
(F,41,79000.0,95102)
(M,48,93000.0,94041)
(M,69,94000.0,94040)
(F,36,81000.0,95101)
(M,35,91000.0,95051)
(F,26,88000.0,95050)
(M,35,83000.0,94041)
(F,36,77000.0,95103)
(M,57,91000.0,95103)
(F,57,89000.0,95101)
(F,38,86000.0,94041)
(F,31,83000.0,95050)
(M,47,96000.0,94041)
(F,91,83000.0,95101)
(F,21,78000.0,95103)
(M,32,84000.0,95051)
(F,41,93000.0,94041)
(M,81,93000.0,95102)
(F,59,78000.0,95105)
(M,71,90000.0,95050)
(F,51,77000.0,95051)
(M,29,88000.0,95102)
(F,40,93000.0,95102)
(F,89,99000.0,95105)
(F,64,77000.0,95103)
(F,53,87000.0,94041)
(M,53,97000.0,94040)
(M,45,78000.0,94040)
(F,76,89000.0,94041)
(M,59,81000.0,95050)
(F,24,76000.0,94041)
(M,72,95000.0,95051)
(M,63,83000.0,94040)
(F,39,76000.0,94041)
(F,26,85000.0,95101)
(M,90,99000.0,95102)
(F,47,76000.0,95103)
(M,72,86000.0,95105)
(M,38,92000.0,95050)
(M,54,78000.0,95101)
(F,48,86000.0,95102)
(F,37,78000.0,94040)
(F,75,88000.0,95103)
(F,66,78000.0,95050)
(M,58,80000.0,94040)
(M,84,88000.0,95050)
(F,35,94000.0,95050)
(M,57,88000.0,95102)
(M,68,83000.0,95050)
(M,37,91000.0,95103)
(M,65,79000.0,95101)
(M,65,85000.0,95101)
(F,97,83000.0,95102)
(M,43,83000.0,95051)
(F,73,82000.0,95103)
(M,89,87000.0,95050)
(F,74,84000.0,95103)
(M,73,90000.0,94041)
(F,46,97000.0,95103)
(M,36,82000.0,94041)
(M,80,82000.0,95105)
(F,78,79000.0,95102)
(M,67,96000.0,94040)
(F,48,98000.0,95102)
(F,82,86000.0,95050)
(M,79,80000.0,95050)
(M,96,84000.0,95103)
(M,51,87000.0,94040)
(F,29,84000.0,95051)
(M,47,86000.0,94040)
(M,54,96000.0,94041)
(F,80,94000.0,94041)
(F,92,93000.0,95103)
(F,59,79000.0,95050)
(M,95,80000.0,95050)
(M,67,92000.0,94040)
(F,23,98000.0,95103)
(M,91,82000.0,95051)
(M,27,89000.0,95105)
(M,43,77000.0,94041)
(F,65,83000.0,94040)
(F,65,82000.0,95051)
(M,43,98000.0,95105)
(F,51,86000.0,95102)
(M,76,83000.0,95051)
(F,25,92000.0,94040)
(M,48,76000.0,95102)
(F,43,86000.0,95050)
(F,57,83000.0,95101)
(F,48,84000.0,95051)
(M,37,98000.0,95102)
(F,98,81000.0,95105)
(M,78,86000.0,94041)
(F,34,93000.0,95102)
(M,53,94000.0,95102)
(M,69,98000.0,94040)
(F,70,84000.0,94041)
(F,89,87000.0,94040)
(F,52,89000.0,95102)
(F,84,79000.0,95102)
(M,44,86000.0,94041)
(M,51,93000.0,94041)
(M,98,81000.0,95102)
(F,82,77000.0,95101)
(M,50,82000.0,95103)
(F,59,76000.0,95051)
(M,29,76000.0,94041)
(F,30,81000.0,95051)
(F,22,96000.0,95105)
(M,64,88000.0,94040)
(M,80,78000.0,95102)
(F,94,85000.0,95051)
(M,63,95000.0,95103)
(F,51,78000.0,95050)
(M,39,94000.0,95105)
(M,80,85000.0,95101)
(M,92,89000.0,95102)
(M,44,88000.0,95103)
(M,57,92000.0,95050)
(F,64,94000.0,95051)
(F,88,91000.0,95102)
(F,43,83000.0,95101)
(F,33,93000.0,95050)
(M,64,92000.0,95102)
(M,91,92000.0,95050)
(F,32,88000.0,95105)
(M,78,87000.0,94041)
(F,64,85000.0,94040)
(M,93,96000.0,95102)
(F,72,98000.0,95103)
(M,68,76000.0,95051)
(M,52,95000.0,95050)
(F,75,93000.0,95103)
(M,45,85000.0,94041)
(F,70,98000.0,95051)
(F,74,96000.0,95101)
(F,81,85000.0,95102)
(M,83,91000.0,95105)
(M,32,89000.0,95101)
(F,58,90000.0,94041)
(M,55,80000.0,95050)
(F,23,79000.0,95051)
(M,91,79000.0,95103)
(F,21,98000.0,95102)
(F,57,91000.0,95101)
(M,58,91000.0,95051)
(F,41,94000.0,95101)
(M,67,95000.0,94041)
(M,69,80000.0,95101)
(M,23,77000.0,94041)
(F,94,92000.0,95105)
(F,60,92000.0,95051)
(F,53,84000.0,94041)
(F,48,98000.0,95103)
(M,70,88000.0,95051)
(M,76,94000.0,95103)
(F,22,88000.0,94040)
(F,80,81000.0,95102)
(F,57,80000.0,95051)
(F,57,99000.0,95103)
(M,50,78000.0,95050)
(M,40,81000.0,95050)
(F,93,97000.0,95050)
(M,40,80000.0,94041)
(M,35,91000.0,95101)
(F,50,96000.0,94041)
(F,27,90000.0,95105)
(F,23,91000.0,95105)
(M,49,80000.0,94041)
(M,90,98000.0,95105)
(M,29,91000.0,95050)
(F,99,83000.0,95103)
(F,43,83000.0,94040)
(F,30,90000.0,94041)
(F,96,97000.0,95102)
(M,83,77000.0,95103)
(F,77,97000.0,94040)
(F,74,98000.0,95105)
(F,96,96000.0,95103)
(F,37,81000.0,94041)
(M,82,91000.0,94040)
(F,33,90000.0,95101)
(F,35,86000.0,95102)
(F,67,87000.0,95105)
(M,95,95000.0,95051)
(M,82,95000.0,95101)
(F,26,76000.0,95050)
(F,65,84000.0,95103)
(F,34,91000.0,95102)
(F,48,81000.0,94040)
(F,93,84000.0,94041)
(F,37,79000.0,95105)
(M,77,84000.0,95102)
(M,94,78000.0,94040)
(M,28,79000.0,95051)
(F,30,80000.0,94041)
(F,54,80000.0,95103)
(F,93,96000.0,95105)
(F,45,78000.0,94041)
right now i'm just executing pig command .I wanna direct or make a copy of output at execution time
let it is really difficult to take a snapshot of it.
just suggest a solution to overcome from it.
the code ILLUSTRATE THE OUTPUT OF THE COMMAND
grunt>salaries= load 'salaries' using PigStorage(',') As (gender, age,salary,zip);
grunt> salaries= load 'salaries' using PigStorage(',') As (gender:chararray,age:int,salary:double,zip:long);
grunt>highsal= filter salaries by salary > 75000;
grunt>dump highsal;
WHEN THE ABOVE COMMAND EXECUTED THE OUTPUT LISTING ABOVE WIILL BE DISPLAYED .
JUST I HAVE COPIED salaries.txt from local FS to hdfs
.
grunt> store highsal into 'file';
2015-09-24 02:59:15,981 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: Pig script failed to parse:
<line 1, column 6> Undefined alias: highsal
Details at logfile: /home/vivek/pig_1443088724224.log
grunt>
i'm still getting error by suggested query.

You have not defined "highsal" alias when trying to run STORE command.
Pig do not store any alias previous session. you have to execute all your command in one session or write a pig script and invoke it.
Try like :
grunt>salaries= load 'salaries' using PigStorage(',') As (gender, age,salary,zip);
grunt> salaries= load 'salaries' using PigStorage(',') As (gender:chararray,age:int,salary:double,zip:long);
grunt>highsal= filter salaries by salary > 75000;
grunt>STORE highsal INTO 'file';
This will store the "highsal" content in a file name 'file/part-x-xxxxx' on user's HDFS directory. You can also provide HDFS absolute directory path instead of 'file' if you want to wish to store data in directory other than users home directory
Hope this helps

store highsal into 'file';
Have a look at apache pig documentation for all commands.

Related

PIG : count of each product in distinctive Locations

I am trying to do following Step1 to Step4 in pig:
STEP 1:- Create a user table:and take data from /tmp/users.txt-
|Column 1 | USER ID |int|
|Column 2 |EMAIL|chararray|
|Column 3 |LANGUAGE |chararray|
|Column 4 |LOCATION |chararray|
STEP 2:- Crate a transaction table and take data from /tmp/transaction.txt:-
|Column 1 | ID |int|
|Column 2 |PRODUCT|int|
|Column 3 |USER ID |int|
|Column 4 |PURCHASE AMOUNT |double|
|Coulmn 5 |DESCRIPTION |chararray|
Step 3:- Find out the count of each product in distinctive Locations.
Step 4:- Display the results.
For achieving above I did the following :
users = LOAD '/tmp/users.txt' USING PigStorage(',') AS (USERID:int, EMAIL:chararray, LANGUAGE:chararray, LOCATION: chararray);
trans = LOAD '/tmp/transaction.txt' USING PigStorage(',') AS (ID:int, PRODUCT:int, USERID:int, PURCHASEAMOUNT: double, DESCRIPTION: chararray);
users_trans = JOIN users BY USERID RIGHT, trans BY USERID;
B = GROUP users_trans BY (DESCRIPTION,LOCATION);
C = FOREACH B GENERATE group as comb, COUNT(users_trans) AS Total;
DUMP C;
But, I am getting errors.. It will helpful if you assist as I am new to pig.
##########################################
Dataset
user.txt
1 creator#gmail.com EN US
2 creator#gmail.com EN GB
3 creator#gmail.com FR FR
4 creator#gmail.com IN HN
5 creator#gmail.com PAK IS
transaction.txt
1 1 1 300 a jumper
2 1 2 300 a jumper
3 1 5 300 a jumper
4 2 3 100 a rubber chicken
5 1 3 300 a jumper
6 5 4 500 a soapbox
7 3 3 200 a adhesive
8 4 1 300 a lotion
9 4 4 500 a sweater
10 5 4 600 a jeans
Error Log:
2019-12-27 06:17:22,180 [LocalJobRunner Map Task Executor #0] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader - Current split being processed file:/tmp/temp2029752934/tmp-883821114/part-r-00000:0+130
2019-12-27 06:17:22,242 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.MapTask - (EQUATOR) 0 kvi 26214396(104857584)
2019-12-27 06:17:22,242 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.MapTask - mapreduce.task.io.sort.mb: 100
2019-12-27 06:17:22,242 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.MapTask - soft limit at 83886080
2019-12-27 06:17:22,242 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.MapTask - bufstart = 0; bufvoid = 104857600
2019-12-27 06:17:22,242 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.MapTask - kvstart = 26214396; length = 6553600
2019-12-27 06:17:22,244 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.MapTask - Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
2019-12-27 06:17:22,248 [LocalJobRunner Map Task Executor #0] INFO org.apache.pig.impl.util.SpillableMemoryManager - Selected heap (PS Old Gen) of size 699400192 to monitor. collectionUsageThreshold = 489580128, usageThreshold = 489580128
2019-12-27 06:17:22,248 [LocalJobRunner Map Task Executor #0] WARN org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized
2019-12-27 06:17:22,250 [LocalJobRunner Map Task Executor #0] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Map - Aliases being processed per job phase (AliasName[line,offset]): M: C[7,4],B[6,4] C: C[7,4],B[6,4] R: C[7,4]
2019-12-27 06:17:22,254 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.LocalJobRunner -
2019-12-27 06:17:22,254 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.MapTask - Starting flush of map output
2019-12-27 06:17:22,254 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.MapTask - Spilling map output
2019-12-27 06:17:22,254 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.MapTask - bufstart = 0; bufend = 100; bufvoid = 104857600
2019-12-27 06:17:22,254 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.MapTask - kvstart = 26214396(104857584); kvend = 26214360(104857440); length = 37/6553600
2019-12-27 06:17:22,262 [LocalJobRunner Map Task Executor #0] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine - Aliases being processed per job phase (AliasName[line,offset]): M: C[7,4],B[6,4] C: C[7,4],B[6,4] R: C[7,4]
2019-12-27 06:17:22,264 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.MapTask - Finished spill 0
2019-12-27 06:17:22,265 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.Task - Task:attempt_local1424814286_0002_m_000000_0 is done. And is in the process of committing
2019-12-27 06:17:22,266 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.LocalJobRunner -map
2019-12-27 06:17:22,266 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.Task - Task 'attempt_local1424814286_0002_m_000000_0' done.
2019-12-27 06:17:22,266 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.LocalJobRunner -Finishing task: attempt_local1424814286_0002_m_000000_0
2019-12-27 06:17:22,266 [Thread-18] INFO org.apache.hadoop.mapred.LocalJobRunner - map task executor complete.
2019-12-27 06:17:22,266 [Thread-18] INFO org.apache.hadoop.mapred.LocalJobRunner - Waiting for reduce tasks
2019-12-27 06:17:22,267 [pool-9-thread-1] INFO org.apache.hadoop.mapred.LocalJobRunner - Starting task: attempt_local1424814286_0002_r_000000_0
2019-12-27 06:17:22,272 [pool-9-thread-1] INFO org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - File Output Committer Algorithm version is 1
2019-12-27 06:17:22,272 [pool-9-thread-1] INFO org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
2019-12-27 06:17:22,274 [pool-9-thread-1] INFO org.apache.hadoop.mapred.Task - Using ResourceCalculatorProcessTree : [ ]
2019-12-27 06:17:22,274 [pool-9-thread-1] INFO org.apache.hadoop.mapred.ReduceTask - Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle#2582aa54
2019-12-27 06:17:22,275 [pool-9-thread-1] INFO org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl - MergerManager: memoryLimit=652528832, maxSingleShuffleLimit=163132208, mergeThreshold=430669056, ioSortFactor=10, memToMemMergeOutputsThreshold=10
2019-12-27 06:17:22,275 [EventFetcher for fetching Map Completion Events] INFO org.apache.hadoop.mapreduce.task.reduce.EventFetcher - attempt_local1424814286_0002_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
2019-12-27 06:17:22,276 [localfetcher#2] INFO org.apache.hadoop.mapreduce.task.reduce.LocalFetcher - localfetcher#2 about to shuffle output of map attempt_local1424814286_0002_m_000000_0 decomp: 14 len: 18 to MEMORY
2019-12-27 06:17:22,277 [localfetcher#2] INFO org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput - Read 14 bytes from map-output for attempt_local1424814286_0002_m_000000_0
2019-12-27 06:17:22,277 [localfetcher#2] INFO org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl - closeInMemoryFile -> map-output of size: 14, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->14
2019-12-27 06:17:22,277 [EventFetcher for fetching Map Completion Events] INFO org.apache.hadoop.mapreduce.task.reduce.EventFetcher - EventFetcher is interrupted.. Returning
2019-12-27 06:17:22,278 [Readahead Thread #3] WARN org.apache.hadoop.io.ReadaheadPool - Failed readahead on ifile
EBADF: Bad file descriptor
at org.apache.hadoop.io.nativeio.NativeIO$POSIX.posix_fadvise(Native Method)
at org.apache.hadoop.io.nativeio.NativeIO$POSIX.posixFadviseIfPossible(NativeIO.java:267)
at org.apache.hadoop.io.nativeio.NativeIO$POSIX$CacheManipulator.posixFadviseIfPossible(NativeIO.java:146)
at org.apache.hadoop.io.ReadaheadPool$ReadaheadRequestImpl.run(ReadaheadPool.java:208)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2019-12-27 06:17:22,278 [pool-9-thread-1] INFO org.apache.hadoop.mapred.LocalJobRunner - 1 / 1 copied.
2019-12-27 06:17:22,280 [pool-9-thread-1] INFO org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl - finalMerge called with 1 in-memory map-outputs and 0 on-disk map-outputs
2019-12-27 06:17:22,280 [pool-9-thread-1] INFO org.apache.hadoop.mapred.Merger - Merging 1 sorted segments
2019-12-27 06:17:22,280 [pool-9-thread-1] INFO org.apache.hadoop.mapred.Merger - Down to the last merge-pass, with 1 segments left of total size: 7 bytes
2019-12-27 06:17:22,281 [pool-9-thread-1] INFO org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl - Merged 1 segments, 14 bytes to disk to satisfy reduce memory limit
2019-12-27 06:17:22,281 [pool-9-thread-1] INFO org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl - Merging 1 files, 18 bytes from disk
2019-12-27 06:17:22,281 [pool-9-thread-1] INFO org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl - Merging 0 segments, 0 bytes from memory into reduce
2019-12-27 06:17:22,281 [pool-9-thread-1] INFO org.apache.hadoop.mapred.Merger - Merging 1 sorted segments
2019-12-27 06:17:22,281 [pool-9-thread-1] INFO org.apache.hadoop.mapred.Merger - Down to the last merge-pass, with 1 segments left of total size: 7 bytes
2019-12-27 06:17:22,282 [pool-9-thread-1] INFO org.apache.hadoop.mapred.LocalJobRunner - 1 / 1 copied.
2019-12-27 06:17:22,283 [pool-9-thread-1] INFO org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - File Output Committer Algorithm version is 1
2019-12-27 06:17:22,283 [pool-9-thread-1] INFO org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
2019-12-27 06:17:22,284 [pool-9-thread-1] INFO org.apache.pig.impl.util.SpillableMemoryManager - Selected heap (PS Old Gen) of size 699400192 to monitor. collectionUsageThreshold = 489580128, usageThreshold = 489580128
2019-12-27 06:17:22,285 [pool-9-thread-1] WARN org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized
2019-12-27 06:17:22,286 [pool-9-thread-1] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce - Aliases being processed per job phase (AliasName[line,offset]): M: C[7,4],B[6,4] C: C[7,4],B[6,4] R: C[7,4]
2019-12-27 06:17:22,287 [pool-9-thread-1] INFO org.apache.hadoop.mapred.Task - Task:attempt_local1424814286_0002_r_000000_0 is done. And is in the process of committing
2019-12-27 06:17:22,289 [pool-9-thread-1] INFO org.apache.hadoop.mapred.LocalJobRunner - 1 / 1 copied.
2019-12-27 06:17:22,289 [pool-9-thread-1] INFO org.apache.hadoop.mapred.Task - Task attempt_local1424814286_0002_r_000000_0 is allowed to commit now
2019-12-27 06:17:22,292 [pool-9-thread-1] INFO org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - Saved output of task 'attempt_local1424814286_0002_r_000000_0' to file:/tmp/temp2029752934/tmp726323435/_temporary/0/task_local1424814286_0002_r_000000
2019-12-27 06:17:22,292 [pool-9-thread-1] INFO org.apache.hadoop.mapred.LocalJobRunner - reduce > reduce
2019-12-27 06:17:22,292 [pool-9-thread-1] INFO org.apache.hadoop.mapred.Task - Task 'attempt_local1424814286_0002_r_000000_0' done.
2019-12-27 06:17:22,292 [pool-9-thread-1] INFO org.apache.hadoop.mapred.LocalJobRunner - Finishing task: attempt_local1424814286_0002_r_000000_0
2019-12-27 06:17:22,292 [Thread-18] INFO org.apache.hadoop.mapred.LocalJobRunner - reduce task executor complete.
2019-12-27 06:17:22,460 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_local1424814286_0002
2019-12-27 06:17:22,460 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases B,C
2019-12-27 06:17:22,460 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: C[7,4],B[6,4] C: C[7,4],B[6,4] R: C[7,4]
2019-12-27 06:17:22,463 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metricswith processName=JobTracker, sessionId= - already initialized
2019-12-27 06:17:22,464 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metricswith processName=JobTracker, sessionId= - already initialized
2019-12-27 06:17:22,465 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metricswith processName=JobTracker, sessionId= - already initialized
2019-12-27 06:17:22,471 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2019-12-27 06:17:22,474 [main] INFO org.apache.pig.tools.pigstats.mapreduce.SimplePigStats - Script Statistics:
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
2.9.2 0.16.0 root 2019-12-27 06:17:20 2019-12-27 06:17:22 HASH_JOIN,GROUP_BY
Success!
Job Stats (time in seconds):
JobId Maps Reduces MaxMapTime MinMapTime AvgMapTime MedianMapTime MaxReduceTime MinReduceTime AvgReduceTime MedianReducetime Alias Feature Outputs
job_local1289071959_0001 2 1 n/a n/a n/a n/a n/a n/a n/a n/a trans,users,users_trans HASH_JOIN
job_local1424814286_0002 1 1 n/a n/a n/a n/a n/a n/a n/a n/a B,C GROUP_BY,COMBINER file:/tmp/temp2029752934/tmp726323435,
Input(s):
Successfully read 5 records from: "/tmp/users.txt"
Successfully read 10 records from: "/tmp/transaction.txt"
Output(s):
Successfully stored 1 records in: "file:/tmp/temp2029752934/tmp726323435"
Counters:
Total records written : 1
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_local1289071959_0001 -> job_local1424814286_0002,
job_local1424814286_0002
2019-12-27 06:17:22,475 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metricswith processName=JobTracker, sessionId= - already initialized
2019-12-27 06:17:22,476 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metricswith processName=JobTracker, sessionId= - already initialized
2019-12-27 06:17:22,477 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metricswith processName=JobTracker, sessionId= - already initialized
2019-12-27 06:17:22,485 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metricswith processName=JobTracker, sessionId= - already initialized
2019-12-27 06:17:22,486 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metricswith processName=JobTracker, sessionId= - already initialized
2019-12-27 06:17:22,487 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metricswith processName=JobTracker, sessionId= - already initialized
2019-12-27 06:17:22,492 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Encountered Warning FIELD_DISCARDED_TYPE_CONVERSION_FAILED 15 time(s).
2019-12-27 06:17:22,493 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Encountered Warning ACCESSING_NON_EXISTENT_FIELD 55 time(s).
2019-12-27 06:17:22,493 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
2019-12-27 06:17:22,496 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2019-12-27 06:17:22,496 [main] WARN org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized
2019-12-27 06:17:22,503 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input files to process : 1
2019-12-27 06:17:22,503 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
2019-12-27 06:17:22,541 [main] INFO org.apache.pig.Main - Pig script completed in 2 seconds and 965 milliseconds (2965 ms)
Advice
First of all: It seems that you are starting up with Pig. It may be valuable to know that Cloudera recently decided to deprecate Pig. It will of course not cease to exist, but think twice if you are planning to pick up a new skill or implement new use cases. I would recommend looking into Hive/Spark/Impala as more future proof alternatives.
Answer
Your job succeeds, but presumably not with output you want. There are several hints to what may be wrong (data types/field names) however this does not point at a specific problem in the code.
My recommendation would be to find out where the problem exactly occurs. Simply cut off the end of your code and print an intermediate result to see if you are still on track.
In the (likely) event you have a problem in your load statement already, it is worth noting that you can still narrow it down further. First load, and then apply the schema.
Given the data you have, first problem would be that you have no commas, so you must load the lines as a whole, then split them later. I used two or more spaces in the transactions file because your last column appears to be one string containing spaces. For accuracy, I suggest having a better delimiter than spaces/tabs.
Then the group by needs to reference the relations that the data comes from.
Everything else is fine, I think, though I'm not sure about the COUNT(X)
A = LOAD '/tmp/users.txt' USING PigStorage() as (line:chararray);
USERS = FOREACH A GENERATE FLATTEN(STRSPLIT(line, '\\s+')) AS (userid:int,email:chararray,language:chararray,location:chararray);
B = LOAD '/tmp/transactions.txt' USING PigStorage() as (line:chararray);
TRANS = FOREACH B GENERATE FLATTEN(STRSPLIT(line, '\\s\\s+')) AS (id:int,product:int,userid:int,purchase:double,desc:chararray);
X = JOIN USERS BY userid RIGHT, TRANS BY userid;
X_grouped = GROUP X BY (TRANS::desc, USERS::location);
RES = FOREACH X_grouped GENERATE group as comb, COUNT(X) AS Total;
\d RES;
Output
((a jeans,HN),1)
((a jumper,FR),1)
((a jumper,GB),1)
((a jumper,IS),1)
((a jumper,US),1)
((a lotion,US),1)
((a soapbox,HN),1)
((a sweater,HN),1)
((a adhesive,FR),1)
((a rubber chicken,FR),1)

Does sqoop spill temporary data to disk

As I understand sqoop, it launches few mappers on different data nodes making jdbc connection with RDBMS. Once connection is formed data is transferred to HDFS.
Just trying to understand, does sqoop mapper spill data temporary on disk (data node)? I know spilling happens in MapReduce but not sure about sqoop job.
It seems sqoop-import runs on mapper and doesn't spill. And sqoop-merge runs on map-reduce and does spill. You can check it on Job tracker during sqoop import run.
Have a look at this part of sqoop import log, it does not spill, fetches and writes to hdfs:
INFO [main] ... mapreduce.db.DataDrivenDBRecordReader: Using query: SELECT...
[main] mapreduce.db.DBRecordReader: Executing query: SELECT...
INFO [main] org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: File Output Committer Algorithm version is 1
INFO [main] org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
INFO [main] org.apache.hadoop.io.compress.CodecPool: Got brand-new compressor [.snappy]
INFO [Thread-16] ...mapreduce.AutoProgressMapper: Auto-progress thread is finished. keepGoing=false
INFO [main] org.apache.hadoop.mapred.Task: Task:attempt_1489705733959_2462784_m_000000_0 is done. And is in the process of committing
INFO [main] org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: Saved output of task 'attempt_1489705733959_2462784_m_000000_0' to hdfs://
Have a look at this sqoop-merge log(skipped some rows), it spills on disk (note Spilling map output in the log):
INFO [main] org.apache.hadoop.mapred.MapTask: Processing split: hdfs://bla-bla/part-m-00000:0+48322717
...
INFO [main] org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
...
INFO [main] org.apache.hadoop.mapred.MapTask: mapreduce.task.io.sort.mb: 1024
INFO [main] org.apache.hadoop.mapred.MapTask: soft limit at 751619264
INFO [main] org.apache.hadoop.mapred.MapTask: bufstart = 0; bufvoid = 1073741824
INFO [main] org.apache.hadoop.mapred.MapTask: kvstart = 268435452; length = 67108864
INFO [main] org.apache.hadoop.mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$**MapOutputBuffer**
INFO [main] com.pepperdata.supervisor.agent.resource.r: Datanode bla-bla is LOCAL.
INFO [main] org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor [.snappy]
...
INFO [main] org.apache.hadoop.mapred.MapTask: **Starting flush of map output**
INFO [main] org.apache.hadoop.mapred.MapTask: **Spilling map output**
INFO [main] org.apache.hadoop.mapred.MapTask: **bufstart** = 0; **bufend** = 184775274; bufvoid = 1073741824
INFO [main] org.apache.hadoop.mapred.MapTask: kvstart = 268435452(1073741808); kvend = 267347800(1069391200); length = 1087653/67108864
INFO [main] org.apache.hadoop.io.compress.CodecPool: Got brand-new compressor [.snappy]
[main] org.apache.hadoop.mapred.MapTask: Finished spill 0
...Task:attempt_1489705733959_2479291_m_000000_0 is done. And is in the process of committing

Apache PIG, ELEPHANTBIRDJSON Loader

I'm trying to parse below input (there are 2 records in this input)using Elephantbird json loader
[{"node_disk_lnum_1":36,"node_disk_xfers_in_rate_sum":136.40000000000001,"node_disk_bytes_in_rate_22":
187392.0, "node_disk_lnum_7": 13}]
[{"node_disk_lnum_1": 36, "node_disk_xfers_in_rate_sum":
105.2,"node_disk_bytes_in_rate_22": 123084.8, "node_disk_lnum_7":13}]
Here is my syntax:
register '/home/data/Desktop/elephant-bird-pig-4.1.jar';
a = LOAD '/pig/tc1.log' USING
com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') as (json:map[]);
b = FOREACH a GENERATE flatten(json#'node_disk_lnum_1') AS
node_disk_lnum_1,flatten(json#'node_disk_xfers_in_rate_sum') AS
node_disk_xfers_in_rate_sum,flatten(json#'node_disk_bytes_in_rate_22') AS
node_disk_bytes_in_rate_22, flatten(json#'node_disk_lnum_7') AS
node_disk_lnum_7;
DESCRIBE b;
b describe result:
b: {node_disk_lnum_1: bytearray,node_disk_xfers_in_rate_sum:
bytearray,node_disk_bytes_in_rate_22: bytearray,node_disk_lnum_7:
bytearray}
c = FOREACH b GENERATE node_disk_lnum_1;
DESCRIBE c;
c: {node_disk_lnum_1: bytearray}
DUMP c;
Expected Result:
36, 136.40000000000001, 187392.0, 13
36, 105.2, 123084.8, 13
Throwing the below error
2017-02-06 01:05:49,337 [main] INFO
org.apache.pig.tools.pigstats.ScriptState - Pig features used in the
script: UNKNOWN 2017-02-06 01:05:49,386 [main] INFO
org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple] was not
set... will not generate code. 2017-02-06 01:05:49,387 [main] INFO
org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer -
{RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator,
GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter,
MergeFilter, MergeForEach, PartitionFilterOptimizer,
PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFilter,
SplitFilter, StreamTypeCastInserter]} 2017-02-06 01:05:49,390 [main]
INFO org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Map
key required for a: $0->[node_disk_lnum_1,
node_disk_xfers_in_rate_sum, node_disk_bytes_in_rate_22,
node_disk_lnum_7]
2017-02-06 01:05:49,395 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler
- File concatenation threshold: 100 optimistic? false 2017-02-06 01:05:49,398 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- MR plan size before optimization: 1 2017-02-06 01:05:49,398 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- MR plan size after optimization: 1 2017-02-06 01:05:49,425 [main] INFO org.apache.pig.tools.pigstats.mapreduce.MRScriptState - Pig
script settings are added to the job 2017-02-06 01:05:49,426 [main]
INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
- mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3 2017-02-06 01:05:49,428 [main] ERROR
org.apache.pig.tools.grunt.Grunt - ERROR 2998: Unhandled internal
error. com/twitter/elephantbird/util/HadoopCompat
Please help what am I missing?
You do not have any nested data in your json,so remove -nestedload
a = LOAD '/pig/tc1.log' USING com.twitter.elephantbird.pig.load.JsonLoader() as (json:map[]);

From hive to elasticsearch :

I'am working with Cloudera CDH5.3 with 1 Namenode (ip:...169) and 3 slaves.
I have ElasticSearch 1.4.4 installed on my master machine (ip:...169).
I have downloaded the ES-Hadoop jar and added it to the path.
With that being said; I now want to load data from Hive to ES.
1) First of all, I created a table via a CSV file under table metastore (with HUE)
2) I defined an external table on top of ES in hive to write and load data in it later:
ADD JAR
/usr/elasticsearch-hadoop-2.0.2/dist/elasticsearch-hadoop-hive-2.0.2.jar;
CREATE EXTERNAL TABLE es_cdr(
id bigint,
calling int,
called int,
duration int,
location string,
date string)
ROW FORMAT SERDE 'org.elasticsearch.hadoop.hive.EsSerDe'
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES(
'es.nodes'='10.44.162.169',
'es.resource' = 'indexOmar/typeOmar');
I've also added manually the serde snapshot jar via paramaters=> add file =>jar
Now, I want to load data from my table in the new ES table :
INSERT OVERWRITE TABLE es_cdr
select NULL, h.appelant, h.called_number,
h.call_duration, h.location_number, h.date_heure_appel from hive_cdr h;
But an error is appearing saying that :
Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
And this is what's written in the log :
15/03/05 14:36:34 INFO log.PerfLogger: <PERFLOG method=compile from=org.apache.hadoop.hive.ql.Driver>
15/03/05 14:36:34 INFO log.PerfLogger: <PERFLOG method=parse from=org.apache.hadoop.hive.ql.Driver>
15/03/05 14:36:34 INFO parse.ParseDriver: Parsing command: INSERT OVERWRITE TABLE hive_es_cdr_10
SELECT NULL,h.appelant,h.called_number,h.call_dur,h.loc_number,h.h_appel FROM hive_cdr h limit 2
15/03/05 14:36:34 INFO parse.ParseDriver: Parse Completed
15/03/05 14:36:34 INFO log.PerfLogger: </PERFLOG method=parse start=1425562594378 end=1425562594381 duration=3 from=org.apache.hadoop.hive.ql.Driver>
15/03/05 14:36:34 INFO log.PerfLogger: <PERFLOG method=semanticAnalyze from=org.apache.hadoop.hive.ql.Driver>
15/03/05 14:36:34 INFO parse.SemanticAnalyzer: Starting Semantic Analysis
15/03/05 14:36:34 INFO parse.SemanticAnalyzer: Completed phase 1 of Semantic Analysis
15/03/05 14:36:34 INFO parse.SemanticAnalyzer: Get metadata for source tables
15/03/05 14:36:34 INFO parse.SemanticAnalyzer: Get metadata for subqueries
15/03/05 14:36:34 INFO parse.SemanticAnalyzer: Get metadata for destination tables
15/03/05 14:36:34 INFO parse.SemanticAnalyzer: Completed getting MetaData in Semantic Analysis
15/03/05 14:36:34 INFO common.FileUtils: Creating directory if it doesn't exist: hdfs://master:8020/user/hive/warehouse/hive_es_cdr_10/.hive-staging_hive_2015-03-05_14-36-34_378_4527939627221909415-1
15/03/05 14:36:34 INFO parse.SemanticAnalyzer: Set stats collection dir : hdfs://master:8020/user/hive/warehouse/hive_es_cdr_10/.hive-staging_hive_2015-03-05_14-36-34_378_4527939627221909415-1/-ext-10000
15/03/05 14:36:34 INFO ppd.OpProcFactory: Processing for FS(109)
15/03/05 14:36:34 INFO ppd.OpProcFactory: Processing for SEL(108)
15/03/05 14:36:34 INFO ppd.OpProcFactory: Processing for LIM(107)
15/03/05 14:36:34 INFO ppd.OpProcFactory: Processing for EX(106)
15/03/05 14:36:34 INFO ppd.OpProcFactory: Processing for RS(105)
15/03/05 14:36:34 INFO ppd.OpProcFactory: Processing for LIM(104)
15/03/05 14:36:34 INFO ppd.OpProcFactory: Processing for SEL(103)
15/03/05 14:36:34 INFO ppd.OpProcFactory: Processing for TS(102)
15/03/05 14:36:34 INFO optimizer.ColumnPrunerProcFactory: RS 105 oldColExprMap: {_col5=Column[_col5], _col4=Column[_col4], _col3=Column[_col3], _col2=Column[_col2], _col1=Column[_col1], _col0=Column[_col0]}
15/03/05 14:36:34 INFO optimizer.ColumnPrunerProcFactory: RS 105 newColExprMap: {_col5=Column[_col5], _col4=Column[_col4], _col3=Column[_col3], _col2=Column[_col2], _col1=Column[_col1], _col0=Column[_col0]}
15/03/05 14:36:34 INFO log.PerfLogger: <PERFLOG method=partition-retrieving from=org.apache.hadoop.hive.ql.optimizer.ppr.PartitionPruner>
15/03/05 14:36:34 INFO log.PerfLogger: </PERFLOG method=partition-retrieving start=1425562594461 end=1425562594461 duration=0 from=org.apache.hadoop.hive.ql.optimizer.ppr.PartitionPruner>
15/03/05 14:36:34 INFO physical.MetadataOnlyOptimizer: Looking for table scans where optimization is applicable
15/03/05 14:36:34 INFO physical.MetadataOnlyOptimizer: Found 0 metadata only table scans
15/03/05 14:36:34 INFO parse.SemanticAnalyzer: Completed plan generation
15/03/05 14:36:34 INFO ql.Driver: Semantic Analysis Completed
15/03/05 14:36:34 INFO log.PerfLogger: </PERFLOG method=semanticAnalyze start=1425562594381 end=1425562594463 duration=82 from=org.apache.hadoop.hive.ql.Driver>
15/03/05 14:36:34 INFO ql.Driver: Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:_col0, type:bigint, comment:null), FieldSchema(name:_col1, type:int, comment:null), FieldSchema(name:_col2, type:int, comment:null), FieldSchema(name:_col3, type:int, comment:null), FieldSchema(name:_col4, type:string, comment:null), FieldSchema(name:_col5, type:string, comment:null)], properties:null)
15/03/05 14:36:34 INFO ql.Driver: EXPLAIN output for queryid hive_20150305143636_528f97d4-b670-40e2-ba80-7d7a7bd441ff : ABSTRACT SYNTAX TREE:
TOK_QUERY
TOK_FROM
TOK_TABREF
TOK_TABNAME
hive_cdr
h
TOK_INSERT
TOK_DESTINATION
TOK_TAB
TOK_TABNAME
hive_es_cdr_10
TOK_SELECT
TOK_SELEXPR
TOK_NULL
TOK_SELEXPR
.
TOK_TABLE_OR_COL
h
appelant
TOK_SELEXPR
.
TOK_TABLE_OR_COL
h
called_number
TOK_SELEXPR
.
TOK_TABLE_OR_COL
h
call_dur
TOK_SELEXPR
.
TOK_TABLE_OR_COL
h
loc_number
TOK_SELEXPR
.
TOK_TABLE_OR_COL
h
h_appel
TOK_LIMIT
2
STAGE DEPENDENCIES:
Stage-0 is a root stage [MAPRED]
STAGE PLANS:
Stage: Stage-0
Map Reduce
Map Operator Tree:
TableScan
alias: h
GatherStats: false
Select Operator
expressions: null (type: string), appelant (type: int), called_number (type: int), call_dur (type: int), loc_number (type: string), h_appel (type: string)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
Limit
Number of rows: 2
Reduce Output Operator
sort order:
tag: -1
value expressions: _col0 (type: void), _col1 (type: int), _col2 (type: int), _col3 (type: int), _col4 (type: string), _col5 (type: string)
Path -> Alias:
hdfs://master:8020/user/hive/warehouse/hive_cdr [h]
Path -> Partition:
hdfs://master:8020/user/hive/warehouse/hive_cdr
Partition
base file name: hive_cdr
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
properties:
COLUMN_STATS_ACCURATE true
bucket_count -1
columns traffic_type_id,appelant,called_number,call_dur,loc_number,h_appel
columns.comments
columns.types int:int:int:int:string:string
field.delim ;
file.inputformat org.apache.hadoop.mapred.TextInputFormat
file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
location hdfs://master:8020/user/hive/warehouse/hive_cdr
name default.hive_cdr
numFiles 1
numRows 0
rawDataSize 0
serialization.ddl struct hive_cdr { i32 traffic_type_id, i32 appelant, i32 called_number, i32 call_dur, string loc_number, string h_appel}
serialization.format ;
serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
totalSize 56373362
transient_lastDdlTime 1425459002
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
properties:
COLUMN_STATS_ACCURATE true
bucket_count -1
columns traffic_type_id,appelant,called_number,call_dur,loc_number,h_appel
columns.comments
columns.types int:int:int:int:string:string
field.delim ;
file.inputformat org.apache.hadoop.mapred.TextInputFormat
file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
location hdfs://master:8020/user/hive/warehouse/hive_cdr
name default.hive_cdr
numFiles 1
numRows 0
rawDataSize 0
serialization.ddl struct hive_cdr { i32 traffic_type_id, i32 appelant, i32 called_number, i32 call_dur, string loc_number, string h_appel}
serialization.format ;
serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
totalSize 56373362
transient_lastDdlTime 1425459002
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
name: default.hive_cdr
name: default.hive_cdr
Truncated Path -> Alias:
/hive_cdr [h]
Needs Tagging: false
Reduce Operator Tree:
Extract
Limit
Number of rows: 2
Select Operator
expressions: UDFToLong(_col0) (type: bigint), _col1 (type: int), _col2 (type: int), _col3 (type: int), _col4 (type: string), _col5 (type: string)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
File Output Operator
compressed: false
GlobalTableId: 1
directory: hdfs://master:8020/user/hive/warehouse/hive_es_cdr_10
NumFilesPerFileSink: 1
Stats Publishing Key Prefix: hdfs://master:8020/user/hive/warehouse/hive_es_cdr_10/
table:
input format: org.elasticsearch.hadoop.hive.EsHiveInputFormat
jobProperties:
EXTERNAL TRUE
bucket_count -1
columns id_traffic,caller,called,call_dur,caller_location,call_date
columns.comments
columns.types bigint:int:int:int:string:string
es.nodes 10.44.162.169
es.port 9200
es.resource myindex/mytype
file.inputformat org.apache.hadoop.mapred.SequenceFileInputFormat
file.outputformat org.apache.hadoop.mapred.SequenceFileOutputFormat
location hdfs://master:8020/user/hive/warehouse/hive_es_cdr_10
name default.hive_es_cdr_10
serialization.ddl struct hive_es_cdr_10 { i64 id_traffic, i32 caller, i32 called, i32 call_dur, string caller_location, string call_date}
serialization.format 1
serialization.lib org.elasticsearch.hadoop.hive.EsSerDe
storage_handler org.elasticsearch.hadoop.hive.EsStorageHandler
transient_lastDdlTime 1425561441
output format: org.elasticsearch.hadoop.hive.EsHiveOutputFormat
properties:
EXTERNAL TRUE
bucket_count -1
columns id_traffic,caller,called,call_dur,caller_location,call_date
columns.comments
columns.types bigint:int:int:int:string:string
es.nodes 10.44.162.169
es.port 9200
es.resource myindex/mytype
file.inputformat org.apache.hadoop.mapred.SequenceFileInputFormat
file.outputformat org.apache.hadoop.mapred.SequenceFileOutputFormat
location hdfs://master:8020/user/hive/warehouse/hive_es_cdr_10
name default.hive_es_cdr_10
serialization.ddl struct hive_es_cdr_10 { i64 id_traffic, i32 caller, i32 called, i32 call_dur, string caller_location, string call_date}
serialization.format 1
serialization.lib org.elasticsearch.hadoop.hive.EsSerDe
storage_handler org.elasticsearch.hadoop.hive.EsStorageHandler
transient_lastDdlTime 1425561441
serde: org.elasticsearch.hadoop.hive.EsSerDe
name: default.hive_es_cdr_10
TotalFiles: 1
GatherStats: false
MultiFileSpray: false
15/03/05 14:36:34 INFO log.PerfLogger: </PERFLOG method=compile start=1425562594378 end=1425562594484 duration=106 from=org.apache.hadoop.hive.ql.Driver>
15/03/05 14:36:34 INFO log.PerfLogger: <PERFLOG method=Driver.run from=org.apache.hadoop.hive.ql.Driver>
15/03/05 14:36:34 INFO log.PerfLogger: <PERFLOG method=TimeToSubmit from=org.apache.hadoop.hive.ql.Driver>
15/03/05 14:36:34 INFO log.PerfLogger: <PERFLOG method=acquireReadWriteLocks from=org.apache.hadoop.hive.ql.Driver>
15/03/05 14:36:34 INFO lockmgr.DummyTxnManager: Creating lock manager of type org.apache.hadoop.hive.ql.lockmgr.zookeeper.ZooKeeperHiveLockManager
15/03/05 14:36:34 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=master:2181 sessionTimeout=600000 watcher=org.apache.hadoop.hive.ql.lockmgr.zookeeper.ZooKeeperHiveLockManager$DummyWatcher#70e69669
15/03/05 14:36:34 INFO log.PerfLogger: </PERFLOG method=acquireReadWriteLocks start=1425562594502 end=1425562594523 duration=21 from=org.apache.hadoop.hive.ql.Driver>
15/03/05 14:36:34 INFO log.PerfLogger: <PERFLOG method=Driver.execute from=org.apache.hadoop.hive.ql.Driver>
15/03/05 14:36:34 INFO ql.Driver: Starting command: INSERT OVERWRITE TABLE hive_es_cdr_10
SELECT NULL,h.appelant,h.called_number,h.call_dur,h.loc_number,h.h_appel FROM hive_cdr h limit 2
15/03/05 14:36:34 INFO ql.Driver: Total jobs = 1
15/03/05 14:36:34 INFO log.PerfLogger: </PERFLOG method=TimeToSubmit start=1425562594500 end=1425562594526 duration=26 from=org.apache.hadoop.hive.ql.Driver>
15/03/05 14:36:34 INFO log.PerfLogger: <PERFLOG method=runTasks from=org.apache.hadoop.hive.ql.Driver>
15/03/05 14:36:34 INFO log.PerfLogger: <PERFLOG method=task.MAPRED.Stage-0 from=org.apache.hadoop.hive.ql.Driver>
15/03/05 14:36:34 INFO ql.Driver: Launching Job 1 out of 1
15/03/05 14:36:34 INFO exec.Task: Number of reduce tasks determined at compile time: 1
15/03/05 14:36:34 INFO exec.Task: In order to change the average load for a reducer (in bytes):
15/03/05 14:36:34 INFO exec.Task: set hive.exec.reducers.bytes.per.reducer=<number>
15/03/05 14:36:34 INFO exec.Task: In order to limit the maximum number of reducers:
15/03/05 14:36:34 INFO exec.Task: set hive.exec.reducers.max=<number>
15/03/05 14:36:34 INFO exec.Task: In order to set a constant number of reducers:
15/03/05 14:36:34 INFO exec.Task: set mapreduce.job.reduces=<number>
15/03/05 14:36:34 INFO ql.Context: New scratch dir is hdfs://master:8020/tmp/hive-hive/hive_2015-03-05_14-36-34_378_4527939627221909415-7
15/03/05 14:36:34 INFO mr.ExecDriver: Using org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
15/03/05 14:36:34 INFO mr.ExecDriver: adding libjars: file:///tmp/d39b23a8-98d2-4bc3-9008-3eff080dd20c_resources/hive-serdes-1.0-SNAPSHOT.jar,file:///usr/elasticsearch-hadoop-2.0.2/dist/elasticsearch-hadoop-hive-2.0.2.jar,file:///opt/cloudera/parcels/CDH-5.3.1-1.cdh5.3.1.p0.5/lib/hive/lib/hive-hbase-handler-0.13.1-cdh5.3.1.jar,file:///opt/cloudera/parcels/CDH-5.3.1-1.cdh5.3.1.p0.5/lib/hbase/hbase-server.jar,file:///opt/cloudera/parcels/CDH-5.3.1-1.cdh5.3.1.p0.5/lib/hbase/lib/htrace-core.jar,file:///opt/cloudera/parcels/CDH-5.3.1-1.cdh5.3.1.p0.5/lib/hbase/lib/htrace-core-2.04.jar,file:///opt/cloudera/parcels/CDH-5.3.1-1.cdh5.3.1.p0.5/lib/hbase/hbase-common.jar,file:///opt/cloudera/parcels/CDH-5.3.1-1.cdh5.3.1.p0.5/lib/hbase/hbase-client.jar,file:///opt/cloudera/parcels/CDH-5.3.1-1.cdh5.3.1.p0.5/lib/hbase/hbase-protocol.jar,file:///opt/cloudera/parcels/CDH-5.3.1-1.cdh5.3.1.p0.5/lib/hbase/hbase-hadoop2-compat.jar,file:///opt/cloudera/parcels/CDH-5.3.1-1.cdh5.3.1.p0.5/lib/hbase/hbase-hadoop-compat.jar
15/03/05 14:36:34 INFO exec.Utilities: Processing alias h
15/03/05 14:36:34 INFO exec.Utilities: Adding input file hdfs://master:8020/user/hive/warehouse/hive_cdr
15/03/05 14:36:34 INFO exec.Utilities: Content Summary not cached for hdfs://master:8020/user/hive/warehouse/hive_cdr
15/03/05 14:36:34 INFO ql.Context: New scratch dir is hdfs://master:8020/tmp/hive-hive/hive_2015-03-05_14-36-34_378_4527939627221909415-7
15/03/05 14:36:34 INFO log.PerfLogger: <PERFLOG method=serializePlan from=org.apache.hadoop.hive.ql.exec.Utilities>
15/03/05 14:36:34 INFO exec.Utilities: Serializing MapWork via kryo
15/03/05 14:36:34 INFO log.PerfLogger: </PERFLOG method=serializePlan start=1425562594554 end=1425562594638 duration=84 from=org.apache.hadoop.hive.ql.exec.Utilities>
15/03/05 14:36:34 INFO log.PerfLogger: <PERFLOG method=serializePlan from=org.apache.hadoop.hive.ql.exec.Utilities>
15/03/05 14:36:34 INFO exec.Utilities: Serializing ReduceWork via kryo
15/03/05 14:36:34 INFO log.PerfLogger: </PERFLOG method=serializePlan start=1425562594653 end=1425562594708 duration=55 from=org.apache.hadoop.hive.ql.exec.Utilities>
15/03/05 14:36:34 INFO client.RMProxy: Connecting to ResourceManager at master/10.44.162.169:8032
15/03/05 14:36:34 INFO client.RMProxy: Connecting to ResourceManager at master/10.44.162.169:8032
15/03/05 14:36:34 WARN mr.EsOutputFormat: Speculative execution enabled for reducer - consider disabling it to prevent data corruption
15/03/05 14:36:34 INFO mr.EsOutputFormat: Writing to [myindex/mytype]
15/03/05 14:36:34 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
15/03/05 14:36:35 INFO log.PerfLogger: <PERFLOG method=getSplits from=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat>
15/03/05 14:36:35 INFO io.CombineHiveInputFormat: CombineHiveInputSplit creating pool for hdfs://master:8020/user/hive/warehouse/hive_cdr; using filter path hdfs://master:8020/user/hive/warehouse/hive_cdr
15/03/05 14:36:35 INFO input.FileInputFormat: Total input paths to process : 1
15/03/05 14:36:35 INFO input.CombineFileInputFormat: DEBUG: Terminated node allocation with : CompletedNodes: 3, size left: 0
15/03/05 14:36:35 INFO io.CombineHiveInputFormat: number of splits 1
15/03/05 14:36:35 INFO log.PerfLogger: </PERFLOG method=getSplits start=1425562595867 end=1425562595896 duration=29 from=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat>
15/03/05 14:36:35 INFO mapreduce.JobSubmitter: number of splits:1
15/03/05 14:36:36 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1425457357655_0006
15/03/05 14:36:36 INFO impl.YarnClientImpl: Submitted application application_1425457357655_0006
15/03/05 14:36:36 INFO mapreduce.Job: The url to track the job: http://master:8088/proxy/application_1425457357655_0006/
15/03/05 14:36:36 INFO exec.Task: Starting Job = job_1425457357655_0006, Tracking URL = http://master:8088/proxy/application_1425457357655_0006/
15/03/05 14:36:36 INFO exec.Task: Kill Command = /opt/cloudera/parcels/CDH-5.3.1-1.cdh5.3.1.p0.5/lib/hadoop/bin/hadoop job -kill job_1425457357655_0006
15/03/05 14:36:58 INFO exec.Task: Hadoop job information for Stage-0: number of mappers: 0; number of reducers: 0
15/03/05 14:36:58 WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead
15/03/05 14:36:58 INFO exec.Task: 2015-03-05 14:36:58,687 Stage-0 map = 0%, reduce = 0%
15/03/05 14:36:58 WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead
15/03/05 14:36:58 ERROR exec.Task: Ended Job = job_1425457357655_0006 with errors
15/03/05 14:36:58 INFO impl.YarnClientImpl: Killed application application_1425457357655_0006
15/03/05 14:36:58 ERROR ql.Driver: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
15/03/05 14:36:58 INFO log.PerfLogger: </PERFLOG method=Driver.execute start=1425562594523 end=1425562618754 duration=24231 from=org.apache.hadoop.hive.ql.Driver>
15/03/05 14:36:58 INFO ql.Driver: MapReduce Jobs Launched:
15/03/05 14:36:58 WARN mapreduce.Counters: Group FileSystemCounters is deprecated. Use org.apache.hadoop.mapreduce.FileSystemCounter instead
15/03/05 14:36:58 INFO ql.Driver: Stage-Stage-0: HDFS Read: 0 HDFS Write: 0 FAIL
15/03/05 14:36:58 INFO ql.Driver: Total MapReduce CPU Time Spent: 0 msec
15/03/05 14:36:58 INFO log.PerfLogger: <PERFLOG method=releaseLocks from=org.apache.hadoop.hive.ql.Driver>
15/03/05 14:36:58 INFO ZooKeeperHiveLockManager: about to release lock for default/hive_es_cdr_10
15/03/05 14:36:58 INFO ZooKeeperHiveLockManager: about to release lock for default/hive_cdr
15/03/05 14:36:58 INFO ZooKeeperHiveLockManager: about to release lock for default
15/03/05 14:36:58 INFO log.PerfLogger: </PERFLOG method=releaseLocks start=1425562618768 end=1425562618780 duration=12 from=org.apache.hadoop.hive.ql.Driver>
15/03/05 14:36:58 ERROR operation.Operation: Error running hive query:
org.apache.hive.service.cli.HiveSQLException: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
at org.apache.hive.service.cli.operation.SQLOperation.runInternal(SQLOperation.java:147)
at org.apache.hive.service.cli.operation.SQLOperation.access$000(SQLOperation.java:69)
at org.apache.hive.service.cli.operation.SQLOperation$1$1.run(SQLOperation.java:200)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
at org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:502)
at org.apache.hive.service.cli.operation.SQLOperation$1.run(SQLOperation.java:213)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run
It seems the failure is caused by a type issue. You can use es.mapping property to set types in TBLPROPERTIES

Pig "Max" command for pig-0.12.1 and pig-0.13.0 with Hadoop-2.4.0

I have a pig script I got from Hortonworks that works fine with pig-0.9.2.15 with Hadoop-1.0.3.16. But when I run it with pig-0.12.1(recompiled with -Dhadoopversion=23) or pig-0.13.0 on Hadoop-2.4.0, it won't work.
It seems the following line is where the problem is.
max_runs = FOREACH grp_data GENERATE group as grp, MAX(runs.runs) as max_runs;
Here's the whole script.
batting = load 'pig_data/Batting.csv' using PigStorage(',');
runs = FOREACH batting GENERATE $0 as playerID, $1 as year, $8 as runs;
grp_data = GROUP runs by (year);
max_runs = FOREACH grp_data GENERATE group as grp, MAX(runs.runs) as max_runs;
join_max_run = JOIN max_runs by ($0, max_runs), runs by (year,runs);
join_data = FOREACH join_max_run GENERATE $0 as year, $2 as playerID, $1 as runs;
STORE join_data INTO './join_data';
And here's the hadoop error info:
2014-07-29 18:03:02,957 [main] ERROR
org.apache.pig.tools.pigstats.PigStats - ERROR 0:
org.apache.pig.backend.executionengine.ExecException: ERROR 0:
Exception while executing (Name: grp_data: Local
Rearrange[tuple]{bytearray}(false) - scope-34 Operator Key: scope-34):
org.apache.pig.backend.executionengine.ExecException: ERROR 2106:
Error executing an algebraic function 2014-07-29 18:03:02,958 [main]
ERROR org.apache.pig.tools.pigstats.mapreduce.MRPigStatsUtil - 1 map
reduce job(s) failed!
How can I fix this if I still want to use "MAX" function? Thank you!
Here's the complete information:
14/07/29 17:50:11 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
14/07/29 17:50:11 INFO pig.ExecTypeProvider: Trying ExecType :
MAPREDUCE 14/07/29 17:50:11 INFO pig.ExecTypeProvider: Picked
MAPREDUCE as the ExecType 2014-07-29 17:50:12,104 [main] INFO
org.apache.pig.Main - Apache Pig version 0.13.0 (r1606446) compiled
Jun 29 2014, 02:27:58 2014-07-29 17:50:12,104 [main] INFO
org.apache.pig.Main - Logging error messages to:
/root/hadooptestingsuite/scripts/tests/pig_test/hadoop2/pig_1406677812103.log
2014-07-29 17:50:13,050 [main] INFO org.apache.pig.impl.util.Utils -
Default bootup file /root/.pigbootup not found 2014-07-29 17:50:13,415
[main] INFO org.apache.hadoop.conf.Configuration.deprecation -
mapred.job.tracker is deprecated. Instead, use
mapreduce.jobtracker.address 2014-07-29 17:50:13,415 [main] INFO
org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is
deprecated. Instead, use fs.defaultFS 2014-07-29 17:50:13,415 [main]
INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
Connecting to hadoop file system at:
hdfs://namenode.cmda.hadoop.com:8020 2014-07-29 17:50:14,302 [main]
INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
Connecting to map-reduce job tracker at: namenode.cmda.hadoop.com:8021
2014-07-29 17:50:14,990 [main] INFO
org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is
deprecated. Instead, use fs.defaultFS 2014-07-29 17:50:15,570 [main]
INFO org.apache.hadoop.conf.Configuration.deprecation -
fs.default.name is deprecated. Instead, use fs.defaultFS 2014-07-29
17:50:15,665 [main] WARN org.apache.pig.newplan.BaseOperatorPlan -
Encountered Warning IMPLICIT_CAST_TO_DOUBLE 1 time(s). 2014-07-29
17:50:15,705 [main] INFO
org.apache.hadoop.conf.Configuration.deprecation -
mapred.textoutputformat.separator is deprecated. Instead, use
mapreduce.output.textoutputformat.separator 2014-07-29 17:50:15,791
[main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features
used in the script: HASH_JOIN,GROUP_BY 2014-07-29 17:50:15,873 [main]
INFO org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer -
{RULES_ENABLED=[AddForEach, ColumnMapKeyPrune,
GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter,
MergeFilter, MergeForEach, PartitionFilterOptimizer,
PushDownForEachFlatten, PushUpFilter, SplitFilter,
StreamTypeCastInserter],
RULES_DISABLED=[FilterLogicExpressionSimplifier]} 2014-07-29
17:50:16,319 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler
- File concatenation threshold: 100 optimistic? false 2014-07-29 17:50:16,377 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.CombinerOptimizer
- Choosing to move algebraic foreach to combiner 2014-07-29 17:50:16,410 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler$LastInputStreamingOptimizer
- Rewrite: POPackage->POForEach to POPackage(JoinPackager) 2014-07-29 17:50:16,417 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- MR plan size before optimization: 3 2014-07-29 17:50:16,418 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- Merged 1 map-reduce splittees. 2014-07-29 17:50:16,418 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- Merged 1 out of total 3 MR operators. 2014-07-29 17:50:16,418 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- MR plan size after optimization: 2 2014-07-29 17:50:16,493 [main] INFO org.apache.hadoop.conf.Configuration.deprecation -
fs.default.name is deprecated. Instead, use fs.defaultFS 2014-07-29
17:50:16,575 [main] INFO org.apache.hadoop.yarn.client.RMProxy -
Connecting to ResourceManager at
namenode.cmda.hadoop.com/10.0.3.1:8050 2014-07-29 17:50:16,973 [main]
INFO org.apache.pig.tools.pigstats.mapreduce.MRScriptState - Pig
script settings are added to the job 2014-07-29 17:50:17,007 [main]
INFO org.apache.hadoop.conf.Configuration.deprecation -
mapred.job.reduce.markreset.buffer.percent is deprecated. Instead, use
mapreduce.reduce.markreset.buffer.percent 2014-07-29 17:50:17,007
[main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
- mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3 2014-07-29 17:50:17,007 [main] INFO
org.apache.hadoop.conf.Configuration.deprecation -
mapred.output.compress is deprecated. Instead, use
mapreduce.output.fileoutputformat.compress 2014-07-29 17:50:17,020
[main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
- Reduce phase detected, estimating # of required reducers. 2014-07-29 17:50:17,020 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
- Using reducer estimator: org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator
2014-07-29 17:50:17,064 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator
- BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=6398990 2014-07-29 17:50:17,067 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
- Setting Parallelism to 1 2014-07-29 17:50:17,067 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.reduce.tasks
is deprecated. Instead, use mapreduce.job.reduces 2014-07-29
17:50:17,068 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
- This job cannot be converted run in-process 2014-07-29 17:50:17,068 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
- creating jar file Job2337803902169382273.jar 2014-07-29 17:50:20,957 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
- jar file Job2337803902169382273.jar created 2014-07-29 17:50:20,957 [main] INFO org.apache.hadoop.conf.Configuration.deprecation -
mapred.jar is deprecated. Instead, use mapreduce.job.jar 2014-07-29
17:50:21,001 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
- Setting up multi store job 2014-07-29 17:50:21,036 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Key [pig.schematuple] is
false, will not generate code. 2014-07-29 17:50:21,036 [main] INFO
org.apache.pig.data.SchemaTupleFrontend - Starting process to move
generated code to distributed cacche 2014-07-29 17:50:21,046 [main]
INFO org.apache.pig.data.SchemaTupleFrontend - Setting key
[pig.schematuple.classes] with classes to deserialize [] 2014-07-29
17:50:21,310 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- 1 map-reduce job(s) waiting for submission. 2014-07-29 17:50:21,311 [main] INFO org.apache.hadoop.conf.Configuration.deprecation -
mapred.job.tracker.http.address is deprecated. Instead, use
mapreduce.jobtracker.http.address 2014-07-29 17:50:21,332 [JobControl]
INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to
ResourceManager at namenode.cmda.hadoop.com/10.0.3.1:8050 2014-07-29
17:50:21,366 [JobControl] INFO
org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is
deprecated. Instead, use fs.defaultFS 2014-07-29 17:50:22,606
[JobControl] INFO
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
paths to process : 1 2014-07-29 17:50:22,606 [JobControl] INFO
org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
input paths to process : 1 2014-07-29 17:50:22,629 [JobControl] INFO
org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
input paths (combined) to process : 1 2014-07-29 17:50:22,729
[JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter - number
of splits:1 2014-07-29 17:50:22,745 [JobControl] INFO
org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is
deprecated. Instead, use fs.defaultFS 2014-07-29 17:50:23,026
[JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter -
Submitting tokens for job: job_1406677482986_0003 2014-07-29
17:50:23,258 [JobControl] INFO
org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted
application application_1406677482986_0003 2014-07-29 17:50:23,340
[JobControl] INFO org.apache.hadoop.mapreduce.Job - The url to track
the job:
http://namenode.cmda.hadoop.com:8088/proxy/application_1406677482986_0003/
2014-07-29 17:50:23,340 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- HadoopJobId: job_1406677482986_0003 2014-07-29 17:50:23,340 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- Processing aliases batting,grp_data,max_runs,runs 2014-07-29 17:50:23,340 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- detailed locations: M: batting[3,10],runs[5,7],max_runs[7,11],grp_data[6,11] C:
max_runs[7,11],grp_data[6,11] R: max_runs[7,11] 2014-07-29
17:50:23,340 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- More information at: http://namenode.cmda.hadoop.com:50030/jobdetails.jsp?jobid=job_1406677482986_0003
2014-07-29 17:50:23,357 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- 0% complete 2014-07-29 17:50:23,357 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- Running jobs are [job_1406677482986_0003] 2014-07-29 17:51:15,564 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- 50% complete 2014-07-29 17:51:15,564 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- Running jobs are [job_1406677482986_0003] 2014-07-29 17:51:18,582 [main] WARN
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- Ooops! Some job has failed! Specify -stop_on_failure if you want Pig to stop immediately on failure. 2014-07-29 17:51:18,582 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- job job_1406677482986_0003 has failed! Stop running all dependent jobs 2014-07-29 17:51:18,582 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- 100% complete 2014-07-29 17:51:18,825 [main] ERROR org.apache.pig.tools.pigstats.PigStats - ERROR 0:
org.apache.pig.backend.executionengine.ExecException: ERROR 0:
Exception while executing (Name: grp_data: Local
Rearrange[tuple]{bytearray}(false) - scope-73 Operator Key: scope-73):
org.apache.pig.backend.executionengine.ExecException: ERROR 2106:
Error executing an algebraic function 2014-07-29 17:51:18,825 [main]
ERROR org.apache.pig.tools.pigstats.mapreduce.MRPigStatsUtil - 1 map
reduce job(s) failed! 2014-07-29 17:51:18,826 [main] INFO
org.apache.pig.tools.pigstats.mapreduce.SimplePigStats - Script
Statistics:
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
2.4.0 0.13.0 root 2014-07-29 17:50:16 2014-07-29 17:51:18 HASH_JOIN,GROUP_BY
Failed!
Failed Jobs: JobId Alias Feature Message Outputs
job_1406677482986_0003 batting,grp_data,max_runs,runs MULTI_QUERY,COMBINER Message:
Job failed!
Input(s): Failed to read data from
"hdfs://namenode.cmda.hadoop.com:8020/user/root/pig_data/Batting.csv"
Output(s):
Counters: Total records written : 0 Total bytes written : 0 Spillable
Memory Manager spill count : 0 Total bags proactively spilled: 0 Total
records proactively spilled: 0
Job DAG: job_1406677482986_0003 -> null, null
2014-07-29 17:51:18,826 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- Failed! 2014-07-29 17:51:18,827 [main] ERROR org.apache.pig.tools.grunt.GruntParser - ERROR 2106: Error executing
an algebraic function Details at logfile:
/root/hadooptestingsuite/scripts/tests/pig_test/hadoop2/pig_1406677812103.log
2014-07-29 17:51:18,828 [main] ERROR
org.apache.pig.tools.grunt.GruntParser - ERROR 2244: Job scope-58
failed, hadoop does not return any error message Details at logfile:
/root/hadooptestingsuite/scripts/tests/pig_test/hadoop2/pig_1406677812103.log
try by casting MAX function
max_runs = FOREACH grp_data GENERATE group as grp, (int)MAX(runs.runs) as max_runs;
hope it will work
You should use data types in your load statement.
runs = FOREACH batting GENERATE $0 as playerID:chararray, $1 as year:int, $8 as runs:int;
If this doesn't help for some reason, try explicit casting.
max_runs = FOREACH grp_data GENERATE group as grp, MAX((int)runs.runs) as max_runs;
Thank both #BigData and #Mikko Kupsu for the hint. The issue does indeed have something to do the datatype casting.
After specifying the data type of each column as follows everything runs great.
batting =
LOAD '/user/root/pig_data/Batting.csv' USING PigStorage(',')
AS (playerID: CHARARRAY, yearID: INT, stint: INT, teamID: CHARARRAY, lgID: CHARARRAY,
G: INT, G_batting: INT, AB: INT, R: INT, H: INT, two_B: INT, three_B: INT, HR: INT, RBI: INT,
SB: INT, CS: INT, BB:INT, SO: INT, IBB: INT, HBP: INT, SH: INT, SF: INT, GIDP: INT, G_old: INT);

Resources