Does sqoop spill temporary data to disk - hadoop
As I understand sqoop, it launches few mappers on different data nodes making jdbc connection with RDBMS. Once connection is formed data is transferred to HDFS.
Just trying to understand, does sqoop mapper spill data temporary on disk (data node)? I know spilling happens in MapReduce but not sure about sqoop job.
It seems sqoop-import runs on mapper and doesn't spill. And sqoop-merge runs on map-reduce and does spill. You can check it on Job tracker during sqoop import run.
Have a look at this part of sqoop import log, it does not spill, fetches and writes to hdfs:
INFO [main] ... mapreduce.db.DataDrivenDBRecordReader: Using query: SELECT...
[main] mapreduce.db.DBRecordReader: Executing query: SELECT...
INFO [main] org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: File Output Committer Algorithm version is 1
INFO [main] org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
INFO [main] org.apache.hadoop.io.compress.CodecPool: Got brand-new compressor [.snappy]
INFO [Thread-16] ...mapreduce.AutoProgressMapper: Auto-progress thread is finished. keepGoing=false
INFO [main] org.apache.hadoop.mapred.Task: Task:attempt_1489705733959_2462784_m_000000_0 is done. And is in the process of committing
INFO [main] org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: Saved output of task 'attempt_1489705733959_2462784_m_000000_0' to hdfs://
Have a look at this sqoop-merge log(skipped some rows), it spills on disk (note Spilling map output in the log):
INFO [main] org.apache.hadoop.mapred.MapTask: Processing split: hdfs://bla-bla/part-m-00000:0+48322717
...
INFO [main] org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
...
INFO [main] org.apache.hadoop.mapred.MapTask: mapreduce.task.io.sort.mb: 1024
INFO [main] org.apache.hadoop.mapred.MapTask: soft limit at 751619264
INFO [main] org.apache.hadoop.mapred.MapTask: bufstart = 0; bufvoid = 1073741824
INFO [main] org.apache.hadoop.mapred.MapTask: kvstart = 268435452; length = 67108864
INFO [main] org.apache.hadoop.mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$**MapOutputBuffer**
INFO [main] com.pepperdata.supervisor.agent.resource.r: Datanode bla-bla is LOCAL.
INFO [main] org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor [.snappy]
...
INFO [main] org.apache.hadoop.mapred.MapTask: **Starting flush of map output**
INFO [main] org.apache.hadoop.mapred.MapTask: **Spilling map output**
INFO [main] org.apache.hadoop.mapred.MapTask: **bufstart** = 0; **bufend** = 184775274; bufvoid = 1073741824
INFO [main] org.apache.hadoop.mapred.MapTask: kvstart = 268435452(1073741808); kvend = 267347800(1069391200); length = 1087653/67108864
INFO [main] org.apache.hadoop.io.compress.CodecPool: Got brand-new compressor [.snappy]
[main] org.apache.hadoop.mapred.MapTask: Finished spill 0
...Task:attempt_1489705733959_2479291_m_000000_0 is done. And is in the process of committing
Related
PIG : count of each product in distinctive Locations
I am trying to do following Step1 to Step4 in pig: STEP 1:- Create a user table:and take data from /tmp/users.txt- |Column 1 | USER ID |int| |Column 2 |EMAIL|chararray| |Column 3 |LANGUAGE |chararray| |Column 4 |LOCATION |chararray| STEP 2:- Crate a transaction table and take data from /tmp/transaction.txt:- |Column 1 | ID |int| |Column 2 |PRODUCT|int| |Column 3 |USER ID |int| |Column 4 |PURCHASE AMOUNT |double| |Coulmn 5 |DESCRIPTION |chararray| Step 3:- Find out the count of each product in distinctive Locations. Step 4:- Display the results. For achieving above I did the following : users = LOAD '/tmp/users.txt' USING PigStorage(',') AS (USERID:int, EMAIL:chararray, LANGUAGE:chararray, LOCATION: chararray); trans = LOAD '/tmp/transaction.txt' USING PigStorage(',') AS (ID:int, PRODUCT:int, USERID:int, PURCHASEAMOUNT: double, DESCRIPTION: chararray); users_trans = JOIN users BY USERID RIGHT, trans BY USERID; B = GROUP users_trans BY (DESCRIPTION,LOCATION); C = FOREACH B GENERATE group as comb, COUNT(users_trans) AS Total; DUMP C; But, I am getting errors.. It will helpful if you assist as I am new to pig. ########################################## Dataset user.txt 1 creator#gmail.com EN US 2 creator#gmail.com EN GB 3 creator#gmail.com FR FR 4 creator#gmail.com IN HN 5 creator#gmail.com PAK IS transaction.txt 1 1 1 300 a jumper 2 1 2 300 a jumper 3 1 5 300 a jumper 4 2 3 100 a rubber chicken 5 1 3 300 a jumper 6 5 4 500 a soapbox 7 3 3 200 a adhesive 8 4 1 300 a lotion 9 4 4 500 a sweater 10 5 4 600 a jeans Error Log: 2019-12-27 06:17:22,180 [LocalJobRunner Map Task Executor #0] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader - Current split being processed file:/tmp/temp2029752934/tmp-883821114/part-r-00000:0+130 2019-12-27 06:17:22,242 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.MapTask - (EQUATOR) 0 kvi 26214396(104857584) 2019-12-27 06:17:22,242 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.MapTask - mapreduce.task.io.sort.mb: 100 2019-12-27 06:17:22,242 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.MapTask - soft limit at 83886080 2019-12-27 06:17:22,242 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.MapTask - bufstart = 0; bufvoid = 104857600 2019-12-27 06:17:22,242 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.MapTask - kvstart = 26214396; length = 6553600 2019-12-27 06:17:22,244 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.MapTask - Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer 2019-12-27 06:17:22,248 [LocalJobRunner Map Task Executor #0] INFO org.apache.pig.impl.util.SpillableMemoryManager - Selected heap (PS Old Gen) of size 699400192 to monitor. collectionUsageThreshold = 489580128, usageThreshold = 489580128 2019-12-27 06:17:22,248 [LocalJobRunner Map Task Executor #0] WARN org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized 2019-12-27 06:17:22,250 [LocalJobRunner Map Task Executor #0] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Map - Aliases being processed per job phase (AliasName[line,offset]): M: C[7,4],B[6,4] C: C[7,4],B[6,4] R: C[7,4] 2019-12-27 06:17:22,254 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.LocalJobRunner - 2019-12-27 06:17:22,254 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.MapTask - Starting flush of map output 2019-12-27 06:17:22,254 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.MapTask - Spilling map output 2019-12-27 06:17:22,254 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.MapTask - bufstart = 0; bufend = 100; bufvoid = 104857600 2019-12-27 06:17:22,254 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.MapTask - kvstart = 26214396(104857584); kvend = 26214360(104857440); length = 37/6553600 2019-12-27 06:17:22,262 [LocalJobRunner Map Task Executor #0] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine - Aliases being processed per job phase (AliasName[line,offset]): M: C[7,4],B[6,4] C: C[7,4],B[6,4] R: C[7,4] 2019-12-27 06:17:22,264 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.MapTask - Finished spill 0 2019-12-27 06:17:22,265 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.Task - Task:attempt_local1424814286_0002_m_000000_0 is done. And is in the process of committing 2019-12-27 06:17:22,266 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.LocalJobRunner -map 2019-12-27 06:17:22,266 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.Task - Task 'attempt_local1424814286_0002_m_000000_0' done. 2019-12-27 06:17:22,266 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.LocalJobRunner -Finishing task: attempt_local1424814286_0002_m_000000_0 2019-12-27 06:17:22,266 [Thread-18] INFO org.apache.hadoop.mapred.LocalJobRunner - map task executor complete. 2019-12-27 06:17:22,266 [Thread-18] INFO org.apache.hadoop.mapred.LocalJobRunner - Waiting for reduce tasks 2019-12-27 06:17:22,267 [pool-9-thread-1] INFO org.apache.hadoop.mapred.LocalJobRunner - Starting task: attempt_local1424814286_0002_r_000000_0 2019-12-27 06:17:22,272 [pool-9-thread-1] INFO org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - File Output Committer Algorithm version is 1 2019-12-27 06:17:22,272 [pool-9-thread-1] INFO org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false 2019-12-27 06:17:22,274 [pool-9-thread-1] INFO org.apache.hadoop.mapred.Task - Using ResourceCalculatorProcessTree : [ ] 2019-12-27 06:17:22,274 [pool-9-thread-1] INFO org.apache.hadoop.mapred.ReduceTask - Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle#2582aa54 2019-12-27 06:17:22,275 [pool-9-thread-1] INFO org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl - MergerManager: memoryLimit=652528832, maxSingleShuffleLimit=163132208, mergeThreshold=430669056, ioSortFactor=10, memToMemMergeOutputsThreshold=10 2019-12-27 06:17:22,275 [EventFetcher for fetching Map Completion Events] INFO org.apache.hadoop.mapreduce.task.reduce.EventFetcher - attempt_local1424814286_0002_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events 2019-12-27 06:17:22,276 [localfetcher#2] INFO org.apache.hadoop.mapreduce.task.reduce.LocalFetcher - localfetcher#2 about to shuffle output of map attempt_local1424814286_0002_m_000000_0 decomp: 14 len: 18 to MEMORY 2019-12-27 06:17:22,277 [localfetcher#2] INFO org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput - Read 14 bytes from map-output for attempt_local1424814286_0002_m_000000_0 2019-12-27 06:17:22,277 [localfetcher#2] INFO org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl - closeInMemoryFile -> map-output of size: 14, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->14 2019-12-27 06:17:22,277 [EventFetcher for fetching Map Completion Events] INFO org.apache.hadoop.mapreduce.task.reduce.EventFetcher - EventFetcher is interrupted.. Returning 2019-12-27 06:17:22,278 [Readahead Thread #3] WARN org.apache.hadoop.io.ReadaheadPool - Failed readahead on ifile EBADF: Bad file descriptor at org.apache.hadoop.io.nativeio.NativeIO$POSIX.posix_fadvise(Native Method) at org.apache.hadoop.io.nativeio.NativeIO$POSIX.posixFadviseIfPossible(NativeIO.java:267) at org.apache.hadoop.io.nativeio.NativeIO$POSIX$CacheManipulator.posixFadviseIfPossible(NativeIO.java:146) at org.apache.hadoop.io.ReadaheadPool$ReadaheadRequestImpl.run(ReadaheadPool.java:208) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 2019-12-27 06:17:22,278 [pool-9-thread-1] INFO org.apache.hadoop.mapred.LocalJobRunner - 1 / 1 copied. 2019-12-27 06:17:22,280 [pool-9-thread-1] INFO org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl - finalMerge called with 1 in-memory map-outputs and 0 on-disk map-outputs 2019-12-27 06:17:22,280 [pool-9-thread-1] INFO org.apache.hadoop.mapred.Merger - Merging 1 sorted segments 2019-12-27 06:17:22,280 [pool-9-thread-1] INFO org.apache.hadoop.mapred.Merger - Down to the last merge-pass, with 1 segments left of total size: 7 bytes 2019-12-27 06:17:22,281 [pool-9-thread-1] INFO org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl - Merged 1 segments, 14 bytes to disk to satisfy reduce memory limit 2019-12-27 06:17:22,281 [pool-9-thread-1] INFO org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl - Merging 1 files, 18 bytes from disk 2019-12-27 06:17:22,281 [pool-9-thread-1] INFO org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl - Merging 0 segments, 0 bytes from memory into reduce 2019-12-27 06:17:22,281 [pool-9-thread-1] INFO org.apache.hadoop.mapred.Merger - Merging 1 sorted segments 2019-12-27 06:17:22,281 [pool-9-thread-1] INFO org.apache.hadoop.mapred.Merger - Down to the last merge-pass, with 1 segments left of total size: 7 bytes 2019-12-27 06:17:22,282 [pool-9-thread-1] INFO org.apache.hadoop.mapred.LocalJobRunner - 1 / 1 copied. 2019-12-27 06:17:22,283 [pool-9-thread-1] INFO org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - File Output Committer Algorithm version is 1 2019-12-27 06:17:22,283 [pool-9-thread-1] INFO org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false 2019-12-27 06:17:22,284 [pool-9-thread-1] INFO org.apache.pig.impl.util.SpillableMemoryManager - Selected heap (PS Old Gen) of size 699400192 to monitor. collectionUsageThreshold = 489580128, usageThreshold = 489580128 2019-12-27 06:17:22,285 [pool-9-thread-1] WARN org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized 2019-12-27 06:17:22,286 [pool-9-thread-1] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce - Aliases being processed per job phase (AliasName[line,offset]): M: C[7,4],B[6,4] C: C[7,4],B[6,4] R: C[7,4] 2019-12-27 06:17:22,287 [pool-9-thread-1] INFO org.apache.hadoop.mapred.Task - Task:attempt_local1424814286_0002_r_000000_0 is done. And is in the process of committing 2019-12-27 06:17:22,289 [pool-9-thread-1] INFO org.apache.hadoop.mapred.LocalJobRunner - 1 / 1 copied. 2019-12-27 06:17:22,289 [pool-9-thread-1] INFO org.apache.hadoop.mapred.Task - Task attempt_local1424814286_0002_r_000000_0 is allowed to commit now 2019-12-27 06:17:22,292 [pool-9-thread-1] INFO org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - Saved output of task 'attempt_local1424814286_0002_r_000000_0' to file:/tmp/temp2029752934/tmp726323435/_temporary/0/task_local1424814286_0002_r_000000 2019-12-27 06:17:22,292 [pool-9-thread-1] INFO org.apache.hadoop.mapred.LocalJobRunner - reduce > reduce 2019-12-27 06:17:22,292 [pool-9-thread-1] INFO org.apache.hadoop.mapred.Task - Task 'attempt_local1424814286_0002_r_000000_0' done. 2019-12-27 06:17:22,292 [pool-9-thread-1] INFO org.apache.hadoop.mapred.LocalJobRunner - Finishing task: attempt_local1424814286_0002_r_000000_0 2019-12-27 06:17:22,292 [Thread-18] INFO org.apache.hadoop.mapred.LocalJobRunner - reduce task executor complete. 2019-12-27 06:17:22,460 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_local1424814286_0002 2019-12-27 06:17:22,460 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases B,C 2019-12-27 06:17:22,460 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: C[7,4],B[6,4] C: C[7,4],B[6,4] R: C[7,4] 2019-12-27 06:17:22,463 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metricswith processName=JobTracker, sessionId= - already initialized 2019-12-27 06:17:22,464 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metricswith processName=JobTracker, sessionId= - already initialized 2019-12-27 06:17:22,465 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metricswith processName=JobTracker, sessionId= - already initialized 2019-12-27 06:17:22,471 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2019-12-27 06:17:22,474 [main] INFO org.apache.pig.tools.pigstats.mapreduce.SimplePigStats - Script Statistics: HadoopVersion PigVersion UserId StartedAt FinishedAt Features 2.9.2 0.16.0 root 2019-12-27 06:17:20 2019-12-27 06:17:22 HASH_JOIN,GROUP_BY Success! Job Stats (time in seconds): JobId Maps Reduces MaxMapTime MinMapTime AvgMapTime MedianMapTime MaxReduceTime MinReduceTime AvgReduceTime MedianReducetime Alias Feature Outputs job_local1289071959_0001 2 1 n/a n/a n/a n/a n/a n/a n/a n/a trans,users,users_trans HASH_JOIN job_local1424814286_0002 1 1 n/a n/a n/a n/a n/a n/a n/a n/a B,C GROUP_BY,COMBINER file:/tmp/temp2029752934/tmp726323435, Input(s): Successfully read 5 records from: "/tmp/users.txt" Successfully read 10 records from: "/tmp/transaction.txt" Output(s): Successfully stored 1 records in: "file:/tmp/temp2029752934/tmp726323435" Counters: Total records written : 1 Total bytes written : 0 Spillable Memory Manager spill count : 0 Total bags proactively spilled: 0 Total records proactively spilled: 0 Job DAG: job_local1289071959_0001 -> job_local1424814286_0002, job_local1424814286_0002 2019-12-27 06:17:22,475 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metricswith processName=JobTracker, sessionId= - already initialized 2019-12-27 06:17:22,476 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metricswith processName=JobTracker, sessionId= - already initialized 2019-12-27 06:17:22,477 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metricswith processName=JobTracker, sessionId= - already initialized 2019-12-27 06:17:22,485 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metricswith processName=JobTracker, sessionId= - already initialized 2019-12-27 06:17:22,486 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metricswith processName=JobTracker, sessionId= - already initialized 2019-12-27 06:17:22,487 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metricswith processName=JobTracker, sessionId= - already initialized 2019-12-27 06:17:22,492 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Encountered Warning FIELD_DISCARDED_TYPE_CONVERSION_FAILED 15 time(s). 2019-12-27 06:17:22,493 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Encountered Warning ACCESSING_NON_EXISTENT_FIELD 55 time(s). 2019-12-27 06:17:22,493 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success! 2019-12-27 06:17:22,496 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS 2019-12-27 06:17:22,496 [main] WARN org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized 2019-12-27 06:17:22,503 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input files to process : 1 2019-12-27 06:17:22,503 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1 2019-12-27 06:17:22,541 [main] INFO org.apache.pig.Main - Pig script completed in 2 seconds and 965 milliseconds (2965 ms)
Advice First of all: It seems that you are starting up with Pig. It may be valuable to know that Cloudera recently decided to deprecate Pig. It will of course not cease to exist, but think twice if you are planning to pick up a new skill or implement new use cases. I would recommend looking into Hive/Spark/Impala as more future proof alternatives. Answer Your job succeeds, but presumably not with output you want. There are several hints to what may be wrong (data types/field names) however this does not point at a specific problem in the code. My recommendation would be to find out where the problem exactly occurs. Simply cut off the end of your code and print an intermediate result to see if you are still on track. In the (likely) event you have a problem in your load statement already, it is worth noting that you can still narrow it down further. First load, and then apply the schema.
Given the data you have, first problem would be that you have no commas, so you must load the lines as a whole, then split them later. I used two or more spaces in the transactions file because your last column appears to be one string containing spaces. For accuracy, I suggest having a better delimiter than spaces/tabs. Then the group by needs to reference the relations that the data comes from. Everything else is fine, I think, though I'm not sure about the COUNT(X) A = LOAD '/tmp/users.txt' USING PigStorage() as (line:chararray); USERS = FOREACH A GENERATE FLATTEN(STRSPLIT(line, '\\s+')) AS (userid:int,email:chararray,language:chararray,location:chararray); B = LOAD '/tmp/transactions.txt' USING PigStorage() as (line:chararray); TRANS = FOREACH B GENERATE FLATTEN(STRSPLIT(line, '\\s\\s+')) AS (id:int,product:int,userid:int,purchase:double,desc:chararray); X = JOIN USERS BY userid RIGHT, TRANS BY userid; X_grouped = GROUP X BY (TRANS::desc, USERS::location); RES = FOREACH X_grouped GENERATE group as comb, COUNT(X) AS Total; \d RES; Output ((a jeans,HN),1) ((a jumper,FR),1) ((a jumper,GB),1) ((a jumper,IS),1) ((a jumper,US),1) ((a lotion,US),1) ((a soapbox,HN),1) ((a sweater,HN),1) ((a adhesive,FR),1) ((a rubber chicken,FR),1)
How to divert the output of pig command to a text file in order to print it out?
2015-09-24 01:59:28,436 [main] INFO org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, PartitionFilterOptimizer, PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter]} 2015-09-24 01:59:28,539 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS 2015-09-24 01:59:28,556 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false 2015-09-24 01:59:28,560 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1 2015-09-24 01:59:28,561 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1 2015-09-24 01:59:28,620 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS 2015-09-24 01:59:28,624 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032 2015-09-24 01:59:28,638 [main] INFO org.apache.pig.tools.pigstats.mapreduce.MRScriptState - Pig script settings are added to the job 2015-09-24 01:59:28,640 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3 2015-09-24 01:59:28,641 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - This job cannot be converted run in-process 2015-09-24 01:59:29,268 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/home/vivek/Applications/pig/pig-0.14.0-core-h2.jar to DistributedCache through /tmp/temp-1176581946/tmp-2078805221/pig-0.14.0-core-h2.jar 2015-09-24 01:59:29,452 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/home/vivek/Applications/pig/lib/automaton-1.11-8.jar to DistributedCache through /tmp/temp-1176581946/tmp-1750967439/automaton-1.11-8.jar 2015-09-24 01:59:29,538 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/home/vivek/Applications/pig/lib/antlr-runtime-3.4.jar to DistributedCache through /tmp/temp-1176581946/tmp1997290065/antlr-runtime-3.4.jar 2015-09-24 01:59:29,843 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/home/vivek/Applications/hadoop/share/hadoop/common/lib/guava-11.0.2.jar to DistributedCache through /tmp/temp-1176581946/tmp-256046780/guava-11.0.2.jar 2015-09-24 01:59:29,990 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/home/vivek/Applications/pig/lib/joda-time-2.1.jar to DistributedCache through /tmp/temp-1176581946/tmp955728106/joda-time-2.1.jar 2015-09-24 01:59:30,129 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job 2015-09-24 01:59:30,131 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Key [pig.schematuple] is false, will not generate code. 2015-09-24 01:59:30,131 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Starting process to move generated code to distributed cacche 2015-09-24 01:59:30,132 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Setting key [pig.schematuple.classes] with classes to deserialize [] 2015-09-24 01:59:30,276 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission. 2015-09-24 01:59:30,283 [JobControl] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032 2015-09-24 01:59:30,568 [JobControl] WARN org.apache.hadoop.mapreduce.JobSubmitter - No job jar file set. User classes may not be found. See Job or Job#setJar(String). 2015-09-24 01:59:30,868 [JobControl] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1 2015-09-24 01:59:30,871 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1 2015-09-24 01:59:30,874 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1 2015-09-24 01:59:31,190 [JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1 2015-09-24 01:59:31,499 [JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1443082231600_0003 2015-09-24 01:59:31,516 [JobControl] INFO org.apache.hadoop.mapred.YARNRunner - Job jar is not present. Not adding any jar to the list of resources. 2015-09-24 01:59:31,704 [JobControl] INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted application application_1443082231600_0003 2015-09-24 01:59:31,738 [JobControl] INFO org.apache.hadoop.mapreduce.Job - The url to track the job: http://ubuntu:8088/proxy/application_1443082231600_0003/ 2015-09-24 01:59:31,742 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_1443082231600_0003 2015-09-24 01:59:31,745 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases highsal,salaries 2015-09-24 01:59:31,745 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: salaries[3,10],salaries[-1,-1],highsal[13,9] C: R: 2015-09-24 01:59:31,781 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2015-09-24 01:59:31,782 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1443082231600_0003] 2015-09-24 02:00:48,025 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete 2015-09-24 02:00:48,025 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1443082231600_0003] 2015-09-24 02:00:53,055 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032 2015-09-24 02:00:53,104 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server 2015-09-24 02:00:58,180 [main] INFO org.apache.hadoop.ipc.Client - Retrying connect to server: 0.0.0.0/0.0.0.0:10020. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-09-24 02:00:59,182 [main] INFO org.apache.hadoop.ipc.Client - Retrying connect to server: 0.0.0.0/0.0.0.0:10020. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-09-24 02:01:00,185 [main] INFO org.apache.hadoop.ipc.Client - Retrying connect to server: 0.0.0.0/0.0.0.0:10020. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) (F,96,86000.0,95105) (M,24,80000.0,95050) (F,84,89000.0,94040) (M,36,85000.0,95101) (F,69,91000.0,95050) (F,96,80000.0,95051) (M,78,87000.0,95105) (M,25,96000.0,95103) (M,89,90000.0,95102) (F,82,77000.0,95051) (M,97,96000.0,95102) (F,39,82000.0,95051) (M,36,79000.0,95101) (M,75,84000.0,95103) (F,78,91000.0,95102) (M,59,77000.0,95051) (F,52,76000.0,95050) (M,52,97000.0,95102) (F,28,98000.0,95105) (M,91,96000.0,94041) (F,47,85000.0,95051) (M,79,85000.0,95101) (F,93,93000.0,95102) (F,33,82000.0,95101) (F,77,96000.0,95103) (F,93,84000.0,95051) (M,23,83000.0,95050) (M,54,97000.0,95101) (F,25,93000.0,94040) (M,52,85000.0,95102) (M,60,78000.0,94040) (F,74,89000.0,94040) (F,23,76000.0,95101) (M,46,93000.0,95051) (F,63,92000.0,95105) (F,86,93000.0,95101) (F,37,95000.0,95101) (M,41,89000.0,95050) (F,89,77000.0,94041) (F,82,84000.0,95050) (M,66,96000.0,95051) (F,75,79000.0,95051) (M,91,90000.0,95105) (M,27,98000.0,95051) (M,24,85000.0,94041) (M,82,96000.0,95050) (F,75,88000.0,95101) (F,80,77000.0,95051) (M,63,80000.0,95101) (M,29,86000.0,95103) (F,44,91000.0,95101) (M,40,78000.0,95103) (F,46,83000.0,95051) (F,42,85000.0,95105) (M,44,90000.0,95102) (F,26,90000.0,94041) (F,31,87000.0,95051) (F,88,76000.0,95050) (M,67,87000.0,95102) (F,58,86000.0,94041) (F,57,85000.0,95051) (M,97,85000.0,95101) (M,73,90000.0,95103) (M,47,95000.0,95105) (F,83,98000.0,94040) (F,56,78000.0,95101) (M,72,89000.0,94041) (M,90,99000.0,95101) (F,59,79000.0,95105) (F,32,84000.0,95051) (F,60,93000.0,95103) (M,47,87000.0,94041) (M,52,87000.0,95103) (M,82,92000.0,95051) (M,39,87000.0,95102) (F,93,89000.0,95103) (M,31,88000.0,95050) (M,21,92000.0,94040) (F,65,84000.0,95050) (M,68,89000.0,94041) (F,63,92000.0,94041) (F,95,77000.0,95050) (F,34,98000.0,95102) (F,44,94000.0,94040) (M,69,81000.0,95103) (F,30,85000.0,95051) (F,85,82000.0,95050) (M,75,78000.0,94040) (F,91,94000.0,95105) (F,71,91000.0,94041) (M,39,91000.0,95051) (M,43,90000.0,95105) (F,35,94000.0,94040) (F,41,83000.0,95051) (M,62,94000.0,94041) (F,38,77000.0,94041) (F,63,89000.0,95051) (M,78,90000.0,95050) (M,65,92000.0,95101) (F,42,94000.0,95103) (M,65,80000.0,95103) (F,38,91000.0,95102) (M,58,93000.0,94040) (F,63,83000.0,95103) (F,23,96000.0,95103) (F,43,96000.0,95102) (F,27,86000.0,94041) (M,94,76000.0,94041) (F,53,79000.0,94041) (M,78,79000.0,95102) (F,62,82000.0,95101) (M,86,83000.0,95051) (F,91,98000.0,95105) (M,61,99000.0,95103) (M,58,94000.0,95050) (F,47,99000.0,95102) (F,24,89000.0,95101) (M,80,92000.0,95051) (F,30,83000.0,95102) (F,35,86000.0,95051) (M,69,82000.0,95102) (F,49,83000.0,95105) (M,59,82000.0,95103) (F,74,84000.0,95103) (F,82,83000.0,95051) (M,32,85000.0,95102) (M,39,91000.0,95103) (M,50,95000.0,95051) (M,98,89000.0,95105) (M,84,96000.0,95050) (M,61,90000.0,95103) (F,69,83000.0,95102) (F,59,91000.0,95101) (M,79,90000.0,95050) (F,98,83000.0,95050) (F,65,78000.0,94040) (F,74,81000.0,95103) (M,83,97000.0,95101) (M,42,92000.0,95102) (M,82,92000.0,95105) (F,41,91000.0,94041) (F,35,97000.0,94040) (F,46,85000.0,95050) (M,34,86000.0,94041) (F,37,85000.0,94041) (M,64,91000.0,94040) (M,92,84000.0,95051) (M,56,83000.0,95103) (F,68,98000.0,95101) (M,28,81000.0,95050) (F,81,93000.0,95050) (M,71,87000.0,95051) (M,90,86000.0,95050) (F,92,78000.0,94041) (M,42,97000.0,95101) (F,97,83000.0,94041) (M,41,86000.0,95051) (F,96,99000.0,95102) (F,56,96000.0,95051) (F,63,99000.0,95105) (F,69,89000.0,95050) (M,67,85000.0,95105) (M,61,83000.0,95051) (M,86,96000.0,95103) (F,84,82000.0,94041) (M,91,90000.0,95050) (F,36,99000.0,94041) (M,75,97000.0,95105) (M,39,93000.0,95050) (M,56,90000.0,95050) (M,61,91000.0,95105) (M,29,93000.0,94041) (M,79,99000.0,95102) (M,48,91000.0,95101) (F,95,76000.0,95101) (M,47,98000.0,95050) (M,61,88000.0,95101) (M,74,77000.0,95101) (M,75,83000.0,94040) (M,34,82000.0,95103) (M,70,85000.0,95103) (F,43,94000.0,94041) (F,64,91000.0,95105) (F,21,95000.0,95051) (M,55,91000.0,95051) (M,27,85000.0,95105) (F,40,84000.0,94040) (F,41,84000.0,94041) (F,50,87000.0,95051) (M,72,82000.0,95103) (F,50,87000.0,95105) (F,31,93000.0,95102) (F,45,80000.0,95050) (F,62,77000.0,94040) (M,93,91000.0,95101) (M,77,94000.0,95051) (F,33,82000.0,95051) (M,95,87000.0,95105) (M,40,79000.0,95102) (M,82,87000.0,95050) (M,55,85000.0,95051) (M,52,96000.0,95102) (F,52,96000.0,95050) (F,78,82000.0,95102) (F,31,82000.0,94041) (F,60,97000.0,95101) (M,77,81000.0,95102) (F,78,93000.0,95101) (M,74,82000.0,94040) (M,62,77000.0,95050) (F,72,77000.0,95102) (M,96,87000.0,94041) (F,89,93000.0,95051) (M,59,87000.0,95050) (F,26,81000.0,95105) (F,84,77000.0,95051) (F,42,84000.0,94040) (F,59,96000.0,94041) (F,31,78000.0,95050) (F,91,85000.0,95105) (F,87,79000.0,95102) (M,39,88000.0,95105) (F,47,86000.0,95051) (F,24,92000.0,95101) (F,76,85000.0,95103) (F,48,83000.0,95105) (M,50,88000.0,95105) (F,61,93000.0,94041) (F,59,98000.0,95050) (F,57,95000.0,95050) (M,77,76000.0,95105) (M,34,90000.0,95105) (M,23,91000.0,95050) (M,38,88000.0,95051) (F,35,86000.0,95102) (M,27,91000.0,95103) (F,99,78000.0,95051) (F,77,94000.0,94041) (M,23,83000.0,95103) (M,93,91000.0,95051) (F,94,89000.0,95103) (M,99,99000.0,95105) (M,75,84000.0,94040) (M,32,89000.0,94041) (F,57,76000.0,94040) (F,94,95000.0,95103) (M,66,82000.0,94041) (F,56,98000.0,94041) (M,37,88000.0,95105) (M,89,82000.0,95050) (M,91,79000.0,95103) (F,72,90000.0,95102) (F,53,85000.0,95050) (F,87,91000.0,95105) (M,74,91000.0,95050) (F,62,99000.0,95102) (M,46,95000.0,95105) (F,73,78000.0,95050) (F,35,94000.0,95102) (F,60,77000.0,95105) (M,83,93000.0,95105) (F,55,76000.0,95051) (F,36,90000.0,95101) (F,75,87000.0,95103) (F,91,98000.0,95103) (F,66,87000.0,95101) (M,83,91000.0,95103) (M,52,77000.0,94040) (F,76,85000.0,95103) (F,98,78000.0,95102) (F,60,89000.0,95050) (F,30,76000.0,95101) (F,53,95000.0,95050) (M,63,85000.0,95105) (F,25,94000.0,95050) (M,29,98000.0,95103) (M,53,82000.0,95050) (F,70,89000.0,95101) (F,76,83000.0,95105) (M,85,98000.0,95050) (F,81,97000.0,95103) (M,30,77000.0,94041) (F,73,85000.0,95102) (M,94,93000.0,95103) (F,83,80000.0,95101) (F,44,88000.0,94040) (F,35,83000.0,95051) (F,25,82000.0,94040) (M,26,92000.0,95101) (F,60,81000.0,95105) (F,47,78000.0,94040) (F,53,87000.0,94040) (F,44,88000.0,95051) (M,73,96000.0,95103) (F,77,95000.0,95103) (M,24,93000.0,95050) (F,21,76000.0,95050) (F,82,90000.0,95103) (M,71,97000.0,95051) (M,53,79000.0,95105) (M,28,84000.0,94040) (M,35,97000.0,95101) (F,75,76000.0,94040) (M,87,94000.0,94041) (F,89,79000.0,95102) (F,80,92000.0,95102) (M,24,77000.0,95102) (F,40,94000.0,95105) (M,43,80000.0,94041) (M,23,80000.0,94041) (F,51,83000.0,94041) (F,90,78000.0,94040) (F,41,79000.0,95102) (M,48,93000.0,94041) (M,69,94000.0,94040) (F,36,81000.0,95101) (M,35,91000.0,95051) (F,26,88000.0,95050) (M,35,83000.0,94041) (F,36,77000.0,95103) (M,57,91000.0,95103) (F,57,89000.0,95101) (F,38,86000.0,94041) (F,31,83000.0,95050) (M,47,96000.0,94041) (F,91,83000.0,95101) (F,21,78000.0,95103) (M,32,84000.0,95051) (F,41,93000.0,94041) (M,81,93000.0,95102) (F,59,78000.0,95105) (M,71,90000.0,95050) (F,51,77000.0,95051) (M,29,88000.0,95102) (F,40,93000.0,95102) (F,89,99000.0,95105) (F,64,77000.0,95103) (F,53,87000.0,94041) (M,53,97000.0,94040) (M,45,78000.0,94040) (F,76,89000.0,94041) (M,59,81000.0,95050) (F,24,76000.0,94041) (M,72,95000.0,95051) (M,63,83000.0,94040) (F,39,76000.0,94041) (F,26,85000.0,95101) (M,90,99000.0,95102) (F,47,76000.0,95103) (M,72,86000.0,95105) (M,38,92000.0,95050) (M,54,78000.0,95101) (F,48,86000.0,95102) (F,37,78000.0,94040) (F,75,88000.0,95103) (F,66,78000.0,95050) (M,58,80000.0,94040) (M,84,88000.0,95050) (F,35,94000.0,95050) (M,57,88000.0,95102) (M,68,83000.0,95050) (M,37,91000.0,95103) (M,65,79000.0,95101) (M,65,85000.0,95101) (F,97,83000.0,95102) (M,43,83000.0,95051) (F,73,82000.0,95103) (M,89,87000.0,95050) (F,74,84000.0,95103) (M,73,90000.0,94041) (F,46,97000.0,95103) (M,36,82000.0,94041) (M,80,82000.0,95105) (F,78,79000.0,95102) (M,67,96000.0,94040) (F,48,98000.0,95102) (F,82,86000.0,95050) (M,79,80000.0,95050) (M,96,84000.0,95103) (M,51,87000.0,94040) (F,29,84000.0,95051) (M,47,86000.0,94040) (M,54,96000.0,94041) (F,80,94000.0,94041) (F,92,93000.0,95103) (F,59,79000.0,95050) (M,95,80000.0,95050) (M,67,92000.0,94040) (F,23,98000.0,95103) (M,91,82000.0,95051) (M,27,89000.0,95105) (M,43,77000.0,94041) (F,65,83000.0,94040) (F,65,82000.0,95051) (M,43,98000.0,95105) (F,51,86000.0,95102) (M,76,83000.0,95051) (F,25,92000.0,94040) (M,48,76000.0,95102) (F,43,86000.0,95050) (F,57,83000.0,95101) (F,48,84000.0,95051) (M,37,98000.0,95102) (F,98,81000.0,95105) (M,78,86000.0,94041) (F,34,93000.0,95102) (M,53,94000.0,95102) (M,69,98000.0,94040) (F,70,84000.0,94041) (F,89,87000.0,94040) (F,52,89000.0,95102) (F,84,79000.0,95102) (M,44,86000.0,94041) (M,51,93000.0,94041) (M,98,81000.0,95102) (F,82,77000.0,95101) (M,50,82000.0,95103) (F,59,76000.0,95051) (M,29,76000.0,94041) (F,30,81000.0,95051) (F,22,96000.0,95105) (M,64,88000.0,94040) (M,80,78000.0,95102) (F,94,85000.0,95051) (M,63,95000.0,95103) (F,51,78000.0,95050) (M,39,94000.0,95105) (M,80,85000.0,95101) (M,92,89000.0,95102) (M,44,88000.0,95103) (M,57,92000.0,95050) (F,64,94000.0,95051) (F,88,91000.0,95102) (F,43,83000.0,95101) (F,33,93000.0,95050) (M,64,92000.0,95102) (M,91,92000.0,95050) (F,32,88000.0,95105) (M,78,87000.0,94041) (F,64,85000.0,94040) (M,93,96000.0,95102) (F,72,98000.0,95103) (M,68,76000.0,95051) (M,52,95000.0,95050) (F,75,93000.0,95103) (M,45,85000.0,94041) (F,70,98000.0,95051) (F,74,96000.0,95101) (F,81,85000.0,95102) (M,83,91000.0,95105) (M,32,89000.0,95101) (F,58,90000.0,94041) (M,55,80000.0,95050) (F,23,79000.0,95051) (M,91,79000.0,95103) (F,21,98000.0,95102) (F,57,91000.0,95101) (M,58,91000.0,95051) (F,41,94000.0,95101) (M,67,95000.0,94041) (M,69,80000.0,95101) (M,23,77000.0,94041) (F,94,92000.0,95105) (F,60,92000.0,95051) (F,53,84000.0,94041) (F,48,98000.0,95103) (M,70,88000.0,95051) (M,76,94000.0,95103) (F,22,88000.0,94040) (F,80,81000.0,95102) (F,57,80000.0,95051) (F,57,99000.0,95103) (M,50,78000.0,95050) (M,40,81000.0,95050) (F,93,97000.0,95050) (M,40,80000.0,94041) (M,35,91000.0,95101) (F,50,96000.0,94041) (F,27,90000.0,95105) (F,23,91000.0,95105) (M,49,80000.0,94041) (M,90,98000.0,95105) (M,29,91000.0,95050) (F,99,83000.0,95103) (F,43,83000.0,94040) (F,30,90000.0,94041) (F,96,97000.0,95102) (M,83,77000.0,95103) (F,77,97000.0,94040) (F,74,98000.0,95105) (F,96,96000.0,95103) (F,37,81000.0,94041) (M,82,91000.0,94040) (F,33,90000.0,95101) (F,35,86000.0,95102) (F,67,87000.0,95105) (M,95,95000.0,95051) (M,82,95000.0,95101) (F,26,76000.0,95050) (F,65,84000.0,95103) (F,34,91000.0,95102) (F,48,81000.0,94040) (F,93,84000.0,94041) (F,37,79000.0,95105) (M,77,84000.0,95102) (M,94,78000.0,94040) (M,28,79000.0,95051) (F,30,80000.0,94041) (F,54,80000.0,95103) (F,93,96000.0,95105) (F,45,78000.0,94041) right now i'm just executing pig command .I wanna direct or make a copy of output at execution time let it is really difficult to take a snapshot of it. just suggest a solution to overcome from it. the code ILLUSTRATE THE OUTPUT OF THE COMMAND grunt>salaries= load 'salaries' using PigStorage(',') As (gender, age,salary,zip); grunt> salaries= load 'salaries' using PigStorage(',') As (gender:chararray,age:int,salary:double,zip:long); grunt>highsal= filter salaries by salary > 75000; grunt>dump highsal; WHEN THE ABOVE COMMAND EXECUTED THE OUTPUT LISTING ABOVE WIILL BE DISPLAYED . JUST I HAVE COPIED salaries.txt from local FS to hdfs . grunt> store highsal into 'file'; 2015-09-24 02:59:15,981 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: Pig script failed to parse: <line 1, column 6> Undefined alias: highsal Details at logfile: /home/vivek/pig_1443088724224.log grunt> i'm still getting error by suggested query.
You have not defined "highsal" alias when trying to run STORE command. Pig do not store any alias previous session. you have to execute all your command in one session or write a pig script and invoke it. Try like : grunt>salaries= load 'salaries' using PigStorage(',') As (gender, age,salary,zip); grunt> salaries= load 'salaries' using PigStorage(',') As (gender:chararray,age:int,salary:double,zip:long); grunt>highsal= filter salaries by salary > 75000; grunt>STORE highsal INTO 'file'; This will store the "highsal" content in a file name 'file/part-x-xxxxx' on user's HDFS directory. You can also provide HDFS absolute directory path instead of 'file' if you want to wish to store data in directory other than users home directory Hope this helps
store highsal into 'file'; Have a look at apache pig documentation for all commands.
ArrayIndexOutOfBoundsException at MapOutputBuffer$Buffer.write in MapTask (Hadoop 2.7.1)
Very odd case of ArrayIndexOutOfBounds in a Scalding-driven job running on Hadoop 2.7.1. Mapper log dump below. It looks like Equator somehow gets set to a negative number in spill 2. Is this normal? 2015-08-12 23:39:19,649 INFO [main] org.apache.hadoop.mapred.MapTask: numReduceTasks: 1 2015-08-12 23:39:20,174 INFO [main] org.apache.hadoop.mapred.MapTask: (EQUATOR) 0 kvi 469762044(1879048176) 2015-08-12 23:39:20,175 INFO [main] org.apache.hadoop.mapred.MapTask: mapreduce.task.io.sort.mb: 1792 2015-08-12 23:39:20,175 INFO [main] org.apache.hadoop.mapred.MapTask: soft limit at 187904816 2015-08-12 23:39:20,175 INFO [main] org.apache.hadoop.mapred.MapTask: bufstart = 0; bufvoid = 1879048192 2015-08-12 23:39:20,175 INFO [main] org.apache.hadoop.mapred.MapTask: kvstart = 469762044; length = 117440512 2015-08-12 23:39:20,214 INFO [main] org.apache.hadoop.mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer 2015-08-12 23:39:20,216 INFO [main] cascading.flow.hadoop.FlowMapper: cascading version: 2.6.1 2015-08-12 23:39:20,216 INFO [main] cascading.flow.hadoop.FlowMapper: child jvm opts: -Xmx1024m -Djava.io.tmpdir=./tmp 2015-08-12 23:39:20,516 INFO [main] org.apache.hadoop.conf.Configuration.deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition 2015-08-12 23:39:20,552 INFO [main] cascading.flow.hadoop.FlowMapper: sourcing from: TempHfs["SequenceFile[['docId', 'otherDocId', 'score']]"][9909013673/_pipe_11__pipe_12/] 2015-08-12 23:39:20,552 INFO [main] cascading.flow.hadoop.FlowMapper: sinking to: GroupBy(_pipe_11+_pipe_12)[by:[ {1} :'docId']] 2015-08-12 23:39:29,424 INFO [main] org.apache.hadoop.mapred.MapTask: Spilling map output 2015-08-12 23:39:29,424 INFO [main] org.apache.hadoop.mapred.MapTask: bufstart = 0; bufend = 108647886; bufvoid = 1879048192 2015-08-12 23:39:29,424 INFO [main] org.apache.hadoop.mapred.MapTask: kvstart = 469762044(1879048176); kvend = 449947816(1799791264); length = 19814229/117440512 2015-08-12 23:39:29,425 INFO [main] org.apache.hadoop.mapred.MapTask: (EQUATOR) 839953118 kvi 209988272(839953088) 2015-08-12 23:39:43,985 INFO [SpillThread] org.apache.hadoop.io.compress.CodecPool: Got brand-new compressor [.gz] 2015-08-12 23:39:46,767 INFO [SpillThread] org.apache.hadoop.mapred.MapTask: Finished spill 0 2015-08-12 23:39:46,767 INFO [main] org.apache.hadoop.mapred.MapTask: (RESET) equator 839953118 kv 209988272(839953088) kvi 178264648(713058592) 2015-08-12 23:39:46,767 INFO [main] org.apache.hadoop.mapred.MapTask: Spilling map output 2015-08-12 23:39:46,767 INFO [main] org.apache.hadoop.mapred.MapTask: bufstart = 839953118; bufend = 1014433072; bufvoid = 1879048192 2015-08-12 23:39:46,767 INFO [main] org.apache.hadoop.mapred.MapTask: kvstart = 209988272(839953088); kvend = 178264648(713058592); length = 31723625/117440512 2015-08-12 23:39:46,767 INFO [main] org.apache.hadoop.mapred.MapTask: (EQUATOR) 1696670336 kvi 424167580(1696670320) 2015-08-12 23:40:22,641 INFO [SpillThread] org.apache.hadoop.mapred.MapTask: Finished spill 1 2015-08-12 23:40:22,641 INFO [main] org.apache.hadoop.mapred.MapTask: (RESET) equator 1696670336 kv 424167580(1696670320) kvi 392768808(1571075232) 2015-08-12 23:40:22,641 INFO [main] org.apache.hadoop.mapred.MapTask: Spilling map output 2015-08-12 23:40:22,641 INFO [main] org.apache.hadoop.mapred.MapTask: bufstart = 1696670336; bufend = 1869363604; bufvoid = 1879048192 2015-08-12 23:40:22,641 INFO [main] org.apache.hadoop.mapred.MapTask: kvstart = 424167580(1696670320); kvend = 392768808(1571075232); length = 31398773/117440512 2015-08-12 23:40:22,642 INFO [main] org.apache.hadoop.mapred.MapTask: (EQUATOR) -1742031900 kvi 34254072(137016288) 2015-08-12 23:40:47,329 INFO [SpillThread] org.apache.hadoop.mapred.MapTask: Finished spill 2 2015-08-12 23:40:47,330 INFO [main] org.apache.hadoop.mapred.MapTask: (RESET) equator -1742031900 kv 34254072(137016288) kvi 34254072(137016288) 2015-08-12 23:40:47,331 ERROR [main] cascading.flow.stream.TrapHandler: caught Throwable, no trap available, rethrowing cascading.flow.stream.DuctException: internal error: ['7541904654925238223', '2.812180059539485'] at cascading.flow.hadoop.stream.HadoopGroupByGate.receive(HadoopGroupByGate.java:81) at cascading.flow.hadoop.stream.HadoopGroupByGate.receive(HadoopGroupByGate.java:37) at cascading.flow.stream.FunctionEachStage$1.collect(FunctionEachStage.java:80) at cascading.tuple.TupleEntryCollector.safeCollect(TupleEntryCollector.java:145) at cascading.tuple.TupleEntryCollector.add(TupleEntryCollector.java:133) at cascading.operation.Identity$2.operate(Identity.java:137) at cascading.operation.Identity.operate(Identity.java:150) at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:99) at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:39) at cascading.flow.stream.SourceStage.map(SourceStage.java:102) at cascading.flow.stream.SourceStage.run(SourceStage.java:58) at cascading.flow.hadoop.FlowMapper.run(FlowMapper.java:130) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) Caused by: java.lang.ArrayIndexOutOfBoundsException at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1453) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1349) at java.io.DataOutputStream.write(DataOutputStream.java:88) at java.io.DataOutputStream.writeByte(DataOutputStream.java:153) at org.apache.hadoop.io.WritableUtils.writeVLong(WritableUtils.java:273) at org.apache.hadoop.io.WritableUtils.writeVInt(WritableUtils.java:253) at cascading.tuple.hadoop.io.HadoopTupleOutputStream.writeIntInternal(HadoopTupleOutputStream.java:155) at cascading.tuple.io.TupleOutputStream.write(TupleOutputStream.java:86) at cascading.tuple.io.TupleOutputStream.writeTuple(TupleOutputStream.java:64) at cascading.tuple.hadoop.io.TupleSerializer.serialize(TupleSerializer.java:37) at cascading.tuple.hadoop.io.TupleSerializer.serialize(TupleSerializer.java:28) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1149) at org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:610) at cascading.tap.hadoop.util.MeasuredOutputCollector.collect(MeasuredOutputCollector.java:69) at cascading.flow.hadoop.stream.HadoopGroupByGate.receive(HadoopGroupByGate.java:68) ... 18 more
It is mapreduce.task.io.sort.mb that made the difference. When setting to 2G or large, it will constantly running into the problem. It is suggested to set to the value below or smaller: Dmapreduce.task.io.sort.mb=1792
I suspect a threading issue, so I tried the below and it worked. Not sure if the cure will stick. <property> <name>mapreduce.map.sort.spill.percent</name> <value>0.8</value> </property> <property> <name>mapreduce.task.io.sort.factor</name> <value>10</value> </property> <property> <name>mapreduce.task.io.sort.mb</name> <value>100</value> </property> <property> <name>mapred.map.multithreadedrunner.threads</name> <value>1</value> </property> <property> <name>mapreduce.mapper.multithreadedmapper.threads</name> <value>1</value> </property>
Pig "Max" command for pig-0.12.1 and pig-0.13.0 with Hadoop-2.4.0
I have a pig script I got from Hortonworks that works fine with pig-0.9.2.15 with Hadoop-1.0.3.16. But when I run it with pig-0.12.1(recompiled with -Dhadoopversion=23) or pig-0.13.0 on Hadoop-2.4.0, it won't work. It seems the following line is where the problem is. max_runs = FOREACH grp_data GENERATE group as grp, MAX(runs.runs) as max_runs; Here's the whole script. batting = load 'pig_data/Batting.csv' using PigStorage(','); runs = FOREACH batting GENERATE $0 as playerID, $1 as year, $8 as runs; grp_data = GROUP runs by (year); max_runs = FOREACH grp_data GENERATE group as grp, MAX(runs.runs) as max_runs; join_max_run = JOIN max_runs by ($0, max_runs), runs by (year,runs); join_data = FOREACH join_max_run GENERATE $0 as year, $2 as playerID, $1 as runs; STORE join_data INTO './join_data'; And here's the hadoop error info: 2014-07-29 18:03:02,957 [main] ERROR org.apache.pig.tools.pigstats.PigStats - ERROR 0: org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while executing (Name: grp_data: Local Rearrange[tuple]{bytearray}(false) - scope-34 Operator Key: scope-34): org.apache.pig.backend.executionengine.ExecException: ERROR 2106: Error executing an algebraic function 2014-07-29 18:03:02,958 [main] ERROR org.apache.pig.tools.pigstats.mapreduce.MRPigStatsUtil - 1 map reduce job(s) failed! How can I fix this if I still want to use "MAX" function? Thank you! Here's the complete information: 14/07/29 17:50:11 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL 14/07/29 17:50:11 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE 14/07/29 17:50:11 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType 2014-07-29 17:50:12,104 [main] INFO org.apache.pig.Main - Apache Pig version 0.13.0 (r1606446) compiled Jun 29 2014, 02:27:58 2014-07-29 17:50:12,104 [main] INFO org.apache.pig.Main - Logging error messages to: /root/hadooptestingsuite/scripts/tests/pig_test/hadoop2/pig_1406677812103.log 2014-07-29 17:50:13,050 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /root/.pigbootup not found 2014-07-29 17:50:13,415 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address 2014-07-29 17:50:13,415 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS 2014-07-29 17:50:13,415 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://namenode.cmda.hadoop.com:8020 2014-07-29 17:50:14,302 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: namenode.cmda.hadoop.com:8021 2014-07-29 17:50:14,990 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS 2014-07-29 17:50:15,570 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS 2014-07-29 17:50:15,665 [main] WARN org.apache.pig.newplan.BaseOperatorPlan - Encountered Warning IMPLICIT_CAST_TO_DOUBLE 1 time(s). 2014-07-29 17:50:15,705 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.textoutputformat.separator is deprecated. Instead, use mapreduce.output.textoutputformat.separator 2014-07-29 17:50:15,791 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: HASH_JOIN,GROUP_BY 2014-07-29 17:50:15,873 [main] INFO org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, PartitionFilterOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter], RULES_DISABLED=[FilterLogicExpressionSimplifier]} 2014-07-29 17:50:16,319 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false 2014-07-29 17:50:16,377 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.CombinerOptimizer - Choosing to move algebraic foreach to combiner 2014-07-29 17:50:16,410 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler$LastInputStreamingOptimizer - Rewrite: POPackage->POForEach to POPackage(JoinPackager) 2014-07-29 17:50:16,417 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 3 2014-07-29 17:50:16,418 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 1 map-reduce splittees. 2014-07-29 17:50:16,418 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 1 out of total 3 MR operators. 2014-07-29 17:50:16,418 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 2 2014-07-29 17:50:16,493 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS 2014-07-29 17:50:16,575 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at namenode.cmda.hadoop.com/10.0.3.1:8050 2014-07-29 17:50:16,973 [main] INFO org.apache.pig.tools.pigstats.mapreduce.MRScriptState - Pig script settings are added to the job 2014-07-29 17:50:17,007 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.reduce.markreset.buffer.percent is deprecated. Instead, use mapreduce.reduce.markreset.buffer.percent 2014-07-29 17:50:17,007 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3 2014-07-29 17:50:17,007 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.output.compress is deprecated. Instead, use mapreduce.output.fileoutputformat.compress 2014-07-29 17:50:17,020 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Reduce phase detected, estimating # of required reducers. 2014-07-29 17:50:17,020 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Using reducer estimator: org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator 2014-07-29 17:50:17,064 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator - BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=6398990 2014-07-29 17:50:17,067 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting Parallelism to 1 2014-07-29 17:50:17,067 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces 2014-07-29 17:50:17,068 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - This job cannot be converted run in-process 2014-07-29 17:50:17,068 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - creating jar file Job2337803902169382273.jar 2014-07-29 17:50:20,957 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - jar file Job2337803902169382273.jar created 2014-07-29 17:50:20,957 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.jar is deprecated. Instead, use mapreduce.job.jar 2014-07-29 17:50:21,001 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up multi store job 2014-07-29 17:50:21,036 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Key [pig.schematuple] is false, will not generate code. 2014-07-29 17:50:21,036 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Starting process to move generated code to distributed cacche 2014-07-29 17:50:21,046 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Setting key [pig.schematuple.classes] with classes to deserialize [] 2014-07-29 17:50:21,310 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission. 2014-07-29 17:50:21,311 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker.http.address is deprecated. Instead, use mapreduce.jobtracker.http.address 2014-07-29 17:50:21,332 [JobControl] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at namenode.cmda.hadoop.com/10.0.3.1:8050 2014-07-29 17:50:21,366 [JobControl] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS 2014-07-29 17:50:22,606 [JobControl] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1 2014-07-29 17:50:22,606 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1 2014-07-29 17:50:22,629 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1 2014-07-29 17:50:22,729 [JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1 2014-07-29 17:50:22,745 [JobControl] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS 2014-07-29 17:50:23,026 [JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1406677482986_0003 2014-07-29 17:50:23,258 [JobControl] INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted application application_1406677482986_0003 2014-07-29 17:50:23,340 [JobControl] INFO org.apache.hadoop.mapreduce.Job - The url to track the job: http://namenode.cmda.hadoop.com:8088/proxy/application_1406677482986_0003/ 2014-07-29 17:50:23,340 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_1406677482986_0003 2014-07-29 17:50:23,340 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases batting,grp_data,max_runs,runs 2014-07-29 17:50:23,340 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: batting[3,10],runs[5,7],max_runs[7,11],grp_data[6,11] C: max_runs[7,11],grp_data[6,11] R: max_runs[7,11] 2014-07-29 17:50:23,340 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - More information at: http://namenode.cmda.hadoop.com:50030/jobdetails.jsp?jobid=job_1406677482986_0003 2014-07-29 17:50:23,357 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2014-07-29 17:50:23,357 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1406677482986_0003] 2014-07-29 17:51:15,564 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete 2014-07-29 17:51:15,564 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1406677482986_0003] 2014-07-29 17:51:18,582 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Ooops! Some job has failed! Specify -stop_on_failure if you want Pig to stop immediately on failure. 2014-07-29 17:51:18,582 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - job job_1406677482986_0003 has failed! Stop running all dependent jobs 2014-07-29 17:51:18,582 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2014-07-29 17:51:18,825 [main] ERROR org.apache.pig.tools.pigstats.PigStats - ERROR 0: org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while executing (Name: grp_data: Local Rearrange[tuple]{bytearray}(false) - scope-73 Operator Key: scope-73): org.apache.pig.backend.executionengine.ExecException: ERROR 2106: Error executing an algebraic function 2014-07-29 17:51:18,825 [main] ERROR org.apache.pig.tools.pigstats.mapreduce.MRPigStatsUtil - 1 map reduce job(s) failed! 2014-07-29 17:51:18,826 [main] INFO org.apache.pig.tools.pigstats.mapreduce.SimplePigStats - Script Statistics: HadoopVersion PigVersion UserId StartedAt FinishedAt Features 2.4.0 0.13.0 root 2014-07-29 17:50:16 2014-07-29 17:51:18 HASH_JOIN,GROUP_BY Failed! Failed Jobs: JobId Alias Feature Message Outputs job_1406677482986_0003 batting,grp_data,max_runs,runs MULTI_QUERY,COMBINER Message: Job failed! Input(s): Failed to read data from "hdfs://namenode.cmda.hadoop.com:8020/user/root/pig_data/Batting.csv" Output(s): Counters: Total records written : 0 Total bytes written : 0 Spillable Memory Manager spill count : 0 Total bags proactively spilled: 0 Total records proactively spilled: 0 Job DAG: job_1406677482986_0003 -> null, null 2014-07-29 17:51:18,826 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed! 2014-07-29 17:51:18,827 [main] ERROR org.apache.pig.tools.grunt.GruntParser - ERROR 2106: Error executing an algebraic function Details at logfile: /root/hadooptestingsuite/scripts/tests/pig_test/hadoop2/pig_1406677812103.log 2014-07-29 17:51:18,828 [main] ERROR org.apache.pig.tools.grunt.GruntParser - ERROR 2244: Job scope-58 failed, hadoop does not return any error message Details at logfile: /root/hadooptestingsuite/scripts/tests/pig_test/hadoop2/pig_1406677812103.log
try by casting MAX function max_runs = FOREACH grp_data GENERATE group as grp, (int)MAX(runs.runs) as max_runs; hope it will work
You should use data types in your load statement. runs = FOREACH batting GENERATE $0 as playerID:chararray, $1 as year:int, $8 as runs:int; If this doesn't help for some reason, try explicit casting. max_runs = FOREACH grp_data GENERATE group as grp, MAX((int)runs.runs) as max_runs;
Thank both #BigData and #Mikko Kupsu for the hint. The issue does indeed have something to do the datatype casting. After specifying the data type of each column as follows everything runs great. batting = LOAD '/user/root/pig_data/Batting.csv' USING PigStorage(',') AS (playerID: CHARARRAY, yearID: INT, stint: INT, teamID: CHARARRAY, lgID: CHARARRAY, G: INT, G_batting: INT, AB: INT, R: INT, H: INT, two_B: INT, three_B: INT, HR: INT, RBI: INT, SB: INT, CS: INT, BB:INT, SO: INT, IBB: INT, HBP: INT, SH: INT, SF: INT, GIDP: INT, G_old: INT);
Cannot run the job on hadoop cluster. only runs using LocalJobRunner
I have submitted a MR job using hadoop jar command with the following command on CDH5 Beta 2 hadoop jar ./hadoop-examples-0.0.1-SNAPSHOT.jar com.aravind.learning.hadoop.mapred.join.ReduceSideJoinDriver tech_talks/users.csv tech_talks/ratings.csv tech_talks/output/ReduceSideJoinDriver/ I've also tried providing the fs name and job tracker url explicitly as below without any success hadoop jar ./hadoop-examples-0.0.1-SNAPSHOT.jar com.aravind.learning.hadoop.mapred.join.ReduceSideJoinDriver -Dfs.default.name=hdfs://abc.com:8020 -Dmapreduce.job.tracker=x.x.x.x:8021 tech_talks/users.csv tech_talks/ratings.csv tech_talks/output/ReduceSideJoinDriver/ The job runs successfully but is using the LocalJobRunner instead of submitting to the cluster. The output is written to HDFS and is correct. Not sure what I am doing wrong here so appreciate your input. I've also tried explicitly specifying the fs and job tracker as below but have the same result 14/04/16 20:35:44 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id 14/04/16 20:35:44 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 14/04/16 20:35:45 WARN mapreduce.JobSubmitter: No job jar file set. User classes may not be found. See Job or Job#setJar(String). 14/04/16 20:35:45 INFO input.FileInputFormat: Total input paths to process : 2 14/04/16 20:35:45 INFO mapreduce.JobSubmitter: number of splits:2 14/04/16 20:35:46 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1427968352_0001 14/04/16 20:35:46 WARN conf.Configuration: file:/tmp/hadoop-ird2/mapred/staging/ird21427968352/.staging/job_local1427968352_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring. 14/04/16 20:35:46 WARN conf.Configuration: file:/tmp/hadoop-ird2/mapred/staging/ird21427968352/.staging/job_local1427968352_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring. 14/04/16 20:35:46 WARN conf.Configuration: file:/tmp/hadoop-ird2/mapred/local/localRunner/ird2/job_local1427968352_0001/job_local1427968352_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring. 14/04/16 20:35:46 WARN conf.Configuration: file:/tmp/hadoop-ird2/mapred/local/localRunner/ird2/job_local1427968352_0001/job_local1427968352_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring. 14/04/16 20:35:46 INFO mapreduce.Job: The url to track the job: http://localhost:8080/ 14/04/16 20:35:46 INFO mapreduce.Job: Running job: job_local1427968352_0001 14/04/16 20:35:46 INFO mapred.LocalJobRunner: OutputCommitter set in config null 14/04/16 20:35:46 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter 14/04/16 20:35:46 INFO mapred.LocalJobRunner: Waiting for map tasks 14/04/16 20:35:46 INFO mapred.LocalJobRunner: Starting task: attempt_local1427968352_0001_m_000000_0 14/04/16 20:35:46 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ] 14/04/16 20:35:46 INFO mapred.MapTask: Processing split: hdfs://...:8020/user/ird2/tech_talks/ratings.csv:0+4388258 14/04/16 20:35:46 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer 14/04/16 20:35:46 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584) 14/04/16 20:35:46 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100 14/04/16 20:35:46 INFO mapred.MapTask: soft limit at 83886080 14/04/16 20:35:46 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600 14/04/16 20:35:46 INFO mapred.MapTask: kvstart = 26214396; length = 6553600 14/04/16 20:35:47 INFO mapreduce.Job: Job job_local1427968352_0001 running in uber mode : false 14/04/16 20:35:47 INFO mapreduce.Job: map 0% reduce 0% 14/04/16 20:35:48 INFO mapred.LocalJobRunner: 14/04/16 20:35:48 INFO mapred.MapTask: Starting flush of map output 14/04/16 20:35:48 INFO mapred.MapTask: Spilling map output 14/04/16 20:35:48 INFO mapred.MapTask: bufstart = 0; bufend = 6485388; bufvoid = 104857600 14/04/16 20:35:48 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 24860980(99443920); length = 1353417/6553600 14/04/16 20:35:49 INFO mapred.MapTask: Finished spill 0 14/04/16 20:35:49 INFO mapred.Task: Task:attempt_local1427968352_0001_m_000000_0 is done. And is in the process of committing 14/04/16 20:35:49 INFO mapred.LocalJobRunner: map 14/04/16 20:35:49 INFO mapred.Task: Task 'attempt_local1427968352_0001_m_000000_0' done. 14/04/16 20:35:49 INFO mapred.LocalJobRunner: Finishing task: attempt_local1427968352_0001_m_000000_0 14/04/16 20:35:49 INFO mapred.LocalJobRunner: Starting task: attempt_local1427968352_0001_m_000001_0 14/04/16 20:35:49 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ] 14/04/16 20:35:49 INFO mapred.MapTask: Processing split: hdfs://...:8020/user/ird2/tech_talks/users.csv:0+186304 14/04/16 20:35:49 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer 14/04/16 20:35:49 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584) 14/04/16 20:35:49 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100 14/04/16 20:35:49 INFO mapred.MapTask: soft limit at 83886080 14/04/16 20:35:49 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600 14/04/16 20:35:49 INFO mapred.MapTask: kvstart = 26214396; length = 6553600 14/04/16 20:35:49 INFO mapred.LocalJobRunner: 14/04/16 20:35:49 INFO mapred.MapTask: Starting flush of map output 14/04/16 20:35:49 INFO mapred.MapTask: Spilling map output 14/04/16 20:35:49 INFO mapred.MapTask: bufstart = 0; bufend = 209667; bufvoid = 104857600 14/04/16 20:35:49 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26192144(104768576); length = 22253/6553600 14/04/16 20:35:49 INFO mapred.MapTask: Finished spill 0 14/04/16 20:35:49 INFO mapred.Task: Task:attempt_local1427968352_0001_m_000001_0 is done. And is in the process of committing 14/04/16 20:35:49 INFO mapred.LocalJobRunner: map 14/04/16 20:35:49 INFO mapred.Task: Task 'attempt_local1427968352_0001_m_000001_0' done. 14/04/16 20:35:49 INFO mapred.LocalJobRunner: Finishing task: attempt_local1427968352_0001_m_000001_0 14/04/16 20:35:49 INFO mapred.LocalJobRunner: map task executor complete. 14/04/16 20:35:49 INFO mapred.LocalJobRunner: Waiting for reduce tasks 14/04/16 20:35:49 INFO mapred.LocalJobRunner: Starting task: attempt_local1427968352_0001_r_000000_0 14/04/16 20:35:49 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ] 14/04/16 20:35:49 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle#5116331d 14/04/16 20:35:49 INFO reduce.MergeManagerImpl: MergerManager: memoryLimit=652528832, maxSingleShuffleLimit=163132208, mergeThreshold=430669056, ioSortFactor=10, memToMemMergeOutputsThreshold=10 14/04/16 20:35:49 INFO reduce.EventFetcher: attempt_local1427968352_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events 14/04/16 20:35:49 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local1427968352_0001_m_000001_0 decomp: 220797 len: 220801 to MEMORY 14/04/16 20:35:49 INFO reduce.InMemoryMapOutput: Read 220797 bytes from map-output for attempt_local1427968352_0001_m_000001_0 14/04/16 20:35:49 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 220797, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->220797 14/04/16 20:35:49 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local1427968352_0001_m_000000_0 decomp: 7162100 len: 7162104 to MEMORY 14/04/16 20:35:49 INFO reduce.InMemoryMapOutput: Read 7162100 bytes from map-output for attempt_local1427968352_0001_m_000000_0 14/04/16 20:35:49 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 7162100, inMemoryMapOutputs.size() -> 2, commitMemory -> 220797, usedMemory ->7382897 14/04/16 20:35:49 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning 14/04/16 20:35:49 INFO mapred.LocalJobRunner: 2 / 2 copied. 14/04/16 20:35:49 INFO reduce.MergeManagerImpl: finalMerge called with 2 in-memory map-outputs and 0 on-disk map-outputs 14/04/16 20:35:49 INFO mapred.Merger: Merging 2 sorted segments 14/04/16 20:35:49 INFO mapred.Merger: Down to the last merge-pass, with 2 segments left of total size: 7382885 bytes 14/04/16 20:35:50 INFO reduce.MergeManagerImpl: Merged 2 segments, 7382897 bytes to disk to satisfy reduce memory limit 14/04/16 20:35:50 INFO reduce.MergeManagerImpl: Merging 1 files, 7382899 bytes from disk 14/04/16 20:35:50 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce 14/04/16 20:35:50 INFO mapred.Merger: Merging 1 sorted segments 14/04/16 20:35:50 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 7382889 bytes 14/04/16 20:35:50 INFO mapred.LocalJobRunner: 2 / 2 copied. 14/04/16 20:35:50 INFO Configuration.deprecation: mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords 14/04/16 20:35:50 INFO mapreduce.Job: map 100% reduce 0% 14/04/16 20:35:51 INFO mapred.Task: Task:attempt_local1427968352_0001_r_000000_0 is done. And is in the process of committing 14/04/16 20:35:51 INFO mapred.LocalJobRunner: 2 / 2 copied. 14/04/16 20:35:51 INFO mapred.Task: Task attempt_local1427968352_0001_r_000000_0 is allowed to commit now 14/04/16 20:35:51 INFO output.FileOutputCommitter: Saved output of task 'attempt_local1427968352_0001_r_000000_0' to hdfs://...:8020/user/ird2/tech_talks/output/ReduceSideJoinDriver/_temporary/0/task_local1427968352_0001_r_000000 14/04/16 20:35:51 INFO mapred.LocalJobRunner: reduce > reduce 14/04/16 20:35:51 INFO mapred.Task: Task 'attempt_local1427968352_0001_r_000000_0' done. 14/04/16 20:35:51 INFO mapred.LocalJobRunner: Finishing task: attempt_local1427968352_0001_r_000000_0 14/04/16 20:35:51 INFO mapred.LocalJobRunner: reduce task executor complete. 14/04/16 20:35:52 INFO mapreduce.Job: map 100% reduce 100% 14/04/16 20:35:52 INFO mapreduce.Job: Job job_local1427968352_0001 completed successfully 14/04/16 20:35:52 INFO mapreduce.Job: Counters: 38 File System Counters FILE: Number of bytes read=14767932 FILE: Number of bytes written=29952985 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=13537382 HDFS: Number of bytes written=2949787 HDFS: Number of read operations=28 HDFS: Number of large read operations=0 HDFS: Number of write operations=5 Map-Reduce Framework Map input records=343919 Map output records=343919 Map output bytes=6695055 Map output materialized bytes=7382905 Input split bytes=272 Combine input records=0 Combine output records=0 Reduce input groups=5564 Reduce shuffle bytes=7382905 Reduce input records=343919 Reduce output records=5564 Spilled Records=687838 Shuffled Maps =2 Failed Shuffles=0 Merged Map outputs=2 GC time elapsed (ms)=92 CPU time spent (ms)=0 Physical memory (bytes) snapshot=0 Virtual memory (bytes) snapshot=0 Total committed heap usage (bytes)=1416101888 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=4574562 File Output Format Counters Bytes Written=2949787 Driver code public class ReduceSideJoinDriver extends Configured implements Tool { #Override public int run(String[] args) throws Exception { if (args.length != 3) { System.err.printf("Usage: %s [generic options] <input> <output>\n", getClass().getSimpleName()); ToolRunner.printGenericCommandUsage(System.err); return -1; } Path usersFile = new Path(args[0]); Path ratingsFile = new Path(args[1]); Job job = Job.getInstance(getConf(), "Aravind - Reduce Side Join"); job.getConfiguration().setStrings(usersFile.getName(), "user"); job.getConfiguration().setStrings(ratingsFile.getName(), "rating"); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); job.setMapOutputKeyClass(IntWritable.class); job.setMapOutputValueClass(TagAndRecord.class); TextInputFormat.addInputPath(job, usersFile); TextInputFormat.addInputPath(job, ratingsFile); TextOutputFormat.setOutputPath(job, new Path(args[2])); job.setMapperClass(ReduceSideJoinMapper.class); job.setReducerClass(ReduceSideJoinReducer.class); job.setOutputKeyClass(IntWritable.class); job.setOutputValueClass(Text.class); return job.waitForCompletion(true) ? 0 : 1; } public static void main(String args[]) throws Exception { int exitCode = ToolRunner.run(new Configuration(), new ReduceSideJoinDriver(), args); System.exit(exitCode); } }
Make sure you have valid following configuration files in hadoop classpath. By default configuration files are taken from the directory /etc/hadoop/conf. This activity should be performed a part of hadoop client node setup. mapred-site.xml yarn-site.xml core-site.xml If the above mentioned configuration files are empty. You got to pupulate the above files with right properties. Population can be achieved in two ways In Cloudera Manager when click on service yarn, in action portion, there is an option Deploy client configuration along with start,stop etc. Use that option to deploy the client configuration. Sometimes above option maynot work if the node is not managed by CM and yarn gateway is not configured on the node. use the option Download client configuration instead of deploy client Configuration. Extract the downloaded zip configuration file(above files) and copy those files to the location /etc/hadoop/conf manually. For executing the jar either hadoop or yarn can be used.
Apparently, you can only submit a hadoop job from the node designated as the gateway node. Everything is working once I submitted the job from the gateway node.