PIG : count of each product in distinctive Locations - hadoop

I am trying to do following Step1 to Step4 in pig:
STEP 1:- Create a user table:and take data from /tmp/users.txt-
|Column 1 | USER ID |int|
|Column 2 |EMAIL|chararray|
|Column 3 |LANGUAGE |chararray|
|Column 4 |LOCATION |chararray|
STEP 2:- Crate a transaction table and take data from /tmp/transaction.txt:-
|Column 1 | ID |int|
|Column 2 |PRODUCT|int|
|Column 3 |USER ID |int|
|Column 4 |PURCHASE AMOUNT |double|
|Coulmn 5 |DESCRIPTION |chararray|
Step 3:- Find out the count of each product in distinctive Locations.
Step 4:- Display the results.
For achieving above I did the following :
users = LOAD '/tmp/users.txt' USING PigStorage(',') AS (USERID:int, EMAIL:chararray, LANGUAGE:chararray, LOCATION: chararray);
trans = LOAD '/tmp/transaction.txt' USING PigStorage(',') AS (ID:int, PRODUCT:int, USERID:int, PURCHASEAMOUNT: double, DESCRIPTION: chararray);
users_trans = JOIN users BY USERID RIGHT, trans BY USERID;
B = GROUP users_trans BY (DESCRIPTION,LOCATION);
C = FOREACH B GENERATE group as comb, COUNT(users_trans) AS Total;
DUMP C;
But, I am getting errors.. It will helpful if you assist as I am new to pig.
##########################################
Dataset
user.txt
1 creator#gmail.com EN US
2 creator#gmail.com EN GB
3 creator#gmail.com FR FR
4 creator#gmail.com IN HN
5 creator#gmail.com PAK IS
transaction.txt
1 1 1 300 a jumper
2 1 2 300 a jumper
3 1 5 300 a jumper
4 2 3 100 a rubber chicken
5 1 3 300 a jumper
6 5 4 500 a soapbox
7 3 3 200 a adhesive
8 4 1 300 a lotion
9 4 4 500 a sweater
10 5 4 600 a jeans
Error Log:
2019-12-27 06:17:22,180 [LocalJobRunner Map Task Executor #0] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader - Current split being processed file:/tmp/temp2029752934/tmp-883821114/part-r-00000:0+130
2019-12-27 06:17:22,242 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.MapTask - (EQUATOR) 0 kvi 26214396(104857584)
2019-12-27 06:17:22,242 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.MapTask - mapreduce.task.io.sort.mb: 100
2019-12-27 06:17:22,242 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.MapTask - soft limit at 83886080
2019-12-27 06:17:22,242 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.MapTask - bufstart = 0; bufvoid = 104857600
2019-12-27 06:17:22,242 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.MapTask - kvstart = 26214396; length = 6553600
2019-12-27 06:17:22,244 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.MapTask - Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
2019-12-27 06:17:22,248 [LocalJobRunner Map Task Executor #0] INFO org.apache.pig.impl.util.SpillableMemoryManager - Selected heap (PS Old Gen) of size 699400192 to monitor. collectionUsageThreshold = 489580128, usageThreshold = 489580128
2019-12-27 06:17:22,248 [LocalJobRunner Map Task Executor #0] WARN org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized
2019-12-27 06:17:22,250 [LocalJobRunner Map Task Executor #0] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Map - Aliases being processed per job phase (AliasName[line,offset]): M: C[7,4],B[6,4] C: C[7,4],B[6,4] R: C[7,4]
2019-12-27 06:17:22,254 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.LocalJobRunner -
2019-12-27 06:17:22,254 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.MapTask - Starting flush of map output
2019-12-27 06:17:22,254 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.MapTask - Spilling map output
2019-12-27 06:17:22,254 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.MapTask - bufstart = 0; bufend = 100; bufvoid = 104857600
2019-12-27 06:17:22,254 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.MapTask - kvstart = 26214396(104857584); kvend = 26214360(104857440); length = 37/6553600
2019-12-27 06:17:22,262 [LocalJobRunner Map Task Executor #0] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine - Aliases being processed per job phase (AliasName[line,offset]): M: C[7,4],B[6,4] C: C[7,4],B[6,4] R: C[7,4]
2019-12-27 06:17:22,264 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.MapTask - Finished spill 0
2019-12-27 06:17:22,265 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.Task - Task:attempt_local1424814286_0002_m_000000_0 is done. And is in the process of committing
2019-12-27 06:17:22,266 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.LocalJobRunner -map
2019-12-27 06:17:22,266 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.Task - Task 'attempt_local1424814286_0002_m_000000_0' done.
2019-12-27 06:17:22,266 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.LocalJobRunner -Finishing task: attempt_local1424814286_0002_m_000000_0
2019-12-27 06:17:22,266 [Thread-18] INFO org.apache.hadoop.mapred.LocalJobRunner - map task executor complete.
2019-12-27 06:17:22,266 [Thread-18] INFO org.apache.hadoop.mapred.LocalJobRunner - Waiting for reduce tasks
2019-12-27 06:17:22,267 [pool-9-thread-1] INFO org.apache.hadoop.mapred.LocalJobRunner - Starting task: attempt_local1424814286_0002_r_000000_0
2019-12-27 06:17:22,272 [pool-9-thread-1] INFO org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - File Output Committer Algorithm version is 1
2019-12-27 06:17:22,272 [pool-9-thread-1] INFO org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
2019-12-27 06:17:22,274 [pool-9-thread-1] INFO org.apache.hadoop.mapred.Task - Using ResourceCalculatorProcessTree : [ ]
2019-12-27 06:17:22,274 [pool-9-thread-1] INFO org.apache.hadoop.mapred.ReduceTask - Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle#2582aa54
2019-12-27 06:17:22,275 [pool-9-thread-1] INFO org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl - MergerManager: memoryLimit=652528832, maxSingleShuffleLimit=163132208, mergeThreshold=430669056, ioSortFactor=10, memToMemMergeOutputsThreshold=10
2019-12-27 06:17:22,275 [EventFetcher for fetching Map Completion Events] INFO org.apache.hadoop.mapreduce.task.reduce.EventFetcher - attempt_local1424814286_0002_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
2019-12-27 06:17:22,276 [localfetcher#2] INFO org.apache.hadoop.mapreduce.task.reduce.LocalFetcher - localfetcher#2 about to shuffle output of map attempt_local1424814286_0002_m_000000_0 decomp: 14 len: 18 to MEMORY
2019-12-27 06:17:22,277 [localfetcher#2] INFO org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput - Read 14 bytes from map-output for attempt_local1424814286_0002_m_000000_0
2019-12-27 06:17:22,277 [localfetcher#2] INFO org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl - closeInMemoryFile -> map-output of size: 14, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->14
2019-12-27 06:17:22,277 [EventFetcher for fetching Map Completion Events] INFO org.apache.hadoop.mapreduce.task.reduce.EventFetcher - EventFetcher is interrupted.. Returning
2019-12-27 06:17:22,278 [Readahead Thread #3] WARN org.apache.hadoop.io.ReadaheadPool - Failed readahead on ifile
EBADF: Bad file descriptor
at org.apache.hadoop.io.nativeio.NativeIO$POSIX.posix_fadvise(Native Method)
at org.apache.hadoop.io.nativeio.NativeIO$POSIX.posixFadviseIfPossible(NativeIO.java:267)
at org.apache.hadoop.io.nativeio.NativeIO$POSIX$CacheManipulator.posixFadviseIfPossible(NativeIO.java:146)
at org.apache.hadoop.io.ReadaheadPool$ReadaheadRequestImpl.run(ReadaheadPool.java:208)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2019-12-27 06:17:22,278 [pool-9-thread-1] INFO org.apache.hadoop.mapred.LocalJobRunner - 1 / 1 copied.
2019-12-27 06:17:22,280 [pool-9-thread-1] INFO org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl - finalMerge called with 1 in-memory map-outputs and 0 on-disk map-outputs
2019-12-27 06:17:22,280 [pool-9-thread-1] INFO org.apache.hadoop.mapred.Merger - Merging 1 sorted segments
2019-12-27 06:17:22,280 [pool-9-thread-1] INFO org.apache.hadoop.mapred.Merger - Down to the last merge-pass, with 1 segments left of total size: 7 bytes
2019-12-27 06:17:22,281 [pool-9-thread-1] INFO org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl - Merged 1 segments, 14 bytes to disk to satisfy reduce memory limit
2019-12-27 06:17:22,281 [pool-9-thread-1] INFO org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl - Merging 1 files, 18 bytes from disk
2019-12-27 06:17:22,281 [pool-9-thread-1] INFO org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl - Merging 0 segments, 0 bytes from memory into reduce
2019-12-27 06:17:22,281 [pool-9-thread-1] INFO org.apache.hadoop.mapred.Merger - Merging 1 sorted segments
2019-12-27 06:17:22,281 [pool-9-thread-1] INFO org.apache.hadoop.mapred.Merger - Down to the last merge-pass, with 1 segments left of total size: 7 bytes
2019-12-27 06:17:22,282 [pool-9-thread-1] INFO org.apache.hadoop.mapred.LocalJobRunner - 1 / 1 copied.
2019-12-27 06:17:22,283 [pool-9-thread-1] INFO org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - File Output Committer Algorithm version is 1
2019-12-27 06:17:22,283 [pool-9-thread-1] INFO org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
2019-12-27 06:17:22,284 [pool-9-thread-1] INFO org.apache.pig.impl.util.SpillableMemoryManager - Selected heap (PS Old Gen) of size 699400192 to monitor. collectionUsageThreshold = 489580128, usageThreshold = 489580128
2019-12-27 06:17:22,285 [pool-9-thread-1] WARN org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized
2019-12-27 06:17:22,286 [pool-9-thread-1] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce - Aliases being processed per job phase (AliasName[line,offset]): M: C[7,4],B[6,4] C: C[7,4],B[6,4] R: C[7,4]
2019-12-27 06:17:22,287 [pool-9-thread-1] INFO org.apache.hadoop.mapred.Task - Task:attempt_local1424814286_0002_r_000000_0 is done. And is in the process of committing
2019-12-27 06:17:22,289 [pool-9-thread-1] INFO org.apache.hadoop.mapred.LocalJobRunner - 1 / 1 copied.
2019-12-27 06:17:22,289 [pool-9-thread-1] INFO org.apache.hadoop.mapred.Task - Task attempt_local1424814286_0002_r_000000_0 is allowed to commit now
2019-12-27 06:17:22,292 [pool-9-thread-1] INFO org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - Saved output of task 'attempt_local1424814286_0002_r_000000_0' to file:/tmp/temp2029752934/tmp726323435/_temporary/0/task_local1424814286_0002_r_000000
2019-12-27 06:17:22,292 [pool-9-thread-1] INFO org.apache.hadoop.mapred.LocalJobRunner - reduce > reduce
2019-12-27 06:17:22,292 [pool-9-thread-1] INFO org.apache.hadoop.mapred.Task - Task 'attempt_local1424814286_0002_r_000000_0' done.
2019-12-27 06:17:22,292 [pool-9-thread-1] INFO org.apache.hadoop.mapred.LocalJobRunner - Finishing task: attempt_local1424814286_0002_r_000000_0
2019-12-27 06:17:22,292 [Thread-18] INFO org.apache.hadoop.mapred.LocalJobRunner - reduce task executor complete.
2019-12-27 06:17:22,460 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_local1424814286_0002
2019-12-27 06:17:22,460 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases B,C
2019-12-27 06:17:22,460 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: C[7,4],B[6,4] C: C[7,4],B[6,4] R: C[7,4]
2019-12-27 06:17:22,463 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metricswith processName=JobTracker, sessionId= - already initialized
2019-12-27 06:17:22,464 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metricswith processName=JobTracker, sessionId= - already initialized
2019-12-27 06:17:22,465 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metricswith processName=JobTracker, sessionId= - already initialized
2019-12-27 06:17:22,471 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2019-12-27 06:17:22,474 [main] INFO org.apache.pig.tools.pigstats.mapreduce.SimplePigStats - Script Statistics:
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
2.9.2 0.16.0 root 2019-12-27 06:17:20 2019-12-27 06:17:22 HASH_JOIN,GROUP_BY
Success!
Job Stats (time in seconds):
JobId Maps Reduces MaxMapTime MinMapTime AvgMapTime MedianMapTime MaxReduceTime MinReduceTime AvgReduceTime MedianReducetime Alias Feature Outputs
job_local1289071959_0001 2 1 n/a n/a n/a n/a n/a n/a n/a n/a trans,users,users_trans HASH_JOIN
job_local1424814286_0002 1 1 n/a n/a n/a n/a n/a n/a n/a n/a B,C GROUP_BY,COMBINER file:/tmp/temp2029752934/tmp726323435,
Input(s):
Successfully read 5 records from: "/tmp/users.txt"
Successfully read 10 records from: "/tmp/transaction.txt"
Output(s):
Successfully stored 1 records in: "file:/tmp/temp2029752934/tmp726323435"
Counters:
Total records written : 1
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_local1289071959_0001 -> job_local1424814286_0002,
job_local1424814286_0002
2019-12-27 06:17:22,475 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metricswith processName=JobTracker, sessionId= - already initialized
2019-12-27 06:17:22,476 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metricswith processName=JobTracker, sessionId= - already initialized
2019-12-27 06:17:22,477 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metricswith processName=JobTracker, sessionId= - already initialized
2019-12-27 06:17:22,485 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metricswith processName=JobTracker, sessionId= - already initialized
2019-12-27 06:17:22,486 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metricswith processName=JobTracker, sessionId= - already initialized
2019-12-27 06:17:22,487 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metricswith processName=JobTracker, sessionId= - already initialized
2019-12-27 06:17:22,492 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Encountered Warning FIELD_DISCARDED_TYPE_CONVERSION_FAILED 15 time(s).
2019-12-27 06:17:22,493 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Encountered Warning ACCESSING_NON_EXISTENT_FIELD 55 time(s).
2019-12-27 06:17:22,493 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
2019-12-27 06:17:22,496 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2019-12-27 06:17:22,496 [main] WARN org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized
2019-12-27 06:17:22,503 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input files to process : 1
2019-12-27 06:17:22,503 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
2019-12-27 06:17:22,541 [main] INFO org.apache.pig.Main - Pig script completed in 2 seconds and 965 milliseconds (2965 ms)

Advice
First of all: It seems that you are starting up with Pig. It may be valuable to know that Cloudera recently decided to deprecate Pig. It will of course not cease to exist, but think twice if you are planning to pick up a new skill or implement new use cases. I would recommend looking into Hive/Spark/Impala as more future proof alternatives.
Answer
Your job succeeds, but presumably not with output you want. There are several hints to what may be wrong (data types/field names) however this does not point at a specific problem in the code.
My recommendation would be to find out where the problem exactly occurs. Simply cut off the end of your code and print an intermediate result to see if you are still on track.
In the (likely) event you have a problem in your load statement already, it is worth noting that you can still narrow it down further. First load, and then apply the schema.

Given the data you have, first problem would be that you have no commas, so you must load the lines as a whole, then split them later. I used two or more spaces in the transactions file because your last column appears to be one string containing spaces. For accuracy, I suggest having a better delimiter than spaces/tabs.
Then the group by needs to reference the relations that the data comes from.
Everything else is fine, I think, though I'm not sure about the COUNT(X)
A = LOAD '/tmp/users.txt' USING PigStorage() as (line:chararray);
USERS = FOREACH A GENERATE FLATTEN(STRSPLIT(line, '\\s+')) AS (userid:int,email:chararray,language:chararray,location:chararray);
B = LOAD '/tmp/transactions.txt' USING PigStorage() as (line:chararray);
TRANS = FOREACH B GENERATE FLATTEN(STRSPLIT(line, '\\s\\s+')) AS (id:int,product:int,userid:int,purchase:double,desc:chararray);
X = JOIN USERS BY userid RIGHT, TRANS BY userid;
X_grouped = GROUP X BY (TRANS::desc, USERS::location);
RES = FOREACH X_grouped GENERATE group as comb, COUNT(X) AS Total;
\d RES;
Output
((a jeans,HN),1)
((a jumper,FR),1)
((a jumper,GB),1)
((a jumper,IS),1)
((a jumper,US),1)
((a lotion,US),1)
((a soapbox,HN),1)
((a sweater,HN),1)
((a adhesive,FR),1)
((a rubber chicken,FR),1)

Related

Does sqoop spill temporary data to disk

As I understand sqoop, it launches few mappers on different data nodes making jdbc connection with RDBMS. Once connection is formed data is transferred to HDFS.
Just trying to understand, does sqoop mapper spill data temporary on disk (data node)? I know spilling happens in MapReduce but not sure about sqoop job.
It seems sqoop-import runs on mapper and doesn't spill. And sqoop-merge runs on map-reduce and does spill. You can check it on Job tracker during sqoop import run.
Have a look at this part of sqoop import log, it does not spill, fetches and writes to hdfs:
INFO [main] ... mapreduce.db.DataDrivenDBRecordReader: Using query: SELECT...
[main] mapreduce.db.DBRecordReader: Executing query: SELECT...
INFO [main] org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: File Output Committer Algorithm version is 1
INFO [main] org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
INFO [main] org.apache.hadoop.io.compress.CodecPool: Got brand-new compressor [.snappy]
INFO [Thread-16] ...mapreduce.AutoProgressMapper: Auto-progress thread is finished. keepGoing=false
INFO [main] org.apache.hadoop.mapred.Task: Task:attempt_1489705733959_2462784_m_000000_0 is done. And is in the process of committing
INFO [main] org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: Saved output of task 'attempt_1489705733959_2462784_m_000000_0' to hdfs://
Have a look at this sqoop-merge log(skipped some rows), it spills on disk (note Spilling map output in the log):
INFO [main] org.apache.hadoop.mapred.MapTask: Processing split: hdfs://bla-bla/part-m-00000:0+48322717
...
INFO [main] org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
...
INFO [main] org.apache.hadoop.mapred.MapTask: mapreduce.task.io.sort.mb: 1024
INFO [main] org.apache.hadoop.mapred.MapTask: soft limit at 751619264
INFO [main] org.apache.hadoop.mapred.MapTask: bufstart = 0; bufvoid = 1073741824
INFO [main] org.apache.hadoop.mapred.MapTask: kvstart = 268435452; length = 67108864
INFO [main] org.apache.hadoop.mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$**MapOutputBuffer**
INFO [main] com.pepperdata.supervisor.agent.resource.r: Datanode bla-bla is LOCAL.
INFO [main] org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor [.snappy]
...
INFO [main] org.apache.hadoop.mapred.MapTask: **Starting flush of map output**
INFO [main] org.apache.hadoop.mapred.MapTask: **Spilling map output**
INFO [main] org.apache.hadoop.mapred.MapTask: **bufstart** = 0; **bufend** = 184775274; bufvoid = 1073741824
INFO [main] org.apache.hadoop.mapred.MapTask: kvstart = 268435452(1073741808); kvend = 267347800(1069391200); length = 1087653/67108864
INFO [main] org.apache.hadoop.io.compress.CodecPool: Got brand-new compressor [.snappy]
[main] org.apache.hadoop.mapred.MapTask: Finished spill 0
...Task:attempt_1489705733959_2479291_m_000000_0 is done. And is in the process of committing

How do i know if my hadoop mapreduce application is running in distributed mode

I'm very new in hadoop mapreduce, however i install the multinode cluster but i still get a sequential excution.
How can i work out if my program is running on the other machines in the cluster or not?
This is the result of execution :
Picked up _JAVA_OPTIONS: -Xmx1g
16/06/07 14:49:16 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/06/07 14:49:19 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
16/06/07 14:49:19 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
16/06/07 14:49:21 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
16/06/07 14:49:21 INFO input.FileInputFormat: Total input paths to process : 3
16/06/07 14:49:22 INFO mapreduce.JobSubmitter: number of splits:3
16/06/07 14:49:23 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1881318657_0001
16/06/07 14:49:24 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
16/06/07 14:49:24 INFO mapreduce.Job: Running job: job_local1881318657_0001
16/06/07 14:49:24 INFO mapred.LocalJobRunner: OutputCommitter set in config null
16/06/07 14:49:24 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
16/06/07 14:49:24 INFO mapred.LocalJobRunner: Waiting for map tasks
16/06/07 14:49:24 INFO mapred.LocalJobRunner: Starting task: attempt_local1881318657_0001_m_000000_0
16/06/07 14:49:24 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
16/06/07 14:49:24 INFO mapred.MapTask: Processing split: hdfs://master:9000/input/leukemia.txt:0+1172207
16/06/07 14:49:24 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
16/06/07 14:49:24 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
16/06/07 14:49:24 INFO mapred.MapTask: soft limit at 83886080
16/06/07 14:49:24 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
16/06/07 14:49:24 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
16/06/07 14:49:24 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
16/06/07 14:49:25 INFO mapreduce.Job: Job job_local1881318657_0001 running in uber mode : false
16/06/07 14:49:25 INFO mapreduce.Job: map 0% reduce 0%
16/06/07 14:49:31 INFO mapred.LocalJobRunner: map > map
16/06/07 14:49:31 INFO mapreduce.Job: map 22% reduce 0%
-3.042421771435325E-9
-3.042421771435325E-9
-3.042421771435325E-9
-3.042421771435325E-9
-3.042421771435325E-9
-2.9889415942690763E-9
-2.9889415942690763E-9
-2.9889415942690763E-9
-2.9287384547432996E-9
-2.898469757139896E-9
-2.898469757139896E-9
-2.880377562441664E-9
-2.880377562441664E-9
-2.880377562441664E-9
-2.8430632294667886E-9
-2.819146987128837E-9
-2.819146987128837E-9
-2.819146987128837E-9
-2.819146987128837E-9
-2.819146987128837E-9
931
16/06/07 15:00:44 INFO mapred.LocalJobRunner: map > map
16/06/07 15:00:44 INFO mapred.MapTask: Starting flush of map output
16/06/07 15:00:44 INFO mapred.MapTask: Spilling map output
16/06/07 15:00:44 INFO mapred.MapTask: bufstart = 0; bufend = 14151; bufvoid = 104857600
16/06/07 15:00:44 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214396(104857584); length = 1/6553600
16/06/07 15:00:46 INFO mapred.MapTask: Finished spill 0
16/06/07 15:00:46 INFO mapred.Task: Task:attempt_local1881318657_0001_m_000000_0 is done. And is in the process of committing
16/06/07 15:00:47 INFO mapred.LocalJobRunner: map
16/06/07 15:00:47 INFO mapred.Task: Task 'attempt_local1881318657_0001_m_000000_0' done.
16/06/07 15:00:47 INFO mapred.LocalJobRunner: Finishing task: attempt_local1881318657_0001_m_000000_0
16/06/07 15:00:47 INFO mapred.LocalJobRunner: Starting task: attempt_local1881318657_0001_m_000001_0
16/06/07 15:00:48 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
16/06/07 15:00:48 INFO mapred.MapTask: Processing split: hdfs://master:9000/input/leukemia1.txt:0+1172207
16/06/07 15:00:48 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
16/06/07 15:00:48 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
16/06/07 15:00:48 INFO mapred.MapTask: soft limit at 83886080
16/06/07 15:00:48 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
16/06/07 15:00:48 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
16/06/07 15:00:48 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
16/06/07 15:00:48 INFO mapreduce.Job: map 100% reduce 0%
16/06/07 15:01:47 INFO mapred.LocalJobRunner: map > map
16/06/07 15:01:48 INFO mapreduce.Job: map 56% reduce 0%
-3.0279963370711145E-9
-3.0279963370711145E-9
-3.0279963370711145E-9
-3.0279963370711145E-9
-3.0279963370711145E-9
-3.001716001136338E-9
-2.997252637652067E-9
-2.997252637652067E-9
-2.9593407930592893E-9
-2.9178102507568847E-9
-2.9178102507568847E-9
-2.9178102507568847E-9
-2.8542232742481287E-9
-2.8542232742481287E-9
-2.8510431833778047E-9
-2.8510431833778047E-9
-2.8510431833778047E-9
-2.8510431833778047E-9
-2.8222418341121026E-9
-2.8222418341121026E-9
907
16/06/07 15:11:30 INFO mapred.LocalJobRunner: map > map
16/06/07 15:11:30 INFO mapred.MapTask: Starting flush of map output
16/06/07 15:11:30 INFO mapred.MapTask: Spilling map output
16/06/07 15:11:30 INFO mapred.MapTask: bufstart = 0; bufend = 14151; bufvoid = 104857600
16/06/07 15:11:30 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214396(104857584); length = 1/6553600
16/06/07 15:11:30 INFO mapred.MapTask: Finished spill 0
16/06/07 15:11:30 INFO mapred.Task: Task:attempt_local1881318657_0001_m_000001_0 is done. And is in the process of committing
16/06/07 15:11:30 INFO mapred.LocalJobRunner: map
16/06/07 15:11:30 INFO mapred.Task: Task 'attempt_local1881318657_0001_m_000001_0' done.
16/06/07 15:11:30 INFO mapred.LocalJobRunner: Finishing task: attempt_local1881318657_0001_m_000001_0
16/06/07 15:11:30 INFO mapred.LocalJobRunner: Starting task: attempt_local1881318657_0001_m_000002_0
16/06/07 15:11:30 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
16/06/07 15:11:30 INFO mapred.MapTask: Processing split: hdfs://master:9000/input/leukemia2.txt:0+1172207
16/06/07 15:11:30 INFO mapreduce.Job: map 100% reduce 0%
16/06/07 15:11:31 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
16/06/07 15:11:31 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
16/06/07 15:11:31 INFO mapred.MapTask: soft limit at 83886080
16/06/07 15:11:31 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
16/06/07 15:11:31 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
16/06/07 15:11:31 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
16/06/07 15:11:37 INFO mapred.LocalJobRunner: map > map
16/06/07 15:11:38 INFO mapreduce.Job: map 89% reduce 0%
-3.064963887619912E-9
-3.064963887619912E-9
-3.064963887619912E-9
-3.064963887619912E-9
-3.064963887619912E-9
-3.0090989883906007E-9
-2.9474075636124447E-9
-2.9474075636124447E-9
-2.9474075636124447E-9
-2.9388849943338927E-9
-2.9388849943338927E-9
-2.8915704649620403E-9
-2.8102046711682226E-9
-2.8102046711682226E-9
-2.8102046711682226E-9
-2.8102046711682226E-9
-2.8102046711682226E-9
-2.8102046711682226E-9
-2.8102046711682226E-9
-2.8102046711682226E-9
925
16/06/07 15:20:19 INFO mapred.LocalJobRunner: map > map
16/06/07 15:20:19 INFO mapred.MapTask: Starting flush of map output
16/06/07 15:20:19 INFO mapred.MapTask: Spilling map output
16/06/07 15:20:19 INFO mapred.MapTask: bufstart = 0; bufend = 14151; bufvoid = 104857600
16/06/07 15:20:19 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214396(104857584); length = 1/6553600
16/06/07 15:20:20 INFO mapred.MapTask: Finished spill 0
16/06/07 15:20:20 INFO mapred.Task: Task:attempt_local1881318657_0001_m_000002_0 is done. And is in the process of committing
16/06/07 15:20:22 INFO mapred.LocalJobRunner: map
16/06/07 15:20:22 INFO mapred.Task: Task 'attempt_local1881318657_0001_m_000002_0' done.
16/06/07 15:20:22 INFO mapred.LocalJobRunner: Finishing task: attempt_local1881318657_0001_m_000002_0
16/06/07 15:20:22 INFO mapred.LocalJobRunner: map task executor complete.
16/06/07 15:20:22 INFO mapreduce.Job: map 100% reduce 0%
16/06/07 15:20:23 INFO mapred.LocalJobRunner: Waiting for reduce tasks
16/06/07 15:20:23 INFO mapred.LocalJobRunner: Starting task: attempt_local1881318657_0001_r_000000_0
16/06/07 15:20:24 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
16/06/07 15:20:24 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle#7f5be2d5
16/06/07 15:20:25 INFO reduce.MergeManagerImpl: MergerManager: memoryLimit=668309888, maxSingleShuffleLimit=167077472, mergeThreshold=441084544, ioSortFactor=10, memToMemMergeOutputsThreshold=10
16/06/07 15:20:25 INFO reduce.EventFetcher: attempt_local1881318657_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
16/06/07 15:20:28 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local1881318657_0001_m_000002_0 decomp: 14157 len: 14161 to MEMORY
16/06/07 15:20:29 INFO reduce.InMemoryMapOutput: Read 14157 bytes from map-output for attempt_local1881318657_0001_m_000002_0
16/06/07 15:20:30 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 14157, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->14157
16/06/07 15:20:30 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local1881318657_0001_m_000001_0 decomp: 14157 len: 14161 to MEMORY
16/06/07 15:20:30 INFO reduce.InMemoryMapOutput: Read 14157 bytes from map-output for attempt_local1881318657_0001_m_000001_0
16/06/07 15:20:30 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 14157, inMemoryMapOutputs.size() -> 2, commitMemory -> 14157, usedMemory ->28314
16/06/07 15:20:30 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local1881318657_0001_m_000000_0 decomp: 14157 len: 14161 to MEMORY
16/06/07 15:20:30 INFO reduce.InMemoryMapOutput: Read 14157 bytes from map-output for attempt_local1881318657_0001_m_000000_0
16/06/07 15:20:30 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 14157, inMemoryMapOutputs.size() -> 3, commitMemory -> 28314, usedMemory ->42471
16/06/07 15:20:30 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning
16/06/07 15:20:30 INFO mapred.LocalJobRunner: 3 / 3 copied.
16/06/07 15:20:30 INFO reduce.MergeManagerImpl: finalMerge called with 3 in-memory map-outputs and 0 on-disk map-outputs
16/06/07 15:20:30 INFO mapred.Merger: Merging 3 sorted segments
16/06/07 15:20:30 INFO mapred.Merger: Down to the last merge-pass, with 3 segments left of total size: 42435 bytes
16/06/07 15:20:30 INFO reduce.MergeManagerImpl: Merged 3 segments, 42471 bytes to disk to satisfy reduce memory limit
16/06/07 15:20:30 INFO reduce.MergeManagerImpl: Merging 1 files, 42471 bytes from disk
16/06/07 15:20:30 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce
16/06/07 15:20:30 INFO mapred.Merger: Merging 1 sorted segments
16/06/07 15:20:30 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 42455 bytes
16/06/07 15:20:30 INFO mapred.LocalJobRunner: 3 / 3 copied.
16/06/07 15:20:33 INFO mapred.LocalJobRunner: reduce > reduce
16/06/07 15:20:33 INFO mapreduce.Job: map 100% reduce 67%
16/06/07 15:20:36 INFO mapred.LocalJobRunner: reduce > reduce
16/06/07 15:20:38 INFO Configuration.deprecation: mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords
16/06/07 15:20:42 INFO mapred.LocalJobRunner: reduce > reduce
16/06/07 15:20:42 INFO mapreduce.Job: map 100% reduce 100%
16/06/07 15:20:44 INFO mapred.Task: Task:attempt_local1881318657_0001_r_000000_0 is done. And is in the process of committing
16/06/07 15:20:44 INFO mapred.LocalJobRunner: reduce > reduce
16/06/07 15:20:44 INFO mapred.Task: Task attempt_local1881318657_0001_r_000000_0 is allowed to commit now
16/06/07 15:20:45 INFO output.FileOutputCommitter: Saved output of task 'attempt_local1881318657_0001_r_000000_0' to hdfs://master:9000/output2/_temporary/0/task_local1881318657_0001_r_000000
16/06/07 15:20:45 INFO mapred.LocalJobRunner: reduce > reduce
16/06/07 15:20:45 INFO mapred.Task: Task 'attempt_local1881318657_0001_r_000000_0' done.
16/06/07 15:20:45 INFO mapred.LocalJobRunner: Finishing task: attempt_local1881318657_0001_r_000000_0
16/06/07 15:20:45 INFO mapred.LocalJobRunner: reduce task executor complete.
16/06/07 15:20:45 INFO mapreduce.Job: Job job_local1881318657_0001 completed successfully
16/06/07 15:20:46 INFO mapreduce.Job: Counters: 38
File System Counters
FILE: Number of bytes read=177067554
FILE: Number of bytes written=179551452
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=10549863
HDFS: Number of bytes written=42438
HDFS: Number of read operations=37
HDFS: Number of large read operations=0
HDFS: Number of write operations=6
Map-Reduce Framework
Map input records=3
Map output records=3
Map output bytes=42453
Map output materialized bytes=42483
Input split bytes=557
Combine input records=0
Combine output records=0
Reduce input groups=2
Reduce shuffle bytes=42483
Reduce input records=3
Reduce output records=3
Spilled Records=6
Shuffled Maps =3
Failed Shuffles=0
Merged Map outputs=3
GC time elapsed (ms)=227283
CPU time spent (ms)=0
Physical memory (bytes) snapshot=0
Virtual memory (bytes) snapshot=0
Total committed heap usage (bytes)=2477260800
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=42438
peace
By the job ID. Your's says: job_local1881318657_0001 running in uber mode : false. Which is a local job. If you ran on a cluster it would just be the job and the identifiers of the app master and attempts.
You need to check the JobTracker ( default port 50030) and explore the job id details mentioned in the above logs.
You can monitor the jobs at:
localhost:8088

Pig "Max" command for pig-0.12.1 and pig-0.13.0 with Hadoop-2.4.0

I have a pig script I got from Hortonworks that works fine with pig-0.9.2.15 with Hadoop-1.0.3.16. But when I run it with pig-0.12.1(recompiled with -Dhadoopversion=23) or pig-0.13.0 on Hadoop-2.4.0, it won't work.
It seems the following line is where the problem is.
max_runs = FOREACH grp_data GENERATE group as grp, MAX(runs.runs) as max_runs;
Here's the whole script.
batting = load 'pig_data/Batting.csv' using PigStorage(',');
runs = FOREACH batting GENERATE $0 as playerID, $1 as year, $8 as runs;
grp_data = GROUP runs by (year);
max_runs = FOREACH grp_data GENERATE group as grp, MAX(runs.runs) as max_runs;
join_max_run = JOIN max_runs by ($0, max_runs), runs by (year,runs);
join_data = FOREACH join_max_run GENERATE $0 as year, $2 as playerID, $1 as runs;
STORE join_data INTO './join_data';
And here's the hadoop error info:
2014-07-29 18:03:02,957 [main] ERROR
org.apache.pig.tools.pigstats.PigStats - ERROR 0:
org.apache.pig.backend.executionengine.ExecException: ERROR 0:
Exception while executing (Name: grp_data: Local
Rearrange[tuple]{bytearray}(false) - scope-34 Operator Key: scope-34):
org.apache.pig.backend.executionengine.ExecException: ERROR 2106:
Error executing an algebraic function 2014-07-29 18:03:02,958 [main]
ERROR org.apache.pig.tools.pigstats.mapreduce.MRPigStatsUtil - 1 map
reduce job(s) failed!
How can I fix this if I still want to use "MAX" function? Thank you!
Here's the complete information:
14/07/29 17:50:11 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
14/07/29 17:50:11 INFO pig.ExecTypeProvider: Trying ExecType :
MAPREDUCE 14/07/29 17:50:11 INFO pig.ExecTypeProvider: Picked
MAPREDUCE as the ExecType 2014-07-29 17:50:12,104 [main] INFO
org.apache.pig.Main - Apache Pig version 0.13.0 (r1606446) compiled
Jun 29 2014, 02:27:58 2014-07-29 17:50:12,104 [main] INFO
org.apache.pig.Main - Logging error messages to:
/root/hadooptestingsuite/scripts/tests/pig_test/hadoop2/pig_1406677812103.log
2014-07-29 17:50:13,050 [main] INFO org.apache.pig.impl.util.Utils -
Default bootup file /root/.pigbootup not found 2014-07-29 17:50:13,415
[main] INFO org.apache.hadoop.conf.Configuration.deprecation -
mapred.job.tracker is deprecated. Instead, use
mapreduce.jobtracker.address 2014-07-29 17:50:13,415 [main] INFO
org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is
deprecated. Instead, use fs.defaultFS 2014-07-29 17:50:13,415 [main]
INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
Connecting to hadoop file system at:
hdfs://namenode.cmda.hadoop.com:8020 2014-07-29 17:50:14,302 [main]
INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
Connecting to map-reduce job tracker at: namenode.cmda.hadoop.com:8021
2014-07-29 17:50:14,990 [main] INFO
org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is
deprecated. Instead, use fs.defaultFS 2014-07-29 17:50:15,570 [main]
INFO org.apache.hadoop.conf.Configuration.deprecation -
fs.default.name is deprecated. Instead, use fs.defaultFS 2014-07-29
17:50:15,665 [main] WARN org.apache.pig.newplan.BaseOperatorPlan -
Encountered Warning IMPLICIT_CAST_TO_DOUBLE 1 time(s). 2014-07-29
17:50:15,705 [main] INFO
org.apache.hadoop.conf.Configuration.deprecation -
mapred.textoutputformat.separator is deprecated. Instead, use
mapreduce.output.textoutputformat.separator 2014-07-29 17:50:15,791
[main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features
used in the script: HASH_JOIN,GROUP_BY 2014-07-29 17:50:15,873 [main]
INFO org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer -
{RULES_ENABLED=[AddForEach, ColumnMapKeyPrune,
GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter,
MergeFilter, MergeForEach, PartitionFilterOptimizer,
PushDownForEachFlatten, PushUpFilter, SplitFilter,
StreamTypeCastInserter],
RULES_DISABLED=[FilterLogicExpressionSimplifier]} 2014-07-29
17:50:16,319 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler
- File concatenation threshold: 100 optimistic? false 2014-07-29 17:50:16,377 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.CombinerOptimizer
- Choosing to move algebraic foreach to combiner 2014-07-29 17:50:16,410 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler$LastInputStreamingOptimizer
- Rewrite: POPackage->POForEach to POPackage(JoinPackager) 2014-07-29 17:50:16,417 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- MR plan size before optimization: 3 2014-07-29 17:50:16,418 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- Merged 1 map-reduce splittees. 2014-07-29 17:50:16,418 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- Merged 1 out of total 3 MR operators. 2014-07-29 17:50:16,418 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- MR plan size after optimization: 2 2014-07-29 17:50:16,493 [main] INFO org.apache.hadoop.conf.Configuration.deprecation -
fs.default.name is deprecated. Instead, use fs.defaultFS 2014-07-29
17:50:16,575 [main] INFO org.apache.hadoop.yarn.client.RMProxy -
Connecting to ResourceManager at
namenode.cmda.hadoop.com/10.0.3.1:8050 2014-07-29 17:50:16,973 [main]
INFO org.apache.pig.tools.pigstats.mapreduce.MRScriptState - Pig
script settings are added to the job 2014-07-29 17:50:17,007 [main]
INFO org.apache.hadoop.conf.Configuration.deprecation -
mapred.job.reduce.markreset.buffer.percent is deprecated. Instead, use
mapreduce.reduce.markreset.buffer.percent 2014-07-29 17:50:17,007
[main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
- mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3 2014-07-29 17:50:17,007 [main] INFO
org.apache.hadoop.conf.Configuration.deprecation -
mapred.output.compress is deprecated. Instead, use
mapreduce.output.fileoutputformat.compress 2014-07-29 17:50:17,020
[main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
- Reduce phase detected, estimating # of required reducers. 2014-07-29 17:50:17,020 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
- Using reducer estimator: org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator
2014-07-29 17:50:17,064 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator
- BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=6398990 2014-07-29 17:50:17,067 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
- Setting Parallelism to 1 2014-07-29 17:50:17,067 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.reduce.tasks
is deprecated. Instead, use mapreduce.job.reduces 2014-07-29
17:50:17,068 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
- This job cannot be converted run in-process 2014-07-29 17:50:17,068 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
- creating jar file Job2337803902169382273.jar 2014-07-29 17:50:20,957 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
- jar file Job2337803902169382273.jar created 2014-07-29 17:50:20,957 [main] INFO org.apache.hadoop.conf.Configuration.deprecation -
mapred.jar is deprecated. Instead, use mapreduce.job.jar 2014-07-29
17:50:21,001 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
- Setting up multi store job 2014-07-29 17:50:21,036 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Key [pig.schematuple] is
false, will not generate code. 2014-07-29 17:50:21,036 [main] INFO
org.apache.pig.data.SchemaTupleFrontend - Starting process to move
generated code to distributed cacche 2014-07-29 17:50:21,046 [main]
INFO org.apache.pig.data.SchemaTupleFrontend - Setting key
[pig.schematuple.classes] with classes to deserialize [] 2014-07-29
17:50:21,310 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- 1 map-reduce job(s) waiting for submission. 2014-07-29 17:50:21,311 [main] INFO org.apache.hadoop.conf.Configuration.deprecation -
mapred.job.tracker.http.address is deprecated. Instead, use
mapreduce.jobtracker.http.address 2014-07-29 17:50:21,332 [JobControl]
INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to
ResourceManager at namenode.cmda.hadoop.com/10.0.3.1:8050 2014-07-29
17:50:21,366 [JobControl] INFO
org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is
deprecated. Instead, use fs.defaultFS 2014-07-29 17:50:22,606
[JobControl] INFO
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
paths to process : 1 2014-07-29 17:50:22,606 [JobControl] INFO
org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
input paths to process : 1 2014-07-29 17:50:22,629 [JobControl] INFO
org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
input paths (combined) to process : 1 2014-07-29 17:50:22,729
[JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter - number
of splits:1 2014-07-29 17:50:22,745 [JobControl] INFO
org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is
deprecated. Instead, use fs.defaultFS 2014-07-29 17:50:23,026
[JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter -
Submitting tokens for job: job_1406677482986_0003 2014-07-29
17:50:23,258 [JobControl] INFO
org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted
application application_1406677482986_0003 2014-07-29 17:50:23,340
[JobControl] INFO org.apache.hadoop.mapreduce.Job - The url to track
the job:
http://namenode.cmda.hadoop.com:8088/proxy/application_1406677482986_0003/
2014-07-29 17:50:23,340 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- HadoopJobId: job_1406677482986_0003 2014-07-29 17:50:23,340 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- Processing aliases batting,grp_data,max_runs,runs 2014-07-29 17:50:23,340 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- detailed locations: M: batting[3,10],runs[5,7],max_runs[7,11],grp_data[6,11] C:
max_runs[7,11],grp_data[6,11] R: max_runs[7,11] 2014-07-29
17:50:23,340 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- More information at: http://namenode.cmda.hadoop.com:50030/jobdetails.jsp?jobid=job_1406677482986_0003
2014-07-29 17:50:23,357 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- 0% complete 2014-07-29 17:50:23,357 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- Running jobs are [job_1406677482986_0003] 2014-07-29 17:51:15,564 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- 50% complete 2014-07-29 17:51:15,564 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- Running jobs are [job_1406677482986_0003] 2014-07-29 17:51:18,582 [main] WARN
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- Ooops! Some job has failed! Specify -stop_on_failure if you want Pig to stop immediately on failure. 2014-07-29 17:51:18,582 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- job job_1406677482986_0003 has failed! Stop running all dependent jobs 2014-07-29 17:51:18,582 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- 100% complete 2014-07-29 17:51:18,825 [main] ERROR org.apache.pig.tools.pigstats.PigStats - ERROR 0:
org.apache.pig.backend.executionengine.ExecException: ERROR 0:
Exception while executing (Name: grp_data: Local
Rearrange[tuple]{bytearray}(false) - scope-73 Operator Key: scope-73):
org.apache.pig.backend.executionengine.ExecException: ERROR 2106:
Error executing an algebraic function 2014-07-29 17:51:18,825 [main]
ERROR org.apache.pig.tools.pigstats.mapreduce.MRPigStatsUtil - 1 map
reduce job(s) failed! 2014-07-29 17:51:18,826 [main] INFO
org.apache.pig.tools.pigstats.mapreduce.SimplePigStats - Script
Statistics:
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
2.4.0 0.13.0 root 2014-07-29 17:50:16 2014-07-29 17:51:18 HASH_JOIN,GROUP_BY
Failed!
Failed Jobs: JobId Alias Feature Message Outputs
job_1406677482986_0003 batting,grp_data,max_runs,runs MULTI_QUERY,COMBINER Message:
Job failed!
Input(s): Failed to read data from
"hdfs://namenode.cmda.hadoop.com:8020/user/root/pig_data/Batting.csv"
Output(s):
Counters: Total records written : 0 Total bytes written : 0 Spillable
Memory Manager spill count : 0 Total bags proactively spilled: 0 Total
records proactively spilled: 0
Job DAG: job_1406677482986_0003 -> null, null
2014-07-29 17:51:18,826 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- Failed! 2014-07-29 17:51:18,827 [main] ERROR org.apache.pig.tools.grunt.GruntParser - ERROR 2106: Error executing
an algebraic function Details at logfile:
/root/hadooptestingsuite/scripts/tests/pig_test/hadoop2/pig_1406677812103.log
2014-07-29 17:51:18,828 [main] ERROR
org.apache.pig.tools.grunt.GruntParser - ERROR 2244: Job scope-58
failed, hadoop does not return any error message Details at logfile:
/root/hadooptestingsuite/scripts/tests/pig_test/hadoop2/pig_1406677812103.log
try by casting MAX function
max_runs = FOREACH grp_data GENERATE group as grp, (int)MAX(runs.runs) as max_runs;
hope it will work
You should use data types in your load statement.
runs = FOREACH batting GENERATE $0 as playerID:chararray, $1 as year:int, $8 as runs:int;
If this doesn't help for some reason, try explicit casting.
max_runs = FOREACH grp_data GENERATE group as grp, MAX((int)runs.runs) as max_runs;
Thank both #BigData and #Mikko Kupsu for the hint. The issue does indeed have something to do the datatype casting.
After specifying the data type of each column as follows everything runs great.
batting =
LOAD '/user/root/pig_data/Batting.csv' USING PigStorage(',')
AS (playerID: CHARARRAY, yearID: INT, stint: INT, teamID: CHARARRAY, lgID: CHARARRAY,
G: INT, G_batting: INT, AB: INT, R: INT, H: INT, two_B: INT, three_B: INT, HR: INT, RBI: INT,
SB: INT, CS: INT, BB:INT, SO: INT, IBB: INT, HBP: INT, SH: INT, SF: INT, GIDP: INT, G_old: INT);

Cannot run the job on hadoop cluster. only runs using LocalJobRunner

I have submitted a MR job using hadoop jar command with the following command on CDH5 Beta 2
hadoop jar ./hadoop-examples-0.0.1-SNAPSHOT.jar com.aravind.learning.hadoop.mapred.join.ReduceSideJoinDriver tech_talks/users.csv tech_talks/ratings.csv tech_talks/output/ReduceSideJoinDriver/
I've also tried providing the fs name and job tracker url explicitly as below without any success
hadoop jar ./hadoop-examples-0.0.1-SNAPSHOT.jar com.aravind.learning.hadoop.mapred.join.ReduceSideJoinDriver -Dfs.default.name=hdfs://abc.com:8020 -Dmapreduce.job.tracker=x.x.x.x:8021 tech_talks/users.csv tech_talks/ratings.csv tech_talks/output/ReduceSideJoinDriver/
The job runs successfully but is using the LocalJobRunner instead of submitting to the cluster. The output is written to HDFS and is correct. Not sure what I am doing wrong here so appreciate your input. I've also tried explicitly specifying the fs and job tracker as below but have the same result
14/04/16 20:35:44 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
14/04/16 20:35:44 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
14/04/16 20:35:45 WARN mapreduce.JobSubmitter: No job jar file set. User classes may not be found. See Job or Job#setJar(String).
14/04/16 20:35:45 INFO input.FileInputFormat: Total input paths to process : 2
14/04/16 20:35:45 INFO mapreduce.JobSubmitter: number of splits:2
14/04/16 20:35:46 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1427968352_0001
14/04/16 20:35:46 WARN conf.Configuration: file:/tmp/hadoop-ird2/mapred/staging/ird21427968352/.staging/job_local1427968352_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.
14/04/16 20:35:46 WARN conf.Configuration: file:/tmp/hadoop-ird2/mapred/staging/ird21427968352/.staging/job_local1427968352_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.
14/04/16 20:35:46 WARN conf.Configuration: file:/tmp/hadoop-ird2/mapred/local/localRunner/ird2/job_local1427968352_0001/job_local1427968352_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.
14/04/16 20:35:46 WARN conf.Configuration: file:/tmp/hadoop-ird2/mapred/local/localRunner/ird2/job_local1427968352_0001/job_local1427968352_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.
14/04/16 20:35:46 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
14/04/16 20:35:46 INFO mapreduce.Job: Running job: job_local1427968352_0001
14/04/16 20:35:46 INFO mapred.LocalJobRunner: OutputCommitter set in config null
14/04/16 20:35:46 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
14/04/16 20:35:46 INFO mapred.LocalJobRunner: Waiting for map tasks
14/04/16 20:35:46 INFO mapred.LocalJobRunner: Starting task: attempt_local1427968352_0001_m_000000_0
14/04/16 20:35:46 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
14/04/16 20:35:46 INFO mapred.MapTask: Processing split: hdfs://...:8020/user/ird2/tech_talks/ratings.csv:0+4388258
14/04/16 20:35:46 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
14/04/16 20:35:46 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
14/04/16 20:35:46 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
14/04/16 20:35:46 INFO mapred.MapTask: soft limit at 83886080
14/04/16 20:35:46 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
14/04/16 20:35:46 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
14/04/16 20:35:47 INFO mapreduce.Job: Job job_local1427968352_0001 running in uber mode : false
14/04/16 20:35:47 INFO mapreduce.Job: map 0% reduce 0%
14/04/16 20:35:48 INFO mapred.LocalJobRunner:
14/04/16 20:35:48 INFO mapred.MapTask: Starting flush of map output
14/04/16 20:35:48 INFO mapred.MapTask: Spilling map output
14/04/16 20:35:48 INFO mapred.MapTask: bufstart = 0; bufend = 6485388; bufvoid = 104857600
14/04/16 20:35:48 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 24860980(99443920); length = 1353417/6553600
14/04/16 20:35:49 INFO mapred.MapTask: Finished spill 0
14/04/16 20:35:49 INFO mapred.Task: Task:attempt_local1427968352_0001_m_000000_0 is done. And is in the process of committing
14/04/16 20:35:49 INFO mapred.LocalJobRunner: map
14/04/16 20:35:49 INFO mapred.Task: Task 'attempt_local1427968352_0001_m_000000_0' done.
14/04/16 20:35:49 INFO mapred.LocalJobRunner: Finishing task: attempt_local1427968352_0001_m_000000_0
14/04/16 20:35:49 INFO mapred.LocalJobRunner: Starting task: attempt_local1427968352_0001_m_000001_0
14/04/16 20:35:49 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
14/04/16 20:35:49 INFO mapred.MapTask: Processing split: hdfs://...:8020/user/ird2/tech_talks/users.csv:0+186304
14/04/16 20:35:49 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
14/04/16 20:35:49 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
14/04/16 20:35:49 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
14/04/16 20:35:49 INFO mapred.MapTask: soft limit at 83886080
14/04/16 20:35:49 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
14/04/16 20:35:49 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
14/04/16 20:35:49 INFO mapred.LocalJobRunner:
14/04/16 20:35:49 INFO mapred.MapTask: Starting flush of map output
14/04/16 20:35:49 INFO mapred.MapTask: Spilling map output
14/04/16 20:35:49 INFO mapred.MapTask: bufstart = 0; bufend = 209667; bufvoid = 104857600
14/04/16 20:35:49 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26192144(104768576); length = 22253/6553600
14/04/16 20:35:49 INFO mapred.MapTask: Finished spill 0
14/04/16 20:35:49 INFO mapred.Task: Task:attempt_local1427968352_0001_m_000001_0 is done. And is in the process of committing
14/04/16 20:35:49 INFO mapred.LocalJobRunner: map
14/04/16 20:35:49 INFO mapred.Task: Task 'attempt_local1427968352_0001_m_000001_0' done.
14/04/16 20:35:49 INFO mapred.LocalJobRunner: Finishing task: attempt_local1427968352_0001_m_000001_0
14/04/16 20:35:49 INFO mapred.LocalJobRunner: map task executor complete.
14/04/16 20:35:49 INFO mapred.LocalJobRunner: Waiting for reduce tasks
14/04/16 20:35:49 INFO mapred.LocalJobRunner: Starting task: attempt_local1427968352_0001_r_000000_0
14/04/16 20:35:49 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
14/04/16 20:35:49 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle#5116331d
14/04/16 20:35:49 INFO reduce.MergeManagerImpl: MergerManager: memoryLimit=652528832, maxSingleShuffleLimit=163132208, mergeThreshold=430669056, ioSortFactor=10, memToMemMergeOutputsThreshold=10
14/04/16 20:35:49 INFO reduce.EventFetcher: attempt_local1427968352_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
14/04/16 20:35:49 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local1427968352_0001_m_000001_0 decomp: 220797 len: 220801 to MEMORY
14/04/16 20:35:49 INFO reduce.InMemoryMapOutput: Read 220797 bytes from map-output for attempt_local1427968352_0001_m_000001_0
14/04/16 20:35:49 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 220797, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->220797
14/04/16 20:35:49 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local1427968352_0001_m_000000_0 decomp: 7162100 len: 7162104 to MEMORY
14/04/16 20:35:49 INFO reduce.InMemoryMapOutput: Read 7162100 bytes from map-output for attempt_local1427968352_0001_m_000000_0
14/04/16 20:35:49 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 7162100, inMemoryMapOutputs.size() -> 2, commitMemory -> 220797, usedMemory ->7382897
14/04/16 20:35:49 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning
14/04/16 20:35:49 INFO mapred.LocalJobRunner: 2 / 2 copied.
14/04/16 20:35:49 INFO reduce.MergeManagerImpl: finalMerge called with 2 in-memory map-outputs and 0 on-disk map-outputs
14/04/16 20:35:49 INFO mapred.Merger: Merging 2 sorted segments
14/04/16 20:35:49 INFO mapred.Merger: Down to the last merge-pass, with 2 segments left of total size: 7382885 bytes
14/04/16 20:35:50 INFO reduce.MergeManagerImpl: Merged 2 segments, 7382897 bytes to disk to satisfy reduce memory limit
14/04/16 20:35:50 INFO reduce.MergeManagerImpl: Merging 1 files, 7382899 bytes from disk
14/04/16 20:35:50 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce
14/04/16 20:35:50 INFO mapred.Merger: Merging 1 sorted segments
14/04/16 20:35:50 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 7382889 bytes
14/04/16 20:35:50 INFO mapred.LocalJobRunner: 2 / 2 copied.
14/04/16 20:35:50 INFO Configuration.deprecation: mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords
14/04/16 20:35:50 INFO mapreduce.Job: map 100% reduce 0%
14/04/16 20:35:51 INFO mapred.Task: Task:attempt_local1427968352_0001_r_000000_0 is done. And is in the process of committing
14/04/16 20:35:51 INFO mapred.LocalJobRunner: 2 / 2 copied.
14/04/16 20:35:51 INFO mapred.Task: Task attempt_local1427968352_0001_r_000000_0 is allowed to commit now
14/04/16 20:35:51 INFO output.FileOutputCommitter: Saved output of task 'attempt_local1427968352_0001_r_000000_0' to hdfs://...:8020/user/ird2/tech_talks/output/ReduceSideJoinDriver/_temporary/0/task_local1427968352_0001_r_000000
14/04/16 20:35:51 INFO mapred.LocalJobRunner: reduce > reduce
14/04/16 20:35:51 INFO mapred.Task: Task 'attempt_local1427968352_0001_r_000000_0' done.
14/04/16 20:35:51 INFO mapred.LocalJobRunner: Finishing task: attempt_local1427968352_0001_r_000000_0
14/04/16 20:35:51 INFO mapred.LocalJobRunner: reduce task executor complete.
14/04/16 20:35:52 INFO mapreduce.Job: map 100% reduce 100%
14/04/16 20:35:52 INFO mapreduce.Job: Job job_local1427968352_0001 completed successfully
14/04/16 20:35:52 INFO mapreduce.Job: Counters: 38
File System Counters
FILE: Number of bytes read=14767932
FILE: Number of bytes written=29952985
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=13537382
HDFS: Number of bytes written=2949787
HDFS: Number of read operations=28
HDFS: Number of large read operations=0
HDFS: Number of write operations=5
Map-Reduce Framework
Map input records=343919
Map output records=343919
Map output bytes=6695055
Map output materialized bytes=7382905
Input split bytes=272
Combine input records=0
Combine output records=0
Reduce input groups=5564
Reduce shuffle bytes=7382905
Reduce input records=343919
Reduce output records=5564
Spilled Records=687838
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=92
CPU time spent (ms)=0
Physical memory (bytes) snapshot=0
Virtual memory (bytes) snapshot=0
Total committed heap usage (bytes)=1416101888
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=4574562
File Output Format Counters
Bytes Written=2949787
Driver code
public class ReduceSideJoinDriver extends Configured implements Tool
{
#Override
public int run(String[] args) throws Exception
{
if (args.length != 3)
{
System.err.printf("Usage: %s [generic options] <input> <output>\n", getClass().getSimpleName());
ToolRunner.printGenericCommandUsage(System.err);
return -1;
}
Path usersFile = new Path(args[0]);
Path ratingsFile = new Path(args[1]);
Job job = Job.getInstance(getConf(), "Aravind - Reduce Side Join");
job.getConfiguration().setStrings(usersFile.getName(), "user");
job.getConfiguration().setStrings(ratingsFile.getName(), "rating");
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setMapOutputKeyClass(IntWritable.class);
job.setMapOutputValueClass(TagAndRecord.class);
TextInputFormat.addInputPath(job, usersFile);
TextInputFormat.addInputPath(job, ratingsFile);
TextOutputFormat.setOutputPath(job, new Path(args[2]));
job.setMapperClass(ReduceSideJoinMapper.class);
job.setReducerClass(ReduceSideJoinReducer.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(Text.class);
return job.waitForCompletion(true) ? 0 : 1;
}
public static void main(String args[]) throws Exception
{
int exitCode = ToolRunner.run(new Configuration(), new ReduceSideJoinDriver(), args);
System.exit(exitCode);
}
}
Make sure you have valid following configuration files in hadoop classpath. By default configuration files are taken from the directory /etc/hadoop/conf. This activity should be performed a part of hadoop client node setup.
mapred-site.xml
yarn-site.xml
core-site.xml
If the above mentioned configuration files are empty. You got to pupulate the above files with right properties. Population can be achieved in two ways
In Cloudera Manager when click on service yarn, in action portion, there is an option Deploy client configuration along with start,stop etc. Use that option to deploy the client configuration.
Sometimes above option maynot work if the node is not managed by CM and yarn gateway is not configured on the node. use the option Download client configuration instead of deploy client Configuration. Extract the downloaded zip configuration file(above files) and copy those files to the location /etc/hadoop/conf manually.
For executing the jar either hadoop or yarn can be used.
Apparently, you can only submit a hadoop job from the node designated as the gateway node. Everything is working once I submitted the job from the gateway node.

Hive not enforcing bucketing

I am going through the Hive tutorial in the O'Reilly Hadoop book by Tom White. I am trying to make a bucketed table, but I can't get Hive to create the buckets. I can create the table and load the data into it, but all of the data is then stored in one file.
I am running a pseudo-distributed Hadoop cluster. I'm using Hadoop 1.2.1 and Hive 0.10.0 with a MySql metastore.
The data (shown below) are initially in the table 'users'. They are to be put in a table with 4 buckets, i.e. one user per bucket.
select * from users;
OK
id name
0 Nat
2 Joe
3 Kay
4 Ann
I set the properties below in an attempt to enforce bucketing (I don't think that setting mapred.reduce.tasks explicitly is necessary, but I included it just in case).
set hive.enforce.bucketing=true;
set mapred.reduce.tasks=4;
Then I create the table 'bucketed_users' and load the data into it.
CREATE TABLE bucketed_users (id INT, name STRING)
CLUSTERED BY (id)
SORTED BY (id ASC) INTO 4 BUCKETS;
INSERT OVERWRITE TABLE bucketed_users SELECT * FROM users;
The output:
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 4
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files.
Execution log at: /tmp/katrina/katrina_20131003204949_a56048f5-ab2f-421b-af45-9ec3ff85731c.log
Job running in-process (local Hadoop)
Hadoop job information for null: number of mappers: 0; number of reducers: 0
2013-10-03 20:49:34,011 null map = 0%, reduce = 0%
2013-10-03 20:49:35,026 null map = 0%, reduce = 100%
Ended Job = job_local1250355097_0001
Execution completed successfully
Mapred Local Task Succeeded . Convert the Join into MapJoin
Loading data to table records.bucketed_users
Deleted hdfs://localhost/user/hive/warehouse/records/bucketed_users
Table records.bucketed_users stats: [num_partitions: 0, num_files: 1, num_rows: 4, total_size: 24, raw_data_size: 20]
OK
id name
Time taken: 8.527 seconds
The data have been loaded into 'bucketed_users' correctly (SELECT * FROM bucketed_users shows all users) but the number of files created is just 1 (num_files: 1 above) rather than the desired 4. Looking at the bucketed_users directory in HDFS (dfs -ls /user/hive/warehouse/records/bucketed_users;) shows just one file, 000000_0. How can I enforce bucketing?
The full log is below:
2013-10-03 20:49:30,769 INFO exec.ExecDriver (SessionState.java:printInfo(392)) - Execution log at: /tmp/katrina/katrina_20131003204949_a56048f5-ab2f-421b-af45-9ec3ff85731c.log
2013-10-03 20:49:31,139 INFO exec.ExecDriver (ExecDriver.java:execute(328)) - Using org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
2013-10-03 20:49:31,144 INFO exec.ExecDriver (ExecDriver.java:execute(350)) - adding libjars: file:///Users/katrina/Code/hive/hive-0.10.0/lib/hive-builtins-0.10.0.jar
2013-10-03 20:49:31,144 INFO exec.ExecDriver (ExecDriver.java:addInputPaths(852)) - Processing alias users
2013-10-03 20:49:31,145 INFO exec.ExecDriver (ExecDriver.java:addInputPaths(870)) - Adding input file hdfs://localhost/user/hive/warehouse/records/users
2013-10-03 20:49:31,145 INFO exec.Utilities (Utilities.java:isEmptyPath(1900)) - Content Summary not cached for hdfs://localhost/user/hive/warehouse/records/users
2013-10-03 20:49:31,365 WARN util.NativeCodeLoader (NativeCodeLoader.java:<clinit>(52)) - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2013-10-03 20:49:32,410 INFO exec.ExecDriver (ExecDriver.java:createTmpDirs(219)) - Making Temp Directory: hdfs://localhost/tmp/hive-katrina/hive_2013-10-03_20-49-28_110_131412476548383989/-ext-10000
2013-10-03 20:49:32,420 WARN mapred.JobClient (JobClient.java:copyAndConfigureFiles(746)) - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
2013-10-03 20:49:32,648 WARN snappy.LoadSnappy (LoadSnappy.java:<clinit>(46)) - Snappy native library not loaded
2013-10-03 20:49:32,655 INFO io.CombineHiveInputFormat (CombineHiveInputFormat.java:getSplits(370)) - CombineHiveInputSplit creating pool for hdfs://localhost/user/hive/warehouse/records/users; using filter path hdfs://localhost/user/hive/warehouse/records/users
2013-10-03 20:49:32,661 INFO mapred.FileInputFormat (FileInputFormat.java:listStatus(199)) - Total input paths to process : 1
2013-10-03 20:49:32,716 INFO io.CombineHiveInputFormat (CombineHiveInputFormat.java:getSplits(411)) - number of splits 1
2013-10-03 20:49:32,847 INFO filecache.TrackerDistributedCacheManager (TrackerDistributedCacheManager.java:downloadCacheObject(423)) - Creating hive-builtins-0.10.0.jar in /tmp/hadoop-katrina/mapred/local/76384558/archive/-2634153638864376244_689726567_810621743/file/Users/katrina/Code/hive/hive-0.10.0/lib/hive-builtins-0.10.0.jar-work--7485859847513724632 with rwxr-xr-x
2013-10-03 20:49:32,850 INFO filecache.TrackerDistributedCacheManager (TrackerDistributedCacheManager.java:downloadCacheObject(435)) - Extracting /tmp/hadoop-katrina/mapred/local/76384558/archive/-2634153638864376244_689726567_810621743/file/Users/katrina/Code/hive/hive-0.10.0/lib/hive-builtins-0.10.0.jar-work--7485859847513724632/hive-builtins-0.10.0.jar to /tmp/hadoop-katrina/mapred/local/76384558/archive/-2634153638864376244_689726567_810621743/file/Users/katrina/Code/hive/hive-0.10.0/lib/hive-builtins-0.10.0.jar-work--7485859847513724632
2013-10-03 20:49:32,870 INFO filecache.TrackerDistributedCacheManager (TrackerDistributedCacheManager.java:downloadCacheObject(463)) - Cached file:///Users/katrina/Code/hive/hive-0.10.0/lib/hive-builtins-0.10.0.jar as /tmp/hadoop-katrina/mapred/local/76384558/archive/-2634153638864376244_689726567_810621743/file/Users/katrina/Code/hive/hive-0.10.0/lib/hive-builtins-0.10.0.jar
2013-10-03 20:49:32,880 INFO filecache.TrackerDistributedCacheManager (TrackerDistributedCacheManager.java:localizePublicCacheObject(486)) - Cached file:///Users/katrina/Code/hive/hive-0.10.0/lib/hive-builtins-0.10.0.jar as /tmp/hadoop-katrina/mapred/local/76384558/archive/-2634153638864376244_689726567_810621743/file/Users/katrina/Code/hive/hive-0.10.0/lib/hive-builtins-0.10.0.jar
2013-10-03 20:49:32,987 INFO exec.ExecDriver (SessionState.java:printInfo(392)) - Job running in-process (local Hadoop)
2013-10-03 20:49:33,034 INFO mapred.LocalJobRunner (LocalJobRunner.java:run(340)) - Waiting for map tasks
2013-10-03 20:49:33,037 INFO mapred.LocalJobRunner (LocalJobRunner.java:run(204)) - Starting task: attempt_local1250355097_0001_m_000000_0
2013-10-03 20:49:33,073 INFO mapred.Task (Task.java:initialize(534)) - Using ResourceCalculatorPlugin : null
2013-10-03 20:49:33,077 INFO mapred.MapTask (MapTask.java:updateJobWithSplit(455)) - Processing split: Paths:/user/hive/warehouse/records/users/users.txt:0+24InputFormatClass: org.apache.hadoop.mapred.TextInputFormat
2013-10-03 20:49:33,093 INFO io.HiveContextAwareRecordReader (HiveContextAwareRecordReader.java:initIOContext(154)) - Processing file hdfs://localhost/user/hive/warehouse/records/users/users.txt
2013-10-03 20:49:33,093 INFO mapred.MapTask (MapTask.java:runOldMapper(419)) - numReduceTasks: 1
2013-10-03 20:49:33,099 INFO mapred.MapTask (MapTask.java:<init>(949)) - io.sort.mb = 100
2013-10-03 20:49:33,541 INFO mapred.MapTask (MapTask.java:<init>(961)) - data buffer = 79691776/99614720
2013-10-03 20:49:33,542 INFO mapred.MapTask (MapTask.java:<init>(962)) - record buffer = 262144/327680
2013-10-03 20:49:33,550 INFO ExecMapper (ExecMapper.java:configure(69)) - maximum memory = 2088435712
2013-10-03 20:49:33,551 INFO ExecMapper (ExecMapper.java:configure(74)) - conf classpath = [file:/tmp/hadoop-katrina/mapred/local/76384558/archive/-2634153638864376244_689726567_810621743/file/Users/katrina/Code/hive/hive-0.10.0/lib/hive-builtins-0.10.0.jar/]
2013-10-03 20:49:33,551 INFO ExecMapper (ExecMapper.java:configure(76)) - thread classpath = [file:/tmp/hadoop-katrina/mapred/local/76384558/archive/-2634153638864376244_689726567_810621743/file/Users/katrina/Code/hive/hive-0.10.0/lib/hive-builtins-0.10.0.jar/]
2013-10-03 20:49:33,585 INFO exec.MapOperator (MapOperator.java:setChildren(387)) - Adding alias users to work list for file hdfs://localhost/user/hive/warehouse/records/users
2013-10-03 20:49:33,587 INFO exec.MapOperator (MapOperator.java:setChildren(402)) - dump TS struct<id:int,name:string>
2013-10-03 20:49:33,588 INFO ExecMapper (ExecMapper.java:configure(91)) -
<MAP>Id =10
<Children>
<TS>Id =0
<Children>
<SEL>Id =1
<Children>
<RS>Id =2
<Parent>Id = 1 null<\Parent>
<\RS>
<\Children>
<Parent>Id = 0 null<\Parent>
<\SEL>
<\Children>
<Parent>Id = 10 null<\Parent>
<\TS>
<\Children>
<\MAP>
2013-10-03 20:49:33,588 INFO exec.MapOperator (Operator.java:initialize(321)) - Initializing Self 10 MAP
2013-10-03 20:49:33,588 INFO exec.TableScanOperator (Operator.java:initialize(321)) - Initializing Self 0 TS
2013-10-03 20:49:33,588 INFO exec.TableScanOperator (Operator.java:initializeChildren(386)) - Operator 0 TS initialized
2013-10-03 20:49:33,589 INFO exec.TableScanOperator (Operator.java:initializeChildren(390)) - Initializing children of 0 TS
2013-10-03 20:49:33,589 INFO exec.SelectOperator (Operator.java:initialize(425)) - Initializing child 1 SEL
2013-10-03 20:49:33,589 INFO exec.SelectOperator (Operator.java:initialize(321)) - Initializing Self 1 SEL
2013-10-03 20:49:33,592 INFO exec.SelectOperator (SelectOperator.java:initializeOp(58)) - SELECT struct<id:int,name:string>
2013-10-03 20:49:33,594 INFO exec.SelectOperator (Operator.java:initializeChildren(386)) - Operator 1 SEL initialized
2013-10-03 20:49:33,595 INFO exec.SelectOperator (Operator.java:initializeChildren(390)) - Initializing children of 1 SEL
2013-10-03 20:49:33,595 INFO exec.ReduceSinkOperator (Operator.java:initialize(425)) - Initializing child 2 RS
2013-10-03 20:49:33,595 INFO exec.ReduceSinkOperator (Operator.java:initialize(321)) - Initializing Self 2 RS
2013-10-03 20:49:33,595 INFO exec.ReduceSinkOperator (ReduceSinkOperator.java:initializeOp(112)) - Using tag = -1
2013-10-03 20:49:33,606 INFO exec.ReduceSinkOperator (Operator.java:initializeChildren(386)) - Operator 2 RS initialized
2013-10-03 20:49:33,606 INFO exec.ReduceSinkOperator (Operator.java:initialize(361)) - Initialization Done 2 RS
2013-10-03 20:49:33,606 INFO exec.SelectOperator (Operator.java:initialize(361)) - Initialization Done 1 SEL
2013-10-03 20:49:33,606 INFO exec.TableScanOperator (Operator.java:initialize(361)) - Initialization Done 0 TS
2013-10-03 20:49:33,607 INFO exec.MapOperator (Operator.java:initialize(361)) - Initialization Done 10 MAP
2013-10-03 20:49:33,637 INFO exec.MapOperator (MapOperator.java:cleanUpInputFileChangedOp(494)) - Processing alias users for file hdfs://localhost/user/hive/warehouse/records/users
2013-10-03 20:49:33,638 INFO exec.MapOperator (Operator.java:forward(774)) - 10 forwarding 1 rows
2013-10-03 20:49:33,638 INFO exec.TableScanOperator (Operator.java:forward(774)) - 0 forwarding 1 rows
2013-10-03 20:49:33,639 INFO exec.SelectOperator (Operator.java:forward(774)) - 1 forwarding 1 rows
2013-10-03 20:49:33,641 INFO ExecMapper (ExecMapper.java:map(148)) - ExecMapper: processing 1 rows: used memory = 114294872
2013-10-03 20:49:33,642 INFO exec.MapOperator (Operator.java:close(549)) - 10 finished. closing...
2013-10-03 20:49:33,643 INFO exec.MapOperator (Operator.java:close(555)) - 10 forwarded 4 rows
2013-10-03 20:49:33,643 INFO exec.MapOperator (Operator.java:logStats(845)) - DESERIALIZE_ERRORS:0
2013-10-03 20:49:33,643 INFO exec.TableScanOperator (Operator.java:close(549)) - 0 finished. closing...
2013-10-03 20:49:33,643 INFO exec.TableScanOperator (Operator.java:close(555)) - 0 forwarded 4 rows
2013-10-03 20:49:33,643 INFO exec.SelectOperator (Operator.java:close(549)) - 1 finished. closing...
2013-10-03 20:49:33,644 INFO exec.SelectOperator (Operator.java:close(555)) - 1 forwarded 4 rows
2013-10-03 20:49:33,644 INFO exec.ReduceSinkOperator (Operator.java:close(549)) - 2 finished. closing...
2013-10-03 20:49:33,644 INFO exec.ReduceSinkOperator (Operator.java:close(555)) - 2 forwarded 0 rows
2013-10-03 20:49:33,644 INFO exec.SelectOperator (Operator.java:close(570)) - 1 Close done
2013-10-03 20:49:33,644 INFO exec.TableScanOperator (Operator.java:close(570)) - 0 Close done
2013-10-03 20:49:33,644 INFO exec.MapOperator (Operator.java:close(570)) - 10 Close done
2013-10-03 20:49:33,645 INFO ExecMapper (ExecMapper.java:close(215)) - ExecMapper: processed 4 rows: used memory = 114767288
2013-10-03 20:49:33,647 INFO mapred.MapTask (MapTask.java:flush(1289)) - Starting flush of map output
2013-10-03 20:49:33,659 INFO mapred.MapTask (MapTask.java:sortAndSpill(1471)) - Finished spill 0
2013-10-03 20:49:33,661 INFO mapred.Task (Task.java:done(858)) - Task:attempt_local1250355097_0001_m_000000_0 is done. And is in the process of commiting
2013-10-03 20:49:33,668 INFO mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(466)) - hdfs://localhost/user/hive/warehouse/records/users/users.txt:0+24
2013-10-03 20:49:33,668 INFO mapred.Task (Task.java:sendDone(970)) - Task 'attempt_local1250355097_0001_m_000000_0' done.
2013-10-03 20:49:33,668 INFO mapred.LocalJobRunner (LocalJobRunner.java:run(229)) - Finishing task: attempt_local1250355097_0001_m_000000_0
2013-10-03 20:49:33,668 INFO mapred.LocalJobRunner (LocalJobRunner.java:run(348)) - Map task executor complete.
2013-10-03 20:49:33,680 INFO mapred.Task (Task.java:initialize(534)) - Using ResourceCalculatorPlugin : null
2013-10-03 20:49:33,680 INFO mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(466)) -
2013-10-03 20:49:33,690 INFO mapred.Merger (Merger.java:merge(408)) - Merging 1 sorted segments
2013-10-03 20:49:33,695 INFO mapred.Merger (Merger.java:merge(491)) - Down to the last merge-pass, with 1 segments left of total size: 70 bytes
2013-10-03 20:49:33,695 INFO mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(466)) -
2013-10-03 20:49:33,697 INFO ExecReducer (ExecReducer.java:configure(100)) - maximum memory = 2088435712
2013-10-03 20:49:33,697 INFO ExecReducer (ExecReducer.java:configure(105)) - conf classpath = [file:/tmp/hadoop-katrina/mapred/local/76384558/archive/-2634153638864376244_689726567_810621743/file/Users/katrina/Code/hive/hive-0.10.0/lib/hive-builtins-0.10.0.jar/]
2013-10-03 20:49:33,697 INFO ExecReducer (ExecReducer.java:configure(107)) - thread classpath = [file:/tmp/hadoop-katrina/mapred/local/76384558/archive/-2634153638864376244_689726567_810621743/file/Users/katrina/Code/hive/hive-0.10.0/lib/hive-builtins-0.10.0.jar/]
2013-10-03 20:49:33,698 INFO ExecReducer (ExecReducer.java:configure(149)) -
<OP>Id =3
<Children>
<FS>Id =4
<Parent>Id = 3 null<\Parent>
<\FS>
<\Children>
<\OP>
2013-10-03 20:49:33,698 INFO exec.ExtractOperator (Operator.java:initialize(321)) - Initializing Self 3 OP
2013-10-03 20:49:33,698 INFO exec.ExtractOperator (Operator.java:initializeChildren(386)) - Operator 3 OP initialized
2013-10-03 20:49:33,698 INFO exec.ExtractOperator (Operator.java:initializeChildren(390)) - Initializing children of 3 OP
2013-10-03 20:49:33,698 INFO exec.FileSinkOperator (Operator.java:initialize(425)) - Initializing child 4 FS
2013-10-03 20:49:33,699 INFO exec.FileSinkOperator (Operator.java:initialize(321)) - Initializing Self 4 FS
2013-10-03 20:49:33,701 INFO exec.FileSinkOperator (Operator.java:initializeChildren(386)) - Operator 4 FS initialized
2013-10-03 20:49:33,701 INFO exec.FileSinkOperator (Operator.java:initialize(361)) - Initialization Done 4 FS
2013-10-03 20:49:33,701 INFO exec.ExtractOperator (Operator.java:initialize(361)) - Initialization Done 3 OP
2013-10-03 20:49:33,706 INFO ExecReducer (ExecReducer.java:reduce(243)) - ExecReducer: processing 1 rows: used memory = 117749816
2013-10-03 20:49:33,707 INFO exec.ExtractOperator (Operator.java:forward(774)) - 3 forwarding 1 rows
2013-10-03 20:49:33,707 INFO exec.FileSinkOperator (FileSinkOperator.java:createBucketFiles(458)) - Final Path: FS hdfs://localhost/tmp/hive-katrina/hive_2013-10-03_20-49-28_110_131412476548383989/_tmp.-ext-10000/000000_0
2013-10-03 20:49:33,707 INFO exec.FileSinkOperator (FileSinkOperator.java:createBucketFiles(460)) - Writing to temp file: FS hdfs://localhost/tmp/hive-katrina/hive_2013-10-03_20-49-28_110_131412476548383989/_task_tmp.-ext-10000/_tmp.000000_0
2013-10-03 20:49:33,707 INFO exec.FileSinkOperator (FileSinkOperator.java:createBucketFiles(481)) - New Final Path: FS hdfs://localhost/tmp/hive-katrina/hive_2013-10-03_20-49-28_110_131412476548383989/_tmp.-ext-10000/000000_0
2013-10-03 20:49:33,737 INFO ExecReducer (ExecReducer.java:close(301)) - ExecReducer: processed 4 rows: used memory = 118477400
2013-10-03 20:49:33,737 INFO exec.ExtractOperator (Operator.java:close(549)) - 3 finished. closing...
2013-10-03 20:49:33,737 INFO exec.ExtractOperator (Operator.java:close(555)) - 3 forwarded 4 rows
2013-10-03 20:49:33,737 INFO exec.FileSinkOperator (Operator.java:close(549)) - 4 finished. closing...
2013-10-03 20:49:33,737 INFO exec.FileSinkOperator (Operator.java:close(555)) - 4 forwarded 0 rows
2013-10-03 20:49:33,990 INFO exec.ExecDriver (SessionState.java:printInfo(392)) - Hadoop job information for null: number of mappers: 0; number of reducers: 0
2013-10-03 20:49:34,011 INFO exec.ExecDriver (SessionState.java:printInfo(392)) - 2013-10-03 20:49:34,011 null map = 0%, reduce = 0%
2013-10-03 20:49:34,111 INFO jdbc.JDBCStatsPublisher (JDBCStatsPublisher.java:publishStat(137)) - Stats publishing for key hdfs://localhost/tmp/hive-katrina/hive_2013-10-03_20-49-28_110_131412476548383989/-ext-10000/000000
2013-10-03 20:49:34,143 INFO exec.FileSinkOperator (Operator.java:logStats(845)) - TABLE_ID_1_ROWCOUNT:4
2013-10-03 20:49:34,143 INFO exec.ExtractOperator (Operator.java:close(570)) - 3 Close done
2013-10-03 20:49:34,145 INFO mapred.Task (Task.java:done(858)) - Task:attempt_local1250355097_0001_r_000000_0 is done. And is in the process of commiting
2013-10-03 20:49:34,146 INFO mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(466)) - reduce > reduce
2013-10-03 20:49:34,147 INFO mapred.Task (Task.java:sendDone(970)) - Task 'attempt_local1250355097_0001_r_000000_0' done.
2013-10-03 20:49:35,026 INFO exec.ExecDriver (SessionState.java:printInfo(392)) - 2013-10-03 20:49:35,026 null map = 0%, reduce = 100%
2013-10-03 20:49:35,030 INFO exec.ExecDriver (SessionState.java:printInfo(392)) - Ended Job = job_local1250355097_0001
2013-10-03 20:49:35,033 INFO exec.FileSinkOperator (Utilities.java:mvFileToFinalPath(1361)) - Moving tmp dir: hdfs://localhost/tmp/hive-katrina/hive_2013-10-03_20-49-28_110_131412476548383989/_tmp.-ext-10000 to: hdfs://localhost/tmp/hive-katrina/hive_2013-10-03_20-49-28_110_131412476548383989/_tmp.-ext-10000.intermediate
2013-10-03 20:49:35,036 INFO exec.FileSinkOperator (Utilities.java:mvFileToFinalPath(1372)) - Moving tmp dir: hdfs://localhost/tmp/hive-katrina/hive_2013-10-03_20-49-28_110_131412476548383989/_tmp.-ext-10000.intermediate to: hdfs://localhost/tmp/hive-katrina/hive_2013-10-03_20-49-28_110_131412476548383989/-ext-10000
I can't reproduce that:
hive> INSERT OVERWRITE TABLE bucketed_users SELECT * FROM unbucketed_users;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 4
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_1384565454792_0070, Tracking URL = http://sandbox.hortonworks.com:8088/proxy/application_1384565454792_0070/
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1384565454792_0070
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 4
2013-11-16 05:04:12,290 Stage-1 map = 0%, reduce = 0%
2013-11-16 05:04:33,868 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 7.16 sec
MapReduce Total cumulative CPU time: 7 seconds 160 msec
Ended Job = job_1384565454792_0070
Loading data to table default.bucketed_users
rmr: DEPRECATED: Please use 'rm -r' instead.
Moved: 'hdfs://sandbox.hortonworks.com:8020/apps/hive/warehouse/bucketed_users' to trash at: hdfs://sandbox.hortonworks.com:8020/user/hue/.Trash/Current
Table default.bucketed_users stats: [num_partitions: 0, num_files: 4, num_rows: 0, total_size: 24, raw_data_size: 0]
MapReduce Jobs Launched:
Job 0: Map: 1 Reduce: 4 Cumulative CPU: 7.16 sec HDFS Read: 259 HDFS Write: 24 SUCCESS
Total MapReduce CPU Time Spent: 7 seconds 160 msec
OK
Time taken: 19.291 seconds
hive> dfs -ls /apps/hive/warehouse/bucketed_users;
Found 4 items
-rw-r--r-- 3 hue hdfs 12 2013-11-16 05:04 /apps/hive/warehouse/bucketed_users/000000_0
-rw-r--r-- 3 hue hdfs 0 2013-11-16 05:04 /apps/hive/warehouse/bucketed_users/000001_0
-rw-r--r-- 3 hue hdfs 6 2013-11-16 05:04 /apps/hive/warehouse/bucketed_users/000002_0
-rw-r--r-- 3 hue hdfs 6 2013-11-16 05:04 /apps/hive/warehouse/bucketed_users/000003_0
It is very odd that you see a conversion to MapJoin, you should not see that since your query has no joins in it. Is that really the query you're running? If you're seeing that I would suggest to:
hive.auto.convert.join=false;
If that fixes it you should file a bug.
Odd, this works for me , However since you specify that your table is sorted you also need to set
set hive.enforce.sorting=true;
in addition of
set hive.enforce.bucketing = true;
I'm wondering if the combination of bucket/sort table and only setting one of the enforce setting messes it up somehow.

Resources