I am going through the Hive tutorial in the O'Reilly Hadoop book by Tom White. I am trying to make a bucketed table, but I can't get Hive to create the buckets. I can create the table and load the data into it, but all of the data is then stored in one file.
I am running a pseudo-distributed Hadoop cluster. I'm using Hadoop 1.2.1 and Hive 0.10.0 with a MySql metastore.
The data (shown below) are initially in the table 'users'. They are to be put in a table with 4 buckets, i.e. one user per bucket.
select * from users;
id name
0 Nat
2 Joe
3 Kay
4 Ann
I set the properties below in an attempt to enforce bucketing (I don't think that setting mapred.reduce.tasks explicitly is necessary, but I included it just in case).
set hive.enforce.bucketing=true;
set mapred.reduce.tasks=4;
Then I create the table 'bucketed_users' and load the data into it.
CREATE TABLE bucketed_users (id INT, name STRING)
The output:
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 4
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use org.apache.hadoop.log.metrics.EventCounter in all the files.
Execution log at: /tmp/katrina/katrina_20131003204949_a56048f5-ab2f-421b-af45-9ec3ff85731c.log
Job running in-process (local Hadoop)
Hadoop job information for null: number of mappers: 0; number of reducers: 0
2013-10-03 20:49:34,011 null map = 0%, reduce = 0%
2013-10-03 20:49:35,026 null map = 0%, reduce = 100%
Ended Job = job_local1250355097_0001
Execution completed successfully
Mapred Local Task Succeeded . Convert the Join into MapJoin
Loading data to table records.bucketed_users
Deleted hdfs://localhost/user/hive/warehouse/records/bucketed_users
Table records.bucketed_users stats: [num_partitions: 0, num_files: 1, num_rows: 4, total_size: 24, raw_data_size: 20]
id name
Time taken: 8.527 seconds
The data have been loaded into 'bucketed_users' correctly (SELECT * FROM bucketed_users shows all users) but the number of files created is just 1 (num_files: 1 above) rather than the desired 4. Looking at the bucketed_users directory in HDFS (dfs -ls /user/hive/warehouse/records/bucketed_users;) shows just one file, 000000_0. How can I enforce bucketing?
The full log is below:
2013-10-03 20:49:30,769 INFO exec.ExecDriver ( - Execution log at: /tmp/katrina/katrina_20131003204949_a56048f5-ab2f-421b-af45-9ec3ff85731c.log
2013-10-03 20:49:31,139 INFO exec.ExecDriver ( - Using
2013-10-03 20:49:31,144 INFO exec.ExecDriver ( - adding libjars: file:///Users/katrina/Code/hive/hive-0.10.0/lib/hive-builtins-0.10.0.jar
2013-10-03 20:49:31,144 INFO exec.ExecDriver ( - Processing alias users
2013-10-03 20:49:31,145 INFO exec.ExecDriver ( - Adding input file hdfs://localhost/user/hive/warehouse/records/users
2013-10-03 20:49:31,145 INFO exec.Utilities ( - Content Summary not cached for hdfs://localhost/user/hive/warehouse/records/users
2013-10-03 20:49:31,365 WARN util.NativeCodeLoader (<clinit>(52)) - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2013-10-03 20:49:32,410 INFO exec.ExecDriver ( - Making Temp Directory: hdfs://localhost/tmp/hive-katrina/hive_2013-10-03_20-49-28_110_131412476548383989/-ext-10000
2013-10-03 20:49:32,420 WARN mapred.JobClient ( - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
2013-10-03 20:49:32,648 WARN snappy.LoadSnappy (<clinit>(46)) - Snappy native library not loaded
2013-10-03 20:49:32,655 INFO io.CombineHiveInputFormat ( - CombineHiveInputSplit creating pool for hdfs://localhost/user/hive/warehouse/records/users; using filter path hdfs://localhost/user/hive/warehouse/records/users
2013-10-03 20:49:32,661 INFO mapred.FileInputFormat ( - Total input paths to process : 1
2013-10-03 20:49:32,716 INFO io.CombineHiveInputFormat ( - number of splits 1
2013-10-03 20:49:32,847 INFO filecache.TrackerDistributedCacheManager ( - Creating hive-builtins-0.10.0.jar in /tmp/hadoop-katrina/mapred/local/76384558/archive/-2634153638864376244_689726567_810621743/file/Users/katrina/Code/hive/hive-0.10.0/lib/hive-builtins-0.10.0.jar-work--7485859847513724632 with rwxr-xr-x
2013-10-03 20:49:32,850 INFO filecache.TrackerDistributedCacheManager ( - Extracting /tmp/hadoop-katrina/mapred/local/76384558/archive/-2634153638864376244_689726567_810621743/file/Users/katrina/Code/hive/hive-0.10.0/lib/hive-builtins-0.10.0.jar-work--7485859847513724632/hive-builtins-0.10.0.jar to /tmp/hadoop-katrina/mapred/local/76384558/archive/-2634153638864376244_689726567_810621743/file/Users/katrina/Code/hive/hive-0.10.0/lib/hive-builtins-0.10.0.jar-work--7485859847513724632
2013-10-03 20:49:32,870 INFO filecache.TrackerDistributedCacheManager ( - Cached file:///Users/katrina/Code/hive/hive-0.10.0/lib/hive-builtins-0.10.0.jar as /tmp/hadoop-katrina/mapred/local/76384558/archive/-2634153638864376244_689726567_810621743/file/Users/katrina/Code/hive/hive-0.10.0/lib/hive-builtins-0.10.0.jar
2013-10-03 20:49:32,880 INFO filecache.TrackerDistributedCacheManager ( - Cached file:///Users/katrina/Code/hive/hive-0.10.0/lib/hive-builtins-0.10.0.jar as /tmp/hadoop-katrina/mapred/local/76384558/archive/-2634153638864376244_689726567_810621743/file/Users/katrina/Code/hive/hive-0.10.0/lib/hive-builtins-0.10.0.jar
2013-10-03 20:49:32,987 INFO exec.ExecDriver ( - Job running in-process (local Hadoop)
2013-10-03 20:49:33,034 INFO mapred.LocalJobRunner ( - Waiting for map tasks
2013-10-03 20:49:33,037 INFO mapred.LocalJobRunner ( - Starting task: attempt_local1250355097_0001_m_000000_0
2013-10-03 20:49:33,073 INFO mapred.Task ( - Using ResourceCalculatorPlugin : null
2013-10-03 20:49:33,077 INFO mapred.MapTask ( - Processing split: Paths:/user/hive/warehouse/records/users/users.txt:0+24InputFormatClass: org.apache.hadoop.mapred.TextInputFormat
2013-10-03 20:49:33,093 INFO io.HiveContextAwareRecordReader ( - Processing file hdfs://localhost/user/hive/warehouse/records/users/users.txt
2013-10-03 20:49:33,093 INFO mapred.MapTask ( - numReduceTasks: 1
2013-10-03 20:49:33,099 INFO mapred.MapTask (<init>(949)) - io.sort.mb = 100
2013-10-03 20:49:33,541 INFO mapred.MapTask (<init>(961)) - data buffer = 79691776/99614720
2013-10-03 20:49:33,542 INFO mapred.MapTask (<init>(962)) - record buffer = 262144/327680
2013-10-03 20:49:33,550 INFO ExecMapper ( - maximum memory = 2088435712
2013-10-03 20:49:33,551 INFO ExecMapper ( - conf classpath = [file:/tmp/hadoop-katrina/mapred/local/76384558/archive/-2634153638864376244_689726567_810621743/file/Users/katrina/Code/hive/hive-0.10.0/lib/hive-builtins-0.10.0.jar/]
2013-10-03 20:49:33,551 INFO ExecMapper ( - thread classpath = [file:/tmp/hadoop-katrina/mapred/local/76384558/archive/-2634153638864376244_689726567_810621743/file/Users/katrina/Code/hive/hive-0.10.0/lib/hive-builtins-0.10.0.jar/]
2013-10-03 20:49:33,585 INFO exec.MapOperator ( - Adding alias users to work list for file hdfs://localhost/user/hive/warehouse/records/users
2013-10-03 20:49:33,587 INFO exec.MapOperator ( - dump TS struct<id:int,name:string>
2013-10-03 20:49:33,588 INFO ExecMapper ( -
<MAP>Id =10
<TS>Id =0
<SEL>Id =1
<RS>Id =2
<Parent>Id = 1 null<\Parent>
<Parent>Id = 0 null<\Parent>
<Parent>Id = 10 null<\Parent>
2013-10-03 20:49:33,588 INFO exec.MapOperator ( - Initializing Self 10 MAP
2013-10-03 20:49:33,588 INFO exec.TableScanOperator ( - Initializing Self 0 TS
2013-10-03 20:49:33,588 INFO exec.TableScanOperator ( - Operator 0 TS initialized
2013-10-03 20:49:33,589 INFO exec.TableScanOperator ( - Initializing children of 0 TS
2013-10-03 20:49:33,589 INFO exec.SelectOperator ( - Initializing child 1 SEL
2013-10-03 20:49:33,589 INFO exec.SelectOperator ( - Initializing Self 1 SEL
2013-10-03 20:49:33,592 INFO exec.SelectOperator ( - SELECT struct<id:int,name:string>
2013-10-03 20:49:33,594 INFO exec.SelectOperator ( - Operator 1 SEL initialized
2013-10-03 20:49:33,595 INFO exec.SelectOperator ( - Initializing children of 1 SEL
2013-10-03 20:49:33,595 INFO exec.ReduceSinkOperator ( - Initializing child 2 RS
2013-10-03 20:49:33,595 INFO exec.ReduceSinkOperator ( - Initializing Self 2 RS
2013-10-03 20:49:33,595 INFO exec.ReduceSinkOperator ( - Using tag = -1
2013-10-03 20:49:33,606 INFO exec.ReduceSinkOperator ( - Operator 2 RS initialized
2013-10-03 20:49:33,606 INFO exec.ReduceSinkOperator ( - Initialization Done 2 RS
2013-10-03 20:49:33,606 INFO exec.SelectOperator ( - Initialization Done 1 SEL
2013-10-03 20:49:33,606 INFO exec.TableScanOperator ( - Initialization Done 0 TS
2013-10-03 20:49:33,607 INFO exec.MapOperator ( - Initialization Done 10 MAP
2013-10-03 20:49:33,637 INFO exec.MapOperator ( - Processing alias users for file hdfs://localhost/user/hive/warehouse/records/users
2013-10-03 20:49:33,638 INFO exec.MapOperator ( - 10 forwarding 1 rows
2013-10-03 20:49:33,638 INFO exec.TableScanOperator ( - 0 forwarding 1 rows
2013-10-03 20:49:33,639 INFO exec.SelectOperator ( - 1 forwarding 1 rows
2013-10-03 20:49:33,641 INFO ExecMapper ( - ExecMapper: processing 1 rows: used memory = 114294872
2013-10-03 20:49:33,642 INFO exec.MapOperator ( - 10 finished. closing...
2013-10-03 20:49:33,643 INFO exec.MapOperator ( - 10 forwarded 4 rows
2013-10-03 20:49:33,643 INFO exec.MapOperator ( - DESERIALIZE_ERRORS:0
2013-10-03 20:49:33,643 INFO exec.TableScanOperator ( - 0 finished. closing...
2013-10-03 20:49:33,643 INFO exec.TableScanOperator ( - 0 forwarded 4 rows
2013-10-03 20:49:33,643 INFO exec.SelectOperator ( - 1 finished. closing...
2013-10-03 20:49:33,644 INFO exec.SelectOperator ( - 1 forwarded 4 rows
2013-10-03 20:49:33,644 INFO exec.ReduceSinkOperator ( - 2 finished. closing...
2013-10-03 20:49:33,644 INFO exec.ReduceSinkOperator ( - 2 forwarded 0 rows
2013-10-03 20:49:33,644 INFO exec.SelectOperator ( - 1 Close done
2013-10-03 20:49:33,644 INFO exec.TableScanOperator ( - 0 Close done
2013-10-03 20:49:33,644 INFO exec.MapOperator ( - 10 Close done
2013-10-03 20:49:33,645 INFO ExecMapper ( - ExecMapper: processed 4 rows: used memory = 114767288
2013-10-03 20:49:33,647 INFO mapred.MapTask ( - Starting flush of map output
2013-10-03 20:49:33,659 INFO mapred.MapTask ( - Finished spill 0
2013-10-03 20:49:33,661 INFO mapred.Task ( - Task:attempt_local1250355097_0001_m_000000_0 is done. And is in the process of commiting
2013-10-03 20:49:33,668 INFO mapred.LocalJobRunner ( - hdfs://localhost/user/hive/warehouse/records/users/users.txt:0+24
2013-10-03 20:49:33,668 INFO mapred.Task ( - Task 'attempt_local1250355097_0001_m_000000_0' done.
2013-10-03 20:49:33,668 INFO mapred.LocalJobRunner ( - Finishing task: attempt_local1250355097_0001_m_000000_0
2013-10-03 20:49:33,668 INFO mapred.LocalJobRunner ( - Map task executor complete.
2013-10-03 20:49:33,680 INFO mapred.Task ( - Using ResourceCalculatorPlugin : null
2013-10-03 20:49:33,680 INFO mapred.LocalJobRunner ( -
2013-10-03 20:49:33,690 INFO mapred.Merger ( - Merging 1 sorted segments
2013-10-03 20:49:33,695 INFO mapred.Merger ( - Down to the last merge-pass, with 1 segments left of total size: 70 bytes
2013-10-03 20:49:33,695 INFO mapred.LocalJobRunner ( -
2013-10-03 20:49:33,697 INFO ExecReducer ( - maximum memory = 2088435712
2013-10-03 20:49:33,697 INFO ExecReducer ( - conf classpath = [file:/tmp/hadoop-katrina/mapred/local/76384558/archive/-2634153638864376244_689726567_810621743/file/Users/katrina/Code/hive/hive-0.10.0/lib/hive-builtins-0.10.0.jar/]
2013-10-03 20:49:33,697 INFO ExecReducer ( - thread classpath = [file:/tmp/hadoop-katrina/mapred/local/76384558/archive/-2634153638864376244_689726567_810621743/file/Users/katrina/Code/hive/hive-0.10.0/lib/hive-builtins-0.10.0.jar/]
2013-10-03 20:49:33,698 INFO ExecReducer ( -
<OP>Id =3
<FS>Id =4
<Parent>Id = 3 null<\Parent>
2013-10-03 20:49:33,698 INFO exec.ExtractOperator ( - Initializing Self 3 OP
2013-10-03 20:49:33,698 INFO exec.ExtractOperator ( - Operator 3 OP initialized
2013-10-03 20:49:33,698 INFO exec.ExtractOperator ( - Initializing children of 3 OP
2013-10-03 20:49:33,698 INFO exec.FileSinkOperator ( - Initializing child 4 FS
2013-10-03 20:49:33,699 INFO exec.FileSinkOperator ( - Initializing Self 4 FS
2013-10-03 20:49:33,701 INFO exec.FileSinkOperator ( - Operator 4 FS initialized
2013-10-03 20:49:33,701 INFO exec.FileSinkOperator ( - Initialization Done 4 FS
2013-10-03 20:49:33,701 INFO exec.ExtractOperator ( - Initialization Done 3 OP
2013-10-03 20:49:33,706 INFO ExecReducer ( - ExecReducer: processing 1 rows: used memory = 117749816
2013-10-03 20:49:33,707 INFO exec.ExtractOperator ( - 3 forwarding 1 rows
2013-10-03 20:49:33,707 INFO exec.FileSinkOperator ( - Final Path: FS hdfs://localhost/tmp/hive-katrina/hive_2013-10-03_20-49-28_110_131412476548383989/_tmp.-ext-10000/000000_0
2013-10-03 20:49:33,707 INFO exec.FileSinkOperator ( - Writing to temp file: FS hdfs://localhost/tmp/hive-katrina/hive_2013-10-03_20-49-28_110_131412476548383989/_task_tmp.-ext-10000/_tmp.000000_0
2013-10-03 20:49:33,707 INFO exec.FileSinkOperator ( - New Final Path: FS hdfs://localhost/tmp/hive-katrina/hive_2013-10-03_20-49-28_110_131412476548383989/_tmp.-ext-10000/000000_0
2013-10-03 20:49:33,737 INFO ExecReducer ( - ExecReducer: processed 4 rows: used memory = 118477400
2013-10-03 20:49:33,737 INFO exec.ExtractOperator ( - 3 finished. closing...
2013-10-03 20:49:33,737 INFO exec.ExtractOperator ( - 3 forwarded 4 rows
2013-10-03 20:49:33,737 INFO exec.FileSinkOperator ( - 4 finished. closing...
2013-10-03 20:49:33,737 INFO exec.FileSinkOperator ( - 4 forwarded 0 rows
2013-10-03 20:49:33,990 INFO exec.ExecDriver ( - Hadoop job information for null: number of mappers: 0; number of reducers: 0
2013-10-03 20:49:34,011 INFO exec.ExecDriver ( - 2013-10-03 20:49:34,011 null map = 0%, reduce = 0%
2013-10-03 20:49:34,111 INFO jdbc.JDBCStatsPublisher ( - Stats publishing for key hdfs://localhost/tmp/hive-katrina/hive_2013-10-03_20-49-28_110_131412476548383989/-ext-10000/000000
2013-10-03 20:49:34,143 INFO exec.FileSinkOperator ( - TABLE_ID_1_ROWCOUNT:4
2013-10-03 20:49:34,143 INFO exec.ExtractOperator ( - 3 Close done
2013-10-03 20:49:34,145 INFO mapred.Task ( - Task:attempt_local1250355097_0001_r_000000_0 is done. And is in the process of commiting
2013-10-03 20:49:34,146 INFO mapred.LocalJobRunner ( - reduce > reduce
2013-10-03 20:49:34,147 INFO mapred.Task ( - Task 'attempt_local1250355097_0001_r_000000_0' done.
2013-10-03 20:49:35,026 INFO exec.ExecDriver ( - 2013-10-03 20:49:35,026 null map = 0%, reduce = 100%
2013-10-03 20:49:35,030 INFO exec.ExecDriver ( - Ended Job = job_local1250355097_0001
2013-10-03 20:49:35,033 INFO exec.FileSinkOperator ( - Moving tmp dir: hdfs://localhost/tmp/hive-katrina/hive_2013-10-03_20-49-28_110_131412476548383989/_tmp.-ext-10000 to: hdfs://localhost/tmp/hive-katrina/hive_2013-10-03_20-49-28_110_131412476548383989/_tmp.-ext-10000.intermediate
2013-10-03 20:49:35,036 INFO exec.FileSinkOperator ( - Moving tmp dir: hdfs://localhost/tmp/hive-katrina/hive_2013-10-03_20-49-28_110_131412476548383989/_tmp.-ext-10000.intermediate to: hdfs://localhost/tmp/hive-katrina/hive_2013-10-03_20-49-28_110_131412476548383989/-ext-10000

I can't reproduce that:
hive> INSERT OVERWRITE TABLE bucketed_users SELECT * FROM unbucketed_users;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 4
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_1384565454792_0070, Tracking URL =
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1384565454792_0070
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 4
2013-11-16 05:04:12,290 Stage-1 map = 0%, reduce = 0%
2013-11-16 05:04:33,868 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 7.16 sec
MapReduce Total cumulative CPU time: 7 seconds 160 msec
Ended Job = job_1384565454792_0070
Loading data to table default.bucketed_users
rmr: DEPRECATED: Please use 'rm -r' instead.
Moved: 'hdfs://' to trash at: hdfs://
Table default.bucketed_users stats: [num_partitions: 0, num_files: 4, num_rows: 0, total_size: 24, raw_data_size: 0]
MapReduce Jobs Launched:
Job 0: Map: 1 Reduce: 4 Cumulative CPU: 7.16 sec HDFS Read: 259 HDFS Write: 24 SUCCESS
Total MapReduce CPU Time Spent: 7 seconds 160 msec
Time taken: 19.291 seconds
hive> dfs -ls /apps/hive/warehouse/bucketed_users;
Found 4 items
-rw-r--r-- 3 hue hdfs 12 2013-11-16 05:04 /apps/hive/warehouse/bucketed_users/000000_0
-rw-r--r-- 3 hue hdfs 0 2013-11-16 05:04 /apps/hive/warehouse/bucketed_users/000001_0
-rw-r--r-- 3 hue hdfs 6 2013-11-16 05:04 /apps/hive/warehouse/bucketed_users/000002_0
-rw-r--r-- 3 hue hdfs 6 2013-11-16 05:04 /apps/hive/warehouse/bucketed_users/000003_0
It is very odd that you see a conversion to MapJoin, you should not see that since your query has no joins in it. Is that really the query you're running? If you're seeing that I would suggest to:;
If that fixes it you should file a bug.

Odd, this works for me , However since you specify that your table is sorted you also need to set
set hive.enforce.sorting=true;
in addition of
set hive.enforce.bucketing = true;
I'm wondering if the combination of bucket/sort table and only setting one of the enforce setting messes it up somehow.


PIG : count of each product in distinctive Locations

I am trying to do following Step1 to Step4 in pig:
STEP 1:- Create a user table:and take data from /tmp/users.txt-
|Column 1 | USER ID |int|
|Column 2 |EMAIL|chararray|
|Column 3 |LANGUAGE |chararray|
|Column 4 |LOCATION |chararray|
STEP 2:- Crate a transaction table and take data from /tmp/transaction.txt:-
|Column 1 | ID |int|
|Column 2 |PRODUCT|int|
|Column 3 |USER ID |int|
|Column 4 |PURCHASE AMOUNT |double|
|Coulmn 5 |DESCRIPTION |chararray|
Step 3:- Find out the count of each product in distinctive Locations.
Step 4:- Display the results.
For achieving above I did the following :
users = LOAD '/tmp/users.txt' USING PigStorage(',') AS (USERID:int, EMAIL:chararray, LANGUAGE:chararray, LOCATION: chararray);
trans = LOAD '/tmp/transaction.txt' USING PigStorage(',') AS (ID:int, PRODUCT:int, USERID:int, PURCHASEAMOUNT: double, DESCRIPTION: chararray);
users_trans = JOIN users BY USERID RIGHT, trans BY USERID;
C = FOREACH B GENERATE group as comb, COUNT(users_trans) AS Total;
But, I am getting errors.. It will helpful if you assist as I am new to pig.
1 1 1 300 a jumper
2 1 2 300 a jumper
3 1 5 300 a jumper
4 2 3 100 a rubber chicken
5 1 3 300 a jumper
6 5 4 500 a soapbox
7 3 3 200 a adhesive
8 4 1 300 a lotion
9 4 4 500 a sweater
10 5 4 600 a jeans
Error Log:
2019-12-27 06:17:22,180 [LocalJobRunner Map Task Executor #0] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader - Current split being processed file:/tmp/temp2029752934/tmp-883821114/part-r-00000:0+130
2019-12-27 06:17:22,242 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.MapTask - (EQUATOR) 0 kvi 26214396(104857584)
2019-12-27 06:17:22,242 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.MapTask - 100
2019-12-27 06:17:22,242 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.MapTask - soft limit at 83886080
2019-12-27 06:17:22,242 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.MapTask - bufstart = 0; bufvoid = 104857600
2019-12-27 06:17:22,242 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.MapTask - kvstart = 26214396; length = 6553600
2019-12-27 06:17:22,244 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.MapTask - Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
2019-12-27 06:17:22,248 [LocalJobRunner Map Task Executor #0] INFO org.apache.pig.impl.util.SpillableMemoryManager - Selected heap (PS Old Gen) of size 699400192 to monitor. collectionUsageThreshold = 489580128, usageThreshold = 489580128
2019-12-27 06:17:22,248 [LocalJobRunner Map Task Executor #0] WARN - SchemaTupleBackend has already been initialized
2019-12-27 06:17:22,250 [LocalJobRunner Map Task Executor #0] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Map - Aliases being processed per job phase (AliasName[line,offset]): M: C[7,4],B[6,4] C: C[7,4],B[6,4] R: C[7,4]
2019-12-27 06:17:22,254 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.LocalJobRunner -
2019-12-27 06:17:22,254 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.MapTask - Starting flush of map output
2019-12-27 06:17:22,254 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.MapTask - Spilling map output
2019-12-27 06:17:22,254 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.MapTask - bufstart = 0; bufend = 100; bufvoid = 104857600
2019-12-27 06:17:22,254 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.MapTask - kvstart = 26214396(104857584); kvend = 26214360(104857440); length = 37/6553600
2019-12-27 06:17:22,262 [LocalJobRunner Map Task Executor #0] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine - Aliases being processed per job phase (AliasName[line,offset]): M: C[7,4],B[6,4] C: C[7,4],B[6,4] R: C[7,4]
2019-12-27 06:17:22,264 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.MapTask - Finished spill 0
2019-12-27 06:17:22,265 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.Task - Task:attempt_local1424814286_0002_m_000000_0 is done. And is in the process of committing
2019-12-27 06:17:22,266 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.LocalJobRunner -map
2019-12-27 06:17:22,266 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.Task - Task 'attempt_local1424814286_0002_m_000000_0' done.
2019-12-27 06:17:22,266 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.LocalJobRunner -Finishing task: attempt_local1424814286_0002_m_000000_0
2019-12-27 06:17:22,266 [Thread-18] INFO org.apache.hadoop.mapred.LocalJobRunner - map task executor complete.
2019-12-27 06:17:22,266 [Thread-18] INFO org.apache.hadoop.mapred.LocalJobRunner - Waiting for reduce tasks
2019-12-27 06:17:22,267 [pool-9-thread-1] INFO org.apache.hadoop.mapred.LocalJobRunner - Starting task: attempt_local1424814286_0002_r_000000_0
2019-12-27 06:17:22,272 [pool-9-thread-1] INFO org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - File Output Committer Algorithm version is 1
2019-12-27 06:17:22,272 [pool-9-thread-1] INFO org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
2019-12-27 06:17:22,274 [pool-9-thread-1] INFO org.apache.hadoop.mapred.Task - Using ResourceCalculatorProcessTree : [ ]
2019-12-27 06:17:22,274 [pool-9-thread-1] INFO org.apache.hadoop.mapred.ReduceTask - Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle#2582aa54
2019-12-27 06:17:22,275 [pool-9-thread-1] INFO org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl - MergerManager: memoryLimit=652528832, maxSingleShuffleLimit=163132208, mergeThreshold=430669056, ioSortFactor=10, memToMemMergeOutputsThreshold=10
2019-12-27 06:17:22,275 [EventFetcher for fetching Map Completion Events] INFO org.apache.hadoop.mapreduce.task.reduce.EventFetcher - attempt_local1424814286_0002_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
2019-12-27 06:17:22,276 [localfetcher#2] INFO org.apache.hadoop.mapreduce.task.reduce.LocalFetcher - localfetcher#2 about to shuffle output of map attempt_local1424814286_0002_m_000000_0 decomp: 14 len: 18 to MEMORY
2019-12-27 06:17:22,277 [localfetcher#2] INFO org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput - Read 14 bytes from map-output for attempt_local1424814286_0002_m_000000_0
2019-12-27 06:17:22,277 [localfetcher#2] INFO org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl - closeInMemoryFile -> map-output of size: 14, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->14
2019-12-27 06:17:22,277 [EventFetcher for fetching Map Completion Events] INFO org.apache.hadoop.mapreduce.task.reduce.EventFetcher - EventFetcher is interrupted.. Returning
2019-12-27 06:17:22,278 [Readahead Thread #3] WARN - Failed readahead on ifile
EBADF: Bad file descriptor
at$POSIX.posix_fadvise(Native Method)
at java.util.concurrent.ThreadPoolExecutor.runWorker(
at java.util.concurrent.ThreadPoolExecutor$
2019-12-27 06:17:22,278 [pool-9-thread-1] INFO org.apache.hadoop.mapred.LocalJobRunner - 1 / 1 copied.
2019-12-27 06:17:22,280 [pool-9-thread-1] INFO org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl - finalMerge called with 1 in-memory map-outputs and 0 on-disk map-outputs
2019-12-27 06:17:22,280 [pool-9-thread-1] INFO org.apache.hadoop.mapred.Merger - Merging 1 sorted segments
2019-12-27 06:17:22,280 [pool-9-thread-1] INFO org.apache.hadoop.mapred.Merger - Down to the last merge-pass, with 1 segments left of total size: 7 bytes
2019-12-27 06:17:22,281 [pool-9-thread-1] INFO org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl - Merged 1 segments, 14 bytes to disk to satisfy reduce memory limit
2019-12-27 06:17:22,281 [pool-9-thread-1] INFO org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl - Merging 1 files, 18 bytes from disk
2019-12-27 06:17:22,281 [pool-9-thread-1] INFO org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl - Merging 0 segments, 0 bytes from memory into reduce
2019-12-27 06:17:22,281 [pool-9-thread-1] INFO org.apache.hadoop.mapred.Merger - Merging 1 sorted segments
2019-12-27 06:17:22,281 [pool-9-thread-1] INFO org.apache.hadoop.mapred.Merger - Down to the last merge-pass, with 1 segments left of total size: 7 bytes
2019-12-27 06:17:22,282 [pool-9-thread-1] INFO org.apache.hadoop.mapred.LocalJobRunner - 1 / 1 copied.
2019-12-27 06:17:22,283 [pool-9-thread-1] INFO org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - File Output Committer Algorithm version is 1
2019-12-27 06:17:22,283 [pool-9-thread-1] INFO org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
2019-12-27 06:17:22,284 [pool-9-thread-1] INFO org.apache.pig.impl.util.SpillableMemoryManager - Selected heap (PS Old Gen) of size 699400192 to monitor. collectionUsageThreshold = 489580128, usageThreshold = 489580128
2019-12-27 06:17:22,285 [pool-9-thread-1] WARN - SchemaTupleBackend has already been initialized
2019-12-27 06:17:22,286 [pool-9-thread-1] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce - Aliases being processed per job phase (AliasName[line,offset]): M: C[7,4],B[6,4] C: C[7,4],B[6,4] R: C[7,4]
2019-12-27 06:17:22,287 [pool-9-thread-1] INFO org.apache.hadoop.mapred.Task - Task:attempt_local1424814286_0002_r_000000_0 is done. And is in the process of committing
2019-12-27 06:17:22,289 [pool-9-thread-1] INFO org.apache.hadoop.mapred.LocalJobRunner - 1 / 1 copied.
2019-12-27 06:17:22,289 [pool-9-thread-1] INFO org.apache.hadoop.mapred.Task - Task attempt_local1424814286_0002_r_000000_0 is allowed to commit now
2019-12-27 06:17:22,292 [pool-9-thread-1] INFO org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - Saved output of task 'attempt_local1424814286_0002_r_000000_0' to file:/tmp/temp2029752934/tmp726323435/_temporary/0/task_local1424814286_0002_r_000000
2019-12-27 06:17:22,292 [pool-9-thread-1] INFO org.apache.hadoop.mapred.LocalJobRunner - reduce > reduce
2019-12-27 06:17:22,292 [pool-9-thread-1] INFO org.apache.hadoop.mapred.Task - Task 'attempt_local1424814286_0002_r_000000_0' done.
2019-12-27 06:17:22,292 [pool-9-thread-1] INFO org.apache.hadoop.mapred.LocalJobRunner - Finishing task: attempt_local1424814286_0002_r_000000_0
2019-12-27 06:17:22,292 [Thread-18] INFO org.apache.hadoop.mapred.LocalJobRunner - reduce task executor complete.
2019-12-27 06:17:22,460 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_local1424814286_0002
2019-12-27 06:17:22,460 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases B,C
2019-12-27 06:17:22,460 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: C[7,4],B[6,4] C: C[7,4],B[6,4] R: C[7,4]
2019-12-27 06:17:22,463 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metricswith processName=JobTracker, sessionId= - already initialized
2019-12-27 06:17:22,464 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metricswith processName=JobTracker, sessionId= - already initialized
2019-12-27 06:17:22,465 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metricswith processName=JobTracker, sessionId= - already initialized
2019-12-27 06:17:22,471 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2019-12-27 06:17:22,474 [main] INFO - Script Statistics:
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
2.9.2 0.16.0 root 2019-12-27 06:17:20 2019-12-27 06:17:22 HASH_JOIN,GROUP_BY
Job Stats (time in seconds):
JobId Maps Reduces MaxMapTime MinMapTime AvgMapTime MedianMapTime MaxReduceTime MinReduceTime AvgReduceTime MedianReducetime Alias Feature Outputs
job_local1289071959_0001 2 1 n/a n/a n/a n/a n/a n/a n/a n/a trans,users,users_trans HASH_JOIN
job_local1424814286_0002 1 1 n/a n/a n/a n/a n/a n/a n/a n/a B,C GROUP_BY,COMBINER file:/tmp/temp2029752934/tmp726323435,
Successfully read 5 records from: "/tmp/users.txt"
Successfully read 10 records from: "/tmp/transaction.txt"
Successfully stored 1 records in: "file:/tmp/temp2029752934/tmp726323435"
Total records written : 1
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_local1289071959_0001 -> job_local1424814286_0002,
2019-12-27 06:17:22,475 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metricswith processName=JobTracker, sessionId= - already initialized
2019-12-27 06:17:22,476 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metricswith processName=JobTracker, sessionId= - already initialized
2019-12-27 06:17:22,477 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metricswith processName=JobTracker, sessionId= - already initialized
2019-12-27 06:17:22,485 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metricswith processName=JobTracker, sessionId= - already initialized
2019-12-27 06:17:22,486 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metricswith processName=JobTracker, sessionId= - already initialized
2019-12-27 06:17:22,487 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metricswith processName=JobTracker, sessionId= - already initialized
2019-12-27 06:17:22,492 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Encountered Warning FIELD_DISCARDED_TYPE_CONVERSION_FAILED 15 time(s).
2019-12-27 06:17:22,493 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Encountered Warning ACCESSING_NON_EXISTENT_FIELD 55 time(s).
2019-12-27 06:17:22,493 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
2019-12-27 06:17:22,496 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - is deprecated. Instead, use fs.defaultFS
2019-12-27 06:17:22,496 [main] WARN - SchemaTupleBackend has already been initialized
2019-12-27 06:17:22,503 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input files to process : 1
2019-12-27 06:17:22,503 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
2019-12-27 06:17:22,541 [main] INFO org.apache.pig.Main - Pig script completed in 2 seconds and 965 milliseconds (2965 ms)
First of all: It seems that you are starting up with Pig. It may be valuable to know that Cloudera recently decided to deprecate Pig. It will of course not cease to exist, but think twice if you are planning to pick up a new skill or implement new use cases. I would recommend looking into Hive/Spark/Impala as more future proof alternatives.
Your job succeeds, but presumably not with output you want. There are several hints to what may be wrong (data types/field names) however this does not point at a specific problem in the code.
My recommendation would be to find out where the problem exactly occurs. Simply cut off the end of your code and print an intermediate result to see if you are still on track.
In the (likely) event you have a problem in your load statement already, it is worth noting that you can still narrow it down further. First load, and then apply the schema.
Given the data you have, first problem would be that you have no commas, so you must load the lines as a whole, then split them later. I used two or more spaces in the transactions file because your last column appears to be one string containing spaces. For accuracy, I suggest having a better delimiter than spaces/tabs.
Then the group by needs to reference the relations that the data comes from.
Everything else is fine, I think, though I'm not sure about the COUNT(X)
A = LOAD '/tmp/users.txt' USING PigStorage() as (line:chararray);
USERS = FOREACH A GENERATE FLATTEN(STRSPLIT(line, '\\s+')) AS (userid:int,email:chararray,language:chararray,location:chararray);
B = LOAD '/tmp/transactions.txt' USING PigStorage() as (line:chararray);
TRANS = FOREACH B GENERATE FLATTEN(STRSPLIT(line, '\\s\\s+')) AS (id:int,product:int,userid:int,purchase:double,desc:chararray);
X = JOIN USERS BY userid RIGHT, TRANS BY userid;
X_grouped = GROUP X BY (TRANS::desc, USERS::location);
RES = FOREACH X_grouped GENERATE group as comb, COUNT(X) AS Total;
\d RES;
((a jeans,HN),1)
((a jumper,FR),1)
((a jumper,GB),1)
((a jumper,IS),1)
((a jumper,US),1)
((a lotion,US),1)
((a soapbox,HN),1)
((a sweater,HN),1)
((a adhesive,FR),1)
((a rubber chicken,FR),1)

map reduce program to find maximum temprature

I have written map reduce program, but the reducer is not working, below is the code which I have written. please let me know what is the mistake in the program, as I am not getting any error, please kindly help me on the same.
below is the data
1993 23
1991 25
1992 56
1991 78
1991 11
1993 24
1992 35
package p1;
import org.apache.hadoop.mapreduce.Mapper;
public class mymaaper extends Mapper <LongWritable,Text,Text, IntWritable>
public void map(LongWritable key, Text value, Context con) throws IOException, InterruptedException
String arr1[]= value.toString().split("\\s");
String year = arr1[0];
int temp = Integer.parseInt(arr1[1]);
con.write(new Text(year), new IntWritable(temp));
//con.write(new Text(year), new Text(year));
package p1;
import org.apache.hadoop.mapreduce.Reducer;
public class myreducer extends Reducer <Text, IntWritable, Text, IntWritable>
public myreducer()
System.out.println("myreducer().hashcode="+ hashCode());
public void reduce(Text key, Iterable<IntWritable> value, Context con) throws IOException, InterruptedException
System.out.print("All values=");
int maxvalue =Integer.MIN_VALUE;
for(IntWritable sw:value)
maxvalue = Math.max(maxvalue, sw.get());
con.write(key, new IntWritable(maxvalue));
package p1;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.Job;
public class mydriver
public static void main(String args[]) throws ClassNotFoundException, IOException, InterruptedException
Path input= new Path("hdfs://localhost:9000/input_temp/");
Path output= new Path("hdfs://localhost:9000/output_temp/");
Configuration conf= new Configuration();
Job j1= Job.getInstance(conf, "maxtemp");
output.getFileSystem(conf).delete(output, true);
System.exit(j1.waitForCompletion(true)? 0 : 1);
2018-09-19 09:42:13,222 WARN util.NativeCodeLoader (<clinit>(60)) - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2018-09-19 09:42:22,319 INFO beanutils.FluentPropertyBeanIntrospector ( - Error when creating PropertyDescriptor for public final void,java.lang.Object)! Ignoring this property.
2018-09-19 09:42:22,864 INFO impl.MetricsConfig ( - loaded properties from
2018-09-19 09:42:23,829 INFO impl.MetricsSystemImpl ( - Scheduled Metric snapshot period at 0 second(s).
2018-09-19 09:42:23,834 INFO impl.MetricsSystemImpl ( - JobTracker metrics system started
2018-09-19 09:42:26,003 WARN mapreduce.JobResourceUploader ( - Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
2018-09-19 09:42:26,053 WARN mapreduce.JobResourceUploader ( - No job jar file set. User classes may not be found. See Job or Job#setJar(String).
2018-09-19 09:42:27,001 INFO input.FileInputFormat ( - Total input files to process : 2
2018-09-19 09:42:27,512 INFO mapreduce.JobSubmitter ( - number of splits:2
2018-09-19 09:42:29,048 INFO mapreduce.JobSubmitter ( - Submitting tokens for job: job_local342787376_0001
2018-09-19 09:42:29,068 INFO mapreduce.JobSubmitter ( - Executing with tokens: []
2018-09-19 09:42:30,382 INFO mapreduce.Job ( - The url to track the job: http://localhost:8080/
2018-09-19 09:42:30,387 INFO mapreduce.Job ( - Running job: job_local342787376_0001
2018-09-19 09:42:30,408 INFO mapred.LocalJobRunner ( - OutputCommitter set in config null
2018-09-19 09:42:30,469 INFO output.FileOutputCommitter (<init>(140)) - File Output Committer Algorithm version is 2
2018-09-19 09:42:30,478 INFO output.FileOutputCommitter (<init>(155)) - FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
2018-09-19 09:42:30,539 INFO mapred.LocalJobRunner ( - OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
2018-09-19 09:42:31,402 INFO mapred.LocalJobRunner ( - Waiting for map tasks
2018-09-19 09:42:31,416 INFO mapred.LocalJobRunner ( - Starting task: attempt_local342787376_0001_m_000000_0
2018-09-19 09:42:31,444 INFO mapreduce.Job ( - Job job_local342787376_0001 running in uber mode : false
2018-09-19 09:42:31,447 INFO mapreduce.Job ( - map 0% reduce 0%
2018-09-19 09:42:31,768 INFO output.FileOutputCommitter (<init>(140)) - File Output Committer Algorithm version is 2
2018-09-19 09:42:31,778 INFO output.FileOutputCommitter (<init>(155)) - FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
2018-09-19 09:42:32,028 INFO mapred.Task ( - Using ResourceCalculatorProcessTree : [ ]
2018-09-19 09:42:32,085 INFO mapred.MapTask ( - Processing split: hdfs://localhost:9000/input_temp/temp1:0+41
2018-09-19 09:42:33,881 INFO mapred.MapTask ( - (EQUATOR) 0 kvi 26214396(104857584)
2018-09-19 09:42:33,888 INFO mapred.MapTask ( - 100
2018-09-19 09:42:33,888 INFO mapred.MapTask ( - soft limit at 83886080
2018-09-19 09:42:33,889 INFO mapred.MapTask ( - bufstart = 0; bufvoid = 104857600
2018-09-19 09:42:33,890 INFO mapred.MapTask ( - kvstart = 26214396; length = 6553600
2018-09-19 09:42:33,964 INFO mapred.MapTask ( - Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
2018-09-19 09:42:35,960 INFO mapred.LocalJobRunner ( -
2018-09-19 09:42:35,992 INFO mapred.MapTask ( - Starting flush of map output
2018-09-19 09:42:36,001 INFO mapred.MapTask ( - Spilling map output
2018-09-19 09:42:36,001 INFO mapred.MapTask ( - bufstart = 0; bufend = 45; bufvoid = 104857600
2018-09-19 09:42:36,007 INFO mapred.MapTask ( - kvstart = 26214396(104857584); kvend = 26214380(104857520); length = 17/6553600
2018-09-19 09:42:36,175 INFO mapred.MapTask ( - Finished spill 0
2018-09-19 09:42:36,337 INFO mapred.Task ( - Task:attempt_local342787376_0001_m_000000_0 is done. And is in the process of committing
2018-09-19 09:42:36,419 INFO mapred.LocalJobRunner ( - map
2018-09-19 09:42:36,426 INFO mapred.Task ( - Task 'attempt_local342787376_0001_m_000000_0' done.
2018-09-19 09:42:36,571 INFO mapred.Task ( - Final Counters for attempt_local342787376_0001_m_000000_0: Counters: 22
File System Counters
FILE: Number of bytes read=267
FILE: Number of bytes written=495006
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=41
HDFS: Number of bytes written=0
HDFS: Number of read operations=5
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Map-Reduce Framework
Map input records=5
Map output records=5
Map output bytes=45
Map output materialized bytes=61
Input split bytes=103
Combine input records=0
Spilled Records=5
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=339
Total committed heap usage (bytes)=167841792
File Input Format Counters
Bytes Read=41
2018-09-19 09:42:36,578 INFO mapred.LocalJobRunner ( - Finishing task: attempt_local342787376_0001_m_000000_0
2018-09-19 09:42:36,581 INFO mapred.LocalJobRunner ( - Starting task: attempt_local342787376_0001_m_000001_0
2018-09-19 09:42:36,606 INFO output.FileOutputCommitter (<init>(140)) - File Output Committer Algorithm version is 2
2018-09-19 09:42:36,607 INFO output.FileOutputCommitter (<init>(155)) - FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
2018-09-19 09:42:36,609 INFO mapred.Task ( - Using ResourceCalculatorProcessTree : [ ]
2018-09-19 09:42:36,644 INFO mapreduce.Job ( - map 100% reduce 0%
2018-09-19 09:42:36,668 INFO mapred.MapTask ( - Processing split: hdfs://localhost:9000/input_temp/temp2:0+33
2018-09-19 09:42:37,175 INFO mapred.MapTask ( - (EQUATOR) 0 kvi 26214396(104857584)
2018-09-19 09:42:37,180 INFO mapred.MapTask ( - 100
2018-09-19 09:42:37,183 INFO mapred.MapTask ( - soft limit at 83886080
2018-09-19 09:42:37,187 INFO mapred.MapTask ( - bufstart = 0; bufvoid = 104857600
2018-09-19 09:42:37,187 INFO mapred.MapTask ( - kvstart = 26214396; length = 6553600
2018-09-19 09:42:37,199 INFO mapred.MapTask ( - Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
2018-09-19 09:42:37,354 INFO mapred.MapTask ( - Starting flush of map output
2018-09-19 09:42:37,355 INFO mapred.MapTask ( - Spilling map output
2018-09-19 09:42:37,355 INFO mapred.MapTask ( - bufstart = 0; bufend = 36; bufvoid = 104857600
2018-09-19 09:42:37,355 INFO mapred.MapTask ( - kvstart = 26214396(104857584); kvend = 26214384(104857536); length = 13/6553600
2018-09-19 09:42:37,419 INFO mapred.MapTask ( - Finished spill 0
2018-09-19 09:42:37,480 INFO mapred.LocalJobRunner ( - map task executor complete.
2018-09-19 09:42:37,498 WARN mapred.LocalJobRunner ( - job_local342787376_0001
java.lang.Exception: java.lang.ArrayIndexOutOfBoundsException: 1
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(
at org.apache.hadoop.mapred.LocalJobRunner$
Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
at org.apache.hadoop.mapred.MapTask.runNewMapper(
at org.apache.hadoop.mapred.LocalJobRunner$Job$
at java.util.concurrent.Executors$
at java.util.concurrent.ThreadPoolExecutor.runWorker(
at java.util.concurrent.ThreadPoolExecutor$
2018-09-19 09:42:37,648 INFO mapreduce.Job ( - Job job_local342787376_0001 failed with state FAILED due to: NA
2018-09-19 09:42:37,786 INFO mapreduce.Job ( - Counters: 22
File System Counters
FILE: Number of bytes read=267
FILE: Number of bytes written=495006
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=41
HDFS: Number of bytes written=0
HDFS: Number of read operations=5
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Map-Reduce Framework
Map input records=5
Map output records=5
Map output bytes=45
Map output materialized bytes=61
Input split bytes=103
Combine input records=0
Spilled Records=5
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=339
Total committed heap usage (bytes)=167841792
File Input Format Counters
Bytes Read=41
You get out of bound exception. I think there is a bad record in your input files. Check arr1 size in mapper before using it.

Does sqoop spill temporary data to disk

As I understand sqoop, it launches few mappers on different data nodes making jdbc connection with RDBMS. Once connection is formed data is transferred to HDFS.
Just trying to understand, does sqoop mapper spill data temporary on disk (data node)? I know spilling happens in MapReduce but not sure about sqoop job.
It seems sqoop-import runs on mapper and doesn't spill. And sqoop-merge runs on map-reduce and does spill. You can check it on Job tracker during sqoop import run.
Have a look at this part of sqoop import log, it does not spill, fetches and writes to hdfs:
INFO [main] ... mapreduce.db.DataDrivenDBRecordReader: Using query: SELECT...
[main] mapreduce.db.DBRecordReader: Executing query: SELECT...
INFO [main] org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: File Output Committer Algorithm version is 1
INFO [main] org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
INFO [main] Got brand-new compressor [.snappy]
INFO [Thread-16] ...mapreduce.AutoProgressMapper: Auto-progress thread is finished. keepGoing=false
INFO [main] org.apache.hadoop.mapred.Task: Task:attempt_1489705733959_2462784_m_000000_0 is done. And is in the process of committing
INFO [main] org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: Saved output of task 'attempt_1489705733959_2462784_m_000000_0' to hdfs://
Have a look at this sqoop-merge log(skipped some rows), it spills on disk (note Spilling map output in the log):
INFO [main] org.apache.hadoop.mapred.MapTask: Processing split: hdfs://bla-bla/part-m-00000:0+48322717
INFO [main] org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
INFO [main] org.apache.hadoop.mapred.MapTask: 1024
INFO [main] org.apache.hadoop.mapred.MapTask: soft limit at 751619264
INFO [main] org.apache.hadoop.mapred.MapTask: bufstart = 0; bufvoid = 1073741824
INFO [main] org.apache.hadoop.mapred.MapTask: kvstart = 268435452; length = 67108864
INFO [main] org.apache.hadoop.mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$**MapOutputBuffer**
INFO [main] com.pepperdata.supervisor.agent.resource.r: Datanode bla-bla is LOCAL.
INFO [main] Got brand-new decompressor [.snappy]
INFO [main] org.apache.hadoop.mapred.MapTask: **Starting flush of map output**
INFO [main] org.apache.hadoop.mapred.MapTask: **Spilling map output**
INFO [main] org.apache.hadoop.mapred.MapTask: **bufstart** = 0; **bufend** = 184775274; bufvoid = 1073741824
INFO [main] org.apache.hadoop.mapred.MapTask: kvstart = 268435452(1073741808); kvend = 267347800(1069391200); length = 1087653/67108864
INFO [main] Got brand-new compressor [.snappy]
[main] org.apache.hadoop.mapred.MapTask: Finished spill 0
...Task:attempt_1489705733959_2479291_m_000000_0 is done. And is in the process of committing

Use PIG to count the number of records in an avro file

I can open a avro file in HUE and HUE shows me it has 10 records. i can browse through all the 10 records in HUE.
Now I write the following code in PIG
data = LOAD '/user/admin/2015/10/04/02/file1.avro' USING AvroStorage();
data_group = GROUP data ALL;
row_count = FOREACH data_group GENERATE COUNT(data);
dump row_count;
The output of the job is
Successfully read 4 records (58507 bytes) from: "/user/admin/2015/10/04/02/file1.avro"
Successfully stored 1 records (6 bytes) in: "hdfs://nn1/tmp/temp-268177355/tmp915757783"
Total records written : 1
Total bytes written : 6
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
2015-10-29 19:08:55,252 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
2015-10-29 19:08:55,252 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - is deprecated. Instead, use fs.defaultFS
2015-10-29 19:08:55,253 [main] INFO - Key [pig.schematuple] was not set... will not generate code.
2015-10-29 19:08:55,261 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2015-10-29 19:08:55,261 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
How did 10 become 4. Is there a different way to count the number of records in an avro file using PIG?

ORDER BY job failed in the Pig script while running EmbeddedPig using Java

I have this following pig script, which works perfectly using grunt shell (stored the results to HDFS without any issues); however, the last job (ORDER BY) failed if I ran the same script using Java EmbeddedPig. If I replace the ORDER BY job by others, such as GROUP or FOREACH GENERATE, the whole script then succeeded in Java EmbeddedPig. So I think it's the ORDER BY which causes the issue. Anyone has any experience with this? Any help would be appreciated!
The Pig script:
REGISTER pig-udf-0.0.1-SNAPSHOT.jar;
user_similarity = LOAD '/tmp/sample-sim-score-results-31/part-r-00000' USING PigStorage('\t') AS (user_id: chararray, sim_user_id: chararray, basic_sim_score: float, alt_sim_score: float);
simplified_user_similarity = FOREACH user_similarity GENERATE $0 AS user_id, $1 AS sim_user_id, $2 AS sim_score;
grouped_user_similarity = GROUP simplified_user_similarity BY user_id;
ordered_user_similarity = FOREACH grouped_user_similarity {
sorted = ORDER simplified_user_similarity BY sim_score DESC;
top = LIMIT sorted 10;
GENERATE group, top;
top_influencers = FOREACH ordered_user_similarity GENERATE$1, 10);
all_influence_scores = FOREACH top_influencers GENERATE FLATTEN($0);
grouped_influence_scores = GROUP all_influence_scores BY bag_of_topSimUserTuples::user_id;
influence_scores = FOREACH grouped_influence_scores GENERATE group AS user_id, SUM(all_influence_scores.bag_of_topSimUserTuples::points) AS influence_score;
ordered_influence_scores = ORDER influence_scores BY influence_score DESC;
STORE ordered_influence_scores INTO '/tmp/cc-test-results-1' USING PigStorage();
The error log from Pig:
12/04/05 10:00:56 INFO pigstats.ScriptState: Pig script settings are added to the job
12/04/05 10:00:56 INFO mapReduceLayer.JobControlCompiler: mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
12/04/05 10:00:58 INFO mapReduceLayer.JobControlCompiler: Setting up single store job
12/04/05 10:00:58 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
12/04/05 10:00:58 INFO mapReduceLayer.MapReduceLauncher: 1 map-reduce job(s) waiting for submission.
12/04/05 10:00:58 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
12/04/05 10:00:58 INFO input.FileInputFormat: Total input paths to process : 1
12/04/05 10:00:58 INFO util.MapRedUtil: Total input paths to process : 1
12/04/05 10:00:58 INFO util.MapRedUtil: Total input paths (combined) to process : 1
12/04/05 10:00:58 INFO filecache.TrackerDistributedCacheManager: Creating tmp-1546565755 in /var/lib/hadoop-0.20/cache/cchuang/mapred/local/archive/4334795313006396107_361978491_57907159/localhost/tmp/temp1725960134-work-6955502337234509704 with rwxr-xr-x
12/04/05 10:00:58 INFO filecache.TrackerDistributedCacheManager: Cached hdfs://localhost/tmp/temp1725960134/tmp-1546565755#pigsample_854728855_1333645258470 as /var/lib/hadoop-0.20/cache/cchuang/mapred/local/archive/4334795313006396107_361978491_57907159/localhost/tmp/temp1725960134/tmp-1546565755
12/04/05 10:00:58 INFO filecache.TrackerDistributedCacheManager: Cached hdfs://localhost/tmp/temp1725960134/tmp-1546565755#pigsample_854728855_1333645258470 as /var/lib/hadoop-0.20/cache/cchuang/mapred/local/archive/4334795313006396107_361978491_57907159/localhost/tmp/temp1725960134/tmp-1546565755
12/04/05 10:00:58 WARN mapred.LocalJobRunner: LocalJobRunner does not support symlinking into current working dir.
12/04/05 10:00:58 INFO mapred.TaskRunner: Creating symlink: /var/lib/hadoop-0.20/cache/cchuang/mapred/local/archive/4334795313006396107_361978491_57907159/localhost/tmp/temp1725960134/tmp-1546565755 <- /var/lib/hadoop-0.20/cache/cchuang/mapred/local/localRunner/pigsample_854728855_1333645258470
12/04/05 10:00:58 INFO filecache.TrackerDistributedCacheManager: Creating symlink: /var/lib/hadoop-0.20/cache/cchuang/mapred/staging/cchuang402164468/.staging/job_local_0004/.job.jar.crc <- /var/lib/hadoop-0.20/cache/cchuang/mapred/local/localRunner/.job.jar.crc
12/04/05 10:00:58 INFO filecache.TrackerDistributedCacheManager: Creating symlink: /var/lib/hadoop-0.20/cache/cchuang/mapred/staging/cchuang402164468/.staging/job_local_0004/.job.split.crc <- /var/lib/hadoop-0.20/cache/cchuang/mapred/local/localRunner/.job.split.crc
12/04/05 10:00:59 INFO filecache.TrackerDistributedCacheManager: Creating symlink: /var/lib/hadoop-0.20/cache/cchuang/mapred/staging/cchuang402164468/.staging/job_local_0004/.job.splitmetainfo.crc <- /var/lib/hadoop-0.20/cache/cchuang/mapred/local/localRunner/.job.splitmetainfo.crc
12/04/05 10:00:59 INFO filecache.TrackerDistributedCacheManager: Creating symlink: /var/lib/hadoop-0.20/cache/cchuang/mapred/staging/cchuang402164468/.staging/job_local_0004/.job.xml.crc <- /var/lib/hadoop-0.20/cache/cchuang/mapred/local/localRunner/.job.xml.crc
12/04/05 10:00:59 INFO filecache.TrackerDistributedCacheManager: Creating symlink: /var/lib/hadoop-0.20/cache/cchuang/mapred/staging/cchuang402164468/.staging/job_local_0004/job.jar <- /var/lib/hadoop-0.20/cache/cchuang/mapred/local/localRunner/job.jar
12/04/05 10:00:59 INFO filecache.TrackerDistributedCacheManager: Creating symlink: /var/lib/hadoop-0.20/cache/cchuang/mapred/staging/cchuang402164468/.staging/job_local_0004/job.split <- /var/lib/hadoop-0.20/cache/cchuang/mapred/local/localRunner/job.split
12/04/05 10:00:59 INFO filecache.TrackerDistributedCacheManager: Creating symlink: /var/lib/hadoop-0.20/cache/cchuang/mapred/staging/cchuang402164468/.staging/job_local_0004/job.splitmetainfo <- /var/lib/hadoop-0.20/cache/cchuang/mapred/local/localRunner/job.splitmetainfo
12/04/05 10:00:59 INFO filecache.TrackerDistributedCacheManager: Creating symlink: /var/lib/hadoop-0.20/cache/cchuang/mapred/staging/cchuang402164468/.staging/job_local_0004/job.xml <- /var/lib/hadoop-0.20/cache/cchuang/mapred/local/localRunner/job.xml
12/04/05 10:00:59 INFO mapred.Task: Using ResourceCalculatorPlugin : null
12/04/05 10:00:59 INFO mapred.MapTask: io.sort.mb = 100
12/04/05 10:00:59 INFO mapred.MapTask: data buffer = 79691776/99614720
12/04/05 10:00:59 INFO mapred.MapTask: record buffer = 262144/327680
12/04/05 10:00:59 WARN mapred.LocalJobRunner: job_local_0004
java.lang.RuntimeException: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: file:/Users/cchuang/workspace/grapevine-rec/pigsample_854728855_1333645258470
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.setConf(
at org.apache.hadoop.util.ReflectionUtils.setConf(
at org.apache.hadoop.util.ReflectionUtils.newInstance(
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(
at org.apache.hadoop.mapred.MapTask.runNewMapper(
at org.apache.hadoop.mapred.LocalJobRunner$
Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: file:/Users/cchuang/workspace/grapevine-rec/pigsample_854728855_1333645258470
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigFileInputFormat.listStatus(
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.setConf(
... 6 more
12/04/05 10:00:59 INFO filecache.TrackerDistributedCacheManager: Deleted path /var/lib/hadoop-0.20/cache/cchuang/mapred/local/archive/4334795313006396107_361978491_57907159/localhost/tmp/temp1725960134/tmp-1546565755
12/04/05 10:00:59 INFO mapReduceLayer.MapReduceLauncher: HadoopJobId: job_local_0004
12/04/05 10:01:04 INFO mapReduceLayer.MapReduceLauncher: job job_local_0004 has failed! Stop running all dependent jobs
12/04/05 10:01:04 INFO mapReduceLayer.MapReduceLauncher: 100% complete
12/04/05 10:01:04 ERROR pigstats.PigStatsUtil: 1 map reduce job(s) failed!
12/04/05 10:01:04 INFO pigstats.PigStats: Script Statistics:
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
0.20.2-cdh3u3 0.8.1-cdh3u3 cchuang 2012-04-05 10:00:34 2012-04-05 10:01:04 GROUP_BY,ORDER_BY
Some jobs have failed! Stop running all dependent jobs
Job Stats (time in seconds):
JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MaxReduceTime MinReduceTime AvgReduceTime Alias Feature Outputs
job_local_0001 0 0 0 0 0 0 0 0 all_influence_scores,grouped_user_similarity,simplified_user_similarity,user_similarity GROUP_BY
job_local_0002 0 0 0 0 0 0 0 0 grouped_influence_scores,influence_scores GROUP_BY,COMBINER
job_local_0003 0 0 0 0 0 0 0 0 ordered_influence_scores SAMPLER
Failed Jobs:
JobId Alias Feature Message Outputs
job_local_0004 ordered_influence_scores ORDER_BY Message: Job failed! Error - NA /tmp/cc-test-results-1,
Successfully read 0 records from: "/tmp/sample-sim-score-results-31/part-r-00000"
Failed to produce result in "/tmp/cc-test-results-1"
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_local_0001 -> job_local_0002,
job_local_0002 -> job_local_0003,
job_local_0003 -> job_local_0004,
12/04/05 10:01:04 INFO mapReduceLayer.MapReduceLauncher: Some jobs have failed! Stop running all dependent jobs
Make sure PIG_HOME environment variable is set to pig installation.
