I was reading an article regarding how small files degrade the performance of the hive query.
https://community.hitachivantara.com/community/products-and-solutions/pentaho/blog/2017/11/07/working-with-small-files-in-hadoop-part-1
I understand the first part regarding overloading the NameNode.
However, what he had said regrading map-reduce doesn't seem to happen. for both map-reduce and Tez.
When a MapReduce job launches, it schedules one map task per block of
data being processed
I don't see mapper task created per file.May the reason is, he is referring the version 1 of map-reduce and so much change haver been done after that.
Hive Version: Hive 1.2.1000.2.6.4.0-91
My table:
create table temp.emp_orc_small_files (id int, name string, salary int)
stored as orcfile;
Data:
following code will create 100 small files it containing only few kb of data.
for i in {1..100}; do hive -e "insert into temp.emp_orc_small_files values(${i}, 'test_${i}', `shuf -i 1000-5000 -n 1`);";done
However I see only one mapper and one reducer task being created for following query.
[root#sandbox-hdp ~]# hive -e "select max(salary) from temp.emp_orc_small_files"
log4j:WARN No such property [maxFileSize] in org.apache.log4j.DailyRollingFileAppender.
Logging initialized using configuration in file:/etc/hive/2.6.4.0-91/0/hive-log4j.properties
Query ID = root_20180911200039_9e1361cb-0a5d-45a3-9c98-4aead46905ac
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1536258296893_0257)
--------------------------------------------------------------------------------
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
--------------------------------------------------------------------------------
Map 1 .......... SUCCEEDED 1 1 0 0 0 0
Reducer 2 ...... SUCCEEDED 1 1 0 0 0 0
--------------------------------------------------------------------------------
VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 7.36 s
--------------------------------------------------------------------------------
OK
4989
Time taken: 13.643 seconds, Fetched: 1 row(s)
Same result with map-reduce.
hive> set hive.execution.engine=mr;
hive> select max(salary) from temp.emp_orc_small_files;
Query ID = root_20180911200545_c4f63cc6-0ab8-4bed-80fe-b4cb545018f2
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1536258296893_0259, Tracking URL = http://sandbox-hdp.hortonworks.com:8088/proxy/application_1536258296893_0259/
Kill Command = /usr/hdp/2.6.4.0-91/hadoop/bin/hadoop job -kill job_1536258296893_0259
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2018-09-11 20:05:57,213 Stage-1 map = 0%, reduce = 0%
2018-09-11 20:06:04,727 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.37 sec
2018-09-11 20:06:12,189 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 7.36 sec
MapReduce Total cumulative CPU time: 7 seconds 360 msec
Ended Job = job_1536258296893_0259
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 7.36 sec HDFS Read: 66478 HDFS Write: 5 SUCCESS
Total MapReduce CPU Time Spent: 7 seconds 360 msec
OK
4989
This is because the following configuration is taking effect
hive.hadoop.supports.splittable.combineinputformat
from the documentation
Whether to combine small input files so that fewer mappers are
spawned.
So essentially Hive can infer that the input is a group of small files smaller than the blocksize and combine them reducing the required number of mappers.
Related
Here is my workflow, works well if I have sample SQL like show tables or drop partition(Tried with Tez as well, it fials with cryptic error message again)..
workflow-app xmlns="uri:oozie:workflow:0.4" name="UDEX-OOZIE POC">
<credentials>
<credential name="HiveCred" type="hcat">
<property>
<name>hcat.metastore.uri</name>
<value>thrift://xxxx.local:9083</value>
</property>
<property>
<name>hcat.metastore.principal</name>
<value>hive/_HOST#xxxx.LOCAL</value>
</property>
</credential>
</credentials>
<start to="IdEduTranCell-pm"/>
<action name="IdEduTranCell-pm" cred="HiveCred">
<hive xmlns="uri:oozie:hive-action:0.5">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<job-xml>${HiveConfigFile}</job-xml>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>default</value>
</property>
</configuration>
<script>${IdEduTranCell_path}</script>
<param>SOURCE_DB_NAME=${SOURCE_DB_NAME}</param>
<param>STRG_DB_NAME=${STRG_DB_NAME}</param>
<param>TABLE_NAME=${TABLE_NAME}</param>
<file>${IdEduTranCell_path}#${IdEduTranCell_path}</file>
<file>${HiveConfigFile}#${HiveConfigFile}</file>
</hive>
<ok to="sub-workflow-end"/>
<error to="kill"/>
</action>
<kill name="kill">
<message>Sub-workflow failed while loading data into hive tables, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="sub-workflow-end"/>
</workflow-app>
But fails for a SQL . Data is not too large(fails for even 1 record), so something wrong that i cant spot on log.. please help
INSERT OVERWRITE TABLE xxx1.fact_tranCell
PARTITION (Timestamp)
select
`(Timestamp)?+.+`, Timestamp as Timestamp
from xxx2.fact_tranCell
order by tranCell_ID,ADMIN_CELL_ID, SITE_ID;
SQL is not bad, runs fine on command line..
Status: Running (Executing on YARN cluster with App id application_1459756606292_15271)
--------------------------------------------------------------------------------
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
--------------------------------------------------------------------------------
Map 1 .......... SUCCEEDED 7 7 0 0 0 0
Reducer 2 ...... SUCCEEDED 1 1 0 0 0 0
Reducer 3 ...... SUCCEEDED 10 10 0 0 0 0
--------------------------------------------------------------------------------
VERTICES: 03/03 [==========================>>] 100% ELAPSED TIME: 19.72 s
--------------------------------------------------------------------------------
Loading data to table xxx1.fact_trancell partition (timestamp=null)
Time taken for load dynamic partitions : 496
Loading partition {timestamp=1464012900}
Time taken for adding to write entity : 8
Partition 4g_oss.fact_trancell{timestamp=1464012900} stats: [numFiles=10, numRows=4352, totalSize=9660382, rawDataSize=207776027]
OK
Time taken: 34.595 seconds
--------------------------- LOG ----------------------------
Starting Job = job_1459756606292_15285, Tracking URL = hxxp://xxxx.local:8088/proxy/application_1459756606292_15285/
Kill Command = /usr/bin/hadoop job -kill job_1459756606292_15285
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2016-05-27 17:32:35,792 Stage-1 map = 0%, reduce = 0%
2016-05-27 17:32:51,692 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 9.97 sec
2016-05-27 17:33:02,263 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 14.97 sec
MapReduce Total cumulative CPU time: 14 seconds 970 msec
Ended Job = job_1459756606292_15285
Launching Job 2 out of 2
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1459756606292_15286, Tracking URL = hxxp://xxxx.local:8088/proxy/application_1459756606292_15286/
Kill Command = /usr/bin/hadoop job -kill job_1459756606292_15286
Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1
2016-05-27 17:33:16,583 Stage-2 map = 0%, reduce = 0%
2016-05-27 17:33:29,814 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 4.29 sec
2016-05-27 17:33:45,587 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 38.74 sec
2016-05-27 17:33:53,990 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 4.29 sec
2016-05-27 17:34:08,662 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 39.27 sec
2016-05-27 17:34:17,061 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 4.29 sec
2016-05-27 17:34:28,576 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 38.28 sec
2016-05-27 17:34:36,940 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 4.29 sec
2016-05-27 17:34:48,435 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 38.09 sec
MapReduce Total cumulative CPU time: 38 seconds 90 msec
Ended Job = job_1459756606292_15286 with errors
Error during job, obtaining debugging information...
Job Tracking URL: http://xxxx.local:8088/proxy/application_1459756606292_15286/
Examining task ID: task_1459756606292_15286_m_000000 (and more) from job job_1459756606292_15286
Task with the most failures(4):
-----
Task ID:
task_1459756606292_15286_r_000000
URL:
hxxp://xxxx.local:8088/taskdetails.jsp?jobid=job_1459756606292_15286&tipid=task_1459756606292_15286_r_000000
-----
Diagnostic Messages for this Task:
Error: Java heap space
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 14.97 sec HDFS Read: 87739161 HDFS Write: 16056577 SUCCESS
Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 38.09 sec HDFS Read: 16056995 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 53 seconds 60 msec
Intercepting System.exit(2)
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.HiveMain], exit code [2]
I could see the error in the log is while writing the ORC files, strange the ORC files are written nicely form command line !!!!
2016-05-30 11:11:20,377 INFO [main] org.apache.hadoop.hive.ql.exec.FileSinkOperator: FS[1]: records written - 1
2016-05-30 11:11:21,307 INFO [main] org.apache.hadoop.hive.ql.exec.FileSinkOperator: FS[1]: records written - 10
2016-05-30 11:11:21,917 INFO [main] org.apache.hadoop.hive.ql.exec.FileSinkOperator: FS[1]: records written - 100
2016-05-30 11:11:22,420 INFO [main] org.apache.hadoop.hive.ql.exec.FileSinkOperator: FS[1]: records written - 1000
2016-05-30 11:11:24,181 INFO [main] org.apache.hadoop.hive.ql.exec.ExtractOperator: 0 finished. closing...
2016-05-30 11:11:24,181 INFO [main] org.apache.hadoop.hive.ql.exec.FileSinkOperator: 1 finished. closing...
2016-05-30 11:11:24,181 INFO [main] org.apache.hadoop.hive.ql.exec.FileSinkOperator: FS[1]: records written - 4352
2016-05-30 11:11:33,028 FATAL [main] org.apache.hadoop.mapred.YarnChild: Error running child : java.lang.OutOfMemoryError: Java heap space
at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
at java.nio.ByteBuffer.allocate(ByteBuffer.java:331)
at org.apache.hadoop.hive.ql.io.orc.OutStream.getNewInputBuffer(OutStream.java:107)
at org.apache.hadoop.hive.ql.io.orc.OutStream.write(OutStream.java:128)
at org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerWriterV2.writeDirectValues(RunLengthIntegerWriterV2.java:374)
at org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerWriterV2.writeValues(RunLengthIntegerWriterV2.java:182)
at org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerWriterV2.write(RunLengthIntegerWriterV2.java:762)
at org.apache.hadoop.hive.ql.io.orc.WriterImpl$StringTreeWriter.flushDictionary(WriterImpl.java:1211)
at org.apache.hadoop.hive.ql.io.orc.WriterImpl$StringTreeWriter.writeStripe(WriterImpl.java:1132)
at org.apache.hadoop.hive.ql.io.orc.WriterImpl$StructTreeWriter.writeStripe(WriterImpl.java:1616)
at org.apache.hadoop.hive.ql.io.orc.WriterImpl.flushStripe(WriterImpl.java:1996)
at org.apache.hadoop.hive.ql.io.orc.WriterImpl.close(WriterImpl.java:2288)
at org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat$OrcRecordWriter.close(OrcOutputFormat.java:106)
at org.apache.hadoop.hive.ql.exec.FileSinkOperator$FSPaths.closeWriters(FileSinkOperator.java:186)
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.closeOp(FileSinkOperator.java:952)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:598)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:610)
at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.close(ExecReducer.java:287)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:453)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
I'm trying out something simple in Hive on HDFS.
The problem is that the queries are not running map reduce when I'm running a "where clause". However, it runs map reduce for count(*), and even group by clauses.
Here's data and queries with result:
Create External Table:
CREATE EXTERNAL TABLE testtab1 (
id STRING, org STRING)
row format delimited
fields terminated by ','
stored as textfile
location '/usr/ankuchak/testtable1';
Simple select * query:
0: jdbc:hive2://> select * from testtab1;
15/07/01 07:32:46 [main]: ERROR hdfs.KeyProviderCache: Could not find uri with key [dfs.encryption.key.provider.uri] to create a keyProvider !!
OK
+---------------+---------------+--+
| testtab1.id | testtab1.org |
+---------------+---------------+--+
| ankur | idc |
| user | idc |
| someone else | ssi |
+---------------+---------------+--+
3 rows selected (2.169 seconds)
Count(*) query
0: jdbc:hive2://> select count(*) from testtab1;
Query ID = ankuchak_20150701073407_e7fd66ae-8812-4e02-87d7-492f81781d15
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapreduce.job.reduces=
15/07/01 07:34:08 [HiveServer2-Background-Pool: Thread-40]: ERROR mr.ExecDriver: yarn
15/07/01 07:34:08 [HiveServer2-Background-Pool: Thread-40]: WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
Starting Job = job_1435425589664_0005, Tracking URL = http://slc02khv:8088/proxy/application_1435425589664_0005/
Kill Command = /scratch/hadoop/hadoop/bin/hadoop job -kill job_1435425589664_0005
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
15/07/01 07:34:16 [HiveServer2-Background-Pool: Thread-40]: WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead
2015-07-01 07:34:16,291 Stage-1 map = 0%, reduce = 0%
2015-07-01 07:34:23,831 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.04 sec
2015-07-01 07:34:30,102 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.41 sec
MapReduce Total cumulative CPU time: 2 seconds 410 msec
Ended Job = job_1435425589664_0005
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 2.41 sec HDFS Read: 6607 HDFS Write: 2 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 410 msec
OK
+------+--+
| _c0 |
+------+--+
| 3 |
+------+--+
1 row selected (23.527 seconds)
Group by query:
0: jdbc:hive2://> select org, count(id) from testtab1 group by org;
Query ID = ankuchak_20150701073540_5f20df4e-0bd4-4e18-b065-44c2688ce21f
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapreduce.job.reduces=
15/07/01 07:35:40 [HiveServer2-Background-Pool: Thread-63]: ERROR mr.ExecDriver: yarn
15/07/01 07:35:41 [HiveServer2-Background-Pool: Thread-63]: WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
Starting Job = job_1435425589664_0006, Tracking URL = http://slc02khv:8088/proxy/application_1435425589664_0006/
Kill Command = /scratch/hadoop/hadoop/bin/hadoop job -kill job_1435425589664_0006
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
15/07/01 07:35:47 [HiveServer2-Background-Pool: Thread-63]: WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead
2015-07-01 07:35:47,200 Stage-1 map = 0%, reduce = 0%
2015-07-01 07:35:53,494 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.05 sec
2015-07-01 07:36:00,799 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.53 sec
MapReduce Total cumulative CPU time: 2 seconds 530 msec
Ended Job = job_1435425589664_0006
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 2.53 sec HDFS Read: 7278 HDFS Write: 14 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 530 msec
OK
+-------+------+--+
| org | _c1 |
+-------+------+--+
| idc | 2 |
| ssi | 1 |
+-------+------+--+
2 rows selected (21.187 seconds)
Now the simple where clause:
0: jdbc:hive2://> select * from testtab1 where org='idc';
OK
+--------------+---------------+--+
| testtab1.id | testtab1.org |
+--------------+---------------+--+
+--------------+---------------+--+
No rows selected (0.11 seconds)
It would be great if you could provide me with some pointers.
Please let me know if you need further information in this regard.
Regards,
Ankur
Map job is occuring in your last query. So it's not that map reduce is not happening. However, some rows should be returned in your last query. The likely culprit here is that for some reason it is not finding a match on the value "idc". Check your table and ensure that the group for Ankur and user contain the string idc.
Try this to see if you get any results:
Select * from testtab1 where org rlike '.*(idc).*';
or
Select * from testtab1 where org like '%idc%';
These will grab any row that has a value containing the string 'idc'. Good luck!
Here, details of the same error and fixed recently. Try verifying the version you are using
I have a question about running hadoop mapreduce job. I have a table staff, partitioned by join date.
Create statement like that:
create table staff (id int, age int) partitioned by (join_date string) row format delimited fields terminated by '\;';
I put some data to parition '20130921' then when i execute statement bellow, the result is ok:
select count(*) from staff where join_date='20130921';**
But when i execute on partition '20130922' (partition without data), the map reduce job is pending too long, seem like is run forever:
hive> select count(*) from staff where join_date='20130922';**
Total MapReduce jobs = 1**
Launching Job 1 out of 1**
**Number of reduce tasks determined at compile time: 1**
**In order to change the average load for a reducer (in bytes):**
set hive.exec.reducers.bytes.per.reducer=<number>**
**In order to limit the maximum number of reducers:**
set hive.exec.reducers.max=<number>**
**In order to set a constant number of reducers:**
set mapred.reduce.tasks=<number>**
**Starting Job** = `job_201309231116_0131, Tracking URL = ....jobid=job_201309231116_0131`
**Kill Command** = `/u01/hadoop-0.20.203.0/bin/../bin/hadoop job -kill job_201309231116_0131`
Hadoop job information for Stage-1: number of mappers: 0; number of reducers: 1
2013-09-23 17:19:07,182 Stage-1 map = 0%, reduce = 0%
2013-09-23 17:19:07,182 Stage-1 map = 0%, reduce = 0%
2013-09-23 17:19:07,182 Stage-1 map = 0%, reduce = 0%
The jobtracker show reduce task pending and this job dont seem like can finished.
Im using hadoop-0.20.203.0 and hive-0.10.0. I googled all day but didnt find any topic have same problem, please help me.
Best regards.
This seems to be a problem with your Hive installation. I came across a similar problem. You can try out restarting Hive Server and Hive Metastore. This fixed my problem.
The hive does need to process 45 files. Each size is about 1GB. After the mapper execution finished 100%, hive Failed with the error message above.
Driver returned: 1. Errors: OK
Hive history file=/tmp/hue/hive_job_log_hue_201308221004_1738621649.txt
Total MapReduce jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1376898282169_0441, Tracking URL = http://SH02SVR2882.hadoop.sh2.ctripcorp.com:8088/proxy/application_1376898282169_0441/
Kill Command = //usr/lib/hadoop/bin/hadoop job -kill job_1376898282169_0441
Hadoop job information for Stage-1: number of mappers: 236; number of reducers: 0
2013-08-22 10:04:40,205 Stage-1 map = 0%, reduce = 0%
2013-08-22 10:05:07,486 Stage-1 map = 1%, reduce = 0%, Cumulative CPU 121.28 sec
.......................
2013-08-22 10:09:18,625 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 7707.18 sec
MapReduce Total cumulative CPU time: 0 days 2 hours 8 minutes 27 seconds 180 msec
Ended Job = job_1376898282169_0441
Ended Job = -541447549, job is filtered out (removed at runtime).
Ended Job = -1652692814, job is filtered out (removed at runtime).
Launching Job 3 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Job Submission failed with exception
'java.io.IOException(Max block location exceeded for split: Paths:/tmp/hive-beeswax-logging/hive_2013-08-22_10-04-32_755_6427103839442439579/-ext-10001/000009_0:0+28909,....,/tmp/hive-beeswax-logging/hive_2013-08-22_10-04-32_755_6427103839442439579/-ext-10001/000218_0:0+45856
Locations:10.8.75.17:...:10.8.75.20:; InputFormatClass: org.apache.hadoop.mapred.TextInputFormat
splitsize: 45 maxsize: 10)'
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MapRedTask
MapReduce Jobs Launched:
Job 0: Map: 236 Cumulative CPU: 7707.18 sec HDFS Read: 63319449229 HDFS Write: 8603165 SUCCESS
Total MapReduce CPU Time Spent: 0 days 2 hours 8 minutes 27 seconds 180 msec
But I didn't set the maxsize. Executed many times but got the same error. I tried to add mapreduce.jobtracker.split.metainfo.maxsize property for hive. But in this case, hive failed without any map work.
set mapreduce.job.max.split.locations > 45
In our situation it solved problem.
I set up a test integration of Cassandra + Pig/Hadoop. 8 nodes are Cassandra + TaskTracker nodes, 1 node is the JobTracker/NameNode.
I fired up the cassandra client and created a the simple bit of data listed in the Readme.txt in the Cassandra distribution:
[default#unknown] create keyspace Keyspace1;
[default#unknown] use Keyspace1;
[default#Keyspace1] create column family Users with comparator=UTF8Type and default_validation_class=UTF8Type and key_validation_class=UTF8Type;
[default#KS1] set Users[jsmith][first] = 'John';
[default#KS1] set Users[jsmith][last] = 'Smith';
[default#KS1] set Users[jsmith][age] = long(42)
Then I ran the sample pig query listed in CASSANDRA_HOME (using pig_cassandra):
grunt> rows = LOAD 'cassandra://Keyspace1/Users' USING CassandraStorage() AS (key, columns: bag {T: tuple(name, value)});
grunt> cols = FOREACH rows GENERATE flatten(columns);
grunt> colnames = FOREACH cols GENERATE $0;
grunt> namegroups = GROUP colnames BY (chararray) $0;
grunt> namecounts = FOREACH namegroups GENERATE COUNT($1), group;
grunt> orderednames = ORDER namecounts BY $0;
grunt> topnames = LIMIT orderednames 50;
grunt> dump topnames;
It took about 3 minutes to complete.
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
1.0.0 0.9.1 root 2012-01-12 22:16:53 2012-01-12 22:20:22 GROUP_BY,ORDER_BY,LIMIT
Success!
Job Stats (time in seconds):
JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MaxReduceTime MinReduceTime AvgReduceTime Alias Feature Outputs
job_201201121817_0010 8 1 12 6 9 21 21 21 colnames,cols,namecounts,namegroups,rows GROUP_BY,COMBINER
job_201201121817_0011 1 1 6 6 6 15 15 15 orderednames SAMPLER
job_201201121817_0012 1 1 9 9 9 15 15 15 orderednames ORDER_BY,COMBINER hdfs://xxxx/tmp/temp-744158198/tmp-1598279340,
Input(s):
Successfully read 1 records (3232 bytes) from: "cassandra://Keyspace1/Users"
Output(s):
Successfully stored 3 records (63 bytes) in: "hdfs://xxxx/tmp/temp-744158198/tmp-1598279340"
Counters:
Total records written : 3
Total bytes written : 63
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
There were no errors or warnings in the logging.
Is this normal, or is there something wrong?
Yes this is normal because running a Map/Reduce job on Hadoop usually takes about 1 minute just for startup. Pig generates multiple Map/Reduce jobs dependent on the complexity of the script.