RHadoop reduce job failed - hadoop

I am following RHadoop tutorial, https://github.com/RevolutionAnalytics/rmr2/blob/master/docs/tutorial.md and running the second example, but I am getting errors which I can't resolve.
The code is as the following:
groups = rbinom(32,n=50,prob=0.4)
groupsdfs =to.dfs(groups)
mapreduceResult<- mapreduce(
input =groupsdfs,
map =function(.,v) keyval(v,1),
reduce = function(k,vv) keyval(k,sum(vv)))
from.dfs(mapreduceResult)
The map job is successful, but reduce job failed, part of the error message is as the following:
14/07/24 11:22:59 INFO mapreduce.Job: map 100% reduce 58%
14/07/24 11:23:01 INFO mapreduce.Job: Task Id : attempt_1406189659246_0001_r_000016_1, Status : FAILED
Error: java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:409)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
... 9 more
Caused by: java.lang.RuntimeException: configuration exception
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:222)
at org.apache.hadoop.streaming.PipeReducer.configure(PipeReducer.java:67)
... 14 more
Caused by: java.io.IOException: Cannot run program "Rscript": error=2, No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1041)
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:209)
... 15 more
Caused by: java.io.IOException: error=2, No such file or directory
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.<init>(UNIXProcess.java:135)
at java.lang.ProcessImpl.start(ProcessImpl.java:130)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1022)
... 16 more
14/07/24 11:23:42 INFO mapreduce.Job: Job job_1406189659246_0001 failed with state FAILED due to: Task failed task_1406189659246_0001_r_000007
Job failed as tasks failed. failedMaps:0 failedReduces:1
14/07/24 11:23:42 INFO mapreduce.Job: Counters: 54
File System Counters
FILE: Number of bytes read=1631
FILE: Number of bytes written=2036200
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=1073
HDFS: Number of bytes written=5198
HDFS: Number of read operations=67
HDFS: Number of large read operations=0
HDFS: Number of write operations=38
Job Counters
Failed map tasks=2
Failed reduce tasks=28
Killed reduce tasks=1
Launched map tasks=4
Launched reduce tasks=48
Other local map tasks=2
Data-local map tasks=2
Total time spent by all maps in occupied slots (ms)=18216
Total time spent by all reduces in occupied slots (ms)=194311
Total time spent by all map tasks (ms)=18216
Total time spent by all reduce tasks (ms)=194311
Total vcore-seconds taken by all map tasks=18216
Total vcore-seconds taken by all reduce tasks=194311
Total megabyte-seconds taken by all map tasks=18653184
Total megabyte-seconds taken by all reduce tasks=198974464
Map-Reduce Framework
Map input records=3
Map output records=25
Map output bytes=2196
Map output materialized bytes=2266
Input split bytes=214
Combine input records=0
Combine output records=0
Reduce input groups=10
Reduce shuffle bytes=1859
Reduce input records=21
Reduce output records=30
Spilled Records=46
Shuffled Maps =38
Failed Shuffles=0
Merged Map outputs=38
GC time elapsed (ms)=1339
CPU time spent (ms)=40060
Physical memory (bytes) snapshot=5958418432
Virtual memory (bytes) snapshot=33795457024
Total committed heap usage (bytes)=7176978432
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=859
File Output Format Counters
Bytes Written=5198
rmr
reduce calls=10
14/07/24 11:23:42 ERROR streaming.StreamJob: Job not Successful!
Streaming Command Failed!
Error in mr(map = map, reduce = reduce, combine = combine, vectorized.reduce, :
hadoop streaming failed with error code 1
could somebody help? I couldn't proceed further from here. Thanks.

The problem is solved. R and rhadoop related packages need to be installed on all the nodes in the cluster. For rhadoop questions it is better to post in their google group https://groups.google.com/forum/#!forum/rhadoop, you can get some hint pretty fast.

this is working example of wordcount (running on Cloudera Sandbox 4.6/5/5.1)
Important is the init on the begining! ;)
Sys.setenv(HADOOP_CMD="/usr/bin/hadoop")
Sys.setenv(HADOOP_STREAMING="/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming.jar")
Sys.setenv(JAVA_HOME="/usr/java/jdk1.7.0_55-cloudera")
Sys.setenv(HADOOP_COMMON_LIB_NATIVE_DIR="/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/hadoop/lib/native")
Sys.setenv(HADOOP_OPTS="-Djava.library.path=HADOOP_HOME/lib")
library(rhdfs)
hdfs.init()
library(rmr2)
## space and word delimiter
map <- function(k,lines) {
words.list <- strsplit(lines, '\\s')
words <- unlist(words.list)
return( keyval(words, 1) )
}
reduce <- function(word, counts) {
keyval(word, sum(counts))
}
wordcount <- function (input, output=NULL) {
mapreduce(input=input, output=output, input.format="text", map=map, reduce=reduce)
}
## variables
hdfs.root <- '/user/node'
hdfs.data <- file.path(hdfs.root, 'data')
hdfs.out <- file.path(hdfs.root, 'out')
## run mapreduce job
##out <- wordcount(hdfs.data, hdfs.out)
system.time(out <- wordcount(hdfs.data, hdfs.out))
## fetch results from HDFS
results <- from.dfs(out)
results.df <- as.data.frame(results, stringsAsFactors=F)
colnames(results.df) <- c('word', 'count')
##head(results.df)
## sorted output TOP10
head(results.df[order(-results.df$count),],10)

Related

Hive On MR java.lang.ClassNotFoundException: Class org.apache.hadoop.hive.ql.io.HiveFileFormatUtils$NullOutputCommitter not found

I am running a group by query in kerberos secured hadoop cluster using hive. But it is having some error from logs we can see that map-reduce is completed then i am facing this error some time i face error of Some Other class not found. How to debug such issues and what are the possible cause of this kind of issues.
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2023-02-16 19:18:34,771 Stage-1 map = 0%, reduce = 0%
2023-02-16 19:18:54,195 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 10.47 sec
2023-02-16 19:19:34,108 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 10.47 sec
MapReduce Total cumulative CPU time: 10 seconds 470 msec
Ended Job = job_1676554595644_0001 with errors
Error during job, obtaining debugging information...
Examining task ID: task_1676554595644_0001_m_000000 (and more) from job job_1676554595644_0001
Task with the most failures(4):
-----
Task ID:
task_1676554595644_0001_r_000000
URL:
http://hadoop-namenode.hadoop.com:8088/taskdetails.jsp?jobid=job_1676554595644_0001&tipid=task_1676554595644_0001_r_000000
-----
Diagnostic Messages for this Task:
Error: java.lang.RuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.hive.ql.io.HiveFileFormatUtils$NullOutputCommitter not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2427)
at org.apache.hadoop.mapred.JobConf.getOutputCommitter(JobConf.java:725)
at org.apache.hadoop.mapred.Task.initialize(Task.java:602)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:332)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:177)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1886)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:171)
Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.hive.ql.io.HiveFileFormatUtils$NullOutputCommitter not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2395)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2419)
... 8 more
Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.hive.ql.io.HiveFileFormatUtils$NullOutputCommitter not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2299)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2393)
... 9 more

ACID transactions on data added from Spark not working

I'm trying to use ACID transactions in Hive but I have a problem when the data are added with Spark.
First, I created a table with the following statement :
CREATE TABLE testdb.test(id string, col1 string)
CLUSTERED BY (id) INTO 4 BUCKETS
STORED AS ORC TBLPROPERTIES('transactional'='true');
Then I added data with those queries :
INSERT INTO testdb.test VALUES("1", "A");
INSERT INTO testdb.test VALUES("2", "B");
INSERT INTO testdb.test VALUES("3", "C");
And I've been able to delete rows with this query :
DELETE FROM testdb.test WHERE id="1";
All that worked perfectly, but a problem occurs when I try to delete rows that were added with Spark.
What I do in Spark (iPython) :
hc = HiveContext(sc)
data = sc.parallelize([["1", "A"], ["2", "B"], ["3", "C"]])
data_df = hc.createDataFrame(data)
data_df.registerTempTable(data_df)
hc.sql("INSERT INTO testdb.test SELECT * FROM data_df");
Then, when I come back to Hive, I'm able to run a SELECT query on this the "test" table.
However, when I try to run the exact same DELETE query as before, I have the following error (it happens after the reduce phase) :
Error: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{"reducesinkkey0":{"transactionid":0,"bucketid":-1,"rowid":0}},"value":null}
at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:265)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{"reducesinkkey0":{"transactionid":0,"bucketid":-1,"rowid":0}},"value":null}
at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:253)
... 7 more
Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.processOp(FileSinkOperator.java:723)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815)
at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:84)
at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:244)
... 7 more
I have no idea where this is coming from, that is why I'm looking for ideas.
I'm using the Cloudera Quickstart VM (5.4.2).
Hive version : 1.1.0
Spark Version : 1.3.0
And here is the complete output of the Hive DELETE command :
hive> delete from testdb.test where id="1";
Query ID = cloudera_20160914090303_795e40b7-ab6a-45b0-8391-6d41d1cfe7bd
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 4
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1473858545651_0036, Tracking URL =http://quickstart.cloudera:8088/proxy/application_1473858545651_0036/
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1473858545651_0036
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 4
2016-09-14 09:03:55,571 Stage-1 map = 0%, reduce = 0%
2016-09-14 09:04:14,898 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 1.66 sec
2016-09-14 09:04:15,944 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.33 sec
2016-09-14 09:04:44,101 Stage-1 map = 100%, reduce = 17%, Cumulative CPU 4.21 sec
2016-09-14 09:04:46,523 Stage-1 map = 100%, reduce = 25%, Cumulative CPU 4.79 sec
2016-09-14 09:04:47,673 Stage-1 map = 100%, reduce = 42%, Cumulative CPU 5.8 sec
2016-09-14 09:04:50,041 Stage-1 map = 100%, reduce = 75%, Cumulative CPU 7.45 sec
2016-09-14 09:05:18,486 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 7.69 sec
MapReduce Total cumulative CPU time: 7 seconds 690 msec
Ended Job = job_1473858545651_0036 with errors
Error during job, obtaining debugging information...
Job Tracking URL: http://quickstart.cloudera:8088/proxy/application_1473858545651_0036/
Examining task ID: task_1473858545651_0036_m_000000 (and more) from job job_1473858545651_0036
Task with the most failures(4):
-----
Task ID:
task_1473858545651_0036_r_000001
URL:
http://0.0.0.0:8088/taskdetails.jsp?jobid=job_1473858545651_0036&tipid=task_1473858545651_0036_r_000001
-----
Diagnostic Messages for this Task:
Error: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{"reducesinkkey0":{"transactionid":0,"bucketid":-1,"rowid":0}},"value":null}
at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:265)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{"reducesinkkey0":{"transactionid":0,"bucketid":-1,"rowid":0}},"value":null}
at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:253)
... 7 more
Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.processOp(FileSinkOperator.java:723)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815)
at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:84)
at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:244)
... 7 more
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
Stage-Stage-1: Map: 2 Reduce: 4 Cumulative CPU: 7.69 sec HDFS Read: 21558 HDFS Write: 114 FAIL
Total MapReduce CPU Time Spent: 7 seconds 690 msec
Thanks !
Use the Spark HiveAcid Datasource - http://github.com/qubole/spark-acid
val df = spark.read.format("HiveAcid").option("table", "testdb.test").load()
df.collect()
Spark needs to run with HMS 3.1.1 so that the underlying datasource can take necessary locks etc.
The directory structure, file formats are different for a Hive ACID table compared with a normal table. CRUD needs to happen from Hive.
With respect to Spark, normal table reads are not compatible with Hive ACID table reads. We could not use the native Spark apis to read the table.
Also, currently there is no support for updates, deletes, inserts in Spark
As for reading the data, one could use the connector - http://github.com/qubole/spark-acid
I had the same issue running from hue, but after I set these parameters from hive cli, it started working:
set hive.support.concurrency=true
set hive.enforce.bucketing=true
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DBTxnManager;
set hive.compactor.initiator.on=true;

Hive will not write to aws s3

I have an external table in hive that is stored on my hadoop cluster and I want to move its contents into an external table that is stored on Amazon s3.
So I created an s3 backed table like so:
CREATE EXTERNAL TABLE IF NOT EXISTS export.export_table
like table_to_be_exported
ROW FORMAT SERDE ...
with SERDEPROPERTIES ('fieldDelimiter'='|')
STORED AS TEXTFILE
LOCATION 's3a://bucket/folder';
Then I run: INSERT INTO export.export_table SELECT * FROM table_to_be_exported
It outputs the following
INFO : Number of reduce tasks is set to 0 since there's no reduce operator
WARN : Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
INFO : Starting Job = job_1435176004514_0028, Tracking URL = http://quickstart.cloudera:8088/proxy/application_1435176004514_0028/
INFO : Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1435176004514_0028
INFO : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
INFO : 2015-07-06 09:22:18,379 Stage-1 map = 0%, reduce = 0%
INFO : 2015-07-06 09:22:27,795 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.9 sec
INFO : MapReduce Total cumulative CPU time: 2 seconds 900 msec
INFO : Ended Job = job_1435176004514_0028
INFO : Stage-4 is selected by condition resolver.
INFO : Stage-3 is filtered out by condition resolver.
INFO : Stage-5 is filtered out by condition resolver.
INFO : Moving data to: s3a://bucket/folder/.hive-staging_hive_2015-07-06_09-22-10_351_9216807769834089982-3/-ext-10000 from s3a://bucket/folder/.hive-staging_hive_2015-07-06_09-22-10_351_9216807769834089982-3/-ext-10002
ERROR : Failed with exception Wrong FS: s3a://bucket/folder/.hive-staging_hive_2015-07-06_09-22-10_351_9216807769834089982-3/-ext-10002, expected: hdfs://quickstart.cloudera:8020
java.lang.IllegalArgumentException: Wrong FS: s3a://bucket/folder/.hive-staging_hive_2015-07-06_09-22-10_351_9216807769834089982-3/-ext-10002, expected: hdfs://quickstart.cloudera:8020
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645)
at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:193)
at org.apache.hadoop.hdfs.DistributedFileSystem.getEZForPath(DistributedFileSystem.java:1916)
at org.apache.hadoop.hdfs.client.HdfsAdmin.getEncryptionZoneForPath(HdfsAdmin.java:262)
at org.apache.hadoop.hive.shims.Hadoop23Shims$HdfsEncryptionShim.isPathEncrypted(Hadoop23Shims.java:1187)
at org.apache.hadoop.hive.ql.metadata.Hive.moveFile(Hive.java:2449)
at org.apache.hadoop.hive.ql.exec.MoveTask.moveFile(MoveTask.java:105)
at org.apache.hadoop.hive.ql.exec.MoveTask.execute(MoveTask.java:222)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:160)
at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:88)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1638)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1397)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1181)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1047)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1042)
at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:145)
at org.apache.hive.service.cli.operation.SQLOperation.access$100(SQLOperation.java:70)
at org.apache.hive.service.cli.operation.SQLOperation$1$1.run(SQLOperation.java:197)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hive.service.cli.operation.SQLOperation$1.run(SQLOperation.java:209)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Error: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask (state=08S01,code=1)
I have s3a key and secret set in my hadoop core-site.xml and am able to do reads and writes from s3 using hadoop directly hdfs dfs -ls s3a://.
Any guesses as to what I could do to get this to work?
Try using s3 instead of s3a, my guess is that s3a is not supported yet in EMR's Hive distribution.

Datastax Enterprise 3.2 hive timeout exception

I tried to run simple hive query through Datastax Enterprise, but it always fails with timeout (on small data set or even empty tables). I've got 4 nodes of m1.large on AWS (2x Cassandra & 2x Analytics). See below:
cqlsh:intracker> select count(*) from event_tracks_by_browser_date LIMIT 100000;
count
-------
15030
Then with hive:
hive> select * from event_tracks_by_browser_date where type_id=10;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201312292327_0003, Tracking URL = http://node3:50030/jobdetails.jsp?jobid=job_201312292327_0003
Kill Command = /usr/bin/dse hadoop job -Dmapred.job.tracker=10.234.9.204:8012 -kill job_201312292327_0003
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0
2013-12-30 10:30:27,578 Stage-1 map = 0%, reduce = 0%
2013-12-30 10:31:27,890 Stage-1 map = 0%, reduce = 0%
2013-12-30 10:32:28,137 Stage-1 map = 0%, reduce = 0%
2013-12-30 10:33:12,344 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201312292327_0003 with errors
Error during job, obtaining debugging information...
Examining task ID: task_201312292327_0003_m_000003 (and more) from job job_201312292327_0003
Exception in thread "Thread-10" java.lang.RuntimeException: Error while reading from task log url
at org.apache.hadoop.hive.ql.exec.errors.TaskLogProcessor.getStackTraces(TaskLogProcessor.java:240)
at org.apache.hadoop.hive.ql.exec.JobDebugger.showJobFailDebugInfo(JobDebugger.java:227)
at org.apache.hadoop.hive.ql.exec.JobDebugger.run(JobDebugger.java:92)
at java.lang.Thread.run(Thread.java:744)
Caused by: java.io.IOException: Server returned HTTP response code: 400 for URL: http://node2:50060/tasklog?taskid=attempt_201312292327_0003_m_000000_1&start=-8193
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1626)
at java.net.URL.openStream(URL.java:1037)
at org.apache.hadoop.hive.ql.exec.errors.TaskLogProcessor.getStackTraces(TaskLogProcessor.java:192)
... 3 more
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
MapReduce Jobs Launched:
Job 0: Map: 2 HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec
I checked system.log and it looks like sime kind of timeout appears.
INFO [IPC Server handler 6 on 8012] 2013-12-30 10:32:24,880 TaskInProgress.java (line 551) Error from attempt_201312292327_0003_m_000001_2: java.io.IOException: java.io.IOEx
ception: java.lang.RuntimeException
at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)
at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:243)
at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:522)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.<init>(MapTask.java:197)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:418)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
at org.apache.hadoop.mapred.Child$4.run(Child.java:266)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.Child.main(Child.java:260)
Caused by: java.io.IOException: java.lang.RuntimeException
at org.apache.hadoop.hive.cassandra.cql3.input.HiveCqlInputFormat.getRecordReader(HiveCqlInputFormat.java:100)
at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:240)
... 9 more
Caused by: java.lang.RuntimeException
at org.apache.cassandra.hadoop.cql3.CqlPagingRecordReader$RowIterator.executeQuery(CqlPagingRecordReader.java:648)
at org.apache.cassandra.hadoop.cql3.CqlPagingRecordReader$RowIterator.<init>(CqlPagingRecordReader.java:301)
at org.apache.cassandra.hadoop.cql3.CqlPagingRecordReader.initialize(CqlPagingRecordReader.java:167)
at org.apache.hadoop.hive.cassandra.cql3.input.CqlHiveRecordReader.initialize(CqlHiveRecordReader.java:91)
at org.apache.hadoop.hive.cassandra.cql3.input.HiveCqlInputFormat.getRecordReader(HiveCqlInputFormat.java:94)
... 10 more
Caused by: TimedOutException()
at org.apache.cassandra.thrift.Cassandra$execute_prepared_cql3_query_result.read(Cassandra.java:42710)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
at org.apache.cassandra.thrift.Cassandra$Client.recv_execute_prepared_cql3_query(Cassandra.java:1724)
at org.apache.cassandra.thrift.Cassandra$Client.execute_prepared_cql3_query(Cassandra.java:1709)
at org.apache.cassandra.hadoop.cql3.CqlPagingRecordReader$RowIterator.executeQuery(CqlPagingRecordReader.java:637)
... 14 more
Cassandra CQLSH works with no problems... any idea??
Try increasing your replication factor to 2 for Analytics.

hadoop mapper over consumption of memory(heap)

I wrote a simple hash join program in hadoop map reduce. The idea is the following:
A small table is distributed to every mapper using DistributedCache provided by hadoop framework. The large table is distributed over the mappers with the split size being 64M.
The setup code of the mapper creates a hashmap reading every line from this small table. In the mapper code, every key is searched(get) on the hashmap, and if the key exists in the hash map it is written out. There is no need of a reducer at this point of time. This is the code which we use:
public class Map extends Mapper<LongWritable, Text, Text, Text> {
private HashMap<String, String> joinData = new HashMap<String, String>();
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String textvalue = value.toString();
String[] tokens;
tokens = textvalue.split(",");
if (tokens.length == 2) {
String joinValue = joinData.get(tokens[0]);
if (null != joinValue) {
context.write(new Text(tokens[0]), new Text(tokens[1] + ","
+ joinValue));
}
}
}
public void setup(Context context) {
try {
Path[] cacheFiles = DistributedCache.getLocalCacheFiles(context
.getConfiguration());
if (null != cacheFiles && cacheFiles.length > 0) {
String line;
String[] tokens;
BufferedReader br = new BufferedReader(new FileReader(
cacheFiles[0].toString()));
try {
while ((line = br.readLine()) != null) {
tokens = line.split(",");
if (tokens.length == 2) {
joinData.put(tokens[0], tokens[1]);
}
}
System.exit(0);
} finally {
br.close();
}
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
While testing this code, our small table was 32M, and large table was 128M, one master and 2 slave nodes.
This code fails with the above inputs when I have a 256M of heap. I use -Xmx256m in the mapred.child.java.opts in mapred-site.xml file. When I increase it to 300m it proceeds very slowly and with 512m it reaches its max throughput.
I dont understand where my mapper is consuming so much memory. With the inputs given above
and with the mapper code I dont expect my heap memory to ever reach 256M, yet it fails with java heap space error.
I will be thankful if you can give some insight into why the mapper is consuming so much memory.
EDIT:
13/03/11 09:37:33 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
13/03/11 09:37:33 INFO input.FileInputFormat: Total input paths to process : 1
13/03/11 09:37:33 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
13/03/11 09:37:33 WARN snappy.LoadSnappy: Snappy native library not loaded
13/03/11 09:37:34 INFO mapred.JobClient: Running job: job_201303110921_0004
13/03/11 09:37:35 INFO mapred.JobClient: map 0% reduce 0%
13/03/11 09:39:12 INFO mapred.JobClient: Task Id : attempt_201303110921_0004_m_000000_0, Status : FAILED
Error: GC overhead limit exceeded
13/03/11 09:40:43 INFO mapred.JobClient: Task Id : attempt_201303110921_0004_m_000001_0, Status : FAILED
org.apache.hadoop.io.SecureIOUtils$AlreadyExistsException: File /usr/home/hadoop/hadoop-1.0.3/libexec/../logs/userlogs/job_201303110921_0004/attempt_201303110921_0004_m_000001_0/log.tmp already exists
at org.apache.hadoop.io.SecureIOUtils.insecureCreateForWrite(SecureIOUtils.java:130)
at org.apache.hadoop.io.SecureIOUtils.createForWrite(SecureIOUtils.java:157)
at org.apache.hadoop.mapred.TaskLog.writeToIndexFile(TaskLog.java:312)
at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:385)
at org.apache.hadoop.mapred.Child$4.run(Child.java:257)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
attempt_201303110921_0004_m_000001_0: Exception in thread "Thread for syncLogs" java.lang.OutOfMemoryError: Java heap space
attempt_201303110921_0004_m_000001_0: at java.io.BufferedOutputStream.<init>(BufferedOutputStream.java:76)
attempt_201303110921_0004_m_000001_0: at java.io.BufferedOutputStream.<init>(BufferedOutputStream.java:59)
attempt_201303110921_0004_m_000001_0: at org.apache.hadoop.mapred.TaskLog.writeToIndexFile(TaskLog.java:312)
attempt_201303110921_0004_m_000001_0: at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:385)
attempt_201303110921_0004_m_000001_0: at org.apache.hadoop.mapred.Child$3.run(Child.java:141)
attempt_201303110921_0004_m_000001_0: log4j:WARN No appenders could be found for logger (org.apache.hadoop.hdfs.DFSClient).
attempt_201303110921_0004_m_000001_0: log4j:WARN Please initialize the log4j system properly.
13/03/11 09:42:18 INFO mapred.JobClient: Task Id : attempt_201303110921_0004_m_000001_1, Status : FAILED
Error: GC overhead limit exceeded
13/03/11 09:43:48 INFO mapred.JobClient: Task Id : attempt_201303110921_0004_m_000001_2, Status : FAILED
Error: GC overhead limit exceeded
13/03/11 09:45:09 INFO mapred.JobClient: Job complete: job_201303110921_0004
13/03/11 09:45:09 INFO mapred.JobClient: Counters: 7
13/03/11 09:45:09 INFO mapred.JobClient: Job Counters
13/03/11 09:45:09 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=468506
13/03/11 09:45:09 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
13/03/11 09:45:09 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
13/03/11 09:45:09 INFO mapred.JobClient: Launched map tasks=6
13/03/11 09:45:09 INFO mapred.JobClient: Data-local map tasks=6
13/03/11 09:45:09 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0
13/03/11 09:45:09 INFO mapred.JobClient: Failed map tasks=1
It's hard to say for sure where the memory consumption is going, but here are a few pointers:
You're creating 2 Text objects for every line of your input. You should just use 2 Text objects that will be initialized once in your Mapper as class variables, and then for each line just call text.set(...). This is a common usage pattern for Map/Reduce patterns, and can save quite a bit of memory overhead.
You should consider using SequenceFile format for your input, which would avoid the need to parse the lines with textValue.split, you would instead have this data directly available as an array. I've read several times that doing string splits like this can be quite intensive, so you should avoid as much as possible if memory is really an issue. You can also think about using KeyValueTextInputFormat if, as in your example, you only care about key/value pairs.
If that isn't enough, I would advise looking at this link, especially part 7 which gives you a very simple method to profile your application and see what gets allocated where.

Resources