Is throughput in the mapreduce metrics in MB or Mb - hadoop

After running TestDFSIO I got the following metrics:
2019-04-30 09:50:35,790 INFO fs.TestDFSIO: Date & time: Tue Apr 30 09:50:35 EDT 2019
2019-04-30 09:50:35,791 INFO fs.TestDFSIO: Number of files: 100
2019-04-30 09:50:35,791 INFO fs.TestDFSIO: Total MBytes processed: 10000
2019-04-30 09:50:35,791 INFO fs.TestDFSIO: Throughput mb/sec: 376.9
2019-04-30 09:50:35,791 INFO fs.TestDFSIO: Average IO rate mb/sec: 387.16
2019-04-30 09:50:35,791 INFO fs.TestDFSIO: IO rate std deviation: 60.42
2019-04-30 09:50:35,791 INFO fs.TestDFSIO: Test exec time sec: 115.21
Is Average IO rate mb/sec in MegaByte or Megabits?

TestDFSIO is a usefult tool but the only available documentation is in its source code.
By looking at the code for TestDFSIO.java it seems that the throughput is expressed in Mebibytes per seconds.
In the source code one can see how troughput is computed:
" Throughput mb/sec: " + df.format(toMB(size) / msToSecs(time)),
Function toMB() is the number of bytes divided by MEGA:
static float toMB(long bytes) {
return ((float)bytes)/MEGA;
}
which is in turn the constant 0x100000L, that is the integer 1048576=1024*1024.
From the code:
private static final long MEGA = ByteMultiple.MB.value();
and
enum ByteMultiple {
B(1L),
KB(0x400L),
MB(0x100000L),
GB(0x40000000L),
TB(0x10000000000L);
...
So the throughput should be expressed in mebibytes/sec (MiB/sec) and not in megabytes (MB).

Related

How to high concurency in Spring Boot

I have a requirement to create a product which should support 40 concurrent users per second (I am new to working on concurrency)
To achieve this, I tried to developed one hello world spring-boot project.
i.e.,
spring-boot (1.5.9)
jetty 9.4.15
rest controller which has get endpoint
code below:
#GetMapping
public String index() {
return "Greetings from Spring Boot!";
}
App running on machine Gen10 DL360
Then I tried to benchmark using apachebench
75 concurrent users:
ab -t 120 -n 1000000 -c 75 http://10.93.243.87:9000/home/
Server Software:
Server Hostname: 10.93.243.87
Server Port: 9000
Document Path: /home/
Document Length: 27 bytes
Concurrency Level: 75
Time taken for tests: 37.184 seconds
Complete requests: 1000000
Failed requests: 0
Write errors: 0
Total transferred: 143000000 bytes
HTML transferred: 27000000 bytes
Requests per second: 26893.28 [#/sec] (mean)
Time per request: 2.789 [ms] (mean)
Time per request: 0.037 [ms] (mean, across all concurrent requests)
Transfer rate: 3755.61 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 1 23.5 0 3006
Processing: 0 2 7.8 1 404
Waiting: 0 2 7.8 1 404
Total: 0 3 24.9 2 3007
100 concurrent users:
ab -t 120 -n 1000000 -c 100 http://10.93.243.87:9000/home/
Server Software:
Server Hostname: 10.93.243.87
Server Port: 9000
Document Path: /home/
Document Length: 27 bytes
Concurrency Level: 100
Time taken for tests: 36.708 seconds
Complete requests: 1000000
Failed requests: 0
Write errors: 0
Total transferred: 143000000 bytes
HTML transferred: 27000000 bytes
Requests per second: 27241.77 [#/sec] (mean)
Time per request: 3.671 [ms] (mean)
Time per request: 0.037 [ms] (mean, across all concurrent requests)
Transfer rate: 3804.27 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 2 35.7 1 3007
Processing: 0 2 9.4 1 405
Waiting: 0 2 9.4 1 405
Total: 0 4 37.0 2 3009
500 concurrent users:
ab -t 120 -n 1000000 -c 500 http://10.93.243.87:9000/home/
Server Software:
Server Hostname: 10.93.243.87
Server Port: 9000
Document Path: /home/
Document Length: 27 bytes
Concurrency Level: 500
Time taken for tests: 36.222 seconds
Complete requests: 1000000
Failed requests: 0
Write errors: 0
Total transferred: 143000000 bytes
HTML transferred: 27000000 bytes
Requests per second: 27607.83 [#/sec] (mean)
Time per request: 18.111 [ms] (mean)
Time per request: 0.036 [ms] (mean, across all concurrent requests)
Transfer rate: 3855.39 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 14 126.2 1 7015
Processing: 0 4 22.3 1 811
Waiting: 0 3 22.3 1 810
Total: 0 18 129.2 2 7018
1000 concurrent users:
ab -t 120 -n 1000000 -c 1000 http://10.93.243.87:9000/home/
Server Software:
Server Hostname: 10.93.243.87
Server Port: 9000
Document Path: /home/
Document Length: 27 bytes
Concurrency Level: 1000
Time taken for tests: 36.534 seconds
Complete requests: 1000000
Failed requests: 0
Write errors: 0
Total transferred: 143000000 bytes
HTML transferred: 27000000 bytes
Requests per second: 27372.09 [#/sec] (mean)
Time per request: 36.534 [ms] (mean)
Time per request: 0.037 [ms] (mean, across all concurrent requests)
Transfer rate: 3822.47 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 30 190.8 1 7015
Processing: 0 6 31.4 2 1613
Waiting: 0 5 31.4 1 1613
Total: 0 36 195.5 2 7018
From above test run, I achieved ~27K per second with 75 users itself but it looks increasing the users also increasing the latency. Also, we can clearly note connect time is increasing.
I have requirement for my application to support 40k concurrent users (assume all are using own separate browsers) and request should be finished within 250 milliseconds.
Please help me on this
I am also not a grand wizard in the topic myself but here is some advice:
there is a hard limit how many request can handle one instance so if you want to support a lot of user you need more instance
if you work with multiple instance then you have to somehow distribute the requests among the instances. One popular solution is Netflix Eureka
if you don't want to maintain additional resources and the product will run in cloud then use the provided load balancing services (e.g. LoadBalancer on AWS)
also you can fine-tune your server's connection pool settings

Performance issues of small files on Hive

I was reading an article regarding how small files degrade the performance of the hive query.
https://community.hitachivantara.com/community/products-and-solutions/pentaho/blog/2017/11/07/working-with-small-files-in-hadoop-part-1
I understand the first part regarding overloading the NameNode.
However, what he had said regrading map-reduce doesn't seem to happen. for both map-reduce and Tez.
When a MapReduce job launches, it schedules one map task per block of
data being processed
I don't see mapper task created per file.May the reason is, he is referring the version 1 of map-reduce and so much change haver been done after that.
Hive Version: Hive 1.2.1000.2.6.4.0-91
My table:
create table temp.emp_orc_small_files (id int, name string, salary int)
stored as orcfile;
Data:
following code will create 100 small files it containing only few kb of data.
for i in {1..100}; do hive -e "insert into temp.emp_orc_small_files values(${i}, 'test_${i}', `shuf -i 1000-5000 -n 1`);";done
However I see only one mapper and one reducer task being created for following query.
[root#sandbox-hdp ~]# hive -e "select max(salary) from temp.emp_orc_small_files"
log4j:WARN No such property [maxFileSize] in org.apache.log4j.DailyRollingFileAppender.
Logging initialized using configuration in file:/etc/hive/2.6.4.0-91/0/hive-log4j.properties
Query ID = root_20180911200039_9e1361cb-0a5d-45a3-9c98-4aead46905ac
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1536258296893_0257)
--------------------------------------------------------------------------------
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
--------------------------------------------------------------------------------
Map 1 .......... SUCCEEDED 1 1 0 0 0 0
Reducer 2 ...... SUCCEEDED 1 1 0 0 0 0
--------------------------------------------------------------------------------
VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 7.36 s
--------------------------------------------------------------------------------
OK
4989
Time taken: 13.643 seconds, Fetched: 1 row(s)
Same result with map-reduce.
hive> set hive.execution.engine=mr;
hive> select max(salary) from temp.emp_orc_small_files;
Query ID = root_20180911200545_c4f63cc6-0ab8-4bed-80fe-b4cb545018f2
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1536258296893_0259, Tracking URL = http://sandbox-hdp.hortonworks.com:8088/proxy/application_1536258296893_0259/
Kill Command = /usr/hdp/2.6.4.0-91/hadoop/bin/hadoop job -kill job_1536258296893_0259
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2018-09-11 20:05:57,213 Stage-1 map = 0%, reduce = 0%
2018-09-11 20:06:04,727 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.37 sec
2018-09-11 20:06:12,189 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 7.36 sec
MapReduce Total cumulative CPU time: 7 seconds 360 msec
Ended Job = job_1536258296893_0259
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 7.36 sec HDFS Read: 66478 HDFS Write: 5 SUCCESS
Total MapReduce CPU Time Spent: 7 seconds 360 msec
OK
4989
This is because the following configuration is taking effect
hive.hadoop.supports.splittable.combineinputformat
from the documentation
Whether to combine small input files so that fewer mappers are
spawned.
So essentially Hive can infer that the input is a group of small files smaller than the blocksize and combine them reducing the required number of mappers.

Hive action fails on Oozie, while works well on hive commandline

Here is my workflow, works well if I have sample SQL like show tables or drop partition(Tried with Tez as well, it fials with cryptic error message again)..
workflow-app xmlns="uri:oozie:workflow:0.4" name="UDEX-OOZIE POC">
<credentials>
<credential name="HiveCred" type="hcat">
<property>
<name>hcat.metastore.uri</name>
<value>thrift://xxxx.local:9083</value>
</property>
<property>
<name>hcat.metastore.principal</name>
<value>hive/_HOST#xxxx.LOCAL</value>
</property>
</credential>
</credentials>
<start to="IdEduTranCell-pm"/>
<action name="IdEduTranCell-pm" cred="HiveCred">
<hive xmlns="uri:oozie:hive-action:0.5">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<job-xml>${HiveConfigFile}</job-xml>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>default</value>
</property>
</configuration>
<script>${IdEduTranCell_path}</script>
<param>SOURCE_DB_NAME=${SOURCE_DB_NAME}</param>
<param>STRG_DB_NAME=${STRG_DB_NAME}</param>
<param>TABLE_NAME=${TABLE_NAME}</param>
<file>${IdEduTranCell_path}#${IdEduTranCell_path}</file>
<file>${HiveConfigFile}#${HiveConfigFile}</file>
</hive>
<ok to="sub-workflow-end"/>
<error to="kill"/>
</action>
<kill name="kill">
<message>Sub-workflow failed while loading data into hive tables, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="sub-workflow-end"/>
</workflow-app>
But fails for a SQL . Data is not too large(fails for even 1 record), so something wrong that i cant spot on log.. please help
INSERT OVERWRITE TABLE xxx1.fact_tranCell
PARTITION (Timestamp)
select
`(Timestamp)?+.+`, Timestamp as Timestamp
from xxx2.fact_tranCell
order by tranCell_ID,ADMIN_CELL_ID, SITE_ID;
SQL is not bad, runs fine on command line..
Status: Running (Executing on YARN cluster with App id application_1459756606292_15271)
--------------------------------------------------------------------------------
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
--------------------------------------------------------------------------------
Map 1 .......... SUCCEEDED 7 7 0 0 0 0
Reducer 2 ...... SUCCEEDED 1 1 0 0 0 0
Reducer 3 ...... SUCCEEDED 10 10 0 0 0 0
--------------------------------------------------------------------------------
VERTICES: 03/03 [==========================>>] 100% ELAPSED TIME: 19.72 s
--------------------------------------------------------------------------------
Loading data to table xxx1.fact_trancell partition (timestamp=null)
Time taken for load dynamic partitions : 496
Loading partition {timestamp=1464012900}
Time taken for adding to write entity : 8
Partition 4g_oss.fact_trancell{timestamp=1464012900} stats: [numFiles=10, numRows=4352, totalSize=9660382, rawDataSize=207776027]
OK
Time taken: 34.595 seconds
--------------------------- LOG ----------------------------
Starting Job = job_1459756606292_15285, Tracking URL = hxxp://xxxx.local:8088/proxy/application_1459756606292_15285/
Kill Command = /usr/bin/hadoop job -kill job_1459756606292_15285
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2016-05-27 17:32:35,792 Stage-1 map = 0%, reduce = 0%
2016-05-27 17:32:51,692 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 9.97 sec
2016-05-27 17:33:02,263 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 14.97 sec
MapReduce Total cumulative CPU time: 14 seconds 970 msec
Ended Job = job_1459756606292_15285
Launching Job 2 out of 2
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1459756606292_15286, Tracking URL = hxxp://xxxx.local:8088/proxy/application_1459756606292_15286/
Kill Command = /usr/bin/hadoop job -kill job_1459756606292_15286
Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1
2016-05-27 17:33:16,583 Stage-2 map = 0%, reduce = 0%
2016-05-27 17:33:29,814 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 4.29 sec
2016-05-27 17:33:45,587 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 38.74 sec
2016-05-27 17:33:53,990 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 4.29 sec
2016-05-27 17:34:08,662 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 39.27 sec
2016-05-27 17:34:17,061 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 4.29 sec
2016-05-27 17:34:28,576 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 38.28 sec
2016-05-27 17:34:36,940 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 4.29 sec
2016-05-27 17:34:48,435 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 38.09 sec
MapReduce Total cumulative CPU time: 38 seconds 90 msec
Ended Job = job_1459756606292_15286 with errors
Error during job, obtaining debugging information...
Job Tracking URL: http://xxxx.local:8088/proxy/application_1459756606292_15286/
Examining task ID: task_1459756606292_15286_m_000000 (and more) from job job_1459756606292_15286
Task with the most failures(4):
-----
Task ID:
task_1459756606292_15286_r_000000
URL:
hxxp://xxxx.local:8088/taskdetails.jsp?jobid=job_1459756606292_15286&tipid=task_1459756606292_15286_r_000000
-----
Diagnostic Messages for this Task:
Error: Java heap space
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 14.97 sec HDFS Read: 87739161 HDFS Write: 16056577 SUCCESS
Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 38.09 sec HDFS Read: 16056995 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 53 seconds 60 msec
Intercepting System.exit(2)
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.HiveMain], exit code [2]
I could see the error in the log is while writing the ORC files, strange the ORC files are written nicely form command line !!!!
2016-05-30 11:11:20,377 INFO [main] org.apache.hadoop.hive.ql.exec.FileSinkOperator: FS[1]: records written - 1
2016-05-30 11:11:21,307 INFO [main] org.apache.hadoop.hive.ql.exec.FileSinkOperator: FS[1]: records written - 10
2016-05-30 11:11:21,917 INFO [main] org.apache.hadoop.hive.ql.exec.FileSinkOperator: FS[1]: records written - 100
2016-05-30 11:11:22,420 INFO [main] org.apache.hadoop.hive.ql.exec.FileSinkOperator: FS[1]: records written - 1000
2016-05-30 11:11:24,181 INFO [main] org.apache.hadoop.hive.ql.exec.ExtractOperator: 0 finished. closing...
2016-05-30 11:11:24,181 INFO [main] org.apache.hadoop.hive.ql.exec.FileSinkOperator: 1 finished. closing...
2016-05-30 11:11:24,181 INFO [main] org.apache.hadoop.hive.ql.exec.FileSinkOperator: FS[1]: records written - 4352
2016-05-30 11:11:33,028 FATAL [main] org.apache.hadoop.mapred.YarnChild: Error running child : java.lang.OutOfMemoryError: Java heap space
at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
at java.nio.ByteBuffer.allocate(ByteBuffer.java:331)
at org.apache.hadoop.hive.ql.io.orc.OutStream.getNewInputBuffer(OutStream.java:107)
at org.apache.hadoop.hive.ql.io.orc.OutStream.write(OutStream.java:128)
at org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerWriterV2.writeDirectValues(RunLengthIntegerWriterV2.java:374)
at org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerWriterV2.writeValues(RunLengthIntegerWriterV2.java:182)
at org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerWriterV2.write(RunLengthIntegerWriterV2.java:762)
at org.apache.hadoop.hive.ql.io.orc.WriterImpl$StringTreeWriter.flushDictionary(WriterImpl.java:1211)
at org.apache.hadoop.hive.ql.io.orc.WriterImpl$StringTreeWriter.writeStripe(WriterImpl.java:1132)
at org.apache.hadoop.hive.ql.io.orc.WriterImpl$StructTreeWriter.writeStripe(WriterImpl.java:1616)
at org.apache.hadoop.hive.ql.io.orc.WriterImpl.flushStripe(WriterImpl.java:1996)
at org.apache.hadoop.hive.ql.io.orc.WriterImpl.close(WriterImpl.java:2288)
at org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat$OrcRecordWriter.close(OrcOutputFormat.java:106)
at org.apache.hadoop.hive.ql.exec.FileSinkOperator$FSPaths.closeWriters(FileSinkOperator.java:186)
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.closeOp(FileSinkOperator.java:952)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:598)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:610)
at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.close(ExecReducer.java:287)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:453)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

Hive not running Map Reduce with "where" clause

I'm trying out something simple in Hive on HDFS.
The problem is that the queries are not running map reduce when I'm running a "where clause". However, it runs map reduce for count(*), and even group by clauses.
Here's data and queries with result:
Create External Table:
CREATE EXTERNAL TABLE testtab1 (
id STRING, org STRING)
row format delimited
fields terminated by ','
stored as textfile
location '/usr/ankuchak/testtable1';
Simple select * query:
0: jdbc:hive2://> select * from testtab1;
15/07/01 07:32:46 [main]: ERROR hdfs.KeyProviderCache: Could not find uri with key [dfs.encryption.key.provider.uri] to create a keyProvider !!
OK
+---------------+---------------+--+
| testtab1.id | testtab1.org |
+---------------+---------------+--+
| ankur | idc |
| user | idc |
| someone else | ssi |
+---------------+---------------+--+
3 rows selected (2.169 seconds)
Count(*) query
0: jdbc:hive2://> select count(*) from testtab1;
Query ID = ankuchak_20150701073407_e7fd66ae-8812-4e02-87d7-492f81781d15
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapreduce.job.reduces=
15/07/01 07:34:08 [HiveServer2-Background-Pool: Thread-40]: ERROR mr.ExecDriver: yarn
15/07/01 07:34:08 [HiveServer2-Background-Pool: Thread-40]: WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
Starting Job = job_1435425589664_0005, Tracking URL = http://slc02khv:8088/proxy/application_1435425589664_0005/
Kill Command = /scratch/hadoop/hadoop/bin/hadoop job -kill job_1435425589664_0005
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
15/07/01 07:34:16 [HiveServer2-Background-Pool: Thread-40]: WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead
2015-07-01 07:34:16,291 Stage-1 map = 0%, reduce = 0%
2015-07-01 07:34:23,831 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.04 sec
2015-07-01 07:34:30,102 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.41 sec
MapReduce Total cumulative CPU time: 2 seconds 410 msec
Ended Job = job_1435425589664_0005
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 2.41 sec HDFS Read: 6607 HDFS Write: 2 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 410 msec
OK
+------+--+
| _c0 |
+------+--+
| 3 |
+------+--+
1 row selected (23.527 seconds)
Group by query:
0: jdbc:hive2://> select org, count(id) from testtab1 group by org;
Query ID = ankuchak_20150701073540_5f20df4e-0bd4-4e18-b065-44c2688ce21f
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapreduce.job.reduces=
15/07/01 07:35:40 [HiveServer2-Background-Pool: Thread-63]: ERROR mr.ExecDriver: yarn
15/07/01 07:35:41 [HiveServer2-Background-Pool: Thread-63]: WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
Starting Job = job_1435425589664_0006, Tracking URL = http://slc02khv:8088/proxy/application_1435425589664_0006/
Kill Command = /scratch/hadoop/hadoop/bin/hadoop job -kill job_1435425589664_0006
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
15/07/01 07:35:47 [HiveServer2-Background-Pool: Thread-63]: WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead
2015-07-01 07:35:47,200 Stage-1 map = 0%, reduce = 0%
2015-07-01 07:35:53,494 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.05 sec
2015-07-01 07:36:00,799 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.53 sec
MapReduce Total cumulative CPU time: 2 seconds 530 msec
Ended Job = job_1435425589664_0006
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 2.53 sec HDFS Read: 7278 HDFS Write: 14 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 530 msec
OK
+-------+------+--+
| org | _c1 |
+-------+------+--+
| idc | 2 |
| ssi | 1 |
+-------+------+--+
2 rows selected (21.187 seconds)
Now the simple where clause:
0: jdbc:hive2://> select * from testtab1 where org='idc';
OK
+--------------+---------------+--+
| testtab1.id | testtab1.org |
+--------------+---------------+--+
+--------------+---------------+--+
No rows selected (0.11 seconds)
It would be great if you could provide me with some pointers.
Please let me know if you need further information in this regard.
Regards,
Ankur
Map job is occuring in your last query. So it's not that map reduce is not happening. However, some rows should be returned in your last query. The likely culprit here is that for some reason it is not finding a match on the value "idc". Check your table and ensure that the group for Ankur and user contain the string idc.
Try this to see if you get any results:
Select * from testtab1 where org rlike '.*(idc).*';
or
Select * from testtab1 where org like '%idc%';
These will grab any row that has a value containing the string 'idc'. Good luck!
Here, details of the same error and fixed recently. Try verifying the version you are using

Node.js slower than Apache

I am comparing performance of Node.js (0.5.1-pre) vs Apache (2.2.17) for a very simple scenario - serving a text file.
Here's the code I use for node server:
var http = require('http')
, fs = require('fs')
fs.readFile('/var/www/README.txt',
function(err, data) {
http.createServer(function(req, res) {
res.writeHead(200, {'Content-Type': 'text/plain'})
res.end(data)
}).listen(8080, '127.0.0.1')
}
)
For Apache I am just using whatever default configuration which goes with Ubuntu 11.04
When running Apache Bench with the following parameters against Apache
ab -n10000 -c100 http://127.0.0.1/README.txt
I get the following runtimes:
Time taken for tests: 1.083 seconds
Complete requests: 10000
Failed requests: 0
Write errors: 0
Total transferred: 27630000 bytes
HTML transferred: 24830000 bytes
Requests per second: 9229.38 [#/sec] (mean)
Time per request: 10.835 [ms] (mean)
Time per request: 0.108 [ms] (mean, across all concurrent requests)
Transfer rate: 24903.11 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.8 0 9
Processing: 5 10 2.0 10 23
Waiting: 4 10 1.9 10 21
Total: 6 11 2.1 10 23
Percentage of the requests served within a certain time (ms)
50% 10
66% 11
75% 11
80% 11
90% 14
95% 15
98% 18
99% 19
100% 23 (longest request)
When running Apache bench against node instance, these are the runtimes:
Time taken for tests: 1.712 seconds
Complete requests: 10000
Failed requests: 0
Write errors: 0
Total transferred: 25470000 bytes
HTML transferred: 24830000 bytes
Requests per second: 5840.83 [#/sec] (mean)
Time per request: 17.121 [ms] (mean)
Time per request: 0.171 [ms] (mean, across all concurrent requests)
Transfer rate: 14527.94 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.9 0 8
Processing: 0 17 8.8 16 53
Waiting: 0 17 8.6 16 48
Total: 1 17 8.7 17 53
Percentage of the requests served within a certain time (ms)
50% 17
66% 21
75% 23
80% 25
90% 28
95% 31
98% 35
99% 38
100% 53 (longest request)
Which is clearly slower than Apache. This is especially surprising if you consider the fact that Apache is doing a lot of other stuff, like logging etc.
Am I doing it wrong? Or is Node.js really slower in this scenario?
Edit 1: I do notice that node's concurrency is better - when increasing a number of simultaneous request to 1000, Apache starts dropping few of them, while node works fine with no connections dropped.
Dynamic requests
node.js is very good at handling at lot small dynamic requests(which can be hanging/long-polling). But it is not good at handling large buffers. Ryan Dahl(Author node.js) explained this one of his presentations. I recommend you to study these slides. I also watched this online somewhere.
Garbage Collector
As you can see from slide(13 from 45) it is bad at big buffers.
Slide 15 from 45:
V8 has a generational garbage
collector. Moves objects around
randomly. Node can’t get a pointer to
raw string data to write to socket.
Use Buffer
Slide 16 from 45
Using Node’s new Buffer object, the
results change.
Still not that good as for example nginx, but a lot better. Also these slides are pretty old so probably Ryan has even improved this.
CDN
Still I don't think you should be using node.js to host static files. You are probably better of hosting them on a CDN which is optimized for hosting static files. Some popular CDN's(some even free for) via WIKI.
NGinx(+Memcached)
If you don't want to use CDN to host your static files I recommend you to use Nginx with memcached instead which is very fast.
In this scenario Apache is probably doing sendfile which result in kernel sending chunk of memory data (cached by fs driver) directly to socket. In the case of node there is some overhead in copying data in userspace between v8, libeio and kernel (see this great article on using sendfile in node)
There are plenty possible scenarios where node will outperform Apache, like 'send stream of data with constant slow speed to as many tcp connections as possible'
The result of your benchmark can change in favor of node.js if you increase the concurrency and use cache in node.js
A sample code from the book "Node Cookbook":
var http = require('http');
var path = require('path');
var fs = require('fs');
var mimeTypes = {
'.js' : 'text/javascript',
'.html': 'text/html',
'.css' : 'text/css'
} ;
var cache = {};
function cacheAndDeliver(f, cb) {
if (!cache[f]) {
fs.readFile(f, function(err, data) {
if (!err) {
cache[f] = {content: data} ;
}
cb(err, data);
});
return;
}
console.log('loading ' + f + ' from cache');
cb(null, cache[f].content);
}
http.createServer(function (request, response) {
var lookup = path.basename(decodeURI(request.url)) || 'index.html';
var f = 'content/'+lookup;
fs.exists(f, function (exists) {
if (exists) {
fs.readFile(f, function(err,data) {
if (err) { response.writeHead(500);
response.end('Server Error!'); return; }
var headers = {'Content-type': mimeTypes[path.extname(lookup)]};
response.writeHead(200, headers);
response.end(data);
});
return;
}
response.writeHead(404); //no such file found!
response.end('Page Not Found!');
});
Really all you're doing here is getting the system to copy data between buffers in memory, in different process's address spaces - the disk cache means you aren't really touching the disk, and you're using local sockets.
So the fewer copies have to be done per request, the faster it goes.
Edit: I suggested adding caching, but in fact I see now you're already doing that - you read the file once, then start the server and send back the same buffer each time.
Have you tried appending the header part to the file data once upfront, so you only have to do a single write operation for each request?
$ cat /var/www/test.php
<?php
for ($i=0; $i<10; $i++) {
echo "hello, world\n";
}
$ ab -r -n 100000 -k -c 50 http://localhost/test.php
This is ApacheBench, Version 2.3 <$Revision: 655654 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking localhost (be patient)
Completed 10000 requests
Completed 20000 requests
Completed 30000 requests
Completed 40000 requests
Completed 50000 requests
Completed 60000 requests
Completed 70000 requests
Completed 80000 requests
Completed 90000 requests
Completed 100000 requests
Finished 100000 requests
Server Software: Apache/2.2.17
Server Hostname: localhost
Server Port: 80
Document Path: /test.php
Document Length: 130 bytes
Concurrency Level: 50
Time taken for tests: 3.656 seconds
Complete requests: 100000
Failed requests: 0
Write errors: 0
Keep-Alive requests: 100000
Total transferred: 37100000 bytes
HTML transferred: 13000000 bytes
Requests per second: 27350.70 [#/sec] (mean)
Time per request: 1.828 [ms] (mean)
Time per request: 0.037 [ms] (mean, across all concurrent requests)
Transfer rate: 9909.29 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.1 0 3
Processing: 0 2 2.7 0 29
Waiting: 0 2 2.7 0 29
Total: 0 2 2.7 0 29
Percentage of the requests served within a certain time (ms)
50% 0
66% 2
75% 3
80% 3
90% 5
95% 7
98% 10
99% 12
100% 29 (longest request)
$ cat node-test.js
var http = require('http');
http.createServer(function (req, res) {
res.writeHead(200, {'Content-Type': 'text/plain'});
res.end('Hello World\n');
}).listen(1337, "127.0.0.1");
console.log('Server running at http://127.0.0.1:1337/');
$ ab -r -n 100000 -k -c 50 http://localhost:1337/
This is ApacheBench, Version 2.3 <$Revision: 655654 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking localhost (be patient)
Completed 10000 requests
Completed 20000 requests
Completed 30000 requests
Completed 40000 requests
Completed 50000 requests
Completed 60000 requests
Completed 70000 requests
Completed 80000 requests
Completed 90000 requests
Completed 100000 requests
Finished 100000 requests
Server Software:
Server Hostname: localhost
Server Port: 1337
Document Path: /
Document Length: 12 bytes
Concurrency Level: 50
Time taken for tests: 14.708 seconds
Complete requests: 100000
Failed requests: 0
Write errors: 0
Keep-Alive requests: 0
Total transferred: 7600000 bytes
HTML transferred: 1200000 bytes
Requests per second: 6799.08 [#/sec] (mean)
Time per request: 7.354 [ms] (mean)
Time per request: 0.147 [ms] (mean, across all concurrent requests)
Transfer rate: 504.62 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.1 0 3
Processing: 0 7 3.8 7 28
Waiting: 0 7 3.8 7 28
Total: 1 7 3.8 7 28
Percentage of the requests served within a certain time (ms)
50% 7
66% 9
75% 10
80% 11
90% 12
95% 14
98% 16
99% 17
100% 28 (longest request)
$ node --version
v0.4.8
In the below benchmarks,
Apache:
$ apache2 -version
Server version: Apache/2.2.17 (Ubuntu)
Server built: Feb 22 2011 18:35:08
PHP APC cache/accelerator is installed.
Test run on my laptop, a Sager NP9280 with Core I7 920, 12G of RAM.
$ uname -a
Linux presto 2.6.38-8-generic #42-Ubuntu SMP Mon Apr 11 03:31:24 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux
KUbuntu natty

Resources