Any tips for making Hive run faster on Hadoop? - performance

I'm new to Hive and Hadoop. I have Hadoop configured for pseudo-distributed operation with one data node and one name node, all on localhost.
I have a trivial employee table containing 4 records. I can select the records in a reasonable amount of time, but anything beyond that takes really long. For example:
0: jdbc:hive2://localhost:10000> select * from emp;
+------------+------------+-------------+-------------+------------+
| emp.empno | emp.ename | emp.job | emp.deptno | emp.etype |
+------------+------------+-------------+-------------+------------+
| 7369 | SMITH | CLERK | 10 | PART_TIME |
| 7400 | JONES | ENGINEER | 10 | FULL_TIME |
| 7500 | BROWN | NIGHTGUARD | 20 | FULL_TIME |
| 7510 | LEE | ENGINEER | 20 | FULL_TIME |
+------------+------------+-------------+-------------+------------+
4 rows selected (0.643 seconds)
0: jdbc:hive2://localhost:10000> select * from emp order by empno;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
+------------+------------+-------------+-------------+------------+
| emp.empno | emp.ename | emp.job | emp.deptno | emp.etype |
+------------+------------+-------------+-------------+------------+
| 7369 | SMITH | CLERK | 10 | PART_TIME |
| 7400 | JONES | ENGINEER | 10 | FULL_TIME |
| 7500 | BROWN | NIGHTGUARD | 20 | FULL_TIME |
| 7510 | LEE | ENGINEER | 20 | FULL_TIME |
+------------+------------+-------------+-------------+------------+
4 rows selected (225.852 seconds)
What's it doing that's taking so long? Are there polling periods that I could reduce? I know that Hive isn't optimized for small tasks but this seems ridiculous.
Here's the various files:
hive-site.xml
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hive.mapred.mode</name>
<value>nostrict</value>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby:;databaseName=metastore_db;create=true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.apache.derby.jdbc.EmbeddedDriver</value>
</property>
<property>
<name>hive.exec.local.scratchdir</name>
<value>/home/hadoop/tmp</value>
</property>
<property>
<name>hive.downloaded.resources.dir</name>
<value>/home/hadoop/tmp/${hive.session.id}_resources</value>
</property>
<property>
<name>hive.querylog.location</name>
<value>/home/hadoop/tmp</value>
</property>
<property>
<name>hive.server2.logging.operation.log.location</name>
<value>/home/hadoop/tmp/operation_logs</value>
</property>
</configuration>
hdfs.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>file:///home/hadoop/hadoopinfra/hdfs/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>file:///home/hadoop/hadoopinfra/hdfs/datanode</value>
</property>
</configuration>
core-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.proxyuser.hive.groups</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hive.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hadoop.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hadoop.groups</name>
<value>*</value>
</property>
</configuration>
mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
yarn-site.xml
<?xml version="1.0"?>
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>

Use beeline shell, hive is deprecated.

You are using 'order by' in query, order by imposes the total order of all results, so there has to be single reducer to sort the final output.
Though the number of records are less in this case, because of the two stages (Map and Reduce) and Disk I/O, the query took time to process. If the number of rows in the output is too large, the single reducer could take a very long time to finish.
Suggest you to run Hive on Tez or Spark engines and dont run the queries which require full scan tables unless is really needed.

Your hive is using Map Reduce for processing the data which is deprecated in Hive2 and also giving slow response so you have to set Tez or Spark as your hive execution engine which gives you faster result.
As per my experience I suggest You have to used Tez as execution engine for best performance. For setup Hive-Tez please flow this document.
https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.2/bk_command-line-installation/content/ref-d677ca50-0a14-4d9e-9882-b764e689f6db.1.html

You can do two things to run your query faster :
1. Use the combination of DISTRIBUTE BY and SORT BY. DISTRIBUTE BY will ensure that all keys with a certain value will end up on the same data node. SORT BY will then sort the data on each node.
2. set the execution engine to Tez before executing your query.
set hive.execution.engine=tez;
I think this will certainly make your query run faster.

Set hive.fetch.task.conversion=none;
Set this property and run it again

Use following at the start of the hive query:
-- use cost-based query optimization
set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;
-- use dynamic hive partition
set hive.exec.dynamic.partition = true;
set hive.exec.dynamic.partition.mode = nonstrict;
Please add comments, if you like my answer!

Related

Odd Hadoop behavior, master performing all work?

I have setup a cluster using this guide: https://medium.com/#jootorres_11979/how-to-set-up-a-hadoop-3-2-1-multi-node-cluster-on-ubuntu-18-04-2-nodes-567ca44a3b12
Currently I have one datanode and one master node.
What happens when I run a Hadoop job is that, the datanode's network activity indicates that it is sending a lot of data and the namenode receives that data. Also, the namenode's CPU is utilized fully while the datanode's CPU is not used at all. See the figure:
The nodes are VMs on the same machine. This happens for several different scripts, the figure is from running a WordCount algorithm.
Why is the work not being performed on the datanode? What could cause such a behavior?
Any help is appreciated.
According to the guide, mapred-site.xml was not changed. This means that the default values are used. The default for mapreduce.framework.name is "local". This means that all calculations will be performed locally. This must be changed to "yarn".
I created the following mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=/usr/local/hadoop</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=/usr/local/hadoop</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=/usr/local/hadoop</value>
</property>
<property>
<name>mapreduce.application.classpath</name>
<value> $HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*
$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*
</value>
</property>
</configuration>
I also had to change yarn-site.xml to:
<?xml version="1.0"?>
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop-master</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
After restarting yarn and Hadoop, everything worked as expected. The work is performed at datanodes.

Issue in Inserting data to hive partition table with over 100k partitions

I created a staging table with 20 million records with only two field viewerid and viewedid. From that i am trying to create a dynamic partitions ORC table with "viewerid" column, but map job is not completing as shown in the attached pic
mapred-site.xml
<configuration>
<property>
<name> mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>localhost:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>localhost:19888</value>
</property>
<property>
<name>mapreduce.map.memory.mb</name>
<value>4096</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>8192</value>
</property>
<property>
<name>mapreduce.map.java.opts</name>
<value>-Xmx3072m</value>
</property>
<property>
<name>mapreduce.reduce.java.opts</name>
<value>-Xmx6144m</value>
</property>
<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>4</value>
</property>
<property>
<name>mapred.tasktracker.reduce.tasks.maximum</name>
<value>4</value>
</property>
**yarn-site.xml**
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>hadoop-master:8030</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>hadoop-master:8032</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>hadoop-master:8088</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>hadoop-master:8031</value>
</property>
job status:
my stage table:
hive> desc formatted bmviews;
OK
# col_name data_type comment
viewerid int
viewedid int
# Detailed Table Information
Database: bm
Owner: sudheer
CreateTime: Tue Aug 29 18:22:34 IST 2017
LastAccessTime: UNKNOWN
Retention: 0
Location: hdfs://hadoop-master:54311/user/hive/warehouse/bm.db/bmviews
Table Type: MANAGED_TABLE
Table Parameters:
numFiles 9
numRows 0
rawDataSize 0
totalSize 539543256
transient_lastDdlTime 1504070146
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
my partition table description:
I have changed the partitions per node to 200k but still facing the issue. I have two data nodes (8g,6g) ram respectively and namenode with 16gb ram.
How can I insert the data into my partition table?

Spark: Spark UI not reflecting the right executor count

We are running a spark streaming application where we want to increase the number of executors spark uses ....so updated spark-default.conf increasing spark.executor.instances from 28 to 40 but the change is not reflected in the UI
1 Master/Driver Node :
Memory :24GB Cores :8
4 Worker Nodes :
Memory :24GB Cores :8
spark.streaming.backpressure.enabled true
spark.streaming.stopGracefullyOnShutdown true
spark.executor.instances 28
spark.executor.memory 2560MB
spark.executor.cores 4
spark.driver.memory 3G
spark.driver.cores 1
Note : restarted spark start-master.sh and start-slaves.sh but no change .Any help in this regard will be greatly appreciated. this is the yarn-site.xml
<configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hdfs-name-node</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>22528</value>
</property>
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>7</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>22528</value>
</property>
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>file:///tmp/hadoop/data/nm-local-dir,file:///tmp/hadoop/data/nm-local-dir/filecache,file:///tmp/hadoop/data/nm-local-dir/usercache</value>
</property>
<property>
<name>yarn.nodemanager.localizer.cache.cleanup.interval-ms</name>
<value>500</value>
</property>
<property>
<name>yarn.nodemanager.localizer.cache.target-size-mb</name>
<value>512</value>
</property>
</configuration>
The yarn-site config allocates 7 cores per each node so overall you have 35 cores which means at most you can run 34 executors with 1 core (1 core for driver).

hadoop: having more than one reducers under pseudo distributed environment?

I am newbie to hadoop. I have successfully configured a hadoop setup in pseudo distributed mode. I want to have multiple reducers with the option -D mapred.reduce.tasks=2 (with hadoop-streaming). however there's still only one reducer.
according to Google I'm sure that mapred.LocalJobRunner limits number of reducers to 1. But I wonder is there any workaround to have more reducers?
my hadoop configuration files:
[admin#localhost string-count-hadoop]$ cat ~/hadoop-1.1.2/conf/core-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/admin/hadoop-data/tmp</value>
</property>
</configuration>
[admin#localhost string-count-hadoop]$ cat ~/hadoop-1.1.2/conf/mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>
[admin#localhost string-count-hadoop]$ cat ~/hadoop-1.1.2/conf/hdfs-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>dfs.name.dir</name>
<value>/home/admin/hadoop-data/name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/admin/hadoop-data/data</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
the way I start job:
[admin#localhost string-count-hadoop]$ cat hadoop-startjob.sh
#!/bin/sh
~/hadoop-1.1.2/bin/hadoop jar ~/hadoop-1.1.2/contrib/streaming/hadoop-streaming-1.1.2.jar \
-D mapred.job.name=string-count \
-D mapred.reduce.tasks=2 \
-mapper mapper \
-file mapper \
-reducer reducer \
-file reducer \
-input $1 \
-output $2
[admin#localhost string-count-hadoop]$ ./hadoop-startjob.sh /z/programming/testdata/items_sequence /z/output
packageJobJar: [mapper, reducer] [] /tmp/streamjob837249979139287589.jar tmpDir=null
13/07/17 20:21:10 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/07/17 20:21:10 WARN snappy.LoadSnappy: Snappy native library not loaded
13/07/17 20:21:10 INFO mapred.FileInputFormat: Total input paths to process : 1
13/07/17 20:21:11 WARN mapred.LocalJobRunner: LocalJobRunner does not support symlinking into current working dir.
...
...
Try modifying core-site.xml's property
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
to,
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000/</value>
</property>
Put an extra / after 9000 and restart all the daemons.

mahout ssvd job performance

I need to compute ssvd.
For 50 000 x 50 000 matrix, when reducing to 300x300 libraries such as ssvdlibc and other can compute it in less than 3 minutes;
I wanted to do it for big data, tried using mahout. Firstly I tried to run it locally on my small data set (that is 50000 x 50000), but it takes 32 minutes to complete that simple job, uses around 5,5GB of disk space for spill files, cause my intel i5 with 8GiB ram and SSD drive to freeze for few times.
I understand that mahout and hadoop must do lots of additional steps to perform everything as map-reduce job, but the performance hit just seems to big. I think I must have something wrong in my setup.
I've read some hadoop and mahout documentation, added few parameters in my config files, but its still incredibly slow. Most of time it uses only one CPU.
Can someone please tell me is what is wrong with my setup ? Can It be somehow tuned for that simple, one mahine use just to see what to look for for bigger deployment ?
my config files :
mapred-site.xml:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>local</value>
</property>
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx5000M</value>
</property>
<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>3</value>
</property>
<property>
<name>mapred.tasktracker.reduce.tasks.maximum</name>
<value>3</value>
</property>
<property>
<name>io.sort.factor</name>
<value>35</value>
</property>
</configuration>
core-site.xml:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>fs.default.name</name>
<value>file:///</value>
</property>
<property>
<name>fs.defaultFS</name>
<value>file:///</value>
</property>
<!--
<property>
<name>fs.inmemory.size.mb</name>
<value>200</value>
</property>
<property>
<name>io.sort.factor</name>
<value>100</value>
</property>
-->
<property>
<name>io.sort.mb</name>
<value>200</value>
</property>
<property>
<name>io.file.buffer.size</name>
<value>131072</value>
</property>
</configuration>
I run my job like that:
mahout ssvd --rank 400 --computeU true --computeV true --reduceTasks 3 --input ${INPUT} --output ${OUTPUT} -ow --tempDir /tmp/ssvdtmp/
I also configured hadoop to and mahout with -Xmx=4000m
Well so first of all I would verify that it is running in parallel, make sure hdfs replication is set to "1", and just generally check your params. That only seeing one core be used is definitely an issue!
But!
The problem with slowness is probably not going to go away completely, you might be able to speed it up significantly with proper configuration, but at the end of the day the hadoop model is not going to outcompete optimized shared memory model libraries on a single computer.
The power of hadoop/mahout is for big data, and honestly 50k x 50k is still in the realm of fairly small, easily manageable on a single computer. Essentially, Hadoop trades speed for scalability. So while it might not outcompete those other two with 50000 x 50000, try to get them to work on 300000 x 300000 while with Hadoop you are sitting pretty on a distributed cluster.

Resources