How to change mapreduce temporary working directory /tmp to other folder

How to change mapreduce temporary working directory /tmp to other folder - hadoop

I am using hive and I want to change the mapreduce temporary working directory from /tmp to some other directory. I tried everything which could I find on internet but nothing is working. I can see by du -h command that /tmp is filling up during the mapreduce task. Please somebody help me to change the directory.
core-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/data/bd/tmp/hadoop-${user.name}</value>
</property>
<property>
<name>dfs.journalnode.edits.dir</name>
<value>/data/bd/tmp/hadoop/dfs/journalnode/</value>
</property>
</configuration>
mapred-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.cluster.local.dir</name>
<value>/data/bd/tmp/mapred/local</value>
</property>
<property>
<name>mapreduce.task.tmp.dir</name>
<value>/data/bd/tmp</value>
</property>
<property>
<name>mapreduce.cluster.temp.dir</name>
<value>/data/bd/tmp/mapred/temp</value>
</property>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<property>
<name>mapreduce.map.memory.mb</name>
<value>2048</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>4096</value>
</property>
<property>
<name>yarn.app.mapreduce.am.staging-dir</name>
<value>/data/bd/tmp/hadoop-yarn/staging</value>
</property>
<property>
<name>mapreduce.jobtracker.system.dir</name>
<value>/data/bd/tmp/mapred/system</value>
</property>
<property>
<name>mapreduce.jobtracker.staging.root.dir</name>
<value>/data/bd/tmp/mapred/staging</value>
</property>
<property>
<name>mapreduce.map.output.compress</name>
<value>true</value>
</property>
<property>
<name>mapreduce.map.output.compress.codec</name>
<value>org.apache.hadoop.io.compress.GzipCodec</value>
</property>
</configuration>
yarn-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.application.classpath</name>
<value>$HADOOP_CONF_DIR,$HADOOP_COMMON_HOME/share/hadoop/common/*,$HADOOP_COMMON_HOME/share/hadoop/common/lib/*,$HADOOP_HDFS_HOME/share/hadoop/hdfs/*,$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*,$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*,$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*,
$HADOOP_YARN_HOME/share/hadoop/yarn/*,$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
<description>Whether virtual memory limits will be enforced for containers</description>
</property>
<property>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>4</value>
<description>Ratio between virtual memory to physical memory when setting memory limits for containers</description>
</property>
<property>
<name>yarn.nodemanager.remote-app-log-dir</name>
<value>/data/bd/tmp/logs</value>
<description>The staging dir used while submitting jobs</description>
</property>
<property>
<name>yarn.timeline-service.entity-group-fs-store.active-dir</name>
<value>/data/bd/tmp/entity-file-history/active</value>
<description>HDFS path to store active application’s timeline data</description>
</property>
<property>
<name>yarn.timeline-service.entity-group-fs-store.done-dir</name>
<value>/data/bd/tmp/entity-file-history/done/</value>
<description>HDFS path to store done application’s timeline data</description>
</property>
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>/data/bd/tmp/hadoop-ubuntu/nm-local-dir</value>
<description>List of directories to store localized files</description>
</property>
</configuration>
hive-site.xml
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost/metastore?createDatabaseIfNotExist=true</value>
<description>metadata is stored in a MySQL server</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>MySQL JDBC driver class</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
<description>user name for connecting to mysql server</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>hive</value>
<description>password for connecting to mysql server</description>
</property>
<property>
<name>hive.exec.parallel</name>
<value>true</value>
<description>Whether to execute jobs in parallel</description>
</property>
<property>
<name>hive.exec.parallel.thread.number</name>
<value>8</value>
<description>How many jobs at most can be executed in parallel</description>
</property>
<property>
<name>hive.cbo.enable</name>
<value>true</value>
<description>Flag to control enabling Cost Based Optimizations using Calcite framework.</description>
</property>
<property>
<name>hive.compute.query.using.stats</name>
<value>true</value>
<description>
When set to true Hive will answer a few queries like count(1) purely using stats
stored in metastore. For basic stats collection turn on the config hive.stats.autogather to true.
For more advanced stats collection need to run analyze table queries.
</description>
</property>
<property>
<name>hive.stats.fetch.partition.stats</name>
<value>true</value>
<description>
Annotation of operator tree with statistics information requires partition level basic
statistics like number of rows, data size and file size. Partition statistics are fetched from
metastore. Fetching partition statistics for each needed partition can be expensive when the
number of partitions is high. This flag can be used to disable fetching of partition statistics
from metastore. When this flag is disabled, Hive will make calls to filesystem to get file sizes
and will estimate the number of rows from row schema.
</description>
</property>
<property>
<name>hive.stats.fetch.column.stats</name>
<value>true</value>
<description>
Annotation of operator tree with statistics information requires column statistics.
Column statistics are fetched from metastore. Fetching column statistics for each needed column
can be expensive when the number of columns is high. This flag can be used to disable fetching
of column statistics from metastore.
</description>
</property>
<property>
<name>hive.stats.autogather</name>
<value>true</value>
<description>A flag to gather statistics automatically during the INSERT OVERWRITE command.</description>
</property>
<property>
<name>hive.stats.dbclass</name>
<value>fs</value>
<description>
Expects one of the pattern in [jdbc(:.*), hbase, counter, custom, fs].
The storage that stores temporary Hive statistics. In filesystem based statistics collection ('fs'),
each task writes statistics it has collected in a file on the filesystem, which will be aggregated
after the job has finished. Supported values are fs (filesystem), jdbc:database (where database
can be derby, mysql, etc.), hbase, counter, and custom as defined in StatsSetupConst.java.
</description>
</property>
<property>
<name>hive.exec.scratchdir</name>
<value>/data/bd/tmp</value>
<description>Scratch space for Hive jobs</description>
</property>
<property>
<name>hive.service.metrics.file.location</name>
<value>/data/bd/tmp/report.json</value>
<description>For metric class org.apache.hadoop.hive.common.metrics.metrics2.CodahaleMetrics JSON_FILE reporter, the location of local JSON metrics file. This file will get overwritten at every interval.</description>
</property>
<property>
<name>hive.query.results.cache.directory</name>
<value>/data/bd/tmp/hive/_resultscache_</value>
<description>unknown</description>
</property>
<property>
<name>hive.llap.io.allocator.mmap.path</name>
<value>/data/bd/tmp</value>
<description>unknown</description>
</property>
<property>
<name>hive.hbase.snapshot.restoredir</name>
<value>/data/bd/tmp</value>
<description>unknown</description>
</property>
<property>
<name>hive.druid.working.directory</name>
<value>/data/bd/tmp//workingDirectory</value>
<description>unknown</description>
</property>
<property>
<name>hive.querylog.location</name>
<value>/data/bd/tmp</value>
<description>logs hive</description>
</property>
</configuration>

For hadoop 2.7.1
Configure mapreduce.cluster.local.dir in $HADOOP_HOME/etc/hadoop/mapred-site.xml, it also supports comma-separated list of directories on different devices.
https://hadoop.apache.org/docs/r2.7.1/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml

Related

Apache Kylin not able to load models/configuration

I'm new to hadoop,hive, hbase and kylin. I tried to install thoose first three, and it's seems to be working.
After that I tried to install apache kylin, run the sample.sh and success.
After running the script I restart and open the web interface. Some page cannot be opened ex: /cube, /models, /admin/config
The problem is: I can see there are 5 tables created in hive, and also 2 cubes created. But when I open in web gui, the models is in loading-state and I cannot build the cube.
When I try to build the cube
I cannot find any infomative log (Or maybe there is one, but I don't know about it)
kylin.log
https://pastebin.com/TUZkQepa
hadoop-hadoop-namenode-master.log
https://pastebin.com/T8eNt3PY
hadoop-hadoop-secondarynamenode-master.log
https://pastebin.com/iMJDNFfU
yarn-hadoop-resourcemanager-master.log
https://pastebin.com/TGwJWTRF
hbase-hadoop-zookeeper-master.log
https://pastebin.com/Ym6eky5h
hbase-hadoop-master-master.log
https://pastebin.com/p1ygfw4W
Here is the configuration for hadoop
(yarn-site.xml)
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>master:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>master:8031</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>master:8032</value>
</property>
</configuration>
hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://master:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hadoop/tmp</value>
</property>
</configuration>
mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
Configuration for hbase
regionservers
slave2
hbase-site.xml
<configuration>
<property>
<name>hbase.rootdir</name>
<value>hdfs://master:9000/hbase</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/home/hadoop/datadir</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>master,slave2</value>
</property>
</configuration>
Configuration for hive
hive-site.xml
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://master:3306/metastore?createDatabaseIfNotExist=true</value>
<description>metadata is stored in a MySQL server</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>MySQL JDBC driver class</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
<description>user name for connecting to mysql server</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>gwudainget</value>
<description>password for connecting to mysql server</description>
</property>
<property>
<name>hive.cli.print.current.db</name>
<value>true</value>
<description>Whether to include the current database in the Hive prompt.</description>
</property>
</configuration>
For kylin, I use default configuration, because I don't really know what to do with the kylin configuration.
What i use:
hadoop 2.7.5 binary
hbase 1.2.6 binary
hive 1.2.2 binary
kylin 2.2.0 source (I just add logs)

Saving Hive tables between different warehouse directories in HDFS via Spark App

as of now, i'm currently figuring out how to save properly a specific hive table that was derived from a mapped source table in a specific database. let's say that the there would be a separate database for both the tester and developer side. how can i segregate the list of tables that they can access from one another?
For now, i monitor the state of the two databases via HUE. Now, I have a spark program that runs on a yarn cluster that creates a table to be stored depending on whether or not he is a developer or a tester.
The spark program that I've just created is a simple app that reads a table from the current warehouse location and saves a new table named new_table
I have the following hive configuration xml such as the following:
<configuration>
<property>
<name>hive.metastore.uris</name>
<value>thrift://xxxx:9083</value>
</property>
<property>
<name>hive.metastore.client.socket.timeout</name>
<value>300</value>
</property>
<!--<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/yyyy/warehouse</value>
</property>-->
<property>
<name>hive.warehouse.subdir.inherit.perms</name>
<value>true</value>
</property>
<property>
<name>hive.auto.convert.join</name>
<value>true</value>
</property>
<property>
<name>hive.auto.convert.join.noconditionaltask.size</name>
<value>20971520</value>
</property>
<property>
<name>hive.optimize.bucketmapjoin.sortedmerge</name>
<value>false</value>
</property>
<property>
<name>hive.smbjoin.cache.rows</name>
<value>10000</value>
</property>
<property>
<name>hive.server2.logging.operation.enabled</name>
<value>true</value>
</property>
<property>
<name>hive.server2.logging.operation.log.location</name>
<value>/var/log/hive/operation_logs</value>
</property>
<property>
<name>mapred.reduce.tasks</name>
<value>-1</value>
</property>
<property>
<name>hive.exec.reducers.bytes.per.reducer</name>
<value>67108864</value>
</property>
<property>
<name>hive.exec.copyfile.maxsize</name>
<value>33554432</value>
</property>
<property>
<name>hive.exec.reducers.max</name>
<value>1099</value>
</property>
<property>
<name>hive.vectorized.groupby.checkinterval</name>
<value>4096</value>
</property>
<property>
<name>hive.vectorized.groupby.flush.percent</name>
<value>0.1</value>
</property>
<property>
<name>hive.compute.query.using.stats</name>
<value>false</value>
</property>
<property>
<name>hive.vectorized.execution.enabled</name>
<value>true</value>
</property>
<property>
<name>hive.vectorized.execution.reduce.enabled</name>
<value>false</value>
</property>
<property>
<name>hive.merge.mapfiles</name>
<value>true</value>
</property>
<property>
<name>hive.merge.mapredfiles</name>
<value>false</value>
</property>
<property>
<name>hive.cbo.enable</name>
<value>false</value>
</property>
<property>
<name>hive.fetch.task.conversion</name>
<value>minimal</value>
</property>
<property>
<name>hive.fetch.task.conversion.threshold</name>
<value>268435456</value>
</property>
<property>
<name>hive.limit.pushdown.memory.usage</name>
<value>0.1</value>
</property>
<property>
<name>hive.merge.sparkfiles</name>
<value>true</value>
</property>
<property>
<name>hive.merge.smallfiles.avgsize</name>
<value>16777216</value>
</property>
<property>
<name>hive.merge.size.per.task</name>
<value>268435456</value>
</property>
<property>
<name>hive.optimize.reducededuplication</name>
<value>true</value>
</property>
<property>
<name>hive.optimize.reducededuplication.min.reducer</name>
<value>4</value>
</property>
<property>
<name>hive.map.aggr</name>
<value>true</value>
</property>
<property>
<name>hive.map.aggr.hash.percentmemory</name>
<value>0.5</value>
</property>
<property>
<name>hive.optimize.sort.dynamic.partition</name>
<value>false</value>
</property>
<property>
<name>hive.execution.engine</name>
<value>mr</value>
</property>
<property>
<name>spark.executor.memory</name>
<value>996461772</value>
</property>
<property>
<name>spark.driver.memory</name>
<value>966367641</value>
</property>
<property>
<name>spark.executor.cores</name>
<value>4</value>
</property>
<property>
<name>spark.yarn.driver.memoryOverhead</name>
<value>102</value>
</property>
<property>
<name>spark.yarn.executor.memoryOverhead</name>
<value>167</value>
</property>
<property>
<name>spark.dynamicAllocation.enabled</name>
<value>true</value>
</property>
<property>
<name>spark.dynamicAllocation.initialExecutors</name>
<value>1</value>
</property>
<property>
<name>spark.dynamicAllocation.minExecutors</name>
<value>1</value>
</property>
<property>
<name>spark.dynamicAllocation.maxExecutors</name>
<value>2147483647</value>
</property>
<property>
<name>hive.metastore.execute.setugi</name>
<value>true</value>
</property>
<property>
<name>hive.support.concurrency</name>
<value>true</value>
</property>
<property>
<name>hive.zookeeper.quorum</name>
<value>xxxx,xxxx</value>
</property>
<property>
<name>hive.zookeeper.client.port</name>
<value>2181</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>xxxx,xxxx</value>
</property>
<property>
<name>hbase.zookeeper.property.clientPort</name>
<value>2181</value>
</property>
<property>
<name>hive.zookeeper.namespace</name>
<value>hive_zookeeper_namespace_hive</value>
</property>
<property>
<name>hive.cluster.delegation.token.store.class</name>
<value>org.apache.hadoop.hive.thrift.MemoryTokenStore</value>
</property>
<property>
<name>hive.server2.enable.doAs</name>
<value>true</value>
</property>
<property>
<name>hive.server2.use.SSL</name>
<value>false</value>
</property>
<property>
<name>spark.shuffle.service.enabled</name>
<value>true</value>
</property>
</configuration>
Based from my current understanding, If i change the warehouse location to something upon submitting the spark app on the yarn cluster via hive.warehouse.dir using --files /file/hive-site.xml such as the value of
hdfs:/user/diff/warehouse, the hive configurations on the spark app should detect the following hive tables that exist on the specific directory.
However, upon doing so, it still persists to the location of the default database of the hive.metastore.uris which points to the directory hdfs:/user/hive/warehouse. Based from my understanding, the hive.metastore.uris overrides the database location in hive.metastore.dir.
What am I doing wrong at this point? is there something i need to properly configure in Hive-site.xml? any answers would be appreciated. Thank you. I'm currently a novice developer when it comes to spark and hadoop.

Create separate databases
Demo
Creating the databases is a one time thing
hive> create database dev_db location '/user/hive/my_databases/dev';
hive> create database tst_db location '/user/hive/my_databases/tst';
When you create the table you choose the database you want to work with
hive> create table dev_db.my_dev_table (i int);
hive> create table tst_db.my_tst_table (i int);
hive> desc formatted dev_db.my_dev_table;
# col_name data_type comment
i int
# Detailed Table Information
Database: dev_db
...
Location: hdfs://quickstart.cloudera:8020/user/hive/my_databases/dev/my_dev_table
...
hive> desc formatted tst_db.my_tst_table;
Database: tst_db
...
Location: hdfs://quickstart.cloudera:8020/user/hive/my_databases/tst/my_tst_table
...

Hadoop Cluster. Map reduce job stuck at map 100% and reduce 0%

I am new to Hadoop. I tried to create a hadoop cluster based on the example given on the Apache Hadoop site.
However when I run the map reduce example the application is stuck at map 100% and reduce 0%.
Please help
I have setup the environment using Vagrant and Virtual box. Created two instances.
I am running name node and a data node in one instance and resource manager and node manager in the other instance.
mapred-siet.xml configuration
<configuration>
<!-- Map Reduce applications configuration -->
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.map.memory.mb</name>
<value>1536</value>
</property>
<property>
<name>mapreduce.map.java.opts</name>
<value>-Xmx1024M</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>3072</value>
</property>
<property>
<name>mapreduce.reduce.java.opts</name>
<value>-Xmx2560M</value>
</property>
<property>
<name>mapreduce.task.io.sort.mb</name>
<value>512</value>
</property>
<property>
<name>mapreduce.task.io.sort.factor</name>
<value>100</value>
</property>
<property>
<name>mapreduce.reduce.shuffle.parallelcopies</name>
<value>50</value>
</property>
<!-- Map Reduce Job History Server -->
<property>
<name>mapreduce.jobhistory.intermediate-done-dir</name>
<value>/mr-history/tmp</value>
</property>
<property>
<name>mapreduce.jobhistory.done-dir</name>
<value>/mr-history/done</value>
</property>
yarn-site.xml
e<configuration>
<!-- Resource Manager -->
<property>
<name>yarn.acl.enable</name>
<value>false</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>1024</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>4096</value>
</property>
<!-- Node Manager -->
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>2048</value>
</property>
<property>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>2.1</value>
</property>
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>/opt/hadoop-2.6.2/tempData</value>
</property>
<property>
<name>yarn.nodemanager.log-dirs</name>
<value>/opt/hadoop-2.6.2/logDir</value>
</property>
<property>
<name>yarn.nodemanager.log.retain-seconds</name>
<value>10800</value>
</property>
<property>
<name>yarn.nodemanager.remote-app-log-dir</name>
<value>/logs</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<!-- History Server -->
<property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>-1</value>
</property>
<property>
<name>yarn.log-aggregation.retain-check-interval-seconds</name>
<value>-1</value>
</property>

I was able to run the application now. As I thought it was a problem with the memory required by the system. I changed the following properties as given below
yarn.scheduler.maximum-allocation-mb
8192
<!-- Node Manager -->
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>8192</value>
</property>
and repeated the process. its working fine now.

Building hadoop cluster on small nodes

I'm preparing hadoop cluster on four very small virtual servers (2GB RAM, 2Cores each) for a proof of concept.
One server as name node and resource manager and three are data nodes.
Every time I'm running the test job (3,4 GB file with data) - two of data nodes (random ones) are working at maximum capability and one of them is sleeping (monitoring via htop).
All 3 data nodes are visible in the hadoop GUI.
What am I missing?
Any help will be much appreciated.
yarn-site.xml
<configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>hadoop-master:8031</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>hadoop-master:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>hadoop-master:8030</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>hadoop-master:8033</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>hadoop-master:8088</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>2048</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
hdfs-site.xml
<configuration>
<property>
<name>dfs.data.dir</name>
<value>/opt/hadoop/hadoop/dfs/data</value>
<final>true</final>
</property>
<property>
<name>dfs.name.dir</name>
<value>/opt/hadoop/hadoop/dfs/name</value>
<final>true</final>
</property>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<name>dfs.blocksize</name>
<value>67108864</value>
</property>

I found the solution.
To increase number of reducers in the file mapred-site.xml I added
<property>
<name>A</name>
<value>5</value>
</property>
After I added additional nodes to cluster, hadoop has increased mappers without any additional change in the configuration. All data nodes are working at maximum capability now.

hive slow performance with this configuration

When i add this configuration command in hive-site.xml ,hive queries is very slow.
Anyone can explain why happened for me?how can we fix it?
<property>
<name>hive.support.concurrency</name>
<value>true</value>
</property>
this is my hive-site.xml:
<configuration>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
<description>Username to use against metastore database</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value></value>
<description>Username to use against metastore database</description>
</property>
<property>
<name>hive.exec.dynamic.partition.mode</name>
<value>nonstrict</value>
</property>
property>
<name>hive.enforce.bucketing</name>
<value>true</value>
</property>

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How to change mapreduce temporary working directory /tmp to other folder - hadoop

For hadoop 2.7.1 Configure mapreduce.cluster.local.dir in $HADOOP_HOME/etc/hadoop/mapred-site.xml, it also supports comma-separated list of directories on different devices. https://hadoop.apache.org/docs/r2.7.1/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml

Related

Apache Kylin not able to load models/configuration

Saving Hive tables between different warehouse directories in HDFS via Spark App

Hadoop Cluster. Map reduce job stuck at map 100% and reduce 0%

Building hadoop cluster on small nodes

hive slow performance with this configuration

Categories

Resources