always Hive Job running in-process local Hadoop - hadoop

When I set this property in hive-site.xml
<property>
<name>hive.exec.mode.local.auto</name>
<value>false</value>
</property>
Hive always runs the hadoop job locally.
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 55
Job running in-process (local Hadoop)
Why does this happen?

As mentioned in HIVE-2585,Going forward Hive will assume that the metastore is operating in local mode if the configuration property hive.metastore.uris is unset, and will assume remote mode otherwise.
Ensure following property is set in Hive-site.xml:
<property>
<name>hive.metastore.uris</name>
<value><URIs of metastore server>:9083</value>
</property>
<property>
<name> hive.metastore.local</name>
<value>false</value>
</property>
The hive.metastore.local property is no longer supported as of Hive 0.10; setting hive.metastore.uris is sufficient to indicate that you are using a remote metastore.
EDIT:
Starting with release 0.7, Hive also supports a mode to run map-reduce jobs in local-mode automatically. The relevant options are hive.exec.mode.local.auto, hive.exec.mode.local.auto.inputbytes.max, and hive.exec.mode.local.auto.tasks.max:
hive> SET hive.exec.mode.local.auto=false;
Note that this feature is disabled by default. If enabled, Hive analyzes the size of each map-reduce job in a query and may run it locally if the following thresholds are satisfied:
1. The total input size of the job is lower than: hive.exec.mode.local.auto.inputbytes.max (128MB by default)
2. The total number of map-tasks is less than: hive.exec.mode.local.auto.tasks.max (4 by default)
3. The total number of reduce tasks required is 1 or 0.
So for queries over small data sets, or for queries with multiple map-reduce jobs where the input to subsequent jobs is substantially smaller (because of reduction/filtering in the prior job), jobs may be run locally.
Reference: Hive Getting started

Related

Pig job fails with "org.apache.hadoop.mapreduce.counters.LimitExceededException: Too many counters: 121 max=120"

We are processing 50 million data and after processing in the end we are using rank function in pig script and pig job is getting failed while executing rank function and we are getting below error:
"org.apache.hadoop.mapreduce.counters.LimitExceededException: Too many counters: 121 max=120"
We have used the below command in pig script but we are still getting same error:
set mapreduce.job.counters.max 1000
I would really appreciate if anyone can get me through this error or can provide me alternative way to use rank function on 50+million processed data.
Check the mapred-site.xml counter limit value.Most likely the limit is set to 120 in tha t file.The file is located in your hadoop home directory ex: $HADOOP_HOME/conf/mapred-site.xml
<property>
<name>mapreduce.job.counters.limit</name>
<value>1000</value> -- Most likely this is set to 120 in your case.
</property>
In Hadoop 2.0 its mapreduce.job.counters.max
<property>
<name>mapreduce.job.counters.max</name>
<value>1000</value> -- Most likely this is set to 120 in your case.
</property>

Yarn MRV2 performance Tuning Number of mappers and Reducers MRV1 performance better

I'm running MR java program with Yarn , even though Number of mappers are 24 , but actual running mappers are 10 , the remaining 14 maps are in pending state. how to make them also to running mode . we are running MAPR 6 node cluster .
changed the below properties in mapred-site.xml and Yarn-site.xml .
these values are overridden from the default values , still I'm not seeing any perfomance Bench mark .
Note . I'm running same Program with MRv1 performance wise it's better some how . So please suggest me how to utilize the cluster utilization resources well .
command Used :
yarn jar /opt/cluster/bin/logmessage-1.0-SNAPSHOT.jar com.message.WordPreprocessDriver -Dmapreduce.input.fileinputformat.split.maxsize=33554432 /data/123.txt
In yarn-site
<name>yarn.nodemanager.resource.memory-mb</name>
<value>20960</value>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>2048</value>
In mapred-site
<name>mapreduce.map.memory.mb</name>
<value>4096</value>
<name>mapreduce.reduce.memory.mb</name>
<value>8192</value>
<name>mapreduce.map.java.opts</name>
<value>-Xmx3072m</value>
<name>mapreduce.reduce.java.opts</name>
<value>-Xmx6144m</value>
mapreduce.job.maps
<name> mapreduce.job.maps</name>
<value>4</value>
You see
Actually the total containers were running as 11 . there was 40 vcores were available, among them 11 vcores were used while running MR program . can you please share what properties needs to change for that .?
Thank you ,
Madhu

Discrepancy in task run per map with the tasks configured in mapred-site.xml

I stopped all the agents running in my pseudo distributed mode by giving the following command.
stop-all.sh
Then I changed the configuration file of "mapred-site.xml" to 1 Map Task
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
</property>
<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>1</value>
</property>
<property>
<name>mapred.tasktracker.reduce.tasks.maximum</name>
<value>1</value>
</property>
</configuration>
you see I have set 1 MapTask snd 1 ReduceTask to run.
Then I started back all the agents
start-all.sh
and run the map-reduce program but still I see 2 tasks instead of 1 as configured in mapred-site.xml.
The screen shot of the tasks are shown below,
Why is such discrepancy occurring, Please guide me through
thanks
Okay so this property mapred.tasktracker.map.tasks.maximum tells the maximum number of tasks (mapper tasks) which can be run by a task tracker at a time. Basically you are restricting each node running task tracker to run one mapper at a time.
If you have 10 nodes then you should be able to run 10 mappers in parallel.
However if your job requires 2 mappers (which is totally based on size of input data & block size unless you extend inputformat) and you have only one node then the map tasks would be executed sequentially on that node.
Hope this is clearer now.

why hadoop capacity scheduler uses 200% of Capacity

I encountered the same problem on our cluster and returned to my pc to do some simple experiments hoping to figure it out.I configured hadoop in Pseudo-distributed mode and used the default capacity-scheduler.xml and configured the mapred-site.xml as the following:
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>io.sort.mb</name>
<value>5</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx10m</value>
</property>
<property>
<name>mapred.jobtracker.taskScheduler</name>
<value>org.apache.hadoop.mapred.CapacityTaskScheduler</value>
</property>
<property>
<name>mapred.queue.names</name>
<value>default</value>
</property>
<property>
<name>mapred.cluster.map.memory.mb</name>
<value>100</value>
</property>
<property>
<name>mapred.cluster.max.map.memory.mb</name>
<value>200</value>
</property>
</configuration>
The web UI looks like this :
Queue Name default
Scheduling Information
Queue configurationfatal
Capacity Percentage: 100.0%
User Limit: 100%
Priority Supported: NO
-------------
Map tasks
Capacity: 2 slots
Used capacity: 2 (100.0% of Capacity)
Running tasks: 1
Active users:
User 'luo': 2 (100.0% of used capacity)
-------------
Reduce tasks
Capacity: 2 slots
Used capacity: 0 (0.0% of Capacity)
Running tasks: 0
-------------
Job info
Number of Waiting Jobs: 0
Number of users who have submitted jobs: 1
Actually , it did work without anything wrong when I submitted a streaming job with one map task which occupies 2 slots and no reduce task.The streaming script is rather simple
~/hadoop/hadoop-0.20.2/bin/hadoop jar Streaming_blat.jar -D mapred.job.map.memory.mb=199 -D mapred.job.name='memory alloc' -D mapred.map.tasks=1 -input file://pwd/input/ -mapper ' /home/luo/hadoop/hadoop-0.20.2/bin/a.out' -output file://pwd/output/ -reducer NONE
a.out is just a C program simply outputting the pid and ppid to a specified file.
And problems came when I set mapred.map.tasks=3. The web UI showed
Map tasks
Capacity: 2 slots
Used capacity: 4 (200.0% of Capacity)
Running tasks: 2
Active users:
User 'luo': 4 (100.0% of used capacity)
which means it already exceeds the limit of map slots I set in mapred-site.xml. As a result, it prompted something like this again and again
Killing one of the least progress tasks - attempt_201210121915_0012_m_000000_0, as the cumulative memory usage of all the tasks on the TaskTracker exceeds virtual memory limit 207618048.
What I want it to do is suspend the map task until there are available slots without exceeding the capacity.So what's wrong have I done ? Could any one provide some solutions? Thanks a lot.
All right,I answer it myself.After cracking the code, I know those 4 properties must be all set in the mapred-site.xml,otherwise the scheduler does not perform memory check(I only set two of them).
mapred.cluster.map.memory.mb
mapred.cluster.reduce.memory.mb
mapred.cluster.max.map.memory.mb
mapred.cluster.max.reduce.memory.mb

How to get datanode timeout?

I have a 3 node hadoop setup, with replication factor as 2.
When one of my datanode dies, namenode waits for 10 mins before removing it from live nodes. Till then my hdfs writes fail saying bad ack from node.
Is there a way to set a smaller timeout( like 1 min) so that the node where datanode dies is discarded immediately ?
Setting up the following in your hdfs-site.xml will give you 1-minute timeout.
<property>
<name>heartbeat.recheck.interval</name>
<value>15</value>
<description>Determines datanode heartbeat interval in seconds</description>
</property>
If above doesn't work - try the following (seems to be version-dependent):
<property>
<name>dfs.heartbeat.recheck.interval</name>
<value>15</value>
<description>Determines datanode heartbeat interval in seconds.</description>
</property>
Timeout equals to 2 * heartbeat.recheck.interval + 10 * heartbeat.interval. Default for heartbeat.interval is 3 seconds.
In the version of Hadoop that we use, dfs.heartbeat.recheck.interval should be specified in milliseconds (check the code/doc of your version of Hadoop, to validate that).
I've managed to make this work. I'm using Hadoop version 0.2.2.
Here's what I added to my hdfs-site.xml:
<property>
<name>dfs.heartbeat.interval</name>
<value>2</value>
<description>Determines datanode heartbeat interval in seconds.</description>
</property>
<property>
<name>dfs.heartbeat.recheck.interval</name>
<value>1</value>
<description>Determines when machines are marked dead</description>
</property>
This parameters can differ for other versions of Hadoop. Here's how to check that you're using the right parameters: Once you set them, start your master, and check the configuration at :
http://your_master_machine:19888/conf
If you don't find "dfs.heartbeat.interval" and/or "dfs.heartbeat.recheck.interval" in there, that means you should try using their version without the "dfs." prefix:
"heartbeat.interval" and "heartbeat.recheck.interval"
Finally, to check that the dead datanode is no longer used after the desired amount of time, kill a datanode, then check repeatedly the console at:
http://your_master_machine:50070
For me, with the configuration shown here, I can see that a dead datanode is removed after about 20 seconds.

Resources