YARN minimum-user-limit-percent not working? - hadoop

I'm using the capacity scheduler in YARN and I saw that there's the possibility for users to get a minimum percentage of the queue by using the property 'yarn minimum-user-limit-percent'. I set this property to 20, and what I would expect is that resources would get equally distributed up to 5 users, according to this:
https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.1/bk_yarn-resource-management/content/setting_user_limits.html
But that's not the case when users start runnig applications. For instance, if I run an application when the queue is idle, YARN allocates all the requested resources for that application. When another user runs the same application afterwards, YARN allocates as much resources as there are left in the queue, and the queue gets filled up. At this point, I thought that with the property the second user would get 50% of the queue, and the first one would have less resources.
If a third user comes in, I would expect him/her to get 33% of the queue, but YARN doesn't even schedule the application because there are no available resources left.
Am I missing something? I thought this parameter made requests independent of the available resources until it hit the minimum percentage per user.
Here are my yarn-site.xml and capacity-scheduler.xml:
yarn-site.xml
<configuration>
<property>
<name>yarn.acl.enable</name>
<value>true</value>
</property>
<property>
<name>yarn.admin.acl</name>
<value>*</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>hadoopLogin:8032</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>hadoopLogin:8033</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>hadoopLogin:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>hadoopLogin:8031</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>hadoopLogin:8088</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.https.address</name>
<value>hadoopLogin:8090</value>
</property>
<property>
<name>yarn.resourcemanager.client.thread-count</name>
<value>50</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.client.thread-count</name>
<value>50</value>
</property>
<property>
<name>yarn.resourcemanager.admin.client.thread-count</name>
<value>1</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>1024</value>
</property>
<property>
<name>yarn.scheduler.increment-allocation-mb</name>
<value>512</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>14336</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-vcores</name>
<value>1</value>
</property>
<property>
<name>yarn.scheduler.increment-allocation-vcores</name>
<value>1</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-vcores</name>
<value>32</value>
</property>
<property>
<name>yarn.resourcemanager.amliveliness-monitor.interval-ms</name>
<value>1000</value>
</property>
<property>
<name>yarn.am.liveness-monitor.expiry-interval-ms</name>
<value>600000</value>
</property>
<property>
<name>yarn.resourcemanager.am.max-attempts</name>
<value>2</value>
</property>
<property>
<name>yarn.resourcemanager.container.liveness-monitor.interval-ms</name>
<value>600000</value>
</property>
<property>
<name>yarn.resourcemanager.nm.liveness-monitor.interval-ms</name>
<value>1000</value>
</property>
<property>
<name>yarn.nm.liveness-monitor.expiry-interval-ms</name>
<value>600000</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.client.thread-count</name>
<value>50</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.class</name><value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value>
yarn.scheduler.fair.user-as-default-queue
true
yarn.scheduler.fair.preemption
false
yarn.scheduler.fair.sizebasedweight
false
yarn.scheduler.fair.assignmultiple
false
yarn.resourcemanager.max-completed-applications
10000
yarn.nodemanager.aux-services
spark_shuffle
yarn.nodemanager.aux-services.spark_shuffle.class
org.apache.spark.network.yarn.YarnShuffleService
capacity-scheduler.xml
yarn.scheduler.capacity.root.queues
batch,notebook
Definition of console and batch jobs (batch) and notebook jobs (notebook) queues
yarn.scheduler.capacity.root.batch.capacity
50
Percentage of capacity for root.batch queue
yarn.scheduler.capacity.root.notebook.capacity
50
Percentage of capacity for root.notebook queue
yarn.scheduler.capacity.root.batch.maximum-capacity
55
Percentage of maximum capacity for root.batch queue
yarn.scheduler.capacity.root.notebook.maximum-capacity
55
Percentage of maximum capacity for root.notebook queue
yarn.scheduler.capacity.root.batch.state
RUNNING
Current state of the root.batch cue
yarn.scheduler.capacity.root.notebook.state
RUNNING
Current state of the root.notebook cue
yarn.scheduler.capacity.root.acl_submit_applications
hadoop,yarn,mapred,hdfs,spark
The ACL of who can submit jobs to the root queue.
yarn.scheduler.capacity.root.batch.acl_submit_applications
scienceUser1 root,gaia,ub,ucm,uac,udc,esac,upo,une,inaf
The ACL of who can submit jobs to the root.batch queue.
yarn.scheduler.capacity.root.notebook.acl_submit_applications
* root,gaia,ub,ucm,uac,udc,esac,upo,une,inaf
The ACL of who can submit jobs to the root.notebook queue.
yarn.scheduler.capacity.root.batch.acl_administer_queue
gaia
The ACL of who can administer jobs to the root.batch queue.
yarn.scheduler.capacity.root.notebook.acl_administer_queue
gaia
The ACL of who can administer jobs to the root.notebook queue.
yarn.scheduler.capacity.root.acl_administer_queue
gaia
The ACL of who can administer jobs on the root queue.
yarn.scheduler.capacity.root.acl_administer_queue
gaia
The ACL of who can administer jobs on the root queue.
yarn.scheduler.capacity.root.acl_administer_queue
gaia
The ACL of who can administer jobs on the root queue.
yarn.scheduler.capacity.root.batch.acl_administer_queue
gaia
The ACL of who can administer jobs on the batch queue.
yarn.scheduler.capacity.root.notebook.acl_administer_queue
gaia
The ACL of who can administer jobs on the notebook queue.
yarn.scheduler.capacity.resource-calculator
org.apache.hadoop.yarn.util.resource.DominantResourceCalculator
yarn.scheduler.capacity.root.notebook.minimum-user-limit-percent
33
Minimum percentage of resources a user gets from the queue.

Actually this is not the way minimum-user-limit should be treated. suppose you have 100 GB and there are 4 users. As a fair , each can get up to 25GB. But what if tomorrow you have 100 users coming in? In such case, dividing equally will end up giving 1GB to each user. Now that is something we may not want to happen considering the case where each process will end up starving for resources and might lead to worse performance. So here we limit ourselves by specifying the parameter minimum-user-limit as 25%. Which means each user might get minimum 25GB and if there are more users , it will be put to accepted state.So now if 1 user comes in , it can go upto 100GB. 2 users can take upto 50GB each. 3 users can take upto 33 GB each , 4 users can take upto 25GB each. But if 5th user comes in. The 5th user will be send to waiting because each user cannot be given 20GB each because we have set the limit as 25%(of 100GB ) which is 25GB. P.S this is my understanding. If anyone finds that this is wrong , then please comment. It is indeed a bit confusing.
Now in your scenario , lets say first two users took the entire resources. That case is different from this. 25% limit only says that you can enter in only when you yarn has at least 25% available to allocate to you. For ensuring two users do not take full resource , you can tweak with user-limit-factor. If set to .2 , each user cannot take more than 20%.

The behavior you're seeing is correct if you haven't enabled preemption.
Look in your yarn-site.xml for yarn.resourcemanager.scheduler.monitor.enable equals true to see if you've enabled it.
If you haven't enabled it, refer to the following for guidance:
http://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html#Capacity_Scheduler_container_preemption
https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.1/bk_yarn-resource-management/content/preemption.html
https://community.hortonworks.com/questions/89894/capacity-scheduler-preemption-doesnt-work.html

Related

Problems with memory kill limits for YARN

I have problem with understanding YARN configuration.
I have such lines in yarn/mapreduce configs:
<name>mapreduce.map.memory.mb</name>
<value>2048</value>
<name>mapreduce.reduce.memory.mb</name>
<value>1024</value>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>2.1</value>
Here is written:
By default ("yarn.nodemanager.vmem-pmem-ratio") is set to 2.1. This means that a map or reduce container can allocate up to 2.1 times the ("mapreduce.reduce.memory.mb") or ("mapreduce.map.memory.mb") of virtual memory before the NM will kill the container.
When NodeManager will kill my container?
When a whole container reaches 2048MB*2.1=4300,8MB? Or 1024MB*2.1=2150,4MB
Can i get some better explanation?
Each Mapper and Reducer runs in its own separate container (containers are not shared between Mappers and Reducers, unless it is a Uber job. Check about Uber mode here: What is the purpose of "uber mode" in hadoop?).
Typically, memory requirements for a Mapper and a Reducer differ.
Hence, there are different configuration parameters for Mapper (mapreduce.map.memory.mb) and Reducer (mapreduce.reduce.memory.mb).
So, as per the settings in your yarn-site.xml, virtual memory limits for Mapper and Redcuer are:
Mapper limit: 2048 * 2.1 = 4300.8 MB
Reducer limit: 1024 * 2.1 = 2150.4 MB
In short, Mappers and Reducers have different memory settings and limits.

Where moved to trash files go?

I have the following questions regarding the "move to trash" functionality in the hue GUI:
Where do these files go?
How long are they stored?
Can I restore them?
1) /user/hduser/.Trash
Where hduser is unix(operating system user, it can be windows user also if you are using java client from windows + eclipse ) user.
2) This will depend on the below configuration in core-site.xml
<property>
<name>fs.trash.interval</name>
<value>30</value>
<description>Number of minutes after which the checkpoint
gets deleted. If zero, the trash feature is disabled.
</description>
</property>
3) For doing this recovery method trash should be enabled in hdfs. Trash can be enabled by setting the property fs.trash.interval (as above mentioned xml) greater than 0.
By default the value is zero. Its value is number of minutes after which the checkpoint gets deleted. If zero, the trash feature is disabled. We have to set this property in core-site.xml.
There is one more property which is having relation with the above property called fs.trash.checkpoint.interval. It is the number of minutes between trash checkpoints. This should be smaller or equal to fs.trash.interval.
Everytime the checkpointer runs, it creates a new checkpoint out of current and removes checkpoints created more than fs.trash.interval minutes ago.
The default value of this property is zero.
<property>
<name>fs.trash.checkpoint.interval</name>
<value>15</value>
<description>Number of minutes between trash checkpoints.
Should be smaller or equal to fs.trash.interval.
Every time the checkpointer runs it creates a new checkpoint
out of current and removes checkpoints created more than
fs.trash.interval minutes ago.
</description>
</property>
If the above properties are enabled in your cluster. Then the deleted files will be present in .Trash directory of hdfs. You have time to recover the files until the next checkpoint occurs. After the new checkpoint the deleted files will not be present in the .Trash. So recover before the new checkpoint. If this property is not enabled in your cluster, you can enable this for future recovery.. :)

hadoop: how to increase the limit of failed tasks

I want to run a job so that all task failures are just logged and are otherwise ignored (basically to test my input). Right now, when a task fails I get "# of failed Map Tasks exceeded allowed limit". How do I increase the limit?
I use Hadoop 1.2.1
Specify the mapred.max.map.failures.percent and mapred.max.reduce.failures.percent in the mapred-site.xml to specify the failure threshold. Both are set to 0. Check the code for JobConf.java for more details.
In order to set increase the limit of the MapTasks try to add following in the mapred-site.xml file.
<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>{cores}</value>
</property>
This will make the number of MapTasks set to maximum value. In place of {cores} you should substitute the value of cores you have. Setting this value to exact value of core available is not considered good. Let me know if you have any questions.
Hope this helps.
Happy Hadooping!!!

Nutch segments folder grows every day

I have configured nutch/solr 1.6 to crawl/index every 12 hours an intranet with about 4000 documents and html pages.
If I execute the crawler with an empty database the process takes about 30 minutes.
When the crawling is executed for several days, it becomes very slow.
Looking the log file it seems that this night the last step (SolrIndexer) started after 1 hour and 20 minutes and it took a bit more than 1 hour.
Because the number of documents indexed doesn't grow, I'm wondering why it is so slow now.
Nutch is executed with the following command:
bin/nutch crawl -urlDir urls -solr http://localhost:8983/solr -dir nutchdb -depth 15 -topN 3000
The nutch-site.xml contains:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>http.agent.name</name>
<value>Internet Site Agent</value>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(tika|metatags)|index-(basic|anchor|metadata|more|http-header)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>
<!-- Used only if plugin parse-metatags is enabled. -->
<property>
<name>metatags.names</name>
<value>description;keywords;published;modified</value>
<description> Names of the metatags to extract, separated by;.
Use '*' to extract all metatags. Prefixes the names with 'metatag.'
in the parse-metadata. For instance to index description and keywords,
you need to activate the plugin index-metadata and set the value of the
parameter 'index.parse.md' to 'metatag.description;metatag.keywords'.
</description>
</property>
<property>
<name>index.parse.md</name>
<value>metatag.description,metatag.keywords,metatag.published,metatag.modified</value>
<description> Comma-separated list of keys to be taken from the parse metadata to generate fields.
Can be used e.g. for 'description' or 'keywords' provided that these values are generated
by a parser (see parse-metatags plugin)
</description>
</property>
<property>
<name>db.ignore.external.links</name>
<value>true</value>
<description>Set this to false if you start crawling your website from
for example http://www.example.com but you would like to crawl
xyz.example.com. Set it to true otherwise if you want to exclude external links
</description>
</property>
<property>
<name>http.content.limit</name>
<value>10000000</value>
<description>The length limit for downloaded content using the http
protocol, in bytes. If this value is nonnegative (>=0), content longer
than it will be truncated; otherwise, no truncation at all. Do not
confuse this setting with the file.content.limit setting.
</description>
</property>
<property>
<name>fetcher.max.crawl.delay</name>
<value>1</value>
<description>
If the Crawl-Delay in robots.txt is set to greater than this value (in
seconds) then the fetcher will skip this page, generating an error report.
If set to -1 the fetcher will never skip such pages and will wait the
amount of time retrieved from robots.txt Crawl-Delay, however long that
might be.
</description>
</property>
<property>
<name>fetcher.threads.fetch</name>
<value>10</value>
<description>The number of FetcherThreads the fetcher should use.
This is also determines the maximum number of requests that are
made at once (each FetcherThread handles one connection). The total
number of threads running in distributed mode will be the number of
fetcher threads * number of nodes as fetcher has one map task per node.
</description>
</property>
<property>
<name>fetcher.threads.fetch</name>
<value>10</value>
<description>The number of FetcherThreads the fetcher should use.
This is also determines the maximum number of requests that are
made at once (each FetcherThread handles one connection). The total
number of threads running in distributed mode will be the number of
fetcher threads * number of nodes as fetcher has one map task per node.
</description>
</property>
<property>
<name>fetcher.server.delay</name>
<value>1.0</value>
<description>The number of seconds the fetcher will delay between
successive requests to the same server.</description>
</property>
<property>
<name>http.redirect.max</name>
<value>0</value>
<description>The maximum number of redirects the fetcher will follow when
trying to fetch a page. If set to negative or 0, fetcher won't immediately
follow redirected URLs, instead it will record them for later fetching.
</description>
</property>
<property>
<name>fetcher.threads.per.queue</name>
<value>2</value>
<description>This number is the maximum number of threads that
should be allowed to access a queue at one time. Replaces
deprecated parameter 'fetcher.threads.per.host'.
</description>
</property>
<property>
<name>link.delete.gone</name>
<value>true</value>
<description>Whether to delete gone pages from the web graph.</description>
</property>
<property>
<name>link.loops.depth</name>
<value>20</value>
<description>The depth for the loops algorithm.</description>
</property>
<!-- moreindexingfilter plugin properties -->
<property>
<name>moreIndexingFilter.indexMimeTypeParts</name>
<value>false</value>
<description>Determines whether the index-more plugin will split the mime-type
in sub parts, this requires the type field to be multi valued. Set to true for backward
compatibility. False will not split the mime-type.
</description>
</property>
<property>
<name>moreIndexingFilter.mapMimeTypes</name>
<value>false</value>
<description>Determines whether MIME-type mapping is enabled. It takes a
plain text file with mapped MIME-types. With it the user can map both
application/xhtml+xml and text/html to the same target MIME-type so it
can be treated equally in an index. See conf/contenttype-mapping.txt.
</description>
</property>
<!-- Fetch Schedule Configuration -->
<property>
<name>db.fetch.interval.default</name>
<!-- for now always re-fetch everything -->
<value>10</value>
<description>The default number of seconds between re-fetches of a page (less than 1 day).
</description>
</property>
<property>
<name>db.fetch.interval.max</name>
<!-- for now always re-fetch everything -->
<value>10</value>
<description>The maximum number of seconds between re-fetches of a page
(less than one day). After this period every page in the db will be re-tried, no
matter what is its status.
</description>
</property>
<!--property>
<name>db.fetch.schedule.class</name>
<value>org.apache.nutch.crawl.AdaptiveFetchSchedule</value>
<description>The implementation of fetch schedule. DefaultFetchSchedule simply
adds the original fetchInterval to the last fetch time, regardless of
page changes.</description>
</property>
<property>
<name>db.fetch.schedule.adaptive.inc_rate</name>
<value>0.4</value>
<description>If a page is unmodified, its fetchInterval will be
increased by this rate. This value should not
exceed 0.5, otherwise the algorithm becomes unstable.</description>
</property>
<property>
<name>db.fetch.schedule.adaptive.dec_rate</name>
<value>0.2</value>
<description>If a page is modified, its fetchInterval will be
decreased by this rate. This value should not
exceed 0.5, otherwise the algorithm becomes unstable.</description>
</property>
<property>
<name>db.fetch.schedule.adaptive.min_interval</name>
<value>60.0</value>
<description>Minimum fetchInterval, in seconds.</description>
</property>
<property>
<name>db.fetch.schedule.adaptive.max_interval</name>
<value>31536000.0</value>
<description>Maximum fetchInterval, in seconds (365 days).
NOTE: this is limited by db.fetch.interval.max. Pages with
fetchInterval larger than db.fetch.interval.max
will be fetched anyway.</description>
</property>
<property>
<name>db.fetch.schedule.adaptive.sync_delta</name>
<value>true</value>
<description>If true, try to synchronize with the time of page change.
by shifting the next fetchTime by a fraction (sync_rate) of the difference
between the last modification time, and the last fetch time.</description>
</property>
<property>
<name>db.fetch.schedule.adaptive.sync_delta_rate</name>
<value>0.3</value>
<description>See sync_delta for description. This value should not
exceed 0.5, otherwise the algorithm becomes unstable.</description>
</property>
<property>
<name>db.fetch.schedule.adaptive.sync_delta_rate</name>
<value>0.3</value>
<description>See sync_delta for description. This value should not
exceed 0.5, otherwise the algorithm becomes unstable.</description>
</property-->
<property>
<name>fetcher.threads.fetch</name>
<value>1</value>
<description>The number of FetcherThreads the fetcher should use.
This is also determines the maximum number of requests that are
made at once (each FetcherThread handles one connection). The total
number of threads running in distributed mode will be the number of
fetcher threads * number of nodes as fetcher has one map task per node.
</description>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/apache-nutch/tmp/</value>
</property>
<!-- Boilerpipe -->
<property>
<name>tika.boilerpipe</name>
<value>true</value>
</property>
<property>
<name>tika.boilerpipe.extractor</name>
<value>ArticleExtractor</value>
</property>
</configuration>
As you can see, I have configured nutch to always refetch all the documents.
Because the site is small, it should be ok for now to refetch everything (the first time takes only 30 minutes...).
I have noticed that in the folder crawldb/segments every day more or less 40 new segments are created.
the disk size of the database of course is growing very fast.
Is this the expected behaviour ? Is there something wrong with the configuration?
It is necessary to delete from the nutchdb the segments that are older than the db.default.fetch.interval. This interval defines when a page should be refetched.
If the page has been refetched, the old segments can be deleted.
If the segments are not deleted the step solrindexer has to read too many segments and becomes very slow (in my case one hour instead of 4 minutes).

How to view the FileSystem of Hadoop out of the local cluster, using webHDFS

I'm new to Hadoop, and it take me one week to find webHDFS, which I think can help me show the FileSystem out of the cluster. I can view the filesystem in "http://master:50070/webhdfs/v1/user/hadoop?user.name=hadoopes&op=LISTSTATUS",
however, it shows,
{"FileStatuses":{"FileStatus":[
{"accessTime":0,"blockSize":0,"group":"supergroup","length":0,"modificationTime":1337823103411,"owner":"hadoop","pathSuffix":"Yijin","permission":"777","replication":0,"type":"DIRECTORY"},
{"accessTime":1337824794722,"blockSize":67108864,"group":"supergroup","length":11,"modificationTime":1337751080433,"owner":"pc","pathSuffix":"hello.txt","permission":"644","replication":2,"type":"FILE"},
{"accessTime":0,"blockSize":0,"group":"supergroup","length":0,"modificationTime":1337848266732,"owner":"hadoop","pathSuffix":"test","permission":"755","replication":0,"type":"DIRECTORY"},
{"accessTime":1337824798450,"blockSize":67108864,"group":"supergroup","length":18,"modificationTime":1337751301976,"owner":"pc","pathSuffix":"test2.txt","permission":"644","replication":2,"type":"FILE"},
{"accessTime":0,"blockSize":0,"group":"supergroup","length":0,"modificationTime":1337821412488,"owner":"hadoop","pathSuffix":"small","permission":"777","replication":0,"type":"DIRECTORY"}
]}}
it's very hard to read.
Is there any other way to view the filesystem by webHDFS,
and this is my "hdfs-site.xml"
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/home/hadoop/hdfs/name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/hadoop/hdfs/data</value>
</property>
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
webHDFS returns all its content in JSON format, if you want a user friendly output format just point your browser to http://master:50070/ and drill down from there
You can build a Class that follows the JSON schema of the object returned by the LISTSTATUS operation. Use a mapper (e.g. Jackson ObjectMapper) to read the JSON and to convert it to your Class object. Finally you can prompt it as you wish!

Resources