Understanding Scheduling and /proc/[pid]/sched - linux-kernel

I have a single threaded application that is (spinning and) pinned to a core with taskset and CPU isolation (i.e. isolcpus=12-23 nohz_full=21,22,23 rcu_nocbs=12-23; I have 2 CPUs, and 12 core each -- Skylake):
exec taskset -c 22 setuidgid myuser envdir ./env -c = /opt/bin/UtilityServer > /tmp/logs/utility-server.log
For some reason, after running few hours, I checked the stats:
UtilityServer (1934, #threads: 1)
se.exec_start : 78998944.120048
se.vruntime : 78337609.962134
se.sum_exec_runtime : 78337613.040860
se.nr_migrations : 6
nr_switches : 41
nr_voluntary_switches : 31
nr_involuntary_switches : 10
se.load.weight : 1024
policy : 0
prio : 120
clock-delta : 13
mm->numa_scan_seq : 925
numa_migrations, 0
numa_faults_memory, 0, 0, 0, 0, 1
numa_faults_memory, 1, 0, 0, 0, 1
numa_faults_memory, 0, 1, 1, 1, 0
numa_faults_memory, 1, 1, 0, 1, 9
How can I stop the switching and migration (i.e. se.nr_migrations, nr_switches, nr_voluntary_switches, nr_involuntary_switches are zero)? Given that my application really wants to use the whole core.
Why did the kernel try to migrate? Given that I have already isolated the core and only assigned one single threaded application there.
nr_voluntary_switches tracked the # of times that my application voluntarily gave up the core? If yes, under what situation that my application would give up the core? My application did do some non blocking disk I/O (i.e. fwrite_unlocked(), etc) but zero networking stuffs.
Under what situation my application would be forced to switch? I see that nr_involuntary_switches = 10, it means my application was forced to switch 10 times?
What do the numbers after numa_faults_memory entries mean?
I am on 3.10.0-862.2.3.el7.x86_64 if it matters.


Prometheus Histogram Vector: All buckets fill equally?

I intend to use a Prometheus Histogram vector to monitor the execution time of request handlers in Go.
I register it so:
var RequestTimeHistogramVec = prometheus.NewHistogramVec(
Name: "request_duration_seconds",
Help: "Request duration distribution",
Buckets: []float64{0.125, 0.25, 0.5, 1, 1.5, 2, 3, 4, 5, 7.5, 10, 20},
func init() {
I use it so:
startTime := time.Now()
// handle request here
metrics.RequestTimeHistogramVec.WithLabelValues("get:" + endpointName).Observe(time.Since(startTime).Seconds())
When I do a HTTP GET to the /metrics endpoint after using my endpoint a couple of times, I get - amongst other things - the following:
# HELP request_duration_seconds Request duration distribution
# TYPE request_duration_seconds histogram
request_duration_seconds_bucket{endpoint="get:/position",le="0.125"} 6
request_duration_seconds_bucket{endpoint="get:/position",le="0.25"} 6
request_duration_seconds_bucket{endpoint="get:/position",le="0.5"} 6
request_duration_seconds_bucket{endpoint="get:/position",le="1"} 6
request_duration_seconds_bucket{endpoint="get:/position",le="1.5"} 6
request_duration_seconds_bucket{endpoint="get:/position",le="2"} 6
request_duration_seconds_bucket{endpoint="get:/position",le="3"} 6
request_duration_seconds_bucket{endpoint="get:/position",le="4"} 6
request_duration_seconds_bucket{endpoint="get:/position",le="5"} 6
request_duration_seconds_bucket{endpoint="get:/position",le="7.5"} 6
request_duration_seconds_bucket{endpoint="get:/position",le="10"} 6
request_duration_seconds_bucket{endpoint="get:/position",le="20"} 6
request_duration_seconds_bucket{endpoint="get:/position",le="+Inf"} 6
request_duration_seconds_sum{endpoint="get:/position"} 0.022002387
request_duration_seconds_count{endpoint="get:/position"} 6
From the looks of it, all buckets are filled by the same amount, equal to the total amount of times I used my endpoint (6 times).
Why does this happen and how may I fix it?
Prometheus histogram buckets are cumulative, so in this case all the requests took less than or equal to 125ms.
In this case your choice of buckets may not be the best, you might want to make some of the buckets smaller.
This is not an error. Notice the rule for filling the bucket is le=..., meaning less or equal. Since all 6 requests succeeded quickly, all buckets were filled.

How YARN cluster metrics are calculated ? Are they an instant snapshot or an average over a period?

For example, by executing this:
I get an output like this:
"clusterMetrics": {
"appsSubmitted": 502521,
"appsCompleted": 501201,
"appsPending": 0,
"appsRunning": 19,
"appsFailed": 454,
"appsKilled": 847,
"reservedMB": 140400,
"availableMB": 12615232,
"allocatedMB": 8830800,
"reservedVirtualCores": 39,
"availableVirtualCores": 6140,
"allocatedVirtualCores": 2065,
"containersAllocated": 1692,
"containersReserved": 39,
"containersPending": 3960,
"totalMB": 21446032,
"totalVirtualCores": 8205,
"totalNodes": 199,
"lostNodes": 1,
"unhealthyNodes": 1,
"decommissionedNodes": 8,
"rebootedNodes": 0,
"activeNodes": 189
For instance, allocatedMB means what ?
Is it an instantaneous value ?
Is it averaged over an interval period ? The interval is configurable ?
The allocatedMB is the memory that has been assigned to the vcores (though not necessarily used). Yes, it is an instantaneous value. There is no interval, it's a snapshot of your cluster at that instant (minus the time it takes to compute these values from the data structures in the Resource Manager and then return it via the REST API).
If you want to translate your metrics it's saying:
You currently have 19 apps running.
These 19 apps are using a total of 2065 vcores.
These 2065 vcores have reserved 8830800 MB of memory for them

embedded elasticsearch - second start up takes long time

I am working on a solution that uses embedded elasticsearch server - on one local machine. The scenario is:
1)create cluster with one node. Import data - 3 million records in ~180 indexes and 911 shards. Data is available, search works and returns expected data, the health seems good:
"cluster_name" : "cn1441023806894",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 911,
"active_shards" : 911,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
2) Now, I shutdown the server - this is my console output:
sie 31, 2015 2:51:36 PM org.elasticsearch.node.internal.InternalNode stop
INFO: [testbg] stopping ...
sie 31, 2015 2:51:50 PM org.elasticsearch.node.internal.InternalNode stop
INFO: [testbg] stopped
sie 31, 2015 2:51:50 PM org.elasticsearch.node.internal.InternalNode close
INFO: [testbg] closing ...
sie 31, 2015 2:51:50 PM org.elasticsearch.node.internal.InternalNode close
INFO: [testbg] closed
The database folder is around 2.4 GB.
3) Now i start the server again.... and it takes around 10 minutes to reach status green, example health:
"cluster_name" : "cn1441023806894",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 68,
"active_shards" : 68,
"relocating_shards" : 0,
"initializing_shards" : 25,
"unassigned_shards" : 818
After that process, the database folder is ~0.8 GB.
Then I shutdown the database, and open it again, and now it gets green in 10 seconds. All next close/start operations are quite fast.
My configuration:
settings.put(SET_NODE_NAME, projectNameLC);
settings.put(SET_PATH_DATA, projectLocation + "\\" + CommonConstants.ANALYZER_DB_FOLDER);
settings.put(SET_CLUSTER_NAME, clusterName);
settings.put(SET_NODE_DATA, true);
settings.put(SET_NODE_LOCAL, true);
settings.put(SET_INDEX_REFRESH_INTERVAL, "-1");
settings.put(SET_INDEX_MERGE_ASYNC, true);
//the following settings are my attempt to speed up loading on the 2nd startup
settings.put("cluster.routing.allocation.disk.threshold_enabled", false);
settings.put("index.number_of_replicas", 0);
settings.put("cluster.routing.allocation.disk.include_relocations", false);
settings.put("cluster.routing.allocation.node_initial_primaries_recoveries", 25);
settings.put("cluster.routing.allocation.node_concurrent_recoveries", 8);
settings.put("indices.recovery.concurrent_streams", 6);
settings.put("indices.recovery.concurrent_streams", 6);
settings.put("indices.recovery.concurrent_small_file_streams", 4);
The questions:
1) What happens during the second start up? The db folder size reduces from 2.4gb into 800 megabytes.
2)If this process is necessary, can it be trigerred manually, so I can show nice "please wait" dialog?
The user experience on teh second database opening is very bad and I need to change it.
on another forum - https://discuss.elastic.co/t/initializing-shards-second-db-start-up-takes-long-time/28357 - I got answer from Mike Simos. The solution is to call synced flush on an index after I finished adding data to it:
client.admin().indices().flush(new FlushRequest(idxName));
And it did the trick: now my database starts in 30 seconds not 10 minutes, the time to flush the data is moved to the import part of my business logic, and that is acceptable. I also noticed that the time impact on import is not very big.

Get rid of unassigned shard

I've an ELK stack with two ElasticSearch nodes running and the cluster state turned red due to some unassigned shards which I can't get rid of. Looking up the unassigned shard, resp. the incomplete index with:
# curl -s elastic01.local:9200/_cat/shards | grep "logstash-2014.09.29"
logstash-2014.09.29 4 p STARTED 745489 481.3mb Crimson and the Raven
logstash-2014.09.29 4 r STARTED 745489 481.3mb Glenn Talbot
logstash-2014.09.29 0 p STARTED 781110 502.3mb Crimson and the Raven
logstash-2014.09.29 0 r STARTED 781110 502.3mb Glenn Talbot
logstash-2014.09.29 3 p INITIALIZING Crimson and the Raven
logstash-2014.09.29 3 r UNASSIGNED
logstash-2014.09.29 1 p STARTED 762991 490.1mb Crimson and the Raven
logstash-2014.09.29 1 r STARTED 762991 490.1mb Glenn Talbot
logstash-2014.09.29 2 p STARTED 761811 491.3mb Crimson and the Raven
logstash-2014.09.29 2 r STARTED 761811 491.3mb Glenn Talbot
My attempt to assign the shard to the other node fails:
curl XPOST -s 'http://elastic01.local:9200/_cluster/reroute?pretty=true' -d '{
"commands" : [ {
"allocate" : {
"index" : "logstash-2014.09.29",
"shard" : 3 ,
"node" : "Glenn Talbot",
"allow_primary" : 1
NO(primary shard is not yet active)]
I can't really seem to find an API to push the shard states any further. How could I proceed here?
Just for a complete picture, that what the system health looks like:
"cluster_name" : "logstash_es",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 114,
"active_shards" : 228,
"relocating_shards" : 0,
"initializing_shards" : 1,
"unassigned_shards" : 1
Thank you for your time and help
I actually ran into this situation with ElasticSearch 1.5 just the other day. After initially getting the same error, I simply repeated the /_cluster/reroute request the next day for lack of other ideas, and it worked, and it put the cluster back into a green state immediately.

Oozie co-ordinator application not working for more than one hour difference of start and end times

Problem with my oozie co-ordinator application.
Case 1 :
For -
start = "2012-09-07 13:00Z" end="2012-09-07 16:00Z" frequency="coord:hour(1)"
No of actions : 1 (expected is 3)
Nominal Times -
1) 2012-09-07 13:00Z (Two more are expected. 2012-09-07 14:00Z,2012-09-07 15:00Z)
Case 2 :
For -
start = "2012-09-07 13:00Z" end="2012-09-07 16:00Z" frequency = "coord:minutes(10)"
No of actions : 6 (expected is 18)
Nominal Times :
1) 2012-09-07 13:00Z
2) 2012-09-07 13:10Z
3) 2012-09-07 13:20Z
4) 2012-09-07 13:30Z
5) 2012-09-07 13:40Z
6) 2012-09-07 13:50Z (12 more are expected. 2012-09-07 14:00Z,2012-09-07 14:10Z and so on..).
Generalization based on observation :
Any frequency from coord:minutes(1) to coord:minutes(59), the nominal times are perfectly calculated, but only till one hour.
Please suggest if I am missing anything here. Using oozie 2.0, trying with a basic co-ordinator app which is working fine for :
start = "2012-09-07 13:00Z" end = "2012-09-07 13:30Z" frequency = "coord:minutes(10)"
Do the 6 actions finish successfully? There are 3 conditions that Oozie Coordinator will check before invoke 1 new action: 1) data dependency 2) frequency 3) concurrency limit. Any of these 3 conditions may stop the action from being started. It will be helpful if you can show us the coordinator's xml file.
