What does "L" means in Elasticsearch Node status? - elasticsearch

Bellow is the Node status of my Elasticsearch cluster(please follow the node.role column,
[root#manager]# curl -XGET http://192.168.6.51:9200/_cat/nodes?v
ip heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
192.168.6.54 20 97 0 0.00 0.00 0.00 dim - siem03.arif.local
192.168.6.51 34 55 0 0.16 0.06 0.01 l - siem00.arif.local
192.168.6.52 15 97 0 0.00 0.00 0.00 dim * siem01.arif.local
192.168.6.53 14 97 0 0.00 0.00 0.00 dim - siem02.arif.local
From Elasticsearch Documentation,
node.role, r, role, nodeRole
(Default) Roles of the node. Returned values include m (master-eligible node), d (data node), i (ingest node), and - (coordinating node only).
So, from the above output, the dim means, Data + Master + Ingest node. Which is absolutely correct. But I configured the host siem00.arif.local as a coordinating node. But it showed l which is not an option described by the documentation.
So what does it mean? It was just - before. But after an update (which I have pushed on each of the nodes) it doesn't work anymore and shows l in the node.role
UPDATE:
All the other nodes except the coordinating node were 1 version back. Now I have updated all of the nodes with exact same version. Now it works and here is the output,
[root#manager]# curl -XGET http://192.168.6.51:9200/_cat/nodes?v
ip heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
192.168.6.53 9 79 2 0.00 0.20 0.19 dilm * siem02.arif.local
192.168.6.52 13 78 2 0.18 0.24 0.20 dilm - siem01.arif.local
192.168.6.51 33 49 1 0.02 0.21 0.20 l - siem00.arif.local
192.168.6.54 12 77 4 0.02 0.19 0.17 dilm - siem03.arif.local
Current Version is :
[root#manager]# rpm -qa | grep elasticsearch
elasticsearch-7.4.0-1.x86_64

The built-in roles are indeed d, m, i and -, but any plugin is free to define new roles if needed. There's another one called v for voting-only nodes.
The l role is for Machine Learning nodes (i.e. those with node.ml: true) as can be seen in the source code of MachineLearning.java in the MachineLearning plugin.

Related

Aerospike - No improvements in latency on moving to in-memory cluster from on-disk cluster

To begin with, we had an aerospike cluster having 5 nodes of i2.2xlarge type in AWS, which our production fleet of around 200 servers was using to store/retrieve data. The aerospike config of the cluster was as follows -
service {
user root
group root
paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
pidfile /var/run/aerospike/asd.pid
service-threads 8
transaction-queues 8
transaction-threads-per-queue 4
fabric-workers 8
transaction-pending-limit 100
proto-fd-max 25000
}
logging {
# Log file must be an absolute path.
file /var/log/aerospike/aerospike.log {
context any info
}
}
network {
service {
address any
port 3000
}
heartbeat {
mode mesh
port 3002 # Heartbeat port for this node.
# List one or more other nodes, one ip-address & port per line:
mesh-seed-address-port <IP> 3002
mesh-seed-address-port <IP> 3002
mesh-seed-address-port <IP> 3002
mesh-seed-address-port <IP> 3002
# mesh-seed-address-port <IP> 3002
interval 250
timeout 10
}
fabric {
port 3001
}
info {
port 3003
}
}
namespace FC {
replication-factor 2
memory-size 7G
default-ttl 30d # 30 days, use 0 to never expire/evict.
high-water-disk-pct 80 # How full may the disk become before the server begins eviction
high-water-memory-pct 70 # Evict non-zero TTL data if capacity exceeds # 70% of 15GB
stop-writes-pct 90 # Stop writes if capacity exceeds 90% of 15GB
storage-engine device {
device /dev/xvdb1
write-block-size 256K
}
}
It was properly handling the traffic corresponding to the namespace "FC", with latencies within 14 ms, as shown in the following graph plotted using graphite -
However, on turning on another namespace, with much higher traffic on the same cluster, it started to give a lot of timeouts and higher latencies, as we scaled up the number of servers using the same cluster of 5 nodes (increasing number of servers step by step from 20 to 40 to 60) with the following namespace configuration -
namespace HEAVYNAMESPACE {
replication-factor 2
memory-size 35G
default-ttl 30d # 30 days, use 0 to never expire/evict.
high-water-disk-pct 80 # How full may the disk become before the server begins eviction
high-water-memory-pct 70 # Evict non-zero TTL data if capacity exceeds # 70% of 35GB
stop-writes-pct 90 # Stop writes if capacity exceeds 90% of 35GB
storage-engine device {
device /dev/xvdb8
write-block-size 256K
}
}
Following were the observations -
----FC Namespace----
20 - servers, 6k Write TPS, 16K Read TPS
set latency = 10ms
set timeouts = 1
get latency = 15ms
get timeouts = 3
40 - servers, 12k Write TPS, 17K Read TPS
set latency = 12ms
set timeouts = 1
get latency = 20ms
get timeouts = 5
60 - servers, 17k Write TPS, 18K Read TPS
set latency = 25ms
set timeouts = 5
get latency = 30ms
get timeouts = 10-50 (fluctuating)
----HEAVYNAMESPACE----
20 - del servers, 6k Write TPS, 16K Read TPS
set latency = 7ms
set timeouts = 1
get latency = 5ms
get timeouts = 0
no of keys = 47 million x 2
disk usage = 121 gb
ram usage = 5.62 gb
40 - del servers, 12k Write TPS, 17K Read TPS
set latency = 15ms
set timeouts = 5
get latency = 12ms
get timeouts = 2
60 - del servers, 17k Write TPS, 18K Read TPS
set latency = 25ms
set timeouts = 25-75 (fluctuating)
get latency = 25ms
get timeouts = 2-15 (fluctuating)
* Set latency refers to latency in setting aerospike cache keys and similarly get for getting keys.
We had to turn off the namespace "HEAVYNAMESPACE" after reaching 60 servers.
We then started a fresh POC with a cluster having nodes which were r3.4xlarge instances of AWS (find details here https://aws.amazon.com/ec2/instance-types/), with the key difference in aerospike configuration being the usage of memory only for caching, hoping that it would give better performance. Here is the aerospike.conf file -
service {
user root
group root
paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
pidfile /var/run/aerospike/asd.pid
service-threads 16
transaction-queues 16
transaction-threads-per-queue 4
proto-fd-max 15000
}
logging {
# Log file must be an absolute path.
file /var/log/aerospike/aerospike.log {
context any info
}
}
network {
service {
address any
port 3000
}
heartbeat {
mode mesh
port 3002 # Heartbeat port for this node.
# List one or more other nodes, one ip-address & port per line:
mesh-seed-address-port <IP> 3002
mesh-seed-address-port <IP> 3002
mesh-seed-address-port <IP> 3002
mesh-seed-address-port <IP> 3002
mesh-seed-address-port <IP> 3002
interval 250
timeout 10
}
fabric {
port 3001
}
info {
port 3003
}
}
namespace FC {
replication-factor 2
memory-size 30G
storage-engine memory
default-ttl 30d # 30 days, use 0 to never expire/evict.
high-water-memory-pct 80 # Evict non-zero TTL data if capacity exceeds # 70% of 15GB
stop-writes-pct 90 # Stop writes if capacity exceeds 90% of 15GB
}
We began with the FC namespace only, and decided to go ahead with the HEAVYNAMESPACE only if we saw significant improvements with the FC namespace, but we didn't. Here are the current observations with different combinations of node count and server count -
Current stats
Observation Point 1 - 4 nodes serving 130 servers.
Point 2 - 5 nodes serving 80 servers.
Point 3 - 5 nodes serving 100 servers.
These observation points are highlighted in the graphs below -
Get latency -
Set successes (giving a measure of the load handled by the cluster) -
We also observed that -
Total memory usage across cluster is 5.52 GB of 144 GB. Node-wise memory usage is ~ 1.10 GB out of 28.90 GB.
There were no observed write failures yet.
There were occasional get/set timeouts which looked fine.
No evicted objects.
Conclusion
We are not seeing the improvements we had expected, by using the memory-only configuration. We would like to get some pointers to be able to scale up with the same cost -
- via tweaking the aerospike configurations
- or by using some more suitable AWS instance type (even if that would lead to cost cutting).
Update
Output of top command on one of the aerospike servers, to show SI (Pointed out by #Sunil in his answer) -
$ top
top - 08:02:21 up 188 days, 48 min, 1 user, load average: 0.07, 0.07, 0.02
Tasks: 179 total, 1 running, 178 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.3%us, 0.1%sy, 0.0%ni, 99.4%id, 0.0%wa, 0.0%hi, 0.2%si, 0.0%st
Mem: 125904196k total, 2726964k used, 123177232k free, 148612k buffers
Swap: 0k total, 0k used, 0k free, 445968k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
63421 root 20 0 5217m 1.6g 4340 S 6.3 1.3 461:08.83 asd
If I am not wrong, the SI appears to be 0.2%. I checked the same on all the nodes of the cluster and it is 0.2% on one and 0.1% on the rest of the three.
Also, here is the output of the network stats on the same node -
$ sar -n DEV 10 10
Linux 4.4.30-32.54.amzn1.x86_64 (ip-10-111-215-72) 07/10/17 _x86_64_ (16 CPU)
08:09:16 IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
08:09:26 lo 12.20 12.20 5.61 5.61 0.00 0.00 0.00 0.00
08:09:26 eth0 2763.60 1471.60 299.24 233.08 0.00 0.00 0.00 0.00
08:09:26 IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
08:09:36 lo 12.00 12.00 5.60 5.60 0.00 0.00 0.00 0.00
08:09:36 eth0 2772.60 1474.50 300.08 233.48 0.00 0.00 0.00 0.00
08:09:36 IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
08:09:46 lo 17.90 17.90 15.21 15.21 0.00 0.00 0.00 0.00
08:09:46 eth0 2802.80 1491.90 304.63 245.33 0.00 0.00 0.00 0.00
08:09:46 IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
08:09:56 lo 12.00 12.00 5.60 5.60 0.00 0.00 0.00 0.00
08:09:56 eth0 2805.20 1494.30 304.37 237.51 0.00 0.00 0.00 0.00
08:09:56 IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
08:10:06 lo 9.40 9.40 5.05 5.05 0.00 0.00 0.00 0.00
08:10:06 eth0 3144.10 1702.30 342.54 255.34 0.00 0.00 0.00 0.00
08:10:06 IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
08:10:16 lo 12.00 12.00 5.60 5.60 0.00 0.00 0.00 0.00
08:10:16 eth0 2862.70 1522.20 310.15 238.32 0.00 0.00 0.00 0.00
08:10:16 IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
08:10:26 lo 12.00 12.00 5.60 5.60 0.00 0.00 0.00 0.00
08:10:26 eth0 2738.40 1453.80 295.85 231.47 0.00 0.00 0.00 0.00
08:10:26 IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
08:10:36 lo 11.79 11.79 5.59 5.59 0.00 0.00 0.00 0.00
08:10:36 eth0 2758.14 1464.14 297.59 231.47 0.00 0.00 0.00 0.00
08:10:36 IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
08:10:46 lo 12.00 12.00 5.60 5.60 0.00 0.00 0.00 0.00
08:10:46 eth0 3100.40 1811.30 328.31 289.92 0.00 0.00 0.00 0.00
08:10:46 IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
08:10:56 lo 9.40 9.40 5.05 5.05 0.00 0.00 0.00 0.00
08:10:56 eth0 2753.40 1460.80 297.15 231.98 0.00 0.00 0.00 0.00
Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
Average: lo 12.07 12.07 6.45 6.45 0.00 0.00 0.00 0.00
Average: eth0 2850.12 1534.68 307.99 242.79 0.00 0.00 0.00 0.00
From the above, I think the total number of packets handled per second should be 2850.12+1534.68 = 4384.8 (sum of rxpck/s and txpck/s) which is well within 250K packets per second, as mentioned in The Amazon EC2 deployment guide on the Aerospike site which is referred in #RonenBotzer's answer.
Update 2
I ran the asadm command followed by show latency on one of the nodes of the cluster and from the output, it appears that there is no latency beyond 1 ms for both reads and writes -
Admin> show latency
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~read Latency~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Node Time Ops/Sec >1Ms >8Ms >64Ms
. Span . . . .
ip-10-111-215-72.ec2.internal:3000 11:35:01->11:35:11 1242.1 0.0 0.0 0.0
ip-10-13-215-20.ec2.internal:3000 11:34:57->11:35:07 1297.5 0.0 0.0 0.0
ip-10-150-147-167.ec2.internal:3000 11:35:04->11:35:14 1147.7 0.0 0.0 0.0
ip-10-165-168-246.ec2.internal:3000 11:34:59->11:35:09 1342.2 0.0 0.0 0.0
ip-10-233-158-213.ec2.internal:3000 11:35:00->11:35:10 1218.0 0.0 0.0 0.0
Number of rows: 5
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~write Latency~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Node Time Ops/Sec >1Ms >8Ms >64Ms
. Span . . . .
ip-10-111-215-72.ec2.internal:3000 11:35:01->11:35:11 33.0 0.0 0.0 0.0
ip-10-13-215-20.ec2.internal:3000 11:34:57->11:35:07 37.2 0.0 0.0 0.0
ip-10-150-147-167.ec2.internal:3000 11:35:04->11:35:14 36.4 0.0 0.0 0.0
ip-10-165-168-246.ec2.internal:3000 11:34:59->11:35:09 36.9 0.0 0.0 0.0
ip-10-233-158-213.ec2.internal:3000 11:35:00->11:35:10 33.9 0.0 0.0 0.0
Number of rows: 5
Aerospike has several modes for storage that you can configure:
Data in memory with no persistence
Data in memory, persisted to disk
Data on SSD, primary index in memory (AKA Hybrid Memory architecture)
In-Memory Optimizations
Release 3.11 and release 3.12 of
Aerospike include several big performance improvements for in-memory namespaces.
Among these are a change to how partitions are represented, from a single red-black tree to sprigs (many sub-trees). The new config parameters partition-tree-sprigs and partition-tree-locks should be used appropriately. In your case, as r3.4xlarge instances have 122G of DRAM, you can afford the 311M of overhead associated with setting partition-tree-sprigs to the max value of 4096.
You should also consider the auto-pin=cpu setting, as well. This option does require Linux Kernal >= 3.19 which is part of Ubuntu >= 15.04 (but not many others yet).
Clustering Improvements
The recent releases 3.13 and 3.14 include a rewrite of the cluster manager. In general you should consider using the latest version, but I'm pointing out the aspects that will directly affect your performance.
EC2 Networking and Aerospike
You don't show the latency numbers of the cluster itself, so I suspect the problem is with the networking, rather than the nodes.
Older instance family types, such as the r3, c3, i2, come with ENIs - NICs which have a single transmit/receive queue. The software interrupts of cores accessing this queue may become a bottleneck as the number of CPUs increases, all of which need to wait for their turn to use the NIC. There's a knowledge base article in the Aerospike community discussion forum on using multiple ENIs with Aerospike to get around the limited performance capacity of the single ENI you initially get with such an instance. The Amazon EC2 deployment guide on the Aerospike site talks about using RPS to maximize TPS when you're in an instance that uses ENIs.
Alternatively, you should consider moving to the newer instances (r4, i3, etc) which come with multiqueue ENAs. These do not require RPS, and support higher TPS without adding extra cards. They also happen to have better chipsets, and cost significantly less than their older siblings (r4 is roughly 30% cheaper than r3, i3 is about 1/3 the price of the i2).
Your title is misleading. Please consider changing it. You moved from on-disk to in-memory.
mem+disk means data is both on disk and mem (using data-in-memory=true).
My best guess is that one CPU is bottlenecking to do network I/O.
You can take a look at the top output and see the si (software interrupts)
If one CPU is showing much higher than the other,
simplest thing you can try is RPS (Receive Packet Steering)
echo f|sudo tee /sys/class/net/eth0/queues/rx-0/rps_cpus
Once you confirm that its network bottlneck,
You can try ENA as suggested by #Ronen
Going into details,
When you had 15ms latency with only FC, assuming its low tps.
But when you added high load on HEAVYNAMESPACE in prod,
the latency kept increasing as you added more client nodes and hence tps.
Simlarly in you POC also, the latency increased with client nodes.
The latency is under 15ms even with 130 servers. Its partly good.
I am not sure if I understood your set_success graph. Assumign its in ktps.
Update:
After looking at the server side latency histogram, looks like server is doing fine.
Most likely it is a client issue. Check CPU and network on the client machine(s).

Indexing very slow in elasticsearch

I can't increase the indexing more than 10000 event/second no matter what I do. I am getting around 13000 events per second from kafka in a single logstash instance. I am running 3 Logstash in different machines reading data from same kafka topic.
I have setup a ELK cluster with 3 Logstash reading data from Kafka and sending them to my elastic cluster.
My cluster contains 3 Logstash, 3 Elastic Master Node, 3 Elastic Client node and 50 Elastic Data Node.
Logstash 2.0.4
Elastic Search 5.0.2
Kibana 5.0.2
All Citrix VM having same configuration of :
Red Hat Linux-7
Intel(R) Xeon(R) CPU E5-2630 v3 # 2.40GHz 6 Cores
32 GB RAM
2 TB spinning media
Logstash Config file :
output {
elasticsearch {
hosts => ["dataNode1:9200","dataNode2:9200","dataNode3:9200" upto "**dataNode50**:9200"]
index => "logstash-applogs-%{+YYYY.MM.dd}-1"
workers => 6
user => "uname"
password => "pwd"
}
}
Elasticsearch Data Node's elastcisearch.yml File:
cluster.name: my-cluster-name
node.name: node46-data-46
node.master: false
node.data: true
bootstrap.memory_lock: true
path.data: /apps/dataES1/data
path.logs: /apps/dataES1/logs
discovery.zen.ping.unicast.hosts: ["master1","master2","master3"]
network.host: hostname
http.port: 9200
The only change that I made in my **jvm.options** file is
-Xms15g
-Xmx15g
System config changes that I did are as follows:
vm.max_map_count=262144
and in /etc/security/limits.conf I added :
elastic soft nofile 65536
elastic hard nofile 65536
elastic soft memlock unlimited
elastic hard memlock unlimited
elastic soft nproc 65536
elastic hard nproc unlimited
Indexing Rate
One of the active data node:
$ sudo iotop -o
Total DISK READ : 0.00 B/s | Total DISK WRITE : 243.29 K/s
Actual DISK READ: 0.00 B/s | Actual DISK WRITE: 357.09 K/s
TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
5199 be/3 root 0.00 B/s 3.92 K/s 0.00 % 1.05 % [jbd2/xvdb1-8]
14079 be/4 elkadmin 0.00 B/s 51.01 K/s 0.00 % 0.53 % java -Xms15g -Xmx15g -XX:+UseConcMarkSweepGC -XX:CMSIni~h-5.0.2/lib/* org.elasticsearch.bootstrap.Elasticsearch
13936 be/4 elkadmin 0.00 B/s 51.01 K/s 0.00 % 0.39 % java -Xms15g -Xmx15g -XX:+UseConcMarkSweepGC -XX:CMSIni~h-5.0.2/lib/* org.elasticsearch.bootstrap.Elasticsearch
13857 be/4 elkadmin 0.00 B/s 58.86 K/s 0.00 % 0.34 % java -Xms15g -Xmx15g -XX:+UseConcMarkSweepGC -XX:CMSIni~h-5.0.2/lib/* org.elasticsearch.bootstrap.Elasticsearch
13960 be/4 elkadmin 0.00 B/s 35.32 K/s 0.00 % 0.33 % java -Xms15g -Xmx15g -XX:+UseConcMarkSweepGC -XX:CMSIni~h-5.0.2/lib/* org.elasticsearch.bootstrap.Elasticsearch
13964 be/4 elkadmin 0.00 B/s 31.39 K/s 0.00 % 0.27 % java -Xms15g -Xmx15g -XX:+UseConcMarkSweepGC -XX:CMSIni~h-5.0.2/lib/* org.elasticsearch.bootstrap.Elasticsearch
14078 be/4 elkadmin 0.00 B/s 11.77 K/s 0.00 % 0.00 % java -Xms15g -Xmx15g -XX:+UseConcMarkSweepGC -XX:CMSIni~h-5.0.2/lib/* org.elasticsearch.bootstrap.Elasticsearch
Index Details :
index shard prirep state docs store
logstash-applogs-2017.01.23-3 11 r STARTED 30528186 35gb
logstash-applogs-2017.01.23-3 11 p STARTED 30528186 30.3gb
logstash-applogs-2017.01.23-3 9 p STARTED 30530585 35.2gb
logstash-applogs-2017.01.23-3 9 r STARTED 30530585 30.5gb
logstash-applogs-2017.01.23-3 1 r STARTED 30526639 30.4gb
logstash-applogs-2017.01.23-3 1 p STARTED 30526668 30.5gb
logstash-applogs-2017.01.23-3 14 p STARTED 30539209 35.5gb
logstash-applogs-2017.01.23-3 14 r STARTED 30539209 35gb
logstash-applogs-2017.01.23-3 12 p STARTED 30536132 30.3gb
logstash-applogs-2017.01.23-3 12 r STARTED 30536132 30.3gb
logstash-applogs-2017.01.23-3 15 p STARTED 30528216 30.4gb
logstash-applogs-2017.01.23-3 15 r STARTED 30528216 30.4gb
logstash-applogs-2017.01.23-3 19 r STARTED 30533725 35.3gb
logstash-applogs-2017.01.23-3 19 p STARTED 30533725 36.4gb
logstash-applogs-2017.01.23-3 18 r STARTED 30525190 30.2gb
logstash-applogs-2017.01.23-3 18 p STARTED 30525190 30.3gb
logstash-applogs-2017.01.23-3 8 p STARTED 30526785 35.8gb
logstash-applogs-2017.01.23-3 8 r STARTED 30526785 35.3gb
logstash-applogs-2017.01.23-3 3 p STARTED 30526960 30.4gb
logstash-applogs-2017.01.23-3 3 r STARTED 30526960 30.2gb
logstash-applogs-2017.01.23-3 5 p STARTED 30522469 35.3gb
logstash-applogs-2017.01.23-3 5 r STARTED 30522469 30.8gb
logstash-applogs-2017.01.23-3 6 p STARTED 30539580 30.9gb
logstash-applogs-2017.01.23-3 6 r STARTED 30539580 30.3gb
logstash-applogs-2017.01.23-3 7 p STARTED 30535488 30.3gb
logstash-applogs-2017.01.23-3 7 r STARTED 30535488 30.4gb
logstash-applogs-2017.01.23-3 2 p STARTED 30524575 35.2gb
logstash-applogs-2017.01.23-3 2 r STARTED 30524575 35.3gb
logstash-applogs-2017.01.23-3 10 p STARTED 30537232 30.4gb
logstash-applogs-2017.01.23-3 10 r STARTED 30537232 30.4gb
logstash-applogs-2017.01.23-3 16 p STARTED 30530098 30.3gb
logstash-applogs-2017.01.23-3 16 r STARTED 30530098 30.3gb
logstash-applogs-2017.01.23-3 4 r STARTED 30529877 30.2gb
logstash-applogs-2017.01.23-3 4 p STARTED 30529877 30.2gb
logstash-applogs-2017.01.23-3 17 r STARTED 30528132 30.2gb
logstash-applogs-2017.01.23-3 17 p STARTED 30528132 30.4gb
logstash-applogs-2017.01.23-3 13 r STARTED 30521873 30.3gb
logstash-applogs-2017.01.23-3 13 p STARTED 30521873 30.4gb
logstash-applogs-2017.01.23-3 0 r STARTED 30520172 30.4gb
logstash-applogs-2017.01.23-3 0 p STARTED 30520172 30.5gb
I tested the incoming data in logstash by dumping data into a file. I got a file of 290 MB with 377822 lines in 30 seconds. So there is no issue from Kafka as at a given time I am receiving 35000 events per second in my 3 Logstash servers but my Elasticsearch is able to index maximum of 10000 events per second.
Can someone please help me with this issue?
Edit: I tried sending the request in batch of default 125, then 500, 1000, 10000, but still I didn't got any improvement in the indexing speed.
I improved indexing rate by moving to a larger Machines for Data nodes.
Data Node: A VMWare virtual machine with the following config:
14 CPU # 2.60GHz
64GB RAM, 31GB dedicated for elasticsearch.
The fasted disk that was available to me was SAN with Fibre Channel as I couldn't get any SSD or Local Disks.
I achieved maximum indexing rate of 100,000 events per second. Each document size is around 2 to 5 KB.

What do large times spent in Thread#initialize and Thread#join mean in JRuby profiling?

I'm trying to profile an application using JRuby's built-in profiler.
Most of the time is taken in ClassIsOfInterest.method_that_is_of_interest, which in turn has most of its time taken in Thread#initialize and Thread#join
total self children calls method
----------------------------------------------------------------
31.36 0.02 31.35 4525 Array#each
31.06 0.00 31.06 2 Test::Unit::RunCount.run_once
31.06 0.00 31.06 1 Test::Unit::RunCount.run
31.06 0.00 31.06 1 MiniTest::Unit#run
31.06 0.00 31.05 1 MiniTest::Unit#_run
31.01 0.00 31.01 2219 Kernel.send
31.00 0.00 31.00 1 MiniTest::Unit#run_tests
31.00 0.00 31.00 1 MiniTest::Unit#_run_anything
30.99 0.00 30.99 1 Test::Unit::Runner#_run_suites
30.99 0.00 30.99 5 MiniTest::Unit#_run_suite
30.99 0.00 30.98 21629 Array#map
30.98 0.00 30.98 1 Test::Unit::TestCase#run
30.98 0.00 30.98 1 MiniTest::Unit::TestCase#run
30.98 0.00 30.98 659 BasicObject#__send__
30.98 0.00 30.98 1 MyTestClass#my_test_method
30.80 0.00 30.80 18 Enumerable.each_with_index
30.77 0.00 30.77 15 MyTestHelper.generate_call_parser_based_on_barcoded_sequence
30.26 0.00 30.25 4943 Class#new_proxy
26.13 0.00 26.13 15 MyProductionClass1#my_production_method1
<snip boring methods with zero self time>
24.27 0.00 24.27 15 ClassIsOfInterest.method_that_is_of_interest
13.71 0.01 13.71 541 Enumerable.map
13.48 0.86 12.63 30 Range#each
12.62 0.22 12.41 450 Thread.new
12.41 12.41 0.00 450 Thread#initialize
10.78 10.78 0.00 450 Thread#join
4.03 0.12 3.91 539 Kernel.require
3.34 0.00 3.34 248 Kernel.require
2.49 0.00 2.49 15 MyTestFixture.create_fixture
<snip boring methods with small total times>
Each invocation of ClassIsOfInterest.method_that_is_of_interest is creating 30 threads, which is probably overkill, but I assume it shouldn't degrade performance that much. When I only had three threads created per invocation, I got
23.16 0.00 23.15 15 ClassIsOfInterest.method_that_is_of_interest
22.73 22.73 0.00 45 Thread#join
4.18 0.08 4.10 539 Kernel.require
3.56 0.00 3.56 248 Kernel.require
2.78 0.00 2.78 15 MyTestFixture.create_fixture
Do large time values for Thread#initialize (in the first profile) and Thread#join indicate that the code responsible for threading is taking a while, or merely that the code that is executed within the thread is taking a while?
The reason you see Thread#join is that your main thread is spending lots of time waiting for the other threads to finish. Most of the time spent in method_that_is_of_interest is spent blocking on Thread#join because it's not doing any other work. I wouldn't worry too much about it -- the profile is just saying that one of your threads is blocking on what other threads are doing. A better performance measurement in this case is the total running time, run the code with different numbers of threads and see where the sweet spot is.
The reason why Thread.new/Thread#initialize shows up is that threads are expensive objects to create. If you're calling this method often and it creates all those threads every time I suggest you look into Java's Executors API. Create a thread pool with Executors once (when your application starts up) and submit all the tasks to the pool instead of creating new threads (you can use ExecutorCompletionService to wait for all tasks to complete, or just call #get on the FutureTask instances you get when you submit your tasks).

Why is it slower to prespecify type in a data.frame?

I was preallocating a big data.frame to fill in later, which I normally do with NA's like this:
n <- 1e6
a <- data.frame(c1 = 1:n, c2 = NA, c3 = NA)
and I wondered if it would make things any faster later if I specified data types up front, so I tested
f1 <- function() {
a <- data.frame(c1 = 1:n, c2 = NA, c3 = NA)
a$c2 <- 1:n
a$c3 <- sample(LETTERS, size= n, replace = TRUE)
}
f2 <- function() {
b <- data.frame(c1 = 1:n, c2 = numeric(n), c3 = character(n))
b$c2 <- 1:n
b$c3 <- sample(LETTERS, size= n, replace = TRUE)
}
> system.time(f1())
user system elapsed
0.219 0.042 0.260
> system.time(f2())
user system elapsed
1.018 0.052 1.072
So it was actually much slower! I tried again with a factor column too, and the difference wasn't closer to 2x than 4x, but I'm curious about why this is slower, and wonder if it is ever appropriate to initialize with data types rather than NA's.
--
Edit: Flodel pointed out that 1:n is integer, not numeric. With that correction the runtimes are nearly identical; of course it hurts to incorrectly specify a data type and change it later!
Assigning any data to a large data frame takes time. If you're going to assign your data all at once in a vector (as you should), it's much faster not to assign the c2 and c3 columns in the original definition at all. For example:
f3 <- function() {
c <- data.frame(c1 = 1:n)
c$c2 <- 1:n
c$c3 <- sample(LETTERS, size= n, replace = TRUE)
}
print(system.time(f1()))
# user system elapsed
# 0.194 0.023 0.216
print(system.time(f2()))
# user system elapsed
# 0.336 0.037 0.374
print(system.time(f3()))
# user system elapsed
# 0.057 0.007 0.063
The reason for this is that when you preassign, a column of length n is created. eg
str(data.frame(x=1:2, y = character(2)))
## 'data.frame': 2 obs. of 2 variables:
## $ x: int 1 2
## $ y: Factor w/ 1 level "": 1 1
Note that the character column has been converted to factor which will be slower than setting stringsAsFactors = F.
#David Robinson's answer is correct, but I will add some profiling here to show how to investigate why some thngs are slower than you might expect.
The best thing to do here is to do some profiling to see what is being called, that can give a clue as to why some things calls are slower than others
library(profr)
profr(f1())
## Read 9 items
## f level time start end leaf source
## 8 f1 1 0.16 0.00 0.16 FALSE <NA>
## 9 data.frame 2 0.04 0.00 0.04 TRUE base
## 10 $<- 2 0.02 0.04 0.06 FALSE base
## 11 sample 2 0.04 0.06 0.10 TRUE base
## 12 $<- 2 0.06 0.10 0.16 FALSE base
## 13 $<-.data.frame 3 0.12 0.04 0.16 TRUE base
profr(f2())
## Read 15 items
## f level time start end leaf source
## 8 f2 1 0.28 0.00 0.28 FALSE <NA>
## 9 data.frame 2 0.12 0.00 0.12 TRUE base
## 10 : 2 0.02 0.12 0.14 TRUE base
## 11 $<- 2 0.02 0.18 0.20 FALSE base
## 12 sample 2 0.02 0.20 0.22 TRUE base
## 13 $<- 2 0.06 0.22 0.28 FALSE base
## 14 as.data.frame 3 0.08 0.04 0.12 FALSE base
## 15 $<-.data.frame 3 0.10 0.18 0.28 TRUE base
## 16 as.data.frame.character 4 0.08 0.04 0.12 FALSE base
## 17 factor 5 0.08 0.04 0.12 FALSE base
## 18 unique 6 0.06 0.04 0.10 FALSE base
## 19 match 6 0.02 0.10 0.12 TRUE base
## 20 unique.default 7 0.06 0.04 0.10 TRUE base
profr(f3())
## Read 4 items
## f level time start end leaf source
## 8 f3 1 0.06 0.00 0.06 FALSE <NA>
## 9 $<- 2 0.02 0.00 0.02 FALSE base
## 10 sample 2 0.04 0.02 0.06 TRUE base
## 11 $<-.data.frame 3 0.02 0.00 0.02 TRUE base
clearly f2() is slower than f1() as there is a lot of character to factor conversions, and recreating levels etc.
For efficient use of memory I would suggest the data.table package. This avoids (as much as possible) the internal copying of objects
library(data.table)
f4 <- function(){
f <- data.table(c1 = 1:n)
f[,c2:=1L:n]
f[,c3:=sample(LETTERS, size= n, replace = TRUE)]
}
system.time(f1())
## user system elapsed
## 0.15 0.02 0.18
system.time(f2())
## user system elapsed
## 0.19 0.00 0.19
system.time(f3())
## user system elapsed
## 0.09 0.00 0.09
system.time(f4())
## user system elapsed
## 0.04 0.00 0.04
Note, that using data.table you could add two columns at once (and by reference)
# Thanks to #Thell for pointing this out.
f[,`:=`(c('c2','c3'), list(1L:n, sample(LETTERS,n, T))), with = F]
EDIT -- functions that will return the required object (Well picked up #Dwin)
n= 1e7
f1 <- function() {
a <- data.frame(c1 = 1:n, c2 = NA, c3 = NA)
a$c2 <- 1:n
a$c3 <- sample(LETTERS, size = n, replace = TRUE)
a
}
f2 <- function() {
b <- data.frame(c1 = 1:n, c2 = numeric(n), c3 = character(n))
b$c2 <- 1:n
b$c3 <- sample(LETTERS, size = n, replace = TRUE)
b
}
f3 <- function() {
c <- data.frame(c1 = 1:n)
c$c2 <- 1:n
c$c3 <- sample(LETTERS, size = n, replace = TRUE)
c
}
f4 <- function() {
f <- data.table(c1 = 1:n)
f[, `:=`(c2, 1L:n)]
f[, `:=`(c3, sample(LETTERS, size = n, replace = TRUE))]
}
system.time(f1())
## user system elapsed
## 1.62 0.34 2.13
system.time(f2())
## user system elapsed
## 2.14 0.66 2.79
system.time(f3())
## user system elapsed
## 0.78 0.25 1.03
system.time(f4())
## user system elapsed
## 0.37 0.08 0.46
profr(f1())
## Read 105 items
## f level time start end leaf source
## 8 f1 1 2.08 0.00 2.08 FALSE <NA>
## 9 data.frame 2 0.66 0.00 0.66 FALSE base
## 10 : 2 0.02 0.66 0.68 TRUE base
## 11 $<- 2 0.32 0.84 1.16 FALSE base
## 12 sample 2 0.40 1.16 1.56 TRUE base
## 13 $<- 2 0.32 1.76 2.08 FALSE base
## 14 : 3 0.02 0.00 0.02 TRUE base
## 15 as.data.frame 3 0.04 0.02 0.06 FALSE base
## 16 unlist 3 0.12 0.54 0.66 TRUE base
## 17 $<-.data.frame 3 1.24 0.84 2.08 TRUE base
## 18 as.data.frame.integer 4 0.04 0.02 0.06 TRUE base
profr(f2())
## Read 145 items
## f level time start end leaf source
## 8 f2 1 2.88 0.00 2.88 FALSE <NA>
## 9 data.frame 2 1.40 0.00 1.40 FALSE base
## 10 : 2 0.04 1.40 1.44 TRUE base
## 11 $<- 2 0.36 1.64 2.00 FALSE base
## 12 sample 2 0.40 2.00 2.40 TRUE base
## 13 $<- 2 0.36 2.52 2.88 FALSE base
## 14 : 3 0.02 0.00 0.02 TRUE base
## 15 numeric 3 0.06 0.02 0.08 TRUE base
## 16 character 3 0.04 0.08 0.12 TRUE base
## 17 as.data.frame 3 1.06 0.12 1.18 FALSE base
## 18 unlist 3 0.20 1.20 1.40 TRUE base
## 19 $<-.data.frame 3 1.24 1.64 2.88 TRUE base
## 20 as.data.frame.integer 4 0.04 0.12 0.16 TRUE base
## 21 as.data.frame.numeric 4 0.16 0.18 0.34 TRUE base
## 22 as.data.frame.character 4 0.78 0.40 1.18 FALSE base
## 23 factor 5 0.74 0.40 1.14 FALSE base
## 24 as.data.frame.vector 5 0.04 1.14 1.18 TRUE base
## 25 unique 6 0.38 0.40 0.78 FALSE base
## 26 match 6 0.32 0.78 1.10 TRUE base
## 27 unique.default 7 0.38 0.40 0.78 TRUE base
profr(f3())
## Read 37 items
## f level time start end leaf source
## 8 f3 1 0.72 0.00 0.72 FALSE <NA>
## 9 data.frame 2 0.10 0.00 0.10 FALSE base
## 10 : 2 0.02 0.10 0.12 TRUE base
## 11 $<- 2 0.08 0.14 0.22 FALSE base
## 12 sample 2 0.26 0.22 0.48 TRUE base
## 13 $<- 2 0.16 0.56 0.72 FALSE base
## 14 : 3 0.02 0.00 0.02 TRUE base
## 15 as.data.frame 3 0.04 0.02 0.06 FALSE base
## 16 unlist 3 0.02 0.08 0.10 TRUE base
## 17 $<-.data.frame 3 0.58 0.14 0.72 TRUE base
## 18 as.data.frame.integer 4 0.04 0.02 0.06 TRUE base
profr(f4())
## Read 15 items
## f level time start end leaf source
## 8 f4 1 0.28 0.00 0.28 FALSE <NA>
## 9 data.table 2 0.02 0.00 0.02 FALSE data.table
## 10 [ 2 0.26 0.02 0.28 FALSE base
## 11 : 3 0.02 0.00 0.02 TRUE base
## 12 [.data.table 3 0.26 0.02 0.28 FALSE <NA>
## 13 eval 4 0.26 0.02 0.28 FALSE base
## 14 eval 5 0.26 0.02 0.28 FALSE base
## 15 : 6 0.02 0.02 0.04 TRUE base
## 16 sample 6 0.24 0.04 0.28 TRUE base

Why is running "unique" faster on a data frame than a matrix in R?

I've begun to believe that data frames hold no advantages over matrices, except for notational convenience. However, I noticed this oddity when running unique on matrices and data frames: it seems to run faster on a data frame.
a = matrix(sample(2,10^6,replace = TRUE), ncol = 10)
b = as.data.frame(a)
system.time({
u1 = unique(a)
})
user system elapsed
1.840 0.000 1.846
system.time({
u2 = unique(b)
})
user system elapsed
0.380 0.000 0.379
The timing results diverge even more substantially as the number of rows is increased. So, there are two parts to this question.
Why is this slower for a matrix? It seems faster to convert to a data frame, run unique, and then convert back.
Is there any reason not to just wrap unique in myUnique, which does the conversions in part #1?
Note 1. Given that a matrix is atomic, it seems that unique should be faster for a matrix, rather than slower. Being able to iterate over fixed-size, contiguous blocks of memory should generally be faster than running over separate blocks of linked lists (I assume that's how data frames are implemented...).
Note 2. As demonstrated by the performance of data.table, running unique on a data frame or a matrix is a comparatively bad idea - see the answer by Matthew Dowle and the comments for relative timings. I've migrated a lot of objects to data tables, and this performance is another reason to do so. So although users should be well served to adopt data tables, for pedagogical / community reasons I'll leave the question open for now regarding the why does this take longer on the matrix objects. The answers below address where does the time go, and how else can we get better performance (i.e. data tables). The answer to why is close at hand - the code can be found via unique.data.frame and unique.matrix. :) An English explanation of what it's doing & why is all that is lacking.
In this implementation, unique.matrix is the same as unique.array
> identical(unique.array, unique.matrix)
[1] TRUE
unique.array has to handle multi-dimensional arrays which requires additional processing to ‘collapse’ the extra dimensions (those extra calls to paste()) which are not needed in the 2-dimensional case. The key section of code is:
collapse <- (ndim > 1L) && (prod(dx[-MARGIN]) > 1L)
temp <- if (collapse)
apply(x, MARGIN, function(x) paste(x, collapse = "\r"))
unique.data.frame is optimised for the 2D case, unique.matrix is not. It could be, as you suggest, it just isn't in the current implementation.
Note that in all cases (unique.{array,matrix,data.table}) where there is more than one dimension it is the string representation that is compared for uniqueness. For floating point numbers this means 15 decimal digits so
NROW(unique(a <- matrix(rep(c(1, 1+4e-15), 2), nrow = 2)))
is 1 while
NROW(unique(a <- matrix(rep(c(1, 1+5e-15), 2), nrow = 2)))
and
NROW(unique(a <- matrix(rep(c(1, 1+4e-15), 1), nrow = 2)))
are both 2. Are you sure unique is what you want?
Not sure but I guess that because matrix is one contiguous vector, R copies it into column vectors first (like a data.frame) because paste needs a list of vectors. Note that both are slow because both use paste.
Perhaps because unique.data.table is already many times faster. Please upgrade to v1.6.7 by downloading it from the R-Forge repository because that has the fix to unique you raised in this question. data.table doesn't use paste to do unique.
a = matrix(sample(2,10^6,replace = TRUE), ncol = 10)
b = as.data.frame(a)
system.time(u1<-unique(a))
user system elapsed
2.98 0.00 2.99
system.time(u2<-unique(b))
user system elapsed
0.99 0.00 0.99
c = as.data.table(b)
system.time(u3<-unique(c))
user system elapsed
0.03 0.02 0.05 # 60 times faster than u1, 20 times faster than u2
identical(as.data.table(u2),u3)
[1] TRUE
In attempting to answer my own question, especially part 1, we can see where the time is spent by looking at the results of Rprof. I ran this again, with 5M elements.
Here are the results for the first unique operation (for the matrix):
> summaryRprof("u1.txt")
$by.self
self.time self.pct total.time total.pct
"paste" 5.70 52.58 5.96 54.98
"apply" 2.70 24.91 10.68 98.52
"FUN" 0.86 7.93 6.82 62.92
"lapply" 0.82 7.56 1.00 9.23
"list" 0.30 2.77 0.30 2.77
"!" 0.14 1.29 0.14 1.29
"c" 0.10 0.92 0.10 0.92
"unlist" 0.08 0.74 1.08 9.96
"aperm.default" 0.06 0.55 0.06 0.55
"is.null" 0.06 0.55 0.06 0.55
"duplicated.default" 0.02 0.18 0.02 0.18
$by.total
total.time total.pct self.time self.pct
"unique" 10.84 100.00 0.00 0.00
"unique.matrix" 10.84 100.00 0.00 0.00
"apply" 10.68 98.52 2.70 24.91
"FUN" 6.82 62.92 0.86 7.93
"paste" 5.96 54.98 5.70 52.58
"unlist" 1.08 9.96 0.08 0.74
"lapply" 1.00 9.23 0.82 7.56
"list" 0.30 2.77 0.30 2.77
"!" 0.14 1.29 0.14 1.29
"do.call" 0.14 1.29 0.00 0.00
"c" 0.10 0.92 0.10 0.92
"aperm.default" 0.06 0.55 0.06 0.55
"is.null" 0.06 0.55 0.06 0.55
"aperm" 0.06 0.55 0.00 0.00
"duplicated.default" 0.02 0.18 0.02 0.18
$sample.interval
[1] 0.02
$sampling.time
[1] 10.84
And for the data frame:
> summaryRprof("u2.txt")
$by.self
self.time self.pct total.time total.pct
"paste" 1.72 94.51 1.72 94.51
"[.data.frame" 0.06 3.30 1.82 100.00
"duplicated.default" 0.04 2.20 0.04 2.20
$by.total
total.time total.pct self.time self.pct
"[.data.frame" 1.82 100.00 0.06 3.30
"[" 1.82 100.00 0.00 0.00
"unique" 1.82 100.00 0.00 0.00
"unique.data.frame" 1.82 100.00 0.00 0.00
"duplicated" 1.76 96.70 0.00 0.00
"duplicated.data.frame" 1.76 96.70 0.00 0.00
"paste" 1.72 94.51 1.72 94.51
"do.call" 1.72 94.51 0.00 0.00
"duplicated.default" 0.04 2.20 0.04 2.20
$sample.interval
[1] 0.02
$sampling.time
[1] 1.82
What we notice is that the matrix version spends a lot of time on apply, paste, and lapply. In contrast, the data frame version simple runs duplicated.data.frame and most of the time is spent in paste, presumably aggregating results.
Although this explains where the time is going, it doesn't explain why these have different implementations, nor the effects of simply changing from one object type to another.

Resources