How does storage beign used in elasticsearch 2.x - elasticsearch

I am using elasticsearch 2.x . I have total 9 nodes with M4.4xlarge and I want to downsize to lesser nodes .
Total docs I need to store 2 Million
Per doc size = 30KB
With this stats I Believe Elasticsearch would need
30 KB * 2000000 docs = 60000000 KB = 60000 = 60GB ~
However when I have indexed all docs I see its 500GB data.
I am confused as how my index has grown this much
could someone please give me some insights to work upon

Related

What will be specific benchmark of elasticsearch reads, writes and updates?

We are using Elasticsearch (version 5.6.0) data updates of around 13M documents with each document in the nested structure having max 100 key value pair, it takes around 34 min to update 99 indices. Hardware is as follows:
5 M4-4x large machines (32G RAM and 8 cores)
500GB disk
So, what should be the Ideal update time elasticsearch should take for the update?
What are the optimization I can do to get good performance?

Elasticsearch Indexing Speed Degrade with the Time

I am doing some performance tuning in elastic search for my project and I need some help in improving the elastic search indexing speed. I am using ES 5.1.1 and I have 2 nodes setup with 8 shards for the index. I have the servers for 2 nodes with 16GB RAM and 12CPUs allocated for each server with 2.2GHz clock speed. I need to index around 25,000,000 documents within 1.5 hours, which I am currently doing in around 4 hours. I have done the following config changes to improve the indexing time.
Setting ‘indices.store.throttle.type’ to ‘none’
Setting ‘refresh_interval’ to ‘-1’
Increasing ‘translog.flush_threshold_size’ to 1GB
Setting ‘number_of_replicas’ to ‘0’
Using 8 shards for the index
Setting VM Options -Xms8g -Xmx8g (Half of the RAM size)
I am using the bulk processor to generate the documents in my java application and I’m using the following configurations to setup the bulk processor.
Bulk Actions Count : 10000
Bulk Size in MB : 100
Concurrent Requests : 100
Flush Interval : 30
Initially I can index around 356167 in the first minute. But with the time, It decreases and after around 1 hour its around 121280 docs per minute.
How can I keep the indexing rate steady over the time? Is there any other ways to improve the performance?
I highly encourage not to change configuration parameters like the translog flush size, the throttling, unless you know what you are doing (and this does not mean reading some blog post on the internet :-)
Try a single shard per server and especially reduce the bulk size to something like 10MB. 100MB * 100 concurrent requests means you need 10GB of heap to deal with those (without doing anything else). I suppose not all of the documents get indexed because of your rejected tasks in your threadpools.
Start small and get bigger instead of starting big but not have any insight in your indexing.

Elasticsearch and Lucene document limit

Document count in our elasticsearch installation from stats api shows about 700 million when the actual document count is about 27 million from the count api. We understand that this difference is from nested documents count - stats api shows all.
In Lucene documentation, we read that there is 2 billion hard document count limit for a shard. Should I worry that elasticsearch is about to hit the document limit? Or should I monitor the data from the count api?
Yes there is limit to the number of docs per shard of 2 billion, which is a hard lucene limit.
There is a maximum number of documents you can have in a single Lucene index. As of LUCENE-5843, the limit is 2,147,483,519 (= Integer.MAX_VALUE - 128) documents.
You should consider scaling horizontally.

heap memory usage in elasticsearch

Currently I'm working on a project which uses elasticsearch 1.4.1.
The cluster consists of 22 nodes, and 20 of them are data nodes, each node has 16GB as the heap memory.
The thing is, when I'm doing massive queries, some of the nodes(2 or 3) consume 70% of the heap memory, while the rest just use less than 10%.
So I'm wondering is this because most of the queries go to those 2 or 3 nodes?
If not, can I further achieve better performance?
Thanks!
Just updated:
I just ran this command: curl -XGET localhost:9200/_cat/shards?v, and it returned:
index shard prirep state docs store ip node
....
mm 2 r STARTED 2248969 293.6mb 10.2.4.117 Mark Todd
mm 2 p STARTED 2248969 293.6mb 10.2.4.129 Saint Elmo
mm 19 r STARTED 30172116 3.5gb 10.2.4.126 Fixer
mm 19 p STARTED 30172116 3.5gb 10.2.4.123 Loki
....
I'm wondering what the store here means? if it is the actual size of the documents, can I load all of them into memory?
This could be because the document matches for that query is somehow colonized to just those 2 machines.
That is , if there are 20 million matches , chances are that 8 million match belongs to one machine , 8 million match comes from another and only just 2 million comes from rest of the 18 machines.
I am guessing you are also using aggregation in the process due to which field data cache is fabricated.

What is elastic search bounded by? Is it cpu, memory etc

I am running elastic search in my personal box.
Memory: 6GB
Processor: Intel® Core™ i3-3120M CPU # 2.50GHz × 4
OS: Ubuntu 12.04 - 64-bit
ElasticSearch Settings: Only running locally
Version : 1.2.2
ES_MIN_MEM=3g
ES_MAX_MEM=3g
threadpool.bulk.queue_size: 3000
indices.fielddata.cache.size: 25%
http.compression: true
bootstrap.mlockall: true
script.disable_dynamic: true
cluster.name: elasticsearch
index size: 252MB
Scenario:
I am trying to test the performance of my bulk queries/aggregations. The test case is to run asynchronous http requests to node.js which in turn will call elastic search. The tests are running from a Java method. Started with 50 requests at a time. Each request is divided and parallized in to two asynchronous(async.parallel) bulk queries in node.js. I am using node-elasticsearch api (uses elasticsearch 1.3 api). The two bulk queries contain 13 and 10 queries respectively.And the two are asynchronously sent to elastic search from node.js. When the Elastic Search returns, the query results are combined and sent back to the test case.
Observations:
I see that all the cpu cores are utilized 100%. Memory is utilized around 90%. The response time for all 50 requests combined is 30 seconds. If I run just the single queries each alone, in the bulk queries, each are returning in less than 100 milli-seconds. Node.js is taking negligible time to forward requests to elastic search and combine responses from elastic search.
Even if run the test case synchronously from java, the response time does not change. I may say that elastic search is not doing parallel processing. Is this because I am CPU or memory bound? One more observation: if I change heap size for elastic search from 1 - 3GB, the response time does not change.
Also I am pasting top command output:
top - 18:04:12 up 4:29, 5 users, load average: 5.93, 5.16, 4.15
Tasks: 224 total, 3 running, 221 sleeping, 0 stopped, 0 zombie
Cpu(s): 98.2%us, 1.0%sy, 0.0%ni, 0.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 5955796k total, 5801920k used, 153876k free, 1548k buffers
Swap: 6133756k total, 708336k used, 5425420k free, 460436k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
17410 root 20 0 7495m 3.3g 27m S 366 58.6 5:09.57 java
15356 rmadd 20 0 1015m 125m 3636 S 19 2.2 1:14.03 node
Questions:
Is this expected, because I am running Elastic Search in my local machine and not in a cluster? Can I improve my performance in my local machine? I would definitely start a cluster. But I want to know, how to improve the performance scalably. What is it that the elastic search is bound to?
I am not able to find this in forums. And am sure this would help others. Thanks for your help.

Resources