how can I upload and parse big files from logstash to Elasticsearch - elasticsearch

I have a 3 nodes cluster with 1 master and 2 data nodes each is set for 1TB
I have increased both -Xms24g -Xmx24g to half my ram (48GB total)
I than successfully upload 140mb file from Kibana to elk from the GUI after increasing it from 100mb to 1GB
when I tried to upload same file with only logstash the process was stuck and broke elastic
my pipeline is fairly simple
input {
file {
path => "/tmp/*_log"
}
}
output {
elasticsearch { hosts => ["localhost:9200"] }
stdout { codec => rubydebug }
}
small files works great. I'm not able to push big files.
log contains 1 million rows
I set all fields in /etc/security/limits.conf to unlimited
any ideas what I'm missing?

you will need to increase memory sizing in /etc/logstash/jvm.options
The recommended heap size for typical ingestion scenarios should be no less than 4GB and no more than 8GB.
CPU utilization can increase unnecessarily if the heap size is too low, resulting in the JVM constantly garbage collecting. You can check for this issue by doubling the heap size to see if performance improves.
Do not increase the heap size past the amount of physical memory. Some memory must be left to run the OS and other processes. As a general guideline for most installations, don’t exceed 50-75% of physical memory. The more memory you have, the higher percentage you can use.
Set the minimum (Xms) and maximum (Xmx) heap allocation size to the same value to prevent the heap from resizing at runtime, which is a very costly process.
You can make more accurate measurements of the JVM heap by using either the jmap command line utility distributed with Java or by using VisualVM

Related

how to limit memory usage of elasticsearch in ubuntu 17.10?

My elasticsearch service is consuming around 1 gb.
My total memory is 2gb. The elasticsearch service keeps getting shut down. I guess the reason is because of the high memory consumption. How can i limit the usage to just 512 MB?
This is the memory before starting elastic search
After running sudo service elasticsearch start the memory consumption jumps
I appreciate any help! Thanks!
From the official doc
The default installation of Elasticsearch is configured with a 1 GB heap. For just about every deployment, this number is usually too small. If you are using the default heap values, your cluster is probably configured incorrectly.
So you can change it like this
There are two ways to change the heap size in Elasticsearch. The easiest is to set an environment variable called ES_HEAP_SIZE. When the server process starts, it will read this environment variable and set the heap accordingly. As an example, you can set it via the command line as follows: export ES_HEAP_SIZE=512m
But it's not recommended. You just can't run an Elasticsearch in the optimal way with so few RAM available.

Hadoop Data Node: why is there a magic "number" for threshold of data blocks?

Experts,
We may see our block count grow in our hadoop cluster. "Too many" blocks have consequences such as increased heap requirements at data node, declining execution speeds, more GC etc. We should take notice when the number of blocks exceed a certain "threshold".
I have seen different static numbers for thresholds such as 200,000 or 500,000 -- "magic" numbers. Shouldn't it be a function of memory of node (Java Heap Size of DataNode in Bytes)?
Other interesting related questions:
What does a high block count indicate?
a. too many small files?
b. running out of capacity?
is it (a) or (b)? how to differentiate between the two?
What is a small file? A file whose size is smaller than block size (dfs.blocksize)?
Does each file take a new data block on disk? or is it the meta data associated with new file that is the problem?
The effects are more GC, declising execution speeds etc. How to "quantify" the effects of high block count?
Thanking in advance
Thanks everyone for their input. I have done some research on the topic and share my findings.
any static number is a magic number. I propose the number of block threshold to be: heap memory (in gb) x 1 million * comfort_%age (say 50%)
Why?
Rule of thumb: 1gb for 1M blocks, Cloudera [1]
The actual amount of heap memory required by namenode turns out to be much lower.
Heap needed = (number of blocks + inode (files + folders)) x object size (150-300 bytes [1])
For 1 million small files: heap needed = (1M + 1M) x 300b = 572mb <== much smaller than rule of thumb.
High block count may indicate both.
namenode UI states the heap capacity used.
For example,
http://namenode:50070/dfshealth.html#tab-overview
9,847,555 files and directories, 6,827,152 blocks = 16,674,707 total filesystem object(s).
Heap Memory used 5.82 GB of 15.85 GB Heap Memory. Max Heap Memory is 15.85 GB.
** Note, the heap memory used is still higher than 16,674,707 objects x 300 bytes = 4.65gb
To find out small files, do
hdfs fsck -blocks | grep "Total blocks (validated):"
It would return something like:
Total blocks (validated): 2402 (avg. block size 325594 B) <== which is smaller than 1mb
yes. a file is small if its size < dfs.blocksize.
each file takes a new data block on disk, though the block size is close to file size. so small block.
for every new file, inode type object is created (150B), so stress on heap memory of name node
Impact on name and data nodes:
Small files pose problems for both name node and data nodes:
name nodes:
- Pull the ceiling on number of files down as it needs to keep metadata for each file in memory
- Long time in restarting as it must read the metadata of every file from a cache on local disk
data nodes:
- large number of small files means a large amount of random disk IO. HDFS is designed for large files, and benefits from sequential reads.
[1] https://www.cloudera.com/documentation/enterprise/5-8-x/topics/admin_nn_memory_config.html
Your first assumption is wrong, since Data node does not maintain the data file structure in memory, it is the job of the Name node to keep track of the filesystem (recurring to INodes) in memory. So the small files will actually cause your Name node do run out memory faster (since more metadata will be required to represent the same amount of data) and the execution speed will be affected since the Mapper is created per block.
To have an answer your first question check: Namenode file quantity limit
Execute the following command: hadoop fs -du -s -h. If you see that the first value (which represents the average file size of all files) is much smaller than the configured block size, then you are facing the problem of the small files. To check if you are running out of space: hadoop fs -df -h
Yup, can be much smaller. Sometimes though if the file is too big, it would require additional block. Once the block is reserved for some file it cannot be used by another files.
The block does not reserve the space on the disk beyond what it does actually need to store the data, it is metadata on the namenode which imposes the limits.
As I told before, it is more mapper tasks which need to be executed for the same amount of data. Since the mapper is ran on new JVM, the GC is not a problem, but the overhead of starting it for processing the tiny amount of data is the problem.

what's the actual ideal NameNode memory size when meet a lot files in HDFS

I will have 200 million files in my HDFS cluster, we know each file will occupy 150 bytes in NameNode memory, plus 3 blocks so there are total 600 bytes in NN.
So I set my NN memory having 250GB to well handle 200 Million files. My question is that so big memory size of 250GB, will it cause too much pressure on GC ? Is it feasible that creating 250GB Memory for NN.
Can someone just say something, why no body answer??
Ideal name node memory size is about total space used by meta of the data + OS + size of daemons and 20-30% space for processing related data.
You should also consider the rate at which data comes in to your cluster. If you have data coming in at 1TB/day then you must consider a bigger memory drive or you would soon run out of memory.
Its always advised to have at least 20% memory free at any point of time. This would help towards avoiding the name node going into a full garbage collection.
As Marco specified earlier you may refer NameNode Garbage Collection Configuration: Best Practices and Rationale for GC config.
In your case 256 looks good if you aren't going to get a lot of data and not going to do lots of operations on the existing data.
Refer: How to Plan Capacity for Hadoop Cluster?
Also refer: Select the Right Hardware for Your New Hadoop Cluster
You can have a physical memory of 256 GB in your namenode. If your data increase in huge volumes, consider hdfs federation. I assume you already have multi cores ( with or without hyperthreading) in the name node host. Guess the below link addresses your GC concerns:
https://community.hortonworks.com/articles/14170/namenode-garbage-collection-configuration-best-pra.html

10 Gb JVM Heap Memory full but only 1Gb Field data cache

We have a ES 1.6 cluster with 4 nodes used to store mostly logging data (~500 documents a second).
ES is configured with 10G of Heap but after numerous OutOfMemoryExceptions and stop-the-world GCs we limited the Field data cache to 10%.
My question is, why are all nodes' JVM constantly using ~9Gb Heap when field data (which i understand to be one of the primary users of Heap) is limited to 1Gb.
Some graphs:
It's worth pointing out our filter cache size is much smaller (~200Mb) and yes the aggressively limited Field data cache size does cause a lot of Field data cache evictions.
What else is using so much heap?
Thanks

How to improve percolator performance in ElasticSearch?

Summary
We need to increase percolator performance (throughput).
Most likely approach is scaling out to multiple servers.
Questions
How to do scaling out right?
1) Would increasing number of shards in underlying index allow running more percolate requests in parallel?
2) How much memory does ElasticSearch server need if it does percolation only?
Is it better to have 2 servers with 4GB RAM or one server with 16GB RAM?
3) Would having SSD meaningfully help percolator's performance, or it is better to increase RAM and/or number of nodes?
Our current situation
We have 200,000 queries (job search alerts) in our job index.
We are able to run 4 parallel queues that call percolator.
Every query is able to percolate batch of 50 jobs in about 35 seconds, so we can percolate about:
4 queues * 50 jobs per batch / 35 seconds * 60 seconds in minute = 343
jobs per minute
We need more.
Our jobs index have 4 shards and we are using .percolator sitting on top of that jobs index.
Hardware: 2 processors server with 32 cores total. 32GB RAM.
We allocated 8GB RAM to ElasticSearch.
When percolator is working, 4 percolation queues I mentioned above consume about 50% of CPU.
When we tried to increase number of parallel percolation queues from 4 to 6, CPU utilization jumped to 75%+.
What is worse, percolator started to fail with NoShardAvailableActionException:
[2015-03-04 09:46:22,221][DEBUG][action.percolate ] [Cletus
Kasady] [jobs][3] Shard multi percolate failure
org.elasticsearch.action.NoShardAvailableActionException: [jobs][3]
null
That error seems to suggest that we should increase number of shards and eventually add dedicated ElasticSearch server (+ later increase number of nodes).
Related:
How to Optimize elasticsearch percolator index Memory Performance
Answers
How to do scaling out right?
Q: 1) Would increasing number of shards in underlying index allow running more percolate requests in parallel?
A: No. Sharding is only really useful when creating a cluster. Additional shards on a single instance may in fact worsen performance. In general the number of shards should equal the number of nodes for optimal performance.
Q: 2) How much memory does ElasticSearch server need if it does percolation only?
Is it better to have 2 servers with 4GB RAM or one server with 16GB RAM?
A: Percolator Indices reside entirely in memory so the answer is A LOT. It is entirely dependent on the size of your index. In my experience 200 000 searches would require a 50MB index. In memory this index would occupy around 500MB of heap memory. Therefore 4 GB RAM should be enough if this is all you're running. I would suggest more nodes in your case. However as the size of your index grows, you will need to add RAM.
Q: 3) Would having SSD meaningfully help percolator's performance, or it is better to increase RAM and/or number of nodes?
A: I doubt it. As I said before percolators reside in memory so disk performance isn't much of a bottleneck.
EDIT: Don't take my word on those memory estimates. Check out the site plugins on the main ES site. I found Big Desk particularly helpful for watching performance counters for scaling and planning purposes. This should give you more valuable info on estimating your specific requirements.
EDIT in response to comment from #DennisGorelik below:
I got those numbers purely from observation but on reflection they make sense.
200K Queries to 50MB on disk: This ratio means the average query occupies 250 bytes when serialized to disk.
50MB index to 500MB on heap: Rather than serialized objects on disk we are dealing with in memory Java objects. Think about deserializing XML (or any data format really) you generally get 10x larger in-memory objects.

Resources