Reducing elasticsearch's index size - elasticsearch

I currently have a large amount of log files being analyzed by Logstash, and therefore a consequent amount of space being used in Elasticsearch.
However, a lot of this data is useless to me, as everything is not being displayed in Kibana.
So I'm wondering: is there a way to keep the index size minimal and only store matching events?
Edit: Perhaps I am being unclear on what I would like to achieve. I have several logs that fall under different categories (because they do not serve the same purpose and are not built the same way). I created several filter configuration files that correspond to these different types of logs.
Currently, all the data from all my logs is being stored in Elasticsearch. For example, say I am looking for the text "cat" in one of my logs, the event containing "cat" will be stored, but so will the other 10,000 lines. I want to avoid this and only store this 1 event in Elasticsearch's index.

You've not really given much information to go on, but as I see it you have 2 choices, you can either update your logstash filters so that you only send the data you're interested in to elasticsearch. You can do this by having conditional logic to "drop {}" certain events. Or you could mutate { remove_field } to get rid of individual fields within certain events.
You other choice would be to close/delete old indexes in your elasticsearch database, this would reduce the amount of information occupying heap space, and would have an immediate effect, while my first option would only affect future logs. The easiest way to close/delete old indexes is to use curator.
EDIT:
From your further question I would suggest:
On your input, add a tag like "drop" to all your inputs
As part of each grok you can remove a tag on a successful match, so when the grok works, remove the drop tag
As part of your output, put conditional logic around the output, so that you only
save records without the drop tag on them, eg:
output { if "drop" not in [tags] { etc } }

Related

How do I exclude/predefine fields for Index Patterns in Kibana?

I am using ELK to monitor REST API servers. Logstash decomposes the URL into a JSON object with fields for query parameters, header params, request duration, headers.
TLDR: I want all these fields retained so when I look at a specific message, I can see all the details. But only need a few of them to query and generate reports/visualizations in Kibana.
I've been testing for a few weeks and adding some new fields on the server side. So whenever I do, I need to rescan the index. However the auto-detection now finds 300+ fields and I'm guessing it indexes all of them.
I would like to control it to just index a set of fields as I think the more it detects, the larger the index file gets?
It was about 300MB/day for a week (100-200 fields), and then when I added a new field I needed to refresh, it went to 350 fields; 1 GB/day. After I accidentally deleted the ELK instance yesterday, I redid everything and now the indexes are like 100MB/day so far which is why I got curious.
I found these docs but not sure which one's are relevant or how they relate/need to be put together.
Mapping, index patterns, indices, templates/filebeats/rollup policy
https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html
https://discuss.elastic.co/t/index-lifecycle-management-for-existing-indices/181749/3
https://www.elastic.co/guide/en/elasticsearch/reference/7.3/indices-templates.html
(One has a PUT call that sends a huge JSON text but not sure how you would enter something like that in putty. POSTMAN/JMeter maybe but these need to be executed on the server itself which is just an SSH session, no GUI/text window.)
To remove fields from your log (since you are using logstash), you can use remove_field option of logstash mutate filter.
Ref: Mutate filter plugin

How does ElasticSearch handle an index with 230m entries?

I was looking through elasticsearch and was noticing that you can create an index and bulk add items. I currently have a series of flat files with 220 million entries. I am working on Logstash to parse and add them to ElasticSearch, but I feel that it existing under 1 index would be rough to query. The row data is nothing more than 1-3 properties at most.
How does Elasticsearch function in this case? In order to effectively query this index, do you just add additional instances to the cluster and they will work together to crunch the set?
I have been walking through the documentation, and it is explaining what to do, but not necessarily all the time explaining why it does what it does.
In order to effectively query this index, do you just add additional instances to the cluster and they will work together to crunch the set?
That is exactly what you need to do. Typically it's an iterative process:
start by putting a subset of the data in. You can also put in all the data, if time and cost permit.
put some search load on it that is as close as possible to production conditions, e.g. by turning on whatever search integration you're planning to use. If you're planning to only issue queries manually, now's the time to try them and gauge their speed and the relevance of the results.
see if the queries are particularly slow and if their results are relevant enough. You change the index mappings or queries you're using to achieve faster results, and indeed add more nodes to your cluster.
Since you mention Logstash, there are a few things that may help further:
check out Filebeat for indexing the data on an ongoing basis. You may not need to do the work of reading the files and bulk indexing yourself.
if it's log or log-like data and you're mostly interested in more recent results, it could be a lot faster to split up the data by date & time (e.g. index-2019-08-11, index-2019-08-12, index-2019-08-13). See the Index Lifecycle Management feature for automating this.
try using the Keyword field type where appropriate in your mappings. It stops analysis on the field, preventing you from doing full-text searches inside the field and only allowing exact string matches. Useful for fields like a "tags" field or a "status" field with something like ["draft", "review", "published"] values.
Good luck!

Can I get messages from the Kibana visualization?

Wondering if there is a way to get list of the messages related to a Kibana visualization. I understand if I apply the same filter on the "Discover", which is on "Visualization", I can filter the related messages. But I want to have more direct user experience like an user clicks on a region of a graph and can get the related messages which formed that region. Is there any way to do it?
This helped me:
https://discuss.elastic.co/t/can-i-get-the-related-messages-from-a-kibana-visualization/101692/2
It says:
Not directly, unfortunately. You can click on the visualization to create a filter, and you can pin that filter and take it to discover, which will do what you're asking, but isn't very obvious.
The reason is that visualizations are built using aggregate data, so they don't know what the underlying documents are, they only know the aggregate representation of the information. For example, if you have a bunch of traffic data, and you are looking at bytes over time, the records get bucketed by time and the aggregate of the bytes in that bucket are shown (average, sum, etc.).
In contrast, Discover only works with the raw documents, showing you exactly what you have stored in Elasticsearch. Both documents and aggregations can use filters and queries, which is why you can create a filter in one and use it in the other, but the underlying data is not the same.

Bulk read of all documents in an elasticsearch alias

I have the following elasticsearch setup:
4 to 6 small-ish indices (<5 million docs, <5Gb each)
they are unioned through an alias
they all contain the same doc type
they change very infrequently (i.e. >99% of the indexing happens when the index is created)
One of the use cases for my app requires to read all documents for the alias, ordered by a field, do some magic and serve the result.
I understand using deep pagination will most likely bring down my cluster, or at the very least have dismal performance so I'm wondering if the scroll API could be the solution. I know the documentation says it is not intended for use in real-time user queries, but what are the actual reasons for that?
Generally, how are people dealing with having to read through all the documents in an index? Should I look for another way to chunk the data?
When you use the scroll API, Elasticsearch creates a sort of a cursor for the current state of the index, so the reason for it not being recommended for real time search is because you will not see any new documents that were inserted after you created the scroll token.
Since your use case indicates that you rarely update or insert new documents into your indices, that may not be an issue for you.
When generating the scroll token you can specify a query with a sort, so if your documents have some sort of timestamp, you could create one scroll context for all documents with timestamp: { lte: "now" } and another scroll (or every a simple query) for the rest of the documents that were not included in the first search context by specifying a certain date range filter.

In Elasticsearch, what happens if I set 'store' to yes on a few fields, but _source to false?

We're building a "unified" search across a lot of different resources in our system. Our index schema includes about 10 generic fields that are indexed, plus 5 which are required to identify the appropriate resource location in our system when results are returned.
The indexed fields often contain sensitive data, so we don't want them stored at all, only indexed for matching, thus we set the _source to FALSE.
I do however want the 5 ident fields returned, so is it possible to set the ident fields to store = yes, but the overall index _source to FALSE and get what I'm looking for in the results?
Have a look at this other answer as well. As mentioned there, in most of the cases the _source field helps a lot. Even though it might seem like a waste because elasticsearch effectively stores the whole document that comes in, that's really handy (e.g. when needing to update documents without sending the whole updated document). At the end of the day it hides a lucene implementation detail, the fact that you need to explicitly store fields if you want to get them back, while users usually expect to get back what they sent to the search engine. Surprisingly, the _source helps performance wise too, as it requires a single disk seek instead of more disk seeks that might be caused by retrieving multiple stored fields. At the end of the day the _source field is just a big lucene stored field containing json, which can be parsed in order to get to specific fields and do some work with them, without needing to store them separately.
That said, depending on your usecase (how many fields you retrieve) it might be useful to have a look at source include/exclude at the bottom of the _source field reference, which allows you to prevent parts (e.g. the sensitive parts of your documents) of the source field from being stored. That would be useful if you want to keep relying on the _source but don't want a part of the input documents to be returned, but you do want to search against those fields, as they are going to be indexed (but not stored!) in the underlying lucene index.
In both cases (either you disable the _source completely or exclude some parts), if you plan to update your documents keep in mind that you'll need to send the whole updated document using the index api. In fact you cannot rely on partial updates provided with the update api as you don't have in the index the complete document that you indexed in the first place, which you would need to apply changes to.
Yes, stored fields do not rely on the _source field, or vice-versa. They are separate, and changing or disabling one shouldn't impact the other.

Resources