How do I exclude/predefine fields for Index Patterns in Kibana? - elasticsearch

I am using ELK to monitor REST API servers. Logstash decomposes the URL into a JSON object with fields for query parameters, header params, request duration, headers.
TLDR: I want all these fields retained so when I look at a specific message, I can see all the details. But only need a few of them to query and generate reports/visualizations in Kibana.
I've been testing for a few weeks and adding some new fields on the server side. So whenever I do, I need to rescan the index. However the auto-detection now finds 300+ fields and I'm guessing it indexes all of them.
I would like to control it to just index a set of fields as I think the more it detects, the larger the index file gets?
It was about 300MB/day for a week (100-200 fields), and then when I added a new field I needed to refresh, it went to 350 fields; 1 GB/day. After I accidentally deleted the ELK instance yesterday, I redid everything and now the indexes are like 100MB/day so far which is why I got curious.
I found these docs but not sure which one's are relevant or how they relate/need to be put together.
Mapping, index patterns, indices, templates/filebeats/rollup policy
(One has a PUT call that sends a huge JSON text but not sure how you would enter something like that in putty. POSTMAN/JMeter maybe but these need to be executed on the server itself which is just an SSH session, no GUI/text window.)

To remove fields from your log (since you are using logstash), you can use remove_field option of logstash mutate filter.
Ref: Mutate filter plugin


Elastic Search:Update of existing Record (which has custom routing param set) results in duplicate record, if custom routing is not set during update

Env Details:
Elastic Search version 7.8.1
routing param is an optional in Index settings.
As per ElasticSearch docs -
When indexing documents specifying a custom _routing, the uniqueness of the _id is not guaranteed across all of the shards in the index. In fact, documents with the same _id might end up on different shards if indexed with different _routing values.
We have landed up in same scenario where earlier we were using custom routing param(let's say customerId). And for some reason we need to remove custom routing now.
Which means now docId will be used as default routing param. This is creating duplicate record with same id across different shard during Index operation. Earlier it used to (before removing custom routing) it resulted in update of record (expected)
I am thinking of following approaches to come out of this, please advise if you have better approach to suggest, key here is to AVOID DOWNTIME.
Approach 1:
As we receive the update request, let duplicate record get created. Once record without custom routing gets created, issue a delete request for a record with custom routing.
CONS: If there is no update on records, then all those records will linger around with custom routing, we want to avoid this as this might results in unforeseen scenario in future.
Approach 2
We use Re-Index API to migrate data to new index (turning off custom routing during migration). Application will use new index after successful migration.
CONS: Some of our Indexes are huge, they take 12 hrs+ for re-index operation, and since elastic search re-index API will not migrate the newer records created between this 12hr window, as it uses snapshot mechanism. This needs a downtime approach.
Please suggest alternative if you have faced this before.
Thanks #Val, also found few other approaches like write to both indexes and read from old. And then shift to read new one after re-indexing is finished. Something on following lines -
Create an aliases pointing to the old indices (*_v1)
Point the application to these aliases instead of actual indices
Create a new indices (*_v2) with the same mapping
Move data from old indices to new using re-indexing and make sure we don't
retain custom routing during this.
Post re-indexing, change the aliases to point to new index instead of old
(need to verify this though, but there are easy alternatives if this
doesn't work)
Once verification is done, delete the old Indices
What do we do in transition period (window between reindexing start to reindexing finish) -
Write to both Indices (old and new) and read from old indices via aliases

Dealing with random failure datatypes in Elasticsearch 2.X

So im working on a system that logs bad data sent to an api and what the full request was. Would love to be able to see this in Kibana.
Issue is the datatypes could be random, so when I send them to the bad_data field it fails if it dosen't match the original mapping.
Anyone have a suggestion for the right way to handle this?
(2.X Es is required due to a sub dependancy)
You could use ignore_malformed flag in your field mappings. In that case wrong format values will not be indexed and your document will be saved.
See elastic documentation for more information.
If you want to be able to query such fields as original text you could use fields in your mapping for multi-type indexing, to get fast queries on raw text values.

Elastic search document storing

Basic usecase that we are trying to solve is for users to be able to search from the contents of the log file .
Lets say a simple situation where user searches for a keyword and this is present in a log file which i want to render it back to the user.
We plan to use ElasticSearch for handling this. The idea that i have in mind is to use elastic search as a mechanism to store the indexed log files.
Having this concept in mind, i went through
Couple of questions i have,
1) I understand the input provided to elastic search is a JSON doc. It is going to scan this JSON provided and create/update indexes. So i need a mechanism to convert my input log files to JSON??
2) Elastic search would scan this input document and create/update inverted indexes. These inverted indexes actually point to the exact document. So does that mean, ES would store these documents somewhere?? Would it store them as JSON docs? Is it purely in memory or on file sytem/database?
3) No when user searches for a keyword , ES returns back the document which contains the searched keyword. Now do i need to have the ability to convert back this JSON doc to the original log document that user expects??
Clearly im missing something.. Sorry for asking questions this silly , but im trying to improve my skills and its WIP.
Also , i understand that there is ELK stack out there. For some reasons we just want to use ES and not the LogStash and Kibana part of the stack..
Logs needs to be parsed to JSON before they can be inserted into Elasticsearch
All documents are stored on the filesystem and some data is kept in memory but all data is persistent.
When you search Elasticsearch you get back matching JSON documents. If you want to display the original error message, you can store that original message in one of the JSON fields and display just that.
So if you just want to store log messages and not break them into fields or anything, you can simply take each row and send it to Elasticsearch like so:
{ "message": "This is my log message" }
To parse logs, break them into fields and add some logic, you will need to use some sort of app, like Logstash for example.

Elasticsearch: Indexing tweets - mapping, template or ETL

I am about to index tweets coming from Apache NiFi to Elasticsearch as POST and want to do the following:
Make create_at field as date. Should I use mapping or index template for this?
make some fields not analyzed. Like hashtags, URLs, etc.
Want to store not entire tweet but some important fields. Like text, not all user information but some field, hashtags, URLs from entities (in post URLs). Don't need quoted source. Etc.
What should I use in this case? template? Pre-process tweets with some ETL process in order to extract data I need and index in ES?
I am a bit confused. Will really appreciate advise.
Thanks in advance.
I guess in your NiFi you have something like GetTwitter and PostHTTP configured. NiFi is already some sort of ETL, so you probably don't need another one. However, since you don't want to index the whole JSOn coming out of Twitter, you clearly need another NiFi process inbetween to select what you want and transform the raw JSON into another more lightweight one. Here is an example on how to do it for Solr, but I'm not sure the same processor exists for Elasticsearch.
This article about streaming Twitter data to Elasticsearch using Logstash shows a possible index template that you could use in order to mold your own (i.e. add the create_at data field if you like).
The way to go for you since you don't want to index everything, is clearly to come up with your own mapping, which you can then use in an index template. Using index templates, you will be able to create daily/weekly/monthly twitter indices as you see fit.

Reducing elasticsearch's index size

I currently have a large amount of log files being analyzed by Logstash, and therefore a consequent amount of space being used in Elasticsearch.
However, a lot of this data is useless to me, as everything is not being displayed in Kibana.
So I'm wondering: is there a way to keep the index size minimal and only store matching events?
Edit: Perhaps I am being unclear on what I would like to achieve. I have several logs that fall under different categories (because they do not serve the same purpose and are not built the same way). I created several filter configuration files that correspond to these different types of logs.
Currently, all the data from all my logs is being stored in Elasticsearch. For example, say I am looking for the text "cat" in one of my logs, the event containing "cat" will be stored, but so will the other 10,000 lines. I want to avoid this and only store this 1 event in Elasticsearch's index.
You've not really given much information to go on, but as I see it you have 2 choices, you can either update your logstash filters so that you only send the data you're interested in to elasticsearch. You can do this by having conditional logic to "drop {}" certain events. Or you could mutate { remove_field } to get rid of individual fields within certain events.
You other choice would be to close/delete old indexes in your elasticsearch database, this would reduce the amount of information occupying heap space, and would have an immediate effect, while my first option would only affect future logs. The easiest way to close/delete old indexes is to use curator.
From your further question I would suggest:
On your input, add a tag like "drop" to all your inputs
As part of each grok you can remove a tag on a successful match, so when the grok works, remove the drop tag
As part of your output, put conditional logic around the output, so that you only
save records without the drop tag on them, eg:
output { if "drop" not in [tags] { etc } }
