Elastic search document storing - elasticsearch

Basic usecase that we are trying to solve is for users to be able to search from the contents of the log file .
Lets say a simple situation where user searches for a keyword and this is present in a log file which i want to render it back to the user.
We plan to use ElasticSearch for handling this. The idea that i have in mind is to use elastic search as a mechanism to store the indexed log files.
Having this concept in mind, i went through https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html
Couple of questions i have,
1) I understand the input provided to elastic search is a JSON doc. It is going to scan this JSON provided and create/update indexes. So i need a mechanism to convert my input log files to JSON??
2) Elastic search would scan this input document and create/update inverted indexes. These inverted indexes actually point to the exact document. So does that mean, ES would store these documents somewhere?? Would it store them as JSON docs? Is it purely in memory or on file sytem/database?
3) No when user searches for a keyword , ES returns back the document which contains the searched keyword. Now do i need to have the ability to convert back this JSON doc to the original log document that user expects??
Clearly im missing something.. Sorry for asking questions this silly , but im trying to improve my skills and its WIP.
Also , i understand that there is ELK stack out there. For some reasons we just want to use ES and not the LogStash and Kibana part of the stack..
Thanks

Logs needs to be parsed to JSON before they can be inserted into Elasticsearch
All documents are stored on the filesystem and some data is kept in memory but all data is persistent.
When you search Elasticsearch you get back matching JSON documents. If you want to display the original error message, you can store that original message in one of the JSON fields and display just that.
So if you just want to store log messages and not break them into fields or anything, you can simply take each row and send it to Elasticsearch like so:
{ "message": "This is my log message" }
To parse logs, break them into fields and add some logic, you will need to use some sort of app, like Logstash for example.

Related

What does elastic search Store, and how?

Elastic search is a search engine, according to Wikipedia. This implies it is not a database, and does not store the data it is indexing (but presumably does store its indexes)
There are presumably 2 ways to get data into Es. Log shipping or directly via api.
Let’s say my app wants to write an old fashioned log file entry:
Logger.error(now() + “ something bad happened in module “ + module + “;” + message”
This could either write to a file or put the data directly in es using a rest api.
If it was done via rest api, does es store the entire log message, in which case you dont need to waste disk writing the logs to files for compliance etc. Or does it only index the data, so you need to keep a separate copy? If you delete or move the original log file, how does es know, and is what it Deos store still usefull?
If you write to a log file, then use log stash or similar to “put the log data in es” does es store the entire log file as well as any indexes?
How does es parse or index arbitrary log files? Does it treat a log line as a single string, or does it require logs to have a specific format such as cvs or Jason?
Does anyone know of a resource with this key info?
Elasticsearch does store the data you are indexing.
When you ingest data into elasticsearch, this data is stored in one or more index and then it can be searched. To be able to search something with elasticsearch you need to store the data in elasticsearch, it can not for example search on external files.
In your example, if you have an app sending logs do elasticsearch, it will store the entire message you send and after it is in elasticsearch you don't need the original log anymore.
If you need to parse your documents in different fields you can do it before sending the log to elasticsearch as a json document, use logstash to do this or use an ingest pipeline in elasticsearch.
A good starting point to know more about how it works is the official documentation

Dealing with random failure datatypes in Elasticsearch 2.X

So im working on a system that logs bad data sent to an api and what the full request was. Would love to be able to see this in Kibana.
Issue is the datatypes could be random, so when I send them to the bad_data field it fails if it dosen't match the original mapping.
Anyone have a suggestion for the right way to handle this?
(2.X Es is required due to a sub dependancy)
You could use ignore_malformed flag in your field mappings. In that case wrong format values will not be indexed and your document will be saved.
See elastic documentation for more information.
If you want to be able to query such fields as original text you could use fields in your mapping for multi-type indexing, to get fast queries on raw text values.

Can Beats update existing documents in Elasticsearch?

Consider the following use case:
I want the information from one particular log line to be indexed into Elasticsearch, as a document X.
I want the information from some log line further down the log file to be indexed into the same document X (not overriding the original, just adding more data).
The first part, I can obviously achieve with filebeat.
For the second, does anyone have any idea about how to approach it? Could I still use filebeat + some pipeline on an ingest node for example?
Clearly, I can use the ES API to update the said document, but I was looking for some solution that doesn't require changes to my application - rather, it is all possible to achieve using the log files.
Thanks in advance!
No, this is not something that Beats were intended to accomplish. Enrichment like you describe is one of the things that Logstash can help with.
Logstash has an Elasticsearch input that would allow you to retrieve data from ES and use it in the pipeline for enrichment. And the Elasticsearch output supports upsert operations (update if exists, insert new if not). Using both those features you can enrich and update documents as new data comes in.
You might want to consider ingesting the log lines as is to Elasticearch. Then using Logstash, build a separate index that is entity specific and driven based on data from the logs.

Elasticsearch: Indexing tweets - mapping, template or ETL

I am about to index tweets coming from Apache NiFi to Elasticsearch as POST and want to do the following:
Make create_at field as date. Should I use mapping or index template for this?
make some fields not analyzed. Like hashtags, URLs, etc.
Want to store not entire tweet but some important fields. Like text, not all user information but some field, hashtags, URLs from entities (in post URLs). Don't need quoted source. Etc.
What should I use in this case? template? Pre-process tweets with some ETL process in order to extract data I need and index in ES?
I am a bit confused. Will really appreciate advise.
Thanks in advance.
I guess in your NiFi you have something like GetTwitter and PostHTTP configured. NiFi is already some sort of ETL, so you probably don't need another one. However, since you don't want to index the whole JSOn coming out of Twitter, you clearly need another NiFi process inbetween to select what you want and transform the raw JSON into another more lightweight one. Here is an example on how to do it for Solr, but I'm not sure the same processor exists for Elasticsearch.
This article about streaming Twitter data to Elasticsearch using Logstash shows a possible index template that you could use in order to mold your own (i.e. add the create_at data field if you like).
The way to go for you since you don't want to index everything, is clearly to come up with your own mapping, which you can then use in an index template. Using index templates, you will be able to create daily/weekly/monthly twitter indices as you see fit.

ElasticSearch, get the field where a text matches

I do index JSON documents, where the structure is 'unknown' and I want to search for content and the result should be the FIELD-NAME where the content belongs to. Any way beside doing a query and then iterating through _source document fields to find it again in the results? I thought also about the precolate feature and generating the queries when indexing a JSON document (but this would create hundrets of queries to check while indexing ...). Maybe there is a simple feature I don't know.
E.g. I know "Main Street" will be stored in the data received (e.g. from a web crawler), but to optimise the crawler it would be helpful to get a suggestion to crawl only the field "property.address.street". The point is customer might know some sample data set to extract from JSON comming from different sources. To apply this knowledge to already collected data, the relevant field name must be found, especially when you want to make it automatically by provided sample content.

Resources