Remove Duplicate Fields Used for document_id Before Elasticsearch in Logstash - elasticsearch

I wrote my own filter for Logstash and I'm trying to calculate my own document_id something like this:
docIdClean = "%d %s %s %s" % [ event["#timestamp"].to_f * 1000, event["type"], event["message"] ]
event["docId"] = Digest::MD5.hexdigest(docIdClean)
And the Logstash configuration looks like this:
output {
elasticsearch {
...
index => "analysis-%{+YYYY.MM.dd}"
document_id => "%{docId}"
template_name => "logstash_per_index"
}
}
The more or less minor downside is that all documents in Elasticsearch contain _id and docId holding the same value. Since docId is completely pointless as nobody searches for an MD5-hash I want to remove it, but I don't know how.
The docId has to exist when the event hits the output, otherwise the output can't refer to it. Therefore, I can't remove it beforehand. Since I can't remove it afterwards, the docId sits there occupying space.
I tried to set the event field _id, but that only causes an exception in Elasticsearch that the id of the document is different.
Maybe for explanation here one document:
{
"_index": "analysis-2014.09.16",
"_type": "access",
"_id": "022d9055423cdd0756b6cfa06886f866",
"_score": 1,
"_source": {
"#timestamp": "2014-09-16T19:36:31.000+02:00",
"type": "access",
"tags": [
"personalized"
],
"importDate": "2014/09/17",
"docId": "022d9055423cdd0756b6cfa06886f866"
}
}
EDIT:
This is about Logstash 1.3

There's nothing you can do about this in Logstash 1.4.
In Logstash 1.5, you can use #metadata fields, which are not passed to Elasticsearch.

Related

Calculate field data size and store to other field at indexing time ElasticSearch 7.17

I am looking for a way to store the size of a field (bytes) in a new field of a document.
I.e. when a document is created with a field message that contains the value hello, I want another field message_size_bytes written that in this example has the value 5.
I am aware of the possibilities using _update_by_query and _search using scripting fields, but I have so much data that I do not want to calculate the sizes while querying but at index time.
Is there a possibility to do this using Elasticsearch 7.17 only? I do not have access to the data before it's passed to elasticsearch.
You can use Ingest Pipeline with Script processor.
You can create pipeline using below command:
PUT _ingest/pipeline/calculate_bytes
{
"processors": [
{
"script": {
"description": "Calculate bytes of message field",
"lang": "painless",
"source": """
ctx['message_size_bytes '] = ctx['message'].length();
"""
}
}
]
}
After creating pipeline, you cna use pipeline name while indexing data like below (same you can use in logstash, java or anyother client as well):
POST 74906877/_doc/1?pipeline=calculate_bytes
{
"message":"hello"
}
Result:
"hits": [
{
"_index": "74906877",
"_id": "1",
"_score": 1,
"_source": {
"message": "hello",
"message_size_bytes ": 5
}
}
]

Elasticsearch query to get results irrespective of spaces in search text

I am trying to fetch data from Elasticsearch matching from a field name. I have following two records
{
"_index": "sam_index",
"_type": "doc",
"_id": "key",
"_version": 1,
"_score": 2,
"_source": {
"name": "Sample Name"
}
}
and
{
"_index": "sam_index",
"_type": "doc",
"_id": "key1",
"_version": 1,
"_score": 2,
"_source": {
"name": "Sample Name"
}
}
When I try to search using texts like sam, sample, Sa, etc, I able fetch both records by using match_phrase_prefix query. The query I tried with match_phrase_prefix is
GET sam_index/doc/_search
{
"query": {
"match_phrase_prefix" : {
"name": "sample"
}
}
}
I am not able to fetch the records when I try to search with string samplen. I need search and get results irrespective of spaces between texts. How can I achieve this in Elasticsearch?
First, you need to understand how Elasticsearch works and why it gives the result and doesn't give the result.
ES works on the token match, Documents which you index in ES goes through the analysis process and creates and stores the tokens generated from this process to inverted index which is used for searching.
Now when you make a query then that query also generates the search tokens, these can be as it is in the search query in case of term query or tokens based on the analyzer defined on the search field in case of match query. Hence it's very important to understand the internals of your search query.
Also, it's very important to understand the mapping of your index, ES uses the standard analyzer by default on the text fields.
You can use the Explain API to understand the internals of the query like which search tokens are generated by your search query, how documents matched to it and on what basis score is calculated.
In your case, I created the name field as text, which uses the word joined analyzer explained in Ignore spaces in Elasticsearch and I was able to get the document which consists of sample name when searched for samplen.
Let us know if you also want to achieve the same and if it solves your issue.

Truncate and Index String values in Elasticsearch 2.3.x

I am running ES 2.3.3. I want to index a non-analyzed String but truncate it to a certain number of characters. The ignore_above property, according to the documentation, will NOT index a field above the provided value. I don't want that. I want to take say a field that could potentially be 30K long and truncate it to 10K long, but still be able to filter and sort on the 10K that is retained.
Is this possible in ES 2.3.3 or do I need to do this using Java prior to indexing a document.
I want to index a non-analyzed String but truncate it to a certain number of characters.
Technically it's possible with Update API and Upsert option, but, depending on your exact needs, it may not be very handy.
Let's say you want to index this document:
{
"name": "foofoofoofoo",
"age": 29
}
but you need to truncate name field so that it has only 5 characters. Using Update API, you'd have to execute a script:
POST http://localhost:9200/insert/test/1/_update
{
"script" : "ctx._source.name = ctx._source.name.substring(0,5);",
"scripted_upsert": true,
"upsert" : {
"name": "foofoofoofoo",
"age": 29
}
}
It means that, if ES does not find the document with given id (here id=1), it should index the document that is inside upsert element, and perform given script. So as you can see, it's rather inconvenient if you want to have automatically generated ids, as you have to provide the id in URI.
Result:
GET http://localhost:9200/insert/test/1
{
"_index": "insert",
"_type": "test",
"_id": "1",
"_version": 1,
"found": true,
"_source": {
"name": "foofo",
"age": 29
}
}

elasticsearch: define field's order in returned doc

i'm doing sending queries to elasticsearch and it responde with an unknown order of fields in its documents.
how can i fix the order that elsasticsearch is returning fields inside documents?
i mean, i'm sending this query:
{
"index": "my_index",
"_source":{
"includes" : ["field1","field2","field3","field14"]
},
"size": X,
"body": {
"query": {
// stuff
}
}
}
and when it responds, it gives me something not in the good order.
i ultimatly want to convert this to csv, and want to fix csv headers.
is there something to do so i can get something like
doc1 :{"field1","field2","field3","field14"}
doc2 :{"field1","field2","field3","field14"}
...
in the same order as my "_source" ?
thank's for your help.
A document in Elasticsearch is a JSON hash/map and by definition maps are unordered.
One solution around this would be to use Logstash in order to extract docs from ES using an elasticsearch input and output them in CSV using a csv output. That way you can guarantee that the fields in the CSV file will have the exact same order as specified. Another benefit is that you don't have to write your own boilerplate code to extract from ES and sink to CSV, Logstash does it all for you for free.
The Logstash configuration would look something like this:
input {
elasticsearch {
hosts => "localhost"
query => '{ "query": { "match_all": {} } }'
size => 100
index => "my_index"
}
}
filter {}
output {
csv {
fields => ["field1","field2","field3","field14"]
path => "/path/to/file.csv"
}
}

Create a Kibana graph from logstash logs

I need to create a graph in kibana according to a specific value.
Here is my raw log from logstash :
2016-03-14T15:01:21.061Z Accueil-PC 14-03-2016 16:01:19.926 [pool-3-thread-1] INFO com.github.vspiewak.loggenerator.SearchRequest - id=300,ip=84.102.53.31,brand=Apple,name=iPhone 5S,model=iPhone 5S - Gris sideral - Disque 64Go,category=Mobile,color=Gris sideral,options=Disque 64Go,price=899.0
In this log line, I have the id information "id=300".
In order to create graphics in Kibana using the id value, I want a new field. So I have a specific grok configuration :
grok {
match => ["message", "(?<mycustomnewfield>id=%{INT}+)"]
}
With this transformation I get the following JSON :
{
"_index": "metrics-2016.03.14",
"_type": "logs",
"_id": "AVN1k-cJcXxORIbORG7w",
"_score": null,
"_source": {
"message": "{\"message\":\"14-03-2016 15:42:18.739 [pool-1950-thread-1] INFO com.github.vspiewak.loggenerator.SellRequest - id=300,ip=54.226.24.77,email=client951#gmail.com,sex=F,brand=Apple,name=iPad R\\\\xE9tina,model=iPad R\\\\xE9tina - Noir,category=Tablette,color=Noir,price=509.0\\\\r\",\"#version\":\"1\",\"#timestamp\":\"2016-03-14T14:42:19.040Z\",\"path\":\"D:\\\\LogStash\\\\logstash-2.2.2\\\\logstash-2.2.2\\\\bin\\\\logs.logs.txt\",\"host\":\"Accueil-PC\",\"type\":\"metrics-type\",\"mycustomnewfield\":\"300\"}",
"#version": "1",
"#timestamp": "2016-03-14T14:42:19.803Z",
"host": "127.0.0.1",
"port": 57867
},
"fields": {
"#timestamp": [
1457966539803
]
},
"sort": [
1457966539803
]}
A new field was actually created (the field 'mycustomnewfield') but within the message field ! As a result I can't see it in kibana when I try to create a graph. I tried to create a "scripted field" in Kibana but only numeric field can be accessed.
Should I create an index in elasticSearch with a specific mapping to create a new field ?
There was actually something wrong with my configuration. I should have paste the whole configuration with my question. In fact i'm using logstash as a shipper and also as a log server. On the server side, I modified the configuration :
input {
tcp {
port => "yyyy"
host => "x.x.x.x"
mode => "server"
codec => json # I forgot this option
}}
Because the logstash shipper is actually sending json, I need to advice the server about this. Now I no longer have a message field within a message field, and my new field is inserted at the right place.

Resources