Split a field from a document into multiple fields - elasticsearch

I have logs captured in elastic index, the variable "message" in an index holds entire log message. I wanted to split that data into multiple fields like timstamp, ip etc.
Note: The logs are pumped directly into elastic from our application using POST.
I have created grok to split this information, but i am not sure how to transform this on the fly.
{
"_index" : "logs_exception",
"_type" : "_doc",
"_id" : "9RI-BGoBwdzZ5ffB3_Sj",
"_score" : 2.4795628,
"_source" : {
"CorrelationId" : "bd3fc7d6-ca39-44e1-9a59-xxasdasd1",
"Message" : "2019-04-10 10:36:27,780 [8] ERROR LoggingService.TestConsole.Program [(null)] - System.AppDomainUnloadedException: Attempted to access an unloaded AppDomain."
}
can we create a pipeline in elastic to feed from one of the index and apply grok and push it back to another index? or whats the best way to do this?

The best way to do is to configure the Ingest node to pre process your documents before indexing it in to es.
In your case you need a Grok Processor to match the message field and separate it in to fields, Below is a sample pipeline definition with Grok Processor to ingest your document in to elastic
{
"description" : "...",
"processors": [
{
"grok": {
"field": "message",
"patterns": ["%{DATESTAMP:timestamp}%{SPACE}%{SPACE}\[(?<misc1>.*)\]%{SPACE}%{WORD:loglevel}%{SPACE}%{JAVACLASS:originator}%{SPACE}\[(?<misc2>.*)\]%{SPACE}%{GREEDYDATA:data}"]
}
}
]
}
With the above pipeline definition in place your data will be indexed as below.
{
"_index" : "logs_exception",
"_type" : "_doc",
"_id" : "9RI-BGoBwdzZ5ffB3_Sj",
"_score" : 2.4795628,
"_source" : {
"CorrelationId" : "bd3fc7d6-ca39-44e1-9a59-xxasdasd1",
"timestamp" : "19-04-10 10:36:27,780",
"misc1" : 8,
"loglevel": ERROR,
"originator": "LoggingService.TestConsole.Program",
"misc2": (null),
"data" : "- System.AppDomainUnloadedException: Attempted to access an unloaded AppDomain.",
"Message" : "2019-04-10 10:36:27,780 [8] ERROR LoggingService.TestConsole.Program [(null)] - System.AppDomainUnloadedException: Attempted to access an unloaded AppDomain."
}

You can make use of json filter:
filter {
json => {
source=>message
target=>event
}
}

Related

How to de-normalize the relational data in Elasticsearch

I am working on social networking application and I am using elasticsearch for service data.I have multiple joins in elasticsearch. Users can share the posts and each post has one parent user. I have a scenario than I have shown posts of those users whose you follow.
Type Post
{
"_index" : "xxxxxx",
"_type" : "_doc",
"_id" : "p-370648",
"_score" : null,
"_routing" : "2",
"_source" : {
"uid" : "9a73b0e0-a52c-11ec-aa58-37061b467b8c",
"user_id" : 87,
"id" : 370648,
"type" : {
"parent" : "u-87",
"name" : "post"
},
"item_type_number" : 2,
"source_key" : "youtube-5wcpIrpbvXQ#2"
}
}
Type User
{
"_index" : "trending",
"_type" : "_doc",
"_id" : "u-56432",
"_score" : null,
"_routing" : "1",
"_source" : {
"gender" : "female",
"picture" : "125252125.jpg",
"uid" : "928de1a5-cc93-4fd3-adec-b9fb220abc2b",
"full_name" : "Shannon Owens",
"dob" : "1990-08-18",
"id" : 56432,
"username" : "local_12556",
"type" : {
"name" : "user"
},
},
}
Type Follow
{
"_index" : "trending",
"_type" : "_doc",
"_id" : "fr-561763",
"_score" : null,
"_routing" : "6",
"_source" : {
"user_id" : 25358,
"id" : 561763,
"object_id" : 36768,
"status" : "U",
"type" : {
"parent" : "u-36768",
"name" : "followers"
},
}
}
So in this scenario if user follow someone then we save record in elasticsearch with object_id following user and user_id who follow the user and type "followers", and on the other hand each post has one parent user. So when I try to fetch posts from elasticsearch with type post so then I need to put two level joins to fetch posts.
First one for post parent with user and second for checking following status with user. This query work good when there is no traffic on system. But when traffic comes on system send consurrent requests then the elasticsearch query gets down due to processing even I try to fix this issue with high server with higher performance and CPU/Ram but still facing fall down.
So I decided to denormalize the type post data but the problem is that I am failed to check the following status with post.
Because If I do another query from DB and use some caching then I facing memory exaust issue when thousand of following users data come in query. So is there any way that I can check the following directly in following with type posts instead of adding parent join in query.

Using field instead of "_id" for more-like-this query

I have a slug field that I want to use to identify object to use as a reference instead of "_id" field. But instead of using it as a reference, doc seems to use it as query to comapre against. Since slug is a unique field with a simple analyzer, it just returns exactly one result like the following. As far as I know, there is no way to use a custom field as _id field:
https://github.com/elastic/elasticsearch/issues/6730
So is double look up, finding out elasticsearch's id first then doing more_like_this the only way to achieve what I am looking for? Someone seems to have asked a similar question three years ago, but it doesn't have an answer.
ArticleDocument.search().query("bool",
should=Q("more_like_this",
fields= ["slug", "text"],
like={"doc": {"slug": "OEXxySDEPWaUfgTT54QvBg",
}, "_index":"article", "_type":"doc"},
min_doc_freq=1,
min_term_freq=1
)
).to_queryset()
Returns:
<ArticleQuerySet [<Article: OEXxySDEPWaUfgTT54QvBg)>]>
You can make some of your documents field as "default" _id while ingesting data.
Logstash
output {
elasticsearch {
hosts => ["localhost:9200"]
index => "my_name"
document_id => "%{some_field_id}"
}
}
Spark (Scala)
DF.saveToEs("index_name" + "/some_type", Map("es.mapping.id" -> "some_field_id"))
Index API
PUT twitter/_doc/1
{
"user" : "kimchy",
"post_date" : "2009-11-15T14:12:12",
"message" : "trying out Elasticsearch"
}
{
"_shards" : {
"total" : 2,
"failed" : 0,
"successful" : 2
},
"_index" : "twitter",
"_type" : "_doc",
"_id" : "1",
"_version" : 1,
"_seq_no" : 0,
"_primary_term" : 1,
"result" : "created"
}

Using Delete By Query API and Bulk API together in Elastic

I couldn't see any documentation/example about using delete by query api with bulk api in elastic search.
Simply, I want to delete all the documents having same A field and insert many documents just after that. If delete process fails, it shouldn't insert any documents.
e.g.
POST _bulk
{ "index" : { "_index" : "test", "_type" : "type1", "_id" : "1" } }
{ "field1" : "value1" }
??? { "delete_by_query???" : { "_index" : "test", "_type" : "type1", "query"... } }
Is there any way to use them together?
Thanks.

How do you bulk index documents into the default mapping of ElasticSearch?

The documentation for ElasticSearch 5.5 offers no examples of how to use the bulk operation to index documents into the default mapping of an index. It also gives no indication why this is not possible, unless I'm missing that somewhere else in the documentation.
The ES 5.5 documentation gives one explicit example of bulk indexing:
POST _bulk
{ "index" : { "_index" : "test", "_type" : "type1", "_id" : "1" } }
{ "field1" : "value1" }
But it also says that
The endpoints are /_bulk, /{index}/_bulk, and {index}/{type}/_bulk.
When the index or the index/type are provided, they will be used by
default on bulk items that don’t provide them explicitly.
So, the middle endpoint is valid, and it implies to me that a) you have to explicitly provide a type in the metadata for each document indexed, or b) that you can index documents into the default mapping ("_default_").
But I can't get this to work.
I've tried the /myindex/bulk endpoint with no type specified in the metadata.
I've tried it with "_type": "_default_" specified.
I've tried /myindex/_default_/bulk.
This has nothing to do with the _default_ mapping. This is about falling back to the default type that you specify in the URL. You can do the following
POST _bulk
{ "index" : { "_index" : "test", "_type" : "type1", "_id" : "1" } }
{ "field1" : "value1" }
However the following snippet is exactly the same
POST /test/type1/_bulk
{ "index" : { "_id" : "1" } }
{ "field1" : "value1" }
And you can mix this
POST foo/bar/_bulk
{ "index" : { "_index" : "test", "_type" : "type1", "_id" : "1" } }
{ "field1" : "value1" }
{ "index" : { "_id" : "1" } }
{ "field1" : "value1" }
In this example, one document would be indexed into foo and one into test.
Hope this makes sense.

Is the order of operations guaranteed in a bulk update?

I am sending delete and index requests to elasticsearch in bulk (the example is adapted from the docs):
{ "delete" : { "_index" : "test", "_type" : "type1", "_id" : "1" } }
{ "index" : { "_index" : "test", "_type" : "type1", "_id" : "1" } }
{ "field1" : "value1" }
The sequence above is intended to first delete a possible document with _id=1, then index a new document with the same _id=1.
Is the order of the actions guaranteed? In other words, for the example above, can I be sure that the delete will not touch the document indexed afterwards (because the order would not be respected for a reason or another)?
The delete operation is useless in this scenario, if you simply index a document with the same ID, it will automatically and implicitly delete/replace the previous document with the same ID.
So if document with ID=1 already exists, simply sending the below command will replace it (read delete and re-index it)
{ "index" : { "_index" : "test", "_type" : "type1", "_id" : "1" } }
{ "field1" : "value1" }
According to an Elastic Team Member:
Elasticsearch is distributed and concurrent. We do not guarantee that requests are executed in the order they are received.
https://discuss.elastic.co/t/are-bulk-index-operations-serialized/83770/6

Resources