Search result fluctuations - elasticsearch

I have bunch of collections with documents and i have encountered so,ething starnge. When I execute same request few times in a row result change consecutively
It would be fine if it's small fluctuations, but count of results changes on ~75000 of documents
So I have a question what's going on
My request is:
POST mycollection/mytype/_search
{
"fields": ["timestamp", "bool_field"],
"filter" : {
"terms":{
"bool_field" : [true]
}
}
}
results are going like this:
=> 148866
=> 75381
=> 148866
=> 75381
=> 148866
=> 75381
=> 148866
When count is 148k
I see some records with bool_field: "False" in Sense

Related

Return five following documents from an id with Elasticsearch and NEST

I think I have blinded myself staring at an error over and over again and could really use some input. I have a time-series set of documents. Now I want to find the five documents following a specific id. I start by fetching that single document. Then fetching the following five documents without this id:
var documents = client.Search<Document>(s => s
.Query(q => q
.ConstantScore(cs => cs
.Filter(f => f
.Bool(b => b
.Must(must => must
.DateRange(dr => dr.Field(field => field.Time).GreaterThanOrEquals(startDoc.Time))
.MustNot(mustNot => mustNot
.Term(term => term.Id, startDoc.Id))
))))
.Take(5)
.Sort(sort => sort.Ascending(asc => asc.Time))).Documents;
My problem is that while 5 documents are returned and sorted correctly, the start document is in the returned data. I'm trying to filter this away with the must not filter, but doesn't seem to be working. I'm pretty sure I have done this in other places, so might be a small issue that I simply cannot see :)
Here's the query generated by NEST:
{
"query":{
"constant_score":{
"filter":{
"bool":{
"must":[
{
"range":{
"time":{
"gte":"2020-08-31T10:47:12.2472849Z"
}
}
}
],
"must_not":[
{
"term":{
"id":{
"value":"982DBC1BE9A24F0E"
}
}
}
]
}
}
}
},
"size":5,
"sort":[
{
"time":{
"order":"asc"
}
}
]
}
This could be happening because the id field might be an analyzed field. Analyzed fields are tokenized. Having a non-analyzed version, for exact match (like you mentioned in the comments, you have one) and using it within your filter will fix the difference you are seeing.
More about analyzed vs non-analyzed fields here

How to evaluate time between log messages with ElasticSearch

I want to find out how long different actions in my old PHP web-application take. There is a log-file that writes out messages when an action is started and ended. It looks like this.
LOGFILE
2018-08-13 13:05:07,217 [30813] ControllerA: actionA start
2018-08-13 13:05:07,280 [30813] ControllerA: actionA end
2018-08-13 13:05:08,928 [30813] ControllerB: actionA start
2018-08-13 13:05:08,942 [30813] ControllerB: actionA end
2018-08-13 13:05:09,035 [17685] ControllerC: actionA start
2018-08-13 13:05:09,049 [17685] ControllerC: actionA end
2018-08-13 13:05:09,115 [8885] ControllerB: actionB start
2018-08-13 13:05:09,128 [8885] ControllerB: actionB end
I parsed the logs with logstash and a grok filter to get a JSON format that ElasticSearch can understand.
LOGSTASH FILTER
grok {
match => { "message" => "%{EXIM_DATE:timestamp} \[%{NUMBER:pid}\] %{WORD:controller}: %{WORD:action} %{WORD:status}" }
}
The result is then indexed by ElasticSearch, but I don't know how I can find out how long each Action takes. Based on the pid, the name of the controller and the name of the action and the start/end status, I have all the information that are needed to find out how long the action takes.
I want to display the duration of each action in Kibana, but I tried first to get data out of the index with a query. I read about aggregations and thought that they may be suitable for this kind of task.
I created the following query:
ES QUERY
{
"aggs":{
"group_by_pid": {
"terms": {
"field": "pid"
}
},
"aggs": {
"group_by_controller": {
"terms": {
"field": "controller"
}
}
"aggs": {
"group_by_action": {
"terms":{
"field": "action"
}
}
}
}
}
}
But the response is always empty. I'm currently unsure if I can even calculate between each start and end action, or if I have to update the complete logging and calculate the duration in PHP.
Any suggestions are welcome!
Thanks to the tip of Val and his response to another question I managed to get aggregated times for the different log-events using logstash.
This is the configuration:
input {
file {
path => "path/to/log.log"
}
}
filter {
grok {
match => { "message" => "%{EXIM_DATE:timestamp} \[%{NUMBER:pid}\] %{WORD:controller}: %{WORD:action} %{WORD:status}" }
add_tag => [ "%{status}" ]
}
elapsed {
unique_id_field => "pid"
start_tag => "start"
end_tag => "end"
new_event_on_match => false
}
if "elapsed" in [tags] {
aggregate {
task_id => "%{pid}"
code => "map['duration'] = [(event.get('elapsed_time')*1000).to_i]"
map_action => "create"
}
}
}
output {
elasticsearch {
hosts => [ "localhost:9200" ]
index => "my_index_%{+xxxx_M}"
action => "index"
}
}
In Kibana I can now use the elapsed_time field created by the elapsed-filter to visualize the time each request takes.

Timezone causing different results when doing a search query to an index in Elastic Search

I'm trying to find out the results from a search query (ie: searching results for the given date range) of a particular index. So that I could get the results in a daily basis.
This is the query : http://localhost:9200/dialog_test/_search?q=timestamp:[2016-08-03T00:00:00.128%20TO%202016-08-03T23:59:59.128]
In the above, timestamp is a field which i added using my logstash.conf in order to get the actual log time. When i tried querying this, surprisingly i got a number of hits (total hits: 24) which should've been 0 since I didn't have any log records from the date of (2016-08-03) . It actually displays the count for the next day (ie: (2016-08-04), which has 24 records in the log file. I'm sure something has gone wrong with the timezone.
My timezone is GMT+5:30.
Here is my filtering part of logstash conf:
filter {
grok {
patterns_dir => ["D:/ELK Stack/logstash/logstash-2.3.4/bin/patterns"]
match => { "message" => "^%{LOGTIMESTAMP:logtimestamp}%{GREEDYDATA}" }
}
mutate {
add_field => { "timestamp" => "%{logtimestamp}" }
remove_field => ["logtimestamp"]
}
date {
match => [ "timestamp" , "ISO8601" , "yyyyMMdd HH:mm:ss.SSS" ]
target => "timestamp"
locale => "en"
}}
EDIT:
This is a snap of the first 24 records which has the date of (2016-08-04) from the log file:
And this is a snap of the JSON response I got when I searched for the date of 2016-08-03:
Where am i going wrong? Any help could be appreciated.
In your date filter you need to add a timezone
date {
match => [ "timestamp" , "ISO8601" , "yyyyMMdd HH:mm:ss.SSS" ]
target => "timestamp"
locale => "en"
timezone => "Asia/Calcutta" <--- add this
}

Get the first document from every Elasticsearch route

I have an Elasticsearch index with route key of day in the following format "yyyyMMdd". Each day a lot of new documents are added. At the end of the month I would like to query if there are any days when for some reason a document haven't been added by a source. There is a source_id field representing the source.
I got it so far that I need to give all the routekeys, like 20160101,20160102 etc. and filter by the source_id. But this can return hundreds of thounsands of documents, I may need to paginate through them all.
Is there a way to only know if there is a routing key which doesn't have matching document with the given source_id, so essentially I would only return 31 documents or less to my application code, so it would be easy to iterate through and check if there is a day without document.
Any ideas?
You can use Terms Aggregation on the _routing field to know what all routing values have been used. See the query below:
POST <index>/<type>/_search
{
"size": 0,
"query": {
"term": {
"source_id": {
"value": "VALUE" <-- Value of source_id to filter on
}
}
},
"aggs": {
"routings": {
"terms": {
"field": "_routing",
"size": 31 <-- We don't expect to get more than 31 unique _routing values
}
}
}
}
Corresponding Nest code is as under:
var response = client.Search<object>(s => s
.Index("<index name>")
.Type("<type>")
.Query(q => q
.Term("source_id", "<source value>"))
.Aggregations(a => a
.Terms("routings", t => t
.Field("_routing")
.Size(31))));
var routings = response.Aggs.Terms("routings").Items.Select(b => b.Key);
routings will contain the list of routing values you need.

Query Mongo Embedded Documents with a size

I have a ruby on rails app using Mongoid and MongoDB v2.4.6.
I have the following MongoDB structure, a record which embeds_many fragments:
{
"_id" : "76561198045636214",
"fragments" : [
{
"id" : 76561198045636215,
"source_id" : "source1"
},
{
"id" : 76561198045636216,
"source_id" : "source2"
},
{
"id" : 76561198045636217,
"source_id" : "source2"
}
]
}
I am trying to find all records in the database that contain fragments with duplicate source_ids.
I'm pretty sure I need to use $elemMatch as I need to query embedded documents.
I have tried
Record.elem_match(fragments: {source_id: 'source2'})
which works but doesn't restrict to duplicates.
I then tried
Record.elem_match(fragments: {source_id: 'source2', :source_id.with_size => 2})
which returns no results (but is a valid query). The query Mongoid produces is:
selector: {"fragments"=>{"$elemMatch"=>{:source_id=>"source2", "source_id"=>{"$size"=>2}}}}
Once that works I need to update it to $size is >1.
Is this possible? It feels like I'm very close. This is a one-off cleanup operation so query performance isn't too much of an issue (however we do have millions of records to update!)
Any help is much appreciated!
I have been able to achieve desired outcome but in testing it's far too slow (will take many weeks to run across our production system). The problem is double query per record (we have ~30 million records in production).
Record.where('fragments.source_id' => 'source2').each do |record|
query = record.fragments.where(source_id: 'source2')
if query.count > 1
# contains duplicates, delete all but latest
query.desc(:updated_at).skip(1).delete_all
end
# needed to trigger after_save filters
record.save!
end
The problem with the current approach in here is that the standard MongoDB query forms do not actually "filter" the nested array documents in any way. This is essentially what you need in order to "find the duplicates" within your documents here.
For this, MongoDB provides the aggregation framework as probably the best approach to finding this. There is no direct "mongoid" style approach to the queries as those are geared towards the existing "rails" style of dealing with relational documents.
You can access the "moped" form though through the .collection accessor on your class model:
Record.collection.aggregate([
# Find arrays two elements or more as possibles
{ "$match" => {
"$and" => [
{ "fragments" => { "$not" => { "$size" => 0 } } },
{ "fragments" => { "$not" => { "$size" => 1 } } }
]
}},
# Unwind the arrays to "de-normalize" as documents
{ "$unwind" => "$fragments" },
# Group back and get counts of the "key" values
{ "$group" => {
"_id" => { "_id" => "$_id", "source_id" => "$fragments.source_id" },
"fragments" => { "$push" => "$fragments.id" },
"count" => { "$sum" => 1 }
}},
# Match the keys found more than once
{ "$match" => { "count" => { "$gte" => 2 } } }
])
That would return you results like this:
{
"_id" : { "_id": "76561198045636214", "source_id": "source2" },
"fragments": ["76561198045636216","76561198045636217"],
"count": 2
}
That at least gives you something to work with on how to deal with the "duplicates" here

Resources