ES keeps returning every document - elasticsearch

I recently inherited an ES instance and ensured I read an entire book on ES cover-to-cover before posting this, however I'm afraid I'm unable to get even simple examples to work.
I have an index on our staging environment which exhibits behavior where every document is returned no matter what - I have a similar index on our QA environment which works like I would expect it to. For example I am running the following query against http://staging:9200/people_alias/_search?explain:
{ "query" :
{ "filtered" :
{ "query" : { "match_all" : {} },
"filter" : { "term" : { "_id" : "34414405382" } } } } }
What I noticed on this staging environment is the score of every document is 1 and it is returning EVERY document in my index no matter what value I specify ...using ?explain I see the following:
_explanation: {
value: 1
description: ConstantScore(*:*), product of:
details: [
{
value: 1, description: boost
}, { value: 1, description: queryNorm } ] }
On my QA environment, which correctly returns only one record I observe for ?explain:
_explanation: {
value: 1
description: ConstantScore(cache(_uid:person#34414405382)), product of:
details: [ {
value: 1,
description: boost
}, {
value: 1,
description: queryNorm
}
]
}
The mappings are almost identical on both indices - the only difference is I removed the manual field-level boost values on some fields as I read field-level boosting is not recommended in favor of query-time boosting, however this should not affect the behavior of filtering on the document ID (right?)
Is there any clue I can glean from the differences in the explain output or should I post the index mappings? Are there any server-level settings I should consider checking? It doesn't matter what query I use on Staging, I can use match queries and exact match lookups on other fields and Staging just keeps returning every result with Score 1.0
I feel like I'm doing something very glaringly and obviously wrong on my Staging environment. Could someone please explain the presence of ConstantScore, boost and queryNorm? I thought from looking at examples in other literature I would see things like term frequency etc.
EDIT: I am issuing the query from Elastic Search Head plugin

In your HEAD plugin, you need to use POST in order to send the query in the payload, otherwise the _search endpoint is hit without any constraints.
In your browser, if you open the developer tools and look at the networking tab, you'll see that nothing is sent in the payload when using GET.
It's a common mistake people often do. Some HTTP clients (like curl) do send a payload using GET, but others (like /head/) don't. Sense will warn you if you use GET instead of POST when sending a payload and will automatically force POST instead of GET.
So to sum it up, it's best to always use POST whenever you wish to send some payload to your servers, so you don't have to care about the behavior of the HTTP client you're using.

Related

Kibana scripted field which loops through an array

I am trying to use the metricbeat http module to monitor F5 pools.
I make a request to the f5 api and bring back json, which is saved to kibana. But the json contains an array of pool members and I want to count the number which are up.
The advice seems to be that this can be done with a scripted field. However, I can't get the script to retrieve the array. eg
doc['http.f5pools.items.monitor'].value.length()
returns in the preview results with the same 'Additional Field' added for comparison:
[
{
"_id": "rT7wdGsBXQSGm_pQoH6Y",
"http": {
"f5pools": {
"items": [
{
"monitor": "default"
},
{
"monitor": "default"
}
]
}
},
"pool.MemberCount": [
7
]
},
If I try
doc['http.f5pools.items']
Or similar I just get an error:
"reason": "No field found for [http.f5pools.items] in mapping with types []"
Googling suggests that the doc construct does not contain arrays?
Is it possible to make a scripted field which can access the set of values? ie is my code or the way I'm indexing the data wrong.
If not is there an alternative approach within metricbeats? I don't want to have to make a whole new api to do the calculation and add a separate field
-- update.
Weirdly it seems that the number values in the array do return the expected results. ie.
doc['http.f5pools.items.ratio']
returns
{
"_id": "BT6WdWsBXQSGm_pQBbCa",
"pool.MemberCount": [
1,
1
]
},
-- update 2
Ok, so if the strings in the field have different values then you get all the values. if they are the same you just get one. wtf?
I'm adding another answer instead of deleting my previous one which is not the actual question but still may be helpful for someone else in future.
I found a hint in the same documentation:
Doc values are a columnar field value store
Upon googling this further I found this Doc Value Intro which says that the doc values are essentially "uninverted index" useful for operations like sorting; my hypotheses is while sorting you essentially dont want same values repeated and hence the data structure they use removes those duplicates. That still did not answer as to why it works different for string than number. Numbers are preserved but strings are filters into unique.
This “uninverted” structure is often called a “column-store” in other
systems. Essentially, it stores all the values for a single field
together in a single column of data, which makes it very efficient for
operations like sorting.
In Elasticsearch, this column-store is known as doc values, and is
enabled by default. Doc values are created at index-time: when a field
is indexed, Elasticsearch adds the tokens to the inverted index for
search. But it also extracts the terms and adds them to the columnar
doc values.
Some more deep-dive into doc values revealed it a compression technique which actually de-deuplicates the values for efficient and memory-friendly operations.
Here's a NOTE given on the link above which answers the question:
You may be thinking "Well that’s great for numbers, but what about
strings?" Strings are encoded similarly, with the help of an ordinal
table. The strings are de-duplicated and sorted into a table, assigned
an ID, and then those ID’s are used as numeric doc values. Which means
strings enjoy many of the same compression benefits that numerics do.
The ordinal table itself has some compression tricks, such as using
fixed, variable or prefix-encoded strings.
Also, if you dont want this behavior then you can disable doc-values
OK, solved it.
https://discuss.elastic.co/t/problem-looping-through-array-in-each-doc-with-painless/90648
So as I discovered arrays are prefiltered to only return distinct values (except in the case of ints apparently?)
The solution is to use params._source instead of doc[]
The answer for why doc doesnt work
Quoting below:
Doc values are a columnar field value store, enabled by default on all
fields except for analyzed text fields.
Doc-values can only return "simple" field values like numbers, dates,
geo- points, terms, etc, or arrays of these values if the field is
multi-valued. It cannot return JSON objects
Also, important to add a null check as mentioned below:
Missing fields
The doc['field'] will throw an error if field is
missing from the mappings. In painless, a check can first be done with
doc.containsKey('field')* to guard accessing the doc map.
Unfortunately, there is no way to check for the existence of the field
in mappings in an expression script.
Also, here is why _source works
Quoting below:
The document _source, which is really just a special stored field, can
be accessed using the _source.field_name syntax. The _source is loaded
as a map-of-maps, so properties within object fields can be accessed
as, for example, _source.name.first.
.
Responding to your comment with an example:
The kyeword here is: It cannot return JSON objects. The field doc['http.f5pools.items'] is a JSON object
Try running below and see the mapping it creates:
PUT t5/doc/2
{
"items": [
{
"monitor": "default"
},
{
"monitor": "default"
}
]
}
GET t5/_mapping
{
"t5" : {
"mappings" : {
"doc" : {
"properties" : {
"items" : {
"properties" : {
"monitor" : { <-- monitor is a property of items property(Object)
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
}
}
}
}

Elasticsearch 6.0 Removal of mapping types - Alternatives

Background
I migrating my ES index into ES version 6. I currenly stuck because ES6 removed the using on "_type" field.
Old Implementation (ES2)
My software has many users (>100K). Each user has at least one document in ES. So, the hierarchy looks like this:
INDEX -> TYPE -> Document
myindex-> user-123 -> document-1
The key point here is with this structure I can easily remove all the document of specific user.
DELETE /myindex/user-123
(Delete all the document of specific user, with a single command)
The problem
"_type" is no longer supported by ES6.
Possible solution
Instead of using _type, use the index name as USER-ID. So my index will looks like:
"user-123" -> "static-name" -> document
Delete user is done by delete index (instead of delete type in previous implementation).
Questions:
My first worry is about the amount of index and performance: Having like 1M indexes is something that acceptable in terms of performance? don't forget I have to search on them frequently.
Most of my users has small amount of documents stored in ES. Is that make sense to hold a shard, which should be expensive, for < 10 documents?
My data architecture sounds reasonable for you?
Any other tip will be welcome!
Thanks.
I would not have one index per user, it's a waste of resources, especially if there are only 10 docs per user.
What I would do instead is to use filtered aliases, one per user.
So the index would be named users and the type would be a static name, e.g. doc. For user 123, the documents of that user would all be stored in users/doc/xyz and in each document you need to add the user id, e.g.
PUT users/doc/xyz
{
...
"userId": 123,
...
}
Then you can define a filtered alias for all documents of user 123, like this:
POST /_aliases
{
"actions" : [
{
"add" : {
"index" : "users",
"alias" : "user-123",
"filter" : { "term" : { "userId" : "123" } }
}
}
]
}
If you need to delete all documents of user 123, then you can simply do it like this:
POST user-123/_delete_by_query?q=*
Having these many indexes is definitely not a good approach. If your only concern to delete multiple documents with a single command. Then you can use Delete by Query API provided by ElasticSearch
You can introduce "subtype" attribute in all your document containing value for each document like "user-" value. So in your case, document would looks like.
{
"attribute1":"value",
"subtype":"user-123"
}

Nested count queries

i'm looking to add a feature to an existing query. Basically, I run a query that returns say 1000 documents. Those documents all have the same structure, only the values of certain fields vary. What i'd like, is to not only get the full list as a result, but also count how many results have a field X with the value Y, how many results have the same field X with the value Z etc...
Basically get all the results + 4 or 5 "counts" that would act like the SQL "group by", in a way.
The point of this is to allow full text search over all the clients in our database (without filtering), while showing how many of those are active clients, past clients, active prospects etc...
Any way to do this without running additional / separate queries ?
EDIT WITH ANSWER :
Aggregations is the way to go. Here's how I did it, it's so straightforward that I expected much harder work !
{
"query": {
"term": {
"_type":"client"
}
},
"aggregations" : {
"agg1" : {
"terms" : {
"field" : "listType.typeRef.keyword"
}
}
}
}
Note that it's even in a list of terms and not a single field, that's just how easy it was !
I believe what you are looking for is the aggregation query.
The documentation should be clear enough, but if you struggle please give us your ES query and we will help you from there.

Kibana 4.1 - use JSON input to create an Hour Of Day field from #timestamp for histogram

Edit: I found the answer, see below for Logstash <= 2.0 ===>
Plugin created for Logstash 2.0
Whomever is interested in this with Logstash 2.0 or above, I created a plugin that makes this dead simple:
The GEM is here:
https://rubygems.org/gems/logstash-filter-dateparts
Here is the documentation and source code:
https://github.com/mikebski/logstash-datepart-plugin
I've got a bunch of data in Logstash with a #Timestamp for a range of a couple of weeks. I have a duration field that is a number field, and I can do a date histogram. I would like to do a histogram over hour of day, rather than a linear histogram from x -> y dates. I would like the x axis to be 0 -> 23 instead of date x -> date y.
I think I can use the JSON Input advanced text input to add a field to the result set which is the hour of day of the #timestamp. The help text says:
Any JSON formatted properties you add here will be merged with the elasticsearch aggregation definition for this section. For example shard_size on a terms aggregation which leads me to believe it can be done but does not give any examples.
Edited to add:
I have tried setting up an entry in the scripted fields based on the link below, but it will not work like the examples on their blog with 4.1. The following script gives an error when trying to add a field with format number and name test_day_of_week: Integer.parseInt("1234")
The problem looks like the scripting is not very robust. Oddly enough, I want to do exactly what they are doing in the examples (add fields for day of month, day of week, etc...). I can get the field to work if the script is doc['#timestamp'], but I cannot manipulate the timestamp.
The docs say Lucene expressions are allowed and show some trig and GCD examples for GIS type stuff, but nothing for date...
There is this update to the BLOG:
UPDATE: As a security precaution, starting with version 4.0.0-RC1,
Kibana scripted fields default to Lucene Expressions, not Groovy, as
the scripting language. Since Lucene Expressions only support
operations on numerical fields, the example below dealing with date
math does not work in Kibana 4.0.0-RC1+ versions.
There is no suggestion for how to actually do this now. I guess I could go off and enable the Groovy plugin...
Any ideas?
EDIT - THE SOLUTION:
I added a filter using Ruby to do this, and it was pretty simple:
Basically, in a ruby script you can access event['field'] and you can create new ones. I use the Ruby time bits to create new fields based on the #timestamp for the event.
ruby {
code => "ts = event['#timestamp']; event['weekday'] = ts.wday; event['hour'] = ts.hour; event['minute'] = ts.min; event['second'] = ts.sec; event['mday'] = ts.day; event['yday'] = ts.yday; event['month'] = ts.month;"
}
This no longer appears to work in Logstash 1.5.4 - the Ruby date elements appear to be unavailable, and this then throws a "rubyexception" and does not add the fields to the logstash events.
I've spent some time searching for a way to recover the functionality we had in the Groovy scripted fields, which are unavailable for scripting dynamically, to provide me with fields such as "hourofday", "dayofweek", et cetera. What I've done is to add these as groovy script files directly on the Elasticsearch nodes themselves, like so:
/etc/elasticsearch/scripts/
hourofday.groovy
dayofweek.groovy
weekofyear.groovy
... and so on.
Those script files contain a single line of Groovy, like so:
Integer.parseInt(new Date(doc["#timestamp"].value).format("d")) (dayofmonth)
Integer.parseInt(new Date(doc["#timestamp"].value).format("u")) (dayofweek)
To reference these in Kibana, firstly create a new search and save it, or choose one of your existing saved searches (Please take a copy of the existing JSON before you change it, just in case) in the "Settings -> Saved Objects -> Searches" page. You then modify the query to add "Script Fields" in, so you get something like this:
{
"query" : {
...
},
"script_fields": {
"minuteofhour": {
"script_file": "minuteofhour"
},
"hourofday": {
"script_file": "hourofday"
},
"dayofweek": {
"script_file": "dayofweek"
},
"dayofmonth": {
"script_file": "dayofmonth"
},
"dayofyear": {
"script_file": "dayofyear"
},
"weekofmonth": {
"script_file": "weekofmonth"
},
"weekofyear": {
"script_file": "weekofyear"
},
"monthofyear": {
"script_file": "monthofyear"
}
}
}
As shown, the "script_fields" line should fall outside the "query" itself, or you will get an error. Also ensure the script files are available to all your Elasticsearch nodes.

Performance tuning MongoDB query/update?

So I have a MongoDB instance where I am trying to update data in one collection with data from another collection. The two collections are participants with about 180k documents and questions with about 95k documents.
Documents in participants typically look something like this:
{
"_id" : ObjectId("52f90b8bbab16dd8594b82b4"),
"answers" : [
{
"_id" : ObjectId("52f90b8bbab16dd8594b82b9"),
"question_id" : 2081,
"sub_id" : null,
"values" : [
"Yes"
]
},
{
"_id" : ObjectId("52f90b8bbab16dd8594b82b8"),
"question_id" : 2082,
"sub_id" : 123,
"values" : [
"Would prefer to go alone"
]
},
{
"_id" : ObjectId("52f90b8bbab16dd8594b82b7"),
"question_id" : 2082,
"sub_id" : 456,
"values" : [
"Yes"
]
}
],
"created" : ISODate("2012-03-01T17:40:21Z"),
"email" : "anonymous",
"id" : 65,
"survey" : ObjectId("52f41d579af1ff4221399a7b"),
"survey_id" : 374
}
I am using the query below to perform the update:
db.participants.ensureIndex({"answers.question_id": 1, "answers.sub_id": 1});
print("created index for answer arrays!")
db.questions.find().forEach(function(doc){
db.participants.update(
{
"answers.question_id": doc.id,
"answers.sub_id": doc.sub_id
},
{
$set:
{
"answers.$.question": doc._id
}
},
false,
true
);
});
db.participants.dropIndex({"answers.question_id": 1, "answers.sub_id": 1});
But this takes about 20 minutes to run. I was hoping that adding the index would help with the performance, but it is still pretty slow. Is this index setup correctly considering that I am indexing fields in an array of objects? Can anyone see anything that I am doing that would cause the slowness? Suggestions on where to start looking to improve the performance of this query?
I think you need to consider what you are actually doing here in order to understand why the index is not helping and indeed why this operation takes so long.
The first part of the answer is explained by what you are doing here:
db.questions.find()
Now that part alone basically says that you are asking to retrieve every document in your questions collection. So we can see what you are trying to do is exactly that, as you want to update that content into your participants collection, particularly the document _id for the "question". But here, by definition of getting all documents, no index will be used.
So what you are doing is looping every document in the questions, then asking with your update operation to match the participants record with data from the "question". And what that means is you are pulling "over the wire" all of your 95K documents and sending back "over the wire" your update operation, 95K times. This is not happening on the server and there is network traffic between your application and your MongoDB.
The index itself is not going to do much other than improve the search of each participants record, which is better than scanning and you should be getting the match. But that's not the part that taking the time, its the fetching of the questions that will be the largest issue. Also note that if you were updating
So if it's possible to run your update process on a machine that is as close as possible in networking terms to the MongoDB server then that is going to be your best performance improvement. You could also wind back your Write Concern if you want to be a little daring and/or can live with checking the integrity in another opertation, and that will reduce your network traffic and waiting for a response to the update (which is actually happening) if you put it in "fire and forget" mode.
Also see the guide if you are not sure of the concepts:
http://docs.mongodb.org/manual/core/write-concern/
In case anyone is interested I was able to take the run time of this update query from 20 minutes down to about a minute and a half by using projection when selecting the questions documents. Since I am only using the _id, id and sub_id fields I was able to do the following:
db.questions.find({},{_id: 1, id: 1, sub_id: 1}).forEach(function(doc){
....
Which drastically improved performance. Hope this helps someone!

Resources