Spurious results from elasticsearch - elasticsearch

I suspect I can't (or I'm just not quite desperate enough to try yet!) give enough information to give you enough work on but I'm just hoping someone may be able to give me an idea of where to investigate...
I have an elastic search index which is in a live system and is working fine. I've added 3 attributes to the core entity in the index (productId). I'm getting the correct data back but every now and then it includes spurious data in the return results.
So for example (I've cut the list of fields down which is my it is a multi_match query).
Using Postman I am sending
{
"query" : {
"multi_match" : {
"query" : "FD41D359-1066-47C5-B930-C839F380FBDE",
"fields" : [ "softwareitem.productId" ]
}
}
}
I'm expecting 1 item to come back in this example and I'm getting 2. I've modified the result a little but the key thing is the productId. You can see in the 2nd item returned it is not the product Id be searched ?
Can anyone give me any idea where I should look next with this ? Is there a fault with my query or do you think the index might be corrupt in some way ?
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 27.424479,
"hits": [
{
"_index": "core_products",
"_type": "softwareitem",
"_id": "040EEEA1-4758-4F01-A55A-CAE710117C81",
"_score": 27.424479,
"_source": {
"id": "040EEEA1-4758-4F01-A55A-CAE710117C81",
"productId": "FD41D359-1066-47C5-B930-C839F380FBDE",
"softwareitem": {
"id": "040EEEA1-4758-4F01-A55A-CAE710117C81",
"title": "Code Library",
"description": "Blah Blah Blah",
"rmType": "Software",
"created": 1424445765000,
"updated": null
},
"searchable": true
}
},
{
"_index": "core_products",
"_type": "softwareitem",
"_id": "806B8F04-3E53-4278-BCC2-C2E1A17D2813",
"_score": 1.049637,
"_source": {
"id": "806B8F04-3E53-4278-BCC2-C2E1A17D2813",
"productId": "9FB80ABA-B09C-47C5-929A-9FB6C48BD5A8",
"softwareitem": {
"id": "806B8F04-3E53-4278-BCC2-C2E1A17D2813",
"title": "Video Game",
"description": "Blah Blah Blah",
"rmType": "Software",
"created": 1424445765000,
"updated": null
},
"searchable": true
}
}
]
}
}

It seems softwareitem.productId is a string field that it's being analysed. For doing exact matching of a string field, use a not_analyzed string field in your mapping, something like:
"productId" : {
"type" : "string",
"index" : "not_analyzed"
}
Probably your field is alread not_analyzed you have to do an additional change.
At query time you don't need to use a multi_match / match query. These type of queries will analyze your input string query and build a more complex query out of that input, for that reason you are seeing a second unexpected result (it contains 47C5, probably the analyzer is tokenising the full string and building a query that only one token needs to match) . You should use terms / term queries

Related

Kibana including versioned documents in visualizations

I have a document with _id "123456", and when I do a GET in Elasticsearch for that ID in my index I can see that it is _version: 2 which makes sense because I updated it.
However in my Kibana visualizations it seems like it is picking up both versions of the same document when showing the results.
How do I exclude versioned documents from re-appearing in the visualization? For example, this record is showing up twice in my bar graph.
Please and thank you
Example GET response:
{
"_index": "censored",
"_type": "censored",
"_id": "123456",
"_version": 2,
"found": true,
"_source": {
... ommitted
}
}
Also I am sure there is only one actual document with that ID because if I do a _search on the _id field I can see this:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 7.53924,
"hits": [
{
"_index": "censored",
"_type": "censored",
"_id": "123456",
"_score": 7.53924,
"_source": {
... ommitted
}
}
]
}
}
EDIT: Things I've tried below
aggs": {
"latest": {
"terms": {
"field": "_id"
}
}
}
and
"aggs": {
"latest": {
"max": {
"field": "version"
}
}
}
So frankly this is just a workaround, if someone finds a better solution I will mark that as the answer instead. Anyway this is how I've been able to prevent multiple records with the same _id showing up in my visualizations on my dashboard:
I just changed the "Y Axis - Count" on all the visualizations to being "Y Axis - Unique Count on field _id"
Honestly it seems silly that I have to do this because I think different versions should just automatically be exempt from appearing in my saved searches & visualizations. I couldn't find any information about why this was happening. I even tried a _forcemerge to try and delete previous versions of records but it didn't do anything.
Would be nice if someone found a real solution.

Elasticsearch match phrase prefix not matching all terms

I am having an issue where when I use the match_phrase_prefix query in Elasticsearch, it is not returning all the results I would expect it to, particularly when the query is one word followed by one letter.
Take this index mapping (this is a contrived example to protect sensitive data):
http://localhost:9200/test/drinks/_mapping
returns:
{
"test": {
"mappings": {
"drinks": {
"properties": {
"name": {
"type": "text"
}
}
}
}
}
}
And amongst millions of other records are these:
{
"_index": "test",
"_type": "drinks",
"_id": "2",
"_score": 1,
"_source": {
"name": "Johnnie Walker Black Label"
}
},
{
"_index": "test",
"_type": "drinks",
"_id": "1",
"_score": 1,
"_source": {
"name": "Johnnie Walker Blue Label"
}
}
The following query, which is one word followed by two letters:
POST http://localhost:9200/test/drinks/_search
{
"query": {
"match_phrase_prefix" : {
"name" : "Walker Bl"
}
}
}
returns this:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.5753642,
"hits": [
{
"_index": "test",
"_type": "drinks",
"_id": "2",
"_score": 0.5753642,
"_source": {
"name": "Johnnie Walker Black Label"
}
},
{
"_index": "test",
"_type": "drinks",
"_id": "1",
"_score": 0.5753642,
"_source": {
"name": "Johnnie Walker Blue Label"
}
}
]
}
}
Whereas this query with one word and one letter:
POST http://localhost:9200/test/drinks/_search
{
"query": {
"match_phrase_prefix" : {
"name" : "Walker B"
}
}
}
returns no results. What could be happening here?
I will assume that you are working with Elasticsearch 5.0 and above.
I think it might have to be because of the max_expansions default value.
As seen in the documentation here, the max_expansions parameters is used to control how many prefixes the last term will be expanded with. The default value is 50 and it might explain why you find "black" and "blue" with the two first letters B and L, but not with the B only.
The documentation is pretty clear about it:
The match_phrase_prefix query is a poor-man’s autocomplete. It is very easy to use, which let’s you get started quickly with search-as-you-type but it’s results, which usually are good enough, can sometimes be confusing.
Consider the query string quick brown f. This query works by creating a phrase query out of quick and brown (i.e. the term quick must exist and must be followed by the term brown). Then it looks at the sorted term dictionary to find the first 50 terms that begin with f, and adds these terms to the phrase query.
The problem is that the first 50 terms may not include the term fox so the phase quick brown fox will not be found. This usually isn’t a problem as the user will continue to type more letters until the word they are looking for appears
I wouldn't be able to tell you if it's ok to increase this parameter above 50 if you are looking for good performances since I never tried myself.

elasticsearch: copying meta-field _id to other field while creating document

I am using elasticsearch. I see there is meta-field _id for each document. I want to search document using this meta-field as I don't have any other field as unique field in document. But _id is a string and can have dashes which are not possible to search unless we add mapping for field as type :keyword. But it is possible as mentioned here. So now I am thinking to add another field newField in document and make it same as _id. One way to do it is: first create document and assign _id to that field and save document again. But this will have 2 connections which is not that good. So I want to find some solution to set newField while creating document itself. Is it even possible?
You can search for a document that contains dashes:
PUT my_index/tweet/testwith-
{
"fullname" : "Jane Doe",
"text" : "The twitter test!"
}
We just created a document with a dash in its id
GET my_index/tweet/_search
{
"query": {
"terms": {
"_id": [
"testwith-"
]
}
}
}
We search for the document that have the following id: testwith-
{
"took": 9,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "my_index",
"_type": "tweet",
"_id": "testwith-",
"_score": 1,
"_source": {
"fullname": "Jane Doe",
"text": "The twitter test!"
}
}
]
}
}
We found it. We can search on document that have - in it.
you could also use a set processor when using an ingest pipeline to store the id in an additional field, see https://www.elastic.co/guide/en/elasticsearch/reference/5.5/accessing-data-in-pipelines.html and https://www.elastic.co/guide/en/elasticsearch/reference/5.5/set-processor.html

Query with `field` returns nothing

I'm new to elastic search and am having troubles with my queries.
When I do a match all I get this;
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 5,
"max_score": 1,
"hits": [{
"_index": "stations",
"_type": "station",
"_id": "4432",
"_score": 1,
"_source": {
"SiteName": "Abborrkroksvägen",
"LastModifiedUtcDateTime": "2015-02-13 10:34:20.643",
"ExistsFromDate": "2015-02-14 00:00:00.000"
}
},
{
"_index": "stations",
"_type": "station",
"_id": "9110",
"_score": 1,
"_source": {
"SiteName": "Abrahamsberg",
"LastModifiedUtcDateTime": "2012-03-26 23:55:32.900",
"ExistsFromDate": "2012-06-23 00:00:00.000"
}
}
]
}
}
My search query looks like this:
{
"query": {
"query_string": {
"fields": ["SiteName"],
"query": "a"
}
}
}
The problem is that when I run the query above I get empty results which is strange. I should receive both of the documents from my index, right?
What am I doing wrong? Did I index my data wrong or is my query just messed up?
Appreciate any help I can get. Thanks guys!
There is nothing wrong either in your data or query. It seems you didn't understand how data get stored in elasticsearch!
Firstly, when you index data("SiteName": "Abborrkroksvägen" and "SiteName": "Abrahamsberg") they will get stored as individual analysed terms.
When you query ES using "query":"a"(means here you are looking for the term "a" ) then it will look for whether there is any match with term a but as there are no terms so you will get empty results.
When you query ES using "query":"a*"(means all terms starts with "a") then it will return you as you expected.
Hope this clarifies your question!
Also you may have a look at article I found recently about search - https://www.timroes.de/2016/05/29/elasticsearch-kibana-queries-in-depth-tutorial/

Change the structure of ElasticSearch response json

In some cases, I don't need all of the fields in response json.
For example,
// request json
{
"_source": "false",
"aggs": { ... },
"query": { ... }
}
// response json
{
"took": 123,
"timed_out": false,
"_shards": { ... },
"hits": {
"total": 123,
"max_score": 123,
"hits": [
{
"_index": "foo",
"_type": "bar",
"_id": "123",
"_score": 123
}
],
...
},
"aggregations": {
"foo": {
"buckets": [
{
"key": 123,
"doc_count": 123
},
...
]
}
}
}
Actually I don't need the _index/_type every time. When I do aggregations, I don't need hits block.
"_source" : false or "_source": { "exclude": [ "foobar" ] } can help ignore/exclude the _source fields in hits block.
But can I change the structure of ES response json in a more common way? Thanks.
I recently needed to "slim down" the Elasticsearch response as it was well over 1MB in json and I started using the filter_path request variable.
This allows to include or exclude specific fields and can have different types of wildcards. Do read the docs in the link above as there is quite some info there.
eg.
_search?filter_path=aggregations.**.hits._source,aggregations.**.key,aggregations.**.doc_count
This reduced (in my case) the response size by half without significantly increasing the search duration, so well worth the effort..
In the hits section, you will always jave _index, _type and _id fields. If you want to retrieve only some specific fields in your search results, you can use fields parameter in the root object :
{
"query": { ... },
"aggs": { ... },
"fields":["fieldName1","fieldName2", etc...]
}
When doing aggregations, you can use the search_type (documentation) parameter with count value like this :
GET index/type/_search?search_type=count
It won't return any document but only the result count, and your aggregations will be computed in the exact same way.

Resources