Using multi-match query in a multi-field does not work - elasticsearch

Our system stores account in the following format: acct:username#domain
But for many searches we only need to username, so for the user created memos I've decided to make the user field a multi_field like this:
{
'text': {
'type': 'string'
}
'user': {
'type': 'multi_field',
'path': 'just_name',
'fields': {
'user': {
'type': 'string',
'index': 'analyzed',
'analyzer': 'lower_keyword'
},
'username': {
'type': 'string',
'index': 'analyzed',
'analyzer': 'username'
}
}
}
}
and other settings:
__settings__ = {
'analysis': {
'tokenizer': {
'username': {
'type': 'pattern',
'group': 1,
'pattern': '^acct:(.+)#.*$'
}
},
'analyzer': {
'lower_keyword': {
'type': 'custom',
'tokenizer': 'keyword',
'filter': 'lowercase'
},
'username': {
'tokenizer': 'username',
'filter': 'lowercase'
}
}
}
}
Now, if I make a query for the username it works. I.e if I have the following user: acct:testuser#testdomain
and I make a query like this:
{
"query": {
"bool": {
"must": [
{
"terms": {
"username": [
"testuser"
]
}
}
],
"minimum_number_should_match": 1
}
},
"size": 50
}
It works (I know it can be done much easier but this is a system generated query).
But, I need to make searches which looks for a string in both the text and the username fields.
I've decided to use a multi-match query for this.
{
"query": {
"bool": {
"must": [
{
"multi_match": {
"operator": "and",
"query": "testuser",
"type": "cross_fields",
"fields": [
"text",
"username"
]
}
}
],
"minimum_number_should_match": 1
}
},
"size": 50
}
Now the problem is, that this query does not work for the username field. It does for the text field, and for other fields if I include them, but does not bring back any result for the username field.
Can you help me what am I doing wrong?

I've forgotten that the username analyzer would also tokenize my searches for match/multi match queries. That way the string 'testuser' was analyzed and it generated zero token.
So the solution is to change the username's field mapping to:
'username': {
'type': 'string',
'index': 'analyzed',
'index_analyzer': 'username',
'search_analyzer': 'lower_keyword'
}
and now both queries are working.

Related

How to add suggestion inside term query in DSL

My DOc is below
[
{'id':1, 'name': 'sachin messi', 'description': 'football#football.com', 'type': 'football', 'var':'sports'},
{'id':2, 'name': 'lionel messi', 'description': 'messi#fifa.com','type': 'soccer','var':'sports'},
{'id':3, 'name': 'sachin', 'description': 'was', 'type': 'cricket', 'var':'sports'}
]
I need to suggest a string which after the term
My DSL query is below
quer = {
"query": {
"bool": {
"must": [
{
"terms": {
"var.keyword": [
"notsports"
]
}
},
{
"query_string": {
"query": "schin",
"fields": [
"name^128",
"description^64",
]
}
}
]
}
},
"suggest": {
"my-suggestion": {
"text": "schin",
"term": {
"field": "name",
"prefix_length": 0,
"min_word_length": 3,
"string_distance": "ngram"
}
}
}
}
My var.keyword is notsports
still i am getting suggestion 'suggest': {'my-suggestion': [{'text':'schin','offset':0,'length':5,'options': [{'text':'sachin', 'score': 0.75, 'freq': 1}]}
When i tried to put suggest inside terms list then i am getting RequestError: RequestError(400, 'x_content_parse_exception', 'unknown query [suggest]')
I need to get the suggestion only if var.keyword matches sports
I have asked question in elasticsearch also https://discuss.elastic.co/t/how-to-add-suggestion-inside-term-query-in-dsl/309893

How to sort aggregations where the keys have international characters?

Given a database which contains a list of people, where they live, and their wealth/income/tax level, I've given my Elasticsearch 5.6.2 this mapping:
mappings => {
person => {
properties => {
name => {
type => 'text',
fields => {
raw => {
type => 'keyword',
},
},
},
county => {
type => 'text',
fields => {
raw => {
type => 'keyword',
},
},
},
community_name => {
type => 'text',
fields => {
raw => {
type => 'keyword',
},
},
},
wealth => {
type => 'long',
},
income => {
type => 'long',
},
tax => {
type => 'long',
},
},
},
},
One county can have several communities, and I want to do an aggregation that creates an overview of the average wealth/income/tax for each of the counties and for each each county's communities.
This seems to work:
aggs => {
counties => {
terms => {
field => 'county.raw',
size => 100,
order => { _term => 'asc' },
},
aggs => {
communities => {
terms => {
field => 'community_name.raw',
size => 1_000,
order => { _term => 'asc' },
},
aggs => {
avg_wealth => {
avg => {
field => 'wealth',
},
},
avg_income => {
avg => {
field => 'income',
},
},
avg_tax => {
avg => {
field => 'tax',
},
},
},
},
avg_wealth => {
avg => {
field => 'wealth',
},
},
avg_income => {
avg => {
field => 'income',
},
},
avg_tax => {
avg => {
field => 'tax',
},
},
},
},
},
However, the "county" and "community_name" aren't sorted correctly because some of them have Norwegian characters in them, meaning that ES sorts "Ål" before "Øvre Eiker", which is wrong.
How can I achieve correct Norwegian sorting?
EDIT: I tried changing the "community_name" field to use "icu_collation_keyword" instead of "keyword":
community_name => {
type => 'text',
fields => {
raw => {
type => 'icu_collation_keyword',
index => 'false',
language => 'nb',
},
},
},
But this results in garbled output:
Akershus - 276855 - 229202 - 80131
ᦥ免⡠႐໠  - 314430 - 243684 - 87105
↘卑◥猔᠈〇㠖 - 202339 - 225665 - 78186
⚞乀⃠᷀  - 306985 - 237405 - 83186
⦘卓敫တ倎瀤 - 218060 - 218407 - 75602
⸳䄓†怜〨 - 271174 - 216843 - 75257
If the field on which you want to do aggregation (community_name in your example) always has only one value, then I think you might try the following approach, which is an extension of where you got so far.
Basically, you can add another sub-aggregation on the original, non-garbled value, and fetch it on the client side for display.
I will show it on a simplified mapping:
PUT /icu_index
{
"mappings": {
"my_type": {
"properties": {
"community": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
},
"norwegian": {
"type": "icu_collation_keyword",
"index": false,
"language": "nb"
}
}
},
"wealth": {
"type": "long"
}
}
}
}
}
We store community name as:
unchanged as community;
as keyword in community.raw;
as icu_collation_keyword in community.norwegian.
Then we put couple of documents (note: community_name has a string argument, not list of strings):
PUT /icu_index/my_type/2
{
"community": "Ål",
"wealth": 10000
}
PUT /icu_index/my_type/3
{
"community": "Øvre Eiker",
"wealth": 5000
}
Now we can do the aggregation:
POST /icu_index/my_type/_search
{
"size": 0,
"aggs": {
"communities": {
"terms": {
"field": "community.norwegian",
"order": {
"_term": "asc"
}
},
"aggs": {
"avg_wealth": {
"avg": {
"field": "wealth"
}
},
"community_original": {
"terms": {
"field": "community.raw"
}
}
}
}
}
}
We are still sorting by community.norwegian, but we also add sub-aggregation on community.raw. Let's see the result:
"aggregations": {
"communities": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "⸳䃔楦၃৉瓅ᘂก捡㜂\u0000\u0001",
"doc_count": 1,
"community_original": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Øvre Eiker",
"doc_count": 1
}
]
},
"avg_wealth": {
"value": 5000
}
},
{
"key": "⸳䄏怠怜〨\u0000\u0000",
"doc_count": 1,
"community_original": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Ål",
"doc_count": 1
}
]
},
"avg_wealth": {
"value": 10000
}
}
]
}
}
Now the buckets are sorted by ICU collation of community name. The first bucket with key "⸳䃔楦၃৉瓅ᘂก捡㜂\u0000\u0001" has its original value in community_original.buckets[0].key, which is "Øvre Eiker".
NB: This hack will of course not work if community_name can be a list of values.
Hope this hack helps!

How to force index on a field?

I have indexed some entries containing a GEOJson point object :
exemple :
{
"_id": "48LEDd5imvEpFnCQx",
"loc": {
"type": "Point",
"coordinates": [-2.7577078342437744, 47.65381454210301]
},
"geoip": false,
"trackerId": "RG-DEMO-1",
"date": "2015-07-25T21:12:07.286Z"
}
The mapping:
{
'trace': {
'properties': {
'loc': {
'type': 'nested',
'properties': {
'type': {
'type': 'string'
},
'coordinates':{
'type': 'geo_point',
'geohash':true,
'geohash_prefix':true,
'lat_lon':true,
'fielddata' : {
'format' : 'compressed',
'precision' : '1cm'
}
}
}
...
The geohash is generated but Kibana says that loc.coordinates is not indexed and I can't use the visualisation map Unindexed fields can not be searched
What is the trick to force index on this type of field?
As stated in the doc nestedtype sub object are not indexed:
So there are two workarounds:
For keeping loc as a nested object, add a include_in_parent wildcard to true
Or, turn the loc type to Object.

How to get latest values for each group with an Elasticsearch query?

I have some documents indexed on Elasticsearch, looking like these samples:
{'country': 'France', 'collected': '2015-03-12', 'value': 20}
{'country': 'Canada', 'collected': '2015-03-12', 'value': 21}
{'country': 'Brazil', 'collected': '2015-03-12', 'value': 33}
{'country': 'France', 'collected': '2015-02-01', 'value': 10}
{'country': 'Canada', 'collected': '2015-02-01', 'value': 11}
{'country': 'Mexico', 'collected': '2015-02-01', 'value': 9}
...
I want to build a query that gets one result per country, getting only the ones with max(collected).
So, for the examples shown above, the results would be something like:
{'country': 'France', 'collected': '2015-03-12', 'value': 20}
{'country': 'Canada', 'collected': '2015-03-12', 'value': 21}
{'country': 'Brazil', 'collected': '2015-03-12', 'value': 33}
{'country': 'Mexico', 'collected': '2015-02-01', 'value': 9}
I realized I need to do aggregation on country, but I'm failing to understand how to limit the results on max(collected).
Any ideas?
You can use a top_hits aggregation that groups on the country field, returns 1 doc per group, and orders the docs by the collected date descending:
POST /test/_search?search_type=count
{
"aggs": {
"group": {
"terms": {
"field": "country"
},
"aggs": {
"group_docs": {
"top_hits": {
"size": 1,
"sort": [
{
"collected": {
"order": "desc"
}
}
]
}
}
}
}
}
}
For those like user1892775 who run into "Fielddata is disabled on text fields by default...", you can create a multi field (https://www.elastic.co/guide/en/elasticsearch/reference/current/multi-fields.html). So you might have mapping like:
"mapping": {
"properties": {
"country": {"type": "string", "fields": {"raw": {"type": "string", "index": "not_analyzed"}}}
}
Then your query would look like
POST /test/_search?search_type=count
{
"aggs": {
"group": {
"terms": {
"field": "country.raw"
},
"aggs": {
"group_docs": {
"top_hits": {
"size": 1,
"sort": [
{
"collected": {
"order": "desc"
}
}
]
}
}
}
}
}
}
(Note the use of country.raw)
The answer marked correct worked great for me. Here is how I added some extra filters. This is version 7.4 on AWS.
The field I'm grouping by is a keyword field named tags.
For each group (tag), get top 3 documents sorted by date_uploaded descending.
Also show the total amount of documents within each group (tag).
Only consider non-deleted documents belonging to user 22.
Only return 10 groups (tags), sorted alphabetically.
For each document, return its ID (book_id) and date_uploaded. (Default is that all info is returned.)
Size:0 keeps the query from returning lots of info about all the documents.
{'query': {'bool': {'filter': [{'terms': {'user_id': [22]}}, {'terms': {'deleted': ['false']}}]}},
'size': 0,
"aggs": {
"group": {
"terms": {
"field": "tags.keyword",
"size":10,
"order":{ "_key": "asc" }
},
"aggs": {
"group_docs": {
"top_hits": {
"size": 3,
"_source":["book_id","date_uploaded"],
"sort": [ {"date_uploaded": { "order": "desc" }}]
}
}
}
}
}
}
Here is how to get each group (tag in my case) and the document matches for each group.
query_results = ... result of query
buckets = query_results["aggregations"]["group"]["buckets"]
for bucket in buckets:
tag = bucket["key"]
tag_doc_count = bucket["doc_count"]
print tag, tag_total_doc_count
tag_hits = bucket["group_docs"]["hits"]["hits"]
for hit in tag_hits:
source = hit["_source"]
print source["book_id"], source["date_uploaded"]
FYI, the "group" term can be named anything. Just make sure to use the same name when getting buckets from your query results.

Mapping ElasticSearch GeoPoint Fields

We store documents that look something like this:
{
"id": "dQesbpxeQniUWXpsnjPQ",
"title": "Golf in Hamburg, Altona",
"user": "CtGjEaDxSrhPbf7W7NcH",
"location": {
"id": "Q6sZhRHdiS3mP2innbJ9",
"name": "Hamburg, Altona",
"lat": 53.55,
"lon": 9.93333,
"slug": "hamburg-altona"
},
"_type": "announcement"
}
We need the announcement.location.slug to be not_analyzed (it's a slug, after all)
However the mapping won't take, we have these settings:
Tire.index(##index_name) do
delete
create(mappings: {
announcement: {
properties: {
"id" => { type: 'string', index: 'not_analyzed' },
"user" => { type: 'string', index: 'not_analyzed' },
"location" => {
type: 'geo_point',
properties: {
"slug" => { type: 'string', index: 'not_analyzed' }
}
},
"times" => { type: 'string', analyzer: 'keyword' },
"sport" => {
"properties" => {
"slug" => { type: 'string', index: 'not_analyzed' }
}
}
}
}
},
settings: {
index: {
number_of_shards: 1,
number_of_replicas: 0
}
})
refresh
end
Note: The same mapping in curl syntax also doesn't work, but is less readable for SO, so I'm posting the Ruby code.
It seems like geo_point is overriding all other mappings on that part of the document. The documentation seems to agree.
I'm sure there's a way to use the lat_lon option, but I can't find any documentation on how that might work. (I assume one maps the individual lat and lon fields with lat_lon settings)
It might also be possible, I had hoped to use the multi field type but that doesn't seem to apply to whole sub-trees of the main document attributes.
How can I proceed without having to change my whole data model?
I'm afraid that you have to change your model as geo_point is a full data type and you can not add properties (meta) in it.

Resources