Why are my completion suggester options empty? - elasticsearch

I'm currently trying to setup my suggestion implementation.
My index settings / mappings:
{
"settings" : {
"analysis" : {
"analyzer" : {
"trigrams" : {
"tokenizer" : "mesh_default_ngram_tokenizer",
"filter" : [ "lowercase" ]
},
"suggestor" : {
"type" : "custom",
"tokenizer" : "standard",
"char_filter" : [ "html_strip" ],
"filter" : [ "lowercase" ]
}
},
"tokenizer" : {
"mesh_default_ngram_tokenizer" : {
"type" : "nGram",
"min_gram" : "3",
"max_gram" : "3"
}
}
}
},
"mappings" : {
"default" : {
"properties" : {
"uuid" : {
"type" : "string",
"index" : "not_analyzed"
},
"language" : {
"type" : "string",
"index" : "not_analyzed"
},
"fields" : {
"properties" : {
"content" : {
"type" : "string",
"index" : "analyzed",
"analyzer" : "trigrams",
"fields" : {
"suggest" : {
"type" : "completion",
"analyzer" : "suggestor"
}
}
}
}
}
}
}
}
My query:
{
"suggest": {
"query-suggest" : {
"text" : "som",
"completion" : {
"field" : "fields.content.suggest"
}
}
},
"_source": ["fields.content", "uuid", "language"]
}
The query result:
{
"took" : 44,
"timed_out" : false,
"_shards" : {
"total" : 20,
"successful" : 20,
"failed" : 0
},
"hits" : {
"total" : 5,
"max_score" : 0.0,
"hits" : [ {
"_index" : "node-08c5d084d4e842b385d084d4e8a2b301-fe6212a62ad94590a212a62ad9759026-44874a2a8d2e4483874a2a8d2e44830c-draft",
"_type" : "default",
"_id" : "c6b7391075cc437ab7391075cc637a05-en",
"_score" : 0.0,
"_source" : {
"language" : "en",
"fields" : {
"content" : "This is<pre>another set of <strong>important</strong>content s<b>om</b>e text with more content you can poke a stick at"
},
"uuid" : "c6b7391075cc437ab7391075cc637a05"
}
}, {
"_index" : "node-08c5d084d4e842b385d084d4e8a2b301-fe6212a62ad94590a212a62ad9759026-44874a2a8d2e4483874a2a8d2e44830c-draft",
"_type" : "default",
"_id" : "96e2c6765b6841fea2c6765b6871fe36-en",
"_score" : 0.0,
"_source" : {
"language" : "en",
"fields" : {
"content" : "This is<pre>another set of <strong>important</strong>content no text with more content you can poke a stick at"
},
"uuid" : "96e2c6765b6841fea2c6765b6871fe36"
}
}, {
"_index" : "node-08c5d084d4e842b385d084d4e8a2b301-fe6212a62ad94590a212a62ad9759026-44874a2a8d2e4483874a2a8d2e44830c-draft",
"_type" : "default",
"_id" : "fd1472555e9d4d039472555e9d5d0386-en",
"_score" : 0.0,
"_source" : {
"language" : "en",
"fields" : {
"content" : "This is<pre>another set of <strong>important</strong>content someth<strong>ing</strong> completely different"
},
"uuid" : "fd1472555e9d4d039472555e9d5d0386"
}
}, {
"_index" : "node-08c5d084d4e842b385d084d4e8a2b301-fe6212a62ad94590a212a62ad9759026-44874a2a8d2e4483874a2a8d2e44830c-draft",
"_type" : "default",
"_id" : "5a3727b134064de4b727b134063de4c4-en",
"_score" : 0.0,
"_source" : {
"language" : "en",
"fields" : {
"content" : "This is<pre>another set of <strong>important</strong>content some<strong>what</strong> strange content"
},
"uuid" : "5a3727b134064de4b727b134063de4c4"
}
}, {
"_index" : "node-08c5d084d4e842b385d084d4e8a2b301-fe6212a62ad94590a212a62ad9759026-44874a2a8d2e4483874a2a8d2e44830c-draft",
"_type" : "default",
"_id" : "865257b6be4340c69257b6be4340c603-en",
"_score" : 0.0,
"_source" : {
"language" : "en",
"fields" : {
"content" : "This is<pre>another set of <strong>important</strong>content some <strong>more</strong> content you can poke a stick at too"
},
"uuid" : "865257b6be4340c69257b6be4340c603"
}
} ]
},
"suggest" : {
"query-suggest" : [ {
"text" : "som",
"offset" : 0,
"length" : 3,
"options" : [ ]
} ]
}
}
I'm currently using Elasticsearch 2.4.6 and I can't update
There are 5 document in my index and only 4 contain the word "some".
Why do I see 5 hits but no options?
The options are not empty if I start my suggest text with the first word of the field string. (e.g: this)
Is my usage of the suggest feature valid when dealing with fields that contain full html pages? I'm not sure whether the feature was meant to handle many tokens per document.
I already tried to use ngram tokenizer for my suggestor analyzer but that did not change the situation. Any hint or feedback would be appreciated.

It seems that the issue I'm seeing is a restriction is completion suggesters:
Matching always starts at the beginning of the text. So, for example, “Smi” will match “Smith, Fed” but not “Fed Smith”. However, you could list both “Smith, Fed” and “Fed Smith” as two different inputs for the one output.
http://rea.tech/implementing-autosuggest-in-elasticsearch/

Related

ElasticSearch - knn search. Sometimes returns _score = null

The more I pass an array to knn_vetcors, the more sources have _score=null
For example - I sent array with length 2 and I got 3 results with valid _score. But if i sent array with length 60 I got all results with _score is null
Request
{
"_source":[],
"collapse":{
"field":"id"
},
"query":{
"knn":{
"vector":{
"k":10,
"vector":[
0,
// array size - 46
0
]
}
}
},
"size":100,
"track_scores":false
}
Response (first and second scores is null but third is float)
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 7,
"relation" : "eq"
},
"max_score" : null,
"hits" : [
{
"_index" : "sb_index_images_ba7587a1-35ab-482f-93d8-a433dd132556_1667904180",
"_type" : "_doc",
"_id" : "207445df53a7b54c76ff76c0bec352c9",
"_score" : null,
"fields" : {
"id" : [
"377007"
]
}
},
{
"_index" : "sb_index_images_ba7587a1-35ab-482f-93d8-a433dd132556_1667904180",
"_type" : "_doc",
"_id" : "ea374a9b90d83ab93a77fb03226cafd3",
"_score" : null,
"fields" : {
"id" : [
"377009"
]
}
},
{
"_index" : "sb_index_images_ba7587a1-35ab-482f-93d8-a433dd132556_1667904180",
"_type" : "_doc",
"_id" : "1f93035d08e2b7af7d482a89f36e3c7c",
"_score" : 0.134376,
"fields" : {
"id" : [
"377014"
]
}
}
]
}
}
Mapping my index
{
"sb_index_images_ba7587a1-35ab-482f-93d8-a433dd132556_1667904180" : {
"mappings" : {
"properties" : {
"colors" : {
"type" : "long"
},
"colors_vector" : {
"type" : "knn_vector",
"dimension" : 9
},
"id" : {
"type" : "keyword"
},
"name" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"params" : {
"properties" : {
"0d23f34d9f2168ab98e5542149eb2f3d" : {
"properties" : {
"name" : {
"type" : "keyword",
"ignore_above" : 256
},
"value" : {
"type" : "keyword",
"eager_global_ordinals" : true,
"ignore_above" : 256,
"fields" : {
"float" : {
"type" : "float",
"ignore_malformed" : true
}
}
}
}
}
}
},
"vector" : {
"type" : "knn_vector",
"dimension" : 2048
}
}
}
}
}

How to count the number of repetitions of a specific word in specific fields of each document in the ElasticSearch index?

I'm pretty new is ElasticSearch and will be thankful for the help.
I have an index.
It's an example of data:
{
"took" : 12,
"timed_out" : false,
"_shards" : {
"total" : 20,
"successful" : 20,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1834,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "profile_similarity",
"_id" : "9c346fe0-253b-4c68-8f11-97bbb18d9c9a",
"_score" : 1.0,
"_source" : {
"country" : "US",
"city" : "Salt Lake City Metropolitan Area",
"headline" : "Product Manager"
}
},
{
"_index" : "profile_similarity",
"_id" : "e97cdbe8-445f-49f0-b659-6a19829a0a14",
"_score" : 1.0,
"_source" : {
"country" : "US",
"city" : "Los Angeles",
"headline" : "K2 & Amazon, Smarter King, LLC."
}
},
{
"_index" : "profile_similarity",
"_id" : "a7a69710-4fad-4b7d-88e4-bd0873e6fd03",
"_score" : 1.0,
"_source" : {
"country" : "CA",
"city" : "Greater Toronto Area",
"headline" : "Senior Product Manager"
}
}
]
}
}
Its mappings:
{
"profile_similarity_ivan" : {
"mappings" : {
"properties" : {
"city" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
},
"fielddata" : true
},
"country" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
},
"fielddata" : true
},
"headline" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
},
"fielddata" : true
}
}
}
}
}
I would like for fields country and headline to count a number of specific words.
For example, if I search for 'US', an output might be like this:
{
"took" : 12,
"timed_out" : false,
"_shards" : {
"total" : 20,
"successful" : 20,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1834,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "profile_similarity",
"_id" : "9c346fe0-253b-4c68-8f11-97bbb18d9c9a",
"_score" : 1.0,
"_source" : {
"country" : "US",
"city" : "Salt Lake City Metropolitan Area",
"headline" : "Product Manager",
"country_count_US" : 1,
"headline_count_US" : 0
}
},
{
"_index" : "profile_similarity",
"_id" : "e97cdbe8-445f-49f0-b659-6a19829a0a14",
"_score" : 1.0,
"_source" : {
"country" : "US",
"city" : "Los Angeles",
"headline" : "K2 & Amazon, Smarter King, LLC.",
"country_count_US" : 1,
"headline_count_US" : 0
}
},
{
"_index" : "profile_similarity",
"_id" : "a7a69710-4fad-4b7d-88e4-bd0873e6fd03",
"_score" : 1.0,
"_source" : {
"country" : "CA",
"city" : "Greater Toronto Area",
"headline" : "Senior Product Manager",
"country_count_US" : 0,
"headline_count_US" : 0
}
}
]
}
}
I notice that it can be done using runtime fields in ElasticSearch and scripting with painless
In general, I have issues with writing the painless script for this task.
Can you help me please write this script and create the right query in ElasticSearch for this task please?
Also will be thankful for any advice for this task can be finished by other functionality (not only by runtime fields) of ElasticSearch.
Thanks
This can be done but you need to fix three things.
You seem not to have created a mapping for your index, what you show look like the dynamic mappings ES assigns on its own to any given field. Even with your current mappings, you can simply run a terms aggregation on the results of your query and you will get the count of the words that you need. Just pass them as individual terms to be aggregated. Something like this will give you some output.
GET _search
{
"query": {
"match": {
"Country": "US"
}
},
"aggs": {
"country_count": {
"composite" : {
"sources" : [
{"country" : {"terms" : {"field" : "country"}}},
{"id" : {"terms" : {"field" : "_id", "include" : "US"}}}
]
}
}
}
}
The compostie aggregation will return PER DOCUMENT, how many times the word "US" has come.
Just go look at the docs about how to paginate the composite aggregation. This way you can get all the required counts for EVERY SINGLE DOCUMENT.
Composite Aggregation
Generally aggregations are used to get such answers. You may need to tweak the mappings of the fields, to use different analyzers(whitespace).
But generally you just need to use terms aggregations.
HTH.

ElasticSearch: What is the param limit in painless scripting?

I will have documents with the following data -
1. id
2. user_id
3. online_hr
4. offline_hr
My use case is the following -
I wish to sort the users who are active using online_hr field,
While I want to sort the users who are inactive using the offline_hr field.
I am planning to use ElasticSearch painless script for this use case,
I will have using 2 arrays of online_user_list and offline_user_list into the script params,
And I plan to compare each document's user_id,
if it is present in the either of the params lists and sort accordingly.
I want to know if there is any limit to the param object,
As the userbase may be in 100s of thousands,
And if passing 2 lists of that size in the ES scripting params would be troublesome?
And if there is any better approach?
Query to add data -
POST /products/_doc/1
{
"id":1,
"user_id" : "1",
"online_hr" : "1",
"offline_hr" : "2"
}
Sample data -
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "products",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"id" : 1,
"user_id" : "1",
"online_hr" : "1",
"offline_hr" : "2"
}
}
]
}
}
Mapping -
{
"products" : {
"aliases" : { },
"mappings" : {
"properties" : {
"id" : {
"type" : "long"
},
"offline_hr" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"online_hr" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"user_id" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
},
"settings" : {
"index" : {
"creation_date" : "1566466257331",
"number_of_shards" : "1",
"number_of_replicas" : "1",
"uuid" : "g2F3UlxQSseHRisVinulYQ",
"version" : {
"created" : "7020099"
},
"provided_name" : "products"
}
}
}
}
I found Painless scripts have a default size limit of 65,535 bytes,
while the ElasticSearch compiler had a limit of 16834 characters
Reference -
https://www.elastic.co/guide/en/elasticsearch/painless/current/painless-walkthrough.html
https://www.elastic.co/guide/en/elasticsearch/painless/current/painless-sort-context.html

How to sort an Elasticsearch query result by a determined field in DESC?

Let's say I have the following query:
curl -XGET 'localhost:9200/library/document/_search?pretty=true'
That returns me the following example results:
{
"took" : 108,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 5,
"max_score" : 1.0,
"hits" : [
{
"_index" : "library",
"_type" : "document",
"_id" : "5",
"_score" : 1.0,
"_source" : {
"page content" : [
"Page 0:",
"Page 1: something"
],
"publish date" : "2015-12-05",
"keywords" : "sample, example, article, alzheimer",
"author" : "Author name",
"language" : "",
"title" : "Sample article",
"number of pages" : 2
}
},
{
"_index" : "library",
"_type" : "document",
"_id" : "2",
"_score" : 1.0,
"_source" : {
"page content" : [
"Page 1: eBay",
"Page 2: Paypal",
"Page 3: Google"
],
"publish date" : "2017-08-03",
"keywords" : "something, another, thing",
"author" : "Alex",
"language" : "english",
"title" : "Microsoft Word - TL0032.doc",
"number of pages" : 21
}
},
...
I want to order by publish date and by id (different querys) so that the most recent one shows first in the list. Is it possible to do? I know I have to use the sort function of Elasticsearch together with the DESC parameter. But somehow it is not working for me.
EDIT: Mapping of the fields
curl -XGET 'localhost:9200/library/_mapping/document?pretty'
{
"library" : {
"mappings" : {
"document" : {
"properties" : {
"author" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"keywords" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"language" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"number of pages" : {
"type" : "long"
},
"page content" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"publish date" : {
"type" : "date"
},
"title" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
}
}
First you need good mapping like this:
PUT my_index
{
"mappings": {
"documents": {
"properties": {
"post_date" : {
"type": "date"
, "format": "yyyy-MM-dd HH:mm:ss"
}
}
}
}
}
And then the search:
GET my_index/_search
{
"sort": [
{
"post_date": {
"order": "desc"
}
}
]
}
Thank you everyone. Managed to get it working with this query:
curl -XGET 'localhost:9200/library/document/_search?pretty=true' -d '{"query": {"match_all": {}},"sort": [{"publish date": {"order": "desc"}}]}'
Didn't need aditional mapping.

Elasticsearch wildcard query with spaces

I'm trying to do a wildcard query with spaces. It easily matches the words on term basis but not on field basis.
I've read the documentation which says that I need to have the field as not_analyzed but with this type set, it returns nothing.
This is the mapping with which it works on term basis:
{
"denshop" : {
"mappings" : {
"products" : {
"properties" : {
"code" : {
"type" : "string"
},
"id" : {
"type" : "long"
},
"name" : {
"type" : "string"
},
"price" : {
"type" : "long"
},
"url" : {
"type" : "string"
}
}
}
}
}
}
This is the mapping with which the exact same query returns nothing:
{
"denshop" : {
"mappings" : {
"products" : {
"properties" : {
"code" : {
"type" : "string"
},
"id" : {
"type" : "long"
},
"name" : {
"type" : "string",
"index" : "not_analyzed"
},
"price" : {
"type" : "long"
},
"url" : {
"type" : "string"
}
}
}
}
}
}
The query is here:
curl -XPOST http://127.0.0.1:9200/denshop/products/_search?pretty -d '{"query":{"wildcard":{"name":"*test*"}}}'
Response with the not_analyzed property:
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 0,
"max_score" : null,
"hits" : [ ]
}
}
Response without not_analyzed:
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 5,
"max_score" : 1.0,
"hits" : [ {
...
EDIT: Adding requested info
Here is the list of documents:
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 5,
"max_score" : 1.0,
"hits" : [ {
"_index" : "denshop",
"_type" : "products",
"_id" : "3L1",
"_score" : 1.0,
"_source" : {
"id" : 3,
"name" : "Testovací produkt 2",
"code" : "",
"price" : 500,
"url" : "http://www.denshop.lh/damske-obleceni/testovaci-produkt-2/"
}
}, {
"_index" : "denshop",
"_type" : "products",
"_id" : "4L1",
"_score" : 1.0,
"_source" : {
"id" : 4,
"name" : "Testovací produkt 3",
"code" : "",
"price" : 666,
"url" : "http://www.denshop.lh/damske-obleceni/testovaci-produkt-3/"
}
}, {
"_index" : "denshop",
"_type" : "products",
"_id" : "2L1",
"_score" : 1.0,
"_source" : {
"id" : 2,
"name" : "Testovací produkt",
"code" : "",
"price" : 500,
"url" : "http://www.denshop.lh/damske-obleceni/testovaci-produkt/"
}
}, {
"_index" : "denshop",
"_type" : "products",
"_id" : "5L1",
"_score" : 1.0,
"_source" : {
"id" : 5,
"name" : "Testovací produkt 4",
"code" : "",
"price" : 666,
"url" : "http://www.denshop.lh/damske-obleceni/testovaci-produkt-4/"
}
}, {
"_index" : "denshop",
"_type" : "products",
"_id" : "6L1",
"_score" : 1.0,
"_source" : {
"id" : 6,
"name" : "Testovací produkt 5",
"code" : "",
"price" : 666,
"url" : "http://www.denshop.lh/tricka-tilka-tuniky/testovaci-produkt-5/"
}
} ]
}
}
Without the not_analyzed it returns with this:
curl -XPOST http://127.0.0.1:9200/denshop/products/_search?pretty -d '{"query":{"wildcard":{"name":"*testovací*"}}}'
But not with this (notice the space before asterisk):
curl -XPOST http://127.0.0.1:9200/denshop/products/_search?pretty -d '{"query":{"wildcard":{"name":"*testovací *"}}}'
When I add the not_analyzed to mapping, it returns no hits no matter what I put in the wildcard query.
Add a custom analyzer that should lowercase the text. Then in your search query, before passing the text to it have it lowercased in your client application.
To, also, keep the original analysis chain, I've added a sub-field to your name field that will use the custom analyzer.
PUT /denshop
{
"settings": {
"analysis": {
"analyzer": {
"keyword_lowercase": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"products": {
"properties": {
"name": {
"type": "string",
"fields": {
"lowercase": {
"type": "string",
"analyzer": "keyword_lowercase"
}
}
}
}
}
}
}
And the query will work on the sub-field:
GET /denshop/products/_search
{
"query": {
"wildcard": {
"name.lowercase": "*testovací *"
}
}
}

Resources