Can't query an edge_ngram field in _all - elasticsearch

So I'm setting up an index and I'd like to have a single search that would do a partial-word edge_ngram search for one field and a more normal search of the rest of the fields. From what I understand this should be easy to do by just matching on _all. However I just can't seem to make it work.
I have been able to get the desired results from a bool query that searches _all and the specific ngram field separately but that seems hackey and I'm guessing there's just something simple that I'm missing.
Here is just a minimal example to show what I'm doing and how it's not working for me.
Here is the index setup:
curl -XPUT "http://localhost:9200/test_index?pretty=true" -d'
{
"settings": {
"analysis": {
"filter": {
"edge_ngram_filter": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 20
}
},
"analyzer": {
"edge_ngram_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"edge_ngram_filter"
]
}
}
}
},
"mappings": {
"doc": {
"properties": {
"text_field": {
"type": "string",
"analyzer": "edge_ngram_analyzer",
"search_analyzer": "standard"
}
}
}
}
}'
And add a simple document:
curl -XPUT "http://localhost:9200/test_index/doc/1?pretty=true" -d'
{
"text_field": "Hello, World!"
}'
_all partial search doesn't work. It returns an empty result.
curl -XPOST "http://localhost:9200/test_index/_search?pretty=true" -d'
{
"query": {
"match": {
"_all": "hell"
}
}
}'
_all whole word search works though
curl -XPOST "http://localhost:9200/test_index/_search?pretty=true" -d'
{
"query": {
"match": {
"_all": "hello"
}
}
}'
And a partial search on the specific field works
curl -XPOST "http://localhost:9200/test_index/_search?pretty=true" -d'
{
"query": {
"match": {
"text_field": "hell"
}
}
}'
The term vector looks fine too
curl -XGET "http://localhost:9200/test_index/doc/1/_termvector?fields=text_field&pretty=true"
I really can't figure out what I'm doing wrong here. Any help would be appreciated.
Here are some details about my environment.
Elasticsearch version: Version: 2.3.3, Build: 218bdf1/2016-05-17T15:40:04Z, JVM: 1.8.0_92
Linux OS: Arch Linux
Kernel version: 4.4.3-1-custom

The _all field combines the original values of all fields as a string, not the terms produced for each field. So in your case, it doesn't contain the terms produced by the edge_ngram_analyzer, just the text from the text_field field. It's just like any other text field, you can specify analyzers for it, etc. In your example, it's using the default analyzer.

Related

Elasticseach not using synonyms from synonym file

I am new to elasticsearch so before downvoting or marking as duplicate, please read the question first.
I am testing synonyms in elasticsearch (v 2.4.6) which I have installed on Ubuntu 16.04. I am giving synonyms through a file named synonym.txt which I have placed in config directory. I have created an index synonym_test as follows-
curl -XPOST localhost:9200/synonym_test/ -d '{
"settings": {
"analysis": {
"analyzer": {
"my_synonyms": {
"tokenizer": "whitespace",
"filter": ["lowercase","my_synonym_filter"]
}
},
"filter": {
"my_synonym_filter": {
"type": "synonym",
"ignore_case": true,
"synonyms_path" : "synonym.txt"
}
}
}
}
}'
The index contains two fields- id and some_text. I configure the field some_text with the custom analyzer as follows-
curl -XPUT localhost:9200/synonym_test/rulers/_mapping -d '{
"properties": {
"id": {
"type": "double"
},
"some_text": {
"type": "string",
"search_analyzer": "my_synonyms"
}
}
}'
Then I have inserted some data as -
curl -XPUT localhost:9200/synonym_test/external/5 -d '{
"id" : "5",
"some_text":"apple is a fruit"
}'
curl -XPUT localhost:9200/synonym_test/external/7 -d '{
"id" : "7",
"some_text":"english is spoken in england"
}'
curl -XPUT localhost:9200/synonym_test/external/8 -d '{
"id" : "8",
"some_text":"Scotland Yard is a popular game."
}'
curl -XPUT localhost:9200/synonym_test/external/9 -d '{
"id" : "9",
"some_text":"bananas contain potassium"
}'
The synonym.txt file contains following-
"britain,england,scotland"
"fruit,bananas"
After doing all this, when I run the query for term fruit (which should also return the text containing bananas as they are synonyms in file), I get the text containing fruit only.
{
"took":117,
"timed_out":false,
"_shards":{
"total":5,
"successful":5,
"failed":0
},
"hits":{
"total":1,
"max_score":0.8465736,
"hits":[
{
"_index":"synonym_test",
"_type":"external",
"_id":"5",
"_score":0.8465736,
"_source":{
"id":"5",
"some_text":"apple is a fruit"
}
}
]
}
}
I have also tried the following links, but none seem to have helped me -
Synonym analyzer not working ,
Elasticsearch synonym analyzer not working , How to apply synonyms at query time instead of index time in Elasticsearch , how to configure the synonyms_path in elasticsearch and many other links.
So, can anyone please tell me if I am doing anything wrong? Is there anything wrong with the settings or synonym file? I want the synonyms to work (query time) so that when I search for a term, I get all documents related to that term.
Please refer to following url: Custom Analyzer on how you should configure custom analyzers.
If we follow the guides from above documentation our schema will be as follows:
curl -XPOST localhost:9200/synonym_test/ -d '{
"settings": {
"analysis": {
"analyzer": {
"type": "custom"
"my_synonyms": {
"tokenizer": "whitespace",
"filter": ["lowercase","my_synonym_filter"]
}
},
"filter": {
"my_synonym_filter": {
"type": "synonym",
"ignore_case": true,
"synonyms_path" : "synonym.txt"
}
}
}
}
}
Which currently works on my elasticsearch instance.

Elasticsearch Automatic Synonyms

I am looking at Elasticsearch to handle search queries made by users in on my website.
Say that I have a document person with field vehicles_owned which is a list of strings. For example:
{
"name":"james",
"surname":"smith",
"vehicles_owned":["car","bike","ship"]
}
I would like to query which people own a certain vehicle. I understand it's possible to configure ES so that boat is treated as a synonym of ship and if I query with boat I am returned the user james who owns a ship.
What I don't understand is whether this is done automatically, or if I have to import lists of synonyms.
The idea is to create a custom analyzer for the vehicles_owned field which leverages the synonym token filter.
So you first need to define your index like this:
curl -XPUT localhost:9200/your_index -d '{
"settings": {
"index": {
"analysis": {
"analyzer": {
"synonym": {
"tokenizer": "whitespace",
"filter": [
"synonym"
]
}
},
"filter": {
"synonym": {
"type": "synonym",
"synonyms_path": "synonyms.txt" <-- your synonym file
}
}
}
}
},
"mappings": {
"syn": {
"properties": {
"name": {
"type": "string"
},
"surname": {
"type": "string"
},
"vehicles_owned": {
"type": "string",
"index_analyzer": "synonym" <-- use the synonym analyzer here
}
}
}
}
}'
Then you can add all the synonyms you want to handle in the $ES_HOME/config/synonyms.txt file using the supported formats, for instance:
boat, ship
Next, you can index your documents:
curl -XPUT localhost:9200/your_index/your_type/1 -d '{
"name":"james",
"surname":"smith",
"vehicles_owned":["car","bike","ship"]
}'
And finally searching for either ship or boat will get you the above document we just indexed:
curl -XGET localhost:9200/your_index/your_type/_search?q=vehicles_owned:boat
curl -XGET localhost:9200/your_index/your_type/_search?q=vehicles_owned:ship

Elasticsearch: sorting Spanish double names alphabetically

I am doing an Elasticsearch query and I want the results ordered alphabetically by last name. My problem: the last names are all Spanish double names, and ES doesn't order them the way I would like it.
I would prefer the order to be:
Batres Rivera
Batrín Chojoj
Fion Morales
Lopez Giron
Martinez Castellanos
Milán Casanova
This is my query:
{
"query": {
"match_all": {}
},
"sort": [
{
"Last Name": {
"order": "asc"
}
}
]
}
The order that I get with this is:
Batres Rivera
Batrín Chojoj
Milán Casanova
Martinez Castellanos
Fion Morales
Lopez Giron
So it is not sorting by the first string, but by either of both (Batres, Batrín, Casanova, Castellanos, Fion, Giron).
If I try additionally
{
"order": "asc",
"mode": "max"
}
then I get:
Batrín Chojoj
Lopez Giron
Martinez Castellanos
Milán Casanova
Fion Morales
Batres Rivera
All the fields are indexed by default, I checked with
curl -XGET localhost/my_index/_mapping
and I get back
my_index: {
my_type: {
properties: {
FirstName: {
type: string
}LastName: {
type: string
}MiddleName: {
type: string
}
...
}
}
}
Does anyone know how to make the results to be ordered to be ordered alphabetically by the beginning string of the last name?
Thanks!
The problem is that your LastName field is analyzed, so the string Batres Rivera is indexed as a multi-value field with two terms: batres and rivera. But this isn't like an ordered array, it's more like a "bag of values". So when you try to sort on the field, it chooses one of the terms (the min or max) and sorts on that.
What you need to do is to store the LastName as a single term (Batres Rivera) for sorting purposes, by mapping the field as
{ "type": "string", "index": "not_analyzed"}
Obviously you can't then use that field for search purposes: you wouldn't be able to search for rivera and match on that field.
The way to support both searching and sorting is to use multi-fields: ie index the same value in two ways, one for searching and one for sorting.
In 0.90.* the syntax for multi-fields is:
curl -XPUT "http://localhost:9200/my_index" -d'
{
"mappings": {
"my_type": {
"properties": {
"LastName": {
"type": "multi_field",
"fields": {
"LastName": {
"type": "string"
},
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
}'
In 1.0.* the multi_field type has been removed and now any core field type supports sub-fields as follows:
curl -XPUT "http://localhost:9200/my_index" -d'
{
"mappings": {
"my_type": {
"properties": {
"LastName": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
}'
So you can use the LastName field for searching, and the LastName.raw field for sorting:
curl -XGET "http://localhost:9200/my_index/my_type/_search" -d'
{
"query": {
"match": {
"LastName": "rivera"
}
},
"sort": "LastName.raw"
}'
Language specific sorting
You should also look at using the ICU analysis plugin to sort using the Spanish sort order (or collation). This is a bit more complex but is worth using:
curl -XPUT "http://localhost:9200/my_index" -d'
{
"settings": {
"analysis": {
"analyzer": {
"folding": {
"type": "custom",
"tokenizer": "icu_tokenizer",
"filter": [
"icu_folding"
]
},
"es_sorting": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"lowercase",
"spanish"
]
}
},
"filter": {
"spanish": {
"type": "icu_collation",
"language": "es"
}
}
}
},
"mappings": {
"my_type": {
"properties": {
"LastName": {
"type": "string",
"analyzer": "folding",
"fields": {
"raw": {
"type": "string",
"analyzer": "es_sorting"
}
}
}
}
}
}
}'
We create a folding analyzer which we'll use for the LastName field, which will analyze a string like Muñoz Rivera into the two terms munoz (without the ~) and rivera. So a user can search for munoz or muñoz and either will match.
Then we create the es_sorting analyzer which indexes the proper sort order for muñoz rivera (lowercased) in Spanish.
Searching would be done in the same way:
curl -XGET "http://localhost:9200/my_index/my_type/_search" -d'
{
"query": {
"match": {
"LastName": "rivera"
}
},
"sort": "LastName.raw"
}'
We need to know how you are indexing the name.
Please check this discussion link.
http://elasticsearch-users.115913.n3.nabble.com/Is-there-a-way-to-search-terms-lower-cased-td932996.html
This should be very helpful for your case. This depends on your mapping settings. What analyzer you use for the name field.
Need your mapping definition to decide on a proper solution.

Query string returning results not found in edgeNGram

I'm having trouble getting a edgengram query to behave properly. I have one record "blue grass" with an edgengram minimum of 2. A query string of "blv" however returns "blue grass" although it shouldn't.
curl -X POST http://localhost:9200/test -d '{
"mappings": {
"product/fragrance": {
"properties": {
"name_query": {
"index_analyzer": "query_index_analyzer",
"search_anaylzer": "query_search_analyzer",
"as": {},
"type": "string"
}
}
}
},
"settings": {
"analysis": {
"filter": {
"query_edgengram": {
"type": "edgeNGram",
"min_gram": 2,
"max_gram": 20,
"side": "front"
}
},
"analyzer": {
"query_index_analyzer": {
"tokenizer": "lowercase",
"filter": ["asciifolding", "query_edgengram"]
},
"query_search_analyzer": {
"tokenizer": "lowercase",
"filter": ["asciifolding"]
}
}
}
}
}'
curl -X POST "http://localhost:9200/test/product%2Ffragrance/1" -d '{
"name_query": "blue grass"
}'
curl -X GET "http://localhost:9200/test/product%2Ffragrance/_search?load=true&pretty=true" -d '{
"query": {
"bool": {
"must": [{
"query_string": {
"query": "blv",
"fields": ["name_query"],
"default_operator": "OR"
}
}]
}
}
}'
For some reason, I get a result from that. Can anyone explain why? Thanks. What I want to happen is "blv" shouldn't be returning "blue grass" although "bl" should. I've used the analyze API and see "blue grass" being broken down to "bl", "blu", "blue", "gr", "gra", "gras", "grass" but "blv" doesn't match any of those.
As David told you in his answer some elasticsearch queries are analyzed. Usually you don't want to apply ngrams to your queries, but you seem to already know that given your mapping. In fact, the reason why your search analyzer without ngrams is not taken into account is a typo: search_anaylzer instead of search_analyzer. That's why your query becomes bl and blv, and bl does match the returned document.
When you search for something using a MatchQuery or QueryString, the same analyzer is applied.
So blv is tokenized into bl, blv and bl matches bl!
You can use a TermQuery which is not analyzed.
Hard to say more as I don't have your query.
David

How to match on prefix in Elasticsearch

let's say that in my elasticsearch index I have a field called "dots" which will contain a string of punctuation separated words (e.g. "first.second.third").
I need to search for e.g. "first.second" and then get all entries whose "dots" field contains a string being exactly "first.second" or starting with "first.second.".
I have a problem understanding how the text querying works, at least I have not been able to create a query which does the job.
Elasticsearch has Path Hierarchy Tokenizer that was created exactly for such use case. Here is an example of how to set it for your index:
# Create a new index with custom path_hierarchy analyzer
# See http://www.elasticsearch.org/guide/reference/index-modules/analysis/pathhierarchy-tokenizer.html
curl -XPUT "localhost:9200/prefix-test" -d '{
"settings": {
"analysis": {
"analyzer": {
"prefix-test-analyzer": {
"type": "custom",
"tokenizer": "prefix-test-tokenizer"
}
},
"tokenizer": {
"prefix-test-tokenizer": {
"type": "path_hierarchy",
"delimiter": "."
}
}
}
},
"mappings": {
"doc": {
"properties": {
"dots": {
"type": "string",
"analyzer": "prefix-test-analyzer",
//"index_analyzer": "prefix-test-analyzer", //deprecated
"search_analyzer": "keyword"
}
}
}
}
}'
echo
# Put some test data
curl -XPUT "localhost:9200/prefix-test/doc/1" -d '{"dots": "first.second.third"}'
curl -XPUT "localhost:9200/prefix-test/doc/2" -d '{"dots": "first.second.foo-bar"}'
curl -XPUT "localhost:9200/prefix-test/doc/3" -d '{"dots": "first.baz.something"}'
curl -XPOST "localhost:9200/prefix-test/_refresh"
echo
# Test searches.
curl -XPOST "localhost:9200/prefix-test/doc/_search?pretty=true" -d '{
"query": {
"term": {
"dots": "first"
}
}
}'
echo
curl -XPOST "localhost:9200/prefix-test/doc/_search?pretty=true" -d '{
"query": {
"term": {
"dots": "first.second"
}
}
}'
echo
curl -XPOST "localhost:9200/prefix-test/doc/_search?pretty=true" -d '{
"query": {
"term": {
"dots": "first.second.foo-bar"
}
}
}'
echo
curl -XPOST "localhost:9200/prefix-test/doc/_search?pretty=true&q=dots:first.second"
echo
There is also a much easier way, as pointed out in elasticsearch documentation:
just use:
{
"text_phrase_prefix" : {
"fieldname" : "yourprefix"
}
}
or since 0.19.9:
{
"match_phrase_prefix" : {
"fieldname" : "yourprefix"
}
}
instead of:
{
"prefix" : {
"fieldname" : "yourprefix"
}
Have a look at prefix queries.
$ curl -XGET 'http://localhost:9200/index/type/_search' -d '{
"query" : {
"prefix" : { "dots" : "first.second" }
}
}'
You should use a commodin chars to make your query, something like this:
$ curl -XGET http://localhost:9200/myapp/index -d '{
"dots": "first.second*"
}'
more examples about the syntax at: http://lucene.apache.org/core/old_versioned_docs/versions/2_9_1/queryparsersyntax.html
I was looking for a similar solution - but matching only a prefix. I found #imtov's answer to get me almost there, but for one change - switching the analyzers around:
"mappings": {
"doc": {
"properties": {
"dots": {
"type": "string",
"analyzer": "keyword",
"search_analyzer": "prefix-test-analyzer"
}
}
}
}
instead of
"mappings": {
"doc": {
"properties": {
"dots": {
"type": "string",
"index_analyzer": "prefix-test-analyzer",
"search_analyzer": "keyword"
}
}
}
}
This way adding:
'{"dots": "first.second"}'
'{"dots": "first.third"}'
Will add only these full tokens, without storing first, second, third tokens.
Yet searching for either
first.second.anyotherstring
first.second
will correctly return only the first entry:
'{"dots": "first.second"}'
Not exactly what you asked for but somehow related, so I thought could help someone.

Resources