Elasticsearch: sorting Spanish double names alphabetically - sorting

I am doing an Elasticsearch query and I want the results ordered alphabetically by last name. My problem: the last names are all Spanish double names, and ES doesn't order them the way I would like it.
I would prefer the order to be:
Batres Rivera
Batrín Chojoj
Fion Morales
Lopez Giron
Martinez Castellanos
Milán Casanova
This is my query:
{
"query": {
"match_all": {}
},
"sort": [
{
"Last Name": {
"order": "asc"
}
}
]
}
The order that I get with this is:
Batres Rivera
Batrín Chojoj
Milán Casanova
Martinez Castellanos
Fion Morales
Lopez Giron
So it is not sorting by the first string, but by either of both (Batres, Batrín, Casanova, Castellanos, Fion, Giron).
If I try additionally
{
"order": "asc",
"mode": "max"
}
then I get:
Batrín Chojoj
Lopez Giron
Martinez Castellanos
Milán Casanova
Fion Morales
Batres Rivera
All the fields are indexed by default, I checked with
curl -XGET localhost/my_index/_mapping
and I get back
my_index: {
my_type: {
properties: {
FirstName: {
type: string
}LastName: {
type: string
}MiddleName: {
type: string
}
...
}
}
}
Does anyone know how to make the results to be ordered to be ordered alphabetically by the beginning string of the last name?
Thanks!

The problem is that your LastName field is analyzed, so the string Batres Rivera is indexed as a multi-value field with two terms: batres and rivera. But this isn't like an ordered array, it's more like a "bag of values". So when you try to sort on the field, it chooses one of the terms (the min or max) and sorts on that.
What you need to do is to store the LastName as a single term (Batres Rivera) for sorting purposes, by mapping the field as
{ "type": "string", "index": "not_analyzed"}
Obviously you can't then use that field for search purposes: you wouldn't be able to search for rivera and match on that field.
The way to support both searching and sorting is to use multi-fields: ie index the same value in two ways, one for searching and one for sorting.
In 0.90.* the syntax for multi-fields is:
curl -XPUT "http://localhost:9200/my_index" -d'
{
"mappings": {
"my_type": {
"properties": {
"LastName": {
"type": "multi_field",
"fields": {
"LastName": {
"type": "string"
},
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
}'
In 1.0.* the multi_field type has been removed and now any core field type supports sub-fields as follows:
curl -XPUT "http://localhost:9200/my_index" -d'
{
"mappings": {
"my_type": {
"properties": {
"LastName": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
}'
So you can use the LastName field for searching, and the LastName.raw field for sorting:
curl -XGET "http://localhost:9200/my_index/my_type/_search" -d'
{
"query": {
"match": {
"LastName": "rivera"
}
},
"sort": "LastName.raw"
}'
Language specific sorting
You should also look at using the ICU analysis plugin to sort using the Spanish sort order (or collation). This is a bit more complex but is worth using:
curl -XPUT "http://localhost:9200/my_index" -d'
{
"settings": {
"analysis": {
"analyzer": {
"folding": {
"type": "custom",
"tokenizer": "icu_tokenizer",
"filter": [
"icu_folding"
]
},
"es_sorting": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"lowercase",
"spanish"
]
}
},
"filter": {
"spanish": {
"type": "icu_collation",
"language": "es"
}
}
}
},
"mappings": {
"my_type": {
"properties": {
"LastName": {
"type": "string",
"analyzer": "folding",
"fields": {
"raw": {
"type": "string",
"analyzer": "es_sorting"
}
}
}
}
}
}
}'
We create a folding analyzer which we'll use for the LastName field, which will analyze a string like Muñoz Rivera into the two terms munoz (without the ~) and rivera. So a user can search for munoz or muñoz and either will match.
Then we create the es_sorting analyzer which indexes the proper sort order for muñoz rivera (lowercased) in Spanish.
Searching would be done in the same way:
curl -XGET "http://localhost:9200/my_index/my_type/_search" -d'
{
"query": {
"match": {
"LastName": "rivera"
}
},
"sort": "LastName.raw"
}'

We need to know how you are indexing the name.
Please check this discussion link.
http://elasticsearch-users.115913.n3.nabble.com/Is-there-a-way-to-search-terms-lower-cased-td932996.html
This should be very helpful for your case. This depends on your mapping settings. What analyzer you use for the name field.
Need your mapping definition to decide on a proper solution.

Related

How do I do a terms aggregation by concatenating two arrays?

I have an Elasticsearch mapping that looks like this:
"product": {
"properties": {
"attributes": {
"type": "keyword",
"normalizer": "lowercase"
},
"skus": {
"type": "nested",
"properties": {
"attributes": {
"type": "keyword",
"normalizer": "lowercase"
}
}
}
}
}
I'm trying to do a terms aggregation on both the field attributes and the field skus.attributes by concatenating them but I haven't figured out how. Both fields are simple string arrays. This is as far as I've gotten:
{
"query": {
"match_all": {}
},
"aggregations": {
"unique_attrs": {
"terms": {
"field": "attributes"
}
}
}
}
Of course, I could reindex my data in a way that there would be another field that contains a concatenation of the values of both fields but that seem right.
As mentioned on the Elasticsearch Forums: https://discuss.elastic.co/t/combining-nested-and-non-nested-aggregations/82583 it recommends merging them using a copy_to mapping when indexing the data.

Elasticsearch Mapping and Settings definitions

I tried setting a new index by configuring its mapping and settings.
Here is the code I used:
POST /test/text
{
"settings": {
"analysis": {
"filter": {
"greek_stop": {
"type": "stop",
"stopwords": "_greek_"
},
"greek_lowercase": {
"type": "lowercase",
"language": "greek"
},
"greek_stemmer": {
"type": "stemmer",
"language": "greek"
}
},
"analyzer": {
"greek": {
"tokenizer": "standard",
"filter": [
"greek_lowercase",
"greek_stop",
"greek_stemmer"
]
}
}
}
},
"mappings": {
"article": {
"properties": {
"title": {
"type": "string",
"fields": {
"greek": {
"type": "string",
"analyser": "greek"
}
}
},
"content": {
"type": "string",
"fields": {
"greek": {
"type": "string",
"analyser": "greek"
}
}
},
"indexed_date": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
}
}
}
}
}
I used POST as I didn't care much about the _id and that way ES gives a random value to this variable instead of using PUT where I have to specify its value. The reason why I created two fields for title and content is because I wish to have the raw version of text and the 'stop words removed-stemmed' version in order to be able to weight higher if the term is found exactly as the user entered it (instead of storing only the stemmed version of a word)
After populating the index with data, e.g.:
PUT /test/text
{
"title": " ",
"content": " ",
"date": " "
}
I tried doing a search query like this:
GET /test/text/_search
{
"query":{
"multi_match":{
"query":"όμορφος",
"type":"most_fields",
"fields":["content","content.greek","title","title.greek"]}}}
And then by changing the query to "όμορφη", these words have the same stemmed version which is "όμορφ", and as a result due to the ES's greek language analyzer I should get the same entry, which I don't.
Any idea why? Should I be doing something more while indexing my documents? After reading the documentation I was under the impression that after defining the mapping when indexing a piece of text it would automatically get indexed both ways and the query would each time be analysed using the appropriate analyzer automatically.
If I am under the right impression why my query doesn't return the same results? Any ideas?
Thank you in advance.
To create the index in the first place you need to call PUT test not POST /test/text. The latter will simply create a new document of type text in a new index called test, but with the default settings and mappings.
So first:
PUT /test
{
"settings": {
...
},
"mappings": {
...
}
}
Then you can create new documents with (note that your mapping type was called article, not text)
POST /test/article
{
"title": " ",
"content": " ",
"date": " "
}
Then only your search query will work

How to implement case sensitive search in elasticsearch?

I have a field in my indexed documents where i need to search with case being sensitive. I am using the match query to fetch the results.
An example of my data document is :
{
"name" : "binoy",
"age" : 26,
"country": "India"
}
Now when I give the following query:
{
“query” : {
“match” : {
“name” : “Binoy"
}
}
}
It gives me a match for "binoy" against "Binoy". I want the search to be case sensitive. It seems by default,elasticsearch seems to go with case being insensitive. How to make the search case sensitive in elasticsearch?
In the mapping you can define the field as not_analyzed.
curl -X PUT "http://localhost:9200/sample" -d '{
"index": {
"number_of_shards": 1,
"number_of_replicas": 1
}
}'
echo
curl -X PUT "http://localhost:9200/sample/data/_mapping" -d '{
"data": {
"properties": {
"name": {
"type": "string",
"index": "not_analyzed"
}
}
}
}'
Now if you can do normal index and do normal search , it wont analyze it and make sure it deliver case insensitive search.
It depends on the mapping you have defined for you field name. If you haven't defined any mapping then elasticsearch will treat it as string and use the standard analyzer (which lower-cases the tokens) to generate tokens. Your query will also use the same analyzer for search hence matching is done by lower-casing the input. That's why "Binoy" matches "binoy"
To solve it you can define a custom analyzer without lowercase filter and use it for your field name. You can define the analyzer as below
"analyzer": {
"casesensitive_text": {
"type": "custom",
"tokenizer": "standard",
"filter": ["stop", "porter_stem" ]
}
}
You can define the mapping for name as below
"name": {
"type": "string",
"analyzer": "casesensitive_text"
}
Now you can do the the search on name.
note: the analyzer above is for example purpose. You may need to change it as per your needs
Have your mapping like:
PUT /whatever
{
"settings": {
"analysis": {
"analyzer": {
"mine": {
"type": "custom",
"tokenizer": "standard"
}
}
}
},
"mappings": {
"type": {
"properties": {
"name": {
"type": "string",
"analyzer": "mine"
}
}
}
}
}
meaning, no lowercase filter for that custom analyzer.
Here is the full index template which worked for my ElasticSearch 5.6:
{
"template": "logstash-*",
"settings": {
"analysis" : {
"analyzer" : {
"case_sensitive" : {
"type" : "custom",
"tokenizer": "standard",
"filter": ["stop", "porter_stem" ]
}
}
},
"number_of_shards": 5,
"number_of_replicas": 1
},
"mappings": {
"fluentd": {
"properties": {
"message": {
"type": "text",
"fields": {
"case_sensitive": {
"type": "text",
"analyzer": "case_sensitive"
}
}
}
}
}
}
}
As you see, the logs are coming from FluentD and are saved into a timebased index logstash-*. To make sure, I can still execute wildcard queries on the message filed, I put a multi-field mapping on that field. Wildcard/analyzed queries can be done on message field and the case sensitive one on the message.case_sensitive field.

How to force a terms filter to ignore stopwords?

I have an Elasticsearch index with a bunch of fields, some of which I want to use along with the default stopword list. On the other hand, I have a username field which should return results for users called the, be etc.
Of course, when I run the following query:
{
"query": {
"constant_score": {
"filter": {
"terms": {
"username": [
"be"
]
}
}
}
}
}
nothing is returned. I have seen various solutions for changing the standard analyzer to remove stopwords, but am struggling to find how I would do so for this one field only. Thanks for any pointers.
You can do it like the following: add a custom analyzer that shouldn't use stopwords and then explicitly specify this analyzer just for those fields you want stopwords to be recognized (like your username field).
PUT /stopwords
{
"settings": {
"analysis": {
"analyzer": {
"my_english": {
"type": "english",
"stopwords": "_none_"
}
}
}
},
"mappings": {
"text": {
"properties": {
"title": {
"type": "string"
},
"content": {
"type": "string"
},
"username": {
"type": "string",
"analyzer": "my_english"
}
}
}
}
}

Why prefix returns documents without the specific prefix?

I want to return only documents which their name start with "pizza". this is what I've done:
{
"query": {
"filtered": {
"filter": {
"prefix": {
"name": "pizza"
}
}
}
}
}
But I've got these 3 documents:
{
"name": "Viana Pizza",
"city": "Mashhad",
"address": "Vakil abad",
"foods": ["Pizza"],
"salad": true,
"rate": 5.0
}
{
"name": "Pizza Pizza",
"city": "Mashhad",
"address": "Bahar st",
"foods": ["Pizza"],
"salad": true,
"rate": 8.5
}
{
"name": "Reza Pizza",
"city": "Tehran",
"address": "Vali Asr",
"foods": ["Pizza"],
"salad": true,
"rate": 7.5
}
As you can see, Only one of them has "pizza" in the beginning of the name field.
What's wrong?
Probably, the simplest explanation given that you didn't provide the actual mapping, is that you have th e "name" field as "string" and "analyzed" (the default). Which means that "Reza Pizza" will be transformed to "reza" and "pizza" terms.
And your filter will match against terms, not against entire fields. Because ES analyzes the fields and forms terms when the standard mapping is used.
You need to either change your "name" field to "not_analyzed" or add another field to mirror the "name" but this mirror field to be "not_analyzed". Also, for text "pizza" (lowercase) to work in this case you need to create a custom analyzer.
Below you have the solution with the mirror field:
PUT /pizza
{
"settings": {
"analysis": {
"analyzer": {
"my_keyword_lowercase_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": ["lowercase"]
}
}
}
},
"mappings": {
"restaurant": {
"properties": {
"name": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"analyzer": "my_keyword_lowercase_analyzer"
}
}
}
}
}
}
}
And in searching you need to use the mirror field:
GET /pizza/restaurant/_search
{
"query": {
"filtered": {
"filter": {
"prefix": {
"name.raw": "pizza"
}
}
}
}
}
That's all about Elasticsearch analyzers. Let's read the documentation on prefix filter:
Filters documents that have fields containing terms with a specified prefix (not analyzed).
Here we can see that this filter matches terms, not the whole field value. When you index the document, ES splits your field values to terms using analyzers. Default analyzer splits value by whitespace and convert parts to lowercse. So all three results have term pizza in the name field and pizza term perfectly matches pizza prefix. If you want to match field value as is - I'd suggest you to map name field as not_analyzed

Resources