Can't deal with accents in Elasticsearch indexing and search - elasticsearch

I have an issue with elasticsearch and the way the data are indexed/retrieved. I don't understand what happens.
This is the mapping I use (sorry, it's yaml format) :
The idea is simple, in theory... I have a string analyzer with lowercase and asciifolding filters. I don't want to care about case or accents, and I would like to use this analyzer to index and search.
settings:
index:
analysis:
filter:
autocomplete_filter:
type: edgeNGram
side: front
min_gram: 1
max_gram: 20
analyzer:
autocomplete:
type: custom
tokenizer: standard
filter: [lowercase, asciifolding, autocomplete_filter]
string_analyzer:
type: custom
tokenizer: standard
filter: [lowercase, asciifolding]
types:
city:
mappings:
cityName:
type: string
analyzer: string_analyzer
search_analyzer: string_analyzer
location: {type: geo_point}
When I run this query :
{
"query": {
"prefix":{
"cityName":"per"
}
}
,
"size":20
}
I get some results like "Perpezat", "Pern", "Péreuil" which is the excepted result.
But if I run the following query :
{
"query": {
"prefix":{
"cityName":"pér"
}
}
,
"size":20
}
Then I get no result at all.
If you have any clue or help, I would be happy to know it.
Thanks

In the Prefix Query, your search input is not analyzed like in other cases:
Matches documents that have fields containing terms with a specified prefix (not analyzed)
Your first example works because the documents are analyzed at index time using your analyzer with lowercase and asciifolding, so they contain a term starting with per (perpezat, pern, pereuil).
Your second example does not work because those documents don't contain any terms starting with pér.
Since I couldn't find a way to tell Elasticsearch to analyze the prefix before performing the search, you could achieve your goal by manually adding this step:
Ask Elastisearch to analyze your input calling the Analyze API
Use the output from step 1 (it should be per in the examples) for the prefix query
For this to work, your search input should be a single term (I think that could be why Elasticsearch doesn't want to analyze it in the first place)

#mario-trucco Finally, I've found this post that explains a better way to analyze the strings.
What is an effective way to search world-wide location names with ElasticSearch?
Of course it doesn't answer my initial question and I still don't understand what happened, but it solves my problem by removing it.
Thanks again for your help and time.

Related

ElasticSearch Search query is not case sensitive

I am trying to search query and it working fine for exact search but if user enter lowercase or uppercase it does not work as ElasticSearch is case insensitive.
example
{
"query" : {
"bool" : {
"should" : {
"match_all" : {}
},
"filter" : {
"term" : {
"city" : "pune"
}
}
}
}
}
it works fine when city is exactly "pune", if we change text to "PUNE" it does not work.
ElasticSearch is case insensitive.
"Elasticsearch" is not case-sensitive. A JSON string property will be mapped as a text datatype by default (with a keyword datatype sub or multi field, which I'll explain shortly).
A text datatype has the notion of analysis associated with it; At index time, the string input is fed through an analysis chain, and the resulting terms are stored in an inverted index data structure for fast full-text search. With a text datatype where you haven't specified an analyzer, the default analyzer will be used, which is the Standard Analyzer. One of the components of the Standard Analyzer is the Lowercase token filter, which lowercases tokens (terms).
When it comes to querying Elasticsearch through the search API, there are a lot of different types of query to use, to fit pretty much any use case. One family of queries such as match, multi_match queries, are full-text queries. These types of queries perform analysis on the query input at search time, with the resulting terms compared to the terms stored in the inverted index. The analyzer used by default will be the Standard Analyzer as well.
Another family of queries such as term, terms, prefix queries, are term-level queries. These types of queries do not analyze the query input, so the query input as-is will be compared to the terms stored in the inverted index.
In your example, your term query on the "city" field does not find any matches when capitalized because it's searching against a text field whose input underwent analysis at index time. With the default mapping, this is where the keyword sub field could help. A keyword datatype does not undergo analysis (well, it has a type of analysis with normalizers), so can be used for exact matching, as well as sorting and aggregations. To use it, you would just need to target the "city.keyword" field. An alternative approach could also be to change the analyzer used by the "city" field to one that does not use the Lowercase token filter; taking this approach would require you to reindex all documents in the index.
Elasticsearch will analyze the text field lowercase unless you define a custom mapping.
Exact values (like numbers, dates, and keywords) have the exact value
specified in the field added to the inverted index in order to make
them searchable.
However, text fields are analyzed. This means that their values are
first passed through an analyzer to produce a list of terms, which are
then added to the inverted index. There are many ways to analyze text:
the default standard analyzer drops most punctuation, breaks up text
into individual words, and lower cases them.
See: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-term-query.html
So if you want to use a term query — analyze the term on your own before querying. Or just lowercase the term in this case.
To Solve this issue i create custom normalization and update mapping to add,
before we have to delete index and add it again
First Delete the index
DELETE PUT http://localhost:9200/users
now create again index
PUT http://localhost:9200/users
{
"settings": {
"analysis": {
"normalizer": {
"lowercase_normalizer": {
"type": "custom",
"char_filter": [],
"filter": ["lowercase", "asciifolding"]
}
}
}
},
"mappings": {
"user": {
"properties": {
"city": {
"type": "keyword",
"normalizer": "lowercase_normalizer"
}
}
}
}
}

How can you match long strings?

One of the main challenges that I am facing at the moment is how to match long string applying fuzziness to them .
For example , let's say that we have the following document :
PUT my_index/type/2
{
"name":"longnameyesverylong"
}
if I apply a fuzzy search on that name , like the following :
"match": {
"name": {
"query": "longnameyesverylong",
"fuzziness": 2
}
I can find it but my goal would be to be able to open the net and allow more than two mistakes for this type of strings.
Let's say for example that I index something like :
PUT my_index/type/2
{
"name":"l1ngnam2yesver3long"
}
The previous match query won't be able to find this document, as the fuzziness is greater than 2 and that is not supported in ES.
I tried to use ngrams , but the tokens did not meet the requirement either and the index would grow too much.
The only option that I have on top of my head is to split the string manually at index time creating my "own tokenizer" and create a document that looks like
PUT my_index/type/2
{
"name":"longnamey esverylong"
}
And then , at search time , split the string again and apply a Boolean query with fuzziness on each token. This can probably do what I need , but I feel that there is probably a better solution for this problem.
Is there any other approach that you think it might be appropriate?
Thank you.
Problem solved. They key for this problem is the pattern_capture filter.

Is it possible to chain fquery filters in elastic search with exact matches?

I have been having trouble writing a method that will take in various search parameters in elasticsearch. I was working with queries that looked like this:
body:
{query:
{filtered:
{filter:
{and:
[
{term: {some_term: "foo"}},
{term: {is_visible: true}},
{term: {"term_two": "something"}}]
}
}
}
}
Using this syntax I thought I could chain these terms together and programatically generate these queries. I was using simple strings and if there was a term like "person_name" I could split the query into two and say "where person_name match 'JOHN'" and where person_name match 'SMITH'" getting accurate results.
However, I just came across the "fquery" upon asking this question:
Escaping slash in elasticsearch
I was not able to use this "and"/"term" filter searching a value with slashes in it, so I learned that I can use fquery to search for the full value, like this
"fquery": {
"query": {
"match": {
"by_line": "John Smith"
But how can I search like this for multiple items? IT seems that when i combine fquery and my filtered/filter/and/term queries, my "and" term queries are ignored. What is the best practice for making nested / chained queries using elastic search ?
As in the comment below, yes I can just add fquery to the "and" block like so
{:filtered=>
{:filter=>
{:and=>[
{:term=>{:is_visible=>true}},
{:term=>{:is_private=>false}},
{:fquery=>
{:query=>{:match=>{:sub_location=>"New JErsey"}}}}]}}}
Why would elasticsearch also return results with "sub_location" = "new York"? I would like to only return "new jersey" here.
A match query analyzes the input and by default it is a boolean OR query if there are multiple terms after the analysis. In your case, "New JErsey" gets analyzed into the terms "new" and "jersey". The match query that you are using will search for documents in which the indexed value of field "sub_location" is either "new" or "jersey". That is why your query also matches documents where the value of field "sub_location" is "new York" because of the common term "new".
To only match for "new jersey", you can use the following version of the match query:
{
"query": {
"match": {
"sub_location": {
"query": "New JErsey",
"operator": "and"
}
}
}
}
This will not match documents where the value of field "sub_location" is "New York". But, it will match documents where the value of field "sub_location" is say "York New" because the query finally translates into a boolean query like "York" AND "New". If you are fine with this behaviour, well and good, else read further.
All these issues arise because you are using the default analyzer for the field "sub_location" which breaks tokens at word boundaries and indexes them. If you really do not care about partial matches and want to always match the entire string, you can make use of custom analyzers to use Keyword Tokenizer and Lowercase Token Filter. Mind you, going ahead with this approach will need you to re-index all your documents again.

Elasticsearch: field "title" was indexed without position data; cannot run PhraseQuery

I have an index in ElasticSearch with the following mapping:
mappings: {
feed: {
properties: {
html_url: {
index: not_analyzed
omit_norms: true
index_options: docs
type: string
}
title: {
index_options: offsets
type: string
}
created: {
store: true
format: yyyy-MM-dd HH:mm:ss
type: date
}
description: {
type: string
}
}
}
getting the following error when performing phrase search ("video games"):
IllegalStateException[field \"title\" was indexed without position data; cannot run PhraseQuery (term=video)];
Single word searches work fine. Tried "index_options: positions" as well but with no luck. Title field contains text in multiple languages, sometimes empty. Interesting that it seems to fail randomly, for example it would fail with 200K documents or 800K using the same dataset. Is there a reason some titles wouldn't get indexed with positions?
Elastic search version 0.90.5
Just in case someone else has the same issue. There was another type/table (feed2) in the same index with the same "title" field that was set to "not_analyzed".
For some reason even if you specify the type: http://elasticsearchhost.com:9200/index_name/feed/_search the other type is still being searched as well. Changing the mapping for feed2 type fixed the problem.
You probably have another field named 'title' with a different mapping in another type but in the same index.
Basically if you have 2 fields with the same name in the same index - even if they are in different types - they cannot have different mappings: to be more precise, even if they have the same type (eg: "string") but one of them is "analyzed" and the other is "not analyzed", problems will arise.
I mean, yeah, you can try to setup 2 different mappings, and ElasticSearch will not complain, but when searching you get strange result and everything will go bananas.
You can read more about this issue here where they say:
[...] In the end, we opted to enforce the rule that all fields with the same name in the same index must have the same mapping [...]
And yeah, considering how the promise of ElasticSearch has always been "it just works" this little detail took a lot of people by surprise.

Elasticsearch: Constructing mappings for Java Client

In my elasticsearch.yml file am trying to implement some mapping where one field belonging to one type is indexed using a different analyzer to the rest.
At present the yaml file has the following structure:
index:
bookshelf:
types:
book:
mappings:
title: {analyzer: customAnalyzer}
analysis:
analyzer:
# set standard analyzer with no stop words as the default
default:
type: standard
stopwords: _none_
# set custom analyser to provide relative search results
customAnalyzer:
type: custom
tokenizer: nGramTokenizer
filter: [lowercase,stopWordsFilter,asciifolding]
tokenizer:
nGramTokenizer:
type: nGram
min_gram: 1
max_gram: 2
filter:
nGramFilter:
type: nGram
min_gram: 1
max_gram: 2
stopWordsFilter:
type: stop
stopwords: _none_
This does not apply the custom analyzer to the title field, so I was hoping someone may be able to point me in the right direction for applying custom analyzers to individual fields?
I answered this in the ml:
If you are using Java you don't have to use an yml file. You can, but you don't have to.
If you are using Spring, you can have a look at the ES spring factory project:  https://github.com/dadoonet/spring-elasticsearch
If not, there is different ways of creating index and mappings in Java:
You can have a look here to see how I'm doing this by reading a json
mapping file: 
https://github.com/dadoonet/spring-elasticsearch/blob/master/src/main/java/fr/pilato/spring/elasticsearch/ElasticsearchAbstractClientFactoryBean.java#L616
You can also use XContent objects provided by ES to build your
mappings in Java: 
https://github.com/dadoonet/rssriver/blob/master/src/test/java/org/elasticsearch/river/rss/RssRiverTest.java#L14
Using this object is described here:  https://github.com/dadoonet/rssriver/blob/master/src/test/java/org/elasticsearch/river/rss/AbstractRssRiverTest.java#L98
Adding the mapping as follows:
node .client() .admin () .indices()
.preparePutMapping ("yourindex" )
.setType ( "yourtype" )
.setSource ( mapping ())
.execute() .actionGet ();
I hope this could help you

Resources