Elasticsearch UTF-8 characters in search - elasticsearch

I have indexed record:
"žiema"
Elastic search settings:
index:
cmpCategory: {type: string, analyzer: like_analyzer}
Analyzer
analysis:
char_filter:
lt_characters:
type: mapping
mappings: ["ą=>a","Ą=>a","č=>c","Č=>c","ę=>e","Ę=>e","ė=>e","Ė=>e","į=>i","Į=>i","š=>s","Š=>s","ų=>u","Ų=>u","ū=>u","ž=>z", "Ū=>u"]
analyzer:
like_analyzer:
type: snowball
tokenizer: standard
filter : [lowercase,asciifolding]
char_filter : [lt_characters]
What I want:
By keyword "žiema" found record "žiema" AND by keyword "ziema" also found record "žiema", how to do that ?
I try execute characters replace and filter asciifolding
What am I doing wrong?

You can try indexing your field twice like it is shown in the documentation.
PUT /my_index/_mapping/my_type
{
"properties": {
"cmpCategory": {
"type": "string",
"analyzer": "standard",
"fields": {
"folded": {
"type": "string",
"analyzer": "like_analyzer"
}
}
}
}
}
so the cmpCategory field is indexed as standard with diacritics, and the cmpCategory.folded field is indexed without diacritics.
And while searching, you do the search on both indexes as such:
GET /my_index/_search
{
"query": {
"multi_match": {
"type": "most_fields",
"query": "žiema",
"fields": [ "cmpCategory", "cmpCategory.folded" ]
}
}
}
Also, I'm not sure if the char_filter is necessary since the asciifolding filter already does that transformation.

Related

Query search string on elastic search

I have a field thats defined as below.
"findings": {
"type": "string",
"fields": {
"orig": {
"type": "string"
},
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
},
The findings contains the following text -
S91 - fiber cut
Now, when I do a 'term' search on 'findings.orig' for word 'fiber', I get a search response but when I do a a 'query string' search on 'findings.orig' for word 'fiber cut', I don't get any search response.
When I do a 'query string' search on '_all' for word 'fiber cut', I get the search response.
Why I dont get any response for 'fiber cut' on 'query string' search on 'findings.orig'.
Elasticsearch: query_string nested search
You can try this, Hope it work...
GET index/type/_search
{
"query": {
"query_string": {
"fields": ["findings.orig"],
"query": "S91 - fiber cut"
}
}
}
If you want to search in nested fields use this.

elasticsearch: multifield mapping of multiple fields

i will have a document with multiple fields. let's say 'title', 'meta1', 'meta2', 'full_body'.
each of them i want to index in a few different ways (raw, stemming without stop-words, shingles, synonyms etc.). therefore i will have fields like: title.stemming, title.shingles, meta1.stemming, meta1.shingles etc.
do i have to copy paste the mapping definition for each field? or is it possible to create one definition of all ways of indexing/analysing and then only apply it to each of 4 top level fields? if so, how?
mappings:
my_type:
properties:
title:
type: string
fields:
shingles:
type: string
analyzer: my_shingle_analyzer
stemming:
type: string
analyzer: my_stemming_analyzer
meta1:
... <-- do i have to repeat everything here?
meta2:
... <-- and here?
full_body:
... <-- and here?
In your case, you could use dynamic templates with the match_mapping_type setting so that you can apply the same setting to all your string fields:
{
"mappings": {
"my_type": {
"dynamic_templates": [
{
"strings": {
"match_mapping_type": "string",
"mapping": {
"type": "string",
"fields": {
"shingles": {
"type": "string",
"analyzer": "my_shingle_analyzer"
},
"stemming": {
"type": "string",
"analyzer": "my_stemming_analyzer"
}
, ... other sub-fields and analyzers
}
}
}
}
]
}
}
}
As a result, whenever you index a string field, its mapping will be created according to the defined template. You can also use the match setting, to restrict the mapping to specific field names, only.

ElasticSearch "H & R Block" with partial word search

The requirements are to be able to search the following terms :
"H & R" to find "H & R Block".
I have managed to implement this requirement alone using word_delimiter, as mentionned in this answer elasticsearch tokenize "H&R Blocks" as "H", "R", "H&R", "Blocks"
Using ruby code :
{
char_filter: {
strip_punctuation: { type: "mapping", mappings: [".=>", ",=>", "!=>", "?=>"] },
},
filter: {
my_splitter: {
type: "word_delimiter",
preserve_original: true
}
},
analyzer: {
my_analyzer {
char_filter: %w[strip_punctuation],
type: "custom",
tokenizer: "whitespace",
filter: %w[lowercase asciifolding my_splitter]
}
}
}
But also, in the same query, we want autocomplete functionality or partial word matching, so
"Ser", "Serv", "Servi", "Servic" and "Service" all find "Service" and "Services".
I have managed to implement this requirement alone, using ngram.
{
char_filter: {
strip_punctuation: { type: "mapping", mappings: [".=>", ",=>", "!=>", "?=>"] }
},
analyzer: {
my_analyzer: {
char_filter: %w[strip_punctuation],
tokenizer: "my_ngram",
filter: %w[lowercase asciifolding]
}
},
tokenizer: {
my_ngram: {
type: "nGram",
min_gram: "3",
max_gram: "10",
token_chars: %w[letter digit]
}
}
}
I just can't manage to implement them together. When I use ngram, short words are ignored so "H & R" is left out. When I use word_delimiter, partial word searches stop working. Below, my latest attempt at merging both requirements, it results in supporting partial word searches but not "H & R".
{
char_filter: {
strip_punctuation: { type: "mapping", mappings: [".=>", ",=>", "!=>", "?=>"] }
},
filter: {
my_splitter: {
type: "word_delimiter",
preserve_original: true
}
},
analyzer: {
my_analyzer: {
char_filter: %w[strip_punctuation],
type: "custom",
tokenizer: "my_tokenizer",
filter: %w[lowercase asciifolding my_splitter]
}
},
tokenizer: {
my_tokenizer: {
type: "nGram",
min_gram: "3",
max_gram: "10",
token_chars: %w[letter digit]
}
}
}
You can use multi_field from your mapping to index the same field in multiple ways. You can use your full text search with custom tokenizer on the default field, and create a special indexing for your autocompletion needs.
"title": {
"type": "string",
"fields": {
"raw": { "type": "string", "index": "not_analyzed" }
}
}
Your query will need to be slightly different when performing the autocomplete as the field will be title.raw instead of just title.
Once the field is indexed in all the ways that make sense for your query, you can query the index using a boolean "should" query, matching the tokenized version and the word start query. It is likely that a larger boost should be provided to the first query matching complete words to get the direct hits on top.

How to search with keyword analyzer?

I have keyword analyzer as default analyzer, like so:
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"default": {
"type": "keyword"
}}}}}}
```
But now I can't search anything. e.g:
{
"query": {
"query_string": {
"query": "cast"
}}}
Gives me 0 results all though "cast" is a common value i the indexed documents. (http://gist.github.com/baelter/b0720a52ee5a27e27d3a)
Search for "*" works fine btw.
I only have explicit defaults in my mapping:
{
"oceanography_point": {
"_all" : {
"enabled" : true
},
"properties" : {}
}
}
The index behaves as if no fields are included in _all, because field:value queries works fine.
Am I misusing the keyword analyzer?
Using keyword analyzer , you can only do an exact string match.
Lets assume that you have used keyword analyzer and no filters.
In that case for as string indexed as "Cast away in forest" , neither search for "cast" or "away" will work. You need to do an exact "Cast away in forest" string to match it. ( Assuming no lowercase filter used , you need to give the right case too)
A better approach would be to use multi fields to declare one copy as keyword analyzed and other one normal.
You can search on one of this field and aggregate on the other.
Okey, some 15h of trial and error I can conclude that this works for search:
{
"settings": {
"index": {
"analysis": {
"tokenizer": {
"default": {
"type": "keyword"
}}}}}}
How ever this breaks faceting so I ended up using a dynamic template instead:
"dynamic_templates" : [
{
"strings_not_analyzed" : {
"match" : "*",
"match_mapping_type" : "string",
"mapping" : {
"type" : "string",
"index" : "not_analyzed"
}
}
}
],

elasticsearch search query for exact match not working

I am using query_string for search. Searching is working fine but its getting all records with small letters and capital letters match.But i want to exact match with case sensitive?
For example :
Search field : "title"
Current output :
title
Title
TITLE,
I want to only first(title). How to resolved this issue.
My code in java :
QueryBuilder qbString=null;
qbString=QueryBuilders.queryString("title").field("field_name");
You need to configure your mappings / text processing so tokens are indexed without being lowercased.
The "standard"-analyzer lowercases (and removes stopwords).
Here's an example that shows how to configure an analyzer and a mapping to achieve this: https://www.found.no/play/gist/7464654
With Version 5 + on ElasticSearch there is no concept of analyzed and not analyzed for index, its driven by type !
String data type is deprecated and is replaced with text and keyword, so if your data type is text it will behave like string and can be analyzed and tokenized.
But if the data type is defined as keyword then automatically its NOT analyzed, and return full exact match.
SO you should remember to mark the type as keyword when you want to do exact match with case sensitive.
code example below for creating index with this definition:
PUT testindex
{
"mappings": {
"original": {
"properties": {
"#timestamp": {
"type": "date"
},
"#version": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"APPLICATION": {
"type": "text",
"fields": {
"exact": {"type": "keyword"}
}
},
"type": {
"type": "text",
"fields": {
"exact": {"type": "keyword"}
}
}
}
}
}
}

Resources