Elasticsearch not analyzed field - elasticsearch

I have an analyzed field which contains the following: 'quick brown foxes' and another one which contains: 'quick brown fox'.
I want to find those documents which explicity contains 'foxes' (not fox). As I know I have to create a multi-field with an analyzed and not-analyzed subfield (see my mapping below). But how can I query this?
Here's an example (note that my analyzer is set to hungarian but I guess this not matters here):
{
"settings" : {
"number_of_replicas": 0,
"number_of_shards": 1,
"analysis" : {
"analyzer" : {
"hu" : {
"tokenizer" : "standard",
"filter" : [ "lowercase", "hu_HU" ]
}
},
"filter" : {
"hu_HU" : {
"type" : "hunspell",
"locale" : "hu_HU",
"language" : "hu_HU"
}
}
}
},
"mappings": {
"foo": {
"_source": { "enabled": true },
"properties": {
"text": {
"type": "string",
"analyzer": "hu",
"store": false,
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed",
"store": false
}
}
}
}
}
}
}
Queries that I tried: match, term, span_term, query_string. All were executed on text and text.raw field.

"index": "not_analyzed" means this field will be not analyzed at all (https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-index.html). So it will not be even split into words. I believe this is not what you want.
Instead of that, you need to add new analyzer, which will include only tokenizer whitespace (https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-whitespace-tokenizer.html):
"analyzer" : {
"hu" : {
"tokenizer" : "standard",
"filter" : [ "lowercase", "hu_HU" ]
},
"no_filter":{
"tokenizer" : "whitespace"
}
}
Then you need to use this new analyzer for your field:
"raw": {
"type": "string",
"analyzer": "no_filter",
"store": false
}

Related

Custom indexing template is not being applied

I have a project where I am to analyze and visualize access log data. I use Logstash to send data to Elasticsearch and then visualize some stuff with Kibana.
Everything has worked fine until I discovered that I needed the Path Hierarchy Analyzer to show what I want to. I now have a custom template (JSON) and changed the out section of my Logstash configuration. But when I index data, my template is not being applied.
(Version 5.2 of Elasticseach and Logstash, can't update since that is the version in use at the place where I work).
My JSON file is valid. As far as the input and filters go, my Logstash configuration is fine, too. I guess I made a mistake in the output.
I already tried setting manage_template to false. I also tried template_overwrite => "false" just for the sake of it.
I tried creating the index first (Kibana Dev Tools) and populating it after. I created the index template and then the index. That way my template was applied and when I created the index pattern, everything seemed correct. Then I indexed one of my log files. I ended up with a Courier Fetch Error. http://localhost:9200/_all/_mapping?pretty=1 showed my that while indexing my data a default template was being used instead of my custom one. Nothing was different from before adding a custom template.
I searched the web and read everything I could find on stackoverflow and in the elastic forum about custom templates not being applied. I tried out all the solutions provided there, that is why I ended up opting for a custom template saved locally and providing the path in my logstash output. But I am all out of ideas now.
This is the output of my logstash configuration:
output {
elasticsearch {
hosts => ["localhost:9200"]
template => "/etc/logstash/conf.d/template.json"
index => "beam-%{+YYYY.MM.dd}"
manage_template => "true"
template_overwrite => "true"
document_type => "beamlogs"
}
stdout {
codec => rubydebug
}
}
And this is my custom template:
{
"template": "beam_custom",
"index_patterns": "beam-*",
"order" : 5,
"settings": {
"number_of_shards": 1,
"analysis": {
"analyzer": {
"custom_path_tree": {
"tokenizer": "custom_hierarchy"
},
"custom_path_tree_reversed": {
"tokenizer": "custom_hierarchy_reversed"
}
},
"tokenizer": {
"custom_hierarchy": {
"type": "path_hierarchy",
"delimiter": "/"
},
"custom_hierarchy_reversed": {
"type": "path_hierarchy",
"delimiter": "/",
"reverse": "true"
}
}
}
},
"mappings": {
"beamlogs": {
"properties": {
"object": {
"type": "text",
"fields": {
"tree": {
"type": "text",
"analyzer": "custom_path_tree"
},
"tree_reversed": {
"type": "text",
"analyzer": "custom_path_tree_reversed"
}
}
},
"referral": {
"type": "text",
"fields": {
"tree": {
"type": "text",
"analyzer": "custom_path_tree"
},
"tree_reversed": {
"type": "text",
"analyzer": "custom_path_tree_reversed"
}
}
},
"#timestamp" : {
"type" : "date"
},
"action" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword"
}
}
},
"datetime" : {
"type" : "date",
"format": "time_no_millis",
"fields" : {
"keyword" : {
"type": "keyword"
}
}
},
"id" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword"
}
}
},
"info" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword"
}
}
},
"message" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword"
}
}
},
"page" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword"
}
}
},
"path" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword"
}
}
},
"result" : {
"type" : "long"
},
"s_direct" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword"
}
}
},
"s_limit" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword"
}
}
},
"s_mobile" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword"
}
}
},
"s_terms" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword"
}
}
},
"size" : {
"type" : "long"
},
"sort" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword"
}
}
}
}
}
}
}
After indexing my data this is part of what I get with http://localhost:9200/_all/_mapping?pretty=1
"datetime" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"object" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
datetime should not have the type text. But worse than that, fields like objet.tree are not even created.
I really don't care about the wrong mapping for datetime, but I need to get the Path Hierarchy Analyzer to work. I just don't know what to do anymore.
So. What I just tried was creating the index template in Kibana.
PUT _template/beam_custom
/followed by what is in my template.json
I then checked if the template was created.
GET _template/beam_custom
The output was this:
{
"beam_custom": {
"order": 100,
"template": "beam_custom",
"settings": {
"index": {
"analysis": {
"analyzer": {
"custom_path_tree_reversed": {
"tokenizer": "custom_hierarchy_reversed"
},
"custom_path_tree": {
"tokenizer": "custom_hierarchy"
}
},
"tokenizer": {
"custom_hierarchy": {
"type": "path_hierarchy",
"delimiter": "/"
},
...
So I guess creating the template worked.
Then I created an index
PUT beam-2019-07-15
But when I checked the index, I got this:
{
"beam-2019.07.15": {
"aliases": {},
"mappings": {},
"settings": {
"index": {
"creation_date": "1563044670605",
"number_of_shards": "5",
"number_of_replicas": "1",
"uuid": "rGzplctSQDmrI_NSlt47hQ",
"version": {
"created": "5061699"
},
"provided_name": "beam-2019.07.15"
}
}
}
}
Shouldn't the index pattern have been recognized? I think this is the heart of the problem. I thought that my template would have been used and the output should have been something like this instead:
{
"beam-2019.07.15": {
"aliases": {},
"mappings": {
"logs": {
"properties": {
"#timestamp": {
"type": "date"
},
"action": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
},...
Why doesn't it recognize the pattern?
So, I found the mistake.
When I looked up how to build my own template, at some point I looked at the documentation for the current version. But in 5.2., "index_patterns =>" doesn't exist.
"template": "beam_custom",
"index_patterns": "beam-*",
This doesn't work then, of course.
Instead, I dropped the "index_patterns" line and defined my pattern in the template-parameter.
"template": ["beam-*"],
//rest
This fixed the problem. After that, my pattern was recognized.
Yet I am facing a different problem now. The Path Hierarchy Analyzer is not working properly. object.tree and the rest of the fields I want are not being created.
GET beam-*/_search
{
"query": {
"term": {
"object.tree": "/belletristik/"
}
}
}
yields nothing, though I should have a few hundred hits. Looking at my data, there are no analyzed fields for my paths. Any ideas?

Search with asciifolding and UTF-8 characters in Elasticsearch

I am indexing all the names on a web page with characters with accents like "José". I want to be able to search the this name with "Jose" and "José".
How should I set up my index mapping and analyzer(s) for a simple index with one field "name"?
I set up an analyzer for the name field like this:
"analyzer": {
"folding": {
"tokenizer": "standard",
"filter": ["lowercase", "asciifolding"]
}
}
But it folds all accents into ASCII equivalents and ignores the accent when indexing the "é". I want the "é" char to be in the index and I want to be able to search "José" with either "José" or "Jose".
You need to preserve the original token with the accent. To achieve that you need to redefine your own asciifolding token filter, like this:
PUT /my_index
{
"settings" : {
"analysis" : {
"analyzer" : {
"folding" : {
"tokenizer" : "standard",
"filter" : ["lowercase", "my_ascii_folding"]
}
},
"filter" : {
"my_ascii_folding" : {
"type" : "asciifolding",
"preserve_original" : true
}
}
}
},
"mappings": {
"my_type": {
"properties": {
"name": {
"type": "text",
"analyzer": "folding"
}
}
}
}
}
After that, both tokens jose and josé will be indexed and searchable
This is what I can think of to resolve the folding problem with diacritical marks:
Analyzer used:
{
"settings": {
"analysis": {
"analyzer": {
"folding": {
"tokenizer": "standard",
"filter": [ "lowercase", "asciifolding" ]
}
}
}
}
}
Below is the mapping to be used:
mappings used:
{
"properties": {
"title": {
"type": "string",
"analyzer": "standard",
"fields": {
"folded": {
"type": "string",
"analyzer": "folding"
}
}
}
}
}
The title field uses the standard analyzer and will contain the original word with diacritics in place.
The title.folded field uses the folding analyzer, which strips the diacritical marks.
Below is the search query I will use:
{
"query": {
"multi_match": {
"type": "most_fields",
"query": "esta loca",
"fields": [ "title", "title.folded" ]
}
}
}

Elastic Search,lowercase search doesnt work

I am trying to search again content using prefix and if I search for diode I get results that differ from Diode. How do I get ES to return result where both diode and Diode return the same results? This is the mappings and settings I am using in ES.
"settings":{
"analysis": {
"analyzer": {
"lowercasespaceanalyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"articles": {
"properties": {
"title": {
"type": "text"
},
"url": {
"type": "keyword",
"index": "true"
},
"imageurl": {
"type": "keyword",
"index": "true"
},
"content": {
"type": "text",
"analyzer" : "lowercasespaceanalyzer",
"search_analyzer":"whitespace"
},
"description": {
"type": "text"
},
"relatedcontentwords": {
"type": "text"
},
"cmskeywords": {
"type": "text"
},
"partnumbers": {
"type": "keyword",
"index": "true"
},
"pubdate": {
"type": "date"
}
}
}
}
here is an example of the query I use
POST _search
{
"query": {
"bool" : {
"must" : {
"prefix" : { "content" : "capacitance" }
}
}
}
}
it happens because you use two different analyzers at search time and at indexing time.
So when you input query "Diod" at search time because you use "whitespace" analyzer your query is interpreted as "Diod".
However, because you use "lowercasespaceanalyzer" at index time "Diod" will be indexed as "diod". Just use the same analyzer both at search and index time, or analyzer that lowercases your strings because default "whitespace" analyzer doesn't https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-whitespace-analyzer.html
There will be no term of Diode in your index. So if you want to get same results, you should let your query context analyzed by same analyzer.
You can use Query string query like
"query_string" : {
"default_field" : "content",
"query" : "Diode",
"analyzer" : "lowercasespaceanalyzer"
}
UPDATE
You can analyze your context before query.
AnalyzeResponse resp = client.admin().indices()
.prepareAnalyze(index, text)
.setAnalyzer("lowercasespaceanalyzer")
.get();
String analyzedContext = resp.getTokens().get(0);
...
Then use analyzedContext as new query context.

Multiple document types with same mapping in Elasticseach

I have index named test which can be associated to n number of documents types named sub_test_1 to sub_text_n. But all will have same mapping.
Is there any way to make an index such all document types have same mapping for their documents? I.e. test\sub_text1\_mapping should be same as test\sub_text2\_mapping.
Otherwise if I have like 1000 document types, I will we having 1000 mappings of the same type referring to each document types.
UPDATE:
PUT /test_index/
{
"settings": {
"index.store.type": "default",
"index": {
"number_of_shards": 5,
"number_of_replicas": 1,
"refresh_interval": "60s"
},
"analysis": {
"filter": {
"porter_stemmer_en_EN": {
"type": "stemmer",
"name": "porter"
},
"default_stop_name_en_EN": {
"type": "stop",
"name": "_english_"
},
"snowball_stop_words_en_EN": {
"type": "stop",
"stopwords_path": "snowball.stop"
},
"smart_stop_words_en_EN": {
"type": "stop",
"stopwords_path": "smart.stop"
},
"shingle_filter_en_EN": {
"type": "shingle",
"min_shingle_size": "2",
"max_shingle_size": "2",
"output_unigrams": true
}
}
}
}
}
Intended mapping:
{
"sub_text" : {
"properties" : {
"_id" : {
"include_in_all" : false,
"type" : "string",
"store" : true,
"index" : "not_analyzed"
},
"alternate_id" : {
"include_in_all" : false,
"type" : "string",
"store" : true,
"index" : "not_analyzed"
},
"text" : {
"type" : "multi_field",
"fields" : {
"text" : {
"type" : "string",
"store" : true,
"index" : "analyzed",
},
"pdf": {
"type" : "attachment",
"fields" : {
"pdf" : {
"type" : "string",
"store" : true,
"index" : "analyzed",
}
}
}
}
}
}
}
}
I want this mapping to be an individual mapping for all sub_texts I create so that I can change it for one sub_text without affecting others e.g. I may want to add two custom analyzers to sub_text1 and three analyzers to sub_text3, rest others will stay same.
UPDATE:
PUT /my-index/document_set/_mapping
{
"properties": {
"type": {
"type": "string",
"index": "not_analyzed"
},
"doc_id": {
"type": "string",
"index": "not_analyzed"
},
"plain_text": {
"type": "string",
"store": true,
"index": "analyzed"
},
"pdf_text": {
"type": "attachment",
"fields": {
"pdf_text": {
"type": "string",
"store": true,
"index": "analyzed"
}
}
}
}
}
POST /my-index/document_set/1
{
"type": "d1",
"doc_id": "1",
"plain_text": "simple text for doc1."
}
POST /my-index/document_set/2
{
"type": "d1",
"doc_id": "2",
"pdf_text": "cGRmIHRleHQgaXMgaGVyZS4="
}
POST /my-index/document_set/3
{
"type": "d2",
"doc_id": "3",
"plain_text": "simple text for doc3 in d2."
}
POST /my-index/document_set/4
{
"type": "d2",
"doc_id": "4",
"pdf_text": "cGRmIHRleHQgaXMgaGVyZSBpbiBkMi4="
}
GET /my-index/document_set/_search
{
"query" : {
"filtered" : {
"filter" : {
"term" : {
"type" : "d1"
}
}
}
}
}
This gives me the documents related to type "d1". How to add analyzers only to document of type "d1"?
At the moment a possible solution is to use index templates or dynamic mapping. However they do not allow wildcard type matching so you would have to use the _default_ root type to apply the mappings to all types in the index and thus it would be up to you to ensure that all your types can be applied to the same dynamic mapping. This template example may work for you:
curl -XPUT localhost:9200/_template/template_1 -d '
{
"template" : "test",
"mappings" : {
"_default_" : {
"dynamic": true,
"properties": {
"field1": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
'
Do not do this.
Otherwise if I have like 1000 document types, I will we having 1000 mappings of the same type referring to each document types.
You're exactly right. For every additional _type with an identical mapping you are needlessly adding to the size of your index's mapping. They will not be merged, nor will any compression save you.
A much better solution is to simply create a shared _type and to create a field that represents the intended type. This completely avoids having wasted mappings and all of the negatives associated with it, including an unnecessary increase for your cluster state's size.
From there, you can imitate what Elasticsearch is doing for you and filter on your custom type without ballooning your mappings.
$ curl -XPUT localhost:9200/my-index -d '{
"mappings" : {
"my-type" : {
"properties" : {
"type" : {
"type" : "string",
"index" : "not_analyzed"
},
# ... whatever other mappings exist ...
}
}
}
}'
Then, for any search against sub_text1 (etc.), then you can do a term (for one) or terms (for more than one) filter to imitate the _type filter that would happen for you.
$ curl -XGET localhost:9200/my-index/my-type/_search -d '{
"query" : {
"filtered" : {
"filter" : {
"term" : {
"type" : "sub_text1"
}
}
}
}
}'
This is doing the same thing as the _type filter and you can create _aliases that contain the filter if you want to have the higher level search capability without exposing client-level logic to the filtering.

Elasticsearch multi-word, multi-field search with analyzers

I want to use elasticsearch for multi-word searches, where all the fields are checked in a document with the assigned analyzers.
So if I have a mapping:
{
"settings": {
"analysis": {
"analyzer": {
"folding": {
"tokenizer": "standard",
"filter": [ "lowercase", "asciifolding" ]
}
}
}
},
"mappings" : {
"typeName" :{
"date_detection": false,
"properties" : {
"stringfield" : {
"type" : "string",
"index" : "folding"
},
"numberfield" : {
"type" : "multi_field",
"fields" : {
"numberfield" : {"type" : "double"},
"untouched" : {"type" : "string", "index" : "not_analyzed"}
}
},
"datefield" : {
"type" : "multi_field",
"fields" : {
"datefield" : {"type" : "date", "format": "dd/MM/yyyy||yyyy-MM-dd"},
"untouched" : {"type" : "string", "index" : "not_analyzed"}
}
}
}
}
}
}
As you see I have different types of fields, but I do know the structure.
What I want to do is starting a search with a string to check all fields using the analyzers too.
For example if the query string is:
John Smith 2014-10-02 300.00
I want to search for "John", "Smith", "2014-10-02" and "300.00" in all the fields, calculating the relevance score as well. The better solution is the one that have more field matches in a single document.
So far I was able to search in all the fields by using multi_field, but in that case I was not able to parse 300.00, since 300 was stored in the string part of multi_field.
If I was searching in "_all" field, then no analyzer was used.
How should I modify my mapping or my queries to be able to do a multi-word search, where dates and numbers are recognized in the multi-word query string?
Now when I do a search, error occurs, since the whole string cannot be parsed as a number or a date. And if I use the string representation of the multi_search then 300.00 will not be a result, since the string representation is 300.
(what I would like is similar to google search, where dates, numbers and strings are recognized in a multi-word query)
Any ideas?
Thanks!
Using whitespace as filter in analyzer and then applying this analyzer as search_analyzer to fields in mapping will split query in parts and each of them would be applied to index to find the best matching. And using ngram for index_analyzer would very improve results.
I am using following setup for query:
"query": {
"multi_match": {
"query": "sample query",
"fuzziness": "AUTO",
"fields": [
"title",
"subtitle",
]
}
}
And for mappings and settings:
{
"settings" : {
"analysis": {
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"standard",
"lowercase",
"ngram"
]
}
},
"filter": {
"ngram": {
"type": "ngram",
"min_gram": 2,
"max_gram": 15
}
}
},
"mappings": {
"title": {
"type": "string",
"search_analyzer": "whitespace",
"index_analyzer": "autocomplete"
},
"subtitle": {
"type": "string"
}
}
}
See following answer and article for more details.

Resources