How to search for # or . in Elasticsearch - elasticsearch

I have a field, under a type company, in my elasticsearch index which captures the technologies that the company uses. So people coming to our site might enter java, Java, C#, c#, .Net, .netetc in the search box to get the companies.
Initially I indexed this in the default way and then I couldn't search for .Net or C# as there were wildcard characters in the search query. When I searched with Net or C it returned companies that uses C or C# , which again is not correct.
I did some research and changed the mapping for the field to "index": "not_analyzed" and re-indexed the companies. Now it returned the correct companies for C# and .Net, but failed in the case were the search term was not an exact match. So it didn't return companies with Java technologies, when the search term was java, but it returned correctly when the search term was java. I understand that not_analyzed requires exact match
How do I index and query on the same field to get both these cases sorted out?

The way to achieve what you need is to create a custom analyzer that does a little bit more than what not_analyzed does, i.e. also lowercase the terms.
curl -XPUT localhost:9200/test_index -d '{
"settings": {
"analysis": {
"analyzer": {
"lowercase_keyword": {
"type": "custom",
"tokenizer": "keyword",
"filter": [ "lowercase" ]
}
}
}
},
"mappings": {
"test_type": {
"properties": {
"name": {
"type": "string",
"analyzer": "lowercase_keyword"
}
}
}
}
}'
Then when you index a document that contains Java, it will be indexed as java, C# as c#, etc
This will bring the benefits of case-insensitive exact matches.

Related

How to conditionally apply an analyzer at index time to a field that could be one of many languages?

I have documents with a field (e.g. input_text) that contains a string that could be one of 20 odd languages. I have another field that has the short form of the language (e.g. lang)
I want to conditionally apply an analyzer at index time to the text field dependent on what the language is as detected from the language field.
I eventually want a Kibana dashboard with a single word cloud of the most common words in the text field (ie in multiple languages) but only words that have been stemmed and tokenized with stop words removed.
Is there a way to do this?
The elasticsearch documents suggest using multiple fields for each language and then specifying an analyzer for the appropriate field, but I can't do this as there are 20 some languages and this would overload my nodes.
There is no way to achieve what you want in Elasticsearch (applying analyzer to field A based on the value of field B).
I would recommend to create one index per language, and then create an index alias that groups all those indices and query against it.
PUT lang_de
{
"mappings": {
"properties": {
"input_text": {
"type": "text",
"analyzer": "german"
}
}
}
}
PUT lang__en
{
"mappings": {
"properties": {
"input_text": {
"type": "text",
"analyzer": "english"
}
}
}
}
POST _aliases
{
"actions": [
{
"add": {
"index": "lang_*",
"alias": "lang"
}
}
]
}

Elasticsearch - can I define index time analyzer on document level?

I want to index pages in multiple languages into a single index. But for each language I need to define custom language analyzer. So for english page it would use english analyzer, for czech page it would use czech analyzer.
At search time I would set the correct analyzer based on current locale as I do not need to search across languages.
It appears that it was possible in the early versions of Elasticsearch, but I cannot find a way to do it in 7.6
Is there a way to achieve this or do I really need to create an index for each type in each language? That would lead to many indices with only small number of indexed documents.
Or is there a better way to handle this scenario? We are considering about 20 languages and several document types (as far as I understand, types are now deprecated so each needs its own index).
You can use the fields feature which is available in Elastic 7.6, which allows you to store the different languages in a single index, also query time it would be possible to just use the subfield of language which you want to query.
In fact, there is a nice official blog from elastic talking about different approaches to have multi-lingual search and approach given by me is inspired by that which is called per-field based language search.
Example
Sample Index mapping would look like below
{
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "english",
"fields": {
"fr": {
"type": "text",
"analyzer": "french"
},
"es": {
"type": "text",
"analyzer": "spanish"
},
"estonian": {
"type": "text",
"analyzer": "estonian"
}
}
}
}
}
}

how to remove a stop word from default _english_ stop-words list in elasticsearch?

I am filtering the text using default English stop-words. I found 'and' is a stop-word in English, but I need to search for the results containing 'and'. I just want to remove and word from this default English stop-words filter and use other stopwords as usually. My elasticsearch schema looks similar to below.
"settings": {
"analysis": {
"analyzer": {
"default": {
"tokenizer": "whitespace" ,
"filter": ["stop_english"]
}
}....,
"filter":{
"stop_english": {
"type": "stop",
"stopwords": "_english_"
}
}
I expect to see the docs containing AND word with _search api.
You can set the stop words for a given index manually like this:
PUT /my_index
{
"settings": {
"analysis": {
"filter": {
"my_stop": {
"type": "stop",
"stopwords": ["and", "is", "the"]
}
}
}
}
}
I also found the list of English stop words used by elasticsearch here. If you manage to manually set the same list of stop words minus "and" in an index and reindex you data in the newly configured index with the good stop words , you should be good to go!
regarding reindexation of your data, you should check out the reindex api. I believe it is required since the tokenization of your data happens at ingestion time so you need to redo the ingestion by reindexing it. It is requires most of the time when changing index settings or some mapping changes (not 100% sure, but i think it makes sense).

Boost field on index in Elastic

I'm using Elastic 1.7.3 and I would like to have a boost on some fields in a index with documents like this fictional example :
{
title: "Mickey Mouse",
content: "Mickey Mouse is a fictional ...",
related_articles: [
{"title": "Donald Duck"},
{"title": "Goofy"}
]
}
Here eg: title is really important, content too, related_articles is a bit more important. My real document have lot of fields and nested object.
I would like to give more weight to the title field than content, and more to content than related_articles.
I have seen the title^5 way, but I must use it at each query and I must (I guess) list all my fields instead of a "_all" query.
I do a lot of search but I found lot of deprecated solutions (_boost by eg).
As I used to work with Sphinx : I search something that works like the field weight option where you can give some weight to field that are really important in your index than others.
You're right that the _boost meta-field that you could use at the type level has been deprecated.
But you can still use the boost property when defining each field in your mapping, which will boost your field at indexing time.
Your mapping would look like this:
{
"my_type": {
"properties": {
"title": {
"type": "string", "boost": 5
},
"content": {
"type": "string", "boost": 4
},
"related_articles": {
"type": "nested",
"properties": {
"title": {
"type": "string", "boost": 3
}
}
}
}
}
}
You have to be aware, though, that it's not necessarily a good idea to boost your field at index time, because once set, you cannot change it unless you are willing to re-index all of your documents, whereas using query-time boosting achieves the same effect and can be changed more easily.

ElasticSearch what analyzer should be used for searching for both url fragment and exact url path

I want to store uri in a mapping and I want to make it searchable the following way:
Exact match (i.e. if I stored: http://stackoverflow.com/questions then looking for the term http://stackoverflow.com/questions retrieves the item.
Bit like letter tokenizer all "words" should be searchable. So searching for either questions, stackoverflow or maybe com will bring back http://stackoverflow.com/questions as a hit.
Looking for '.' or '/' separated url fragments should be still searchable. So searching for stackoverflow.com will bring back http://stackoverflow.com/questions as a hit.
should be case insensitive. (like lowercase)
The html://, htmls://, www. etc. is optional for searching. So searching for either http://stackoverflow.com or stackoverflow.com will bring back http://stackoverflow.com/questions as a hit.
Maybe a solution should be something like chaining tokenizers or something like that. I'm quite new to ES so this is maybe a trivial question.
So what kind of analyzer should I use/build to achieve this functionality?
Any help would be greatly apprechiated.
You are absolutely, correct. You will want to set your field type as multi_field and then create analyzers for each scenario. At the core, you can then do a multi_match query:
=============type properties===============
{
"fun_documents": {
"properties": {
"url": {
"type": "multi_field",
"fields": {
"keyword": {
"type": "string",
"analyzer": "keyword"
},
"alphanum_only": {
"type": "string",
"analyzer": "my_custom_alpha_num_analyzer"
},
{
"etc": "etc"
}
}
}
}
}
}
==================query=====================
{
"query": {
"multi_match": {
"query": "stackoverflow",
"fields": [
"url.keyword",
"url.alphanum_only",
"url.optional_fun"
]
}
}
}
Note that you can get fancy with multi_field aliases and reusing the same name, but this is the simple demonstration.

Resources