how to keep *only* longest term produced by PathHierarchy tokenizer in ElasticSearch? - elasticsearch

I need to use PathHierarchy tokenizer during indexing stage. (so I could generate terms like "a", "a/b", "a/b/c".
But during search stage I would like to only keep the longest term ("a/b/c"). I need this because Kibana uses query_string type of queries so the query_string itself is analyzesed.
(question regarding Kibana queries is here:
do the queries for values analyzed with hierarchical path work correctly in Kibana and ElasticSearch?)
is it possible to create a custom analyzer which will use path_hierarchy tokenizer and then will apply a filter which will only keep the longest term?

You can use a different analyser for indexing and searching. Maybe this mapping can help you:
PUT /myindex
{
"mappings": {
"mytype":{
"properties": {
"path": {
"type": "string",
"index_analyzer": " path_hierarchy",
"search_analyzer": "keyword"
}
}
}
}
}

Related

Elasticsearch 7.9 forward slashes

I'm using elasticsearch 7.9.1 and want to search for "/abc" (including the forward slash) in the field name "Path", such as such as in "mysite.com/abc/xyz". Here's the index template, but doesn't work:
"Path": {
"type": "text",
"index": false
}
What did I do wrong? Can you please help? Thanks!
They changed the syntax for "not analyzed" text only once (in ES 5), from
{
"type": "string",
"index": "not_analyzed"
}
to
{
"type": "keyword"
}
If you want special characters like / to not be removed at indexing time during analysis, you should use keyword instead of text.
Moreover, if your intent is to search within URL, you should prefer the wildcard field type or keep using text but use an appropriate custom analyzer that splits your URL into parts.
If you upgrade to 7.11, you could also have access to the URI parts ingest processor that does all the job for you.

Elasticsearch - can I define index time analyzer on document level?

I want to index pages in multiple languages into a single index. But for each language I need to define custom language analyzer. So for english page it would use english analyzer, for czech page it would use czech analyzer.
At search time I would set the correct analyzer based on current locale as I do not need to search across languages.
It appears that it was possible in the early versions of Elasticsearch, but I cannot find a way to do it in 7.6
Is there a way to achieve this or do I really need to create an index for each type in each language? That would lead to many indices with only small number of indexed documents.
Or is there a better way to handle this scenario? We are considering about 20 languages and several document types (as far as I understand, types are now deprecated so each needs its own index).
You can use the fields feature which is available in Elastic 7.6, which allows you to store the different languages in a single index, also query time it would be possible to just use the subfield of language which you want to query.
In fact, there is a nice official blog from elastic talking about different approaches to have multi-lingual search and approach given by me is inspired by that which is called per-field based language search.
Example
Sample Index mapping would look like below
{
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "english",
"fields": {
"fr": {
"type": "text",
"analyzer": "french"
},
"es": {
"type": "text",
"analyzer": "spanish"
},
"estonian": {
"type": "text",
"analyzer": "estonian"
}
}
}
}
}
}

Elasticsearch - match not_analyzed field with partial search term

I have a "name" field - not_analyzed in my elasticsearch index.
Lets say value of "name" field is "some name". My question is, if I want a match for the search term - some name some_more_name someother name because it contains some name in it, then will not_analyzed allow that match to happen, if not, then how can I get a match for the proposed search term?
During the indexing the text of name field is stored in inverted index. If this field was analyzed, 2 terms would go to the inverted index: some and name. But as it is not analyzed, only 1 term is stored: some name
During the search (using match query), by default your search query is analyzed and tokenized. So there will be several terms: some, name, some_more_name and someother. Then Elasticsearch will look at inverted index to see if there is at least one term from the search query. But there is only some name term, so you won't see this document in the result set.
You can play with analyzers using _analyze endpoint
Returning to your question, if you want to get a match for the proposed search query, your field must be analyzed.
If you need to keep non-analyzed version as well you should use multi fields:
PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"name": {
"type": "keyword",
"fields": {
"analyzed": {
"type": "text"
}
}
}
}
}
}
}
Taras has explained clearly,and i issue might have resolved,but still if you cant change mapping of your index ,you can use query(I have tested in 5.4 ES)
GET test/_search
{
"query": {
"query_string": {
"default_field": "namekey",
"query": "*some* *name*",
"default_operator": "OR"
}
}

How to search for # or . in Elasticsearch

I have a field, under a type company, in my elasticsearch index which captures the technologies that the company uses. So people coming to our site might enter java, Java, C#, c#, .Net, .netetc in the search box to get the companies.
Initially I indexed this in the default way and then I couldn't search for .Net or C# as there were wildcard characters in the search query. When I searched with Net or C it returned companies that uses C or C# , which again is not correct.
I did some research and changed the mapping for the field to "index": "not_analyzed" and re-indexed the companies. Now it returned the correct companies for C# and .Net, but failed in the case were the search term was not an exact match. So it didn't return companies with Java technologies, when the search term was java, but it returned correctly when the search term was java. I understand that not_analyzed requires exact match
How do I index and query on the same field to get both these cases sorted out?
The way to achieve what you need is to create a custom analyzer that does a little bit more than what not_analyzed does, i.e. also lowercase the terms.
curl -XPUT localhost:9200/test_index -d '{
"settings": {
"analysis": {
"analyzer": {
"lowercase_keyword": {
"type": "custom",
"tokenizer": "keyword",
"filter": [ "lowercase" ]
}
}
}
},
"mappings": {
"test_type": {
"properties": {
"name": {
"type": "string",
"analyzer": "lowercase_keyword"
}
}
}
}
}'
Then when you index a document that contains Java, it will be indexed as java, C# as c#, etc
This will bring the benefits of case-insensitive exact matches.

Is match query case sensitive in elasticsearch?

I have followed an example from here
The mapping for the index is
{
"mappings": {
"my_type": {
"properties": {
"full_text": {
"type": "string"
},
"exact_value": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
And the document indexed is
{
"full_text": "Quick Foxes!",
"exact_value": "Quick Foxes!"
}
I have noticed while using a simple match query on the "full_text" field like below
{
"query": {
"match": {
"full_text": "quick"
}
}
}
I get to see the document is matching. Also if I use uppercase, that is "QUICK" , as the search term, it shows the document is matching.
Why is it so?. By default the tokenizer would have splitted the text in "full_text" field in to "quick","foxes". So how is match query matching the document for upper cased values?
Because you haven't specified which analyzer to use for "full_text" field into your index mapping then the default analyzer is used. The default will be "Standard Analyzer".
Quote from ElasticSearch docs:
An analyzer of type standard is built using the Standard Tokenizer with the Standard Token Filter, Lower Case Token Filter, and Stop Token Filter.
Before executing the query in your index, ElasticSearch will apply the same analyzer configured for your field to your query values. Because the default analyzer uses Lower Case Token Filter in its processing then using "Quick" or "QUICK" or "quick" will give you to the same query because the analyzer will lower case them by using the Lower Case Token Filter and result to just "quick".

Resources