ElasticSearch: how can I check the way multiple analyzers split a text into tokens? - elasticsearch

I'm quite new to ElasticSearch, so if I overlook something obvious/basic, please forgive me.
I'm using ElasticSearch and want to see how analyzers applied to an index split a text into tokens. And as far as the result of GET /my_index/_settings displays, multiple analyzers are applied to the index, like following:
{
"my_index": {
"settings": {
"analysis": {
"analyzer": {
"my_ngram_search_analyzer": {},
"my_ngram_index_analyzer": {},
"my_kuromoji_search_analyzer": {},
"my_kuromoji_index_analyzer": {},
}
}
}
}
}
So I tried to analyze a text with multiple analyzers using an array, but it does not work.
GET /my_index/_analyze
{
"text":"本日は晴天なり",
"analyzer":["my_ngram_search_analyzer","my_kuromoji_search_analyzer"]
}
How can I perform this?
I'd appreciate it if you would share some information.

Analyze API doesn't support analysing a text with multiple analyzers and it works with just one analyzer, In the index you can define many analyzers which can be applied to various fields in Elasticsearch index.
You can also not apply multiple index time or search time analysers to a single field in Elasticsearch.
Hope this helps.

Related

Elasticsearch Dynamic Analyzer & synonyms

Hi I have a use case where I want my application to dynamically decide on xyz_tokizer, xyz_filter, xyz_synonyms etc
something similar to this
'''
GET test/_search
{
"query":{
"match": {
"content": {
"query": "search_text",
"analyzer": {
"filter": "xyz_filter",
"tokenizer": "xyz_tokenizer"
}
}
}
}
}
'''
However, it throws error. As per elasticsearch docs I found out that we can specify only analyzers that are defined in index settings. Similarly, How to specify filters, tokenizer as well dynamically
You can't, these analyzers need to be registered in your index, what you can do is to use the search time analyzer, dynamically according to your requirements.
But index-time, you can't add them dynamically, it needs to be present in your index settings. You can also change the index-setting to add the new analyzer and add new fields with the newly added analyzer(incremental changes), but changing the existing analyzer of a field is a breaking change and you need to reindex the whole data.

Elasticsearch - Making a field aggregatable but not searchable

My elasticsearch data has a large number of fields that I don't need to search by. But I would like to get aggregations like percentiles, median, count, avg. etc. on these fields.
Is there a way to disable searchability of a field but let it still be aggregatable?
Most of the fields are indexed by default and hence make them searchable. If you want to make a field non-searchable all you have to do is set its index param as false and doc_values to true.
As per elastic documentation:
All fields which support doc values have them enabled by default.
So you need not explicitly set "doc_values": true for such fields.
For e.g.
{
"mappings": {
"_doc": {
"properties": {
"only_agg": {
"type": "keyword",
"index": false
}
}
}
}
}
If you try to search on field only_agg in above example, elastic will throw exception with reason as below:
Cannot search on field [only_agg] since it is not indexed.
yeah take a look at doc_value:
https://www.elastic.co/guide/en/elasticsearch/reference/current/doc-values.html

Elasticsearch 6.2: terms query require lowercase input when searching on keyword

I've created an example index, with the following mapping:
{
"_doc": {
"_source": {
"enabled": False
},
"properties": {
"status": { "type": "keyword" }
}
}
}
And indexed a document:
{"status": "CMP"}
When searching the documents with this status with a terms query, I find no results:
{
"query" : {
"terms": { "status": ["CMP"]}
}
}
However, if I make the same query by putting the input in lowercase, I will find my document:
{
"query" : {
"terms": { "status": ["cmp"]}
}
}
Why is it? Since I'm searching on a keyword field, the indexed content should not be analyzed and should match an uppercase value...
no more #Oliver Charlesworth Now - in Elastic 6.x - you could continue to use a keyword datatype, lowercasing your text with a normalizer,doc here. However in every cases you should change your index mapping and reindex your docs
The index and mapping creation and the search were part of a test suite. It seems that the setup part of the test suite was not executed, and the mapping was not applied to the index.
The index was then using the default types instead of the mapping types, resulting of the use of string fields instead of keywords.
After changing the setup method of the automated tests, the mappings are well applied to the index, and the uppercase values for the status "CMP" are now matching documents.
The symptoms you're seeing shouldn't occur, unless something else is wrong.
A keyword index is not analysed, so your index should contain only CMP. A terms query is also not analysed, etc. so your index is searched only for CMP. Hence there should be a match.

In Elasticsearch match query how to deal with slash

I have a match query searching for a type of doc:
{
"query": {
"bool": {
"should": {
"match": {
"ph1_enc": "EAAQnb1kMr/e2/ADqo"
}
}
}
}
}
"EAAQnb1kMr/e2/ADqo" is the string i'm trying to match, however in the search results I can see multiple records with substring "/e2/" are also returned.
Looks like "/e2/" is indexed separately, so that this could happen.I thought the match query is to do full-text match... Is it because I missed something when creating the template? Any idea?
Add-on instead of reindex, how to modify the query to match the exact value in the query?
Which analyzer do you set in the mapping to index your data?
If you are using the default one (standard analyzer), then according to the documentation, this uses the default tokenizer that seems to split also the text by slash ('/'). The documentation redirects here for more information about the tokenizer.
So, that will index the following words 'EAAQnb1kMr', 'e2', and 'ADqo'. Accordingly, your query value will also been analyzed the same way the field was indexed. That is why documents with 'e2' are also being returned.
If you don't need to tokenize the 'ph1_enc' field, you can just set its type in the mapping as 'keyword'.
"properties": {
"ph1_enc": {
"type": "keyword"
}
}
That will not analyze the field and it will match exactly while you query.
I hope that it helps.

Elasticsearch to wildcard search email addresses

I'm trying to use elasticsearch for a project I'm working on. I was wondering if someone could help steer me in the right direction. I'm using an index with 100+ million records.
I need to be able to search with a wildcard query like the following:
b*g#gmail.com
b*g#*.com
*gus#gmail.com
br*gu*#gmail.com
*g*#*
When I try using Wildcard and other searches, I don't get completely expected results.
What type of search with elasticsearch should I look into implementing? Is ElasticSearch even the right tool to be using? The source I'm pulling this out of is Mysql, so if not I may consider using Sphinx or Solr.
I assume that you have tried out the wildcard query as described here.
However, it has very different behaviour if your email is analyzed versus not analyzed. I would suggest you delete your index and change your mapping. e.g.
PUT /emails
{
"mappings": {
"email": {
"properties": {
"email": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
Once you have this, you can just do the normal wildcard query or query_string. e.g.
GET emails/_search
{
"query": {
"wildcard": {
"email": {
"value": "s*com"
}
}
}
}
As an aside, when you just index email without setting it as not_analyzed, the default mapping actually splits up the email prefix from the domain and so that's why you don't get results for when you do s*#gmail.com. You would still get results for s* or *gmail.com but for your case, using not_analyzed works correctly. If you want to support case insensitivity, then you might want to look at a custom analyzer that uses the uax_url_email tokenizer as described here.

Resources