Elasticsearch Dynamic Analyzer & synonyms - elasticsearch

Hi I have a use case where I want my application to dynamically decide on xyz_tokizer, xyz_filter, xyz_synonyms etc
something similar to this
'''
GET test/_search
{
"query":{
"match": {
"content": {
"query": "search_text",
"analyzer": {
"filter": "xyz_filter",
"tokenizer": "xyz_tokenizer"
}
}
}
}
}
'''
However, it throws error. As per elasticsearch docs I found out that we can specify only analyzers that are defined in index settings. Similarly, How to specify filters, tokenizer as well dynamically

You can't, these analyzers need to be registered in your index, what you can do is to use the search time analyzer, dynamically according to your requirements.
But index-time, you can't add them dynamically, it needs to be present in your index settings. You can also change the index-setting to add the new analyzer and add new fields with the newly added analyzer(incremental changes), but changing the existing analyzer of a field is a breaking change and you need to reindex the whole data.

Related

ElasticSearch: how can I check the way multiple analyzers split a text into tokens?

I'm quite new to ElasticSearch, so if I overlook something obvious/basic, please forgive me.
I'm using ElasticSearch and want to see how analyzers applied to an index split a text into tokens. And as far as the result of GET /my_index/_settings displays, multiple analyzers are applied to the index, like following:
{
"my_index": {
"settings": {
"analysis": {
"analyzer": {
"my_ngram_search_analyzer": {},
"my_ngram_index_analyzer": {},
"my_kuromoji_search_analyzer": {},
"my_kuromoji_index_analyzer": {},
}
}
}
}
}
So I tried to analyze a text with multiple analyzers using an array, but it does not work.
GET /my_index/_analyze
{
"text":"本日は晴天なり",
"analyzer":["my_ngram_search_analyzer","my_kuromoji_search_analyzer"]
}
How can I perform this?
I'd appreciate it if you would share some information.
Analyze API doesn't support analysing a text with multiple analyzers and it works with just one analyzer, In the index you can define many analyzers which can be applied to various fields in Elasticsearch index.
You can also not apply multiple index time or search time analysers to a single field in Elasticsearch.
Hope this helps.

Elasticsearch Suggest+Synonyms+fuzziness

I am looking for a way to implement the auto-suggest with synonyms & fuzziness
For example, when the user tried to search for "replce ar"
My synonym list has ar => audio record
So, the result should include the items matching
changing audio record
replacing audio record
etc..,
Here we need fuzziness because there is a typo on "replace" (in the user's search text)
Synonyms to match ar => audio record
Auto-suggest with regex pattern.
Is it possible to implement all the three features in a single field?
Edit:
a regex+fuzzy just throws error.
I haven't well explained my need of a regex-pattern.
so, i needed a Regex for doing a partial word lookup ('encyclopedic' contains 'cyclo').
now, after investigating what options do i have for this purpose, directing me to the NGram Tokenizer and looking into the other suggesters, i found that maybe Phrase suggester is realy what I'm looking for, so I'll try it & tell you about.
Yes, you can use synonyms as well as fuzziness for suggestions. The synonyms are handled by adding a synonym filter in your language analyzer and adding that filter to the analyzer. Then, when you create the field mapping for the field(s) you want to use for suggestions, you assign that analyzer to that field.
As for fuzziness, that happens at query time. Most text-based queries support a fuzziness option which allows you to specify how many corrections you want to allow. The default auto value adjusts the number of corrections, depending on how long the term is, so that's usually best.
Notional analysis setup (synonym_graph reference)
{
"analysis": {
"filter": {
"synonyms": {
"type": "synonym_graph",
"expand": "false",
"synonyms": [
"ar => audio record"
]
}
},
"analyzer": {
"synonyms": {
"tokenizer": "standard",
"type": "custom",
"filter": [
"standard",
"lowercase",
"synonyms"
]
}
}
}
}
Notional Field Mapping (Analyzer + Mapping reference)
(Note that the analyzer matches the name of the analyzer defined above)
{
"properties": {
"suggestion": {
"type": "text",
"analyzer": "synonyms"
}
}
}
Notional Query
{
"query": {
"match": {
"suggestion": {
"query": "replce ar",
"fuzziness": "auto",
"operator": "and"
}
}
}
}
Keep in mind that there are several different options for suggestions, so depending on which option you use, you may need to adjust the way the field is mapped, or even add another token filter to the analyzer. But analyzers are just made up of a series of token filters, so you can usually combine whatever token filters you need to achieve your goal. Just make sure you understand what each filter is doing so you get the filters in the correct order.
If you get stuck in part of this process, just submit another question with the specific issue you're running into. Good luck!

Remove custom analyzer / filter on Elasticsearch for all new indexes

We tried adding a customer analyzer / lowercase filter to all new indexes in Elasticsearch. It looks something like this:
"analysis": {
"normalizer": {
"lowercase_normalizer": {
"filter": [
"lowercase"
],
"type": "custom",
"char_filter": []
}
}
},
This is automatically applied to all new indexes. How do I remove this? I realize i can not remove this from existing indexes, but how do i stop it from being automatically added to new ones?
It appears that these settings are located somewhere in my master "template". I can see the template using "GET /_template", which contains all of the unwanted lowercase normalizers... but how do i remove them?
Thanks!
Here is how you can delete and index template
DELETE /_template/template_1
Also in future if you want to add a new custom analyzer to any template first make a testing index with your custom analyzer and then test if you custom analyzers is giving you desired results by following
GET <index_name>/_analyze
{
"analyzer" : "analyzer_name",
"text" : "this is a test"
}

Specifying Field Types Indexing from Logstash to Elasticsearch

I have successfully ingested data using the XML filter plugin from Logstash to Elasticsearch, however all the field types are of the type "text."
Is there a way to manually or automatically specify the correct type?
I found the following technique good for my use:
Logstash would filter the data and change a field from the default - text to whatever form you want. The documentation would be found here. The example given in the documentation is:
filter {
mutate {
convert => { "fieldname" => "integer" }
}
}
This you add in the /etc/logstash/conf.d/02-... file in the body part. I believe the downside of this practice is that from my understanding it is less recommended to alter data entering the ES.
After you do this you will probably get the this problem. If you have this problem and your DB is a test DB that you can erase all old data just DELETE the index until now that there would not be a conflict (for example you have a field that was until now text and now it is received as date there would be a conflict between old and new data). If you can't just erase the old data then read into the answer in the link I linked.
What you want to do is specify a mapping template.
PUT _template/template_1
{
"index_patterns": ["te*", "bar*"],
"settings": {
"number_of_shards": 1
},
"mappings": {
"type1": {
"_source": {
"enabled": false
},
"properties": {
"host_name": {
"type": "keyword"
},
"created_at": {
"type": "date",
"format": "EEE MMM dd HH:mm:ss Z YYYY"
}
}
}
}
}
Change the settings to match your needs such as listing the properties to map what you want them to map to.
Setting index_patterns is especially important because it tells elastic how to apply this template. You can set an array of index patterns and can use * as appropriate for wildcards. i.e logstash's default is to rotate by date. They will look like logstash-2018.04.23 so your pattern could be logstash-* and any that match the pattern will receive the template.
If you want to match based on some pattern, then you can use dynamic templates.
Edit: Adding a little update here, if you want logstash to apply the template for you, here is a link to the settings you'll want to be aware of.

Elasticsearch to wildcard search email addresses

I'm trying to use elasticsearch for a project I'm working on. I was wondering if someone could help steer me in the right direction. I'm using an index with 100+ million records.
I need to be able to search with a wildcard query like the following:
b*g#gmail.com
b*g#*.com
*gus#gmail.com
br*gu*#gmail.com
*g*#*
When I try using Wildcard and other searches, I don't get completely expected results.
What type of search with elasticsearch should I look into implementing? Is ElasticSearch even the right tool to be using? The source I'm pulling this out of is Mysql, so if not I may consider using Sphinx or Solr.
I assume that you have tried out the wildcard query as described here.
However, it has very different behaviour if your email is analyzed versus not analyzed. I would suggest you delete your index and change your mapping. e.g.
PUT /emails
{
"mappings": {
"email": {
"properties": {
"email": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
Once you have this, you can just do the normal wildcard query or query_string. e.g.
GET emails/_search
{
"query": {
"wildcard": {
"email": {
"value": "s*com"
}
}
}
}
As an aside, when you just index email without setting it as not_analyzed, the default mapping actually splits up the email prefix from the domain and so that's why you don't get results for when you do s*#gmail.com. You would still get results for s* or *gmail.com but for your case, using not_analyzed works correctly. If you want to support case insensitivity, then you might want to look at a custom analyzer that uses the uax_url_email tokenizer as described here.

Resources