Tokenising / filtering text with markup

Tokenising / filtering text with markup - elasticsearch

I have the following text:
Lurasidone is a dopamine D<sub>2</sub>
I would like to tokenize it such that I get the following tokens:
Lurasidone
dopamine
D2
How do I achieve this using a tokenizer or filter? I've attempted to to use the html filter however D<sub>2</sub> is tokenized as:
D
2
whereas I need it to tokenize as:
D2

You can use Pattern Replace Char Filter
This is what I did.
"char_filter": {
"html_pattern": {
"type": "pattern_replace",
"pattern": "<.*>(.*)<\\/.*>",
"replacement": "$1"
}
}
I included that in my custom analyzer like this
"my_custom_analyzer": {
"tokenizer": "standard",
"char_filter": [
"html_pattern"
],
"filter": ["stop"]
}
These are the tokens generated for your text
{
"tokens": [
{
"token": "Lurasidone",
"start_offset": 0,
"end_offset": 10,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "dopamine",
"start_offset": 16,
"end_offset": 24,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "D2",
"start_offset": 25,
"end_offset": 38,
"type": "<ALPHANUM>",
"position": 5
}
]
}
I hope this helps.

Related

Different position incremental behaviour of Synonym Filter

My use case is to search for edge_ngrams with synonym support where the tokens to match should be in sequence.
While trying out the analysis, I observed 2 different behaviour of the filter chain with respect to position increments.
With filter-chain as lowercase, synonym there is no position increment due to SynonymFilter
With filter-chain as lowercase, edge_ngram, synonym there is position increment due to SynonymFilter
Here are the queries I'm running for each case:
Case 1. No position increment
PUT synonym_test
{
"index": {
"analysis": {
"analyzer": {
"by_smart": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"custom_synonym"
]
}
},
"filter": {
"custom_synonym": {
"type": "synonym",
"synonyms": [
"begin => start"
]
}
}
}
}
}
GET synonym_test/_analyze
{
"text": "begin working",
"analyzer": "by_smart"
}
Outputs :
{
"tokens": [
{
"token": "start",
"start_offset": 0,
"end_offset": 5,
"type": "SYNONYM",
"position": 0
},
{
"token": "working",
"start_offset": 6,
"end_offset": 13,
"type": "word",
"position": 1
}
]
}
Case 2. Position increment
PUT synonym_test
{
"index": {
"analysis": {
"analyzer": {
"by_smart": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"custom_edge_ngram",
"custom_synonym"
]
}
},
"filter": {
"custom_synonym": {
"type": "synonym",
"synonyms": [
"begin => start"
]
},
"custom_edge_ngram": {
"type": "edge_ngram",
"min_gram": "2",
"max_gram": "60"
}
}
}
}
}
GET synonym_test/_analyze
{
"text": "begin working",
"analyzer": "by_smart"
}
Outputs :
{
"tokens": [
{
"token": "be",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
},
{
"token": "beg",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
},
{
"token": "begi",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
},
{
"token": "start",
"start_offset": 0,
"end_offset": 5,
"type": "SYNONYM",
"position": 1
},
{
"token": "wo",
"start_offset": 6,
"end_offset": 13,
"type": "word",
"position": 2
},
{
"token": "wor",
"start_offset": 6,
"end_offset": 13,
"type": "word",
"position": 2
},
{
"token": "work",
"start_offset": 6,
"end_offset": 13,
"type": "word",
"position": 2
},
{
"token": "worki",
"start_offset": 6,
"end_offset": 13,
"type": "word",
"position": 2
},
{
"token": "workin",
"start_offset": 6,
"end_offset": 13,
"type": "word",
"position": 2
},
{
"token": "working",
"start_offset": 6,
"end_offset": 13,
"type": "word",
"position": 2
}
]
}
Notice how in Case1 the token begin and start when replaced have the same position and there is no position increment. However in Case 2, when begin token is replaced by start the position got incremented for the subsequent token stream.
Now here are my questions :
Why is it not happening in Case 1 and only happening in Case 2 ?
The main issue that this is causing is when the input query is begi wor with match_phrase query (and default slop as 0) it doesn't matches begin work.
Which is happening since begi and wor are 2 positions away. Any suggestions on how can I achieve this behaviour without impacting my use case ?
I'm using ElasticSearch version 5.6.8 having lucene version 6.6.1.
I've read several documentation links and articles but I couldn't find any proper one explaining why is this happening and is there some settings to get my desired behaviour.

Which Analyzer can meet my need in elasticsearch?

In my situation, my field is like "abc,123", I want it can be searched either "abc" or "123".
my index mapping is just like the code below
{
"myfield": {
"type": "text",
"analyzer": "stop",
"search_analyzer": "stop" }
But when I use es _analyzer API to test, I got the result
{
"tokens": [
{
"token": "abc",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0
}
]
}
"123" was lost.
If I want to meet my situation, do I need to choose some other analyzer or just to add some special configs?

You need to choose standard analyzer instead as stop analyzer breaks text into terms whenever it encounters a character which is not a letter and removes stop words like 'the'. In your case "abc,123" results in token abc when using stop analyzer. Using standard analyzer it returns abc and 123 as shown below
POST _analyze
{
"analyzer": "standard",
"text": "abc, 123"
}
Output:
{
"tokens": [
{
"token": "abc",
"start_offset": 0,
"end_offset": 3,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "123",
"start_offset": 5,
"end_offset": 8,
"type": "<NUM>",
"position": 1
}
]
}
EDIT1 Using Simple Pattern Split Tokenizer
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "simple_pattern_split",
"pattern": ","
}
}
}
}
}
POST my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "abc,123"
}
Output:
{
"tokens": [
{
"token": "abc",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0
},
{
"token": "123",
"start_offset": 4,
"end_offset": 7,
"type": "word",
"position": 1
}
]
}

Elasticsearch character alternative like i or ı

I have a problem with some turkish character and need like alternative character to fix it.
Example : "İzmir" of our city but some user searching it "ızmır" some Of "Izmır" and sometimes "izmir" .
How can I act like i or "ı" or "İ" or "ı " "I" or "i" whenever user using this chars?

You just need to correct your mapping from english to Turkish, if you done so can you paste here your mapping.
Because mapping holds the power to search.

The Turkish analyzer doesn't do that out of the box for you. Example:
GET _analyze
{
"analyzer": "turkish",
"text": "ızmır Izmır izmir"
}
{
"tokens": [
{
"token": "ızmır",
"start_offset": 0,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "ızmır",
"start_offset": 6,
"end_offset": 11,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "izmir",
"start_offset": 12,
"end_offset": 17,
"type": "<ALPHANUM>",
"position": 2
}
]
}
But you could use ASCII folding for that purpose — creating an index with a custom analyzer and testing the example against it:
PUT /asciifold_example
{
"settings" : {
"analysis" : {
"analyzer" : {
"my_analyzer" : {
"tokenizer" : "standard",
"filter" : ["lowercase", "asciifolding"]
}
}
}
}
}
GET asciifold_example/_analyze
{
"analyzer": "my_analyzer",
"text": "ızmır Izmır izmir"
}
{
"tokens": [
{
"token": "izmir",
"start_offset": 0,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "izmir",
"start_offset": 6,
"end_offset": 11,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "izmir",
"start_offset": 12,
"end_offset": 17,
"type": "<ALPHANUM>",
"position": 2
}
]
}

Handling the dot in ElasticSearch

I have a string property called summary that has analyzer set to trigrams and search_analyzer set to words.
"filter": {
"words_splitter": {
"type": "word_delimiter",
"preserve_original": "true"
},
"english_words_filter": {
"type": "stop",
"stop_words": "_english_"
},
"trigrams_filter": {
"type": "ngram",
"min_gram": "2",
"max_gram": "20"
}
},
"analyzer": {
"words": {
"filter": [
"lowercase",
"words_splitter",
"english_words_filter"
],
"type": "custom",
"tokenizer": "whitespace"
},
"trigrams": {
"filter": [
"lowercase",
"words_splitter",
"trigrams_filter",
"english_words_filter"
],
"type": "custom",
"tokenizer": "whitespace"
}
}
I need that query strings given in input like React and HTML (or React, html) are being matched to documents that contain in the summary the words React, reactjs, react.js, html, html5. As more matching keywords they have, an higher score they have (I would expect lower scores on documents that have just a word matching not even at 100%, ideally).
The thing is, I guess at the moment react.js is split in both react and js since I get all the documents that contain js as well. On the other hand, Reactjs returns nothing. I also think to need words_splitter in order to ignore the comma.

You can solve the problem with names like react.js with a keyword marker filter and by defining the analyzer so that it uses the keyword filter. This will prevent react.js from being split into react and js tokens.
Here is an example configuration for the filter:
"filter": {
"keywords": {
"type": "keyword_marker",
"keywords": [
"react.js",
]
}
}
And the analyzer:
"analyzer": {
"main_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"keywords",
"synonym_filter",
"german_stop",
"german_stemmer"
]
}
}
You can see whether your analyzer behaves as required using the analyze command:
GET /<index_name>/_analyze?analyzer=main_analyzer&text="react.js is a nice library"
This should return the following tokens where react.js is not tokenized:
{
"tokens": [
{
"token": "react.js",
"start_offset": 1,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "is",
"start_offset": 10,
"end_offset": 12,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "a",
"start_offset": 13,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "nice",
"start_offset": 15,
"end_offset": 19,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "library",
"start_offset": 20,
"end_offset": 27,
"type": "<ALPHANUM>",
"position": 4
}
]
}
For the words that are similar but not exactly the same as: React.js and Reactjs you could use a synonym filter. Do you have a fixed set of keywords that you want to match?

I found a solution.
Basically I'm going to define the word_delimiter filter with catenate_all active
"words_splitter": {
"catenate_all": "true",
"type": "word_delimiter",
"preserve_original": "true"
}
giving it to the words analyzer with a keyword tokenizer
"words": {
"filter": [
"words_splitter"
],
"type": "custom",
"tokenizer": "keyword"
}
Calling http://localhost:9200/sample_index/_analyze?analyzer=words&pretty=true&text=react.js I get the following tokens:
{
"tokens": [
{
"token": "react.js",
"start_offset": 0,
"end_offset": 8,
"type": "word",
"position": 0
},
{
"token": "react",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
},
{
"token": "reactjs",
"start_offset": 0,
"end_offset": 8,
"type": "word",
"position": 0
},
{
"token": "js",
"start_offset": 6,
"end_offset": 8,
"type": "word",
"position": 1
}
]
}

Comma in Elasticsearch synonyms

Some of the sentences I want to define as syonyms have commas in them, for example:
"Employment Information" and "Your Activity, Your Job" are synonyms.
However if I define them in the following way the result is not what I envisioned, since "," has a special meaning in the Elasticsearch format:
Employment Information=>Your Activity, Your Job
Is the only solution for me to use WordNet synonym format in this case or perhaps I can just ignore the comma entirely and take it out?

I dont think comma will be an issue, If you are using standard analyzer then it will remove comma. This is my test setup
"filter": {
"my_synonym_filter": {
"type": "synonym",
"synonyms": [
"employment information=>your activity your job"
]
}
},
"analyzer": {
"my_synonyms": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_synonym_filter"
]
}
}
It is good to use lowercase filter to avoid case sensitive issues, so now this query
GET my_index/_analyze?text=employment Information&analyzer=my_synonyms
will give you following tokens
{
"tokens": [
{
"token": "your",
"start_offset": 0,
"end_offset": 10,
"type": "SYNONYM",
"position": 1
},
{
"token": "activity",
"start_offset": 11,
"end_offset": 22,
"type": "SYNONYM",
"position": 2
},
{
"token": "your",
"start_offset": 11,
"end_offset": 22,
"type": "SYNONYM",
"position": 3
},
{
"token": "job",
"start_offset": 11,
"end_offset": 22,
"type": "SYNONYM",
"position": 4
}
]
}
There is a gotcha with multiword synonym, if you analyze the output of
GET my_index/_analyze?text=employment Information is useful&analyzer=my_synonyms, you will get unexpected results like this
{
"tokens": [
{
"token": "your",
"start_offset": 0,
"end_offset": 10,
"type": "SYNONYM",
"position": 1
},
{
"token": "activity",
"start_offset": 11,
"end_offset": 22,
"type": "SYNONYM",
"position": 2
},
{
"token": "is",
"start_offset": 23,
"end_offset": 25,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "your",
"start_offset": 23,
"end_offset": 25,
"type": "SYNONYM",
"position": 3
},
{
"token": "useful",
"start_offset": 26,
"end_offset": 32,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "job",
"start_offset": 26,
"end_offset": 32,
"type": "SYNONYM",
"position": 4
}
]
}
You can solve this issue with simple contraction, write synonyms like this
"synonyms": [
"employment information,your activity your job=>sentence1"
]
If you are using keyword analyzer then you could use pattern replace char filter to remove comma.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Tokenising / filtering text with markup - elasticsearch

Related

Different position incremental behaviour of Synonym Filter

Which Analyzer can meet my need in elasticsearch?

Elasticsearch character alternative like i or ı

Handling the dot in ElasticSearch

Comma in Elasticsearch synonyms

Categories

Resources