In elasticsearc How can I Tokenize words separeted by space and be able to match by typing without space - elasticsearch

Here is what I want to achieve :
My field value : "one two three"
I want to be able to match this field by typing: one or onetwo or onetwothree or onethree or twothree or two or three
For that, the tokenizer need to produce those tokens:
one
onetwo
onetwothree
onethree
two
twothree
three
Do you know how can I implement this analyzer ?

there is the same problem in German language when we connect different words into one. For this purpose Elasticsearch uses technique called "coumpound words". There is also a specific token filter called "compound word token filter". It is trying to find sub-words from given dictionary in string. You only have to define dictionary for your language. There is whole specification at link bellow.
https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis-compound-word-tokenfilter.html

Related

Struggling to understand user dictionary format in Elasticsearch Kuromoji tokenizer

I wanted to use Elasticsearch Kuromoji plugin for Japanese language. However, I'm struggling to understand the user_dictionary format of the file in the tokenizer. It's explained in elastic doc https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-kuromoji-tokenizer.html as the CSV of the following form:
The Kuromoji tokenizer uses the MeCab-IPADIC dictionary by default. A user_dictionary may be appended to the default dictionary. The dictionary should have the following CSV format:
<text>,<token 1> ... <token n>,<reading 1> ... <reading n>,<part-of-speech tag>
So there is not much in the documentation about that.
When looking at the sample entry the doc shows, it can looks like below:
東京スカイツリー,東京 スカイツリー,トウキョウ スカイツリー,カスタム名詞
So, breaking it down, the first element is the dictionary text:
東京スカイツリー - Tokyo Sky Tree
東京 スカイツリー - is Tokyo Sky tree - I assuming the space here is to denote token, but wondering why only "Tokyo" is a separate token, but sky tree is not split into "sky" "tree" ?
トウキョウ スカイツリー - Then we have a reading forms. And again, "Tokyo" "sky tree" - again, why it's splited such way. Can I specify more than one reading form of the text in this column (of course if there are any)
And the last is the part of speech, which is the bit I don't understand. カスタム名詞 means "Custom noun". I assuming I can define the part of speech such as verb, noun etc. But what are the rules, should it follow some format of part of speech name. I saw examples where it's specified as "noun" - 名詞. But in this example is custom noun.
Anyone have some ideas, materials especially around Part of speech field - such as what are the available values. Additionally, what impact has this field to the overall tokenizer capabilities ?
Thanks
Do you try to define "tokyo sky tree" like this
"東京スカイツリー,東京スカイツリー,トウキョウスカイツリー,カスタム名詞"
"東京スカイツリー,東京 スカイツリー,トウキョウ スカイツリー,カスタム名詞"
I encounter another problem Found duplicate term [東京スカイツリー] in user dictionary at line [1]

Maching two words as a single word

Consider that I have a document which has a field with the following content: 5W30 QUARTZ INEO MC 3 5L
A user wants to be able to search for MC3 (no space) and get the document; however, search for MC 3 (with spaces) should also work. Moreover, there can be documents that have the content without spaces and that should be found when querying with a space.
I tried indexing without spaces (e.g. 5W30QUARTZINEOMC35L), but that does not really work as using a wildcard search I would match too much, e.g. MC35 would also match, and I only want to match two exact words concatenated together (as well as exact single word).
So far I'm thinking of additionally indexing all combinations of two words, e.g. 5W30QUARTZ, QUARTZINEO, INEOMC, MC3, 35L. However, does Elasticsearch have a native solution for this?
I'm pretty sure what you want can be done with the shingle token filter. Depending on your mapping, I would imagine you'd need to add a filter looking something like this to your content field to get your tokens indexed in pairs:
"filter_shingle":{
"type":"shingle",
"max_shingle_size":2,
"min_shingle_size":2,
"output_unigrams":"true"
}
Note that this is also already the default configuration, I just added it for clarity.

elasticsearch - fulltext search for words with special/reserved characters

I am indexing documents that may contain any special/reserved characters in their fulltext body. For example
"PDF/A is an ISO-standardized version of the Portable Document Format..."
I would like to be able to search for pdf/a without having to escape the forward slash.
How should i analyze my query-string and what type of query should i use?
The default standard analyzer will tokenize a string like that so that "PDF" and "A" are separate tokens. The "A" token might get cut out by the stop token filter (See Standard Analyzer). So without any custom analyzers, you will typically get any documents with just "PDF".
You can try creating your own analyzer modeled off the standard analyzer that includes a Mapping Char Filter. The idea would that "PDF/A" might get transformed into something like "pdf_a" at index and query time. A simple match query will work just fine. But this is a very simplistic approach and you might want to consider how '/' characters are used in your content and use slightly more complex regex filters which are also not perfect solutions.
Sorry, I completely missed your point about having to escape the character. Can you elaborate on your use case if this turns out to not be helpful at all?
To support queries containing reserved characters i now use the Simple Query String Query (https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-simple-query-string-query.html)
As of not using a query parser it is a bit limited (e.g. no field-queries like id:5), but it solves the purpose.

Ignore elements in cts:search

I am having some xml documents which have a structure like this:
<root>
<intro>...</intro>
...
<body>
<p>..................
some text CO<sub>2</sub>
.................. </p>
</body>
</root>
Now I want to search all the results with phrase CO2 and also want to get results of above type in search results.
For this purpose, I am using this query -
cts:search
(fn:collection ("urn:iddn:collections:searchable"),
cts:element-query
(
fn:QName("http://iddn.icis.com/ns/fields","body"),
cts:word-query
(
"CO2",
("case-insensitive","diacritic-sensitive","punctuation-insensitive",
"whitespace-sensitive","unstemmed","unwildcarded","lang=en"),
1
)
)
,
("unfiltered", "score-logtfidf"),
0.0)
But using this I am not able to get document with CO<sub>2</sub>. I am only getting data with simple phrase CO2.
If I replace the search phrase to CO 2 then I am able to get documents only with CO<sub>2</sub> and not with CO2
I want to get combined data for both CO<sub>2</sub> and CO2 as search results.
So can I ignore <sub> by any means, or is there any other way to cater this problem?
The issue here is tokenization. "CO2" is a single word token. CO<sub>2</sub>, even with phrase-through, is a phrase of two word tokens: "CO" and "2". Just as "blackbird" does not match "black bird", so too does "CO2" not match "CO 2". The phrase-through setting just means that we're willing to look for a phrase that crosses the <sub> element boundary.
You can't splice together CO<sub>2</sub> into one token, but you might be able to use customized tokenization overrides to break "CO2" into two tokens. Define a field and define overrides for the digits as 'symbol'. This will make each digit its own token and will break "CO2" into two tokens in the context of that field. You'd then need to replace the word-query with a field-word-query.
You probably don't want this to apply anywhere in a document, so you'd be best of adding markup around these kinds of chemical phrases in your documents. Fields in general and tokenization overrides in particular will come at a performance cost. The contents of a field are indexed completely separately so the index is bigger, and the tokenization overrides mean that we have to retokenize as well, both on ingest and at query time. This will slow things down a little (not a lot).
It appears that you want to add a phrase-through configuration.
Example:
<p>to <b>be</b> or not to be</p>
A phrase-through on <b> would then be indexed as "to be or not to be"

ElasticSearch Nest AutoComplete based on words split by whitespace

I have AutoComplete working with ElasticSearch (Nest) and it's fine when the user types in the letters from the begining of the phrase but I would like to be able to use a specialized type of auto complete if it's possible that caters for words in a sentence.
To clarify further, my requirement is to be able to "auto complete" like such:
Imagine the full indexed string is "this is some title". When the user types in "th", this comes back as a suggestion with my current code.
I would also like the same thing to be returned if the user types in "som" or "title" or any letters that form a word (word being classified as a string between two spaces or the start/end of the string).
The code I have is:
var result = _client.Search<ContentIndexable>(
body => body
.Index(indexName)
.SuggestCompletion("content-suggest" + Guid.NewGuid(),
descriptor =>
descriptor
.OnField(t => t.Title.Suffix("completion"))
.Text(searchTerm)
.Size(size)));
And I would like to see if it would be possible to write something that matches my requirement using SuggestCompletion (and not by doing a match query).
Many thanks,
Update:
This question already has an answer here but I leave it here since the title/description is probably a little easier to search by search engines.
The correct solution to this problem can be found here:
Elasticsearch NEST client creating multi-field fields with completion
#Kha i think it's better to use the NGram Tokenizer
So you should use this tokenizer when you create the mapping.
If you want more info, and maybe an example write back.

Resources