In elasticsearch, is there a way to set up an analyzer that would produce position gaps between tokens when line breaks or punctuation marks are encountered?
Let's say I index an object with the following nonsensical string (with line break) as one of its fields:
The quick brown fox runs after the rabbit.
Then comes the jumpy frog.
The standard analyzer will yield the following tokens with respective positions:
0 the
1 quick
2 brown
3 fox
4 runs
5 after
6 the
7 rabbit
8 then
9 comes
10 the
11 jumpy
12 frog
This means that a match_phrase query of the rabbit then comes will match this document as a hit.
Is there a way to introduce a position gap between rabbit and then so that it doesn't match unless a slop is introduced?
Of course, a workaround could be to transform the single string into an array (one line per entry) and use position_offset_gap in field mapping, but I would really rather keep a single string with newlines (and an ultimate solution would involve larger position gaps for newlines than, say, for punctuation marks).
I eventually figured out a solution using a char_filter to introduce extra tokens on line breaks and punctuation marks:
PUT /index
{
"settings": {
"analysis": {
"char_filter": {
"my_mapping": {
"type": "mapping",
"mappings": [ ".=>\\n_PERIOD_\\n", "\\n=>\\n_NEWLINE_\\n" ]
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"char_filter": ["my_mapping"],
"filter": ["lowercase"]
}
}
}
}
}
Testing with the example string
POST /index/_analyze?analyzer=my_analyzer&pretty
The quick brown fox runs after the rabbit.
Then comes the jumpy frog.
yields the following result:
{
"tokens" : [ {
"token" : "the",
"start_offset" : 0,
"end_offset" : 3,
"type" : "<ALPHANUM>",
"position" : 1
}, {
... snip ...
"token" : "rabbit",
"start_offset" : 35,
"end_offset" : 41,
"type" : "<ALPHANUM>",
"position" : 8
}, {
"token" : "_period_",
"start_offset" : 41,
"end_offset" : 41,
"type" : "<ALPHANUM>",
"position" : 9
}, {
"token" : "_newline_",
"start_offset" : 42,
"end_offset" : 42,
"type" : "<ALPHANUM>",
"position" : 10
}, {
"token" : "then",
"start_offset" : 43,
"end_offset" : 47,
"type" : "<ALPHANUM>",
"position" : 11
... snip ...
} ]
}
Related
Our documents have a field of keyword type called "version_numbers", which essentially stores an array of version numbers. For example: ["1.00.100.01", "2.00.470.00"].
This versioning follows a specific pattern, where each group should be associated with search keywords. Here's a breakdown of the versioning pattern:
1.00.240.15
\ / | |
\/ \ \_ maintenance version
major \_ minor version
version
I want to to build an analyzer such that:
major version is associated with keyword APP_VERSION and can be searched with queries like APP_VERSION1.0, APP_VERSION1.00, APP_VERSION1 etc.
minor version is associated with keyword XP and can be searched with queries like XP24, XP 24, XP240 etc.
maintenance version is associated with keyword Rev and can be searched with queries like Rev15, Rev 15 etc.
documents also can be queried in a combination of all three, like APP_VERSION1 XP240 Rev15 etc.
How do I associate each group of version pattern with keywords specified above?
Here's what I've tried so far to tokenize versions:
{
"analysis": {
"analyzer": {
"version_analyzer": {
"tokenizer": "version_tokenizer"
}
},
"tokenizer": {
"version_tokenizer": {
"type": "simple_pattern_split",
"pattern": "\\d"
}
}
}
}
But this seems to be splitting by dots only.
{
"tokens" : [
{
"token" : "2",
"start_offset" : 0,
"end_offset" : 1,
"type" : "word",
"position" : 0
},
{
"token" : "00",
"start_offset" : 2,
"end_offset" : 4,
"type" : "word",
"position" : 1
},
{
"token" : "470",
"start_offset" : 5,
"end_offset" : 8,
"type" : "word",
"position" : 2
},
{
"token" : "00",
"start_offset" : 9,
"end_offset" : 11,
"type" : "word",
"position" : 3
}
]
}
I'm new to Elasticsearch, so I highly appreciate any hints/guidance on this.
This activity is the one that matches with the query:
"_type": "_doc",
"_id": "V47jV3oBb_MDv65iR4vj",
"_score": 0.0,
"_source": {
"ActivitatID": "30",
"Nombre": "A la romana",
"Tipo": "Proyecto ",
"Descripcion": "Los alumnos y alumnas conocerán diversos aspectos relacionados con la Antigua Roma y utilizarlos para crear actividades lúdico-didácticas. Es decir, los chicos y chicas “gamificarán” el conocimiento que adquieran sobre los romanos elaborando un juego sobre el mundo romano, que será el producto final de la secuencia.",
"Idioma": "Castellano",
"Assignaturas": "Matemáticas, Educación Artística, Ciencias Sociales y Lengua Castellana",
"Competencias": "Comunicación lingüística, Aprender a aprender, Sociales y cívicas, Conciencia y expresiones culturales, Matemáticas y ciencias, iniciativa y espiritu emprendedor",
"EdadMinima": "10",
"EdadMaxima": "11",
"Link": "https://descargas.intef.es/cedec/proyectoedia/reaprimaria/a_la_romana/_gua_didctica_.html",
"Puntuacion": "2.823",
"Votos": "10035",
"Guardados": "994",
"Tags": ""
}
If I try the following GET query it works:
{
"query": {
"bool": {
"filter": [
{ "term": { "Idioma": "castellano" }},
{ "term": { "Assignaturas": "matemáticas" }},
{ "range": { "EdadMinima":{"gte":10}}},
{ "range": { "EdadMaxima":{"lte":11}}}
]
}
}
}
but if I change "matemáticas" for "Matemáticas" or "Educación Artística" or anything else, 0 results are found.
Why?
Thanks
Field Assignaturas is of type text so tokens are generated using standard analyzer
GET index162/_analyze
{
"text": ["Matemáticas, Educación Artística, Ciencias Sociales y Lengua Castellana"],
"analyzer": "standard"
}
**Result**
"tokens" : [
{
"token" : "matemáticas",
"start_offset" : 0,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "educación",
"start_offset" : 13,
"end_offset" : 22,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "artística",
"start_offset" : 23,
"end_offset" : 32,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "ciencias",
"start_offset" : 34,
"end_offset" : 42,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "sociales",
"start_offset" : 43,
"end_offset" : 51,
"type" : "<ALPHANUM>",
"position" : 4
},
{
"token" : "y",
"start_offset" : 52,
"end_offset" : 53,
"type" : "<ALPHANUM>",
"position" : 5
},
{
"token" : "lengua",
"start_offset" : 54,
"end_offset" : 60,
"type" : "<ALPHANUM>",
"position" : 6
},
{
"token" : "castellana",
"start_offset" : 61,
"end_offset" : 71,
"type" : "<ALPHANUM>",
"position" : 7
}
]
Standard analyzer splits text and lowercases the tokens.
Term query
Returns documents that contain an exact term in a provided field.
So "term": { "Assignaturas": "Matemáticas" } is trying to match search text Matemáticas with token matemáticas hence no document is returned.
You should use match query for this
match query uses same analyzer as used in your field on search text so similar tokens are generated
Term queries are not analyzed. In case you search for "Batman" in a field, it will only show docs which have "Batman". There will not be any analysis on your query.
Case will remain same. And even if you have spaces: "Pac Man", it will look for the exact same term.
Term query is not the best way to search if you want that kind of flexibility. If your fields Idioma and Assignaturas are text fields, you probably want to use match.
From the documentation:
Avoid using the term query for text fields. By default, Elasticsearch changes the values of text fields as part of analysis. This can make finding exact matches for text field values difficult.To search text field values, use the match query instead.
If they are text fields:
{
"query": {
"bool": {
"filter": [
{ "match": { "Idioma": "castellano" }},
{ "match": { "Assignaturas": "matemáticas" }},
{ "range": { "EdadMinima":{"gte":10}}},
{ "range": { "EdadMaxima":{"lte":11}}}
]
}
}
}
I have a case where I want I use elasticsearch as a text search engine for pretty long HTML Arabic text.
The search works pretty fine except for words with diacritics, it doesn't seem to be able to recognize them.
For example:
This sentence: ' وَهَكَذَا في كُلّ عَقْدٍ' (this is the one stored in the db)
is the exact same as this: 'وهكذا في كل عقد' (this is what the user enters for search)
it's exactly the same with the exception of the added diacritics, which are handled as separate characters in computers (but are just rendered on top of other characters).
I want to know if there's a way to make the search ignore all diacritics.
The first method I am thinking about is if there's a way to tell elasticsearch to completely ignore diacritics when indexing (kindda like stopwords ?).
If not, is it suitable to have another field in the document (text_normalized) where I manually remove the diacritics before adding it to elasticsearch, would that be efficient ?
To solve your problem you can use arabic_normalization token filter, it will remove diacritics from text before indexing. You need to define a custom analyzer and your Analyzer should look something like this:
"analyzer": {
"rebuilt_arabic": {
"tokenizer": "standard",
"filter": [
"lowercase",
"decimal_digit",
"arabic_stop",
"arabic_normalization",
"arabic_keywords",
"arabic_stemmer"
]
}
}
Analyzer API check:
GET /_analyze
{
"tokenizer" : "standard",
"filter" : ["arabic_normalization"],
"text" : "وَهَكَذَا في كُلّ عَقْدٍ"
}
Result from Analyzer:
{
"tokens" : [
{
"token" : "وهكذا",
"start_offset" : 0,
"end_offset" : 9,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "في",
"start_offset" : 10,
"end_offset" : 12,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "كل",
"start_offset" : 13,
"end_offset" : 17,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "عقد",
"start_offset" : 18,
"end_offset" : 24,
"type" : "<ALPHANUM>",
"position" : 3
}
]
}
As you can see diacritics are removed. For more information you can check here.
I'm using Elastic Search 1.7.1 on Mac.
Here is my index mapping:
{
"settings":{
"analysis":{
"filter":{
"my_edgengram":{
"max_gram":15,
"token_chars":[
"letter",
"digit"
],
"type":"edgeNGram",
"min_gram":1
},
},
"analyzer":{
"stop_edgengram_analyzer":{
"filter":[
"lowercase",
"asciifolding",
"stop",
"my_edgengram"
],
"type":"custom",
"tokenizer":"whitespace"
}
}
}
}
}
Debugging analyzer:
$ curl -XGET 'http://localhost:9200/objects/_analyze?analyzer=stop_edgengram_analyzer&text=America,s&pretty=True'
{
"tokens" : [
... skipped ...
, {
"token" : "america",
"start_offset" : 0,
"end_offset" : 9,
"type" : "word",
"position" : 1
}, {
"token" : "america,",
"start_offset" : 0,
"end_offset" : 9,
"type" : "word",
"position" : 1
}, {
"token" : "america,s",
"start_offset" : 0,
"end_offset" : 9,
"type" : "word",
"position" : 1
} ]
}
Why america,s token is in output?
, is punctuation symbol. I expect letters and digits only as specified in token_chars property of my_edgengram filter.
You are confusing edge_ngram tokenizer and edge_ngram token filter.
From documentation:
Tokenizers are used to break a string down into a stream of terms or
tokens.
In the example provided in question whitespace is the tokenizer that is being used
Token Filter on other hand :
accept a stream of tokens from a tokenizer and can
modify tokens (eg lowercasing), delete tokens (eg remove stopwords) or
add tokens (eg synonyms).
In the example provided in OP egde_ngram token filter is being used.
token_chars is not supported for edge_ngram token filter and hence ignored.
Is it possible to use the path_hierarchy tokenizer with paths that have whitespace in them and have it create tokens based only on the delimiter not the whitespace? For example,
"/airport/hangar 1"
would be tokenized as
"airport", "hangar 1",
not
"airport", "hangar", "1"?
The path_hierarchy tokenizer works perfectly fine with paths that have whitespaces:
curl "localhost:9200/_analyze?tokenizer=path_hierarchy&pretty=true" -d "/airport/hangar 1"
{
"tokens" : [ {
"token" : "/airport",
"start_offset" : 0,
"end_offset" : 8,
"type" : "word",
"position" : 1
}, {
"token" : "/airport/hangar 1",
"start_offset" : 0,
"end_offset" : 17,
"type" : "word",
"position" : 1
} ]
}
However, based on your example, you might need to use the pattern tokenizer instead.