Make Elasticsearch handle only the numeric part of a string field and parse/copy it into a numeric field - elasticsearch

In my data, i have a field that contain a string representation of a year. The field can contain other characters and sometimes several year strings.
Examples:
1995-2000
[2000]
cop. 1865
I want to (in Elasticsearch) extract these years and parse them into a numeric (multi-valued) field in order to make Histogram aggregates.
I have tried the following configuration wich gives me only the numeric parts of the strings as tokens, but i cannot figure out how to make the final step and have those tokens interpreted as integers/shorts.
{
"analysis": {
"analyzer": {
"numeric_extractor": {
"filter": [
"numeric_keeper"
],
"tokenizer": "numeric_keeper_tokenizer"
}
},
"char_filter": {
"non_numeric_remover": {
"type": "pattern_replace",
"pattern": "[^0-9]+",
"replacement": " "
}
},
"tokenizer": {
"numeric_keeper_tokenizer": {
"type": "pattern",
"group": 1,
"pattern": "([0-9]{4})"
}
},
"filter": {
"numeric_keeper": {
"type": "pattern_capture",
"preserve_original": 0,
"patterns": [
"([0-9]{4})"
]
}
}
},
"properties": {
"date": {
"fields": {
"date": {
"analyzer": "numeric_extractor",
"index": "analyzed",
"type": "string"
}
},
"type": "multi_field"
}
}
}
Elastic version 2.4.

Related

how to require minimum length letters from query to match in elasticsearch

I want to require that the query be at least 5 matching consecutive characters for matching a particular field. They can be somewhat fuzzy (would be ideal if the longer the sequence is, the fuzzier it can be).
In this example I defined n-gram with no min 5 characters in gram. That way it is possible to match with at least 5 characters.
PUT teste
{
"mappings": {
"properties": {
"name": {
"type": "text",
"fields": {
"ngram": {
"type": "text",
"analyzer": "shingle_analyzer"
}
}
}
}
},
"settings": {
"analysis": {
"analyzer": {
"shingle_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"shingle_filter"
]
}
},
"filter": {
"shingle_filter": {
"type": "edge_ngram",
"min_gram": 5,
"max_gram": 8
}
}
}
}
}
POST teste/_doc
{
"name":"example text match fiver terms sequence"
}
GET teste/_search
{
"query": {
"match": {
"name.ngram": "exampl"
}
}
}

How to search by words written together among data where these words are written apart in Elasticsearch?

I have documents which have, let's say, 1 field - name of this document. Name may consist of several words written apart, for example:
{
"name": "first document"
},
{
"name": "second document"
}
My goal is to be able to search for these documents by strings:
firstdocument, seconddocumen
As you can see, search strings are written wrong, but they still match those documents if we delete whitespaces from documents' names. This issue could be handled by creating another field with the same string but without whitespaces, but it seems like extra data unless there's no other way to do that.
I need something similar to this:
GET /_analyze
{
"tokenizer": "whitespace",
"filter": [
{
"type":"shingle",
"max_shingle_size":3,
"min_shingle_size":2,
"output_unigrams":"true",
"token_separator": ""
}
],
"text": "first document"
}
But the other way around. I need kind of apply this not to a search text, but for search objects (name of documents), so I could find documents with a little misspell in a search text. How should it be done?
I suggest using multi-fields with an analyzer for removing whitespaces.
Analyzer
"no_spaces": {
"filter": [
"lowercase"
],
"char_filter": [
"remove_spaces"
],
"tokenizer": "standard"
}
Char Filter
"remove_spaces": {
"type": "pattern_replace",
"pattern": "[ ]",
"replacement": ""
}
Field Mapping
"name": {
"type": "text",
"fields": {
"without_spaces": {
"type": "text",
"analyzer": "no_spaces"
}
}
}
Query
GET /_search
{
"query": {
"match": {
"name.without_spaces": {
"query": "seconddocumen",
"fuzziness": "AUTO"
}
}
}
}
EDIT:
For completion: An alternative to the remove_spaces filter could be the shingle filter:
"analysis": {
"filter": {
"shingle_filter": {
"type": "shingle",
"output_unigrams": "false",
"token_separator": ""
}
},
"analyzer": {
"shingle_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"shingle_filter"
]
}
}
}

Search a list of names and categorizing each letter type

I want to index a large list of names using ES.
I want to distinguish between consonants and vowels in each word, and be able to search based on the position of each letter and if it is a consonant or a vowel.
So say the name like:
JOHN
I want to enter this:
CVCC
and when I run the search, JOHN should be in the result set.
Is it possible somehow to index names in elastic search such that I could index and then search them using the tokens C and V for vowel?
So somehow Elasticsearch will have to index the character types for each position for each word, how can this be done?
You can do it with pattern_replace char filters in a custom analyzer. Also, in my solution I have used a sub-field for the custom analyzer, thinking maybe that you will want other kinds of searches on the name field, the consonants-vowels one being only one of them.
DELETE test
PUT test
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "keyword",
"char_filter": [
"replace_filter_lowercase_CONS",
"replace_filter_uppercase_CONS",
"replace_filter_lowercase_VOW",
"replace_filter_uppercase_VOW"
]
}
},
"char_filter": {
"replace_filter_lowercase_CONS": {
"type": "pattern_replace",
"pattern": "[b-df-hj-np-tv-z]{1}",
"replacement": "c"
},
"replace_filter_uppercase_CONS": {
"type": "pattern_replace",
"pattern": "[B-DF-HJ-NP-TV-Z]{1}",
"replacement": "C"
},
"replace_filter_lowercase_VOW": {
"type": "pattern_replace",
"pattern": "[aeiou]{1}",
"replacement": "v"
},
"replace_filter_uppercase_VOW": {
"type": "pattern_replace",
"pattern": "[AEIOU]{1}",
"replacement": "V"
}
}
}
},
"mappings": {
"test": {
"properties": {
"name": {
"type": "text",
"fields": {
"cons_vow": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
}
}
}
POST /test/test/1
{"name":"JOHN"}
POST /test/test/2
{"name":"Andrew"}
POST /test/test/3
{"name":"JOhn DOE"}
GET /test/_search
{
"query": {
"term": {
"name.cons_vow": {
"value": "CVCC"
}
}
}
}

"Letter" tokenizer and "word_delimiter" filter not working with underscores

I built an ElasticSearch index using a custom analyzer which uses letter tokenizer and lower_case and word_delimiter token filters. Then I tried searching for documents containing underscore-separated sub-words, e.g. abc_xyz, using only one of the sub-words, e.g. abc, but it didn't come back with any result. When I tried the full-word, i.e. abc_xyz, it did find the document.
Then I changed the document to have dash-separated sub-words instead, e.g. abc-xyz and tried to search by sub-words again and it worked.
To try to understand what is going on, I thought I would check the terms generated for my documents using _termvector service, and the result was identical for both, the underscore-separated sub-words and the dash-separated sub-words, so really I expect the result of searching to be identical in both cases.
Any idea what I could be doing wrong?
If it helps, this is the settings I used for my index:
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"cmt_value_analyzer": {
"tokenizer": "letter",
"filter": [
"lowercase",
"my_filter"
],
"type": "custom"
}
},
"filter": {
"my_filter": {
"type": "word_delimiter"
}
}
}
}
},
"mappings": {
"alertmodel": {
"properties": {
"name": {
"analyzer": "cmt_value_analyzer",
"term_vector": "with_positions_offsets_payloads",
"type": "string"
},
"productId": {
"type": "double"
},
"productName": {
"analyzer": "cmt_value_analyzer",
"term_vector": "with_positions_offsets_payloads",
"type": "string"
},
"link": {
"analyzer": "cmt_value_analyzer",
"term_vector": "with_positions_offsets_payloads",
"type": "string"
},
"updatedOn": {
"type": "date"
}
}
}
}
}

multiple like query in elastic search

I have a field path in my elastic-search documents which has entries like this
/logs/hadoop-yarn/container/application_1451299305289_0120/container_e18_1451299305289_0120_01_011007/stderr
/logs/hadoop-yarn/container/application_1451299305289_0120/container_e18_1451299305289_0120_01_008874/stderr
#*Note -- I want to select all the documents having below line in the **path** field
/logs/hadoop-yarn/container/application_1451299305289_0120/container_e18_1451299305289_0120_01_009257/stderr
I want to make a like query on this path field given certain things(basically an AND condition on all the 3):-
I have given application number 1451299305289_0120
I have also given a task number 009257
The path field should also contain stderr
Given the above criteria the document having the path field as the 3rd line should be selected
This is what I have tries so far
http://localhost:9200/logstash-*/_search?q=application_1451299305289_0120 AND path:stderr&size=50
This query fulfills the 3rd criteria, and partially the 1st criteria i.e if I search for 1451299305289_0120 instead of application_1451299305289_0120, I got 0 results. (What I really need is like search on 1451299305289_0120)
When I tried this
http://10.30.145.160:9200/logstash-*/_search?q=path:*_1451299305289_0120*008779 AND path:stderr&size=50
I got the result, but using * at the start is a costly operation. Is their another way to achieve this effectively (like using nGram and using fuzzy-search of elastic-search)
This can be achieved by using Pattern Replace Char Filter. You just extract only important bits of information with regex. This is my setup
POST log_index
{
"settings": {
"analysis": {
"analyzer": {
"app_analyzer": {
"char_filter": [
"app_extractor"
],
"tokenizer": "keyword",
"filter": [
"lowercase",
"asciifolding"
]
},
"path_analyzer": {
"char_filter": [
"path_extractor"
],
"tokenizer": "keyword",
"filter": [
"lowercase",
"asciifolding"
]
},
"task_analyzer": {
"char_filter": [
"task_extractor"
],
"tokenizer": "keyword",
"filter": [
"lowercase",
"asciifolding"
]
}
},
"char_filter": {
"app_extractor": {
"type": "pattern_replace",
"pattern": ".*application_(.*)/container.*",
"replacement": "$1"
},
"path_extractor": {
"type": "pattern_replace",
"pattern": ".*/(.*)",
"replacement": "$1"
},
"task_extractor": {
"type": "pattern_replace",
"pattern": ".*container.{27}(.*)/.*",
"replacement": "$1"
}
}
}
},
"mappings": {
"your_type": {
"properties": {
"name": {
"type": "string",
"analyzer": "keyword",
"fields": {
"application_number": {
"type": "string",
"analyzer": "app_analyzer"
},
"path": {
"type": "string",
"analyzer": "path_analyzer"
},
"task": {
"type": "string",
"analyzer": "task_analyzer"
}
}
}
}
}
}
}
I am extracting application number, task number and path with regex. You might want to optimize task regex a bit if you have some other log pattern, then we can use Filters to search.A big advantage of using filters is that they are cached and make subsequent calls faster.
I indexed sample log like this
PUT log_index/your_type/1
{
"name" : "/logs/hadoop-yarn/container/application_1451299305289_0120/container_e18_1451299305289_0120_01_009257/stderr"
}
This query will give you desired results
GET log_index/_search
{
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
{
"term": {
"name.application_number": "1451299305289_0120"
}
},
{
"term": {
"name.task": "009257"
}
},
{
"term": {
"name.path": "stderr"
}
}
]
}
}
}
}
}
On a side note filtered query is deprecated in ES 2.x, just use filter directly.Also path hierarchy might be useful for some other uses
Hope this helps :)

Resources