Search a list of names and categorizing each letter type - elasticsearch

I want to index a large list of names using ES.
I want to distinguish between consonants and vowels in each word, and be able to search based on the position of each letter and if it is a consonant or a vowel.
So say the name like:
JOHN
I want to enter this:
CVCC
and when I run the search, JOHN should be in the result set.
Is it possible somehow to index names in elastic search such that I could index and then search them using the tokens C and V for vowel?
So somehow Elasticsearch will have to index the character types for each position for each word, how can this be done?

You can do it with pattern_replace char filters in a custom analyzer. Also, in my solution I have used a sub-field for the custom analyzer, thinking maybe that you will want other kinds of searches on the name field, the consonants-vowels one being only one of them.
DELETE test
PUT test
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "keyword",
"char_filter": [
"replace_filter_lowercase_CONS",
"replace_filter_uppercase_CONS",
"replace_filter_lowercase_VOW",
"replace_filter_uppercase_VOW"
]
}
},
"char_filter": {
"replace_filter_lowercase_CONS": {
"type": "pattern_replace",
"pattern": "[b-df-hj-np-tv-z]{1}",
"replacement": "c"
},
"replace_filter_uppercase_CONS": {
"type": "pattern_replace",
"pattern": "[B-DF-HJ-NP-TV-Z]{1}",
"replacement": "C"
},
"replace_filter_lowercase_VOW": {
"type": "pattern_replace",
"pattern": "[aeiou]{1}",
"replacement": "v"
},
"replace_filter_uppercase_VOW": {
"type": "pattern_replace",
"pattern": "[AEIOU]{1}",
"replacement": "V"
}
}
}
},
"mappings": {
"test": {
"properties": {
"name": {
"type": "text",
"fields": {
"cons_vow": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
}
}
}
POST /test/test/1
{"name":"JOHN"}
POST /test/test/2
{"name":"Andrew"}
POST /test/test/3
{"name":"JOhn DOE"}
GET /test/_search
{
"query": {
"term": {
"name.cons_vow": {
"value": "CVCC"
}
}
}
}

Related

Elastic Search - search the data ignoring periods or

The elastic search index has the data having CPFs.
{
"name": "A",
"cpf": "718.881.683-23",
}
{
"name": "B",
"cpf": "404.833.187-60",
}
I want to search the data by field cpf as following:
query: 718
output: doc with name "A"
query: 718.881.683-23
output: doc with name "A"
The above is working.
But the following is not working.
query: 71888168323
output: doc with name "A"
Here I want to search the doc by field CPF data but without period and hyphen also.
You can add a custom analyzer that will remove all characters that are not digits and only index the digits.
The analyzer looks like this:
PUT test
{
"settings": {
"analysis": {
"filter": {
"number_only": {
"type": "pattern_replace",
"pattern": "\\D"
}
},
"analyzer": {
"cpf_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"number_only"
]
}
}
}
},
"mappings": {
"properties": {
"cpf": {
"type": "text",
"analyzer": "cpf_analyzer"
}
}
}
}
Then you can index your documents as usual:
POST test/_doc
{
"name": "A",
"cpf": "718.881.683-23"
}
POST test/_doc
{
"name": "B",
"cpf": "404.833.187-60"
}
Searching for a prefix like 718 can be done like this:
POST test/_search
{
"query": {
"prefix": {
"cpf": "718"
}
}
}
Searching for the exact value with non-digit characters can be done like this:
POST test/_search
{
"query": {
"match": {
"cpf": "718.881.683-23"
}
}
}
And finally, you can also search with numbers only:
POST test/_search
{
"query": {
"match": {
"cpf": "71888168323"
}
}
}
With the given analyzer, all the above queries will return the document you expect.
If you cannot recreate your index for whatever reason, you can create a sub-field with the right analyzer and update your data in place:
PUT test/_mapping
{
"properties": {
"cpf": {
"type": "text",
"fields": {
"numeric": {
"type": "text",
"analyzer": "cpf_analyzer"
}
}
}
}
}
And then simply run the following command which will reindex all the data in place and populate the cpf.numeric field:
POST test/_update_by_query
All your searches will then need to be done on the cpf.numeric field instead of cpf directly.
718.881.683-23 is tokenized to 718 881 683 23 by the standard analyzer. So by default, you will find the document A with 718, 718 881, 718 and 23, but not with 7188 as there is no such token in the field. Probably you want to specify a different analyzer, for example using the edge n-gram tokenizer.
You can create a custom analyzer specifying a filter - for example, a pattern replace like the following (strips everything that is not a digit)
"my_char_filter": {
"type": "pattern_replace",
"pattern": "[^\d]",
"replacement": ""
}
and a edge n-gram
"my_tokenizer": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 11,
"token_chars": [
"digit"
]
}

Achieving literal text search combined with subword matching in Elasticsearch

I have populated an Elasticsearch database using the following settings:
mapping = {
"properties": {
"location": {
"type": "text",
"analyzer": "ngram_analyzer"
},
"description": {
"type": "text",
"analyzer": "ngram_analyzer"
},
"commentaar": {
"type": "text",
"analyzer": "ngram_analyzer"
},
}
}
settings = {
"settings": {
"analysis": {
"filter": {
"ngram_filter": {
"type": "ngram",
"min_gram": 1,
"max_gram": 20
}
},
"analyzer": {
"ngram_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"ngram_filter"
]
}
}
}
},
"mappings": {"custom_tool": mapping}
}
I used the ngram analyser because I wanted the be able to have subword matching. So a search for "ackoverfl" would return the entries containing "stackoverflow".
My search queries are made as follows:
q = {
"simple_query_string": {
"query": needle,
"default_operator": "and",
"analyzer": "whitespace"
}
}
Where needle is the text from my search bar.
Sometimes I would also like to do literal phrase searching. For example:
If my search term is:
"the ap hangs in the tree"
(Notice that I use quotation marks here with the intention the search for a literal piece of text).
Then in my results I get a document containing:
the apple hangs in the tree
This results is unwanted.
How could I implement having a subword matching search capability while also having the option to search for literal phrases (by using for example quotation marks) ?

How to search by words written together among data where these words are written apart in Elasticsearch?

I have documents which have, let's say, 1 field - name of this document. Name may consist of several words written apart, for example:
{
"name": "first document"
},
{
"name": "second document"
}
My goal is to be able to search for these documents by strings:
firstdocument, seconddocumen
As you can see, search strings are written wrong, but they still match those documents if we delete whitespaces from documents' names. This issue could be handled by creating another field with the same string but without whitespaces, but it seems like extra data unless there's no other way to do that.
I need something similar to this:
GET /_analyze
{
"tokenizer": "whitespace",
"filter": [
{
"type":"shingle",
"max_shingle_size":3,
"min_shingle_size":2,
"output_unigrams":"true",
"token_separator": ""
}
],
"text": "first document"
}
But the other way around. I need kind of apply this not to a search text, but for search objects (name of documents), so I could find documents with a little misspell in a search text. How should it be done?
I suggest using multi-fields with an analyzer for removing whitespaces.
Analyzer
"no_spaces": {
"filter": [
"lowercase"
],
"char_filter": [
"remove_spaces"
],
"tokenizer": "standard"
}
Char Filter
"remove_spaces": {
"type": "pattern_replace",
"pattern": "[ ]",
"replacement": ""
}
Field Mapping
"name": {
"type": "text",
"fields": {
"without_spaces": {
"type": "text",
"analyzer": "no_spaces"
}
}
}
Query
GET /_search
{
"query": {
"match": {
"name.without_spaces": {
"query": "seconddocumen",
"fuzziness": "AUTO"
}
}
}
}
EDIT:
For completion: An alternative to the remove_spaces filter could be the shingle filter:
"analysis": {
"filter": {
"shingle_filter": {
"type": "shingle",
"output_unigrams": "false",
"token_separator": ""
}
},
"analyzer": {
"shingle_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"shingle_filter"
]
}
}
}

Make Elasticsearch handle only the numeric part of a string field and parse/copy it into a numeric field

In my data, i have a field that contain a string representation of a year. The field can contain other characters and sometimes several year strings.
Examples:
1995-2000
[2000]
cop. 1865
I want to (in Elasticsearch) extract these years and parse them into a numeric (multi-valued) field in order to make Histogram aggregates.
I have tried the following configuration wich gives me only the numeric parts of the strings as tokens, but i cannot figure out how to make the final step and have those tokens interpreted as integers/shorts.
{
"analysis": {
"analyzer": {
"numeric_extractor": {
"filter": [
"numeric_keeper"
],
"tokenizer": "numeric_keeper_tokenizer"
}
},
"char_filter": {
"non_numeric_remover": {
"type": "pattern_replace",
"pattern": "[^0-9]+",
"replacement": " "
}
},
"tokenizer": {
"numeric_keeper_tokenizer": {
"type": "pattern",
"group": 1,
"pattern": "([0-9]{4})"
}
},
"filter": {
"numeric_keeper": {
"type": "pattern_capture",
"preserve_original": 0,
"patterns": [
"([0-9]{4})"
]
}
}
},
"properties": {
"date": {
"fields": {
"date": {
"analyzer": "numeric_extractor",
"index": "analyzed",
"type": "string"
}
},
"type": "multi_field"
}
}
}
Elastic version 2.4.

Elasticsearch "pattern_replace", replacing whitespaces while analyzing

Basically I want to remove all whitespaces and tokenize the whole string as a single token. (I will use nGram on top of that later on.)
This is my index settings:
"settings": {
"index": {
"analysis": {
"filter": {
"whitespace_remove": {
"type": "pattern_replace",
"pattern": " ",
"replacement": ""
}
},
"analyzer": {
"meliuz_analyzer": {
"filter": [
"lowercase",
"whitespace_remove"
],
"type": "custom",
"tokenizer": "standard"
}
}
}
}
Instead of "pattern": " ", I tried "pattern": "\\u0020" and \\s , too.
But when I analyze the text "beleza na web", it still creates three separate tokens: "beleza", "na" and "web", instead of one single "belezanaweb".
The analyzer analyzes a string by tokenizing it first then applying a series of token filters. You have specified tokenizer as standard means the input is already tokenized using standard tokenizer which created the tokens separately. Then pattern replace filter is applied to the tokens.
Use keyword tokenizer instead of your standard tokenizer. Rest of the mapping is fine.
You can change your mapping as below
"settings": {
"index": {
"analysis": {
"filter": {
"whitespace_remove": {
"type": "pattern_replace",
"pattern": " ",
"replacement": ""
}
},
"analyzer": {
"meliuz_analyzer": {
"filter": [
"lowercase",
"whitespace_remove",
"nGram"
],
"type": "custom",
"tokenizer": "keyword"
}
}
}
}

Resources