Elasticsearch concatenate two words into one - elasticsearch

I have a field ManufacturerName
"ManufacturerName": {
"type": "keyword",
"normalizer" : "keyword_lowercase"
},
And a normalizer
"normalizer": {
"keyword_lowercase": {
"type": "custom",
"filter": ["lowercase"]
}
}
When searching for 'ripcurl' it matches. However when searching for 'rip curl' it doesn't.
How/what would use to concatenate certain words. i.e. 'rip curl' -> 'ripcurl'
Apologies if this is a duplicate, I've spent some time seeking a solution to this.

You would want to make use of text field for what you are looking for and get this kind of requirement carried out via Ngram Tokenizer
Below is a sample mapping, query and response:
Mapping:
PUT mysomeindex
{
"mappings": {
"mydocs":{
"properties": {
"ManufacturerName":{
"type": "text",
"analyzer": "my_analyzer",
"fields":{
"keyword":{
"type": "keyword",
"normalizer": "my_normalizer"
}
}
}
}
}
},
"settings": {
"analysis": {
"normalizer": {
"my_normalizer":{
"type": "custom",
"char_filter": [],
"filter": ["lowercase", "asciifolding"]
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer",
"filter": [ "synonyms" ]
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 5,
"token_chars": [
"letter",
"digit"
]
}
},
"filter": {
"synonyms":{
"type": "synonym",
"synonyms" : ["henry loyd, henry loid, henry lloyd => henri lloyd"]
}
}
}
}
}
Notice that the field ManufacturerName is a multi-field which has both text type and its sibling keyword type. That way for exact matches & for aggregation queries you could make use of keyword field while for this requirement you can make use of text field.
Sample Document:
POST mysomeindex/mydocs/1
{
"ManufacturerName": "ripcurl"
}
POST mysomeindex/mydocs/2
{
"ManufacturerName": "henri lloyd"
}
What elasticsearch does when you ingest the above document is, it creates tokens of size from 3 to 5 length and stored them in inverted index for e.g. `rip, ipc, pcu etc...
You can execute the below query to see what tokens gets created:
POST mysomeindex/_analyze
{
"text": "ripcurl",
"analyzer": "my_analyzer"
}
Also I'd suggest you to look into Edge Ngram tokenizer and see if that fits better for your requirement.
Query:
POST mysomeindex/_search
{
"query": {
"match": {
"ManufacturerName": "rip curl"
}
}
}
Response:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.25316024,
"hits": [
{
"_index": "mysomeindex",
"_type": "mydocs",
"_id": "1",
"_score": 0.25316024,
"_source": {
"ManufacturerName": "ripcurl"
}
}
]
}
}
Query for Synonyms:
POST mysomeindex/_search
{
"query": {
"match": {
"ManufacturerName": "henri lloyd"
}
}
}
Response:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 2.2784421,
"hits": [
{
"_index": "mysomeindex",
"_type": "mydocs",
"_id": "2",
"_score": 2.2784421,
"_source": {
"ManufacturerName": "henry lloyd"
}
}
]
}
}
Note: If you intend to make use of synonyms then the best way it to have them in the a text file and add that relative to the config folder location as mentioned here
Hope this helps!

Related

Keyword normalizer not applied on document

I'm using Elasticsearch 6.8
here is my mapping
{
"index_patterns": [
"my_index_*"
],
"settings": {
"index": {
"number_of_shards": 3,
"number_of_replicas": 1
},
"analysis": {
"analyzer": {
"lower_ascii_analyzer": {
"tokenizer": "keyword",
"filter": [
"lowercase",
"asciifolding"
]
}
},
"normalizer": {
"my_normalizer": {
"type": "custom",
"char_filter": [],
"filter": ["lowercase"]
}
}
}
},
"mappings": {
"audit_conformity": {
"dynamic": "false",
"properties": {
"country": {
"type": "keyword",
"normalizer": "my_normalizer"
},
[…]
Then I post a document with this body
{
"_source": {
"company_id": "a813bec1-f9f3-44c7-96ac-11157f64b79b",
"country": "MX",
"user_entity_id": "1"
}
}
When I search for the document, the country is still capitalized
GET /my_index_country/_search
I get
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 3,
"successful": 3,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "my_index_country",
"_type": "my_index",
"_id": "LOT0fYIBCNP9gFG_7cet",
"_score": 1,
"_source": {
"_source": {
"company_id": "a813bec1-f9f3-44c7-96ac-11157f64b79b",
"country": "MX",
"user_entity_id": "1",
}
}
}
]
}
}
What do I do wrong ?
You do nothing wrong, but normalizers (and analyzer alike) will never modify your source document, only whatever is indexed from it.
This means that the source document keeps holding MX but underneath mx will be indexed for the country field.
If you want to lowercase the country field, you should use an ingest pipeline with a lowercase processor instead which will modify your source document before indexing it:
PUT _ingest/pipeline/lowercase-pipiline
{
"processors": [
{
"lowercase": {
"field": "country"
}
}
]
}
Then use it when indexing your documents:
PUT my_index_country/my_index/LOT0fYIBCNP9gFG_7cet?pipeline=lowercase-pipeline
{
"company_id": "a813bec1-f9f3-44c7-96ac-11157f64b79b",
"country": "MX",
"user_entity_id": "1",
}
GET my_index_country/my_index/LOT0fYIBCNP9gFG_7cet
Result =>
{
"company_id": "a813bec1-f9f3-44c7-96ac-11157f64b79b",
"country": "mx",
"user_entity_id": "1",
}

Extract keywords (multi word) from text using elastic search and return offset of the searched words

I have a lot of keywords that I want to extract from a query and tell the position(offset) of were the keywords are in that text
So this is my progress I created two custom analyzers keyword and shingles:
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer_keyword": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"asciifolding",
"lowercase"
]
},
"my_analyzer_shingle": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"asciifolding",
"lowercase",
"shingle"
]
}
}
}
},
"mappings": {
"your_type": {
"properties": {
"keyword": {
"type": "string",
"index_analyzer": "my_analyzer_keyword",
"search_analyzer": "my_analyzer_shingle"
}
}
}
}
And here are the keywords that I say:
{
"hits": {
"total": 2000,
"hits": [
{
"id": 1,
"keyword": "python programming"
},
{
"id": 2,
"keyword": "facebook"
},
{
"id": 3,
"keyword": "Microsoft"
},
{
"id": 4,
"keyword": "NLTK"
},
{
"id": 5,
"keyword": "Natural language processing"
}
]
}
}
And I make a query something like this:
{
"query": {
"match": {
"keyword": "I post a lot of things on Facebook and quora"
}
}
}
So with the code above I get
{
"took": 6,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 4,
"max_score": 0.009332742,
"hits": [
{
"_index": "test",
"_type": "your_type",
"_id": "2",
"_score": 0.009332742,
"_source": {
"id": 2,
"keyword": "facebook"
}
},
{
"_index": "test",
"_type": "your_type",
"_id": "4",
"_score": 0.009207102,
"_source": {
"id": 4,
"keyword": "quora"
}
}
]
}
}
But I don't know were in the text are that words the offset of those words:
I want to know that quora start at index 40. But not to highlight them between tags or something like this.
I want to mention that my post is based on this post
Extract keywords (multi word) from text using elastic search

Elasticsearch Edge-NGrams Prefer Shorter Terms

I like the results I am getting from Elasticsearch using Edge-NGrams to index data and a different analyzer for searching. I would, however, prefer that shorter terms that match get ranked higher than longer terms.
For example, take the terms ABC100 and ABC100xxx. If I perform a query using the term ABC, I get back both of these documents as hits with the same score. What I would like is for ABC100 to be scored higher than ABC100xxx because ABC closer matches ABC100 according to something like the Levenshtein distance algorithm.
Setting up the index:
PUT stackoverflow
{
"settings": {
"index": {
"number_of_replicas": 0,
"number_of_shards": 1
},
"analysis": {
"filter": {
"edge_ngram": {
"type": "edgeNGram",
"min_gram": "1",
"max_gram": "20"
}
},
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"edge_ngram"
]
}
}
}
},
"mappings": {
"doc": {
"properties": {
"product": {
"type": "text",
"analyzer": "my_analyzer",
"search_analyzer": "whitespace"
}
}
}
}
}
Inserting documents:
PUT stackoverflow/doc/1
{
"product": "ABC100"
}
PUT stackoverflow/doc/2
{
"product": "ABC100xxx"
}
Search query:
GET stackoverflow/_search?pretty
{
"query": {
"match": {
"product": "ABC"
}
}
}
Results:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.28247002,
"hits": [
{
"_index": "stackoverflow",
"_type": "doc",
"_id": "2",
"_score": 0.28247002,
"_source": {
"product": "ABC100xxx"
}
},
{
"_index": "stackoverflow",
"_type": "doc",
"_id": "1",
"_score": 0.28247002,
"_source": {
"product": "ABC100"
}
}
]
}
}
Does anyone know how I may have a shorter term such as ABC100 ranked higher than ABC100xxx?
After finding plenty of less than optimal solutions regarding storing field length as a field or using a script query, I found the root of my problem. It was simply because I was using the edge_ngrams token filter instead of the the edge_ngrams tokenizer.

Querying elasticsearch with OR and wildcards

I'm trying to do a simple query to my elasticsearch _type and match multiple fields with wildcards, my first attempt was like this:
POST my_index/my_type/_search
{
"sort" : { "date_field" : {"order" : "desc"}},
"query" : {
"filtered" : {
"filter" : {
"or" : [
{
"term" : { "field1" : "4848" }
},
{
"term" : { "field2" : "6867" }
}
]
}
}
}
}
This example will successfully match every record when field1 OR field2 are exactly equal to 4848 and 6867 respectively.
What I'm trying to do is to match on field1 any text that contains 4848 and field2 that contains 6867 but I'm not really sure how to do it.
I appreciate any help I can get :)
It sounds like your problem has mostly to do with analysis. The appropriate solution depends on the structure of your data and what you want to match. I'll provide a couple of examples.
First, let's assume that your data is such that we can get what we want just using the standard analyzer. This analyzer will tokenize text fields on whitespace, punctuation and symbols. So the text "1234-5678-90" will be broken into the terms "1234", "5678", and "90", so a "term" query or filter for any of those terms will match that document. More concretely:
DELETE /test_index
PUT /test_index
{
"settings": {
"number_of_shards": 1
},
"mappings": {
"doc": {
"properties": {
"field1":{
"type": "string",
"analyzer": "standard"
},
"field2":{
"type": "string",
"analyzer": "standard"
}
}
}
}
}
POST /test_index/_bulk
{"index":{"_index":"test_index","_type":"doc","_id":1}}
{"field1": "1212-2323-4848","field2": "1234-5678-90"}
{"index":{"_index":"test_index","_type":"doc","_id":2}}
{"field1": "0000-0000-0000","field2": "0987-6543-21"}
{"index":{"_index":"test_index","_type":"doc","_id":3}}
{"field1": "1111-2222-3333","field2": "6867-4545-90"}
POST test_index/_search
{
"query": {
"filtered": {
"filter": {
"or": [
{
"term": { "field1": "4848" }
},
{
"term": { "field2": "6867" }
}
]
}
}
}
}
...
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_score": 1,
"_source": {
"field1": "1212-2323-4848",
"field2": "1234-5678-90"
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "3",
"_score": 1,
"_source": {
"field1": "1111-2222-3333",
"field2": "6867-4545-90"
}
}
]
}
}
(Explicitly writing "analyzer": "standard" is redundant since that is the default analyzer used if you do not specify one; I just wanted to make it obvious.)
On the other hand, if the text is embedded in such a way that the standard analysis doesn't provide what you want, say something like "121223234848" and you want to match on "4848", you will have to do something little more sophisticated, using ngrams. Here is an example of that (notice the difference in the data):
DELETE /test_index
PUT /test_index
{
"settings": {
"analysis": {
"filter": {
"nGram_filter": {
"type": "nGram",
"min_gram": 2,
"max_gram": 20,
"token_chars": [
"letter",
"digit",
"punctuation",
"symbol"
]
}
},
"analyzer": {
"nGram_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding",
"nGram_filter"
]
},
"whitespace_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
},
"mappings": {
"doc": {
"properties": {
"field1":{
"type": "string",
"index_analyzer": "nGram_analyzer",
"search_analyzer": "whitespace_analyzer"
},
"field2":{
"type": "string",
"index_analyzer": "nGram_analyzer",
"search_analyzer": "whitespace_analyzer"
}
}
}
}
}
POST /test_index/_bulk
{"index":{"_index":"test_index","_type":"doc","_id":1}}
{"field1": "121223234848","field2": "1234567890"}
{"index":{"_index":"test_index","_type":"doc","_id":2}}
{"field1": "000000000000","field2": "0987654321"}
{"index":{"_index":"test_index","_type":"doc","_id":3}}
{"field1": "111122223333","field2": "6867454590"}
POST test_index/_search
{
"query": {
"filtered": {
"filter": {
"or": [
{
"term": { "field1": "4848" }
},
{
"term": { "field2": "6867" }
}
]
}
}
}
}
...
{
"took": 8,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_score": 1,
"_source": {
"field1": "121223234848",
"field2": "1234567890"
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "3",
"_score": 1,
"_source": {
"field1": "111122223333",
"field2": "6867454590"
}
}
]
}
}
There is a lot going on here, so I won't attempt to explain it in this post. If you want more explanation I would encourage you to read this blog post: http://blog.qbox.io/multi-field-partial-word-autocomplete-in-elasticsearch-using-ngrams. Hope you'll forgive the shameless plug. ;)
Hope that helps.

ElasticSearch edgeNGram

I have the following settings and analyzer:
put /tests
{
"settings": {
"analysis": {
"analyzer": {
"standardWithEdgeNGram": {
"tokenizer": "standard",
"filter": ["lowercase", "edgeNGram"]
}
},
"tokenizer": {
"standard": {
"type": "standard"
}
},
"filter": {
"lowercase": {
"type": "lowercase"
},
"edgeNGram": {
"type": "edgeNGram",
"min_gram": 2,
"max_gram": 15,
"token_chars": ["letter", "digit"]
}
}
}
},
"mappings": {
"test": {
"_all": {
"analyzer": "standardWithEdgeNGram"
},
"properties": {
"Name": {
"type": "string",
"analyzer": "standardWithEdgeNGram"
}
}
}
}
}
And I posted the following data into it:
POST /tests/test
{
"Name": "JACKSON v. FRENKEL"
}
And here is my query:
GET /tests/test/_search
{
"query": {
"match": {
"Name": "jax"
}
}
}
And I got this result:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.19178301,
"hits": [
{
"_index": "tests",
"_type": "test",
"_id": "lfOxb_5bS86_CMumo_ZLoA",
"_score": 0.19178301,
"_source": {
"Name": "JACKSON v. FRENKEL"
}
}
]
}
}
Can someone explain to me that there is no "jax" anywhere in the "Name", and it still gets the match?
Thanks in advance
A match query performs analysis on its given value. By default, "jax" is being analyzed with standardWithEdgeNGram, which includes n-gram analysis permuting it into ["ja", "ax"], the first of which matches the "ja" from the analyzed "JACKSON v. FRENKEL".
If you don't want this behavior you can specify a different analyzer to match, using the analyzer field, for example keyword:
GET /tests/test/_search
{
"query": {
"match": {
"Name": "jax",
"analyzer" : "keyword"
}
}
}
In ES 1.3.2 the below query gave an error
GET /tests/test/_search
{
"query": {
"match": {
"Name": "jax",
"analyzer" : "keyword"
}
}
}
Error : query parsed in simplified form, with direct field name, but included more options than just the field name, possibly use its 'options' form, with 'query' element?]; }]
status: 400
I fixed the issue as below:
{
"query": {
"query_string": {
"fields": [
"Name"
],
"query": "jax",
"analyzer": "simple"
}
}
}

Resources