How to access mapping sub-fields in painless? - elasticsearch

Assuming there is an Elasticsearch mapping:
"mappings": {
"dynamic_templates": [],
"properties": {
"name": {
"type": "text",
"fields": {
"x": {
"type": "completion"
},
"y": {
"type": "keyword"
},
"z": {
"type": "text",
"analyzer": "shingles",
"fielddata": true
}
}
How can I access name.x/y/z data/tokens from painless script?
Below are not working:
ctx._source.name.x
ctx._source['name.fields.x']
ctx._source['name.x']

Oh I see, you want some kind of retro-loop of whatever is produced in name.z to be fed into name.x. First, I think there's a misconception in the way you think multi-fields are working. You only ever get to specify the value for the name field and each of the sub-field will get a value according to whatever type/analyzer it has, but you cannot specify a value directly for any of those sub-fields.
What you should do instead is to specify the shingles analyzer in your completion field directly and that should do the trick.
UPDATE
What you want to do is not possible automatically and not with multi-fields, you need a standalone completion field, not as sub-field of name. The process you want can be done like this, but needs to be coded on client-side:
# 1. Index your document
PUT test/_doc/1
{ "name": "the big brown fox" }
# 2. For each indexed document, retrieve the term vectors for name.z
GET test/_termvectors/1?fields=name.z
=>
"term_vectors" : {
"name.z" : {
"field_statistics" : {
"sum_doc_freq" : 7,
"doc_count" : 1,
"sum_ttf" : 7
},
"terms" : {
"big" : {
"term_freq" : 1,
"tokens" : [
{
"position" : 1,
"start_offset" : 4,
"end_offset" : 7
}
]
},
"big brown" : {
"term_freq" : 1,
"tokens" : [
{
"position" : 1,
"start_offset" : 4,
"end_offset" : 13
}
]
},
# 3. Retrieve all the keys from the terms map obtained in step 2
"big", "big brown", ...
# 4. Feed the retrieved tokens back into the document's top-level completion field
POST test/_doc/1/_update
{
"doc": {
"completion": {
"input": [ "big", "big brown", ... ],
}
}
}

Related

Tokens in Index Time vs Query Time are not the same when using common_gram filter ElasticSearch

I want to use common_gram token filter based on this link.
My elasticsearch version is: 7.17.8
Here is the setting of my index in ElasticSearch.
I have defined a filter named "common_grams" that uses "common_grams" as type.
I have defined a custom analyzer named "index_grams" that use "whitespace" as tokenizer and the above filter as a token filter.
I have just one field named as "title_fa" and I have used my custom analyzer for this field.
PUT /my-index-000007
{
"settings": {
"analysis": {
"analyzer": {
"index_grams": {
"tokenizer": "whitespace",
"filter": [ "common_grams" ]
}
},
"filter": {
"common_grams": {
"type": "common_grams",
"common_words": [ "the","is" ]
}
}
}
}
,
"mappings": {
"properties": {
"title_fa": {
"type": "text",
"analyzer": "index_grams",
"boost": 40
}
}
}
}
It works fine in Index Time and the tokens are what I expect to be. Here I get the tokens via kibana dev tool.
GET /my-index-000007/_analyze
{
"analyzer": "index_grams",
"text" : "brown is the"
}
Here is the result of the tokens for the text.
{
"tokens" : [
{
"token" : "brown",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 0
},
{
"token" : "brown_is",
"start_offset" : 0,
"end_offset" : 8,
"type" : "gram",
"position" : 0,
"positionLength" : 2
},
{
"token" : "is",
"start_offset" : 6,
"end_offset" : 8,
"type" : "word",
"position" : 1
},
{
"token" : "is_the",
"start_offset" : 6,
"end_offset" : 12,
"type" : "gram",
"position" : 1,
"positionLength" : 2
},
{
"token" : "the",
"start_offset" : 9,
"end_offset" : 12,
"type" : "word",
"position" : 2
}
]
}
When I search the query "brown is the", I expect these tokens to be searched:
["brown", "brown_is", "is", "is_the", "the" ]
But these are the tokens that will actually be searched:
["brown is the", "brown is_the", "brown_is the"]
Here you can see the details
Query Time Tokens
UPDATE:
I have added a sample document like this:
POST /my-index-000007/_doc/1
{ "title_fa" : "brown" }
When I search "brown coat"
GET /my-index-000007/_search
{
"query": {
"query_string": {
"query": "brown is coat",
"default_field": "title_fa"
}
}
}
it returns the document because it searches:
["brown", "coat"]
When I search "brown is coat", it can't find the document because it is searching for
["brown is coat", "brown_is coat", "brown is_coat"]
Clearly when it gets a query that contains a common word, it acts differently and I guess it's because of the index time tokens and query time tokens.
Do you know where I am getting this wrong? Why is it acting differently?

Elasticsearch char_filter not effecting search

My understanding of how the char_filter works must be wrong. My goal here is to treat all apostrophes and quote like characters the same (in this case, remove them entirely) in elasticsearch. (Apparently there are like 5 apostrophe-like unicode characters... and my database has all versions :facepalm:)
Aside: This approach to the solution was inspired by this thread
So here is a toy problem that illustrates my issue.
I create an index with the char_filter, and then populate it with 3 documents:
PUT test
{
"settings": {
"analysis": {
"analyzer": {
"quote_analyzer": {
"char_filter": [
"quotes"
],
"tokenizer": "standard"
}
},
"char_filter": {
"quotes": {
"mappings": [
"\u0091=>",
"\u0092=>",
"\u2018=>",
"\u2019=>"
],
"type": "mapping"
}
}
}
}
}
POST test/_doc
{
"name": "The King’s men",
"id": "1"
}
POST test/_doc
{
"name": "Zoom LeBron the Soldier 7 'King's Pride'",
"id": "2"
}
POST test/_doc
{
"name": "Kings Kings Kings",
"id": "3"
}
As you can see, each document contains some form of the word Kings. I then check that my analyzer is doing what I think it should be doing:
GET test/_analyze
{
"analyzer": "quote_analyzer",
"text": "King’s boat"
}
Which yields:
{
"tokens" : [
{
"token" : "Kings",
"start_offset" : 0,
"end_offset" : 6,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "boat",
"start_offset" : 7,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 1
}
]
}
It appears that the apostrophe in King’s has been removed and the token is Kings. Great! So now I want to search for King’s and since the analyzer is removing the apostrophe I should get all three results. Or at LEAST I would get just id:3 as the apostrophe was removed, and it only matches that Kings Kings Kings without the apostrophe. However, searching for:
GET test/_search
{
"query": {
"match": {
"name": "King’s boat"
}
}
}
Yields:
{
"took" : 1,
// collapsing ....
"hits" : {
// collapsing ....
"hits" : [
{
"_index" : "test",
"_type" : "_doc",
"_id" : "1e2x_38Bn0QWlup8OIvp",
"_score" : 1.1220688,
"_source" : {
"name" : "The King’s men",
"id" : "1"
}
}
]
}
}
Similarly, searching Kings boat only retrieves id:3. And searching King's boat only retrieves id:2.
What am I missing? How do I accomplish the goal of treating all apostrophe characters the same?
Please modify your char_filter to accommodate both quotes and apostrophe, like you already did for quotes.

Emails not being searched properly in elasticsearch

I have indexed a few documents in elasticsearch which have email ids as a field. But when I query for a specific email id, the search results are showing all the documents without filtering.
This is the query I have used
{
"query": {
"match": {
"mail-id": "abc#gmail.com"
}
}
}
By default, your mail-id field is analyzed by the standard analyzer which will tokenize the email abc#gmail.com into the following two tokens:
{
"tokens" : [ {
"token" : "abc",
"start_offset" : 0,
"end_offset" : 3,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "gmail.com",
"start_offset" : 4,
"end_offset" : 13,
"type" : "<ALPHANUM>",
"position" : 2
} ]
}
What you need instead is to create a custom analyzer using the UAX email URL tokenizer, which will tokenize email addresses as a one token.
So you need to define your index as follows:
curl -XPUT localhost:9200/people -d '{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "uax_url_email"
}
}
}
},
"mappings": {
"person": {
"properties": {
"mail-id": {
"type": "string",
"analyzer": "my_analyzer"
}
}
}
}
}'
After creating that index, you can see that the email abc#gmail.com will be tokenized as a single token and your search will work as expected.
curl -XGET 'localhost:9200/people/_analyze?analyzer=my_analyzer&pretty' -d 'abc#gmail.com'
{
"tokens" : [ {
"token" : "abc#gmail.com",
"start_offset" : 0,
"end_offset" : 13,
"type" : "<EMAIL>",
"position" : 1
} ]
}
This happens when you use the default mappings. Elasticsearch has uax_url_email tokenizers which would identify the urls and emails as a single entity/token.
You can read more about this here and here

Elasticsearch exact matches on analyzed fields

Is there a way to have ElasticSearch identify exact matches on analyzed fields? Ideally, I would like to lowercase, tokenize, stem and perhaps even phoneticize my docs, then have queries pull "exact" matches out.
What I mean is that if I index "Hamburger Buns" and "Hamburgers", they will be analyzed as ["hamburger","bun"] and ["hamburger"]. If I search for "Hamburger", it will only return the "hamburger" doc, as that's the "exact" match.
I've tried using the keyword tokenizer, but that won't stem the individual tokens. Do I need to do something to ensure that the number of tokens is equal or so?
I'm familiar with multi-fields and using the "not_analyzed" type, but this is more restrictive than I'm looking for. I'd like exact matching, post-analysis.
Use shingles tokenizer together with stemming and whatever else you need. Add a sub-field of type token_count that will count the number of tokens in the field.
At searching time, you need to add an additional filter to match the number of tokens in the index with the number of tokens you have in the searching text. You would need an additional step, when you perform the actual search, that should count the tokens in the searching string. This is like this because shingles will create multiple permutations of tokens and you need to make sure that it matches the size of your searching text.
An attempt for this, just to give you an idea:
{
"settings": {
"analysis": {
"filter": {
"filter_shingle": {
"type": "shingle",
"max_shingle_size": 10,
"min_shingle_size": 2,
"output_unigrams": true
},
"filter_stemmer": {
"type": "porter_stem",
"language": "_english_"
}
},
"analyzer": {
"ShingleAnalyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"snowball",
"filter_stemmer",
"filter_shingle"
]
}
}
}
},
"mappings": {
"test": {
"properties": {
"text": {
"type": "string",
"analyzer": "ShingleAnalyzer",
"fields": {
"word_count": {
"type": "token_count",
"store": "yes",
"analyzer": "ShingleAnalyzer"
}
}
}
}
}
}
}
And the query:
{
"query": {
"filtered": {
"query": {
"match_phrase": {
"text": {
"query": "HaMbUrGeRs BUN"
}
}
},
"filter": {
"term": {
"text.word_count": "2"
}
}
}
}
}
The shingles filter is important here because it can create combinations of tokens. And more than that, these are combinations that keep the order or the tokens. Imo, the most difficult requirement to fulfill here is to change the tokens (stemming, lowercasing etc) and, also, to assemble back the original text. Unless you define your own "concatenation" filter I don't think there is any other way than using the shingles filter.
But with shingles there is another issue: it creates combinations that are not needed. For a text like "Hamburgers buns in Los Angeles" you end up with a long list of shingles:
"angeles",
"buns",
"buns in",
"buns in los",
"buns in los angeles",
"hamburgers",
"hamburgers buns",
"hamburgers buns in",
"hamburgers buns in los",
"hamburgers buns in los angeles",
"in",
"in los",
"in los angeles",
"los",
"los angeles"
If you are interested in only those documents that match exactly meaning, the documents above matches only when you search for "hamburgers buns in los angeles" (and doesn't match something like "any hamburgers buns in los angeles") then you need a way to filter that long list of shingles. The way I see it is to use word_count.
You can use multi-fields for that purpose and have a not_analyzed sub-field within your analyzed field (let's call it item in this example). Your mapping would have to look like this:
{
"yourtype": {
"properties": {
"item": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
With this kind of mapping, you can check how each of the values Hamburgers and Hamburger Buns are "viewed" by the analyzer with respect to your multi-field item and item.raw
For Hamburger:
curl -XGET 'localhost:9200/yourtypes/_analyze?field=item&pretty' -d 'Hamburger'
{
"tokens" : [ {
"token" : "hamburger",
"start_offset" : 0,
"end_offset" : 10,
"type" : "<ALPHANUM>",
"position" : 1
} ]
}
curl -XGET 'localhost:9200/yourtypes/_analyze?field=item.raw&pretty' -d 'Hamburger'
{
"tokens" : [ {
"token" : "Hamburger",
"start_offset" : 0,
"end_offset" : 10,
"type" : "word",
"position" : 1
} ]
}
For Hamburger Buns:
curl -XGET 'localhost:9200/yourtypes/_analyze?field=item&pretty' -d 'Hamburger Buns'
{
"tokens" : [ {
"token" : "hamburger",
"start_offset" : 0,
"end_offset" : 10,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "buns",
"start_offset" : 11,
"end_offset" : 15,
"type" : "<ALPHANUM>",
"position" : 2
} ]
}
curl -XGET 'localhost:9200/yourtypes/_analyze?field=item.raw&pretty' -d 'Hamburger Buns'
{
"tokens" : [ {
"token" : "Hamburger Buns",
"start_offset" : 0,
"end_offset" : 15,
"type" : "word",
"position" : 1
} ]
}
As you can see, the not_analyzed field is going to be indexed untouched exactly as it was input.
Now, let's index two sample documents to illustrate this:
curl -XPOST localhost:9200/yourtypes/_bulk -d '
{"index": {"_type": "yourtype", "_id": 1}}
{"item": "Hamburger"}
{"index": {"_type": "yourtype", "_id": 2}}
{"item": "Hamburger Buns"}
'
And finally, to answer your question, if you want to have an exact match on Hamburger, you can search within your sub-field item.raw like this (note that the case has to match, too):
curl -XPOST localhost:9200/yourtypes/yourtype/_search -d '{
"query": {
"term": {
"item.raw": "Hamburger"
}
}
}'
And you'll get:
{
...
"hits" : {
"total" : 1,
"max_score" : 0.30685282,
"hits" : [ {
"_index" : "yourtypes",
"_type" : "yourtype",
"_id" : "1",
"_score" : 0.30685282,
"_source":{"item": "Hamburger"}
} ]
}
}
UPDATE (see comments/discussion below and question re-edit)
Taking your example from the comments and trying to have HaMbUrGeR BuNs match Hamburger buns you could simply achieve it with a match query like this.
curl -XPOST localhost:9200/yourtypes/yourtype/_search?pretty -d '{
"query": {
"match": {
"item": {
"query": "HaMbUrGeR BuNs",
"operator": "and"
}
}
}
}'
Which based on the same two indexed documents above will yield
{
...
"hits" : {
"total" : 1,
"max_score" : 0.2712221,
"hits" : [ {
"_index" : "yourtypes",
"_type" : "yourtype",
"_id" : "2",
"_score" : 0.2712221,
"_source":{"item": "Hamburger Buns"}
} ]
}
}
You can keep the analyzer as what you expected (lowercase, tokenize, stem, ...), and use query_string as the main query, match_phrase as the boosting query to search. Something like this:
{
"bool" : {
"should" : [
{
"query_string" : {
"default_field" : "your_field",
"default_operator" : "OR",
"phrase_slop" : 1,
"query" : "Hamburger"
}
},
{
"match_phrase": {
"your_field": {
"query": "Hamburger"
}
}
}
]
}
}
It will match both documents, and exact match (match_phrase) will be on top since the query match both should clauses (and get higher score)
default_operator is set to OR, it will help the query "Hamburger Buns" (match hamburger OR bun) match the document "Hamburger" also.
phrase_slop is set to 1 to match terms with distance = 1 only, e.g. search for Hamburger Buns will not match document Hamburger Big Buns. You can adjust this depend on your requirements.
You can refer Closer is better, Query string for more details.

Elasticsearch, search for domains in urls

We index HTML documents which may include links to other documents. We're using elasticsearch and things are pretty smooth for most keyword searches, which is great.
Now, we're adding more complex searches similar to Google site: or link: searches: basically we want to retrieve documents which point to eithr specific urls or even domains. (If document A has a link to http://a.site.tld/path/, the search link:http://a.site.tld should yield it.).
And we're now trying what would be the best way to achieve this.
So far, we have extracted the links from the documents and added a links field to our document. We setup the links to be not analyzed. We can then do search that match the exact url link:http://a.site.tld/path/ But of course link:http://a.site.tld does not yield anything.
Our initial idea would be to create a new field linkedDomains which would work similarly... but there may exist better solutions?
You could try the Path Hierarchy Tokenizer:
Define a mapping as follows:
PUT /link-demo
{
"settings": {
"analysis": {
"analyzer": {
"path-analyzer": {
"type": "custom",
"tokenizer": "path_hierarchy"
}
}
}
},
"mappings": {
"doc": {
"properties": {
"link": {
"type": "string",
"index_analyzer": "path-analyzer"
}
}
}
}
}
Index a doc:
POST /link-demo/doc
{
link: "http://a.site.tld/path/"
}
The following term query returns the indexed doc:
POST /link-demo/_search?pretty
{
"query": {
"term": {
"link": {
"value": "http://a.site.tld"
}
}
}
}
To get a feel for how this is being indexed:
GET link-demo/_analyze?analyzer=path-analyzer&text="http://a.site.tld/path"&pretty
Shows the following:
{
"tokens" : [ {
"token" : "\"http:",
"start_offset" : 0,
"end_offset" : 6,
"type" : "word",
"position" : 1
}, {
"token" : "\"http:/",
"start_offset" : 0,
"end_offset" : 7,
"type" : "word",
"position" : 1
}, {
"token" : "\"http://a.site.tld",
"start_offset" : 0,
"end_offset" : 18,
"type" : "word",
"position" : 1
}, {
"token" : "\"http://a.site.tld/path\"",
"start_offset" : 0,
"end_offset" : 24,
"type" : "word",
"position" : 1
} ]
}

Resources