Query to partially match every word in a search term in Elasticsearch - elasticsearch

I have an array of tags containing words.
tags: ['australianbrownsnake', 'venomoussnake', ...]
How do I match this against these search terms:
'brown snake', 'australian snake', 'venomous', 'venomous brown snake'
I am not even sure if this is possible since I am new to Elasticsearch.
Help would be appreciated. Thank you.
Edit: I have created an ngram analyzer and added a field called ngram like so.
properties": {
"tags": {
"type": "text",
"fields": {
"ngram": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
i tried the following query but no luck
"query": {
"multi_match": {
"query": "snake",
"fields": [
"tags.ngram"
],
"type": "most_fields"
}
}
my tag mapping is as follows:
"tags" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
},
"ngram" : {
"type" : "text",
"analyzer" : "my_analyzer"
}
}
},
my settings are:
{
"image" : {
"settings" : {
"index" : {
"max_ngram_diff" : "10",
"number_of_shards" : "1",
"provided_name" : "image",
"creation_date" : "1572590562106",
"analysis" : {
"analyzer" : {
"my_analyzer" : {
"tokenizer" : "my_tokenizer"
}
},
"tokenizer" : {
"my_tokenizer" : {
"token_chars" : [
"letter",
"digit"
],
"min_gram" : "3",
"type" : "ngram",
"max_gram" : "10"
}
}
},
"number_of_replicas" : "1",
"uuid" : "pO9F7W43QxuZmI9vmXfKyw",
"version" : {
"created" : "7040299"
}
}
}
}
}
Update:
This config should work fine.
I believe it was my mistake. I was searching on the wrong index

You need to index your tags in the way you want to search them. For queries like 'brown snake', 'australian snake' to match your tags you would need to break them into smaller tokens.
By default elasticsearch indexes strings by passing it through its standard analyzer. You can always create your custom analyzer to store your field however you want. You can create your custom analyzer which tokenizes strings into nGrams. You can specify a size of 3-10 which will store your 'australianbrownsnake' tag as something like: ['aus', 'aust', ..., 'tra', 'tral',...]
You can then modify your search query to match on your tags.ngram field and you should get the desired results.
tags.ngrams field can be created like so:
https://www.elastic.co/guide/en/elasticsearch/reference/current/multi-fields.html
using ngram tokenizer:
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-edgengram-tokenizer.html
EDIT1: Elastic tends to use the analyzer of the field being matched on, to analyze the query keywords. You might not need the user query to be tokenized in nGrams since there should be a matching nGram stored in the tags field. You could specify a standard search_analyzer in your mappings.

Related

get shingle result from elasticsearch

I'm already familiar with shingle analyzer and I am able to create a shingle analyzer as follows:
"index": {
"number_of_shards": 10,
"number_of_replicas": 1
},
"analysis": {
"analyzer": {
"shingle_analyzer": {
"filter": [
"standard",
"lowercase"
"filter_shingle"
]
}
},
"filter": {
"filter_shingle": {
"type": "shingle",
"max_shingle_size": 2,
"min_shingle_size": 2,
"output_unigrams": false
}
}
}
}
and then I use the defined analyzer in mapping for a field in my document named content.The problem is the content field is a very long text and I want to use it as data for a autocomplete suggester, so I just need one or two words that follow the matched phrase. I wonder if there is a way to get the search (or suggest or analyze) API result as shingles too. By using shingle analyzer the elastic itself indexes the text as shingles, is there a way to access those shingles?
For instance,
the query I pass is :
GET the_index/_search
{
"_source": ["content"],
"explain": true,
"query" : {
"match" : { "content.shngled_field": "news" }
}
}
the result is :
{
"took" : 395,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 10000,
"relation" : "gte"
},
"max_score" : 7.8647532,
"hits" : [
{
"_shard" : "[v3_kavan_telegram_201911][0]",
"_node" : "L6vHYla-TN6CHo2I6g4M_A",
"_index" : "v3_kavan_telegram_201911",
"_type" : "_doc",
"_id" : "g1music/70733",
"_score" : 7.8647532,
"_source" : {
"content" : "Find the latest breaking news and information on the top stories, weather, business, entertainment, politics, and more."
....
}
as you can see the result contains the whole content filed which is a very long text. The result I expect is
"content" : "news and information on"
which is the matched shingle itself.
After you've created an index & ingested a doc
PUT sh
{
"mappings": {
"properties": {
"content": {
"type": "text",
"fields": {
"shingled": {
"type": "text",
"analyzer": "shingle_analyzer"
}
}
}
}
},
"settings": {
"analysis": {
"analyzer": {
"shingle_analyzer": {
"type": "standard",
"filter": [
"standard",
"lowercase",
"filter_shingle"
]
}
},
"filter": {
"filter_shingle": {
"type": "shingle",
"max_shingle_size": 2,
"min_shingle_size": 2,
"output_unigrams": false
}
}
}
}
}
POST sh/_doc/1
{
"content": "and then I use the defined analyzer in mapping for a field in my document named content.The problem is the content field is a very long text and I want to use it as data for a autocomplete suggester, so I just need one or two words that follow the matched phrase. I wonder if there is a way to get the search (or suggest or analyze) API result as shingles too. By using shingle analyzer the elastic itself indexes the text as shingles, is there a way to access those shingles?"
}
You can call either _analyze w/ the corresponding analyzer to see how a given text would be tokenized:
GET sh/_analyze
{
"text": "and then I use the defined analyzer in mapping for a field in my document named content.The problem is the content field is a very long text and I want to use it as data for a autocomplete suggester, so I just need one or two words that follow the matched phrase. I wonder if there is a way to get the search (or suggest or analyze) API result as shingles too. By using shingle analyzer the elastic itself indexes the text as shingles, is there a way to access those shingles?",
"analyzer": "shingle_analyzer"
}
Or check out the term vectors information:
GET sh/_termvectors/1
{
"fields" : ["content.shingled"],
"offsets" : true,
"payloads" : true,
"positions" : true,
"term_statistics" : true,
"field_statistics" : true
}
Will you be highlighting too?

Elastic search - no hit though there should be result

I've encountered the following problem with Elastic search, does anyone know where should I troubleshoot?
I'm happily retrieving result with the following query:
{
"query" : {
"match" : { "name" : "A1212001" }
}
}
But when I refine the value of the search field "name" to a substring, i've not no hit?
{
"query" : {
"match" : { "name" : "A12120" }
}
}
"A12120" is a substring of already hit query "A1212001"
If you don't have too many documents, you can go with a regexp query
POST /index/_search
{
"query" :{
"regexp":{
"name": "A12120.*"
}
}
}
or even a wildcard one
POST /index/_search
{
"query": {
"wildcard" : { "name" : "A12120*" }
}
}
However, as #Waldemar suggested, if you have many documents in your index, the best approach for this is to use an EdgeNGram tokenizer since the above queries are not ultra-performant.
First, you define your index settings like this:
PUT index
{
"settings" : {
"analysis" : {
"analyzer" : {
"my_analyzer" : {
"type": "custom",
"tokenizer" : "edge_tokens",
"filter": ["lowercase"]
}
},
"tokenizer" : {
"edge_tokens" : {
"type" : "edgeNGram",
"min_gram" : "1",
"max_gram" : "10",
"token_chars": [ "letter", "digit" ]
}
}
}
},
"mappings": {
"my_type": {
"properties": {
"name": {
"type": "string",
"analyzer": "my_analyzer",
"search_analyzer": "standard"
}
}
}
}
}
Then, when indexing a document whose name field contains A1212001, the following tokens will be indexed: A, A1, A12, A121, A1212, A12120, A121200, A1212001. So when you'll search for A12120 you'll find a match.
Are you using a Match Query this query will check for terms inside lucene and your term is A1212001 if you need to find a part of your term do you can use a Regex Query but you need know that will be there some internal impacts using regex because the shard will check in all of your terms.
If you need a more "professional" way to search a part of a term do you can use NGrams

How do I get the most frequent uni-, bi-, tri-grams using shingles in Elasticsearch across all documents

I am using the following field definition in my elasticsearch index:
"my_text" :{
"type" : "string",
"index" : "analyzed",
"analyzer" : "my_ngram_analyzer",
"term_vector": "with_positions",
"term_statistics" : true
}
where, my_ngram_analyzer is used to tokenize text into n-grams using shingles and is defined as:
"settings" : {
"analysis" : {
"filter" : {
"nGram_filter": {
"type": "shingle",
"max_shingle_size": 5,
"min_shingle_size": 2,
"output_unigrams":"true"
}
},
"analyzer" : {
"my_ngram_analyzer" :{
"tokenizer" : "standard",
"filter" : [
"lowercase",
"nGram_filter"
]
}
}
}
}
I have two questions:
How can I find the most frequent n-gram (n = 1 to 5) and its frequency across all documents ?
Is there a way to get total term frequency of an n-gram without querying for a document using the termvector API with term_statistics ?

Elasticsearch multi-word, multi-field search with analyzers

I want to use elasticsearch for multi-word searches, where all the fields are checked in a document with the assigned analyzers.
So if I have a mapping:
{
"settings": {
"analysis": {
"analyzer": {
"folding": {
"tokenizer": "standard",
"filter": [ "lowercase", "asciifolding" ]
}
}
}
},
"mappings" : {
"typeName" :{
"date_detection": false,
"properties" : {
"stringfield" : {
"type" : "string",
"index" : "folding"
},
"numberfield" : {
"type" : "multi_field",
"fields" : {
"numberfield" : {"type" : "double"},
"untouched" : {"type" : "string", "index" : "not_analyzed"}
}
},
"datefield" : {
"type" : "multi_field",
"fields" : {
"datefield" : {"type" : "date", "format": "dd/MM/yyyy||yyyy-MM-dd"},
"untouched" : {"type" : "string", "index" : "not_analyzed"}
}
}
}
}
}
}
As you see I have different types of fields, but I do know the structure.
What I want to do is starting a search with a string to check all fields using the analyzers too.
For example if the query string is:
John Smith 2014-10-02 300.00
I want to search for "John", "Smith", "2014-10-02" and "300.00" in all the fields, calculating the relevance score as well. The better solution is the one that have more field matches in a single document.
So far I was able to search in all the fields by using multi_field, but in that case I was not able to parse 300.00, since 300 was stored in the string part of multi_field.
If I was searching in "_all" field, then no analyzer was used.
How should I modify my mapping or my queries to be able to do a multi-word search, where dates and numbers are recognized in the multi-word query string?
Now when I do a search, error occurs, since the whole string cannot be parsed as a number or a date. And if I use the string representation of the multi_search then 300.00 will not be a result, since the string representation is 300.
(what I would like is similar to google search, where dates, numbers and strings are recognized in a multi-word query)
Any ideas?
Thanks!
Using whitespace as filter in analyzer and then applying this analyzer as search_analyzer to fields in mapping will split query in parts and each of them would be applied to index to find the best matching. And using ngram for index_analyzer would very improve results.
I am using following setup for query:
"query": {
"multi_match": {
"query": "sample query",
"fuzziness": "AUTO",
"fields": [
"title",
"subtitle",
]
}
}
And for mappings and settings:
{
"settings" : {
"analysis": {
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"standard",
"lowercase",
"ngram"
]
}
},
"filter": {
"ngram": {
"type": "ngram",
"min_gram": 2,
"max_gram": 15
}
}
},
"mappings": {
"title": {
"type": "string",
"search_analyzer": "whitespace",
"index_analyzer": "autocomplete"
},
"subtitle": {
"type": "string"
}
}
}
See following answer and article for more details.

Partial Search using Analyzer in ElasticSearch

I am using elasticsearch to build the index of URLs.
I extracted one URL into 3 parts which is "domain", "path", and "query".
For example: testing.com/index.html?user=who&pw=no will be separated into
domain = testing.com
path = index.html
query = user=who&pw=no
There is problems when I wanted to partial search domain in my index such as "user=who" or "ing.com".
Is it possible to use "Analyzer" when I search even I didn't use "Analyzer" when indexing?
How can I do partial search based on the analyzer ?
Thank you very much.
2 approaches:
1. Wildcard search - easy and slow
"query": {
"query_string": {
"query": "*ing.com",
"default_field": "domain"
}
}
2. Use an nGram tokenizer - harder but faster
Index Settings
"settings" : {
"analysis" : {
"analyzer" : {
"my_ngram_analyzer" : {
"tokenizer" : "my_ngram_tokenizer"
}
},
"tokenizer" : {
"my_ngram_tokenizer" : {
"type" : "nGram",
"min_gram" : "1",
"max_gram" : "50"
}
}
}
}
Mapping
"properties": {
"domain": {
"type": "string",
"index_analyzer": "my_ngram_analyzer"
},
"path": {
"type": "string",
"index_analyzer": "my_ngram_analyzer"
},
"query": {
"type": "string",
"index_analyzer": "my_ngram_analyzer"
}
}
Querying
"query": {
"match": {
"domain": "ing.com"
}
}
Trick with query string is split string like "user=who&pw=no" to tokens ["user=who&pw=no", "user=who", "pw=no"] at index time. That allows you to make easily queries like "user=who". You could do this with pattern_capture token filter, but there may be better ways to do this as well.
You can also make hostname and path more searchable with path_hierarchy tokenizer, for example "/some/path/somewhere" becomes ["/some/path/somewhere", "/some/path/", "/some"]. You can index also hostname with with path_hierarchy hierarcy tokenizer by using setting reverse: true and delimiter: ".". You may also want to use some stopwords-filter to exclude top-level domains.

Resources