I'm aiming to build an index that, for each document, will break it down by word ngrams (uni, bi, and tri), then capture term vector analysis on all of those word ngrams. Is that possible with Elasticsearch?
For instance, for a document field containing "The red car drives." I would be able to get the information:
red - 1 instance
car - 1 instance
drives - 1 instance
red car - 1 instance
car drives - 1 instance
red car drives - 1 instance
Thanks in advance!
Assuming you already know about the Term Vectors api you could apply the shingle token filter at index time to add those terms as independent to each other in the token stream.
Setting min_shingle_size to 1 (instead of the default of 2), and max_shingle_size to at least 3 (instead of the default of 2)
And based on the fact that you left "the" out of the possible terms you should use stop words filter before applying shingles filter.
The analyzer settings would be something like this:
{
"settings": {
"analysis": {
"analyzer": {
"evolutionAnalyzer": {
"tokenizer": "standard",
"filter": [
"standard",
"lowercase",
"custom_stop",
"custom_shingle"
]
}
},
"filter": {
"custom_stop": {
"type": "stop",
"stopwords": "_english_",
"enable_position_increments":"false"
},
"custom_shingle": {
"type": "shingle",
"min_shingle_size": "1",
"max_shingle_size": "3"
}
}
}
}
}
You can test the analyzer using the _analyze api endpoint.
Related
I'm trying to use a combined_fields query with a synonym_graph search-time token filter in Elasticsearch. When I query for a multi-term phrase in my synonym file, Elasticsearch seems to unconfigurably switch from "or logic" to "and logic" between my original terms. Here's an example Elasticsearch query that has been exaggerated for demonstration purposes:
GET /products/_search
{
"query": {
"bool": {
"should": [
{
"combined_fields": {
"query": "boxes other rectangle hinged lid hook cutout",
"operator": "or",
"minimum_should_match": 1,
"fields": [
"productTitle^9",
"fullDescription^5"
],
"auto_generate_synonyms_phrase_query": false
}
}
]
}
}
}
When I submit the query on my index with an empty synonyms.txt file, it returns >1000 hits. As expected, the top hits contain all or many of the terms in the query, and the result set is composed of all documents that contain any of the terms. However, when I add this line to the synonyms.txt file:
black spigot, boxes other rectangle hinged lid hook cutout
the query only returns 4 hits. These hits either contain all of the terms in my query across the queried fields, or both the terms "black" and "spigot".
My conclusion is that presence of the phrase in the synonyms file is influencing how the "non-synonym-replaced" phrase is being searched for. This seems counterintuitive - adding a phrase to the synonyms file should only possibly increase the number of results that a search for that exact phrase produces, right?
Does anyone know what I'm doing incorrectly, or if my expectations are reliant upon some fundamental misunderstanding of how Elasticsearch works? I observe the same behavior when I use a multi-match query or an array of match queries, and I've tried every combination of query options that I reasonably think might resolve the problem.
For reference, here is my analyzer configuration:
"analysis": {
"analyzer": {
"indexAnalyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"porter_stem",
"stop"
]
},
"searchAnalyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"porter_stem",
"stop",
"productSynonym"
]
}
},
"filter": {
"productSynonym": {
"type": "synonym_graph",
"synonyms_path": "analysis/synonyms.txt"
}
}
}
I indexed some data using a nGram analyzer (which emits only tri-grams), to solve the compound words problem exactly as described at the ES guide.
This doesn't work however as expected: the according match query will return all documents where at least one nGram-token (per word) matched.
Example:
Let's take these two indexed documents with a single field, using that nGram analyzer:
POST /compound_test/doc/_bulk
{ "index": { "_id": 1 }}
{ "content": "elasticsearch is awesome" }
{ "index": { "_id": 2 }}
{ "content": "some search queries don't perform good" }
Now if I run the following query, I get both results:
"match": {
"content": {
"query": "awesome search",
"minimum_should_match": "100%"
}
}
The query that is constructed from this, could be expressed like this:
(awe OR wes OR eso OR ome) AND (sea OR ear OR arc OR rch)
That's why the second document matches (it contains "some" and "search"). It would even match a document with words that contain the tokens "som" and "rch".
What I actually want is a query where each analyzed token must match (in the best case depending on the minimum-should-match), so something like this:
"match": {
"content": {
"query": "awe wes eso ome sea ear arc rch",
"analyzer": "whitespace",
"minimum_should_match": "100%"
}
}
..without actually creating that query "from hand" / pre-analyzing it on client side.
All settings and data to reproduce that behavior can be found at https://pastebin.com/97QxfaSb
Is there such a possibility?
While writing the question, I accidentally found the answer:
If the ngram analyzer uses a ngram-filter to generate trigrams (as described in the guide), it works the way described above. (I guess because the actual tokens are not the single ngrams but the combination of all created ngrams)
To achieve the wanted behavior, the analyzer must use the ngram tokenizer:
"tokenizer": {
"trigram_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 3,
"token_chars": [
"letter",
"digit"
]
}
},
"analyzer": {
"trigrams_with_tokenizer": {
"type": "custom",
"tokenizer": "trigram_tokenizer"
}
}
Using this way to produce tokens will result in the wished result when queering that field.
I've a search query which does basic search after a complete word is typed in. I'm looking for auto suggestions after 3 letters.
For Example,
Title- samsung galaxy s4
I want to see auto suggestions after "sam" instead of complete word "samsung".
while the ngram filter works, there is a dedicated suggester for this use-case, called the completion suggester, which uses another data structure internal, which will allow you to execute suggestions in the millisecond range, thus being much faster than a regular query use edgengram. Check out the documentation here
https://www.elastic.co/guide/en/elasticsearch/reference/5.5/search-suggesters-completion.html
You need to use an edgeNGram filter for this.
{
"analysis": {
"tokenizer": {
"autocomplete_tokenizer": {
"type": "edgeNGram",
"min_gram": "3",
"max_gram": "20"
}
},
"analyzer": {
"autocomplete_edge_ngram": {
"filter": ["lowercase"],
"type": "custom",
"tokenizer": "autocomplete_tokenizer"
}
}
}
}
and mapping will be
{
"title_edge_ngram": {
"type": "text",
"analyzer": "autocomplete_edge_ngram",
"search_analyzer": "standard"
}
Or you can use the completion suggester in elasticsearch.
For three character check, you have to do it in your client side itself.
I am wondering if it is possible to use shingles with the Simple Query String query. My mapping for the relevant field looks like this:
{
"text_2": {
"type": "string",
"analyzer": "shingle_analyzer"
}
}
The analyzer and filters are defined as follows:
"analyzer": {
"shingle_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["standard", "custom_delimiter", "lowercase", "stop", "snowball", "filter_shingle"]
}
},
"filter": {
"filter_shingle":{
"type":"shingle",
"max_shingle_size":5,
"min_shingle_size":2,
"output_unigrams":"true"
},
"custom_delimiter": {
"type": "word_delimiter",
"preserve_original": True
}
}
I am performing the following search:
{
"query": {
"bool": {
"must": [
{
"simple_query_string": {
"analyzer": "shingle_analyzer",
"fields": [
"text_2"
],
"lenient": "false",
"default_operator": "and",
"query": "porsches small red"
}
}
]
}
}
}
Now, I have a document with text_2 = small red porsches. Since I am using the AND operator, I would expect my document to NOT match, since the above query should produce a shingle of "porsches small red", which is a different order. However, when I look at the match explanation I am only seeing the single word tokens "red" "small" "porsche", which of course match.
Is SQS incompatible with shingles?
The answer is "Yes, but...".
What you're seeing is normal given the fact that the text_2 field probably has the standard index analyzer in your mapping (according to the explanation you're seeing), i.e. the only tokens that have been produced and indexed for small red porsches are small, red and porsches.
On the query side, you're probably using a shingle analyzer with output_unigrams set to true (default), which means that the unigram tokens will also be produced in addition to the bigrams (again according to the explanation you're seeing). Those unigrams are the only reason why you get matches at all. If you want to match on bigrams, then one solution is to use the shingle analyzer at indexing time, too, so that bigrams small red and red porsches can be produced and indexed as well in addition to the unigrams small, red and porsches.
Then at query time, the unigrams would match as well but small red bigram would definitely match, too. In order to only match on the bigrams, you can have another shingle analyzer just for query time whose output_unigrams is set to false, so that only bigrams get generated out of your search input. And in case your query only contains one single word (e.g. porsches), then that shingle analyzer would only generate a single unigram (because output_unigrams_if_no_shingles is true) and the query would still match your document. If that's not desired you can simply set output_unigrams_if_no_shingles to false in your shingle search analyzer.
Here's the mapping for one of the fields in my index:
"resourceId": {
"type": "string",
"index_analyzer": "partial_match",
"search_analyzer": "lowercase",
"include_in_all": true
}
Here are the custom analyzers used in the index:
"analysis": {
"filter": {
"partial_match_filter": {
"type": "ngram",
"min_gram": 1,
"max_gram": 50
}
},
"analyzer": {
"partial_match": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"partial_match_filter"
]
},
"lowercase": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase"
]
}
}
}
This field will contain an array of strings, which are the multiple IDs that a resource can have (it can have multiple IDs due to different systems calling each resource by a different id).
Now let's suppose that resource #1 has has three IDs:
resourceId: [3]
0: "ID:MATCH"
1: "MATCH"
2: "ID:ALT"
And that resource #2 has only one ID:
resourceId: [1]
0: "ID:MATCHFIVE"
And let's suppose that we run this query against my index:
{
"from": 0,
"size": 30,
"query": {
"query_string": {
"query": "resourceId:ID\\:MATCH"
}
}
}
What I'd like is for resource #1 to show up first, since its array contains an exact match. However, resource #2 is the one coming on top;
When I used the explain parameter on the query request, I saw that the tf and idf scores where the same for both resources. However, the norm score was lower for resource #1.
My theory is that since resource #1 has three items in the array (which I assume are concatenated together during indexing), the field is considered larger, and thus the norm value is decreased. When it comes to resource #2, it has only one item (and it's shorter than the concatenation of the other array), so the norm is higher, bumping the resource to the top.
My question, therefore, is: when calculating the score, is it possible for the norm calculation to only consider the size of the item that matched in the array?
For example: the search for "ID:MATCH" would find the exact match on resource #1 on resourceId[0]. At this point, all other items in the array would be put aside and the norm would be calculated based on that single item (resourceId[0]), showing a perfect match. As for resource #2, the norm would be lower, since the resourceId field would be larger.
If this isn't possible, would there be workarounds to get the exact match to the top? Or maybe I'm completely off on my theory?