I have a database of records, each of which has a right and a left field, and both these fields contain text. The database is indexed with Elasticsearch.
I want to search through both fields of these records and find the records that contain in any of the fields two or more of the words with certain prefixes. The search should be specific enough to find only the records that contain all words in the query, not just some of them.
For example, a query qui bro should return the record containing the sentence The quick brown fox jumped over the lazy dog, but not the one containing the sentence The quick fox jumped over the lazy dog
I've seen a description of how to perform prefix queries with Elasticsearch (and can reproduce it when searching for one word in one field).
I've also seen a description of how to perform multi-match queries to search through several fields at once.
But what I need is some combination of these techniques, which would allow me both to search through several fields at once, and to look only for parts of words. And to get only those records that have all the words whose parts are contained in the query.
How can I do that? Any method will do (prefixes, ngrams, whatever).
(P.S.: My question may, to a certain extent, be a duplicate of this one, but since it never was answered, I hope I'm not breaking any rules by asking mine.)
======================================
UPDATED:
Oh, I might have the first part of the question. Here is the syntax that seems to work in my Rails app (using elasticsearch-rails gem):
response = Paragraph.search query: {bool: { must: [ { prefix: {right: "qui"}}, {prefix: {right: "bro"}} ] } }
Or, to re-write it in pure Elasticsearch syntax:
{
"bool": {
"must": [
{ "prefix": { "right": "qui" }},
{ "prefix": { "right": "bro" }}
]
}
}
So my updated question now is how to combine this prefix search with multi_match search (to search both through the right and the left field.
OK, here is a possible answer that seems to work. The code has to search through multiple fields for several incomplete words and return only the records that contain all these words.
Here is the request written in elasticsearch-rails syntax:
response = Paragraph.search query: {bool: { must: [ { multi_match: { query: "qui", type: "phrase_prefix", fields: ["right", "left"]}}, { multi_match: { query: "brow", type: "phrase_prefix", fields: ["right", "left"]}}]}}
Or, re-written in the syntax that is used on Elasticsearch site:
{query:
{bool:
{ must:
[
{ multi_match:
{
query: "qui",
type: "phrase_prefix",
fields: ["right", "left"]
}
},
{ multi_match:
{
query: "brow",
type: "phrase_prefix",
fields: ["right", "left"]
}
}
]
}
}
}
This seems to work. But if somebody has other solutions (particularly if these solutions will make the search case-insensitive), I will be happy to hear them.
Related
I have a test collection with these two documents:
{ _id: ObjectId("636ce11889a00c51cac27779"), sku: 'kw-lids-0009' }
{ _id: ObjectId("636ce14b89a00c51cac2777a"), sku: 'kw-fs66-gre' }
I've created a search index with this definition:
{
"analyzer": "lucene.standard",
"searchAnalyzer": "lucene.standard",
"mappings": {
"dynamic": false,
"fields": {
"sku": {
"type": "string"
}
}
}
}
If I run this aggregation:
[{
$search: {
index: 'test',
text: {
query: 'kw-fs',
path: 'sku'
}
}
}]
Why do I get 2 results? I only expected the one with sku: 'kw-fs66-gre' 😬
During indexing, the standard anlyzer breaks the string "kw-lids-0009" into 3 tokens [kw][lids][0009], and similarly tokenizes "kw-fs66-gre" as [kw][fs66][gre]. When you query for "kw-fs", the same analyzer tokenizes the query as [kw][fs], and so Lucene matches on both documents, as both have the [kw] token in the index.
To get the behavior you're looking for, you should index the sku field as type autocomplete and use the autocomplete operator in your $search stage instead of text
You're still getting 2 results because of the tokenization, i.e., you're still matching on [kw] in two documents. If you search for "fs66", you'll get a single match only. Results are scored based on relevance, they are not filtered. You can add {$project: {score: { $meta: "searchScore" }}} to your pipeline and see the difference in score between the matching documents.
If you are looking to get exact matches only, you can look to using the keyword analyzer or a custom analyzer that will strip the dashes, so you deal w/ a single token per field and not 3
Assuming I have an index with two fields: title and loc, I would like to search in this two fields and get the "best" match. So if I have three items:
{"title": "castle", "loc": "something"},
{"title": "something castle something", "loc": "something,pontivy,something"},
{"title": "something else", "loc": "something"}
... I would like to get the second one which has "castle" in its title and "pontivy" in its loc. I tried to simplify the example and the base, it's a bit more complicated. So I tried this query, but it seems not accurate (it's a feeling, not really easy to explain):
GET merimee/_search/?
{
"query": {
"multi_match" : {
"query": "castle pontivy",
"fields": [ "title", "loc" ]
}
}
}
Is it the right way to search in various field and get the one which match the in all the fields?
Not sure my question is clear enough, I can edit if required.
EDIT:
The story is: the user type "castle pontivy" and I want to get the "best" result for this query, which is the second because it contains "castle" in "title" and "pontivy" in "loc". In other words I want the result that has the best result in both fields.
As the other posted suggested, you could use a bool query but that might not work for your use case since you have a single search box that you want to query against multiple fields with.
I recommend looking at a Simple Query String query as that will likely give you the functionality you're looking for. See: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-simple-query-string-query.html
So you could do something similar to this:
{
"query": {
"simple_query_string" : {
"query": "castle pontivy",
"fields": ["title", "loc"],
"default_operator": "and"
}
}
}
So this will try to give you the best documents that match both terms in either of those fields. The default operator is set as AND here because otherwise it is OR which might not give you the expected results.
It is worthwhile to experiment with other options available for this query type as well. You might also explore using a Query String query as it gives more flexibility but the Simple Query String term works very well for most cases.
This can be done by using bool type of query and then matching the fields.
GET _search
{
"query":
{
"bool": {"must": [{"match": {"title": "castle"}},{"match": {"loc": "pontivy"}}]
}
}
}
I'm using Elasticsearch 5.2. I'm executing the below query against an index that has only one document
Query:
GET test/val/_validate/query?pretty&explain=true
{
"query": {
"bool": {
"should": {
"multi_match": {
"query": "alkis stackoverflow",
"fields": [
"name",
"job"
],
"type": "most_fields",
"operator": "AND"
}
}
}
}
}
Document:
PUT test/val/1
{
"name": "alkis stackoverflow",
"job": "developer"
}
The explanation of the query is
+(((+job:alkis +job:stackoverflow) (+name:alkis +name:stackoverflow))) #(#_type:val)
I read this as:
Field job must have alkis and stackoverflow
AND
Field name must have alkis and stackoverflow
This is not the case with my document though. The AND between the two fields is actually OR (as it seems from the result I'm getting)
When I change the type to best_fields I get
+(((+job:alkis +job:stackoverflow) | (+name:alkis +name:stackoverflow))) #(#_type:val)
Which is the correct explanation.
Is there a bug with the validate api? Have I misunderstood something? Isn't the scoring the only difference between these two types?
Since you picked the most_fields type with an explicit AND operator, the reasoning is that one match query is going to be generated per field and all terms must be present in a single field for a document to match, which is your case, i.e. both terms alkis and stackoverflow are present in the name field, hence why the document matches.
So in the explanation of the corresponding Lucene query, i.e.
+(((+job:alkis +job:stackoverflow) (+name:alkis +name:stackoverflow)))
when no specific operator is specified between the terms, the default one is an OR
So you need to read this as: Field job must have both alkis and stackoverflow OR field name must have both alkis and stackoverflow.
The AND operator that you apply only concerns all the terms in your query but in regard to a single field, it's not an AND between all fields. Said differently, your query will be executed as a two match queries (one per field) in a bool/should clause, like this:
{
"query": {
"bool": {
"should": [
{ "match": { "job": "alkis stackoverflow" }},
{ "match": { "name": "alkis stackoverflow" }}
]
}
}
}
In summary, the most_fields type is most useful when querying multiple fields that contain the same text analyzed in different ways. This is not your case and you'd probably better be using cross_fields or best_fields depending on your use case, but certainly not most_fields.
UPDATE
When using the best_fields type, ES generates a dis_max query instead of a bool/should and the | (which is not an OR !!) sign separates all sub-queries in a dis_max query.
How would I define an analyzer so a query recalls a document with term "starbucks" when mistakenly querying "star bucks"?
Or in general: how would I define an analyzer that is able to search for combined terms by omitting term-separators/ spaces, in the supplied query?
N-grams clearly don't work, since you'd have to know to split up the term 'starbucks' on indexing in 2 separate terms 'star' and 'bucks'. Splitting on syllables might be enough, but not sure if that's possible (or scales)
Thoughts?
You can use Fuzzy Search.
Here is a full working sample:
PUT test1
POST test1/a
{
"item1": "starbucks"
}
POST test1/a
{
"item1": "foo"
}
GET test1/a/_search
{
"query": {
"fuzzy": {
"item1": "star bucks"
}
}
}
Does anyone have experience with Elasticsearch and getting the searches to be more flexible?
Currently, if I have a query "House" it will return the correct items back. but if "Hous" is typed in, nothing gets returned. Also, if I search "O.J." it will return O.J. but if I wanted to search OJ I get nothing.
use prefixing:
bool: {
must: [
{
multi_match: {
query: "your text query",
type: "phrase_prefix",
max_expansions: 4,
fields: ["field1", "field2"]
}
}
]
}
you can also add fuzziness, this will allow dynamic mutations but may yield to less accurate results.