Multi-field Search with Array Using 'AND' Operator in elasticsearch - elasticsearch

I want to query the values of a multi-value field as separate 'fields' in the same way I'm querying the other fields.
I have a data structure like so:
{
name: 'foo one',
alternate_name: 'bar two',
lay_name: 'baz three',
tags: ['stuff like', 'this that']
}
My query looks like this:
{
query:
query: stuff
type: 'best_fields',
fields: ['name', 'alternate_name', 'lay_name', 'tags'],
operator: 'and'
}
The 'type' and 'operator' work perfectly for the single value fields in only matching when the value contains my entire query. For example, querying 'foo two' doesn't return a match.
I'd like the tags field to behave the same way. Right now, querying 'stuff that' will return a match when it shouldn't because no fields or tag values contain both words in a single value. Is there a way to achieve this?
EDIT
Val's assessment was spot on. I've updated my mapping to the following (using elasticsearch-rails/elasticsearch-model):
mapping dynamic: false, include_in_all: true do
... other fields ...
indexes :tags, type: 'nested' do
indexes :tag, type: 'string', include_in_parent: true
end
end

Please show your mapping type, but I suspect your tags field is a simple string field like this:
{
"your_type" : {
"properties" : {
"tags" : {
"type" : "string"
}
}
}
}
In this case ES will "flatten" all your tags under the hood in the tags field at indexing time like this:
tags: "stuff", "like", "this", "that"
i.e. this is why you get results when querying "stuff that", because the tags field contains both words.
The way forward would be to make tags a nested object type, like this
{
"your_type" : {
"properties" : {
"tags" : {
"type" : "nested",
"properties": {
"tag" : {"type": "string" }
}
}
}
}
}
You'll need to reindex your data but at least querying for tags: "stuff that" will not return anything anymore. Your tag tokens will be "kept together" as you expect. Give it a try.

Related

Atlas Search Index partial match

I have a test collection with these two documents:
{ _id: ObjectId("636ce11889a00c51cac27779"), sku: 'kw-lids-0009' }
{ _id: ObjectId("636ce14b89a00c51cac2777a"), sku: 'kw-fs66-gre' }
I've created a search index with this definition:
{
"analyzer": "lucene.standard",
"searchAnalyzer": "lucene.standard",
"mappings": {
"dynamic": false,
"fields": {
"sku": {
"type": "string"
}
}
}
}
If I run this aggregation:
[{
$search: {
index: 'test',
text: {
query: 'kw-fs',
path: 'sku'
}
}
}]
Why do I get 2 results? I only expected the one with sku: 'kw-fs66-gre' 😬
During indexing, the standard anlyzer breaks the string "kw-lids-0009" into 3 tokens [kw][lids][0009], and similarly tokenizes "kw-fs66-gre" as [kw][fs66][gre]. When you query for "kw-fs", the same analyzer tokenizes the query as [kw][fs], and so Lucene matches on both documents, as both have the [kw] token in the index.
To get the behavior you're looking for, you should index the sku field as type autocomplete and use the autocomplete operator in your $search stage instead of text
You're still getting 2 results because of the tokenization, i.e., you're still matching on [kw] in two documents. If you search for "fs66", you'll get a single match only. Results are scored based on relevance, they are not filtered. You can add {$project: {score: { $meta: "searchScore" }}} to your pipeline and see the difference in score between the matching documents.
If you are looking to get exact matches only, you can look to using the keyword analyzer or a custom analyzer that will strip the dashes, so you deal w/ a single token per field and not 3

ElasticSearch Query with two fields and 'AND' filter - Java API

Let's say I have a index in elasticsearch with two fields: title and tags. I have few documents there
Should be returned by query
{ title: "My main title on this page", tags: ["first", "whatever"] }
Should not be returned by query
{ title: "My on this page", tags: ["first", "page", "whatever"] }
Should be returned by query
{ title: "My main title on this page", tags: ["page", "whatever"]}
I want to find all documents which title CONTAINS "main title" AND tag "first" OR "page".
I want to use java API for this, but I'm not sure how can I do this. I know that I can use filter query to create "or" and "and". Not sure how can I add to query the title part, and how can I get the logic with "at least one from the list".
Any ideas?
It depends whether you care about the order of the "main" and "title" words (is it a phrase), but this is relatively simple:
{
"query" : {
"bool" : {
"must" : [
{
"match_phrase" : {
"title" : "main title"
}
},
{
"terms" : {
"tags" : [ "first", "page" ]
}
}
]
}
}
}
By default, the terms query is going to work as a single match and it will boost the score (relevancy) by matching multiple tags. This will perform an exact match, which you should only do with not_analyzed strings. Anything with in the must will inherently behave like an AND; you can understand the bool query/filter by checking here. This translates pretty simply into the Java API:
import static org.elasticsearch.index.query.QueryBuilders.*;
SearchResponse response =
client.prepareSearch("your-index").setTypes("your-type")
.setQuery(
boolQuery()
.must(matchPhraseQuery("title", "main title"))
.must(termsQuery("tags", "first", "page"))
.execute()
.actionGet();

Elasticsearch aggregation on object

How do I can run an aggregation query only on object property, but get all properties in result? e.g. I want to get [{'doc_count': 1, 'key': {'id': 1, 'name': 'tag name'}}], but got [{'doc_count': 1, 'key': '1'] instead. Aggregation on field 'tags' returns zero results.
Mapping:
{
"test": {
"properties" : {
"tags" : {
"type" : "object",
"properties": {
"id" : {"type": "string", "index": "not_analyzed"},
"name" : {"type": "string", "index": "not_analyzed", "enabled": false}
}
}
}
}
}
Aggregation query: (returns only IDs as expected, but how can I get ID & name pairs in results?)
'aggregations': {
'tags': {
'terms': {
'field': 'tags.id',
'order': {'_count': 'desc'},
},
}
}
EDIT:
Got ID & Name by aggregating on "script": "_source.tags" but still looking for faster solution.
you can use a script if you want, e.g.
"terms":{"script":"doc['tags.id'].value + '|' + doc['tags.name'].value"}
for each created bucket you will get a key with the values of the fields that you have included in your script. To be honest though, the purpose of aggregations is not to return full docs back, but to do calculations on groups of documents (buckets) and return the results, e.g. sums and distinct values. What you actually doing with your query is that you create buckets based on the field tags.id.
Keep in mind that the key on the result will include both values separated with a '|' so you might have to manipulate its value to extract all the information that you need.
It's also possible to nest aggregation, you could aggregate by id, then by name.
Additional information, the answer above (cpard's one) works perfectly with nested object. Maybe the weird results that you got are from the fact that you are using object and not nested object.
The difference between these types is that nested object keeps the internal relation between the element in an object. That is why "terms":{"script":"doc['tags.id'].value + '|' + doc['tags.name'].value"} make sense. If you use object type, elasticsearch doesn't know which tags.name are with which tags.id.
For more detail:
https://www.elastic.co/blog/managing-relations-inside-elasticsearch

Elasticsearch bool search matching incorrectly

So I have an object with an Id field which is populated by a Guid. I'm doing an elasticsearch query with a "Must" clause to match a specific Id in that field. The issue is that elasticsearch is returning a result which does not match the Guid I'm providing exactly. I have noticed that the Guid I'm providing and one of the results that Elasticsearch is returning share the same digits in one particular part of the Guid.
Here is my query source (I'm using the Elasticsearch head console):
{
query:
{
bool:
{
must: [
{
text:
{
couchbaseDocument.doc.Id: 5cd1cde9-1adc-4886-a463-7c8fa7966f26
}
}]
must_not: [ ]
should: [ ]
}
}
from: 0
size: 10
sort: [ ]
facets: { }
}
And it is returning two results. One with ID of
5cd1cde9-1adc-4886-a463-7c8fa7966f26
and the other with ID of
34de3d35-5a27-4886-95e8-a2d6dcf253c2
As you can see, they both share the same middle term "-4886-". However, I would expect this query to only return a record if the record were an exact match, not a partial match. What am I doing wrong here?
The query is (probably) correct.
What you're almost certainly seeing is the work of the 'Standard Analyzer` which is used by default at index-time. This Analyzer will tokenize the input (split it into terms) on hyphen ('-') among other characters. That's why a match is found.
To remedy this, you want to set your couchbaseDocument.doc.Id field to not_analyzed
See: How to not-analyze in ElasticSearch? and the links from there into the official docs.
Mapping would be something like:
{
"yourType" : {
"properties" : {
"couchbaseDocument.doc.Id" : {"type" : "string", "index" : "not_analyzed"},
}
}
}

How to map / query this data with ElasticSearch?

I'm using ElasticSearch along with the tire gem to power the search
functionality of my site. I'm having trouble figuring out how to map and
query the data to get the results I need.
Relevant code is below. I will explain the desired outbut below that as
well.
# models/product.rb
class Product < ActiveRecord::Base
include Tire::Model::Search
include Tire::Model::Callbacks
has_many :categorizations
has_many :categories, :through => :categorizations
has_many :product_traits
has_many :traits, :through => :product_traits
mapping do
indexes :id, type: 'integer'
indexes :name, boost: 10
indexes :description, analyzer: 'snowball'
indexes :categories do
indexes :id, type: 'integer'
indexes :name, type: 'string', index: 'not_analyzed'
end
indexes :product_traits, type: 'string', index: 'not_analyzed'
end
def self.search(params={})
out = tire.search(page: params[:page], per_page: 12, load: true) do
query do
boolean do
must { string params[:query], default_operator: "OR" } if params[:query].present?
must { term 'categories.id', params[:category_id] } if params[:category_id].present?
# if we aren't browsing a category, search results are "drill-down"
unless params[:category_id].present?
must { term 'categories.name', params[:categories] } if params[:categories].present?
end
params.select { |p| p[0,2] == 't_' }.each do |name,value|
must { term :product_traits, "#{name[2..-1]}##{value}" }
end
end
end
# don't show the category facets if we are browsing a category
facet("categories") { terms 'categories.name', size: 20 } unless params[:category_id].present?
facet("traits") {
terms :product_traits, size: 1000 #, all_terms: true
}
# raise to_curl
end
# process the trait facet results into a hash of arrays
if out.facets['traits']
facets = {}
out.facets['traits']['terms'].each do |f|
split = f['term'].partition('#')
facets[split[0]] ||= []
facets[split[0]] << { 'term' => split[2], 'count' => f['count'] }
end
out.facets['traits']['terms'] = facets
end
out
end
def to_indexed_json
{
id: id,
name: name,
description: description,
categories: categories.all(:select => 'categories.id, categories.name, categories.keywords'),
product_traits: product_traits.includes(:trait).collect { |t| "#{t.trait.name}##{t.value}" }
}.to_json
end
end
As you can see above, I'm doing some pre/post processing of the data
to/from elasticsearch in order to get what i want from the
'product_traits' field. This is what doesn't feel right and where my
questions originate.
I have a large catalog of products, each with a handful of 'traits' such
as color, material and brand. Since these traits are so varied, I
modeled the data to include a Trait model which relates to the Product
model via a ProductTrait model, which holds the value of the trait for
the given product.
First question is: How can i create the elasticsearch mapping to index
these traits properly? I assume that this involves a nested type but I
can't make enough sense of the docs to figure it out.
Second question: I want the facets to come back in groups (in the
manner that I am processing them at the end of the search method
above) but with counts that reflect how many matches there are without
taking into account the currently selected value for each trait. For
example: If the user searches for 'Glitter' and then clicks the link
corresponding to the 'Blue Color' facet, I want all the 'Color' facets
to remain visible and show counts correspinding the query results
without the 'Blue Color' filter. I hope that is a good explanation,
sorry if it needs more clarification.
If you index your traits as:
[
{
trait: 'color',
value: 'green'
},
{
trait: 'material',
value: 'plastic'
}
]
this would be indexed internally as:
{
trait: ['color', 'material' ],
value: ['green', 'plastic' ]
}
which means that you could only ever query for docs which have a trait with value 'color' and a value with value green. There is no relationship between the trait and the value.
You have a few choices to solve this problem.
As single terms
The first you are already doing, and it is a good solution, ie storing the traits as single terms like:
['color#green`','material#plastic']
As objects
An alternative (assuming you have a limited number of trait names) would be to store them as:
{
traits: {
color: 'green',
material: 'plastic'
}
}
Then you could run queries against traits.color or traits.material.
As nested
If you want to keep your array structure, then you can use the nested type eg:
{
"mappings" : {
"product" : {
"properties" : {
... other fields ...
"traits" : {
"type" : "nested",
"properties" : {
"trait" : {
"index" : "not_analyzed",
"type" : "string"
},
"value" : {
"index" : "not_analyzed",
"type" : "string"
}
}
}
}
}
}
}
Each trait/value pair would be indexed internally as a separate (but related) document, meaning that there would be a relationship between the trait and its value. You'd need to use nested queries or nested filters to query them, eg:
curl -XGET 'http://127.0.0.1:9200/test/product/_search?pretty=1' -d '
{
"query" : {
"filtered" : {
"query" : {
"text" : {
"name" : "my query terms"
}
},
"filter" : {
"nested" : {
"path" : "traits",
"filter" : {
"and" : [
{
"term" : {
"trait" : "color"
}
},
{
"term" : {
"value" : "green"
}
}
]
}
}
}
}
}
}
'
Combining facets, filtering and nested docs
You state that, when a user filters on eg color == green you want to show results only where color == green, but you still want to show the counts for all colors.
To do that, you need to use the filter param to the search API rather than a filtered query. A filtered query filters out the results BEFORE calculating the facets. The filter param is applied to query results AFTER calculating facets.
Here's an example where the final query results are limited to docs where color == green but the facets are calculated for all colors:
curl -XGET 'http://127.0.0.1:9200/test/product/_search?pretty=1' -d '
{
"query" : {
"text" : {
"name" : "my query terms"
}
},
"filter" : {
"nested" : {
"path" : "traits",
"filter" : {
"and" : [
{
"term" : {
"trait" : "color"
}
},
{
"term" : {
"value" : "green"
}
}
]
}
}
},
"facets" : {
"color" : {
"nested" : "traits",
"terms" : { "field" : "value" },
"facet_filter" : {
"term" : {
"trait" : "color"
}
}
}
}
}
'

Resources