How to map / query this data with ElasticSearch? - ruby

I'm using ElasticSearch along with the tire gem to power the search
functionality of my site. I'm having trouble figuring out how to map and
query the data to get the results I need.
Relevant code is below. I will explain the desired outbut below that as
well.
# models/product.rb
class Product < ActiveRecord::Base
include Tire::Model::Search
include Tire::Model::Callbacks
has_many :categorizations
has_many :categories, :through => :categorizations
has_many :product_traits
has_many :traits, :through => :product_traits
mapping do
indexes :id, type: 'integer'
indexes :name, boost: 10
indexes :description, analyzer: 'snowball'
indexes :categories do
indexes :id, type: 'integer'
indexes :name, type: 'string', index: 'not_analyzed'
end
indexes :product_traits, type: 'string', index: 'not_analyzed'
end
def self.search(params={})
out = tire.search(page: params[:page], per_page: 12, load: true) do
query do
boolean do
must { string params[:query], default_operator: "OR" } if params[:query].present?
must { term 'categories.id', params[:category_id] } if params[:category_id].present?
# if we aren't browsing a category, search results are "drill-down"
unless params[:category_id].present?
must { term 'categories.name', params[:categories] } if params[:categories].present?
end
params.select { |p| p[0,2] == 't_' }.each do |name,value|
must { term :product_traits, "#{name[2..-1]}##{value}" }
end
end
end
# don't show the category facets if we are browsing a category
facet("categories") { terms 'categories.name', size: 20 } unless params[:category_id].present?
facet("traits") {
terms :product_traits, size: 1000 #, all_terms: true
}
# raise to_curl
end
# process the trait facet results into a hash of arrays
if out.facets['traits']
facets = {}
out.facets['traits']['terms'].each do |f|
split = f['term'].partition('#')
facets[split[0]] ||= []
facets[split[0]] << { 'term' => split[2], 'count' => f['count'] }
end
out.facets['traits']['terms'] = facets
end
out
end
def to_indexed_json
{
id: id,
name: name,
description: description,
categories: categories.all(:select => 'categories.id, categories.name, categories.keywords'),
product_traits: product_traits.includes(:trait).collect { |t| "#{t.trait.name}##{t.value}" }
}.to_json
end
end
As you can see above, I'm doing some pre/post processing of the data
to/from elasticsearch in order to get what i want from the
'product_traits' field. This is what doesn't feel right and where my
questions originate.
I have a large catalog of products, each with a handful of 'traits' such
as color, material and brand. Since these traits are so varied, I
modeled the data to include a Trait model which relates to the Product
model via a ProductTrait model, which holds the value of the trait for
the given product.
First question is: How can i create the elasticsearch mapping to index
these traits properly? I assume that this involves a nested type but I
can't make enough sense of the docs to figure it out.
Second question: I want the facets to come back in groups (in the
manner that I am processing them at the end of the search method
above) but with counts that reflect how many matches there are without
taking into account the currently selected value for each trait. For
example: If the user searches for 'Glitter' and then clicks the link
corresponding to the 'Blue Color' facet, I want all the 'Color' facets
to remain visible and show counts correspinding the query results
without the 'Blue Color' filter. I hope that is a good explanation,
sorry if it needs more clarification.

If you index your traits as:
[
{
trait: 'color',
value: 'green'
},
{
trait: 'material',
value: 'plastic'
}
]
this would be indexed internally as:
{
trait: ['color', 'material' ],
value: ['green', 'plastic' ]
}
which means that you could only ever query for docs which have a trait with value 'color' and a value with value green. There is no relationship between the trait and the value.
You have a few choices to solve this problem.
As single terms
The first you are already doing, and it is a good solution, ie storing the traits as single terms like:
['color#green`','material#plastic']
As objects
An alternative (assuming you have a limited number of trait names) would be to store them as:
{
traits: {
color: 'green',
material: 'plastic'
}
}
Then you could run queries against traits.color or traits.material.
As nested
If you want to keep your array structure, then you can use the nested type eg:
{
"mappings" : {
"product" : {
"properties" : {
... other fields ...
"traits" : {
"type" : "nested",
"properties" : {
"trait" : {
"index" : "not_analyzed",
"type" : "string"
},
"value" : {
"index" : "not_analyzed",
"type" : "string"
}
}
}
}
}
}
}
Each trait/value pair would be indexed internally as a separate (but related) document, meaning that there would be a relationship between the trait and its value. You'd need to use nested queries or nested filters to query them, eg:
curl -XGET 'http://127.0.0.1:9200/test/product/_search?pretty=1' -d '
{
"query" : {
"filtered" : {
"query" : {
"text" : {
"name" : "my query terms"
}
},
"filter" : {
"nested" : {
"path" : "traits",
"filter" : {
"and" : [
{
"term" : {
"trait" : "color"
}
},
{
"term" : {
"value" : "green"
}
}
]
}
}
}
}
}
}
'
Combining facets, filtering and nested docs
You state that, when a user filters on eg color == green you want to show results only where color == green, but you still want to show the counts for all colors.
To do that, you need to use the filter param to the search API rather than a filtered query. A filtered query filters out the results BEFORE calculating the facets. The filter param is applied to query results AFTER calculating facets.
Here's an example where the final query results are limited to docs where color == green but the facets are calculated for all colors:
curl -XGET 'http://127.0.0.1:9200/test/product/_search?pretty=1' -d '
{
"query" : {
"text" : {
"name" : "my query terms"
}
},
"filter" : {
"nested" : {
"path" : "traits",
"filter" : {
"and" : [
{
"term" : {
"trait" : "color"
}
},
{
"term" : {
"value" : "green"
}
}
]
}
}
},
"facets" : {
"color" : {
"nested" : "traits",
"terms" : { "field" : "value" },
"facet_filter" : {
"term" : {
"trait" : "color"
}
}
}
}
}
'

Related

Run a subquery for each of the filtered elasticsearch documents

I have an index named employees with the following structure:
{
id: integer,
name: text,
age: integer,
cityId: integer,
resumeText: text <--------- parsed resume text
}
I want to search employees with certain criteria e.g having age > 40, resumeText contains a specific skill or employee belongs to a certain city etc, and have the following query for so far requirement:
{
query:{
bool:{
should:[
{
term:{
cityId:2990
},
{
match:{
resumeText:"marketing"
},
{
match:{
resumeText:"critical thinking"
}}}
],
filter:{
range:{
age:{
gte:40
}}}}}
}
This gives me expected results but i want to know also among the returned documents/employees which are the ones whose resumeText contains the mentioned skills. e.g in the response, I want to get documents having mentioned that this document had matched "critical thinking" , this employee had matched both the skills and this employee didn't match any skills (as it was returned based on other filters)
What changes do i need to do to get the desired results:
can aggregation help?
can we rum a script for EACH filtered document to compute desired result (sub query for each document)?
any other approach?
Yes, You can use aggregation.
Refer this
You can bucket like how many resumes are matching each skill you are looking for.
GET employees/_search
{
"size": 0,
"aggs" : {
"messages" : {
"filters" : {
"filters" : {
"marketing_resume_count" : { "match" : { "resumeText" : "marketing" }},
"thinking_resume_count" : { "match" : { "resumeText" : "thinking" }}
}
}
}
}
}
To extend to your use case:
You can add query section to the query as below
GET employees/_search
{
"size": 0,
"query":{
"match":{
"region":"AM"
}
},
"aggs" : {
"messages" : {
"filters" : {
"filters" : {
"marketing_resume_count" : { "match" : { "resumeText" : "marketing" }},
"thinking_resume_count" : { "match" : { "resumeText" : "thinking" }}
}
}
}
}
}
You can use range query to handle gte and let conditions. You can refer this for range query example. This can be used in place of query section.

Search for empty/present arrays in elasticsearch

I'm currently using the elasticsearch 6.5.4 and I'm trying to query for all docs in an index with an empty array on a specific field. I found the the elasticsearch has a exists dsl who is supposed to cover the empty array case.
The problem is: whem I query for a must exists no doc is returned and when I query for must not exists all documents are returned.
Since I can't share the actual mapping for legal reasons, this is the closest I can give you:
{
"foo_production" : {
"mappings" : {
"foo" : {
"properties" : {
"bar" : {
"type" : "text",
"index" : false
}
}
}
}
}
}
And the query I am performing is:
GET foo_production/_search
{
"query": {
"bool": {
"must": {
"exists": {
"field": "bar"
}
}
}
}
}
Can you guys tell me where the problem is?
Note: Upgrading the elasticsearch version is not a viable solution
Enable indexing for the field bar by setting "index" : true
The index option controls whether field values are indexed. It accepts true or false and defaults to true. Fields that are not indexed are not queryable.
Source : https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-index.html

Multi-field Search with Array Using 'AND' Operator in elasticsearch

I want to query the values of a multi-value field as separate 'fields' in the same way I'm querying the other fields.
I have a data structure like so:
{
name: 'foo one',
alternate_name: 'bar two',
lay_name: 'baz three',
tags: ['stuff like', 'this that']
}
My query looks like this:
{
query:
query: stuff
type: 'best_fields',
fields: ['name', 'alternate_name', 'lay_name', 'tags'],
operator: 'and'
}
The 'type' and 'operator' work perfectly for the single value fields in only matching when the value contains my entire query. For example, querying 'foo two' doesn't return a match.
I'd like the tags field to behave the same way. Right now, querying 'stuff that' will return a match when it shouldn't because no fields or tag values contain both words in a single value. Is there a way to achieve this?
EDIT
Val's assessment was spot on. I've updated my mapping to the following (using elasticsearch-rails/elasticsearch-model):
mapping dynamic: false, include_in_all: true do
... other fields ...
indexes :tags, type: 'nested' do
indexes :tag, type: 'string', include_in_parent: true
end
end
Please show your mapping type, but I suspect your tags field is a simple string field like this:
{
"your_type" : {
"properties" : {
"tags" : {
"type" : "string"
}
}
}
}
In this case ES will "flatten" all your tags under the hood in the tags field at indexing time like this:
tags: "stuff", "like", "this", "that"
i.e. this is why you get results when querying "stuff that", because the tags field contains both words.
The way forward would be to make tags a nested object type, like this
{
"your_type" : {
"properties" : {
"tags" : {
"type" : "nested",
"properties": {
"tag" : {"type": "string" }
}
}
}
}
}
You'll need to reindex your data but at least querying for tags: "stuff that" will not return anything anymore. Your tag tokens will be "kept together" as you expect. Give it a try.

Join/Merge Elasticsearch results

We have a documents with a (simplified) structure as shown here in Elasticsearch:
{ _id: ..., patientId: 4711, text: "blue" }
{ _id: ..., patientId: 4711, text: "red" }
{ _id: ..., patientId: 4712, text: "blue" }
{ _id: ..., patientId: 4712, text: "green" }
{ ... }
How can I create a query to find all documents containing the text
blue and red within the SAME patient.
In the above example I would expect a result set of two documents with patientId 4711 (contains blue and red).
Potential solution strategies might be :
Run two queries and "join" results afterward by application logic.
Run separate queries based on prior list of patients. Only feasible if number of potential patients are small.
Are there better ways (ideal one query) to handle this use case?
How about changing the way you store data into elastisearch.
Just store one document for one patient id, and keep text as array of all distinct colors assigned to that patient.
You can simply use bool query or bool filter
Example using bool filter
{
"filtered" : {
"query" : {
"match_all" : { }
},
"filter" : {
"bool" : {
"Must" : [
{
"term" : { "text" : "blue" }
},
{
"term" : { "text" : "red" }
}
]
}
}
}
}
Edit: misread the requirement:
You should be using field collapsing

Query Mongo Embedded Documents with a size

I have a ruby on rails app using Mongoid and MongoDB v2.4.6.
I have the following MongoDB structure, a record which embeds_many fragments:
{
"_id" : "76561198045636214",
"fragments" : [
{
"id" : 76561198045636215,
"source_id" : "source1"
},
{
"id" : 76561198045636216,
"source_id" : "source2"
},
{
"id" : 76561198045636217,
"source_id" : "source2"
}
]
}
I am trying to find all records in the database that contain fragments with duplicate source_ids.
I'm pretty sure I need to use $elemMatch as I need to query embedded documents.
I have tried
Record.elem_match(fragments: {source_id: 'source2'})
which works but doesn't restrict to duplicates.
I then tried
Record.elem_match(fragments: {source_id: 'source2', :source_id.with_size => 2})
which returns no results (but is a valid query). The query Mongoid produces is:
selector: {"fragments"=>{"$elemMatch"=>{:source_id=>"source2", "source_id"=>{"$size"=>2}}}}
Once that works I need to update it to $size is >1.
Is this possible? It feels like I'm very close. This is a one-off cleanup operation so query performance isn't too much of an issue (however we do have millions of records to update!)
Any help is much appreciated!
I have been able to achieve desired outcome but in testing it's far too slow (will take many weeks to run across our production system). The problem is double query per record (we have ~30 million records in production).
Record.where('fragments.source_id' => 'source2').each do |record|
query = record.fragments.where(source_id: 'source2')
if query.count > 1
# contains duplicates, delete all but latest
query.desc(:updated_at).skip(1).delete_all
end
# needed to trigger after_save filters
record.save!
end
The problem with the current approach in here is that the standard MongoDB query forms do not actually "filter" the nested array documents in any way. This is essentially what you need in order to "find the duplicates" within your documents here.
For this, MongoDB provides the aggregation framework as probably the best approach to finding this. There is no direct "mongoid" style approach to the queries as those are geared towards the existing "rails" style of dealing with relational documents.
You can access the "moped" form though through the .collection accessor on your class model:
Record.collection.aggregate([
# Find arrays two elements or more as possibles
{ "$match" => {
"$and" => [
{ "fragments" => { "$not" => { "$size" => 0 } } },
{ "fragments" => { "$not" => { "$size" => 1 } } }
]
}},
# Unwind the arrays to "de-normalize" as documents
{ "$unwind" => "$fragments" },
# Group back and get counts of the "key" values
{ "$group" => {
"_id" => { "_id" => "$_id", "source_id" => "$fragments.source_id" },
"fragments" => { "$push" => "$fragments.id" },
"count" => { "$sum" => 1 }
}},
# Match the keys found more than once
{ "$match" => { "count" => { "$gte" => 2 } } }
])
That would return you results like this:
{
"_id" : { "_id": "76561198045636214", "source_id": "source2" },
"fragments": ["76561198045636216","76561198045636217"],
"count": 2
}
That at least gives you something to work with on how to deal with the "duplicates" here

Resources