Join/Merge Elasticsearch results - elasticsearch

We have a documents with a (simplified) structure as shown here in Elasticsearch:
{ _id: ..., patientId: 4711, text: "blue" }
{ _id: ..., patientId: 4711, text: "red" }
{ _id: ..., patientId: 4712, text: "blue" }
{ _id: ..., patientId: 4712, text: "green" }
{ ... }
How can I create a query to find all documents containing the text
blue and red within the SAME patient.
In the above example I would expect a result set of two documents with patientId 4711 (contains blue and red).
Potential solution strategies might be :
Run two queries and "join" results afterward by application logic.
Run separate queries based on prior list of patients. Only feasible if number of potential patients are small.
Are there better ways (ideal one query) to handle this use case?

How about changing the way you store data into elastisearch.
Just store one document for one patient id, and keep text as array of all distinct colors assigned to that patient.

You can simply use bool query or bool filter
Example using bool filter
{
"filtered" : {
"query" : {
"match_all" : { }
},
"filter" : {
"bool" : {
"Must" : [
{
"term" : { "text" : "blue" }
},
{
"term" : { "text" : "red" }
}
]
}
}
}
}
Edit: misread the requirement:
You should be using field collapsing

Related

Run a subquery for each of the filtered elasticsearch documents

I have an index named employees with the following structure:
{
id: integer,
name: text,
age: integer,
cityId: integer,
resumeText: text <--------- parsed resume text
}
I want to search employees with certain criteria e.g having age > 40, resumeText contains a specific skill or employee belongs to a certain city etc, and have the following query for so far requirement:
{
query:{
bool:{
should:[
{
term:{
cityId:2990
},
{
match:{
resumeText:"marketing"
},
{
match:{
resumeText:"critical thinking"
}}}
],
filter:{
range:{
age:{
gte:40
}}}}}
}
This gives me expected results but i want to know also among the returned documents/employees which are the ones whose resumeText contains the mentioned skills. e.g in the response, I want to get documents having mentioned that this document had matched "critical thinking" , this employee had matched both the skills and this employee didn't match any skills (as it was returned based on other filters)
What changes do i need to do to get the desired results:
can aggregation help?
can we rum a script for EACH filtered document to compute desired result (sub query for each document)?
any other approach?
Yes, You can use aggregation.
Refer this
You can bucket like how many resumes are matching each skill you are looking for.
GET employees/_search
{
"size": 0,
"aggs" : {
"messages" : {
"filters" : {
"filters" : {
"marketing_resume_count" : { "match" : { "resumeText" : "marketing" }},
"thinking_resume_count" : { "match" : { "resumeText" : "thinking" }}
}
}
}
}
}
To extend to your use case:
You can add query section to the query as below
GET employees/_search
{
"size": 0,
"query":{
"match":{
"region":"AM"
}
},
"aggs" : {
"messages" : {
"filters" : {
"filters" : {
"marketing_resume_count" : { "match" : { "resumeText" : "marketing" }},
"thinking_resume_count" : { "match" : { "resumeText" : "thinking" }}
}
}
}
}
}
You can use range query to handle gte and let conditions. You can refer this for range query example. This can be used in place of query section.

How to apply aggregations on grouped fields in Elasticsearch?

On my eCommerce store I want to only include the first item in each group (grouped by item_id) in the final results. At the same time I don't want to lose my aggregations (little numbers next to attributes that indicate how many items with that attribute are found).
Here is a little example:
Suppose I make a search for items and only 25 show up. This is the result for the color aggregation that I currently get:
black (65)
green (32)
white (13)
And I want it to be:
black (14)
green (6)
white (5)
The numbers should amount to the total number the user actually sees on the page.
How could I achieve that with Elasticsearch? I have tried both Grouping (Top Hits) and Field Collapsing and both don't seem to fit my use case. Solr does it almost by default with its Grouping functionality.
It should be rather easy. When you are asking for aggregation you are simple sending request to the _search endpoint. Example:
POST /exams/_search
{
"aggs" : {
"avg_grade" : { "avg" : { "field" : "grade" } }
}
}
and in above example you will get aggregation for all the documents.
If you want to get aggregation for specific documents you just need to add specific query to the request body, like:
POST /exams/_search
{
"query": {
"bool" : {
"must" : {
"query_string" : {
"query" : "some query string here"
}
},
"filter" : {
"term" : { "user" : "kimchy" }
}
}
},
"aggs" : {
"avg_grade" : { "avg" : { "field" : "grade" } }
}
}
and you can send size and from parameters as well.

Springdata mongodb aggregation match

After asking question to understand a bit more of the aggregation framework in MongoDB I finally found the way to do aggregation for my need (thanks to a StackExchange user)
So basically here is a document from my collection:
{
"_id" : ObjectId("s4dcsd5s4d6c54s6d"),
"items" : [
{
type : "TYPE_1",
text : "blablabla"
},
{
type : "TYPE_2",
text : "blablabla"
},
{
type : "TYPE_3",
text : "blablabla"
},
{
type : "TYPE_1",
text : "blablabla"
},
{
type : "TYPE_2",
text : "blablabla"
},
{
type : "TYPE_1",
text : "blablabla"
}
]
}
The idea was to be able to filter only some elements of my collections (avoiding Type 2 and 3). In fact I have more than 30 types and 6 are not allowed but for simplicity I made this example.
So the aggregation command in command line is this one:
db.history.aggregate([{
$match: {
_id: ObjectId("s4dcsd5s4d6c54s6d")
}
}, {
$unwind: '$items'
}, {
$match: {
'items.type': { '$nin': [ "TYPE_2" , "TYPE_3"] }
}
},
{ $limit: 10 }
]);
With this I am able to retrieve the 10 elements items of this document which do not match TYPE_2 and TYPE_3
However when I am using spring data there is no output. I looked a bit at the example to build mine but its still not working.
So I did:
Aggregation aggregation = newAggregation(
match(Criteria.where("id").is(myID)),
unwind("items"),
match(Criteria.where("items.type").nin(ignoreditemstype)),
limit(3),
skip(offsetLong)
);
AggregationResults<PersonnalHistory> results = mongAccess.getOperation().aggregate(query,
"items", PersonnalHistory.class);
PersonnalHistory is marked with annotation #Document(collection = "history") and id with the #id annotation
ignoreditemstype is a list containing TYPE_2 and TYPE_3
Here is what I have in the toString method of aggregation:
{
"aggregate" : "__collection__" ,
"pipeline" : [
{ "$match": { "id" : "s4dcsd5s4d6c54s6d"} },
{ "$unwind": "$items"},
{ "$match": { "items.type": { "$nin" : [ "TYPE_2" , "TYPE_3" ] } } },
{ "$limit" : 3},
{ "$skip" : 0 }
]
}
I tried a lot of stuff (to have at least an answer :) ) like removing id or the nin:
aggregation = newAggregation(
unwind("items"),
match(Criteria.where("items.type").nin(ignoreditemstype)),
limit(3),
skip(offsetLong)
);
aggregation = newAggregation(
match(Criteria.where("id").is(myid)),
unwind("items")
);
For information when I do a simple query like:
query.addCriteria(Criteria.where("id").is(myID));
My document is returned. However I have thousands of items. So I just want to have the 15 first (in fact the 15 first are the 15 last added)
Do you maybe see what I am doing wrong?
Yeah looks like you are passing simple String while it is expecting ObjectId
Aggregation aggregation = newAggregation(
match(Criteria.where("_id").is(new ObjectId(myID))),
unwind("items"),
match(Criteria.where("items.type").nin(ignoreditemstype)),
limit(3),
skip(offsetLong)
);
Now the question is why it works with simple query, my answer would be because spring-data driver is not that mature at least not with aggregation pipeline.

How to use lucene SpanQuery in ElasticSearch

For my project, I thought of using Span Near Queries of ElasticSearch, with the constraint that is, certain tokens may have to searched with Fuzziness. I was able to generate a set of SpanQuery (org.apache.lucene.search.spans.SpanQuery) objects some with fuzzy enabled, some without. I couldn't figure out how to use these set of SpanQueries in ElasticSearch spanNearQuery.
Can someone help me out with right pointers to samples or docs. And is there any way to construct ES SpanNearQueryBuilder with some clauses fuzzy enabled ?
You can wrap an fuzzy query into a span query with Span Multi Term Query:
{
"span_near" : {
"clauses" : [
{ "span_term" : { "field" : "value1" } },
{ "span_multi" :
"match" : {
"prefix" : { "user" : { "field" : "value2" } }
}
}
],
...
}
}

How to map / query this data with ElasticSearch?

I'm using ElasticSearch along with the tire gem to power the search
functionality of my site. I'm having trouble figuring out how to map and
query the data to get the results I need.
Relevant code is below. I will explain the desired outbut below that as
well.
# models/product.rb
class Product < ActiveRecord::Base
include Tire::Model::Search
include Tire::Model::Callbacks
has_many :categorizations
has_many :categories, :through => :categorizations
has_many :product_traits
has_many :traits, :through => :product_traits
mapping do
indexes :id, type: 'integer'
indexes :name, boost: 10
indexes :description, analyzer: 'snowball'
indexes :categories do
indexes :id, type: 'integer'
indexes :name, type: 'string', index: 'not_analyzed'
end
indexes :product_traits, type: 'string', index: 'not_analyzed'
end
def self.search(params={})
out = tire.search(page: params[:page], per_page: 12, load: true) do
query do
boolean do
must { string params[:query], default_operator: "OR" } if params[:query].present?
must { term 'categories.id', params[:category_id] } if params[:category_id].present?
# if we aren't browsing a category, search results are "drill-down"
unless params[:category_id].present?
must { term 'categories.name', params[:categories] } if params[:categories].present?
end
params.select { |p| p[0,2] == 't_' }.each do |name,value|
must { term :product_traits, "#{name[2..-1]}##{value}" }
end
end
end
# don't show the category facets if we are browsing a category
facet("categories") { terms 'categories.name', size: 20 } unless params[:category_id].present?
facet("traits") {
terms :product_traits, size: 1000 #, all_terms: true
}
# raise to_curl
end
# process the trait facet results into a hash of arrays
if out.facets['traits']
facets = {}
out.facets['traits']['terms'].each do |f|
split = f['term'].partition('#')
facets[split[0]] ||= []
facets[split[0]] << { 'term' => split[2], 'count' => f['count'] }
end
out.facets['traits']['terms'] = facets
end
out
end
def to_indexed_json
{
id: id,
name: name,
description: description,
categories: categories.all(:select => 'categories.id, categories.name, categories.keywords'),
product_traits: product_traits.includes(:trait).collect { |t| "#{t.trait.name}##{t.value}" }
}.to_json
end
end
As you can see above, I'm doing some pre/post processing of the data
to/from elasticsearch in order to get what i want from the
'product_traits' field. This is what doesn't feel right and where my
questions originate.
I have a large catalog of products, each with a handful of 'traits' such
as color, material and brand. Since these traits are so varied, I
modeled the data to include a Trait model which relates to the Product
model via a ProductTrait model, which holds the value of the trait for
the given product.
First question is: How can i create the elasticsearch mapping to index
these traits properly? I assume that this involves a nested type but I
can't make enough sense of the docs to figure it out.
Second question: I want the facets to come back in groups (in the
manner that I am processing them at the end of the search method
above) but with counts that reflect how many matches there are without
taking into account the currently selected value for each trait. For
example: If the user searches for 'Glitter' and then clicks the link
corresponding to the 'Blue Color' facet, I want all the 'Color' facets
to remain visible and show counts correspinding the query results
without the 'Blue Color' filter. I hope that is a good explanation,
sorry if it needs more clarification.
If you index your traits as:
[
{
trait: 'color',
value: 'green'
},
{
trait: 'material',
value: 'plastic'
}
]
this would be indexed internally as:
{
trait: ['color', 'material' ],
value: ['green', 'plastic' ]
}
which means that you could only ever query for docs which have a trait with value 'color' and a value with value green. There is no relationship between the trait and the value.
You have a few choices to solve this problem.
As single terms
The first you are already doing, and it is a good solution, ie storing the traits as single terms like:
['color#green`','material#plastic']
As objects
An alternative (assuming you have a limited number of trait names) would be to store them as:
{
traits: {
color: 'green',
material: 'plastic'
}
}
Then you could run queries against traits.color or traits.material.
As nested
If you want to keep your array structure, then you can use the nested type eg:
{
"mappings" : {
"product" : {
"properties" : {
... other fields ...
"traits" : {
"type" : "nested",
"properties" : {
"trait" : {
"index" : "not_analyzed",
"type" : "string"
},
"value" : {
"index" : "not_analyzed",
"type" : "string"
}
}
}
}
}
}
}
Each trait/value pair would be indexed internally as a separate (but related) document, meaning that there would be a relationship between the trait and its value. You'd need to use nested queries or nested filters to query them, eg:
curl -XGET 'http://127.0.0.1:9200/test/product/_search?pretty=1' -d '
{
"query" : {
"filtered" : {
"query" : {
"text" : {
"name" : "my query terms"
}
},
"filter" : {
"nested" : {
"path" : "traits",
"filter" : {
"and" : [
{
"term" : {
"trait" : "color"
}
},
{
"term" : {
"value" : "green"
}
}
]
}
}
}
}
}
}
'
Combining facets, filtering and nested docs
You state that, when a user filters on eg color == green you want to show results only where color == green, but you still want to show the counts for all colors.
To do that, you need to use the filter param to the search API rather than a filtered query. A filtered query filters out the results BEFORE calculating the facets. The filter param is applied to query results AFTER calculating facets.
Here's an example where the final query results are limited to docs where color == green but the facets are calculated for all colors:
curl -XGET 'http://127.0.0.1:9200/test/product/_search?pretty=1' -d '
{
"query" : {
"text" : {
"name" : "my query terms"
}
},
"filter" : {
"nested" : {
"path" : "traits",
"filter" : {
"and" : [
{
"term" : {
"trait" : "color"
}
},
{
"term" : {
"value" : "green"
}
}
]
}
}
},
"facets" : {
"color" : {
"nested" : "traits",
"terms" : { "field" : "value" },
"facet_filter" : {
"term" : {
"trait" : "color"
}
}
}
}
}
'

Resources