Is it possible to "join" two indexes in elasticsearch using query - elasticsearch

I know that there is no option to join indexes in elasticsearch but I need to find a way to solve this problem:
I have 2 indexes like A, B
IndexA have informations like field1, field2, field3
IndexB have field4, field5, field6
And if i will search by the query field5(in this case "test") I want to have all relationships in the tree like:
Match all the the documents from IndexA which matches "field2" - from IndexA and "field5" from IndexB For example
IndexA documents:
5, "test", "test2",
10, "test", "test7"
11, "test10", "test11"
IndexB documents:
1, "test", (...)
2, "test", (...)
3, "test100", (...)
The example response:
for id5 (from indexA) i want to have an object with id's 1 and 2 from indexB like {id:5, responses: {1, 2}}
for id10 (from IndexA) i want to have an object with id's 1 and 2 from indexB like {id:10, responses: {1, 2}}
for id11 there is no match ("test10" != "test") {id:11, responses:{}}
Meybe there is any way to solve this? Finally I need to do this for four indexes (but if it is possible between two then I can do it on 4 aswell).

I don't think it's possible in elasticsearch, just like you said. You shouldn't create indexes with such relations. It would be better to rethink your model and denormalize the data.
In order to solve this, you'll have to do the processing programmatically in your backend. Pseudocode:
//Get all objects from indexA
const allIndexA = indexA.getAll();
const result = new Array();
//For each object in indexA, select the corresponding object in indexB
allIndexA.forEach((entryA) => {
const entriesB = indexB.get({field5: entryA.field2});
result.push({
entryA,
entriesB
});
});

I was tring as bellow:
GET /_msearch
{
"_index": [
"index1",
"index2",
"index3"
]
}
{
"query": {
"bool": {
"should": [
{
"match": {
"index3id": "1" // it is in the 3th index so i have responses from 3th index
} // only
}
]
}
},
"size": 100,
"aggregations": {
"firstLevel": {
"top_hits": {
"size": 100,
"_source": {
"includes": "index3id"
}
}
}
}
}
response of aggregation here:
"aggregations": {
"firstLevel": {
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "index3",
"_type": "someTypeNotRelevant",
"_id": "81",
"_score": 1,
"_source": {
"index3id": 1
}
},
{
"_index": "index3",
"_type": "someTypeNotRelevant",
"_id": "61",
"_score": 1,
"_source": {
"index3id": 1
}
}
]
}
}
}
Now I just want to do a new query in index2 for some field but with values which were in
_source(in this case - for the all index3id's) (i was thinking about some sub-aggregation to firstLevel": {} aggregation - but with use of new query to index2).
There are 2 problems:
1. How to pass these index3id's?
2. After first query, I have only "data" from index3 because of using index3id
Anyway thank you for advice.

Related

How to get inner hits field values in Nest or Elastic.Net library ? Alterantivly how to specify output type in Nest or Elastic.Net library?

I am new to elasticsearch and I am having troubles with the Nest/Elastic.Net library.
I would like to retrieve not the entire document but just part of it. I am able to do it in Postman but I cannot do it via Elastic.Net library or Nest library.
Document structure looks like following
{
“Doc_id”: “id_for_cross_refference_with_othersystem”
“Ocr”:[
{
“word”: “example_word1”,
“box”: [],
“cord”: “some_number”,
},
{
“word”: “example_word2”,
“box”: [],
“cord”: “some_number2”,
}
]
}
The document has a huge amount of properties but I am interested only in Doc_id , ocr.word, ocr.box and ocr.cord.
The following postman request fully satisfies my needs :
{
"query": {
"bool": {
"must": [
{
"match": {
"doc_id": "2a558865-7dc2-4e4d-ad02-3f683159984e"
}
},
{
"nested": {
"path": "ocr",
"query": {
"match": {
"ocr.word": "signing"
}
},
"inner_hits": {
"_source": {
"includes":[
"ocr.word",
"ocr.box",
"ocr.conf"
]
}
}
}
}
]
}
},
"_source":"false"
}
Result of that request is following :
{
"took": 9,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 3,
"relation": "eq"
},
"max_score": 18.99095,
"hits": [
{
"_index": "irrelevant",
"_type": "irrelevant",
"_id": "irrelevant",
"_score": 18.99095,
"_source": {},
"inner_hits": {
"ocr": {
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 7.9260864,
"hits": [
{
"_index": "irrelevant",
"_type": "irrelevant",
"_id": "irrelevant",
"_nested": {
"field": "ocr",
"offset": 11
},
"_score": 7.9260864,
"_source": {
"box": [
],
"conf": "96.452858",
"word": "signing"
}
}
]
}
}
}
},
{
"_index": "there_rest _of_object_is_ommited",
},
{
"_index": "there_rest _of_object_is_ommited",
}
]
}
}
However when I try to convert that request to Nest Query DSL I am not able to achieve the same result.
When I try to use the NEST library I don’t see any way to provide output result model/type. It looks like the Type of Document should match the output type which is not my case.
Query that I am using :
var searchResponse = client2.Search<Model>(s => s
.Query(q1 => q1.Bool(b1 => b1.Must(s1 =>
s1.Match(m => m.Field(f => f.doc_id).Query("2a558865-7dc2-4e4d-ad02-3f683159984e")),
s2 => s2.Nested(n => n.Path("ocr").Query(q2 => q2.Bool(b => b.Must(m => m.Match(m => m.Field(f => f.ocr.First().word).Query("signing")))))
.InnerHits(ih => ih.Source(s => s.Includes(i => i.Field(f => f.ocr.First().word).Field(f => f.ocr.First().conf))))
)
)))
.Source(false)
);
Due to the fact that the Model type is created for a document and it doesn’t match the output type I am getting [null, null, null] as the output .
There is property such properties as Hits in ISearchResponse? But when I look into it I cannot see values of fields.
I tried using a low level client (Elastic.Net) and providing json request as a string. But It looks like there is not way of specifying the output type either. When I ran my code with the low level library it returns me 3 object of class Model with empty fields.
My questions are :
Is it possible to specify output type different from document type for Nest query DSL or Elatic.Net library ?
Is it possible to get values of the fields that I specified in request for inner hits with help of Nest or Elastic.Net libraries?
How would you solve such problem ? I mean we have huge documents and we don’t want to pass unnecessary information back and forth. The inner hits approach looks like a neat solution for us but it doesn’t look like it works with the recommended libraries Unless I am doing some silly mistake.
NOTE: I can achieve desired result with simple use of HTTPClient and manually doing what I need , but I hope to leverage library that is written for this purpose(Nest or Elastic.Net).

How to filter match in top 3 - elasticsearch?

I am having the following data in the elasticsearch
{
"_index": "media",
"_type": "information",
"_id": "6838",
"_source": {
"demographics_countries": {
"AE": 0.17543859649122806,
"CA": 0.013157894736842105,
"FR": 0.017543859649122806,
"GB": 0.043859649122807015,
"IT": 0.02631578947368421,
"LB": 0.013157894736842105,
"SA": 0.49122807017543857,
"TR": 0.017543859649122806,
"US": 0.09210526315789472
}
}
},
{
"_index": "media",
"_type": "information",
"_id": "57696",
"_source": {
"demographics_countries": {
"TN": 0.8125,
"MA": 0.034375,
"DZ": 0.032812,
"FR": 0.0125,
"EG": 0.0125,
"IN": 0.009375,
"SA": 0.009375
}
}
]
Expected result:
Find out an document having specific country SA (saudi arabia) is among top 3 in demographics_countries
For example:
"_id": "6838" (first document) is matched because SA (saudi arabia) is among top 3 in the demographics_countries in the above mentioned example document.
Tried ? : I have tried to filter using top_hits, But it's not working as expected.
Any suggestion will be grateful
With the current data model it's quite difficult to do that. What I'd suggest might be not the easiest way to do it, but it will definitely be the fastest to query eventually.
I'd suggest remodelling your documents to already include top countries:
[
{
"_index": "media",
"_type": "information",
"_id": "6838",
"_source": {
"top_demographics_countries": ["TN", "MA", "DZ"],
"demographics_countries": {
"AE": 0.17543859649122806,
"CA": 0.013157894736842105,
"FR": 0.017543859649122806,
"GB": 0.043859649122807015,
"IT": 0.02631578947368421,
"LB": 0.013157894736842105,
"SA": 0.49122807017543857,
"TR": 0.017543859649122806,
"US": 0.09210526315789472
}
}
},
{
"_index": "media",
"_type": "information",
"_id": "57696",
"_source": {
"top_demographics_countries": ["TN", "MA", "DZ"],
"demographics_countries": {
"TN": 0.8125,
"MA": 0.034375,
"DZ": 0.032812,
"FR": 0.0125,
"EG": 0.0125,
"IN": 0.009375,
"SA": 0.009375
}
}
}
]
Ignore values I've picked for top_demographics_countries. With this kind of approach, you can always precalculate top and then you could use a simple terms query to check if document contains that value or not:
{
"query": {
"bool": {
"filter": {
"term": {
"top_demographics_countries": "SA"
}
}
}
}
}
It's going to be cheaper to compute them once during saving compared to always building that clause dynamically.
#Evaldas is right -- it's better to extract the top 3 beforehand.
But if you can't help yourself and feel compelled to use java/painless, here's one approach:
{
"query": {
"bool": {
"must": [
{
"exists": {
"field": "demographics_countries.SA"
}
},
{
"script": {
"script": {
"source": """
def tuple_list = new ArrayList();
for (def c : params.all_countries) {
def key = 'demographics_countries.'+c;
if (!doc.containsKey(key) || doc[key].size() == 0) {
continue;
}
def val = doc[key].value;
tuple_list.add([c, val]);
}
// sort tuple list by the country values
Collections.sort(tuple_list, (arr1, arr2) -> arr1[1] < arr2[1] ? 1 : -1);
// slice & take only the top 3
def top_3_countries = tuple_list.subList(0, 3).stream().map(arr -> arr[0]).collect(Collectors.toList());
return top_3_countries.size() >=3 && top_3_countries.contains(params.country_of_interest);
""",
"params": {
"country_of_interest": "SA",
"all_countries": [
"AE",
"CA",
"FR",
"GB",
"IT",
"LB",
"SA",
"TR",
"US",
"TN",
"MA",
"DZ",
"EG",
"IN"
]
}
}
}
}
]
}
}
}

Has child query with date filter in Elasticsearch

I have two types in an index my_index. One is the parent of the other. The child is called calendar the parent is called stock. One stockcan have many calendar entries, but not viceversa. I need to calculate the number of days where there was stock available in a range of dates (greater or equal to and less than or equal to). How can I do it?
stock entries look more or less like this:
{
"_index": "my_index",
"_type": "stock",
"_id": "21321",
"_score": 1,
"_source": {
"description": "Screwdriver",
"size": "big",
"some_characteristic1": null,
"some_characteristic2": "Good",
"id": 21321,
...
}
}
calendar entries look more or less like this:
{
"_index": "my_index",
"_type": "calendar",
"_id": "21321-20161129",
"_score": 1,
"_routing": "21321",
"_parent": "21321",
"_source": {
"available": true,
"date": "2016-11-29",
"prices": {
"price": 2150
}
}
}, ...
The result of this query should look more or less like this:
{
"description": "Screwdriver",
"size": "big",
"some_characteristic1": null,
"some_characteristic2": "Good",
"id": 21321,
"available_days_in_february": {
"count": 12,
"total_count": 28
}
}
A similar query in sql could look like this:
SELECT s.*, sum(c.available = true) as count, count(c.available) as total_count
FROM stock as s
LEFT JOIN calendar as c ON s.id = c._parent
WHERE c.date >= 2017-02-01 AND c.date <= 2017-02-28

Elastic search Nest TopHits aggregation

I've been struggling with a problem for a while now, so i thought i would swing this by stackoverflow.
My document type has a title, a language field (used to filter) and a grouping id field (im leaving out all the other fields to keep this to the point)
When i search for documents i want to find all documents containing the text in the title. I only want one document for each unique grouping id.
I've been looking at tophits aggregation, and from what i can see it should be able to solve my problem.
When running this query against my index:
{
"query": {
"match": {
"title": "dingo"
}
},
"aggs": {
"top-tags": {
"terms": {
"field": "groupId",
"size": 1000000
},
"aggs": {
"top_tag_hits": {
"top_hits": {
"_source": {
"include": [
"*"
]
},
"size": 1
}
}
}
}
}
}
I get the following response (All results are in the same language):
{
"took": 9,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0,
"hits": []
},
"aggregations": {
"top-tags": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [{
"key": "3044BC9E7C29450AAB2E4B6C9B35AAE2",
"doc_count": 2,
"top_tag_hits": {
"hits": {
"total": 2,
"max_score": 1.4983996,
"hits": [{
"_index": "elasticsearch",
"_type": "productdocument",
"_id": "FB15279FB18E4B34AD66ACAF69B96E9E",
"_score": 1.4983996,
"_source": {
"groupId": "3044BC9E7C29450AAB2E4B6C9B35AAE2",
"title": "wombat, dingo and zetapunga actionfigures",
}
}]
}
}
},
{
"key": "F11799ABD0C14B98ADF2554C84FF0DA0",
"doc_count": 1,
"top_tag_hits": {
"hits": {
"total": 1,
"max_score": 1.30684,
"hits": [{
"_index": "elasticsearch",
"_type": "productdocument",
"_id": "42562A25E4434A0091DE0C79A3E7F3F4",
"_score": 1.30684,
"_source": {
"groupId": "F11799ABD0C14B98ADF2554C84FF0DA0",
"title": "awesome dingo raptor"
}
}]
}
}
}]
}
}
}
This is exactly what i expected (two hits in one bucket, but only one document retrieved for that bucket). However when i try this in NEST i can't seem to retrieve all of the documents.
My query looks like this:
result = _elasticClient.Search<T>(s => s
.From(skip)
.Filter(fd => fd.Term(f => f.Language, language))
.Size(pageSize)
.SearchType(SearchType.Count)
.Query(
q => q.Wildcard(f => f.Title, query, 2.0)
|| q.Wildcard(f => f.Description, query)
)
.Aggregations(agd =>
agd.Terms("groupId", tagd => tagd
.Field("groupId")
.Size(100000) //We sadly need all products
)
.TopHits("top_tag_hits", thagd => thagd
.Size(1)
.Source(ssd => ssd.Include("*")))
));
var topHits = result.Aggs.TopHits("top_tag_hits");
var documents = topHits.Documents<ProductDocument>(); //contains only one document (I would expect it to contain two, one for each bucket)
Inspecting the aggregations in the debugger reveals there is a "groupId" aggregation with 2 buckets (and matching what i see in my "raw" query against the index. Just without any apparent way to retrieve the documents)
So my question is. How do i retrieve the top hit for each bucket? Or am i doing this completely wrong? Is there some other way to achieve what i am trying to do?
EDIT
After the help i received, i was able to retrieve my results with the following:
result = _elasticClient.Search<T>(s => s
.From(skip)
.Filter(fd => fd.Term(f => f.Language, language))
.Size(pageSize)
.SearchType(SearchType.Count)
.Query(
q => q.Wildcard(f => f.Title, query, 2.0)
|| q.Wildcard(f => f.Description, query)
)
.Aggregations(agd =>
agd.Terms("groupId", tagd => tagd
.Field("groupId")
.Size(0)
.Aggregations(tagdaggs =>
tagdaggs.TopHits("top_tag_hits", thagd => thagd
.Size(1)))
)
)
);
var groupIdAggregation = result.Aggs.Terms("groupId");
var topHits =
groupIdAggregation.Items.Select(key => key.TopHits("top_tag_hits"))
.SelectMany(topHitMetric => topHitMetric.Documents<ProductDocument>()).ToList();
Your NEST query tries to run both Terms aggregation and TopHits side by side, while your original query runs Terms first and then for each bucket, you're calling TopHits.
You simply have to move your TopHits agg into Terms in your NEST query to make it work fine.
This should fix it:
.Aggregations(agd =>
agd.Terms("groupId", tagd => tagd
.Field("groupId")
.Size(0)
.Aggregations(tagdaggs =>
tagdaggs.TopHits("top_tag_hits", thagd => thagd
.Size(1)))
)
)
By the way, you don't have to use Include("*") to include all fields. Just remove this option, also specifying .Size(0) should bring back ALL possible terms for you.

Pyes search on particular field gives no results

I have a document indexed whose structure is as follows:
{
"_index": "yelp",
"_type": "user",
"_id": "ABC",
"_score": 1,
"_source": {
"yelping_since": "2007-07",
"votes": {},
"review_count": 1798,
"name": "TEST_USER",
"user_id": "123",
"friends": [],
"fans": 89,
"average_stars": 3.48,
"type": "user",
"compliments": {},
"elite": []
}
}
When I do:
myFilter = TermFilter("_id", "ABC")
q = FilteredQuery(MatchAllQuery(), myFilter).search()
results = conn.search(query=q, indices=index_name,type=doc_typeU)
Exact result is displayed.
But when I try to search for field say user_id as follows,
myFilter = TermFilter("user_id", "123")
q = FilteredQuery(MatchAllQuery(), myFilter).search()
results = conn.search(query=q, indices=index_name,type=doc_typeU)
No results are displayed. Any idea why?
please try this one
myFilter = TermQuery("user_id", "123")
q = FilteredQuery(MatchAllQuery(), myFilter).search()
results = conn.search(query=q, indices=index_name,type=doc_typeU)

Resources