Getting child documents - elasticsearch

I have an Elasticsearch index. Each document in that index has a number (i.e 1, 2, 3, etc.) and an array named ChildDocumentIds. There are additional properties too. Still, each item in this array is the _id of a document that is related to this document.
I have a saved search named "Child Documents". I would like to use the number (i.e. 1, 2, 3, etc.) and get the child documents associated with it.
Is there a way to do this in Elastisearch? I can't seem to find a way to do a relational-type query in Elasticsearch for this purpose. I know it will be slow, but I'm o.k. with that.

The terms query allows you to do this. If document #1000 had child documents 3, 12, and 15 then the following two queries would return identical results:
"terms" : { "_id" : [3, 12, 15] }
and:
"terms" : {
"_id" : {
"index" : <parent_index>,
"type" : <parent_type>,
"id" : 1000,
"path" : "ChildDocumentIds"
}
}
The reason that it requires you to specify the index and type a second time is that the terms query supports cross-index lookups.

Related

improving performance of search query using index field when working with alias

I am using an alias name when writing data using Bulk Api.
I have 2 questions:
Can I get the index name after writing data using the alias name maybe as part of the response?
Can I improve performance if I send search queries on specific indexes instead to search on all indexes of the same alias?
If you're using an alias name for writes, that alias can only point to a single index which you're going to receive back in the bulk response
For instance, if test_alias is an alias to the test index, then when sending this bulk command:
POST test_alias/_doc/_bulk
{"index":{}}
{"foo": "bar"}
You will receive this response:
{
"index" : {
"_index" : "test", <---- here is the real index name
"_type" : "_doc",
"_id" : "WtcviYABdf6lG9Jldg0d",
"_version" : 1,
"result" : "created",
"_shards" : {
"total" : 2,
"successful" : 2,
"failed" : 0
},
"_seq_no" : 0,
"_primary_term" : 1,
"status" : 201
}
}
Common sense has it that searching on a single index is always faster than searching on an alias spanning several indexes, but if the alias only spans a single index, then there's no difference.
You can provide the multiple index names while searching the data, if you are using alias and it has multiple indices by default it would search on all the indices, but if you want to filter it based on a few indices in your alias, that is also possible based on the fields in the underlying indices.
You can read the Filter-based aliases to limit access to data section in this blog on how to achieve it, as it queries fewer indices and less data, search performance would be better.
Also alias can have only single writable index, and name of that you can get as part of _cat/alias?v api response as well, which shows which is the write_index for the alias, you can see the sample output here

ElasticSearch Sorted Index not working as expected with multiple shards

I have an elastic index with default sort mapping of price:
shop_prices_sort_index
"sort" : {
"field" : "enrich.price",
"order" : "desc"
},
If I insert 10 documents:
100, 98, 10230, 34, 1, 23, 777, 2323, 3, 109
And Fetch results using /_search. By default it returns documents in order of price descending.
10230, 2323...
But if I distribute my documents into 3 shards, Then the same query returns some other sequence of products:
100, 98, 34...
I am really stuck here, I am not sure if I am missing out something basic or Do I need any extra settings to make a Sorted Index behave correctly.
PS: I also tried 'routing' & 'preference'. but no luck.
Any help much appreciated.
When configuring index sorting, you're only making sure that each segment inside each shard is properly sorted. The goal of index sorting is to provide some more optimization during searches
Due to the distributed nature of ES, when your index has many shards, each shard will be properly sorted, but your search query will still need to use sorting explicitly.
So if your index settings contains the following to apply sorting at indexing time
"sort" : {
"field" : "enrich.price",
"order" : "desc"
}
your search queries will also need to contain the same sort specification at query time
"sort" : {
"field" : "enrich.price",
"order" : "desc"
}
By using index sorting you'll hit a little overhead at indexing time, but your search queries will be a bit faster in the end.

Sum field and sort on Solr

I'm implementing a grouped search in Solr. I'm looking for a way of summing one field and sort the results by this sum. With the following data example I hope it will be clearer.
{
[
{
"id" : 1,
"parent_id" : 22,
"valueToBeSummed": 3
},
{
"id" : 2,
"parent_id" : 22,
"valueToBeSummed": 1
},
{
"id" : 3,
"parent_id" : 33,
"valueToBeSummed": 1
},
{
"id" : 4,
"parent_id" : 5,
"valueToBeSummed": 21
}
]
}
If the search is made over this data I'd like to obtain
{
[
{
"numFound": 1,
"summedValue" : 21,
"parent_id" : 5
},
{
"numFound": 2,
"summedValue" : 4,
"parent_id" : 22
},
{
"numFound": 1,
"summedValue" : 1,
"parent_id" : 33
}
]
}
Do you have any advice on this ?
Solr 5.1+ (and 5.3) introduces Solr Facet functions to solve this exact issue.
From Yonik's introduction of the feature:
$ curl http://localhost:8983/solr/query -d 'q=*:*&
json.facet={
categories:{
type : terms,
field : cat,
sort : "x desc", // can also use sort:{x:desc}
facet:{
x : "avg(price)",
y : "sum(price)"
}
}
}
'
So the suggestion would be to upgrade to the newest version of Solr (the most recent version is currently 5.2.1, be advised that some of the syntax that's on the above link will be landed in 5.3 - the current release target).
So you want to group your results on the field parent_id and inside each group you want to sum up the fields valueToBeSummed and then you want to sort the entire results (the groups) by this new summedvalue field. That is a very interesting use case...
Unfortunately, I don't think there is a built in way of doing what you have asked.
There are function queries which you can use to sort, there is a group.func parameter also, but they will not do what you have asked.
Have you already indexed this data? Or are you still in the process of charting out how to store this data? If its the latter then one possible way would be to have a summedvalue field for each documents and calculate this as and when a document gets indexed. For example, given the sample documents in your question, the first document will be indexed as
{
"id" : 1,
"parent_id" : 22,
"valueToBeSummed": 3
"summedvalue": 3
"timestamp": current-timestamp
},
Before indexing the second document id:2 with parent_id:22 you will run a solr query to get the last indexed document with parent_id:22
Solr Query q=parent_id:22&sort=timestamp desc&rows=1
and add the summedvalue of id:1 with valueToBeSummed of id:2
So the next document will be indexed as
{
"id" : 2,
"parent_id" : 22,
"valueToBeSummed": 1
"summedvalue": 4
"timestamp": current-timestamp
}
and so on.
Once you have documents indexed this way, you can run a regular solr query with &group=true&group.field=parent_id&sort=summedValue.
Please do let us know how you decide to implement it. Like I said its a very interesting use case! :)
You can add the below query
select?q=*:*&stats=true&stats.field={!tag=piv1 sum=true}valueToBeSummed&facet=true&facet.pivot={!stats=piv1 facet.sort=index}parent_id&wt=json&indent=true
You need to use Stats Component for the requirement. You can get more information here. The idea is first define on what you need to have stats on. Here it is valueToBeSummed, and then we need to group on parent_id. We use facet.pivot for this functionality.
Regarding sort, when we do grouping, the default sorting order is based on count in each group. We can define based on the value too. I have done this above using facet.sort=index. So it sorted on parent_id which is the one we used for grouping. But your requirement is to sort on valueToBeSummed which is different from the grouping attribute.
As of now not sure, if we can achieve that. But will look into it and let you know.
In short, you got the grouping, you got the sum above. Just sort is pending

How to retrieve all the document ids from an elasticsearch index

How to retrieve all the document ids (the internal document '_id') from an Elasticsearch index? if I have 20 million documents in that index, what is the best way to do that?
I would just export the entire index and read off the file system. My experience with size/from and scan/scroll has been disaster when dealing with querying resultsets in the millions. Just takes too long.
If you can use a tool like knapsack, you can export the index to the file system, and iterate through the directories. Each document is stored under it's own directory named after _id. No need to actually open files. Just iterate through the dir.
link to knapsack:
https://github.com/jprante/elasticsearch-knapsack
edit: hopefully you are not doing this often... or this may not be a viable solution
For that amount of documents, you probably want to use the scan and scroll API.
Many client libraries have ready helpers to use the interface. For example, with elasticsearch-py you can do:
es = elasticsearch.Elasticsearch(eshost)
scroll = elasticsearch.helpers.scan(es, query='{"fields": "_id"}', index=idxname, scroll='10s')
for res in scroll:
print res['_id']
First you can issue a request to get the full count of records in the index.
curl -X GET 'http://localhost:9200/documents/document/_count?pretty=true'
{
"count" : 1408,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
}
}
Then you'll want to loop through the set using a combination of size and from parameters until you reach the total count. Passing an empty field parameter will return only the index and _id that you're interested in.
Find a good page size that you can consume without running out of memory and increment the from each iteration.
curl -X GET 'http://localhost:9200/documents/document/_search?fields=&size=1000&from=5000'
Example item response:
{
"_index" : "documents",
"_type" : "document",
"_id" : "1341",
"_score" : 1.0
},
...

Uniformly distributing results in elastic search based on an attribute

I am using tire to perform searches on sets of objects that have a category attribute (there are 6 different categories).
I want the results to come in pages of 6 with one of each category on a page (while it is possible).
Eg1. So if the first,second and third category had 2 objects each and the fourth, fifth and sixth categories had 1 object each the pages would look like:
Data: [1,1,2,2,3,3,4,5,6]
1: 1,2,3,4,5,6
2: 1,2,3
Eg2. [1,1,1,1,1,2,2,3,4,5]
1: 1,2,3,4,5,1
2: 2,1,1,1
In something like ruby it wouldn't be too difficult to sort based on the number of times a number has appeared.
Something like
times_seen = {}
results.sort_by do |r|
times_seen[r.category] ||= 0
[times_seen[r.category] += 1, r.category]
end
E.g.
irb(main):032:0> times_seen = {};[1,1,1,1,1,2,2,3,4,5].sort_by{|i| times_seen[i] ||= 1; [times_seen[i] += 1, i];}
=> [1, 2, 3, 4, 5, 1, 2, 1, 1, 1]
To do this with a large number of results would be really slow because we would need to pull them all into ruby first and then sort.
Ideally we want to do this in elastic search and let it handle the pagination for us.
There is Script based sorting in elastic search:
http://www.elasticsearch.org/guide/reference/api/search/sort/
{
"query" : {
....
},
"sort" : {
"_script" : {
"script" : "doc['field_name'].value * factor",
"type" : "number",
"params" : {
"factor" : 1.1
},
"order" : "asc"
}
}
}
So if we could do something like this but have the times_seen logic from above in it, it would make life really easy, but it would require having a times_seen variable that persists between scripts.
Any ideas on how to achieve a uniform distribution based on an attribute or if it is possible to somehow use a variable in the script sort?

Resources