Finding most commonly coincident values within an array with Elasticsearch - elasticsearch

I would like to display the most commonly ordered pairs of products for a set of orders placed. an abbreviated version of my search index would look something like this:
{
"_type" : "order",
"_id" : "10",
"_score" : 1.0,
"_source" : {
...
"product_ids" : [
1, 2
]
...
}
},
{
"_type" : "order",
"_id" : "11",
"_score" : 1.0,
"_source" : {
...
"product_ids" : [
1, 2, 3
]
...
}
}
Given that my search index contains a set of orders, each with a product_ids field that contains an array of the product ids that are in the order, is it possible to put together an Elasticsearch aggregation that will return either:
The most common pairs of product ids (which may be members of an arbitrarily long list of product ids) that occur the most frequently together in orders.
The most common sets of product ids of an arbitrary length that occur most frequently together in orders.
I've been reading the documentation, and I'm not sure if an adjacency matrix might be appropriate for this problem. My current hunch is to write a scripted cardinality query that orders and joins the product_ids in the search document to get results in line with #2, since #1 seems like it might involve too many permutations of product ids to be reasonably efficient.

Related

How could i remove items from another search?

On elastic search we make two searches, one for exact items, and another for non-exact items.
On we search input = dev, and on the exact result we get this item:
{"_id" : "users-USER#1-name",
"_source" : {
"pk" : "USER#1",
"entity" : "users",
"field" : "name",
"input" : "dev",
}}
Then we do a second search for the non-exact results we get this item:
{"_id" : "users-USER#1-description",
"_source" : {
"pk" : "USER#1",
"entity" : "users",
"field" : "name",
"input" : "Dev1",
}}
We want to remove the exact results from the first search from the second non-exact search by pk, we want to remove the items with the pk's from the first search from the second search
I'll heavenly appreciate any idea.
For example, on the fist search we got item:
"_id" : "users-USER#1-name"
"pk" : "USER#1"
Since we got this item on the first search, we want to remove all the items with the pks from the second search.
So the second search would be empty:
empty

improving performance of search query using index field when working with alias

I am using an alias name when writing data using Bulk Api.
I have 2 questions:
Can I get the index name after writing data using the alias name maybe as part of the response?
Can I improve performance if I send search queries on specific indexes instead to search on all indexes of the same alias?
If you're using an alias name for writes, that alias can only point to a single index which you're going to receive back in the bulk response
For instance, if test_alias is an alias to the test index, then when sending this bulk command:
POST test_alias/_doc/_bulk
{"index":{}}
{"foo": "bar"}
You will receive this response:
{
"index" : {
"_index" : "test", <---- here is the real index name
"_type" : "_doc",
"_id" : "WtcviYABdf6lG9Jldg0d",
"_version" : 1,
"result" : "created",
"_shards" : {
"total" : 2,
"successful" : 2,
"failed" : 0
},
"_seq_no" : 0,
"_primary_term" : 1,
"status" : 201
}
}
Common sense has it that searching on a single index is always faster than searching on an alias spanning several indexes, but if the alias only spans a single index, then there's no difference.
You can provide the multiple index names while searching the data, if you are using alias and it has multiple indices by default it would search on all the indices, but if you want to filter it based on a few indices in your alias, that is also possible based on the fields in the underlying indices.
You can read the Filter-based aliases to limit access to data section in this blog on how to achieve it, as it queries fewer indices and less data, search performance would be better.
Also alias can have only single writable index, and name of that you can get as part of _cat/alias?v api response as well, which shows which is the write_index for the alias, you can see the sample output here

ElasticSearch, Multiple Indices search

Can anyone help me for below use case:
I want to have 3 schools, each school has multiple students, so i want to be able to search the student names, but the search result should tell me that the text that I searched belongs to which school, here is what i am thinking would be the solution with having a problem:
I should have an index for each school and then using multi match to match the entered text against all the indexes, but the problem is that i want to know each matched result is belong to which index? please if there is a better solution for the use case or how can i solve the mentioned problem. Thank you All..
BR
when you run a search you get back a response that contains this;
"hits" : [
{
"_index" : ".async-search",
"_type" : "_doc",
"_id" : "CdT9fKXfQpOEIPuZazz0BA",
"_score" : 1.0,
"_source" : {
so you can use that _index value there to determine things.
ps - it's Elasticsearch, not ElasticSearch :)

Getting child documents

I have an Elasticsearch index. Each document in that index has a number (i.e 1, 2, 3, etc.) and an array named ChildDocumentIds. There are additional properties too. Still, each item in this array is the _id of a document that is related to this document.
I have a saved search named "Child Documents". I would like to use the number (i.e. 1, 2, 3, etc.) and get the child documents associated with it.
Is there a way to do this in Elastisearch? I can't seem to find a way to do a relational-type query in Elasticsearch for this purpose. I know it will be slow, but I'm o.k. with that.
The terms query allows you to do this. If document #1000 had child documents 3, 12, and 15 then the following two queries would return identical results:
"terms" : { "_id" : [3, 12, 15] }
and:
"terms" : {
"_id" : {
"index" : <parent_index>,
"type" : <parent_type>,
"id" : 1000,
"path" : "ChildDocumentIds"
}
}
The reason that it requires you to specify the index and type a second time is that the terms query supports cross-index lookups.

How to retrieve all the document ids from an elasticsearch index

How to retrieve all the document ids (the internal document '_id') from an Elasticsearch index? if I have 20 million documents in that index, what is the best way to do that?
I would just export the entire index and read off the file system. My experience with size/from and scan/scroll has been disaster when dealing with querying resultsets in the millions. Just takes too long.
If you can use a tool like knapsack, you can export the index to the file system, and iterate through the directories. Each document is stored under it's own directory named after _id. No need to actually open files. Just iterate through the dir.
link to knapsack:
https://github.com/jprante/elasticsearch-knapsack
edit: hopefully you are not doing this often... or this may not be a viable solution
For that amount of documents, you probably want to use the scan and scroll API.
Many client libraries have ready helpers to use the interface. For example, with elasticsearch-py you can do:
es = elasticsearch.Elasticsearch(eshost)
scroll = elasticsearch.helpers.scan(es, query='{"fields": "_id"}', index=idxname, scroll='10s')
for res in scroll:
print res['_id']
First you can issue a request to get the full count of records in the index.
curl -X GET 'http://localhost:9200/documents/document/_count?pretty=true'
{
"count" : 1408,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
}
}
Then you'll want to loop through the set using a combination of size and from parameters until you reach the total count. Passing an empty field parameter will return only the index and _id that you're interested in.
Find a good page size that you can consume without running out of memory and increment the from each iteration.
curl -X GET 'http://localhost:9200/documents/document/_search?fields=&size=1000&from=5000'
Example item response:
{
"_index" : "documents",
"_type" : "document",
"_id" : "1341",
"_score" : 1.0
},
...

Resources