ElasticSearch, Multiple Indices search - elasticsearch

Can anyone help me for below use case:
I want to have 3 schools, each school has multiple students, so i want to be able to search the student names, but the search result should tell me that the text that I searched belongs to which school, here is what i am thinking would be the solution with having a problem:
I should have an index for each school and then using multi match to match the entered text against all the indexes, but the problem is that i want to know each matched result is belong to which index? please if there is a better solution for the use case or how can i solve the mentioned problem. Thank you All..
BR

when you run a search you get back a response that contains this;
"hits" : [
{
"_index" : ".async-search",
"_type" : "_doc",
"_id" : "CdT9fKXfQpOEIPuZazz0BA",
"_score" : 1.0,
"_source" : {
so you can use that _index value there to determine things.
ps - it's Elasticsearch, not ElasticSearch :)

Related

improving performance of search query using index field when working with alias

I am using an alias name when writing data using Bulk Api.
I have 2 questions:
Can I get the index name after writing data using the alias name maybe as part of the response?
Can I improve performance if I send search queries on specific indexes instead to search on all indexes of the same alias?
If you're using an alias name for writes, that alias can only point to a single index which you're going to receive back in the bulk response
For instance, if test_alias is an alias to the test index, then when sending this bulk command:
POST test_alias/_doc/_bulk
{"index":{}}
{"foo": "bar"}
You will receive this response:
{
"index" : {
"_index" : "test", <---- here is the real index name
"_type" : "_doc",
"_id" : "WtcviYABdf6lG9Jldg0d",
"_version" : 1,
"result" : "created",
"_shards" : {
"total" : 2,
"successful" : 2,
"failed" : 0
},
"_seq_no" : 0,
"_primary_term" : 1,
"status" : 201
}
}
Common sense has it that searching on a single index is always faster than searching on an alias spanning several indexes, but if the alias only spans a single index, then there's no difference.
You can provide the multiple index names while searching the data, if you are using alias and it has multiple indices by default it would search on all the indices, but if you want to filter it based on a few indices in your alias, that is also possible based on the fields in the underlying indices.
You can read the Filter-based aliases to limit access to data section in this blog on how to achieve it, as it queries fewer indices and less data, search performance would be better.
Also alias can have only single writable index, and name of that you can get as part of _cat/alias?v api response as well, which shows which is the write_index for the alias, you can see the sample output here

Finding most commonly coincident values within an array with Elasticsearch

I would like to display the most commonly ordered pairs of products for a set of orders placed. an abbreviated version of my search index would look something like this:
{
"_type" : "order",
"_id" : "10",
"_score" : 1.0,
"_source" : {
...
"product_ids" : [
1, 2
]
...
}
},
{
"_type" : "order",
"_id" : "11",
"_score" : 1.0,
"_source" : {
...
"product_ids" : [
1, 2, 3
]
...
}
}
Given that my search index contains a set of orders, each with a product_ids field that contains an array of the product ids that are in the order, is it possible to put together an Elasticsearch aggregation that will return either:
The most common pairs of product ids (which may be members of an arbitrarily long list of product ids) that occur the most frequently together in orders.
The most common sets of product ids of an arbitrary length that occur most frequently together in orders.
I've been reading the documentation, and I'm not sure if an adjacency matrix might be appropriate for this problem. My current hunch is to write a scripted cardinality query that orders and joins the product_ids in the search document to get results in line with #2, since #1 seems like it might involve too many permutations of product ids to be reasonably efficient.

How to retrieve all the document ids from an elasticsearch index

How to retrieve all the document ids (the internal document '_id') from an Elasticsearch index? if I have 20 million documents in that index, what is the best way to do that?
I would just export the entire index and read off the file system. My experience with size/from and scan/scroll has been disaster when dealing with querying resultsets in the millions. Just takes too long.
If you can use a tool like knapsack, you can export the index to the file system, and iterate through the directories. Each document is stored under it's own directory named after _id. No need to actually open files. Just iterate through the dir.
link to knapsack:
https://github.com/jprante/elasticsearch-knapsack
edit: hopefully you are not doing this often... or this may not be a viable solution
For that amount of documents, you probably want to use the scan and scroll API.
Many client libraries have ready helpers to use the interface. For example, with elasticsearch-py you can do:
es = elasticsearch.Elasticsearch(eshost)
scroll = elasticsearch.helpers.scan(es, query='{"fields": "_id"}', index=idxname, scroll='10s')
for res in scroll:
print res['_id']
First you can issue a request to get the full count of records in the index.
curl -X GET 'http://localhost:9200/documents/document/_count?pretty=true'
{
"count" : 1408,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
}
}
Then you'll want to loop through the set using a combination of size and from parameters until you reach the total count. Passing an empty field parameter will return only the index and _id that you're interested in.
Find a good page size that you can consume without running out of memory and increment the from each iteration.
curl -X GET 'http://localhost:9200/documents/document/_search?fields=&size=1000&from=5000'
Example item response:
{
"_index" : "documents",
"_type" : "document",
"_id" : "1341",
"_score" : 1.0
},
...

MongoDB complex find

I need to grab the top 3 results for each of the 8 users. Currently I am looping through for each user and making 8 calls the the db. Is there a way to structure the query to pull the same 8X3 dataset in a single db pull?
selected_users = users.sample(8)
cur = 0
while cur <= selected_users .count-1
cursor = status_store.find({'user' => selected_users[cur]},{:fields =>params}).sort('score', -1).limit(3)
*do something*
cur+=1
end
The collection I am pulling from looks like the below. Each user can have an unbound number of tweets so I have not embedded them within within a user document.
{
"_id" : ObjectId("51e92cc8e1ce7219e40003eb"),
"id_str" : "57915476419948544",
"score" : 904,
"text" : "Yesterday we had a bald eagle on the show. Oddly enough, he was in the country illegally.",
"timestamp" : "19/07/2013 08:10",
"user" : {
"id_str" : "115485051",
"name" : "Conan O'Brien",
"screen_name" : "ConanOBrien",
"description" : "The voice of the people. Sorry, people.",
}
}
Thanks in advance.
Yes you can do this using the aggregation framework.
Another way would be to keep track of the top 3 scores for in the user documents. If this is faster or not depends on how often you write to scores vs read to top scores by users.

how to include doc urls in result set

In ElasticSearch, I am wondering how I can get back document urls as well in the search result set? Here is what I meant with some example.
Let's say I index a doc using the following curl command:
curl -XPUT 'http://localhost:9200/ads/offers/1234' -d '{
"name": "blah blah",
"Weight":0.0001,
...
}'
Then I run a search and I want to get the document URL itself in the result set. In the above case, the document URL is the following:
http://localhost:9200/ads/offers/1234.
How can I do that? Is there a special field name for this or do I have to create some kind field to store this explicitly?
elasticsearch search response contains all piece that are needed to build this URL on the client. The record for the URL in you example will look like this:
"hits" : [ {
"_index" : "ads",
"_type" : "offers",
"_id" : "1234",
...
If you really need to get this URL from elasticsearch you can use script field to combine these pieces together into a single field on the server side, although I cannot think of a legitimate scenario where it would be needed.

Resources