MongoDB complex find - ruby

I need to grab the top 3 results for each of the 8 users. Currently I am looping through for each user and making 8 calls the the db. Is there a way to structure the query to pull the same 8X3 dataset in a single db pull?
selected_users = users.sample(8)
cur = 0
while cur <= selected_users .count-1
cursor = status_store.find({'user' => selected_users[cur]},{:fields =>params}).sort('score', -1).limit(3)
*do something*
cur+=1
end
The collection I am pulling from looks like the below. Each user can have an unbound number of tweets so I have not embedded them within within a user document.
{
"_id" : ObjectId("51e92cc8e1ce7219e40003eb"),
"id_str" : "57915476419948544",
"score" : 904,
"text" : "Yesterday we had a bald eagle on the show. Oddly enough, he was in the country illegally.",
"timestamp" : "19/07/2013 08:10",
"user" : {
"id_str" : "115485051",
"name" : "Conan O'Brien",
"screen_name" : "ConanOBrien",
"description" : "The voice of the people. Sorry, people.",
}
}
Thanks in advance.

Yes you can do this using the aggregation framework.
Another way would be to keep track of the top 3 scores for in the user documents. If this is faster or not depends on how often you write to scores vs read to top scores by users.

Related

Elasticsearch: count rows in a table

I have a big table (15000 x 2000 entries). In this table, I need to count rows with certain properties like "all rows, that have a 1 or 2 in column 5 and a 0 in column 6". I will call this type of operation a count operation. For my use case, the count operation needs to be very fast, as I executing several hundreds of those count operations.
I tried to do so with elastic search, but the performance seems to be very bad (like 10 seconds for 180 count operations). I was wondering, if I am building my queries the wrong way, or if maybe Elasticsearch is the wrong technology to do so?
My queries are all of the same form. I create them with java, so it's kind of hard to post here, how they do look like but I do my best to explain
I build each single coun operation as a BoolQuery. For the example above it would be a query that looks similar to this (don't blame me if it's wrong, I cannot copy the correct query, as it is built in java):
"query": {
"bool" : {
"must" : [
"should" : [
{ "column 5" : "1" },
{ "column 5" : "2" }
],
"should" : [
{ "column 6" : "0" }
],
"minimum_should_match" : 1
],
"boost" : 1.0
}
}
The many bool queries of this form are then grouped into a MultiSearchRequest. I use the option "fetchSource = false" to prevent Elasticsearch from loading the entities themselves.
Please tell me, if you need any further information, or if it is unclear, what I am trying to do!
I just fixed the problem myself. For all with a similar question, here is how:
I changed the SearchSourceBuilder, so that it now uses a ValueCountAggregator. This one counts the values and allows me to set the SearchSourceBuilder.size() to 0. In this way I get rid of the hits themselves and retrieve only the aggregation values.
Requests that took 4 seconds before are now executed in less than 100ms.

Handle Frequently updated geo data in Elastic Search

I have the location(geo points) data updated every 5 minutes for millions of users. We have to search users with specific attributes(age/interests/languages) & in particular geo range. Wanted to understand the right strategy to store such data in Elastic.
Option1
Create user document with following keys
user Metadata & attributes (age, interests, languages, salary etc around 8-10 searchable attributes)
Live location (changing every few minutes)
"liveLocation" : {
"type" : "Point",
"coordinates" : [-72.333077, 30.856567]
}
location data - multiple addresses - home address, work address etc along with geo points. (not updated frequently) -
"addresses" :
[
{
"type" : "home",
"address" : "first floor, xyz, near landmark",
"city" : "Newyork",
"country" : "Country",
"zipcode" : "US1029",
"location" : {
"type" : "Point",
"coordinates" : [-73.856077, 40.848447]
},
{
... more atype of addresses
}
]
We want to perform geo search queries over all the geo type fields. My worry - live location for users will be updated quite frequently. Is this a viable approach ?
Option2
Treat every location update as a time series data and insert a new document. This will avoid updating the documents. instead will insert new documents for each user every few minutes.
Problems -
While searching all the users(home/office/live location) in a particular geo range, I have to consider only the last updated documents for each user . How to do that in elastic ?
We have to search users with specific attributes(age/interests/language) & in particular geo range. If option2 is preferable should user attribute-metadata & location updates be treated as parent-child relationship or some other approach?
In Conclusion - What should be the right approach .

elasticsearch - query between document types

I have a production_order document_type
i.e.
{
part_number: "abc123",
start_date: "2018-01-20"
},
{
part_number: "1234",
start_date: "2018-04-16"
}
I want to create a commodity document type
i.e.
{
part_number: "abc123",
commodity: "1 meter machining"
},
{
part_number: "1234",
commodity: "small flat & form"
}
Production orders are datawarehoused every week and are immutable.
Commodities on the other hand could change over time. i.e abc123 could change from 1 meter machining to 5 meter machining, so I don't want to store this data with the production_order records.
If a user searches for "small flat & form" in the commodity document type, I want to pull all matching records from the production_order document type, the match being between part number.
Obviously I can do this in a relational database with a join. Is it possible to do the same in elasticsearch?
If it helps, we have about 500k part numbers that will be commoditized and our production order data warehouse currently holds 20 million records.
I have found that you can indeed now query between indexs in elasticsearch, however you have to ensure your data stored correctly. Here is an example from the 6.3 elasticsearch docs
Terms lookup twitter example At first we index the information for
user with id 2, specifically, its followers, then index a tweet from
user with id 1. Finally we search on all the tweets that match the
followers of user 2.
PUT /users/user/2
{
"followers" : ["1", "3"]
}
PUT /tweets/tweet/1
{
"user" : "1"
}
GET /tweets/_search
{
"query" : {
"terms" : {
"user" : {
"index" : "users",
"type" : "user",
"id" : "2",
"path" : "followers"
}
}
}
}
Here is the link to the original page
https://www.elastic.co/guide/en/elasticsearch/reference/6.1/query-dsl-terms-query.html
In my case above I need to setup my storage so that commodity is a field and it's values are an array of part numbers.
i.e.
{
"1 meter machining": ["abc1234", "1234"]
}
I can then look up the 1 meter machining part numbers against my production_order documents
I have tested and it works.
There is no joins supported in elasticsearch.
You can query twice first by getting all the partnumbers using "small flat & form" and then using all the partnumbers to query the other index.
Else try to find a way to merge these into a single index. That would be better. Updating the Commodities would not cause you any problem by combining the both.

How to retrieve all the document ids from an elasticsearch index

How to retrieve all the document ids (the internal document '_id') from an Elasticsearch index? if I have 20 million documents in that index, what is the best way to do that?
I would just export the entire index and read off the file system. My experience with size/from and scan/scroll has been disaster when dealing with querying resultsets in the millions. Just takes too long.
If you can use a tool like knapsack, you can export the index to the file system, and iterate through the directories. Each document is stored under it's own directory named after _id. No need to actually open files. Just iterate through the dir.
link to knapsack:
https://github.com/jprante/elasticsearch-knapsack
edit: hopefully you are not doing this often... or this may not be a viable solution
For that amount of documents, you probably want to use the scan and scroll API.
Many client libraries have ready helpers to use the interface. For example, with elasticsearch-py you can do:
es = elasticsearch.Elasticsearch(eshost)
scroll = elasticsearch.helpers.scan(es, query='{"fields": "_id"}', index=idxname, scroll='10s')
for res in scroll:
print res['_id']
First you can issue a request to get the full count of records in the index.
curl -X GET 'http://localhost:9200/documents/document/_count?pretty=true'
{
"count" : 1408,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
}
}
Then you'll want to loop through the set using a combination of size and from parameters until you reach the total count. Passing an empty field parameter will return only the index and _id that you're interested in.
Find a good page size that you can consume without running out of memory and increment the from each iteration.
curl -X GET 'http://localhost:9200/documents/document/_search?fields=&size=1000&from=5000'
Example item response:
{
"_index" : "documents",
"_type" : "document",
"_id" : "1341",
"_score" : 1.0
},
...

Performance tuning MongoDB query/update?

So I have a MongoDB instance where I am trying to update data in one collection with data from another collection. The two collections are participants with about 180k documents and questions with about 95k documents.
Documents in participants typically look something like this:
{
"_id" : ObjectId("52f90b8bbab16dd8594b82b4"),
"answers" : [
{
"_id" : ObjectId("52f90b8bbab16dd8594b82b9"),
"question_id" : 2081,
"sub_id" : null,
"values" : [
"Yes"
]
},
{
"_id" : ObjectId("52f90b8bbab16dd8594b82b8"),
"question_id" : 2082,
"sub_id" : 123,
"values" : [
"Would prefer to go alone"
]
},
{
"_id" : ObjectId("52f90b8bbab16dd8594b82b7"),
"question_id" : 2082,
"sub_id" : 456,
"values" : [
"Yes"
]
}
],
"created" : ISODate("2012-03-01T17:40:21Z"),
"email" : "anonymous",
"id" : 65,
"survey" : ObjectId("52f41d579af1ff4221399a7b"),
"survey_id" : 374
}
I am using the query below to perform the update:
db.participants.ensureIndex({"answers.question_id": 1, "answers.sub_id": 1});
print("created index for answer arrays!")
db.questions.find().forEach(function(doc){
db.participants.update(
{
"answers.question_id": doc.id,
"answers.sub_id": doc.sub_id
},
{
$set:
{
"answers.$.question": doc._id
}
},
false,
true
);
});
db.participants.dropIndex({"answers.question_id": 1, "answers.sub_id": 1});
But this takes about 20 minutes to run. I was hoping that adding the index would help with the performance, but it is still pretty slow. Is this index setup correctly considering that I am indexing fields in an array of objects? Can anyone see anything that I am doing that would cause the slowness? Suggestions on where to start looking to improve the performance of this query?
I think you need to consider what you are actually doing here in order to understand why the index is not helping and indeed why this operation takes so long.
The first part of the answer is explained by what you are doing here:
db.questions.find()
Now that part alone basically says that you are asking to retrieve every document in your questions collection. So we can see what you are trying to do is exactly that, as you want to update that content into your participants collection, particularly the document _id for the "question". But here, by definition of getting all documents, no index will be used.
So what you are doing is looping every document in the questions, then asking with your update operation to match the participants record with data from the "question". And what that means is you are pulling "over the wire" all of your 95K documents and sending back "over the wire" your update operation, 95K times. This is not happening on the server and there is network traffic between your application and your MongoDB.
The index itself is not going to do much other than improve the search of each participants record, which is better than scanning and you should be getting the match. But that's not the part that taking the time, its the fetching of the questions that will be the largest issue. Also note that if you were updating
So if it's possible to run your update process on a machine that is as close as possible in networking terms to the MongoDB server then that is going to be your best performance improvement. You could also wind back your Write Concern if you want to be a little daring and/or can live with checking the integrity in another opertation, and that will reduce your network traffic and waiting for a response to the update (which is actually happening) if you put it in "fire and forget" mode.
Also see the guide if you are not sure of the concepts:
http://docs.mongodb.org/manual/core/write-concern/
In case anyone is interested I was able to take the run time of this update query from 20 minutes down to about a minute and a half by using projection when selecting the questions documents. Since I am only using the _id, id and sub_id fields I was able to do the following:
db.questions.find({},{_id: 1, id: 1, sub_id: 1}).forEach(function(doc){
....
Which drastically improved performance. Hope this helps someone!

Resources