Capped Collection in mongodb issues - performance

I made an analysis on capped collection, I found that there is no performance improvement in capped collection.
I created a collection named test1 with 20,000 data, I did copyTo test2 with same data which is capped true with data size specified. I gave the following queries to examine the performance
db.test1.find( { $query: { "group" : "amazonTigers"}, $explain: 1 } ).pretty()
db.test2.find( { $query: { "group" : "amazonTigers"}, $explain: 1 } ).pretty()
Both result in the same response time as 124ms...
More Over, I dont understand how the capped collection works???
I read through lots of blogs, But am not able to find the correct working principle of mongo capped collection.
I read through the disadvantages of capped collection, Its given like we are not able to use $set and $push in it. Is there any other disadvantages there in capped collection for the specified collection entries???
Regards,
Harry

Related

Elasticsearch 2.1: Result window is too large (index.max_result_window)

We retrieve information from Elasticsearch 2.1 and allow the user to page thru the results. When the user requests a high page number we get the following error message:
Result window is too large, from + size must be less than or equal
to: [10000] but was [10020]. See the scroll api for a more efficient
way to request large data sets. This limit can be set by changing the
[index.max_result_window] index level parameter
The elastic docu says that this is because of high memory consumption and to use the scrolling api:
Values higher than can consume significant chunks of heap memory per
search and per shard executing the search. It’s safest to leave this
value as it is an use the scroll api for any deep scrolling https://www.elastic.co/guide/en/elasticsearch/reference/2.x/breaking_21_search_changes.html#_from_size_limits
The thing is that I do not want to retrieve large data sets. I only want to retrieve a slice from the data set which is very high up in the result set. Also the scrolling docu says:
Scrolling is not intended for real time user requests https://www.elastic.co/guide/en/elasticsearch/reference/2.2/search-request-scroll.html
This leaves me with some questions:
1) Would the memory consumption really be lower (any if so why) if I use the scrolling api to scroll up to result 10020 (and disregard everything below 10000) instead of doing a "normal" search request for result 10000-10020?
2) It does not seem that the scrolling API is an option for me but that I have to increase "index.max_result_window". Does anyone have any experience with this?
3) Are there any other options to solve my problem?
If you need deep pagination, one possible solution is to increase the value max_result_window. You can use curl to do this from your shell command line:
curl -XPUT "http://localhost:9200/my_index/_settings" -H 'Content-Type: application/json' -d '{ "index" : { "max_result_window" : 500000 } }'
I did not notice increased memory usage, for values of ~ 100k.
The right solution would be to use scrolling.
However, if you want to extend the results search returns beyond 10,000 results, you can do it easily with Kibana:
Go to Dev Tools and just post the following to your index (your_index_name), specifing what would be the new max result window
PUT your_index_name/_settings
{
"max_result_window" : 500000
}
If all goes well, you should see the following success response:
{
"acknowledged": true
}
The following pages in the elastic documentation talk about deep paging:
https://www.elastic.co/guide/en/elasticsearch/guide/current/pagination.html
https://www.elastic.co/guide/en/elasticsearch/guide/current/_fetch_phase.html
Depending on the size of your documents, the number of shards, and the
hardware you are using, paging 10,000 to 50,000 results (1,000 to
5,000 pages) deep should be perfectly doable. But with big-enough from
values, the sorting process can become very heavy indeed, using vast
amounts of CPU, memory, and bandwidth. For this reason, we strongly
advise against deep paging.
Use the Scroll API to get more than 10000 results.
Scroll example in ElasticSearch NEST API
I have used it like this:
private static Customer[] GetCustomers(IElasticClient elasticClient)
{
var customers = new List<Customer>();
var searchResult = elasticClient.Search<Customer>(s => s.Index(IndexAlias.ForCustomers())
.Size(10000).SearchType(SearchType.Scan).Scroll("1m"));
do
{
var result = searchResult;
searchResult = elasticClient.Scroll<Customer>("1m", result.ScrollId);
customers.AddRange(searchResult.Documents);
} while (searchResult.IsValid && searchResult.Documents.Any());
return customers.ToArray();
}
If you want more than 10000 results then in all the data nodes the memory usage will be very high because it has to return more results in each query request. Then if you have more data and more shards then merging those results will be inefficient. Also es cache the filter context, hence again more memory. You have to trial and error how much exactly you are taking. If you are getting many requests in small window you should do multiple query for more than 10k and merge it by urself in the code, which is supposed to take less application memory then if you increase the window size.
2) It does not seem that the scrolling API is an option for me but that I have to increase "index.max_result_window". Does anyone have any experience with this?
--> You can define this value in index templates , es template will be applicable for new indexes only ,so you either have to delete old indexes after creating template or wait for new data to be ingested in elasticsearch .
{
"order": 1,
"template": "index_template*",
"settings": {
"index.number_of_replicas": "0",
"index.number_of_shards": "1",
"index.max_result_window": 2147483647
},
In my case it looks like reducing the results via the from & size prefixes to the query will remove the error as we don't need all the results:
GET widgets_development/_search
{
"from" : 0,
"size": 5,
"query": {
"bool": {}
},
"sort": {
"col_one": "asc"
}
}

Rethinkdb multiple level grouping

Let's say I have a table with documents like:
{
"country": 1,
"merchant": 2
"product": 123,
...
}
Is it possible to group all the documents into a final json structure like:
[
{
<country_id>: {
<merchant_id>: {
<product_id>: <# docs with this product id/merchant_id/country_id>,
... (other product_id and so on)
},
... (other merchant_id_id and so on)
},
... (other country_id and so on)
]
And if yes, what would be the best and most efficient way?
I have more than a million of these documents, on 4 shards with powerful servers (22 Gb cache each)
I have tried this (in the data explorer, in JS, for the moment):
r.db('foo')
.table('bar')
.indexCreate('test1', function(d){
return [d('country'), d('merchant'), d('product')]
})
and then
r.db('foo')
.table('bar')
.group({index: 'test1'})
But the data explorer seems to hang, still working on it as you can see...
.group({index: 'test1'}).count() will do something pretty similar to what you want, except it won't produce the nested document structure. To produce the nested document structure it would probably be easiest to ungroup, then map over the ungrouped values to produce objects of the form you want, then merge all of them.
The problem with group queries on the whole table though is that they won't stream, you'll need to traverse the whole table to get the end result back. The data explorer is meant for small queries, and I think it times out if your query takes more than 5 minutes to return, so if you're traversing a giant table then it would probably be better to run that query from one of the clients.

For 1 billion documents, Populate data from one field to another fields in the same collection using MongoDB

I need to populate data from one field to multiple fields on the same collection. For example:
Currently I have document like below:
{ _id: 1, temp_data: {temp1: [1,2,3], temp2: "foo bar"} }
I want to populate into two different fields on the same collection as like below:
{ _id: 1, temp1: [1,2,3], temp2: "foo bar" }
I have one billion documents to migrate. Please suggest me the efficient way to update all one billion documents?
In your favorite language, write a tool that runs through all documents, migrates them, and store them in a new database.
Some hints:
When iterating the results, make sure they are sorted (e.g. on the _id) so you can implement resume should your migration code crash at 90%...
Do batch inserts: read, say, 1000 items, migrate them, then write 1000 items in a single batch to the new database. Reads are automatically batched.
Create indexes after the migration, not before. That will be faster and lead to less fragmentation
Here I made a query for you, use following query to migrate your data
db.collection.find().forEach(function(myDoc) {
db.collection_new.update(
{_id: myDoc._id},
{
$unset: {'temp_data': 1},
$set: {
'temp1': myDoc.temp_data.temp1,
'temp2': myDoc.temp_data.temp2
}
},
{ upsert: true }
)
});
To learn more about foreach cursor please visit link
Need $limit and $skip operator to migrate data in batches. In update query i have used upsert beacuse there if already exist it will update otherwise inserted entry wiil be new.
Thanks

MongoDB on collection.insert and data-size dilemma

I'm working on this project where I have to perform mass insertion on MongoDB database
I understand that MongoDB is a document database and there is limit on size of each document as seen here
Now for mass insertion code look like this
RockBand.collection.insert(mass_data)
mass_data is a array hash of like this
[
{
name: "LedZepplin",
member : 4,
studio_album : 10,
...
...
...
},
{
name: "Linkin Park",
member: 5,
studio_album: 7,
...
...
...
},
{
...
...
},
...
...
]
the total length of the array is 500K - 100K
an I knew for sure none of the above hash present in array which are basically a document in MongoDB are of size 16MB
So whenever I performer this
RockBand.collection.insert(mass_data)
Why it keeping give me 16MB limit error as state above I'm quite sure that none of the above document persent in the array(i.e hash) does not weigh is of 16MB individual .
then why the error of data-size exceed for a document
Is it considering the whole array as single document when it should have be considering
each hash of the array as an individual document
Can Anyone Suggest
Btw I'm using Mongoid Driver on top of MongoDB ruby driver for connecting to MongoDB
When you insert an array like that, you are inserting the whole array as a single document. You have to insert each object in the array as with an individual insert command.

Extremely slow performance of some MongoDB queries

I have a collection of about 30K item all of which have an element called Program. "Program" is a first part of a compound index, so looking up an item with specific Program value is very fast. It is also fast to run range queries, e.g.:
db.MyCollection.find(
{ $and: [ { Program: { "$gte" : "K", "$lt" : "L" } },
{ Program: { "$gte" : "X", "$lt" : "Y" } } ] }).count();
The query above does not return any results because I am querying for an overlap of two non-overlaping ranges (K-L) and (X-Y)). The left range (K-L) contains about 7K items.
However if I replace the second "and" clause with "where" expression, the query execution takes ages:
db.MyCollection.find(
{ $and: [ { Program: { "$gte" : "K", "$lt" : "L" } }, { "$where" : "this.Program == \"Z\"" } ] }).count();
As you can see, the query above should also return an empty result set (range K-L is combined with Program=="Z"). I am aware of slow performance of "where", but should not Mongo first reduce potential result set by evaluating the left clause (that would result in about 7K items) and only then apply "where" check? If it does, should not processing of a few thousand items take seconds and not minutes as it does on my machine with Mongo service consuming about 3GB RAM while peforming this operation? Looks too heavy for relatively small collection.
There are a few things that you can do -
Use explain() to see what is happening on your query. explain() is described here. Use the $explain operator to return a document that describes the process and indexes used to return the query. For example -
db.collection.find(query).explain()
If that doesn't return enough information, you can look at using the Database Profiler. However, please bear in mind that this is not free and adds load itself. Within this page, you can also find some basic notes about optimising the query performance.
However, in your case, it all boils down to the $where operator:
$where evaluates JavaScript and cannot take advantage of indexes. Therefore, query performance improves when you express your query using the standard MongoDB operators (e.g., $gt, $in).
In general, you should use $where only when you can’t express your query using another operator. If you must use $where, try to include at least one other standard query operator to filter the result set. Using $where alone requires a table scan. $where, like Map Reduce, limits your concurrency.
As a FYI: couple of things to note about the output from explain():
ntoreturn Number of objects the client requested for return from a query. For example, findOne(), sets ntoreturn to limit() sets the appropriate limit. Zero indicates no limit.
query Details of the query spec.
nscanned Number of objects scanned in executing the operation.
reslen Query result length in bytes.
nreturned Number of objects returned from query.

Resources