MongoDB on collection.insert and data-size dilemma - ruby

I'm working on this project where I have to perform mass insertion on MongoDB database
I understand that MongoDB is a document database and there is limit on size of each document as seen here
Now for mass insertion code look like this
RockBand.collection.insert(mass_data)
mass_data is a array hash of like this
[
{
name: "LedZepplin",
member : 4,
studio_album : 10,
...
...
...
},
{
name: "Linkin Park",
member: 5,
studio_album: 7,
...
...
...
},
{
...
...
},
...
...
]
the total length of the array is 500K - 100K
an I knew for sure none of the above hash present in array which are basically a document in MongoDB are of size 16MB
So whenever I performer this
RockBand.collection.insert(mass_data)
Why it keeping give me 16MB limit error as state above I'm quite sure that none of the above document persent in the array(i.e hash) does not weigh is of 16MB individual .
then why the error of data-size exceed for a document
Is it considering the whole array as single document when it should have be considering
each hash of the array as an individual document
Can Anyone Suggest
Btw I'm using Mongoid Driver on top of MongoDB ruby driver for connecting to MongoDB

When you insert an array like that, you are inserting the whole array as a single document. You have to insert each object in the array as with an individual insert command.

Related

Elasticsearch - Limit of total fields [1000] in index exceeded

I saw that there are some concerns to raising the total limit on fields above 1000.
I have a situation where I am not sure how to approach it from the design point of view.
I have lots of simple key value pairs:
key1:15, key2:45, key99999:1313123.
Where key is a string and value is a integer on which I would like to sort my results upon on where as if a certain document receives a key it gets sorted by the value.
I ended up creating an object and just put the key value pairs inside so I can match it easy.
For example I have sorting: "object.key".
I was wondering if I just use a simple object with bunch of strings inside that are just there for exact matching should I worry about raising this limit to 10k, or 20k.
Because I now have an issue where there can be more then 1k of these records. I've found I could use nested sorting but it still has a default limit of 10k.
Is there a good design pattern approach for this or should I not be worried by raising the field limits?
Simplified version of the query:
GET products/_search
{
"query": {
"match_all": {}
},
"sort": [
{
"sortingObject.someSortingKey1": {
"order": "desc",
"missing": 2,
"unmapped_type":"float"
}
}
]
}
Point is that I get the sortingKey from request and I use it to sort my results. There are 100k different ways to sort the result for example
There were some recent improvements (in 7.16) that should help there, but 10K or 20K fields is still a lot of overhead.
I'm not sure what kind of queries you need to run on those keyX fields, but maybe the flattened data-type would work for you? https://www.elastic.co/guide/en/elasticsearch/reference/current/flattened.html

Elasticsearch performance impact on choosing mapping structure for index

I am receiving data in a format like,
{
name:"index_name",
status: "good",
datapoints: [{
paramType: "ABC",
batch: [{
time:"timestamp1<epoch in sec>",
value: "123"
},{
time:"timestamp2<epoch in sec>",
value: "123"
}]
},
{
paramType: "XYZ",
batch: [{
time:"timestamp1<epoch in sec>",
value: "123"
},{
time:"timestamp2<epoch in sec>",
value: "124"
}]
}]
}
I would like to store the data into elasticsearch in such a way that I can query based on a timerange, status or paramType.
As mentioned here, I can define datapoints or batch as a nested data type which will allow to index object inside the array.
Another way, I can possibly think is by dividing the structure into separate documents. e.g.
{
name : "index_name",
status: "good",
paramType:"ABC",
time:"timestamp<epoch in sec>",
value: "123"
}
which one will be the most efficient way?
if I choose the 2nd way, I know there may be ~1000 elements in the batch array and 10-15 paramsType array, which means ~15k documents will be generated and 15k*5 fields (= 75K) key values pair will be repeated in the index?
Here this explains about the advantage and disadvantage of using nested but no performance related stats provided. in my case, there won't be any update in the inner object. So not sure which one will be better. Also, I have two nested objects so I would like to know how can I query if I use nested for getting data between a timerange?
Flat structure will perform better than nested. Nested queries are slower compared to term queries ; Also while indexing - internally a single nested document is represented as bunch of documents ; just that they are indexed in same block .
As long as your requirements are met - second option works better.

Rethinkdb multiple level grouping

Let's say I have a table with documents like:
{
"country": 1,
"merchant": 2
"product": 123,
...
}
Is it possible to group all the documents into a final json structure like:
[
{
<country_id>: {
<merchant_id>: {
<product_id>: <# docs with this product id/merchant_id/country_id>,
... (other product_id and so on)
},
... (other merchant_id_id and so on)
},
... (other country_id and so on)
]
And if yes, what would be the best and most efficient way?
I have more than a million of these documents, on 4 shards with powerful servers (22 Gb cache each)
I have tried this (in the data explorer, in JS, for the moment):
r.db('foo')
.table('bar')
.indexCreate('test1', function(d){
return [d('country'), d('merchant'), d('product')]
})
and then
r.db('foo')
.table('bar')
.group({index: 'test1'})
But the data explorer seems to hang, still working on it as you can see...
.group({index: 'test1'}).count() will do something pretty similar to what you want, except it won't produce the nested document structure. To produce the nested document structure it would probably be easiest to ungroup, then map over the ungrouped values to produce objects of the form you want, then merge all of them.
The problem with group queries on the whole table though is that they won't stream, you'll need to traverse the whole table to get the end result back. The data explorer is meant for small queries, and I think it times out if your query takes more than 5 minutes to return, so if you're traversing a giant table then it would probably be better to run that query from one of the clients.

For 1 billion documents, Populate data from one field to another fields in the same collection using MongoDB

I need to populate data from one field to multiple fields on the same collection. For example:
Currently I have document like below:
{ _id: 1, temp_data: {temp1: [1,2,3], temp2: "foo bar"} }
I want to populate into two different fields on the same collection as like below:
{ _id: 1, temp1: [1,2,3], temp2: "foo bar" }
I have one billion documents to migrate. Please suggest me the efficient way to update all one billion documents?
In your favorite language, write a tool that runs through all documents, migrates them, and store them in a new database.
Some hints:
When iterating the results, make sure they are sorted (e.g. on the _id) so you can implement resume should your migration code crash at 90%...
Do batch inserts: read, say, 1000 items, migrate them, then write 1000 items in a single batch to the new database. Reads are automatically batched.
Create indexes after the migration, not before. That will be faster and lead to less fragmentation
Here I made a query for you, use following query to migrate your data
db.collection.find().forEach(function(myDoc) {
db.collection_new.update(
{_id: myDoc._id},
{
$unset: {'temp_data': 1},
$set: {
'temp1': myDoc.temp_data.temp1,
'temp2': myDoc.temp_data.temp2
}
},
{ upsert: true }
)
});
To learn more about foreach cursor please visit link
Need $limit and $skip operator to migrate data in batches. In update query i have used upsert beacuse there if already exist it will update otherwise inserted entry wiil be new.
Thanks

Capped Collection in mongodb issues

I made an analysis on capped collection, I found that there is no performance improvement in capped collection.
I created a collection named test1 with 20,000 data, I did copyTo test2 with same data which is capped true with data size specified. I gave the following queries to examine the performance
db.test1.find( { $query: { "group" : "amazonTigers"}, $explain: 1 } ).pretty()
db.test2.find( { $query: { "group" : "amazonTigers"}, $explain: 1 } ).pretty()
Both result in the same response time as 124ms...
More Over, I dont understand how the capped collection works???
I read through lots of blogs, But am not able to find the correct working principle of mongo capped collection.
I read through the disadvantages of capped collection, Its given like we are not able to use $set and $push in it. Is there any other disadvantages there in capped collection for the specified collection entries???
Regards,
Harry

Resources