Ruby mongoDB and large documents - ruby

I have a populated mongoDB.
Now I need to add huge amounts of additional data to my documents (log file data). This data exceeds the BSON size limit.
Document too large: This BSON document is limited to 16777216 bytes. (BSON::InvalidDocument)
A simplified example of my situation would look like this:
cli = MongoClient.new("localhost", MongoClient::DEFAULT_PORT)
db = cli.db("testdb")
coll = db.collection("test")
data = {:name => "Customer1", :data1 => "some value", :log_file => "A" * 17_000_000}
coll.save data
What is the best way to add this huge amount of data?
Could I use GridFS to store those files and link the GridFS-file-handle to the correct document?
Could I then access the GridFS-file during queries?

I would suggest two approaches:
GridFS with instructions here https://github.com/mongodb/mongo-ruby-driver/wiki/GridFS
Advantages: uses already existing service(mongodb) to store files so presumably easiest to implement/ cheapest since you already have the infrastructure.
Disadvantage: Not necesarilly the best use of an in-memory DB, especially if it's used for other storage as well.
S3 - Store links to a hosted data service (such as Amazon S3) which is designed for file storage (redundant, replicated and highly available). In this case you just upload the files and store a pointer to their S3 location in your DB.
Advantage Keeps your DB leaner, probably cheaper since you keep your mongo machines optimised for doing mongo things (i.e. high-memory) and take advantage of the really cheap file storage on S3 as well as the near-infinite scalability.
Disadvantage Harder to implement since you need to design your own code to do this. Though there may be off the shelf solutions somewhere.
Some more useful discussion on this SO post

Maybe you can split up your document and reference them. See this SO post: syntax for linking documents in mongodb

The paragraph about document growth finally solved my question. (Found by following Konrad's link.)
http://docs.mongodb.org/manual/core/data-model-operations/#data-model-document-growth
What I am now basically doing is this:
cli = MongoClient.new("localhost", MongoClient::DEFAULT_PORT)
db = cli.db("testdb")
coll = db.collection("test")
grid = Grid.new db
#store data
id = grid.put "A"*17_000_000
data = {:name => "Customer1", :data1 => "some value", :log_file => id}
coll.save data
#access data
cust = coll.find({:name => "Customer1"})
id = cust.first["log_file"]
data = grid.get id

Related

Caching dataset results in Sequel and Sinatra

I'm building an API with Sinatra along with Sequel as ORM on a postgres database.
I have some complexes datasets to query in a paging style, so i'd like to keep the dataset in cache for following next pages requests after a first call.
I've read Sequel dataset are cached by default, but i need to keep this object between 2 requests to benefit this behavior.
So I was wondering to put this object somewhere to retrieve it later if the same query is called again rather than doing a full new dataset each time.
I tried in Sinatra session hash, but i've got a TypeError : can't dump anonymous class #<Class:0x000000028c13b8> when putting the object dataset in it.
I'm wondering maybe to use memcached for that.
Any advices on the best way to do that will be very appreciated, thanks.
Memcached or Redis (using LRU) would likely be appropriate solutions for what you are describing. The Ruby Dalli gem makes it pretty easy to get started with memcached. You can find it at https://github.com/mperham/dalli.
On the Github page you will see the following basic example
require 'dalli'
options = { :namespace => "app_v1", :compress => true }
dc = Dalli::Client.new('localhost:11211', options)
dc.set('abc', 123)
value = dc.get('abc')
This illustrates the basics to use the gem. Consider that Memcached is simply a key/value store utilizing LRU (least recently used) fallout. This means you allocate memory to Memcached and let your keys organically expire unless there is a reason to manually expire the key.
From there it becomes simply attempting to fetch a key from memcached, and then only running your real queries if there is no match found.
found = dc.get('my_unique_key')
unless found
# Do your Sequel query here
dc.set('my_unique_key', 'value_goes_here')
end

How to get the collection based upon the wildchar redis key using redis-rb gem?

The redis objects created using the redis-rb gem.
$redis = Redis.new
$redis.sadd("work:the-first-task", 1)
$redis.sadd("work:another-task", 2)
$redis.sadd("work:yet-another-task", 3)
Is there any method to get the collection that has "work:*" keys?
Actually, if you just want to build a collection on Redis, you only need one key.
The example you provided builds 3 distinct collections, each of them with a single item. This is probably not that you wanted to do. The example could be rewritten as:
$redis = Redis.new
$redis.sadd("work","the-first-task|1")
$redis.sadd("work", "another-task|2")
$redis.sadd("work", "yet-another-task|3")
To retrieve all the items of this collection, use the following code:
x = $redis.smembers("work")
If you need to keep track of the order of the items in your collection, it would be better to use a list instead of a set.
In any case, usage of the KEYS command should be restricted to tooling/debug code only. It is not meant to be used in a real application because of its linear complexity.
If you really need to build several collections, and retrieve items from all these collections, the best way is probably to introduce a new "catalog" collection to keep track of the keys corresponding to these collections.
For instance:
$redis = Redis.new
$redis.sadd("catalog:work", "work:the-first-task" )
$redis.sadd("catalog:work", "work:another-task" )
$redis.sadd("work:the-first-task", 1)
$redis.sadd("work:the-first-task", 2)
$redis.sadd("work:another-task", 3)
$redis.sadd("work:another-task", 4)
To efficiently retrieve all the items:
keys = $redis.smembers("catalog:work")
res = $redis.pipelined do
keys.each do |x|
$redis.smembers(x)
end
end
res.flatten!(1)
The idea is to perform a first query to get the content of catalog:work, and then iterate on the result using pipelining to fetch all the data. I'm not a Ruby user, so there is probably a more idiomatic way to implement it.
Another simpler option can be used if the number of collections you want to retrieve is limited, and if you do not care about the ownership of the items (in which set is stored each item)
keys = $redis.smembers("catalog:work")
res = $redis.sunion(*keys)
Here the SUNION command is used to build a set resulting of the union of all the sets you are interested in. It also filters out the duplicates in the result (this was not done in the previous example).
Well, I could get it by $redis.keys("work:*").

Adviced on how to array a mongodb document

I am building an API using Codeigniter and MongoDB.
I got some questions about how to "model" the mongoDB.
A user should have basic data like name and user should also be able to
follow other users. Like it is now each user document keeps track of all people
that is following him and all that he is following. This is done by using arrays
of user _ids.
Like this:
"following": [323424,2323123,2312312],
"followers": [355656,5656565,5656234234,23424243,234246456],
"fullname": "James Bond"
Is this a good way? Perhaps the user document should only contain ids of peoples that the user is following and not who is following him? I can imaging that keeping potentially thousands of ids (for followers) in an array will make the document to big?
All input is welcome!
The max-document size is currently limited to 16MB (v1.8.x and up), this is pretty big. But i still think, that it would be ok in this case to move the follower-relations to an own collection -- you never know how big your project gets.
However: i would recommend using database references for storing the follower-relations: it's way easier to resolve the user from a database reference. Have a look at:
http://www.mongodb.org/display/DOCS/Database+References

How to cache Zend Lucene search results in Code Igniter?

I'm not sure if this is the best way to go about this, but my aim is to have pagination of my lucene search results.
I thought it would make sense to run the search, store all the results in the cache, and then have a page function on my results controller that could return any particular subset of results from the cached results.
Is this a bad approach? I've never used caching of any sort, so don't know where to begin. The CI Caching Driver looked promising, but everything throws a server error. I don't know if I need to install APC, or Memcached, or what to do.
Help!
Lucene is a search engine that is built for scale. You can push it pretty far till the need arises to cache the search results. I would suggest you use the default settings and run it.
If you still feel the need for cache, first look at this Lucene FAQ and then the next level would perhaps be something on the lines of memcache.
Hope it helps!
Zend Search Lucene is indexed on the file system and as the user above has stated, built for scale. Unless you are indexing hundreds of thousands of documents, then caching is not really necessary - especially since all you would effectively be doing is taking data from one file and storing it in another.
On the other hand, if you are only storing, say, product Id in your search index and then selecting the products from the database when you get a result, it's well worth caching. This can easily be achived by using Zend_Cache.
A basic example of Zend Db caching is here:
$frontendOptions = array(
'automatic_serialization' => true
);
$backendOptions = array(
'cache_dir' => YOUR_CACHE_PATH_ON_THE_FILE_SYSTEM,
'file_name_prefix' => 'my_cache_prefix',
);
$cache = Zend_Cache::factory('Core',
'File',
$frontendOptions,
$backendOptions
);
Zend_Db_Table_Abstract::setDefaultMetadataCache($cache);
This should be added to your bootstrap file in an _initDbCache (call it whatever you want) method.
Of course that is a very simple implementation and does not achieve full result caching, more information on Zend Caching with Zend Db can be found here.

Pagination with MongoDB

I have been using MongoDB and RoR to store logging data. I am pulling out the data and looking to page the results. Has anyone done paging with MongoDB or know of any resources online that might help get me started?
Cheers
Eef
Pagination in MongoDB can be accomplished by using a combination of limit() and skip().
For example, assume we have a collection called users in our active database.
>> db.users.find().limit(3)
This retrieves a list of the first three user documents for us. Note, this is essentially the same as writing:
>> db.users.find().skip(0).limit(3)
For the next three, we can do this:
>> db.users.find().skip(3).limit(3)
This skips over the first three user records, and gives us the next three. If there is only one more user in your database, don't worry; MongoDB is smart enough to only return data that is present, and won't crash.
This can be generalised like so, and would be roughly equivalent to what you would do in a web application. Assuming we have variables called PAGE_SIZE which is set to 3, and an arbitrary PAGE_NUMBER:
>> db.users.find().skip(PAGE_SIZE * (PAGE_NUMBER - 1)).limit(PAGE_SIZE)
I cannot speak directly as to how to employ this method in Ruby on Rails, but I suspect the Ruby MongoDB library exposes these methods.

Resources