Sequentially Accessed Records in an Hierarchical (Object) DB - data-structures

I am creating an app that doesn't do any searching (or many other random-access activities). It's built on an object DB (ZODB if you're interested) and will store many instances of a identical type. Once they are created, the main access to objects in this structure will be a cron job working through them all sequentially at periodic intervals.
Is the best way to store them in an Object DB hierarchy simply placing them all 1 level below the hierarchy root? ZODB storage works very much like a python dictionary. On the (very) odd occasion they are accessed randomly, would this be a performance issue? I envisage that the max number of objects in the DB will be ~10k.

Simply store them in a BTree (part of the ZODB package) and you should be fine; the BTree structures are highly efficient for sequential and random access.

Related

Advantages of database vs hash for simple key value lookup (in Ruby)

Suppose I have a production app on AWS with, let's say 50,000 users, and I need to simply take a username and lookup one or two pieces of information about them.
Is there any advantage to keeping this information in a DynamoDB over a Ruby hash stored in an AWS S3 bucket?
By "advantages" I mean both cost and speed.
At some point, will I need to migrate to a DB, or will a simple hash lookup suffice? Again, I will never need to compare entries, or do anything but lookup the values associates with a key (username).
The more general question is: what are the advantages of a DB (like DynamoDB) over an S3 hash for the purposes of simple key/value storage?
You should note that Hash cannot be used as database, it must be loaded with values from some data store (such as a database, a JSON, YAML file or equivalent). On the contrary, DynamoDB is a database and has persistence built-in.
Having said that, for 50,000 entries, a Ruby Hash should be a viable option, it will perform quite well as indicated in this article.
Ruby Hash is not distributed, hence, if you run your app on multiple servers for availability/scalability, then, you will have to load that Hash in each server and maintain its data consistency. In other words, you need to make sure that if one of the user attributes gets updated via one server, how will you replicate its value across other server. Also, if number of users in your system is not 50,000 but 50 million - then, you may have to rethink Hash as cache option.
DynamoDB is full blown NoSQL database - it is distributed and promises high scalability. It also costs money to use it - so your decisions to use it should be based on whether you need such a scale and availability offered by DynamoDB, and whether you have budget for it.

Is it feasible to use a distributed cache for queryable data sets?

My scenario is as follows. I have a data table with a million rows of tuples (say first name and last name), and a client that needs to retrieve a small subset of rows whose first name or last name begins with the query string. Caching this seems like a catch-22, because:
On the one hand, I can't store and retrieve the entire data set on every request (would overwhelm the network)
On the other hand, I can't just store each row individually, because then I'd have no way to run a query.
Storing ranges of values in the cache, with a local "index" or directory would work... except that, you'd have to essentially duplicate the data for each index, which defeats the purpose of even using a distributed cache.
What approach is advisable for this kind of thing? Is it possible to get the benefits of using a distributed cache, or is it simply not feasible for this kind of scenario?
Distributed Caching, is feasible for query-able data sets.
But for this scenario there should either be native function or procedure that would give much faster results. If different scope are not possible like session or application then it would be much of iteration required on server side for fetching the data for each request.
Indexing on server side then of Database is never a good idea.
If still there are network issues. You could go ahead for Document Oriented or Column Oriented NoSQL DB. If feasible.

A better idea/data structure to collect analytics data

I'm collecting analytics data. I'm using a master map that holds many other nested maps.
Considering maps are immutable, many new maps are going to be allocated. (Yes, that is efficient in Clojure).
Basic operation that I'm using is update-in , very convenient to update a value for a given path or create the binding for a non-existant value.
Once I reached a specific point, I'm going to save that data structure to the data base.
What would be a better idea to collect these data more efficiently in Clojure? a transient data structure?
As with all optimizations, measure first, and If the map update is a bottle neck then switching to a transient map is a rather unintrusive code change. If you find that GC overhead is the real culprit, as it often is with persistent data structures, and transients dont help enough then collecting the data into a list and batch adding it into a transient map which is made persistent and saved into the DB at the end may be a more effective though larger change. Adding to a list produces very little GC overhead because unlike adding to a map the old head does not need to be discarded and GCd

Running out of RAM memory

I may need to build a hash table that may grow very large in size. I am wondering if the hash table does not fit in memory what is the best way to address this problem as to avoid having the application crash when it runs out of memory.
Use case: This hash table contains a bunch of ids that are referenced in a for loop that needs to consult the id for a particular word.
Any time you have data that can not be easily recreated on the fly, then you need to make provisions to get it out of RAM and onto disk. Any sort of data store will do that. You could use a flat or text file, or a YAML file.
If you need fast access then you'll be looking at some sort of database, because reading a flat/text file doesn't easily allow random access. SQLLite can do it, or a no-sql database.
If you need to allow multiple processes access to the data and have good access restriction, and/or store the data on one machine and access it from another, then you'll be looking at a database of some sort. At that point I'd look into MySQL or Postgres. I prefer the later, but they'll both work.
If you really think the hash will grow so big, then maybe you should not store this data in a hash in your ram. I don't think you can easily avoid a crash when your app runs out of memory. I guess the key is create mechanisms to avoid major memory consumption.
I don't know your situation, but I really doubt the hash table you described would make a reasonable computer run out of memory. If you really think so, maybe you should use a key value storage database (Redis is fairly easy to learn http://redis.io/) or other kind of NoSQL database.

Storage for Write Once Read Many

I have a list of 1 million digits. Every time the user submit an input, I would need to do a matching of the input with the list.
As such, the list would have the Write Once Read Many (WORM) characteristics?
What would be the best way to implement storage for this data?
I am thinking of several options:
A SQL Database but is it suitable for WORM (UPDATE: using VARCHAR field type instead of INT)
One file with the list
A directory structure like /1/2/3/4/5/6/7/8/9/0 (but this one would be taking too much space)
A bucket system like /12345/67890/
What do you think?
UPDATE: The application would be a web application.
To answer this question you'll need to think about two things:
Are you trying to minimize storage space, or are you trying to minimize process time.
Storing the data in memory will give you the fastest processing time, especially if you could optimize the datastructure for your most common operations (in this case a lookup) at the cost of memory space. For persistence, you could store the data to a flat file, and read the data during startup.
SQL Databases are great for storing and reading relational data. For instance storing Names, addresses, and orders can be normalized and stored efficiently. Does a flat list of digits make sense to store in a relational database? For each access you will have a lot of overhead associated with looking up the data. Constructing the query, building the query plan, executing the query plan, etc. Since the data is a flat list, you wouldn't be able to create an effective index (your index would essentially be the values you are storing, which means you would do a table scan for each data access).
Using a directory structure might work, but then your application is no longer portable.
If I were writing the application, I would either load the data during startup from a file and store it in memory in a hash table (which offers constant lookups), or write a simple indexed file accessor class that stores the data in a search optimized order (worst case a flat file).
Maybe you are interested in how The Pi Searcher did it. They have 200 million digits to search through, and have published a description on how their indexed searches work.
If you're concerned about speed and don't want to care about file system storage, probably SQL is your best shot. You can optimize your table indexes but also will add another external dependency on your project.
EDIT: Seems MySQL have an ARCHIVE Storage Engine:
MySQL supports on-the-fly compression since version 5.0 with the ARCHIVE storage engine. Archive is a write-once, read-many storage engine, designed for historical data. It compresses data up to 90%. It does not support indexes. In version 5.1 Archive engine can be used with partitioning.
Two options I would consider:
Serialization - when the memory footprint of your lookup list is acceptable for your application, and the application is persistent (a daemon or server app), then create it and store it as a binary file, read the binary file on application startup. Upside - fast lookups. Downside - memory footprint, application initialization time.
SQL storage - when the lookup is amenable to index-based lookup, and you don't want to hold the entire list in memory. Upside - reduced init time, reduced memory footprint. Downside - requires DBMS (extra app dependency, design expertise), fast, but not as fast as holding the whole list in memeory
If you're concerned about tampering, buy a writable DVD (or a CD if you can find a store which still carries them ...), write the list on it and then put it into a server with only a DVD drive (not a DVD writer/burner). This way, the list can't be modified. Another option would be to buy an USB stick which has a "write protect" switch but they are hard to come by and the security isn't as good as with a CD/DVD.
Next, write each digit into a file on that disk with one entry per line. When you need to match the numbers, just open the file, read each line and stop when you find a match. With todays computer speeds and amounts of RAM (and therefore file system cache), this should be fast enough for a once-per-day access pattern.
Given that 1M numbers is not a huge amount of numbers for todays computers, why not just do pretty much the simplest thing that could work. Just store the numbers in a text file and read them into a hash set on application startup. On my computer reading in 1M numbers from a text file takes under a second and after that I can do about 13M lookups per second.

Resources