I may need to build a hash table that may grow very large in size. I am wondering if the hash table does not fit in memory what is the best way to address this problem as to avoid having the application crash when it runs out of memory.
Use case: This hash table contains a bunch of ids that are referenced in a for loop that needs to consult the id for a particular word.
Any time you have data that can not be easily recreated on the fly, then you need to make provisions to get it out of RAM and onto disk. Any sort of data store will do that. You could use a flat or text file, or a YAML file.
If you need fast access then you'll be looking at some sort of database, because reading a flat/text file doesn't easily allow random access. SQLLite can do it, or a no-sql database.
If you need to allow multiple processes access to the data and have good access restriction, and/or store the data on one machine and access it from another, then you'll be looking at a database of some sort. At that point I'd look into MySQL or Postgres. I prefer the later, but they'll both work.
If you really think the hash will grow so big, then maybe you should not store this data in a hash in your ram. I don't think you can easily avoid a crash when your app runs out of memory. I guess the key is create mechanisms to avoid major memory consumption.
I don't know your situation, but I really doubt the hash table you described would make a reasonable computer run out of memory. If you really think so, maybe you should use a key value storage database (Redis is fairly easy to learn http://redis.io/) or other kind of NoSQL database.
Related
I am using Realm as the database solution for my app. I need persistent storage ability for my images so I can load them when offline. I also need a cache so I can load the images from there rather than fetching them from the API each time a cell draws them. My first thought was that a Realm database could serve both of these functions just fine if I were to store the images in Realm as NSData. But I have found two answers on SE (here and here) that recommend not doing this if you have many images of a largish size that will change often. Instead they recommend saving the images to disk, and then storing the URL to those images in Realm.
My question is why is this best practice? The answers linked to above don't give reasons why except to say that you end up with a bloated database. But why is that a problem? What is the difference between a having lots of images in my database vs having lots of images on disk?
Is it a speed issue? If so, is there a marked speed difference in an app being able to access an image from disk to being able to access it from a database solution like Realm?
Thanks in advance.
This isn't really just a problem localised to Realm. I remember the same advice being given with Core Data too.
I'm guessing the main reason above all else as to why storing large binary data in a database isn't recommended is because 'You don't gain anything, and actually stand to lose more than you otherwise would'.
With Core Data (i.e. databases backed by SQLite), you'll actually take a performance hit as the data will be copied into memory when you perform the read from SQLite. If it's a large amount of data, then this is wholly unacceptable.
With Realm at least, since it uses a zero-copy, memory-mapped mechanism, you'll be provided with the NSData mapped straight from the Realm file, but then again, this is absolutely no different than if you simply loaded the image file from disk itself.
Where this becomes a major problem in Realm is when you start changing the image often. Realm actually uses an internal snapshotting mechanism when working with changing data across threads, but that essentially means that during operation, entire sets of data might be periodically duplicated on-disk (To ensure thread-safety). If the data sets include large blobs of binary data, these will get duplicated too (Which might also mean a performance hit as well). When this happens, the size of the Realm file on disk will be increased to accomodate the snapshots, but when the operation completes and the snapshots are deleted, the file will not shrink back to it's original size. This is because reclaiming that disk space would be a costly performance hit, and since it's easily possible the space could be needed again (i.e. by another large snapshotting operation), it seems inefficient to pre-emptively do (hence the 'bloat').
It's possible to manually perform an operation to reclaim this disk space if necessary, but the generally recommended approach is to optimise your code to minimise this from happening in the first place.
So, to sum that all up, while you totally can save large data blobs to a database, over time, it'll potentially result in performance hits and file size bloat that you could have otherwise avoided. These sorts of databases are designed to help transform small bits of data to a format that can be saved to and retrieved from disk, so it's essentially wasted on binary files that could easily be directly saved without any modification.
It's usually much easier, cleaner and more efficient to simply store your large binary data on disk, and simply store a file name reference to them inside the database. :)
Suppose I have a production app on AWS with, let's say 50,000 users, and I need to simply take a username and lookup one or two pieces of information about them.
Is there any advantage to keeping this information in a DynamoDB over a Ruby hash stored in an AWS S3 bucket?
By "advantages" I mean both cost and speed.
At some point, will I need to migrate to a DB, or will a simple hash lookup suffice? Again, I will never need to compare entries, or do anything but lookup the values associates with a key (username).
The more general question is: what are the advantages of a DB (like DynamoDB) over an S3 hash for the purposes of simple key/value storage?
You should note that Hash cannot be used as database, it must be loaded with values from some data store (such as a database, a JSON, YAML file or equivalent). On the contrary, DynamoDB is a database and has persistence built-in.
Having said that, for 50,000 entries, a Ruby Hash should be a viable option, it will perform quite well as indicated in this article.
Ruby Hash is not distributed, hence, if you run your app on multiple servers for availability/scalability, then, you will have to load that Hash in each server and maintain its data consistency. In other words, you need to make sure that if one of the user attributes gets updated via one server, how will you replicate its value across other server. Also, if number of users in your system is not 50,000 but 50 million - then, you may have to rethink Hash as cache option.
DynamoDB is full blown NoSQL database - it is distributed and promises high scalability. It also costs money to use it - so your decisions to use it should be based on whether you need such a scale and availability offered by DynamoDB, and whether you have budget for it.
I was using bcp (sybase mass insert) to insert millions of records but my company is migrating to oracle.
I am not sure if I should use array binding or sql loader. I have a lot of data in memory. I can either 1. Create a text file with the data and use sql-loader to insert it or 2. use the array binding library to insert data. I'm not sure which is more practical for my application. What are the differences between one and the other. Is one better for certain applications?
Which should I use to replace bcp?
SQL*Loader is the most direct replacement for bcp. If you have an existing process that uses bcp, moving to SQL*Loader is probably the path of least resistance.
You say that you already have the data in memory already. I assume that means that the data is in memory on a client machine not on the database server. Given that starting point, I'd generally prefer a direct path load assuming that whatever database access API you are using provides a direct path API. Incurring the overhead of writing a bunch of data to a file only to have SQL*Loader incur the overhead of reading that data back off disk just to use (assuming you set it up to do so) the direct path API to load the data should make SQL*Loader less efficient. Of course, as a purpose-built tool, it is likely that a SQL*Loader solution can be cobbled together with acceptable performance more quickly than you can write your own code to do so particularly if you're just learning the API.
If you don't have access to a direct path API and you're debating between an application doing a conventional path load using array binds or a SQL*Loader solution doing a direct-path load, the question is much closer. You'd probably need to benchmark the two. Direct-path loading is more efficient than a conventional path load. But writing all the data to disk and reading it all back is going to incur an additional cost. Whether the cost of reading and writing that data to disk outweighs the benefit of a direct-path load will depend on a variety of factors that are specific to your application (data volumes, disk speed, network I/O, etc.).
One additional option to consider may be to write the file to disk, copy the file to the database server, and then use an external table to expose the data to the database. This is generally more efficient than using SQL*Loader from a client machine. Whether it performs better than a direct-path load from an application and whether the additional complexity of writing files, moving them around, and generally moving control from an application to operating system utilities and back to the application outweighs the complexity of writing a bit more code in the application is something that you'd need to answer for yourself.
I'm doing some research for a new project, for which the constraints and specifications have yet to be set. One thing that is wanted is a large number of paths, directly under the root domain. This could ramp up to millions of paths. The paths don't have a common structure or unique parts, so I have to look for exact matches.
Now I know it's more efficient to break up those paths, which would also help in the path lookup. However I'm researching the possibility here, so bear with me.
I'm evaluating methods to accomplish this, while maintaining excellent performance. I thought of the following methods:
Storing the paths in an SQL database and doing a lookup on every request. This seems like the worst option and will definitely not be used.
Storing the paths in a key-value store like Redis. This would be a lot better, and perform quite well I think (have to benchmark it though).
Doing string/regex matching - like many frameworks do out of the box - for this amount of possible matches is nuts and thus not really an option. But I could see how doing some sort of algorithm where you compare letter-by-letter, in combination with some smart optimizations, could work.
But maybe there are tools/methods I don't know about that are far more suited for this type of problem. I could use any tips on how to accomplish this though.
Oh and in case anyone is wondering, no this isn't homework.
UPDATE
I've tested the Redis approach. Based on two sets of keywords, I got 150 million paths. I've added each of them using the set command, with the value being a serialized string of id's I can use to identify the actual keywords in the request. (SET 'keyword1-keyword2' '<serialized_string>')
A quick test in a local VM with a data set of one million records returned promising results: benchmarking 1000 requests took 2ms on average. And this was on my laptop, which runs tons of other stuff.
Next I did a complete test on a VPS with 4 cores and 8GB of RAM, with the complete set of 150 million records. This yielded a database of 3.1G in file size, and around 9GB in memory. Since the database could not be loaded in memory entirely, Redis started swapping, which caused terrible results: around 100ms on average.
Obviously this will not work and scale nice. Either each web server needs to have a enormous amount of RAM for this, or we'll have to use a dedicated Redis-routing server. I've read an article from the engineers at Instagram, who came up with a trick to decrease the database size dramatically, but I haven't tried this yet. Either way, this does not seem the right way to do this. Back to the drawing board.
Storing the paths in an SQL database and doing a lookup on every request. This seems like the worst option and will definitely not be used.
You're probably underestimating what a database can do. Can I invite you to reconsider your position there?
For Postgres (or MySQL w/ InnoDB), a million entries is a notch above tiny. Store the whole path in a field, add an index on it, vacuum, analyze. Don't do nutty joins until you've identified the ID of your key object, and you'll be fine in terms of lookup speeds. Say a few ms when running your query from psql.
Your real issue will be the bottleneck related to disk IO if you get material amounts of traffic. The operating motto here is: the less, the better. Besides the basics such as installing APC on your php server, using Passenger if you're using Ruby, etc:
Make sure the server has plenty of RAM to fit that index.
Cache a reference to the object related to each path in memcached.
If you can categorize all routes in a dozen or so regex, they might help by allowing the use of smaller, more targeted indexes that are easier to keep in memory. If not, just stick to storing the (possibly trailing-slashed) entire path and move on.
Worry about misses. If you've a non-canonical URL that redirects to a canonical one, store the redirect in memcached without any expiration date and begone with it.
Did I mention lots of RAM and memcached?
Oh, and don't overrate that ORM you're using, either. Chances are it's taking more time to build your query than your data store is taking to parse, retrieve and return the results.
RAM... Memcached...
To be very honest, Reddis isn't so different from a SQL + memcached option, except when it comes to memory management (as you found out), sharding, replication, and syntax. And familiarity, of course.
Your key decision point (besides excluding iterating over more than a few regex) ought to be how your data is structured. If it's highly structured with critical needs for atomicity, SQL + memcached ought to be your preferred option. If you've custom fields all over and obese EAV tables, then playing with Reddis or CouchDB or another NoSQL store ought to be on your radar.
In either case, it'll help to have lots of RAM to keep those indexes in memory, and a memcached cluster in front of the whole thing will never hurt if you need to scale.
Redis is your best bet I think. SQL would be slow and regular expressions from my experience are always painfully slow in queries.
I would do the following steps to test Redis:
Fire up a Redis instance either with a local VM or in the cloud on something like EC2.
Download a dictionary or two and pump this data into Redis. For example something from here: http://wordlist.sourceforge.net/ Make sure you normalize the data. For example, always lower case the strings and remove white space at the start/end of the string, etc.
I would ignore the hash. I don't see the reason you need to hash the URL? It would be impossible to read later if you wanted to debug things and it doesn't seem to "buy" you anything. I went to http://www.sha1-online.com/, and entered ryan and got ea3cd978650417470535f3a4725b6b5042a6ab59 as the hash. The original text would be much smaller to put in RAM which will help Redis. Obviously for longer paths, the hash would be better, but your examples were very small. =)
Write a tool to read from Redis and see how well it performs.
Profit!
Keep in mind that Redis needs to keep the entire data set in RAM, so plan accordingly.
I would suggest using some kind of key-value store (i.e. a hashing store), possibly along with hashing the key so it is shorter (something like SHA-1 would be OK IMHO).
I have a list of 1 million digits. Every time the user submit an input, I would need to do a matching of the input with the list.
As such, the list would have the Write Once Read Many (WORM) characteristics?
What would be the best way to implement storage for this data?
I am thinking of several options:
A SQL Database but is it suitable for WORM (UPDATE: using VARCHAR field type instead of INT)
One file with the list
A directory structure like /1/2/3/4/5/6/7/8/9/0 (but this one would be taking too much space)
A bucket system like /12345/67890/
What do you think?
UPDATE: The application would be a web application.
To answer this question you'll need to think about two things:
Are you trying to minimize storage space, or are you trying to minimize process time.
Storing the data in memory will give you the fastest processing time, especially if you could optimize the datastructure for your most common operations (in this case a lookup) at the cost of memory space. For persistence, you could store the data to a flat file, and read the data during startup.
SQL Databases are great for storing and reading relational data. For instance storing Names, addresses, and orders can be normalized and stored efficiently. Does a flat list of digits make sense to store in a relational database? For each access you will have a lot of overhead associated with looking up the data. Constructing the query, building the query plan, executing the query plan, etc. Since the data is a flat list, you wouldn't be able to create an effective index (your index would essentially be the values you are storing, which means you would do a table scan for each data access).
Using a directory structure might work, but then your application is no longer portable.
If I were writing the application, I would either load the data during startup from a file and store it in memory in a hash table (which offers constant lookups), or write a simple indexed file accessor class that stores the data in a search optimized order (worst case a flat file).
Maybe you are interested in how The Pi Searcher did it. They have 200 million digits to search through, and have published a description on how their indexed searches work.
If you're concerned about speed and don't want to care about file system storage, probably SQL is your best shot. You can optimize your table indexes but also will add another external dependency on your project.
EDIT: Seems MySQL have an ARCHIVE Storage Engine:
MySQL supports on-the-fly compression since version 5.0 with the ARCHIVE storage engine. Archive is a write-once, read-many storage engine, designed for historical data. It compresses data up to 90%. It does not support indexes. In version 5.1 Archive engine can be used with partitioning.
Two options I would consider:
Serialization - when the memory footprint of your lookup list is acceptable for your application, and the application is persistent (a daemon or server app), then create it and store it as a binary file, read the binary file on application startup. Upside - fast lookups. Downside - memory footprint, application initialization time.
SQL storage - when the lookup is amenable to index-based lookup, and you don't want to hold the entire list in memory. Upside - reduced init time, reduced memory footprint. Downside - requires DBMS (extra app dependency, design expertise), fast, but not as fast as holding the whole list in memeory
If you're concerned about tampering, buy a writable DVD (or a CD if you can find a store which still carries them ...), write the list on it and then put it into a server with only a DVD drive (not a DVD writer/burner). This way, the list can't be modified. Another option would be to buy an USB stick which has a "write protect" switch but they are hard to come by and the security isn't as good as with a CD/DVD.
Next, write each digit into a file on that disk with one entry per line. When you need to match the numbers, just open the file, read each line and stop when you find a match. With todays computer speeds and amounts of RAM (and therefore file system cache), this should be fast enough for a once-per-day access pattern.
Given that 1M numbers is not a huge amount of numbers for todays computers, why not just do pretty much the simplest thing that could work. Just store the numbers in a text file and read them into a hash set on application startup. On my computer reading in 1M numbers from a text file takes under a second and after that I can do about 13M lookups per second.