We have an app that requires very fast access to columns from delimited flat files, whithout knowing in advance what columns will be in the files. My initial approach was to store each column in a MongoDB document as a array, but it does not scale as the files get bigger as I hit the 16MB document limit.
My second approach is to split the file columnwise and essentially treat them as blobs that can be served off disk to the client app. I'd intuitively think that storing the location in database and the files on disk is the best approach, and that storing them in a database as blobs (or in mongo gridfs) is adding unnessasary overhead, however there may be advatages that are not apparant to me at the moment. So my question is: what would be the advantage to storing them as blobs in a database such as (Oracle/Mongo) and are there any databases that are particularly well suited to this task.
Thanks,
Vackar
Related
I want to store a large number of media files in oracle. I believe I can store these files in the form of blobs using pl/sql procedure. However I want to make sure there is no impact to resolution / quality of the media file. Also are there any considerations that I need to account for to store media files in Oracle DB?
By storing and retrieving files in a blob does not impact image quality or resolution. Oracle does treat them as binary objects and what you store is what you get when you retrieve.
Typically such modifications are done at application layer logic before storing data into blob. In case of text based files, compressing them and storing them would save some disk space, in case of images, typically resolution/image size is modified to reduce file size etc. These are the decisions taken while designing application, as part of application architecture to reduce overall storage requirement.
Also, consider if this is going to be right design for you. There are implications in terms of storage requirement, application performance and scalablity. There are several threads in stackoverflow discussing advantages of storing images in RDBMS vs NoSQL databases vs filesystems. Also average size of files do matter a lot.
some links:
Storing images in NoSQL stores
NoSQL- Is it suitable for storing images?
Storing very big files in database
https://softwareengineering.stackexchange.com/questions/150669/is-it-a-bad-practice-to-store-large-files-10-mb-in-a-database
I am using Realm as the database solution for my app. I need persistent storage ability for my images so I can load them when offline. I also need a cache so I can load the images from there rather than fetching them from the API each time a cell draws them. My first thought was that a Realm database could serve both of these functions just fine if I were to store the images in Realm as NSData. But I have found two answers on SE (here and here) that recommend not doing this if you have many images of a largish size that will change often. Instead they recommend saving the images to disk, and then storing the URL to those images in Realm.
My question is why is this best practice? The answers linked to above don't give reasons why except to say that you end up with a bloated database. But why is that a problem? What is the difference between a having lots of images in my database vs having lots of images on disk?
Is it a speed issue? If so, is there a marked speed difference in an app being able to access an image from disk to being able to access it from a database solution like Realm?
Thanks in advance.
This isn't really just a problem localised to Realm. I remember the same advice being given with Core Data too.
I'm guessing the main reason above all else as to why storing large binary data in a database isn't recommended is because 'You don't gain anything, and actually stand to lose more than you otherwise would'.
With Core Data (i.e. databases backed by SQLite), you'll actually take a performance hit as the data will be copied into memory when you perform the read from SQLite. If it's a large amount of data, then this is wholly unacceptable.
With Realm at least, since it uses a zero-copy, memory-mapped mechanism, you'll be provided with the NSData mapped straight from the Realm file, but then again, this is absolutely no different than if you simply loaded the image file from disk itself.
Where this becomes a major problem in Realm is when you start changing the image often. Realm actually uses an internal snapshotting mechanism when working with changing data across threads, but that essentially means that during operation, entire sets of data might be periodically duplicated on-disk (To ensure thread-safety). If the data sets include large blobs of binary data, these will get duplicated too (Which might also mean a performance hit as well). When this happens, the size of the Realm file on disk will be increased to accomodate the snapshots, but when the operation completes and the snapshots are deleted, the file will not shrink back to it's original size. This is because reclaiming that disk space would be a costly performance hit, and since it's easily possible the space could be needed again (i.e. by another large snapshotting operation), it seems inefficient to pre-emptively do (hence the 'bloat').
It's possible to manually perform an operation to reclaim this disk space if necessary, but the generally recommended approach is to optimise your code to minimise this from happening in the first place.
So, to sum that all up, while you totally can save large data blobs to a database, over time, it'll potentially result in performance hits and file size bloat that you could have otherwise avoided. These sorts of databases are designed to help transform small bits of data to a format that can be saved to and retrieved from disk, so it's essentially wasted on binary files that could easily be directly saved without any modification.
It's usually much easier, cleaner and more efficient to simply store your large binary data on disk, and simply store a file name reference to them inside the database. :)
I am loading files into Redshift with the COPY command using a manifest. The files are in S3. Unfortunately, there's about 2,000 files per table, so it's like
users1.csv.gz, users2.csv.gz, users3.csv.gz, users4.csv.gz, etc
I don't know if that matters or not, because the files are loaded with a manifest, and the manifest is supposed to parallelize this. That being said, it is really slow to load a table, and I need to speed it up.
What are some things I could do to speed this up?
In my case, I was importing lots of small tables (~100 tables of less than 1k rows each). In this case, adding the following options did help:
COMPUPDATE OFF
and
STATUPDATE OFF
documentation for COPY
documentation for COMPUPDATE.
documentation for STATUPDATE
Keep in mind that this does skip automatic compression and stats update. Refer to the documentation for the exact consequences of this.
If the size of the each user*.csv.gz file is very small, then Redshift might be spending some compute effort in uncompressing. If it is small, you may consider, uploading the csv files directly without compressing.
If you may want only specific columns from the CSV, you may use the column list to ignore a few columns. The below link describes column lists.
https://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-column-mapping.html#copy-column-list
You may disable the COMPUPDATE option during load if it is unnecessary.
Is it an empty table or does the table possess any data. If so, please execute VACUUM and ANALYSE commands before/after the load. VACUUM & ANALYSE are time consuming activities as well, if thr is any sort key and the data in your csv is also in the same sorted order, the above operation should be faster.
Define relevant sort keys which will have an impact on disk I/O and columnar compression & Load data in the sort key order. https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-sort-key-order.html
Define relevant distribution styles, which will distribute data across multiple slices and will impact disk I/O across the cluster.
Specify compression types for columns which reduces disk size and disk I/O subsequenty.
May I know the numbers, how many records in total and how long does it take to load?
Hope the above points help
My scenario is as follows. I have a data table with a million rows of tuples (say first name and last name), and a client that needs to retrieve a small subset of rows whose first name or last name begins with the query string. Caching this seems like a catch-22, because:
On the one hand, I can't store and retrieve the entire data set on every request (would overwhelm the network)
On the other hand, I can't just store each row individually, because then I'd have no way to run a query.
Storing ranges of values in the cache, with a local "index" or directory would work... except that, you'd have to essentially duplicate the data for each index, which defeats the purpose of even using a distributed cache.
What approach is advisable for this kind of thing? Is it possible to get the benefits of using a distributed cache, or is it simply not feasible for this kind of scenario?
Distributed Caching, is feasible for query-able data sets.
But for this scenario there should either be native function or procedure that would give much faster results. If different scope are not possible like session or application then it would be much of iteration required on server side for fetching the data for each request.
Indexing on server side then of Database is never a good idea.
If still there are network issues. You could go ahead for Document Oriented or Column Oriented NoSQL DB. If feasible.
I have a list of 1 million digits. Every time the user submit an input, I would need to do a matching of the input with the list.
As such, the list would have the Write Once Read Many (WORM) characteristics?
What would be the best way to implement storage for this data?
I am thinking of several options:
A SQL Database but is it suitable for WORM (UPDATE: using VARCHAR field type instead of INT)
One file with the list
A directory structure like /1/2/3/4/5/6/7/8/9/0 (but this one would be taking too much space)
A bucket system like /12345/67890/
What do you think?
UPDATE: The application would be a web application.
To answer this question you'll need to think about two things:
Are you trying to minimize storage space, or are you trying to minimize process time.
Storing the data in memory will give you the fastest processing time, especially if you could optimize the datastructure for your most common operations (in this case a lookup) at the cost of memory space. For persistence, you could store the data to a flat file, and read the data during startup.
SQL Databases are great for storing and reading relational data. For instance storing Names, addresses, and orders can be normalized and stored efficiently. Does a flat list of digits make sense to store in a relational database? For each access you will have a lot of overhead associated with looking up the data. Constructing the query, building the query plan, executing the query plan, etc. Since the data is a flat list, you wouldn't be able to create an effective index (your index would essentially be the values you are storing, which means you would do a table scan for each data access).
Using a directory structure might work, but then your application is no longer portable.
If I were writing the application, I would either load the data during startup from a file and store it in memory in a hash table (which offers constant lookups), or write a simple indexed file accessor class that stores the data in a search optimized order (worst case a flat file).
Maybe you are interested in how The Pi Searcher did it. They have 200 million digits to search through, and have published a description on how their indexed searches work.
If you're concerned about speed and don't want to care about file system storage, probably SQL is your best shot. You can optimize your table indexes but also will add another external dependency on your project.
EDIT: Seems MySQL have an ARCHIVE Storage Engine:
MySQL supports on-the-fly compression since version 5.0 with the ARCHIVE storage engine. Archive is a write-once, read-many storage engine, designed for historical data. It compresses data up to 90%. It does not support indexes. In version 5.1 Archive engine can be used with partitioning.
Two options I would consider:
Serialization - when the memory footprint of your lookup list is acceptable for your application, and the application is persistent (a daemon or server app), then create it and store it as a binary file, read the binary file on application startup. Upside - fast lookups. Downside - memory footprint, application initialization time.
SQL storage - when the lookup is amenable to index-based lookup, and you don't want to hold the entire list in memory. Upside - reduced init time, reduced memory footprint. Downside - requires DBMS (extra app dependency, design expertise), fast, but not as fast as holding the whole list in memeory
If you're concerned about tampering, buy a writable DVD (or a CD if you can find a store which still carries them ...), write the list on it and then put it into a server with only a DVD drive (not a DVD writer/burner). This way, the list can't be modified. Another option would be to buy an USB stick which has a "write protect" switch but they are hard to come by and the security isn't as good as with a CD/DVD.
Next, write each digit into a file on that disk with one entry per line. When you need to match the numbers, just open the file, read each line and stop when you find a match. With todays computer speeds and amounts of RAM (and therefore file system cache), this should be fast enough for a once-per-day access pattern.
Given that 1M numbers is not a huge amount of numbers for todays computers, why not just do pretty much the simplest thing that could work. Just store the numbers in a text file and read them into a hash set on application startup. On my computer reading in 1M numbers from a text file takes under a second and after that I can do about 13M lookups per second.