I want to store a large number of media files in oracle. I believe I can store these files in the form of blobs using pl/sql procedure. However I want to make sure there is no impact to resolution / quality of the media file. Also are there any considerations that I need to account for to store media files in Oracle DB?
By storing and retrieving files in a blob does not impact image quality or resolution. Oracle does treat them as binary objects and what you store is what you get when you retrieve.
Typically such modifications are done at application layer logic before storing data into blob. In case of text based files, compressing them and storing them would save some disk space, in case of images, typically resolution/image size is modified to reduce file size etc. These are the decisions taken while designing application, as part of application architecture to reduce overall storage requirement.
Also, consider if this is going to be right design for you. There are implications in terms of storage requirement, application performance and scalablity. There are several threads in stackoverflow discussing advantages of storing images in RDBMS vs NoSQL databases vs filesystems. Also average size of files do matter a lot.
some links:
Storing images in NoSQL stores
NoSQL- Is it suitable for storing images?
Storing very big files in database
https://softwareengineering.stackexchange.com/questions/150669/is-it-a-bad-practice-to-store-large-files-10-mb-in-a-database
Related
My current approach is that I have a few containers:
raw (the actual raw files or exports, separated into folders like servicenow-cases, servicenow-users, playvox-evaluations, etc.)
staging (lightly transformed raw data)
analytics (these are Parquet file directories which consolidate and partition the files)
visualization (we use a 3rd party tool which syncs with Azure Blob, but only CSV files currently. This is almost the exact same as the analytics container)
However, it could also make some sense to create more containers and kind of use them like I would use a database schema. For example, one container for ServiceNow data, another for LogMeIn data, another for our telephony system, etc.
Is there any preferred approach?
Based on your description, it seems you are tangled to use a small number of containers to store a large number of blobs or make a large number of containers to store a small number of blobs. If all you think about is parallelism and scalability, you can rest assured, just you design a storage structure that suits you. Because partitioning in Azure Blob storage is done at the blob level, not the container.
Each of these two approaches has their advantages and disadvantages.
For a small number of containers, it can save the cost of creating containers (the operation of creating containers need you to pay for it). But at the same time, when you try to list the blobs in the container, the objects in it will be listed. If you still have a subset inside, you still need to continue to obtain, in this case the performance is less than the Lots of Container Solution. And at the same time, the security boundary you set will apply to all blobs in this container. This is not necessarily what you want.
For a large number of structured containers, more containers can set more security boundaries (custom access permissions, access control SAS signatures). It is also easy to list blobs, no more messy subsets are needed to catch. But again, its disadvantage is that it will have more consumption in creating containers (in extreme cases, it will increase a lot of costs. In general, it does not matter. a website that calculates costs: https://azure.microsoft.com/en-us/pricing/calculator/?cdn=disable).
I am using Realm as the database solution for my app. I need persistent storage ability for my images so I can load them when offline. I also need a cache so I can load the images from there rather than fetching them from the API each time a cell draws them. My first thought was that a Realm database could serve both of these functions just fine if I were to store the images in Realm as NSData. But I have found two answers on SE (here and here) that recommend not doing this if you have many images of a largish size that will change often. Instead they recommend saving the images to disk, and then storing the URL to those images in Realm.
My question is why is this best practice? The answers linked to above don't give reasons why except to say that you end up with a bloated database. But why is that a problem? What is the difference between a having lots of images in my database vs having lots of images on disk?
Is it a speed issue? If so, is there a marked speed difference in an app being able to access an image from disk to being able to access it from a database solution like Realm?
Thanks in advance.
This isn't really just a problem localised to Realm. I remember the same advice being given with Core Data too.
I'm guessing the main reason above all else as to why storing large binary data in a database isn't recommended is because 'You don't gain anything, and actually stand to lose more than you otherwise would'.
With Core Data (i.e. databases backed by SQLite), you'll actually take a performance hit as the data will be copied into memory when you perform the read from SQLite. If it's a large amount of data, then this is wholly unacceptable.
With Realm at least, since it uses a zero-copy, memory-mapped mechanism, you'll be provided with the NSData mapped straight from the Realm file, but then again, this is absolutely no different than if you simply loaded the image file from disk itself.
Where this becomes a major problem in Realm is when you start changing the image often. Realm actually uses an internal snapshotting mechanism when working with changing data across threads, but that essentially means that during operation, entire sets of data might be periodically duplicated on-disk (To ensure thread-safety). If the data sets include large blobs of binary data, these will get duplicated too (Which might also mean a performance hit as well). When this happens, the size of the Realm file on disk will be increased to accomodate the snapshots, but when the operation completes and the snapshots are deleted, the file will not shrink back to it's original size. This is because reclaiming that disk space would be a costly performance hit, and since it's easily possible the space could be needed again (i.e. by another large snapshotting operation), it seems inefficient to pre-emptively do (hence the 'bloat').
It's possible to manually perform an operation to reclaim this disk space if necessary, but the generally recommended approach is to optimise your code to minimise this from happening in the first place.
So, to sum that all up, while you totally can save large data blobs to a database, over time, it'll potentially result in performance hits and file size bloat that you could have otherwise avoided. These sorts of databases are designed to help transform small bits of data to a format that can be saved to and retrieved from disk, so it's essentially wasted on binary files that could easily be directly saved without any modification.
It's usually much easier, cleaner and more efficient to simply store your large binary data on disk, and simply store a file name reference to them inside the database. :)
I am storing text files in Azure Blob Storage. The files will be on the order of 1MB, but I could theoretically reduce that size by perhaps 30% at the cost of significantly increasing my application logic complexity.
I'm leaning toward just using the simpler but larger files. However, I wanted to know what factor blob size would have on retrieval time. Is it negligible or could there be a significant difference? I'm retrieving from a web server directly within the same datacenter as the blobs.
Also, is any compression automatically applied to blobs being sent within the datacenter? (As text files with lots of repeated content, they would compress very well.)
Thanks for any pointers!
It depends on your E2E usage scenario but for simply uploading text files to Azure Blob Storage I would suggest going with the simpler but larger ~1MB files as the difference after reducing the size will probably be negligible.
You can also take a look at the Azure Storage Scalability and Performance Targets - https://azure.microsoft.com/en-us/documentation/articles/storage-scalability-targets/
Considering the network speed with in the data centre, there would not be any issues with 1MB text files.Or atleast they would be negligible.
And also as you don't want to increase the app's complexity to handle this data, better even don't explicitly try to do any compressed file transfers as the size 1MB is ok to be transferred with out any compression. And I doubt that you would end up doing some logic to decompress the received compressed file.
Ref this thread for Compression details - Does Windows Azure Blob Storage support serving compressed files similar to Amazon S3?
We have an app that requires very fast access to columns from delimited flat files, whithout knowing in advance what columns will be in the files. My initial approach was to store each column in a MongoDB document as a array, but it does not scale as the files get bigger as I hit the 16MB document limit.
My second approach is to split the file columnwise and essentially treat them as blobs that can be served off disk to the client app. I'd intuitively think that storing the location in database and the files on disk is the best approach, and that storing them in a database as blobs (or in mongo gridfs) is adding unnessasary overhead, however there may be advatages that are not apparant to me at the moment. So my question is: what would be the advantage to storing them as blobs in a database such as (Oracle/Mongo) and are there any databases that are particularly well suited to this task.
Thanks,
Vackar
I have a list of 1 million digits. Every time the user submit an input, I would need to do a matching of the input with the list.
As such, the list would have the Write Once Read Many (WORM) characteristics?
What would be the best way to implement storage for this data?
I am thinking of several options:
A SQL Database but is it suitable for WORM (UPDATE: using VARCHAR field type instead of INT)
One file with the list
A directory structure like /1/2/3/4/5/6/7/8/9/0 (but this one would be taking too much space)
A bucket system like /12345/67890/
What do you think?
UPDATE: The application would be a web application.
To answer this question you'll need to think about two things:
Are you trying to minimize storage space, or are you trying to minimize process time.
Storing the data in memory will give you the fastest processing time, especially if you could optimize the datastructure for your most common operations (in this case a lookup) at the cost of memory space. For persistence, you could store the data to a flat file, and read the data during startup.
SQL Databases are great for storing and reading relational data. For instance storing Names, addresses, and orders can be normalized and stored efficiently. Does a flat list of digits make sense to store in a relational database? For each access you will have a lot of overhead associated with looking up the data. Constructing the query, building the query plan, executing the query plan, etc. Since the data is a flat list, you wouldn't be able to create an effective index (your index would essentially be the values you are storing, which means you would do a table scan for each data access).
Using a directory structure might work, but then your application is no longer portable.
If I were writing the application, I would either load the data during startup from a file and store it in memory in a hash table (which offers constant lookups), or write a simple indexed file accessor class that stores the data in a search optimized order (worst case a flat file).
Maybe you are interested in how The Pi Searcher did it. They have 200 million digits to search through, and have published a description on how their indexed searches work.
If you're concerned about speed and don't want to care about file system storage, probably SQL is your best shot. You can optimize your table indexes but also will add another external dependency on your project.
EDIT: Seems MySQL have an ARCHIVE Storage Engine:
MySQL supports on-the-fly compression since version 5.0 with the ARCHIVE storage engine. Archive is a write-once, read-many storage engine, designed for historical data. It compresses data up to 90%. It does not support indexes. In version 5.1 Archive engine can be used with partitioning.
Two options I would consider:
Serialization - when the memory footprint of your lookup list is acceptable for your application, and the application is persistent (a daemon or server app), then create it and store it as a binary file, read the binary file on application startup. Upside - fast lookups. Downside - memory footprint, application initialization time.
SQL storage - when the lookup is amenable to index-based lookup, and you don't want to hold the entire list in memory. Upside - reduced init time, reduced memory footprint. Downside - requires DBMS (extra app dependency, design expertise), fast, but not as fast as holding the whole list in memeory
If you're concerned about tampering, buy a writable DVD (or a CD if you can find a store which still carries them ...), write the list on it and then put it into a server with only a DVD drive (not a DVD writer/burner). This way, the list can't be modified. Another option would be to buy an USB stick which has a "write protect" switch but they are hard to come by and the security isn't as good as with a CD/DVD.
Next, write each digit into a file on that disk with one entry per line. When you need to match the numbers, just open the file, read each line and stop when you find a match. With todays computer speeds and amounts of RAM (and therefore file system cache), this should be fast enough for a once-per-day access pattern.
Given that 1M numbers is not a huge amount of numbers for todays computers, why not just do pretty much the simplest thing that could work. Just store the numbers in a text file and read them into a hash set on application startup. On my computer reading in 1M numbers from a text file takes under a second and after that I can do about 13M lookups per second.