Ruby PStore file too large - ruby

I am using PStore to store the results of some computer simulations. Unfortunately, when the file becomes too large (more than 2GB from what I can see) I am not able to write the file to disk anymore and I receive the following error;
Errno::EINVAL: Invalid argument - <filename>
I am aware that this is probably a limitation of IO but I was wondering whether there is a workaround. For example, to read large JSON files, I would first split the file and then read it in parts. Probably the definitive solution should be to switch to a proper database in the backend, but because of some limitations of the specific Ruby (Sketchup) I am using this is not always possible.

I am going to assume that your data has a field that could be used as a crude key.
Therefore I would suggest that instead of dumping data into one huge file, you could put your data into different files/buckets.
For example, if your data has a name field, you could take the first 1-4 chars of the name, create a file with those chars like rojj-datafile.pstore and add the entry there. Any records with a name starting 'rojj' go in that file.
A more structured version is to take the first char as a directory, then put the file inside that, like r/rojj-datafile.pstore.
Obviously your mechanism for reading/writing will have to take this new file structure into account, and it will undoubtedly end up slower to process the data into the pstores.

Related

How to create a partially modifiable binary file format?

I'm creating my custom binary file extension.
I use the RIFF standard for encoding data. And it seems to work pretty well.
But there are some additional requirements:
Binary files could be large up to 500 MB.
Real-time saving data into the binary file in intervals when data on the application has changed.
Application could run on the browser.
The problem I face is when I want to save data it needs to read everything from memory and rewrite the whole binary file.
This won't be a problem when data is small. But when it's getting larger, the Real-time saving feature seems to be unscalable.
So main requirement of this binary file could be:
Able to partially read the binary file (Cause file is huge)
Able to partially write changed data into the file without rewriting the whole file.
Streaming protocol like .m3u8 is not an option, We can't split it into chunks and point it using separate URLs.
Any guidance on how to design a binary file system that scales in this scenario?
There is an answer from a random user that has been deleted here.
It seems great to me.
You can claim your answer back and I'll delete this one.
He said:
If we design the file to be support addition then we able to add whatever data we want without needing to rewrite the whole file.
This idea gives me a very great starting point.
So I can append more and more changes at the end of the file.
Then obsolete old chunks of data in the middle of the file.
I can then reuse these obsolete data slots later if I want to.
The downside is that I need to clean up the obsolete slot when I have a chance to rewrite the whole file.

Get file offset on disk/cluster number

I need to get any information about where the file is physically located on the NTFS disk. Absolute offset, cluster ID..anything.
I need to scan the disk twice, once to get allocated files and one more time I'll need to open partition directly in RAW mode and try to find the rest of data (from deleted files). I need a way to understand that the data I found is the same as the data I've already handled previously as file. As I'm scanning disk in raw mode, the offset of the data I found can be somehow converted to the offset of the file (having information about disk geometry). Is there any way to do this? Other solutions are accepted as well.
Now I'm playing with FSCTL_GET_NTFS_FILE_RECORD, but can't make it work at the moment and I'm not really sure it will help.
UPDATE
I found the following function
http://msdn.microsoft.com/en-us/library/windows/desktop/aa364952(v=vs.85).aspx
It returns structure that contains nFileIndexHigh and nFileIndexLow variables.
Documentation says
The identifier that is stored in the nFileIndexHigh and nFileIndexLow members is called the file ID. Support for file IDs is file system-specific. File IDs are not guaranteed to be unique over time, because file systems are free to reuse them. In some cases, the file ID for a file can change over time.
I don't really understand what is this. I can't connect it to the physical location of file. Is it possible later to extract this file ID from MFT?
UPDATE
Found this:
This identifier and the volume serial number uniquely identify a file. This number can change when the system is restarted or when the file is opened.
This doesn't satisfy my requirements, because I'm going to open the file and the fact that ID might change doesn't make me happy.
Any ideas?
Use the Defragmentation IOCTLs. For example, FSCTL_GET_RETRIEVAL_POINTERS will tell you the extents which contain file data.

How to do large file integrity check

I need to do an integrity check for a single big file. I have read the SHA code for Android, but it will need one another file for the result digest. Is there another method using a single file?
I need a simple and quick method. Can I merge the two files into a single file?
The file is binary and the file name is fixed. I can get the file size using fstat. My problem is that I can only have one single file. Maybe I should use CRC, but it would be very slow because it is a large file.
My object is to ensure the file on the SD card is not corrupt. I write it on a PC and read it on an embedded platform. The file is around 200 MB.
You have to store the hash somehow, no way around it.
You can try writing it to the file itself (at the beginning or end) and skip it when performing the integrity check. This can work for things like XML files, but not for images or binaries.
You can also put the hash in the filename, or just keep a database of all your hashes.
It really all depends on what your program does and how it's set up.

Are there alternatives for creating large container files that are cross platform?

Previously, I asked the question.
The problem is the demands of our file structure are very high.
For instance, we're trying to create a container with up to 4500 files and 500mb data.
The file structure of this container consists of
SQLite DB (under 1mb)
Text based xml-like file
Images inside a dynamic folder structure that make up the rest of the 4,500ish files
After the initial creation the images files are read only with the exception of deletion.
The small db is used regularly when the container is accessed.
Tar, Zip and the likes are all too slow (even with 0 compression). Slow is subjective I know, but to untar a container of this size is over 20 seconds.
Any thoughts?
As you seem to be doing arbitrary file system operations on your container (say, creation, deletion of new files in the container, overwriting existing files, appending), I think you should go for some kind of file system. Allocate a large file, then create a file system structure in it.
There are several options for the file system available: for both Berkeley UFS and Linux ext2/ext3, there are user-mode libraries available. It might also be possible that you find a FAT implementation somewhere. Make sure you understand the structure of the file system, and pick one that allows for extending - I know that ext2 is fairly easy to extend (by another block group), and FAT is difficult to extend (need to append to the FAT).
Alternatively, you can put a virtual disk format yet below the file system, allowing arbitrary remapping of blocks. Then "free" blocks of the file system don't need to appear on disk, and you can allocate the virtual disk much larger than the real container file will be.
Three things.
1) What Timothy Walters said is right on, I'll go in to more detail.
2) 4500 files and 500Mb of data is simply a lot of data and disk writes. If you're operating on the entire dataset, it's going to be slow. Just I/O truth.
3) As others have mentioned, there's no detail on the use case.
If we assume a read only, random access scenario, then what Timothy says is pretty much dead on, and implementation is straightforward.
In a nutshell, here is what you do.
You concatenate all of the files in to a single blob. While you are concatenating them, you track their filename, the file length, and the offset that the file starts within the blob. You write that information out in to a block of data, sorted by name. We'll call this the Table of Contents, or TOC block.
Next, then, you concatenate the two files together. In the simple case, you have the TOC block first, then the data block.
When you wish to get data from this format, search the TOC for the file name, grab the offset from the begining of the data block, add in the TOC block size, and read FILE_LENGTH bytes of data. Simple.
If you want to be clever, you can put the TOC at the END of the blob file. Then, append at the very end, the offset to the start of the TOC. Then you lseek to the end of the file, back up 4 or 8 bytes (depending on your number size), take THAT value and lseek even farther back to the start of your TOC. Then you're back to square one. You do this so you don't have to rebuild the archive twice at the beginning.
If you lay out your TOC in blocks (say 1K byte in size), then you can easily perform a binary search on the TOC. Simply fill each block with the File information entries, and when you run out of room, write a marker, pad with zeroes and advance to the next block. To do the binary search, you already know the size of the TOC, start in the middle, read the first file name, and go from there. Soon, you'll find the block, and then you read in the block and scan it for the file. This makes it efficient for reading without having the entire TOC in RAM. The other benefit is that the blocking requires less disk activity than a chained scheme like TAR (where you have to crawl the archive to find something).
I suggest you pad the files to block sizes as well, disks like work with regular sized blocks of data, this isn't difficult either.
Updating this without rebuilding the entire thing is difficult. If you want an updatable container system, then you may as well look in to some of the simpler file system designs, because that's what you're really looking for in that case.
As for portability, I suggest you store your binary numbers in network order, as most standard libraries have routines to handle those details for you.
Working on the assumption that you're only going to need read-only access to the files why not just merge them all together and have a second "index" file (or an index in the header) that tells you the file name, start position and length. All you need to do is seek to the start point and read the correct number of bytes. The method will vary depending on your language but it's pretty straight forward in most of them.
The hardest part then becomes creating your data file + index, and even that is pretty basic!
An ISO disk image might do the trick. It should be able to hold that many files easily, and is supported by many pieces of software on all the major operating systems.
First, thank-you for expanding your question, it helps a lot in providing better answers.
Given that you're going to need a SQLite database anyway, have you looked at the performance of putting it all into the database? My experience is based around SQL Server 2000/2005/2008 so I'm not positive of the capabilities of SQLite but I'm sure it's going to be a pretty fast option for looking up records and getting the data, while still allowing for delete and/or update options.
Usually I would not recommend to put files inside the database, but given that the total size of all images is around 500MB for 4500 images you're looking at a little over 100K per image right? If you're using a dynamic path to store the images then in a slightly more normalized database you could have a "ImagePaths" table that maps each path to an ID, then you can look for images with that PathID and load the data from the BLOB column as needed.
The XML file(s) could also be in the SQLite database, which gives you a single 'data file' for your app that can move between Windows and OSX without issue. You can simply rely on your SQLite engine to provide the performance and compatability you need.
How you optimize it depends on your usage, for example if you're frequently needing to get all images at a certain path then having a PathID (as an integer for performance) would be fast, but if you're showing all images that start with "A" and simply show the path as a property then an index on the ImageName column would be of more use.
I am a little concerned though that this sounds like premature optimization, as you really need to find a solution that works 'fast enough', abstract the mechanics of it so your application (or both apps if you have both Mac and PC versions) use a simple repository or similar and then you can change the storage/retrieval method at will without any implication to your application.
Check Solid File System - it seems to be what you need.

Generate File Names Automatically without collision

I'm writing a "file sharing hosting" and I want to rename all the files when uploading to a unique name and somehow keep track of the names on the database. Since I don't want two or more files having same name (which is surely impossible), I'm looking for an algorithm which based on key or something generates random names for me.
Moreover, I don't want to generate a name and search the database to see if the file already exists. I want to make sure 100% or 99% that the generated filename has never been created earlier by my application.
Any idea how I can write such application?
You could produce a hash based on the file contents itself. There are two good reasons to do this:
Allows you to never store the same file twice - for example, if you have two copies of a music file which are identical in content you could check to see if you have already stored that file, and just store it once.
You separate meta-data (file name is just meta data) from the blob. So you would have a storage system which is indexed by the hash of the file contents, and you then associate the file meta-data with that hash lookup code.
The risk of finding two files that compute the same hash that aren't indeed the same contents, depending on the size of the hash would be low, and you can effectively mitigate that by perhaps hashing the file in chunks (which could then lead to some interesting storage optimisation scenarios :P).
GUIDs are one way. You're basically guaranteed to not get any repeats (if you have a proper random generator).
You could also append with the time since epoch.
The best solution have already been mentioned. I just want to add some thoughts.
The simplest solution is to have a counter and increment on every new file. This works quite well as long as only one thread creates new files. If multiple threads, processes or even systems add new files, things get a bit more complicated. You must coordinate the creation of new ids with locking or any similar synchronisation method. You could also assign id ranges to every proceses to reduce the synchronisation work, or extend the file id by a unique process id.
A better solution might be to use GUIDs in this scenario and do not have to care about synchronisation between processes.
Finally, you can at some random data to every identifier to make them harder to guess if this is a requirement.
Also coommon is storing files in a directory structure where the location of a file depends on its name. File abcdef1234.xyz might be stored as /ab/cd/ef/1234.xyz. This avoids directories with a huge number of files. I am not really aware why this is done - may be file system limitations, performance issues - but it is quite common. I do not know if similar things are common if the files are stored directly in the database.
The best way is to simply use a counter. The first file is 1, the next is 2, another is 3, and so on...
But, it seems you want random. To quickly do this, you could make sure that your random number is greater than the last file created. You can cache the last file and then just offset your random number with its last name.
file = last_file + random(1 through 10)

Resources