How to represent bunch of files as one file on Windows for direct reading - windows

I'd like to make one file representing (linking) bunch of files - something as on Linux named pipe do. The motivation is not to concatenate files (not to create the new one when I have originals and I want to keep them) so do not duplicate data. For example I want to use this to load videos from camera which are divided by approx. 2 GB.

Related

Load multiple graphml files into JanusGraph

I have 2 heavy graphml files (which is why I don't want to combine them if not absolutely necessary).
Additionally, the nodes ids are coherent between the two files, and there is no reference to any node from the second file in the first one.
Would there be a way to load the first file into JanusGraph, and then load the second as an addition to the first? (If it needs a little reformatting, it is not an issue, I can process the files as I want.)
If it isn't possible that way, how can I load big amounts of data into JanusGraph?
It doesn't seem as though there is a way to load multiple graphml files into JanusGraph. This being said, one can use personalized groovy scripts to load data from csv, txt, ... files.
This is easier and allows to handle large amount of data, split into smaller files. (One way to proceed would be to do one file per type of node / type of relationship. This makes the process relatively easy)

Write multiple files atomically

Suppose I have a folder with a few files, images, texts, whatever, it only matters that there are multiple files and the folder is rather large (> 100 mb). Now I want to update five files in this folder, but I want to do this atomically, normally I would just create a temporary folder and write everything into it and if it succeeds, just replace the existing folder. But because I/O is expensive, I don't really want to go this way (resaving hundreds of files just to update five seems like a huge overhead). But how am I supposed to write these five files atomically? Note, I want the writing of all files to be atomic, not each file separately.
You could adapt your original solution:
Create a temporary folder full of hard links to the original files.
Save the five new files into the temporary folder.
Delete the original folder and move the folder of hard links in its place.
Creating a few links should be speedy, and it avoids rewriting all the files.

compress large amount of small files in different folders

In our appliation,we save some images in different folders like:
1
2
3
4
...
500
...
And inside each folder there are large amount of images whose size is (5kb-20kb).
Now we found that when we try to transfer these files,we have to compress them first using the winrar,however it cost toooooo much time!! Also two hours to compress one parent folder.
In fact the images in the application are map images like the google map tiles:
||
So I wonder if there is an good idea to save/transfer these small but large amount files?
Images like that are likely to already be compressed so you will get little gain in bandwidth use (and so transfer speed) from the compression step.
If the compression process is taking along time where your CPU is busy then try instead just creating a plain tar file (which joins all the files into one archive without applying any compression). I don't know about winrar but most other compression tools (like 7zip) can generate a tar file, so I'm guessing winrar can too.
If you regularly transfer the whole set of files but only small numbers are added/changed each time, you might want to look into other transfer methods like rsync. You don't describe either of your environments so I can't tell if this is likely to be available to you, but if it is rsync does an excellent job of only transferring changes (speeding up the transfer significantly) and it also always uses one connection so you don't get hit by the per file latency of FTP and other protocols - one file follows the previous one down the same connection as if the parts being transferred had been tared together so you don't need that extra step to pack the files at one and (and unpack them at the other).
Those images are already compressed. However, to increase transfer speed, you might try using rar in 'archive' mode. This does the same thing as tar: concatenates all the files together into one big file. Don't use any compression in your archive format.
Maybe you can use a fast compression library like Snappy. However, it can only compress a single file, and you surely don't want to transfer each file separately. I'd create an uncompressed TAR archive for that.

Joining large files into one humongous file

Is there a way in Windows to link multiple files together without having to open the target file and read the contents of the source files to append them to the target file? Something like a shell link api?
Background
I have up to 8 seperate processes creating parts of a data file that I want to recombine into one large file.
A less radical solution that should work just fine.
system("copy filefragment.1+filefragmenent.2+filefragment.3+....+filefragment.8 outputfile.bin");
No simple way that I know of. But here's a radical idea.
Use a virtual file system (Dokan, EldoS CBFS, Pismo Technic, etc..) to emulate one logical file that is actually backed by separate files on disk.
I have up to 8 seperate processes creating parts of a data file that I want to recombine into one large file.
How do you want them concatenated? Mixed or one after the other?
If you want them mixed, you can just open() your output file and write() to it from your threads. If you want them one after the other, you're best bet is to write to separate files and join them together at the end.

Millions of small graphics files and how to overcome slow file system access on XP

I'm rendering millions of tiles which will be displayed as an overlay on Google Maps. The files are created by GMapCreator from the Centre for Advanced Spatial Analysis at University College London. The application renders files in to a single folder at a time, in some cases I need to create about 4.2 million tiles. Im running it on Windows XP using an NTFS filesystem, the disk is 500GB and was formatted using the default operating system options.
I'm finding the rendering of tiles gets slower and slower as the number of rendered tiles increases. I have also seen that if I try to look at the folders in Windows Explorer or using the Command line then the whole machine effectively locks up for a number of minutes before it recovers enough to do something again.
I've been splitting the input shapefiles into smaller pieces, running on different machines and so on, but the issue is still causing me considerable pain. I wondered if the cluster size on my disk might be hindering the thing or whether I should look at using another file system altogether. Does anyone have any ideas how I might be able to overcome this issue?
Thanks,
Barry.
Update:
Thanks to everyone for the suggestions. The eventual solution involved writing piece of code which monitored the GMapCreator output folder, moving files into a directory heirarchy based upon their filenames; so a file named abcdefg.gif would be moved into \a\b\c\d\e\f\g.gif. Running this at the same time as GMapCreator overcame the filesystem performance problems. The hint about the generation of DOS 8.3 filenames was also very useful - as noted below I was amazed how much of a difference this made. Cheers :-)
There are several things you could/should do
Disable automatic NTFS short file name generation (google it)
Or restrict file names to use 8.3 pattern (e.g. i0000001.jpg, ...)
In any case try making the first six characters of the filename as unique/different as possible
If you use the same folder over and (say adding file, removing file, readding files, ...)
Use contig to keep the index file of the directory as less fragmented as possible (check this for explanation)
Especially when removing many files consider using the folder remove trick to reduce the direcotry index file size
As already posted consider splitting up the files in multiple directories.
.e.g. instead of
directory/abc.jpg
directory/acc.jpg
directory/acd.jpg
directory/adc.jpg
directory/aec.jpg
use
directory/b/c/abc.jpg
directory/c/c/acc.jpg
directory/c/d/acd.jpg
directory/d/c/adc.jpg
directory/e/c/aec.jpg
You could try an SSD....
http://www.crucial.com/promo/index.aspx?prog=ssd
Use more folders and limit the number of entries in any given folder. The time to enumerate the number of entries in a directory goes up (exponentially? I'm not sure about that) with the number of entries, and if you have millions of small files in the same directory, even doing something like dir folder_with_millions_of_files can take minutes. Switching to another FS or OS will not solve the problem---Linux has the same behavior, last time I checked.
Find a way to group the images into subfolders of no more than a few hundred files each. Make the directory tree as deep as it needs to be in order to support this.
The solution is most likely to restrict the number of files per directory.
I had a very similar problem with financial data held in ~200,000 flat files. We solved it by storing the files in directories based on their name. e.g.
gbp97m.xls
was stored in
g/b/p97m.xls
This works fine provided your files are named appropriately (we had a spread of characters to work with). So the resulting tree of directories and files wasn't optimal in terms of distribution, but it worked well enough to reduced each directory to 100s of files and free the disk bottleneck.
One solution is to implement haystacks. This is what Facebook does for photos, as the meta-data and random-reads required to fetch a file is quite high, and offers no value for a data store.
Haystack presents a generic HTTP-based object store containing needles that map to stored opaque objects. Storing photos as needles in the haystack eliminates the metadata overhead by aggregating hundreds of thousands of images in a single haystack store file. This keeps the metadata overhead very small and allows us to store each needle’s location in the store file in an in-memory index. This allows retrieval of an image’s data in a minimal number of I/O operations, eliminating all unnecessary metadata overhead.

Resources