How to prevent rsync compression on files with no extension? - shell

I'm using rsync to perform synchronisation between two machines on a network, so I have rsync's --compress setting enabled, however I have various file-types that I'm excluded that I know are already compressed such .jpg, .mp4 etc, using the --skip-compress option.
However, I have a large number of files with no extension that I know to have poor compression (due to encryption), as part of OS X's sparsebundle disk image format (where each "block" of the image is its own file with no file extension.
Anyway, I don't have many other files that should conflict, as other files that I have with no extension should be either excluded already or are quite small (so not really worth compressing).
However, I'm at a loss as to how I should add no extension files to rsync's --skip-compress list?

Going up one level: How much time are you saving with --skip-compress?
On a 0.5 megabyte/s network link, a 21 megabyte mp3 file with suffices mp3, txt and none I tried
--skip-compress="[]/gz/foo" and --skip-compress="gz//foo". I could not find a difference in the speed over 5 tries.

Related

How to compress file on HFS programmatically?

macOS HFS+ supports transparent filesystem-level compression. How can I enable this compression for certain files via a programmatic API? (e.g. Cocoa or C interface)
I'd like to achieve effect of ditto --hfsCompression src dst, but without shelling out.
To clarify: I'm asking how to make uncompressed file compressed. I'm not interested in reading or preserving existing HFS compression state.
There's afsctool which is an open source implementation of HFS+ compression. It was originally by hacker brkirch (macrumors forum link, as he still visits there) but has since been expanded and improved a great deal by #rjvb who is doing amazing things with it.
The copyfile.c file discloses some of the implementation details.
There's also a compression tool based on that: afsctool.
I think you're asking two different questions (and might not know it).
If you're asking "How can I make arbitrary file 'A' an HFS compressed file?" the answer is, you can,'t. HFS compressed files are created by the installer and (technically[1]) only Apple can create them.
If you are asking "How can I emulate the --hfsCompression logic in ditto such that I can copy an HFS compressed file from one HFS+ volume to another HFS+ volume and preserve its compression?" the answer to that is pretty straight forward, albeit not well documented.
HFS Compressed files have a special UF_COMPRESSED file flag. If you see that, the data fork of the file is actually an uncompressed image of a hidden resource. The compressed version of the file is stored in a special extended attribute. It's special because it normally doesn't appear in the list of attributes when you request them (so if you just ls -l# the file, for example, you won't see it). To list and read this special attribute you must pass the XATTR_SHOWCOMPRESSION flag to both the listxattr() and getxattr() functions.
To restore a compressed file, you reverse the process: Write an empty file, then restore all of its extended attributes, specifically the special one. When you're done, set the file's UF_COMPRESSED flag and the uncompressed data will magically appear in its data fork.
[1] Note: It's rumored that the compressed resource of a file is just a ZIPed version of the data, possibly with some wrapper around it. I've never taken the time to experiment, but if you're intent on creating your own compressed files you could take a stab at reverse-engineering the compressed extended attribute.

Undo a botched command prompt copy which concatenated all of my files

In a Windows 8 Command Prompt, I had a backup drive plugged in and I navigated to my User directory. I executed the command:
copy Documents G:/Seagate_backup/Documents
What I assumed was that copy would create the Documents directory on my backup drive and then copy the contents of the C: Documents directory into it. That is not what happened!
I proceeded to wipe my hard-drive and re-install the operating system, thinking I had backed up the important files, only to find out that copy seemingly concatenated all the C: Documents files of different types (.doc, .pdf, .txt, etc) into one file called "Documents." This file is of course unreadable but opening it in Notepad reveals what happened. I can see some of my documents which were plain text throughout the massively long file.
How do I undo this!!? It's terrible because I was actually helping a friend and was so sure of myself but now this has happened. The only thing I can think of doing is searching for some common separator amongst the concatenated files and write some sort of script to split the file back apart. But then I would have to guess the extensions of each of the pieces...
Merging files together in the fashion that copy uses, discards important file system information such as file size and file name. While the file name may not be as important the size is. Both parameters are used by the OS to discriminate files.
This problem might sound familiar if you have damaged your file allocation table before and all files disappeared. In both cases, you will end up with a binary blob (be it an actual disk or something like your file which might resemble a disk image) that lacks any size and filename information.
Fortunately, this is where a lot of file system recovery tools can help. They are specialized in matching patterns. Specifically they are looking for giveaway clues to what type a file is of, where it starts and what it's size is.
This is for instance enabled by many file types having a set of magic numbers that are used to allow a program to check if a file really is of the type that the extension claims to be.
In principle it is possible to undo this process more or less well.
You will need to use data recovery tools or other analysis tools like binwalk to extract the concatenated binary blob. Essentially the same tools that are used to recover deleted files should be able to extract your documents again. Without any filename of course. I recommend renaming the file to a disk image (.img) and either mounting it from within the operating system as a virtual harddisk (don't worry that it has no file system - it should show up as an unformatted drive) or directly using a data recovery tool or analysis tool which can read binary files (binwalk, for instance, can do that directly, but may not find all types of files as it's mainly for unpacking firmware images that may be assembled in the same or a similar way to how your files ended up).

compress large amount of small files in different folders

In our appliation,we save some images in different folders like:
1
2
3
4
...
500
...
And inside each folder there are large amount of images whose size is (5kb-20kb).
Now we found that when we try to transfer these files,we have to compress them first using the winrar,however it cost toooooo much time!! Also two hours to compress one parent folder.
In fact the images in the application are map images like the google map tiles:
||
So I wonder if there is an good idea to save/transfer these small but large amount files?
Images like that are likely to already be compressed so you will get little gain in bandwidth use (and so transfer speed) from the compression step.
If the compression process is taking along time where your CPU is busy then try instead just creating a plain tar file (which joins all the files into one archive without applying any compression). I don't know about winrar but most other compression tools (like 7zip) can generate a tar file, so I'm guessing winrar can too.
If you regularly transfer the whole set of files but only small numbers are added/changed each time, you might want to look into other transfer methods like rsync. You don't describe either of your environments so I can't tell if this is likely to be available to you, but if it is rsync does an excellent job of only transferring changes (speeding up the transfer significantly) and it also always uses one connection so you don't get hit by the per file latency of FTP and other protocols - one file follows the previous one down the same connection as if the parts being transferred had been tared together so you don't need that extra step to pack the files at one and (and unpack them at the other).
Those images are already compressed. However, to increase transfer speed, you might try using rar in 'archive' mode. This does the same thing as tar: concatenates all the files together into one big file. Don't use any compression in your archive format.
Maybe you can use a fast compression library like Snappy. However, it can only compress a single file, and you surely don't want to transfer each file separately. I'd create an uncompressed TAR archive for that.

How many files is most advised to have in a Windows folder (NTFS)?

we have a project that constitutes a large archive of image files...
We try to split them into sub-folders within the main archive folder.
Each sub-folder contains up to 2500 files in it.
For example:
C:\Archive
C:\Archive\Animals\
C:\Archive\Animals\001 - 2500 files...
C:\Archive\Animals\002 - 2300 files..
C:\Archive\Politics\
C:\Archive\Politics\001 - 2000 files...
C:\Archive\Politics\002 - 2100 files...
Etc... What would be the best way of storing files in such way under Windows ? and why exactly, please ... ?
Later on, the files have their EXIF metadata extracted and indexed for keywords, to be added into a Lucene index... (this is done by a Windows service that lives on the server)
We have an application where we try to make sure we don't store more than around 1000 files in a directory. Under Windows at least, we noticed extreme degradation in performance over this number. The folder can theoretically store up to 4,294,967,295 in Windows 7. Note that because the OS does a scan of the folder, doing lookups and lists very quickly degrades as you add many more files. Once we got to 100,000 files in a folder it was almost completely unusable.
I'd recommend breaking down the animals even further, perhaps by first letter of name. Same with the other files. This will let you separate things out more so you won't have to worry about the directory performance. Best advice I can give is to perform some stress tests on your system to see where the performance starts to tail off once you have enough files in a directory. Just be aware you'll need several thousand files to test this out.

Millions of small graphics files and how to overcome slow file system access on XP

I'm rendering millions of tiles which will be displayed as an overlay on Google Maps. The files are created by GMapCreator from the Centre for Advanced Spatial Analysis at University College London. The application renders files in to a single folder at a time, in some cases I need to create about 4.2 million tiles. Im running it on Windows XP using an NTFS filesystem, the disk is 500GB and was formatted using the default operating system options.
I'm finding the rendering of tiles gets slower and slower as the number of rendered tiles increases. I have also seen that if I try to look at the folders in Windows Explorer or using the Command line then the whole machine effectively locks up for a number of minutes before it recovers enough to do something again.
I've been splitting the input shapefiles into smaller pieces, running on different machines and so on, but the issue is still causing me considerable pain. I wondered if the cluster size on my disk might be hindering the thing or whether I should look at using another file system altogether. Does anyone have any ideas how I might be able to overcome this issue?
Thanks,
Barry.
Update:
Thanks to everyone for the suggestions. The eventual solution involved writing piece of code which monitored the GMapCreator output folder, moving files into a directory heirarchy based upon their filenames; so a file named abcdefg.gif would be moved into \a\b\c\d\e\f\g.gif. Running this at the same time as GMapCreator overcame the filesystem performance problems. The hint about the generation of DOS 8.3 filenames was also very useful - as noted below I was amazed how much of a difference this made. Cheers :-)
There are several things you could/should do
Disable automatic NTFS short file name generation (google it)
Or restrict file names to use 8.3 pattern (e.g. i0000001.jpg, ...)
In any case try making the first six characters of the filename as unique/different as possible
If you use the same folder over and (say adding file, removing file, readding files, ...)
Use contig to keep the index file of the directory as less fragmented as possible (check this for explanation)
Especially when removing many files consider using the folder remove trick to reduce the direcotry index file size
As already posted consider splitting up the files in multiple directories.
.e.g. instead of
directory/abc.jpg
directory/acc.jpg
directory/acd.jpg
directory/adc.jpg
directory/aec.jpg
use
directory/b/c/abc.jpg
directory/c/c/acc.jpg
directory/c/d/acd.jpg
directory/d/c/adc.jpg
directory/e/c/aec.jpg
You could try an SSD....
http://www.crucial.com/promo/index.aspx?prog=ssd
Use more folders and limit the number of entries in any given folder. The time to enumerate the number of entries in a directory goes up (exponentially? I'm not sure about that) with the number of entries, and if you have millions of small files in the same directory, even doing something like dir folder_with_millions_of_files can take minutes. Switching to another FS or OS will not solve the problem---Linux has the same behavior, last time I checked.
Find a way to group the images into subfolders of no more than a few hundred files each. Make the directory tree as deep as it needs to be in order to support this.
The solution is most likely to restrict the number of files per directory.
I had a very similar problem with financial data held in ~200,000 flat files. We solved it by storing the files in directories based on their name. e.g.
gbp97m.xls
was stored in
g/b/p97m.xls
This works fine provided your files are named appropriately (we had a spread of characters to work with). So the resulting tree of directories and files wasn't optimal in terms of distribution, but it worked well enough to reduced each directory to 100s of files and free the disk bottleneck.
One solution is to implement haystacks. This is what Facebook does for photos, as the meta-data and random-reads required to fetch a file is quite high, and offers no value for a data store.
Haystack presents a generic HTTP-based object store containing needles that map to stored opaque objects. Storing photos as needles in the haystack eliminates the metadata overhead by aggregating hundreds of thousands of images in a single haystack store file. This keeps the metadata overhead very small and allows us to store each needle’s location in the store file in an in-memory index. This allows retrieval of an image’s data in a minimal number of I/O operations, eliminating all unnecessary metadata overhead.

Resources