identical picture file has different size in different computer - filesize

I am developing a IM software, in which you can send pictures. Recently I had met a weird problem: the same picture(have the same md5 checksum), when picked up to send through my software, and in my software, I read different file size.
In my computer, the software read the correct size of 7489 bytes, but in my customer's computer, it's size is 8700 bytes. Both OS is Win7 Premium version, and use C stat method to get file size.
Does anybody know what's going on?

It is too long for a comment, I hope it reaches the quality of an answer. :)
Check the file system type and the allocation unit size on the file systems as well. It is normal that the same file have different reported size on different file systems. Also different software reports the size differently. Some reports the useful data amount, others, report "the size of the file on the disk". Which describes how many space does it consume. It should be an integer multiple of the allocation unit size. I guess 7489 bytes is the size of the real useful data, and 8700 bytes are the size on the disk. Although it is a bit unusual since the allocation unit size used to be a multiple of 512 bytes, but on SSD it might not be a rule anymore. So I think you should not worry about it. It is something like a rounding error, and if your checksums are the same then the files are identical.
Ps: In explorer right click -> properties will show both the useful size of the file and its size on the disk, so you can compare.

Related

Is there any number and size limit to embed rc entries to a Windows resource DLL file?

We may embed resource files through defining rc in a Windows DLL file.
I am wondering if there are any limits of how many rc resource entries that may stores in a DLL? Or is there a limit to the file size of DLL file?
Will there a significant speed performance difference on accessing a resource in DLL that store more than 30,000 resource items compare to DLL that has less than 1000 resource items?
As far as the maximum number of resources you can have, if you're asking about limits you're probably doing something wrong. ;-) Seriously, 30K resources in a single DLL? But with that being said, the maximum number of resources varies by resource type, IIRC, although I can't find anything in MSDN documentation that specifies any limits.
For the second (file size limit), it would be the same file size limit that applies to other files on the specific version of Windows. Win95, for instance, with FAT32 has a 4 GiB file size limit; NTFS, exFAT, and UDF file systems have file size limits of 2^64 -1 There's an MSDN article comparing File System Functionality that might help (it's where I got the file size limit information).
To your third question (performance differences), of course there is; iterating a resource table (or multiple tables) that has 30K items will obviously take longer than doing the same with less than 1K items. There's also more overhead for file I/O with a larger file, although the OS file caching would help mitigate that factor.

Large amounts of file handles to large files - potential problems?

Would keeping say 512 file handles to files sized 3GB+ open for the lifetime of a program, say a week or so, cause issues in 32-bit Linux? Windows?
Potential workaround: How bad is the performance penalty of opening/closing file handles?
The size of the files doesn't matter. The number of file descriptors does, though. On Mac OS X, for example, the default limit is 256 open files per process, so your program would not be able to run.
I don't know about Linux, but in Windows, 512 files doesn't seem that much to me. But as a rule of thumb, any more than a thousand and it's too many. (Although I have to say that I haven't seen any program first-hand opening more than, say, 50.)
And the cost of opening/closing handles isn't that big unless you do them every time you want to read/write a small amount, in which case it's too high and you should buffer your data.

Random access of multiple files and file caching

This relates to some software I've been given to "fix". The easiest and quickest solution would make it open and read 10 random files out of hundreds and extract some very short strings for processing and immediately close them. Another process may come along right after that and do the same thing to different, or the same, random files and this may occur hundreds of times in a few seconds.
I know modern operating systems keep those files in memory to a point so disk thrashing isn't an issue as in the past but I'm looking for any articles or discussions about how to determine when all this open/closing of many random files becomes a problem.
When your working set (the amount of data read by all your processes) exceeds your available RAM, your throughput will tend towards the I/O capacity of your underlying disk.
From your description of the workload, seek times will be more of a problem than data transfer rates.
When your working set size stays below the amount of RAM you have, the OS will keep all data cached and won't need to go to the disk after having its caches filled.

why fsutil.exe takes less time to write a huge file into disk than programmatically?

this question is according to this topic:
creating a huge dummy file in a matter of seconds in c#
I just checked the fsutil.exe in xp/vista/seven to write a huge amount of dummy data into storage disk and it takes less time to write such a big file in comparison to programmaticly way.
When I'm trying to do the same thing with the help of .net it will take considerably more time than fsutil.exe.
note: I know that .net don't use native code because of that I just checked this issue with native api too like following:
long int size = DiskFree('L' - 64);
const char* full = "fulldisk.dsk";
__try{
Application->ProcessMessages();
HANDLE hf = CreateFile(full,
GENERIC_WRITE,
0,
0,
CREATE_ALWAYS,
0,
0);
SetFilePointer(hf, size, 0, FILE_BEGIN);
SetEndOfFile(hf);
CloseHandle(hf);
}__finally{
ShowMessage("Finished");
exit(0);
and the answer was as equal as .net results.
but with the help of fsutil.exe it only takes less duration than above or .net approaches say it is 2 times faster
example :
for writing 400mb with .net it will take ~40 secs
the same amount with fsutil.exe will take around 20secs or less.
is there any explanation about that?
or which function fsutil.exe does use which has this significance speed to write?
I don't know exactly what fsutil is doing, but I do know of two ways to write a large file that are faster than what you've done above (or seeking to the length you want and writing a zero, which has the same result).
The problem with those approaches is that they zero-fill the file at the time you do the write.
You can avoid the zero-fill by either:
Creating a sparse file. The size is marked where you want it, but the data doesn't actually exist on disk until you write it. All reads of the unwritten areas will return zeros.
Using the SetFileValidData function to set the valid data length without zeroing the file first. However, due to potential security issues, this command requires elevated permissions.
I agree with the last comment. I was experimenting with a minifilter driver and was capturing IRP_MJ_WRITE IRPs in callback. When I create or write to the file from cmd line or win32 application, I can see writes coming down. But when I created a file using "fsutil file createnew ..." command, I don't see any writes. I'm seeing this behavior on win2k8 r2 on NTFS volume. And I don't think (not sure 100% though) it is a sparse file either. It is probably setting the size properties in MFT without allocating any cluster. fsutil do check the available free space, so if the file size is bigger than free space on disk, you get error 1.
I also ran program sening FSCTL_GET_RETRIEVAL_POINTERS to the file and I got one extent for the entire size of the file. But I believe it is getting all data
This is something that could be
Written in Assembler (Raw blinding speed)
A Native C/C++ code to do this
Possibly an undocumented system call to do this or some trick that is not documented anywhere.
The above three points could have a huge significant factor - when you think about it, when a .NET code is loaded, it gets jit'ted by the runtime (Ok, the time factor would not be noticeable if you have a blazing fast machine - on a lowly end pentium, it would be noticeable, sluggish loading).
More than likely it could have been written in either C/C++. It may surprise you if it was written in Assembler.
You can check this for yourself - look at the file size of the executable and compare it to a .NET's executable. You might argue that the file is compressed, which I doubt it would be and therefore be inclined to rule this out, Microsoft wouldn't go that far I reckon with the compressing executables business.
Hope this answers your question,
Best regards,
Tom.
fsutil is only fast on NTFS and exFAT, not on FAT32, FAT16
This is because some file system have an "initialized size" concespt and thus support fast file initialization. This just reserves the clusters but does not zero them out, because it notes in the file system that no data was written to the file and valid reads would all return 00-filled buffers.

How files are copying at the low level?

I have a small question:
For example I'm using System.IO.File.Copy() method from .NET Framework. This method is a managed wrapper for CopyFile() function from WinAPI. But how CopyFile function works? It is interacts with HDD's firmware or maybe some other operations are performed through Assembler or maybe something other...
How does it look like from the highest level to the lowest?
Better to start at the bottom and work your way up.
Disk drives are organized, at the lowest level, in to a collection of Sectors, Tracks, and Heads. Sectors are segments of a track, Tracks are area on the disks itself, represented by the heads position as the platters spins underneath it, and the head is the actual element that reads the data from the platter.
Since Tracks are measured based on the distance that a head is from the center of a disk, you can see how towards the center of the disk the "length" of a track is short than one at the outer edge of the disk.
Sectors are pieces of a track, typically of a fixed length. So, an inner track will hold fewer sectors than an outer track.
Much of this disk geometry is handled by the drive controllers themselves nowadays, though in the past this organization was managed directly by the operating systems and the disk drivers.
The drive electronics and disk drivers cooperate to try and represent the disk as a sequential series of fixed length blocks.
So, you can see that if you have a 10MB drive, and you use 512 byte disk blocks, then that drive would have a capacity of 20,480 "blocks".
This block organization is the foundation upon which everything else is built. Once you have this capability, you can tell the disk, via the disk driver and drive controller, to go to a specific block on the disk, and read/write that block with new data.
A file system organizes this heap of blocks in to it's own structure. The FS must track which blocks are being used, and by which files.
Most file systems have a fixed location "where they start", that is, some place that upon start up they can go to try and find out information about the disk layout.
Consider a crude file system that doesn't have directories, and support files that have 8 letter names and 3 letter extension, plus 1 byte of status information, and 2 bytes for block number where the file starts on the disk. We can also assume that the system has a hard limit of 1024 files. Finally, it must know which blocks on the disk are being used. For that it will use 1 bit per block.
This information is commonly called the "file system metadata". When a disk is "formatted", nowadays it's simply a matter of writing new file system metadata. In the old days, it was a matter of actually writing sector marks and other information on blank magnetic media (commonly known as a "low level format"). Today, most drives already have a low level format.
For our crude example, we must allocate space for the directory, and space for the "Table of Contents", the data that says which blocks are being used.
We'll also say that the file system must start at block 16, so that the OS can use the first 16 blocks for, say, a "boot sector".
So, at block 16, we need to store 14 bytes (each file entry) * 1024 (number of files) = 12K. Divide that by 512 (block size) is 24 blocks. For our 10MB drive, it has 20,480 blocks. 20,480 / 8 (8 bits/byte) is 2,560 bytes / 512 = 5 blocks.
Of the 20,480 block available on the disk, the file system metadata is 29 blocks. Add in the 16 for the OS, that 45 blocks out of the 20,480, leaving 20,435 "free blocks".
Finally, each of the data blocks reserves the last 2 bytes to point to the next block in the file.
Now, to read a file, you look up the file name in the directory blocks. From there, you find the offset to the first data block for the file. You read that data block, grab the last two bytes. If those two byte are 00 00, then that's the end of the file. Otherwise, take that number, load that data block, and keep going until the entire file is read.
The file system code hides the details of the pointers at the end, and simply loads blocks in to memory, for use by the program. If the program does a read(buffer, 10000), you can see how this will translate in to reading several blocks of data from the disk until the buffer has been filled, or the end of file is reached.
To write a file, the system must first find a free space in the directory. Once it has that, it then finds a free block in the TOC bitmap. Finally, it takes the data, write the directory entry, sets its first block to the available block from the bitmap, toggles the bit on the bitmap, and then takes the data and writes it to the correct block. The system will buffer this information so that it ideally only has to write the blocks once, when they're full.
As it writes the blocks, it continues to consume bits from the TOC, and chains the blocks together as it goes.
Beyond that, a "file copy" is a simple process, from a system leverage the file system code and disk drivers. The file copy simply reads a buffer in, fills it up, writes the buffer out.
The file system has to maintain all of the meta data, keep track of where you are reading from a file, or where you are writing. For example, if you read only 100 bytes from a file, obviously the system will need to read the entire 512 byte datablock, and then "know" it's on byte 101 for when you try to read another 100 bytes from the file.
Also, I hope it's obvious, this is a really, really crude file system layout, with lots of issues.
But the fundamentals are there, and all file systems work in some manner similar to this, but the details vary greatly (most modern file systems don't have hard limits any more, as a simple example).
This is a question demanding or a really long answer, but I'm trying to make it brief.
Basically, the .NET Framework wraps some "native" calls, calls that are processed in lower-level libraries. These lower-level calls are often wrapped in a buffer logic to hide complicated stuff like synchronizing file contents from you.
Below, there is the native level, interacting with the OS' kernel. The kernel, the core of any operating system, then translates your high-level instruction to something your hardware can understand. Windows and Linux are for example both using a Hardware Abstraction Layer, a system that hides hardware specific details behind a generic interface. Writing a driver for a specific device is then only the task of implementing all methods certain device has to provide.
Before anything gets called on your hardware, the filesystem gets involved, and the filesystem for itself also buffers and caches a lot, but again transparently, so you don't even notice that. The last element in the call-queue is the device itself, and again, most devices conform to some standard ( like SATA or IDE ) and can thus be interfaced in a similar manner.
I hope this helps :-)
The .NET framework invokes the Windows API.
The Windows API has functions for managing files across various file systems.
Then it depends on the file system in question. Remember, it's not necessarily a "normal" file system over a HDD; It could even be a shell extension that just emulates a drive and keeps the data in you gmail account, or whatever. The point is that the same file manipulation functions in the Windows API are used as an abstraction over many possible lower layers of data.
So the answer really depends on the kind of file system you're interested in.

Resources