Calculate the total space consumption of specific files in unix terminal - filesize

I have a folder containing .tcb and .tch files. I need to know what the size of all .tcb files together, respectively of all .tch files is.
I did like this:
1) I created a temp folder and then:
mv *tch temp
2) and then:
du -sk temp
I found the command in the Internet and wikipedia says this: "du (abbreviated from disk usage) is a standard Unix program used to estimate the file space usage". I think the reason why it says that it is an estimation is that if there are links then the size of the link will be shown instead of the linked file.
But if I do
ls -l
in the temp folder (which contains the all *.tch) files and then sum up the sizes which are displayed in the terminal, I have another file size. Why is that the case?
Well in sum, what I need is a command which shows me the real file size of *all .tch files in a folder, which can contain also other file types.
I hope anyone can help me with that. Thanks a lot!

You can use the -L option to du if you want to follow symbolic links (that is, calculate the size of the link target, not of the link itself). You can also use the -c option to display a grand total at the end.
Armed with those options, try du -skLc *.tch.
For more details on du, see this manpage.

Look at the specific man page for your version of du as they vary considerably in how they count.
"Approximate" can be because:
Blocks used or Bytes used can be reported with Blocks over-stating file sizes that aren't exact multiples of the block size but more accurately represents "space used that I can't use for other stuffs"
Unix files can have "holes" created by seeking a long way and writing. The OS doesn't actually allocate space for the skipped holes.
Symbolic links may or may not be dereferenced to the real file they point to.
If you just want the bytecount use wc -c *.tcb

Related

True in-place file editing using GNU tools

I have a very large (multiple gigabytes) file that I want to do simple operations on:
Add 5-10 lines in the end of the file.
Add 2-3 lines in the beginning of the file.
Delete a few lines in the beginning, up to a certain substring. Specifically, I need to traverse the file up to a line that says "delete me!\n" and then delete all lines in the file up to and including that line.
I'm struggling to find a tool that can do the editing in place, without creating a temporary file (very long task) that has essentially a copy of my original file. Basically, I want to minimize the number of I/O operations against the disk.
Both sed -i, and awk -i, do exactly that slow thing (https://askubuntu.com/questions/20414/find-and-replace-text-within-a-file-using-commands) and are inefficient as a result. What's a better way?
I'm on Debian.
Adding 5-10 lines at the beginning of a multi-GB file will always require fully rewriting the contents of that file, unless you're using an OS and filesystem that provides nonstandard syscalls. (You can avoid needing multiple GB of temporary space by writing back to a point in the file you're modifying from which you've already read to a buffer, but you can't avoid needing to rewrite everything past the point of the edit).
This is because UNIX only permits adding new contents to a file in a manner that changes its overall size at or past its existing end. You can edit part of a file in-place -- that is to say, you can seek 1GB in and write 1MB of new contents -- but this changes the 1MB of contents that had previously been in that location; it doesn't change the total size of the file. Similarly, you can truncate and rewrite a file at a location of your choice, but everything past the point of truncation needs to be rewritten.
An example of the nonstandard operations referred to above is the FALLOC_FL_INSERT_RANGE and FALLOC_FL_COLLAPSE_RANGE operations, which with very new Linux kernels will allow blocks to be inserted to or removed from an existing file. This is unlikely to be helpful to you here:
Only exact blocks (ie. 4kb -- whatever your filesystem is formatted for) can be inserted, not individual lines of text of arbitrary size.
Only XFS and ext4 are supported.
See the documentation for fallocate(2).
here is a recommendation for editing large files (change the lines and number of digits based on your file length and number of sections to work on)
split -l 1000 -a 4 -d bigfile bigfile_
for that you need space, since bigfile won't be removed
insert header as first line
sed -i '1iheader` bigfile_000
search a specific pattern, get the file name and remove the previous sections.
grep pattern bigfile_*
etc.
Once all editing is done, just cat back the remaining pieces
cat bigfile_* > edited_bigfile

How can you identify a file without a filename or filepath?

If I were to give you a file. You can read the file but you can't change it or copy it. Then I take the file, rename it, move it to a new location. How could you identify that file? (Fairly reliably)
I'm looking if I have a database of media files for a program and the user alters the location/name of file, could I find the file by searching a directory and looking for something.
I have done exactly this, it's not hard.
I take a 256-bit hash (I forget which routine I used off the top of my head) of the file and the filesize and write it to a table. If they match the files match. (And I think tracking the size is more paranoia than necessity.) To speed things up I also fold that hash to a 32-bit value. If the 32-bit values match then I check all the data.
For the sake of performance I persist the last 10 million files I have examined. The 32-bit values go in one file which is read in it's entirety, when a main record needs to be examined I pull in a "page" (I forget exactly how big) of them which is padded to align it with the disk.

Finding actual size of a folder in Windows

On my home desktop which is a Windows machine I right click on C:\Windows folder under properties and it displays:
If I use the du tool provided by Microsoft sysinternals
du C:\Windows
This produces
Files: 77060
Directories: 21838
Size: 31,070,596,369 bytes
Size on disk: 31,151,837,184 bytes
If I run the same command as administrator
Files: 77894
Directories: 22220
Size: 32,223,507,961 bytes
Size on disk: 32,297,160,704 bytes
With Powershell ISE running as administrator I ran the following powershell snippet from this SO answer
"{0:N2}" -f ((Get-ChildItem -path C:\InsertPathHere -recurse | Measure-Object -property length -sum ).sum /1MB) + " MB"
which output
22,486.11 MB
The C# code in the following SO answer from a command prompt running as Administrator returns:
35,163,662,628 bytes
Although close it still does not display the same as Windows Explorer. None of these methods therefore return the actual size of the directory. So my question is this.
Is there a scripted or coded method that will return the actual folder size of C:\Windows
If there is no way of retrieving the folder size, is there a way I can programatically retrieve the information displayed by Windows Explorer?
When it comes to windows they have a strange way of actually storing data, for example while a file maybe 1mb in size, when stored on disc its probably going to be 1.1mb the reason for this being is that includes the directory link to the actual file on disc and that estimated size isn't including the possible additional data windows stores with the associated data.
Now your probably thinking, thats nice and all but how do you explain the massive size change when looking at the file size from admin, well that is a good question because this is another additional header/meta data that is stored in conjunction with the file which is only allowed to be seen by admins.
Coming back to your original question about telling the actual size of the file, well that is quite hard to say for windows due to the amount of additional data it uses in conjunction with the desired file, but for readability purposes or if you are using this for some form of coding, I'd suggest looking for the size on disk with the admin command, not because it seems like the file is at its maximum size (for me it is) but because usually when you are looking to transfer that's probably the most reliable figure you can go with, because once you transfer the file, some additional data will be removed or changed and you already know what the likely swing in file size difference will be.
Also you have to take into account the hard drive format (NTFS, fat32) because of how it segments files because that too can change the file size slightly if the file is huge Ie. 1gb++
Hope that helps mate, because we all know how wonderful windows can be when trying to get information (sigh).
The ambiguities and differences have a lot to do with junctions, soft links, and hard links (similar to symlinks if you come from the *nix world). The biggest issue: Almost no Windows programs handle hard links well--they look like (and indeed are) "normal" files. All files in Windows have 1+ hard links.
You can get an indication of "true" disk storage by using Sysinternals Disk Usage utility
> du64 c:\windows
Yields on my machine:
DU v1.61 - Directory disk usage reporter
Copyright (C) 2005-2016 Mark Russinovich
Sysinternals - www.sysinternals.com
Files: 204992
Directories: 57026
Size: 14,909,427,806 bytes
Size on disk: 15,631,523,840 bytes
Which is a lot smaller than what you would see if you right-click and get the size in the properties dialog. By default du64 doesn't double count files with multiple hard links--it returns true disk space used. And that's also why this command takes a while to process. You can use the -u option to have the disk usage utility naively count the size of all links.
> du64 -u c:\windows
DU v1.61 - Directory disk usage reporter
Copyright (C) 2005-2016 Mark Russinovich
Sysinternals - www.sysinternals.com
Files: 236008
Directories: 57026
Size: 21,334,850,784 bytes
Size on disk: 22,129,897,472 bytes
This is much bigger--but it's double counted files that have multiple links pointing to the same storage space. Hope this helps.

How does a program determine the size of a file without reading it whole?

In the question Using C++ filestreams (fstream), how can you determine the size of a file?, the top answer is the following C++ snippet:
ifstream file("example.txt", ios::binary | ios::ate);
return file.tellg();
Running it myself I noticed that the size of arbitrarily large files could be determined instantaneously and with a single read operation.
Conventionally I would assume that to determine the size of a file, one would have to move through it byte-by-byte, adding to a byte-counter. How is this achieved instead? Metadata?
The size of the file is embedded in the file metadata in the file system. Different file systems have different ways of storing this information.
Edit Obviously, this is an incomplete answer. When someone will provide an answer where he'll exemplify on a common filesystem like ex3 or ntfs or fat exactly how the file size it's known and stored, i'll delete this answer.
The file size is stored as metadata on most filesystems. In addition to GI Joe's answer above, you can use the stat function on posix systems:
stat(3) manpage
struct stat statbuf;
stat("filename.txt", &statbuf);
printf("The file is %d bytes long\n", statbuf.st_size);
When ios::ate is set, the initial position will be the end of the file, but you are free to seek thereafter.
tellg returns the position of the current character in the input stream. The other key part is ios::binary.
So all it does it seek to the end of the file for you when it opens the filestream and tell you the current position (which is the end of the file). I guess you could say it's a sort of hack in a way, but the logic makes sense.
If you would like to learn how filestreams work at a lower level, please read this StackOverflow question.

How to see fragmentation of a specific file?

Is there a tool that would show me for a specific file on disk, how fragmented it is? (How many seeks does physical disk need to make if I were to read that file in a linear fashion)
The Sysinternals tool contig with parameter -a can do this for a file or all files in a folder and its subfolders.
You can use DeviceIoControl with FSCTL_GET_VOLUME_BITMAP, FSCTL_GET_RETRIEVAL_POINTERS and FSCTL_MOVE_FILE, see Defragmenting Files.
You can also find different code examples if you search for FSCTL_MOVE_FILE.
Here is one in C and another in .NET.
filefrag is the tool you're looking for, if you're using Linux.
Use -v parameter with filename to get detailed list of fragmentation.
http://linux.die.net/man/8/filefrag
And, of course, "fragmentation" is suspect:
The file may be in pieces in the same cylinder. No seek overhead, just rotational latency. Or not as the pieces may be an optimal order (chances are near zero for this one).
The file may be "contiguous" but across several cylinders. Even reading sequentially will result in seeks.
The file may be on a stripe set and you have no idea where the boundaries are. You may skip to another controller, another spindle, or another partition on the same drive.
Be careful about what conclusions you draw.
fsutil file queryallocranges offset=<o> length=<l> <file> will show you the file's extents you will need admin rights.

Resources