What are sufficient file information for file hashing? - algorithm

Since making a hash of a complete binary file would be far too heavy to compute in a reasonably fast time:
What are sufficient file information for hashing a file? The following lists what properties the resulting hashing should ideally have:
collision-free in respect to other files in the directory
fast
catch all file changes
As a rule of thumb, the less information we can use to create enough entropy, the better. Since the speed of retrieval for specific information may depend largely on the given environment (OS, file-IO of the language, IO of the used library, etc.), it should be disregarded here.
(This is my first attempt at a community wiki. My reason for making it one is that the information asked here is very generic but (hopefully) informative. I also would like this question to be marked as a community wiki, so it can be improved where fit. )

General Overview
Our goal here is to track as much differences between two file states while not using redundant data. Thus each informational source must be a disjointed subset of the information of the files state.
The following items represent sources of information about a file:
the name of the file
the directory-path relative to the specified document-root (aka absolute from document-root)
the files permissions
the files owner (user/group)
the last change time
the size of the file
the hostname of the machine the file resides on
the actual saved binary data
Per Item Considerations
Name of File
The name of the file is part of its absolute filesystem's path (the last bit) and as #La-comadreja said, it is unique in that no two files on a system can have the same absolute path. Using the File's name in combination with the rest of its absolute path (see directory-path for more information) is highly encouraged to avoid hash collisions with other files.
Directory-Path
While the files absolute path will be perfectly unique, it should be noted that in certain circumstances hashing the absolute path may be inappropriate. For instance, comparing the hashes of two files on different machines will most likely fail when both files do not have the identical absolute path on both machines. This becomes even more problematic on machines with different OS's and/or architectures. It is therefore encouraged to specify a document-root and resolve an absolute path from there.
Permissions
If you want to track changes to a files permissions, the tests below indicate that you would need to incoporate them into your hash directly as they do not change any other information about the file (most notably the timestamp). Note however that permissions are handled quite differently on different machines, so caution must be exercised here (for instance to use a canonical permission translation scheme).
Ownership
Ownership, just as permissions, is handled very differently across architectures and filesystems. A change of ownership does not change other information (as indicated by the tests below).
timestamp
The timestamp of a file is also something that is not unifiedly implemented across all (or at least the most common) systems. First of all, there are different timestamps on different filesystems we could be looking at: creation date, modified date, access date, etc. For our purpose the modified date is most suitable, as it is supported by most of the available filesystems [1] and holds the exact information we need: the last change to a file. However comparing files across different OS's may pose a problem, as Windows and Unix handle timestamps (in general) differently (see here [2] for a detailed article about the problem). Note that the modification date of a file changes whenever a file has been edited (disregarding edge cases), so timestamp indicates changes in file size (note that the opposite does not hold true, see file-size).
File size
The file size in bytes is an extremely good indication whether a file has been edited (except for permissions, ownership and name changes), as each edit would change the files content, thus changing its size. However this does not hold true if additions to a file are exactly as big as deletions. Thus the files timestamp may be a better indicator. Also, calculating a files binary size may be quite computation intensive.
Hostname
If one wants to compare files across multiple hosts and regard identical files on different hosts as different, then the hostname of the machine (or another suitable unique identifier for the host) should be included in the hash.
Binary Data
The binary data of the file has, of course, all necessary information to check if a file was changed. However, it is also too resource intensive to be of any practicability. It i highly discouraged to use this information.
Suggestions
The following sources should be used to compare files:
the name of the file
the directory path
the timestamp (see above for problems)
The following extra sources can be used to track more information:
permissions (see above)
ownership (see above)
hostname (when comparing across different machines)
The following sources of information should be disregarded:
file size
binary data
Tests
I did some tests on Debian checking whether changing one information would change another. Most interestingly rename, permission change, owner change did not affect a timestamp change or filesize change. (Note that these tests are currently only tested on Debian Linux. Other OS's will likely behave differently.)
$ ls -l
-rw-r--r-- 1 alex alex 30 Apr 26 11:04 bar
-rw-r--r-- 1 alex alex 0 Apr 26 11:03 baz
-rw-r--r-- 1 alex alex 14 Apr 26 11:04 foo
$ mv baz baz2
$ ls -l
-rw-r--r-- 1 alex alex 30 Apr 26 11:04 bar
-rw-r--r-- 1 alex alex 0 Apr 26 11:03 baz2
-rw-r--r-- 1 alex alex 14 Apr 26 11:04 foo
$ chmod 777 foo
$ ls -l
-rw-r--r-- 1 alex alex 30 Apr 26 11:04 bar
-rw-r--r-- 1 alex alex 0 Apr 26 11:03 baz2
-rwxrwxrwx 1 alex alex 14 Apr 26 11:04 foo
$ mv baz2 baz
$ echo "Another string" >> bar
$ ls -l
-rw-r--r-- 1 alex alex 45 Apr 26 11:17 bar
-rw-r--r-- 1 alex alex 0 Apr 26 11:03 baz
-rwxrwxrwx 1 alex alex 14 Apr 26 11:04 foo
$ sudo chown root baz
$ ls -l
-rw-r--r-- 1 alex alex 45 Apr 26 11:17 bar
-rw-r--r-- 1 root alex 0 Apr 26 11:03 baz
-rwxrwxrwx 1 alex alex 14 Apr 26 11:04 foo

Assuming all the files are on the same machine, directory path and file name should produce a unique combination because two files in the same directory cannot have the same name. Directory path, filename and timestamp of last change should capture each change.
If the files are on different machines, the machine name should be included in the directory path.

Related

How to enforce bash to recognize letters case in files names in MacOS bash?

I bumped into the fact that MacOS GNU bash5.1.16 "ls" command doesn't differentiate between lower and upper case in the files names ( weird, I know)
e.g.
[17:39:28:~/Work/cloud-formation/output/templates$] ls -l Man*
-rw-r--r-- 1 geoku staff 71244 31 Jan 17:23 ManagementProd.json
-rw-r--r-- 1 geoku staff 67569 31 Jan 17:23 ManagementStage.json
Now, I can get the exactly same file listed with a different command:
[17:44:19:~/Work/cloud-formation/output/templates$] cksum ManagementStage.json Managementstage.JsOn
cksum ManagementStage.json Managementstage.JsOn
3327010753 67569 ManagementStage.json
3327010753 67569 Managementstage.JsOn
Is it some sort of bash settings?
OK, it turns out MacOS disk partitions can be created case-sensitive or not, and default is apparently the second one
Disk utility shows the options when one is trying to create a new partition
enter image description here

Recursively searching a directory without changing directory atimes

I'm checking an alternative to 'find' command in shell scripting so as to eliminate the discrepancy of Accessed date of sub directories.
According to my observation, when find command is executed to list all the files in a directory, the accessed date of sub-directories is getting changed.
I want to post genuine statistics in one of the junk platforms, So I have been looking at some forums and got the alternative with 'ls' command. But that doesn't completely fulfill my request.
Below is the answer given by #ghostdog74.
ls -R %path% | awk '/:$/&&f{s=$0;f=0} /:$/&&!f{sub(/:$/,"");s=$0;f=1;next} NF&&f{ print s"/"$0 }'.
But this finds only the files inside the sub directories. I need all the files and sub-directories' files to be listed.
For example:
bash-3.2# pwd
/Users/manojkapalavai/Desktop/SleepTimeReport
bash-3.2# ls
**6th floor** manoj17 manoj26.txt manoj36 manoj45.txt manoj55 manoj70.txt manoj80 manoj9.txt **test1**
manoj14 manoj23.txt manoj33 manoj42.txt manoj52 manoj61.txt manoj71 manoj80.txt manoj90 **test2**.
The highlighted ones are sub-directories inside "SleepTimeReport" directory and remaining are just files. So, when I execute the above command, I get only the below output.
bash-3.2# ls -R ~/Desktop/SleepTimeReport | awk '/:$/&&f{s=$0;f=0} /:$/&&!f{sub(/:$/,"");s=$0;f=1;next} NF&&f{ print s"/"$0 }'.
~/Desktop/SleepTimeReport/6th floor/Script to increase the Sleep Time.numbers.
~/Desktop/SleepTimeReport/6th floor/Zone1Sleep.pages.
~/Desktop/SleepTimeReport/test1/New_folder.
~/Desktop/SleepTimeReport/test1/manoj.txt.
~/Desktop/SleepTimeReport/test1/sathish.txt.
~/Desktop/SleepTimeReport/test1/vara.txt.
~/Desktop/SleepTimeReport/test1/New_folder/Script to increase the Sleep Time.numbers.
~/Desktop/SleepTimeReport/test1/New_folder/Zone1Sleep.pages.
i.e.; only those files inside sub-directories are listed.
Brief explanation of what issue I'm facing, please see below
Manojs-MacBook-Pro:SleepTimeReport manojkapalavai$ ls -l
total 16
drwxr-xr-x 8 manojkapalavai staff 272 Sep 14 15:07 6th floor
-rwxr-xr-x 1 manojkapalavai staff 59 Nov 13 10:41 AltrFind.sh
-rw-r--r-- 1 manojkapalavai staff 0 Nov 2 15:15 manoj%.txt
-rw-r--r-- 1 manojkapalavai staff 0 Nov 2 18:23 manoj1
When I try finding Created time and Accessed Time of the folder 6th floor before using 'find' command, the below is output.
Manojs-MacBook-Pro:SleepTimeReport manojkapalavai$ stat -f '%N, %SB, %Sa' 6th\ floor/
6th floor/, Sep 13 10:34:55 2017, **Nov 13 11:21:33 2017**
Manojs-MacBook-Pro:SleepTimeReport manojkapalavai$ find /Users/manojkapalavai/Desktop/SleepTimeReport/
/Users/manojkapalavai/Desktop/SleepTimeReport/
/Users/manojkapalavai/Desktop/SleepTimeReport//6th floor
/Users/manojkapalavai/Desktop/SleepTimeReport//6th floor/.DS_Store
/Users/manojkapalavai/Desktop/SleepTimeReport//6th floor/Script to increase the Sleep Time.numbers
/Users/manojkapalavai/Desktop/SleepTimeReport//6th floor/Zone1Sleep.pages
Now, after finding all the files inside a directory, below is the output of atime. you can notice the change
Manojs-MacBook-Pro:SleepTimeReport manojkapalavai$ stat -f '%N, %SB, %Sa' 6th\ floor/
6th floor/, Sep 13 10:34:55 2017, **Nov 13 14:26:03 2017**
All tha I have done is just find the files, and atime of sub-folders inside a folder when we find is getting changed to that current time.
Is there any way to solve this?
ls is the wrong tool for programmatic use. Generally, you should be able to fix your find usage to not have an effect on atimes (actually, it's pretty rare for folks to even have atimes enabled at the filesystem level on modern production systems), but if you really want to avoid it, consider the bash globstar option:
shopt -s globstar
for file in **/*; do
echo "Doing whatever with $file"
done

Windows API to access case-sensitive paths (Bash-on-Ubuntu-on-Windows)

Bash-on-Ubuntu-on-Windows supports case-sensitive file paths. This means that I can create two files or directories with names only differing in capitalization. I have issues accessing those files, though.
Running
bash -c "touch Magic ; mkdir magic ; echo Secret! > magic/secret"
Creates a file names Magic, a directory named magic and a file names secret in that directory.
bash -c "ls -lR" yields
.:
total 0
drwxrwxrwx 2 root root 0 Aug 23 10:37 magic
-rwxrwxrwx 1 root root 0 Aug 23 10:37 Magic
./magic:
total 0
-rwxrwxrwx 1 root root 8 Aug 23 10:37 secret
(I am not sure why I get root, as it is not the default user, but that does not seem relevant to my question.)
Windows Explorer shows:
Now, while bash can easily access the magic/secret file in the directory, Windows seems to treat both the directory and the file as one and the same. So double-clicking the directory I get a "directory name invalid" error
Same goes for using cd, as I get The directory name is invalid. printed out.
Are there any APIs that allow me to access those case-sensitive paths, or create them? It seems that regular Windows APIs ignore character case completely when accessing existing files.
Case-sensitive paths can be used on Windows with NTFS, but it requires a bit of extra work.
First, case-sensitivity must be enabled system-wide. This is done by setting the HKLM\SYSTEM\CurrentControlSet\Control\Session Manager\kernel\ dword:ObCaseInsensitive registry value to 0, then restarting the system.
I found this part here.
Once case-sensitivity is enabled, it is possible to use CreateFile to with case-sensitive paths. To do that, you have to pass the FILE_FLAG_POSIX_SEMANTICS as part of the dwFlagsAndAttributes parameter. From msdn:
Access will occur according to POSIX rules. This includes allowing multiple files with names, differing only in case, for file systems that support that naming.
I found this part in this answer.
By setting the registry setting and the CreateFile flag, I was able to access case-sensitive paths.

Hadoop Log File Analysis from 2 separate machines

I am a fresher to Hadoop. I have to find the trend of symbols traded among users.
I have 2 machines b040n10 and b040n11. The files in the machine are as mentioned below:
b040n10:/u/ssekar>ls -lrt
-rw-r--r-- 1 root root 482342353 Feb 8 2014 A.log
-rw-r--r-- 1 root root 481231231 Feb 8 2014 B.log
b040n11:/u/ssekar>ls -lrt
-rw-r--r-- 1 root root 412312312 Feb 8 2014 C.log
-rw-r--r-- 1 root root 412356315 Feb 8 2014 D.log
There is a field called "symbol_name" on all these logs (example below).
IP=145.45.34.2;***symbol_name=ABC;***timestamp=12:13:05
IP=145.45.34.2;***symbol_name=XYZ;***timestamp=12:13:56
IP=145.45.34.2;***symbol_name=ABC;***timestamp=12:14:56
I have Hadoop running on my Laptop and I have 2 machines connected to my Laptop (can be used as Datanodes).
My task now is to get the list of symbol_name and the Symbol count.
As mentioned below:
ABC - 2
XYZ - 1
Should I now:
1. copy all the files (A.log,B.log,C.log,D.log) from b040n10 and b040n11 to my Laptop,
2. Issue a copyFromLocal command to HDFS system and analyze the data?
or is there a better way to findout the symbol_name and count without copying these files to my laptop?
The question is a basic one, but I am new to Hadoop, please help me to understand and use Hadoop to better. Please let me know if more information on the question is need.
Thanks
Copying the files from Hadoop to your local laptop defies the entire purpose of Hadoop which is to move the processing to the data not the other way. Because when you really have "BigData", you won't be able to move the data around to process it locally.
Your problem is a typical case of Map/Reduce, all what you need is a job that counts the occurrence of each symbol. Just search for Map/Reduce WordCount example and adapt it to your case

what does terminal command: ls -l show?

I know that it outputs the "long" version but what do each of the sections mean?
On my mac, when I type in
ls -l /Users
I get
total 0
drwxr-xr-x+ 33 MaxHarris staff 1122 Jul 1 14:06 MaxHarris
drwxrwxrwt 8 root wheel 272 May 20 13:26 Shared
drwxr-xr-x+ 14 admin staff 476 May 17 11:25 admin
drwxr-xr-x+ 44 hugger staff 1496 Mar 17 21:13 hugger
I know that the first line it the permissions, although I don't know what the order is. It would be great if that could be explained too. Then whats the number after it?
Basically, what do each one of these things mean? Why are the usernames written twice sometimes and don't match other times?
The option '-l' tells the command to use a long list format. It gives back several columns wich correspond to:
Permissions
Number of hardlinks
File owner
File group
File size
Modification time
Filename
The first letter in the permissions column show the file's type. A 'd' means a directory and a '-' means a normal file (there are other characters, but those are the basic ones).
The next nine characters are divided into 3 groups, each one a permission. Each letter in a group correspond to the read, write and execute permission, and each group correspond to the owner of the file, the group of the file and then for everyone else.
[ File type ][ Owner permissions ][ Group permissions ][ Everyone permissions ]
The characters can be one of four options:
r = read permission
w = write permission
x = execute permission
- = no permission
Finally, the "+" at the end means some extended permissions.
If you type the command
$ man ls
You’ll get the documentation for ls, which says in part:
The Long Format
If the -l option is given, the following information is displayed for
each file: file mode, number of links, owner name, group name, number of
bytes in the file, abbreviated month, day-of-month file was last modified, hour file last modified, minute file last modified, and the pathname. In addition, for each directory whose contents are displayed, the
total number of 512-byte blocks used by the files in the directory is
displayed on a line by itself, immediately before the information for the
files in the directory. If the file or directory has extended
attributes, the permissions field printed by the -l option is followed by
a '#' character. Otherwise, if the file or directory has extended security information (such as an access control list), the permissions field
printed by the -l option is followed by a '+' character.
…
The man command is short for “manual”, and the articles it shows are called “man pages”; try running man manpages to learn even more about them.
The following information is provided:
permissions
number of linked hardlinks
owner of the file
to which group this file belongs to
size
modification/creation date and time
file/directory name

Resources