What is the purpose of creating a symbolic link between files? - symlink

Recently I came across the os library in Python and found out about the existence of symbolic links. I would like to know what a symbolic link is, why it exists, and what are various uses of it?

I will answer this from a perspective of an *nix user (specifically Linux). If you're interested in how this relates to Windows I suggest you look for tutorials like this one. This will be a bit of a roundabout, but I find it that symbolic links or symlinks are best explained together with hard links and generic properties of a filesystem on Linux.
Links and files on Linux
As a rule of thumb, in Linux everything is treated as a file. Directories are files that contain mappings from names (paths) to inodes, which are just unique identifiers of different objects residing on your system. Basically, if I give you a name like /home/gst/mydog.png the accessing process will first look into the / directory (the root directory) where it will find information on where to find home, then opening that file it will look into it to see where gst is and finally in that file it will try to find the location of mydog.png, and if successful try do whatever it set out to do with it. Going back to directory files, the mappings they contain are called links. Which brings us to hard and symbolic links.
Hard vs Symbolic links
A hard link is just a mapping like the one we discussed previously. It points directly to a certain object. A symlink on the other hand does not point directly to an object. Rather it just saves a path to an object. For example, say that I created a symbolic link to /home/gst/mydog.png at /home/gst/Desktop/mycat.png with os.symlink("/home/gst/mydog.png", "/home/gst/Desktop/mycat.png"). When I try to open it, the name /home/gst/Desktop/mycat.png is usually resolved to /home/gst/mydog.png. By following the symlink located at /home/gst/Desktop/mycat.png I actually (try to) access an object pointed to by /home/gst/mydog.png.
If I create a hard link (for example by calling os.link) I just add entries to the relevant directory files, such that the specific name can be followed to the linked object. When I create a symbolic link I create a file that contains a path to another file (which might be another symbolic link).
More specific to your question, if I pass /home/gst/Desktop/mycat.png to os.readlink it will return /home/gst/mydog.png. This name resolution also happens when calling functions in os with an (optional) parameter follow_symlinks set to True, however, if it's set to False the name does not get resolved (for instance you'd set it to false when you want to manipulate the symlink itself not the object it points to). From the module documentation:
not following symlinks: If follow_symlinks is False, and the last element of the path to operate on is a symbolic link, the function will operate on the symbolic link itself instead of the file the link points to. (For POSIX systems, Python will call the l... version of the function.)
You can check whether or not follow_symlinks is supported on your platform using os.supports_follow_symlinks. If it is unavailable, using it will raise a NotImplementedError.
Why use hard links?
This question has already been answered here, quoting from the accepted answer:
The main advantage of hard links is that, compared to soft links, there is no size or speed penalty. Soft links are an extra layer of indirection on top of normal file access; the kernel has to dereference the link when you open the file, and this takes a small amount of time. The link also takes a small amount of space on the disk, to hold the text of the link. These penalties do not exist with hard links because they are built into the very structure of the filesystem.
I'd like to add that hard links allow for an easy method of file backup. For every file the system keeps a count of its hard links. Once this count reaches 0 the memory segment on which the file is located is marked as free, meaning that the system will eventually overwrite it with another data (effectively deleting the previous file - which doesn't happen for at least as long as a running process has an opened stream associated with the file, but that's another story). Why would that matter?
Let's say you have a huge directory full of files you'd like to manipulate somehow (rename some, delete others, etc.) and you write a script to do this for you. However, you're not completely sure that the script will work as intended and you fear it might delete some wrong files. You also don't want to copy all the files, as this would take up too much space and time. One solution is to just create a hard link for each file at some other point in the filesystem. If you delete a file in the target directory, the associated object is still available because there's another hard link associated with it. Creating that many hard links will consume much less time and space than copying all the file, yet it will give you a reasonable backup strategy.
This is not the case with symbolic links. Remember, symlinks point to other links (possibly another symlink as well) not to actual files. Hence, I might create a symlink to a file, but that it will only save the link. If the (eventual) hard link that the symlink is pointing to gets removed from the system, trying to resolve the symlink won't lead you to a file. Such symlinks are said to be "broken" or "dangling". Thus you cannot rely on symlinks to preserve access to a certain file. (Conversely, deleting a symlink does not affect the link count associated with a target file.) So what's their use?
Why use symbolic links?
You can operate on symlinks as if they were the actual files to which they pointing somewhere down the line (except deleting them). This allows you to have multiple "access points" to a file, without having excess copies (that remain up to date, since they always access the same file). If you want to replace the file that is being accessed you only need to change it once and all of the symlinks will point to it (as long as the path saved by them is not changed). However, if you have hard links to a certain file and you then replace that file with another one, you also need to replace the hard links as otherwise they'll still be pointing to the old file.
Lastly, it is not uncommon to have different filesystems mounted on the same Linux machine. That is to say, that the way data is organized and interpreted at some point in the file hierarchy (say /home/gst/fs1) can be different to how it is organized and interpreted at another point (say /home/gst/Desktop/fs2). A hard link can only reside on the same filesystem as the file it's pointing to. Whereas, a symlink can be created on one filesystem but effectively pointing to a file on another filesystem (see answers to this question).

Symbolic links, also known as soft links, are special types of files that point to other files, much like shortcuts in Windows and Macintosh aliases. The data in the target file does not appear in a symbolic link, unlike a hard link. Instead, it points to another file system entry.
Read more here:
https://kb.iu.edu/d/abbe#:~:text=A%20symbolic%20link%2C%20also%20termed,somewhere%20in%20the%20file%20system.

Related

Find next file (but not FindNextFile)

If a user opens a file in a program (for example using GetOpenFileNameW, DragQueryFileW, command line argument, or whatever else to get the path, and a subsequent CreateFileW call), is there a way to find the next file in the parent directory of the opened file?
The obvious solution is to cycle through the results from FindNextFileW or NtQueryDirectoryFileEx until the opened file is encountered, and just open the next file.
However, this seems undesireable.
First, because these functions use paths (instead of for example a handle), the original file is decoupled from the search algorithm, so the original file might not even get encountered in that search. This is not much of an issue (as failing in this case is the expected outcome), and it probably could be resolved with (temporarly) changing the sharing mode, using LockFile or similar (though I would like to avoid that).
Second, this cycling search would have to be done every time, because the contents of the directory might have changed (retaining hFindFile does not work, because only FindFirstFileW calls NtQueryDirectoryFileEx and enumerates the contents of the directory). Which seems like unnecessary work and might even affect performance (for example if the directory contains a lot of files).
In theory any file system has some way of enumerating the files in a directory. Meaning there is some ordered data structure of the files' metadata. And getting the next file should only involve going back from the existing file handle to that file's entry, and then getting the next entry from that data structure. So there does not seem to be a fundamental reason why this cannot be done more sanely.
I thought maybe there exist a better way to do this somewhere in WinAPI...
Same question for finding the previous file.

Undo a botched command prompt copy which concatenated all of my files

In a Windows 8 Command Prompt, I had a backup drive plugged in and I navigated to my User directory. I executed the command:
copy Documents G:/Seagate_backup/Documents
What I assumed was that copy would create the Documents directory on my backup drive and then copy the contents of the C: Documents directory into it. That is not what happened!
I proceeded to wipe my hard-drive and re-install the operating system, thinking I had backed up the important files, only to find out that copy seemingly concatenated all the C: Documents files of different types (.doc, .pdf, .txt, etc) into one file called "Documents." This file is of course unreadable but opening it in Notepad reveals what happened. I can see some of my documents which were plain text throughout the massively long file.
How do I undo this!!? It's terrible because I was actually helping a friend and was so sure of myself but now this has happened. The only thing I can think of doing is searching for some common separator amongst the concatenated files and write some sort of script to split the file back apart. But then I would have to guess the extensions of each of the pieces...
Merging files together in the fashion that copy uses, discards important file system information such as file size and file name. While the file name may not be as important the size is. Both parameters are used by the OS to discriminate files.
This problem might sound familiar if you have damaged your file allocation table before and all files disappeared. In both cases, you will end up with a binary blob (be it an actual disk or something like your file which might resemble a disk image) that lacks any size and filename information.
Fortunately, this is where a lot of file system recovery tools can help. They are specialized in matching patterns. Specifically they are looking for giveaway clues to what type a file is of, where it starts and what it's size is.
This is for instance enabled by many file types having a set of magic numbers that are used to allow a program to check if a file really is of the type that the extension claims to be.
In principle it is possible to undo this process more or less well.
You will need to use data recovery tools or other analysis tools like binwalk to extract the concatenated binary blob. Essentially the same tools that are used to recover deleted files should be able to extract your documents again. Without any filename of course. I recommend renaming the file to a disk image (.img) and either mounting it from within the operating system as a virtual harddisk (don't worry that it has no file system - it should show up as an unformatted drive) or directly using a data recovery tool or analysis tool which can read binary files (binwalk, for instance, can do that directly, but may not find all types of files as it's mainly for unpacking firmware images that may be assembled in the same or a similar way to how your files ended up).

How should I mark a folder as processed in a script?

A script shall process files in a folder on a Windows machine and mark it as done once it is finished in order to not pick it up in the next round of processing.
My tendency is to let the script rename the folder to a different name, like adding "_done".
But on Windows, renaming a folder is not possible if some process has the folder or a file within it open. In this setup, there is a minor chance that some user may have the folder open.
Alternatively I could just write a stamp-file into that folder.
Are there better alternatives?
Is there a way to force the renaming anyway, in particular when it is on a shared drive or some NAS drive?
You have several options:
Put a token file of some sort in each processed folder and skip the folders that contain said file
Keep track of the last folder processed and only process ones newer (Either by time stamp or (since they're numbered sequentially), by sequence number)
Rename the folder
Since you've already stated that other users may already have the folder/files open, we can rule out #3.
In this situation, I'm in favor of option #1 even though you'll end up with extra files, if someone needs to try and figure out which folders have already been processed, they have a quick, easy method of discerning that with the naked eye, rather than trying to find a counter somewhere in a different file. It's also a bit less code to write, so less pieces to break.
Option #2 is good in this situation as well (I've used both depending on the circumstances), but I tend to favor it for things that a human wouldn't really need to care about or need to look for very often.

NTFS Journal USN_REASON_HARD_LINK_CHANGE event

I've written a program that reads the NTFS index and journal similar to what is described here:
http://ejrh.wordpress.com/2012/07/06/using-the-ntfs-journal-for-backups/
And It works fairly well.
In addition to the normal journal events USN_REASON_CLOSE, USN_REASON_FILE_CREATE, USN_REASON_FILE_DELETE etc' I'm receiving an event with reason USN_REASON_HARD_LINK_CHANGE. I'd like to be able to update the directory index according to this event but I can't find any information about it. The only documentation is:
An NTFS file system hard link is added to or removed from the file or
directory. An NTFS file system hard link, similar to a POSIX hard
link, is one of several directory entries that see the same file or
directory.
What does this mean? where was the hard-link created? or was it removed? how do I get more information about what happened?
I know this is ancient, but I stumbled upon this while researching a related problem. Here's what I found: The hard-links are a complicating factor when reading the USN. You can get journal entries describing change to a single file reference number by way of changes made through any hard-link that's been created. Generally, and to the original question, hard-links are alternative directory entries through which a single file might be accessed. Thus, all the file's characteristics are shared for each link (except for the names and parent file reference numbers). Technically, you can't tell which entry is the original and which is a link.
A subtle difference does exist, and it manifests if you query the master file table (using DeviceIOControl and Fsctl_Enum_Usn_Data). The query will return only a single representative file regardless of how many links exist. You can query for the links using NtQueryInformationFile, querying for FILE_HARD_LINK_INFORMATION. I think of the entry returned by the MFT query as the main entry and the NtQueryInformationFile-returned items as links...however, the main entry can get deleted and one of the links will get promoted...so it's only a housekeeping thought and little else.
Note that a problem arises where one of the hard-links is moved or renamed. In this case, the journal entries for the rename or move reflect the filename and parent file reference number of the affected link. The problem arises if you ask for only the summary "on-close" records. In such a case, you won't ever see the USN_REASON_RENAME_OLD_NAME record...because that USN entry never gets an associated REASON_CLOSE associated with it. Without this tidbit, you won't be able to easily determine which link's name or location was changed. You have to read the usn with ReadOnlyOnClose set to 0 in the Read_Usn_Journal_Data_V0. This is a far chattier query, but without it, you can't accurately associate the change with one link or the other.
As always with the USN, I expect you'll need to go through a bit of trial and error to get it to work right. These observations/guesses may, I hope, be helpful:
When the last hard link to a file is deleted, the file is deleted; so if the last hard link has been removed you should see USN_REASON_FILE_DELETE instead of USN_REASON_HARD_LINK_CHANGE. I believe that each reference number refers to a file (or directory, but NTFS doesn't support multiple hard links to directories AFAIK) rather than to a hard link. So immediately after the event is recorded, at least, the file reference number should still be valid, and point to another name for the file.
If the file still exists, you can look it up by reference number and use FindFirstFileNameW and friends to find the current links. Comparing this to the event record in question plus any relevant later events should give you enough information, although if multiple hard links for the same file are deleted and/or created you might not be able to reconstruct the order in which this happened, and if you don't have enough information about the prior state of the file system you might not be able to identify the deleted hard links. I don't know whether that would matter to you or not.
If the file no longer exists, you should still be able to identify it by the USN record in which it was deleted. Again, taking all relevant events into consideration, and with enough information about the prior state, you should be able to reconstruct most of what happened, if not the order.
There is some hope that we can do better than this: the file name and/or ParentFileReference number in the event record might refer to the hard link that was created or deleted, rather than to an arbitrary link to the file. In this case you'll have all the relevant information about the sequence of events except for whether any particular event was a create or a delete, which you should be able to work out by looking at the current state of the file and working backwards through the records.
I assume you've already looked for nearby change records that might contain additional information? There isn't, for example, a USN_REASON_RENAME_NEW_NAME record generated when a hard link is created or a USN_REASON_RENAME_OLD_NAME when a hard link is removed? Or paired USN_REASON_HARD_LINK_CHANGE records, one for the file, one for the directory containing the affected hard link to the file? (Wishful thinking, I expect, but it wouldn't hurt to look!)
For testing purposes, you can create hard links with the mklink command.

identify portable executable

I want to write a program that allows or blocks processes while openning a file depending on a policy.
I could make a control by checking the name of the program. However, it would not be enough because user can change the name to pass the policy. i.e. let's say that policy doesn't allow a.exe to access txt files whereas b.exe is allowed. If user change a.exe with b.exe, i cannot block it.
On the other hand, verifying portable executable signature is not enough for me, because i don't care whether the executable signed or not. I just want to identify the executable that is wanted to execute even its name is changed.
For this type of case, what would you propose? Any solutions are welcome.
Thanks in advance
There are many ways to identify an executable file. Here is a simple list:
Name:
The most simple and straightforward approach is to identify a file by its name. But it is one of the easiest things to change, and you already ruled that out.
Date:
Files have an access, creation, and modification date, and they are managed by the operating system. They are not foolproof, or maybe not even accurate.
Also, they are very simple to change.
Version Information:
Since we are talking about executable files, then most executable files have version information attached to it. For example, original file name, file version, product version, company, description, etc. You can check these fields if you are sure the user cannot modify them by editing your executable. It doesn't require you to keep a database of allowed files. However, it does require you to have something to compare to, like company name, or a product name. Also, If someone made an executable with the same value
they can run instead of the allowed one and bypass your protection.
Location:
If the file is located in a specific place and is protected by file system access rights, and it cannot be changed, then you can use that. You can, for example, put the allowed files in a folder where the user (without admin rights) can only read/execute them, but not rename/move. Then identify the file with its location. If it is run from this location, then allow it, else block it. It is good as it doesn't need a database of
allowed/blocked files, it just compares the location, it it is a valid one, then allow, and you can keep adding and removing files to allowed locations
without affecting your program.
Size:
If the file has a specific file size, you can quickly check its size and compare it. But it is unreliable as files can be changed/patched and without any change in size. You can avoid that by also applying a CRC check to detect if the content of the file changed.
But, both size and CRC can be changed. Also, this requires you to have a list of file names and their sizes/CRC, and keeping it up to date.
Signature:
Deanna mentioned that you can self-sign your executable files. Then check if the signature matches yours and allow/deny based on that. This seems to be a good way if
it is okay for you to sign all the executable files you want to allow. It doesn't require you to keep an updated list of allowed files.
Hash:
arx also pointed out, that you can hash the files. It is one of the slowest methods, as it requires the file to be hashed every time it is executed, then compared to a list of files. But it very reliable as it can uniquely identify each file and hard to break. But, you will need to keep an up to date database of every file hash.
Finally, and depend on your needs and options, you can mix two or more ways together to get the results you want. Like, checking file name + location, etc.
I hope I covered most of things, but I'm sure there are more ways. Anyone can freely edit my post to include anything that I have missed.
I would recommend using the signature, if it has one, or the hash otherwise. Apps such as Office that update frequently are more likely to be signed, whereas smaller apps downloaded off the Internet are unlikely to ever be updated and so should have a consistent hash.

Resources