Git: list case-sensitive paths that have collided during clone

Git: list case-sensitive paths that have collided during clone - windows

When cloning a git repository that contains case-sensitive file paths (e.g. /README.md and /readme.md) on a case-insensitive file system (like NTFS or APFS), git will only check out one of the colliding files.
In macOS, how can I list all the files that collided because of case insensitivity?

There is no built in thing to find this. phd's comment will get you close, possibly close enough, but may over-fit a bit (although one might still like to know about these things).
For instance, suppose some commit has files:
path/TO/file1.ext
path/to/file2.ext
On your file system, only either path/TO or path/to may exist. Once one of those exists, these two files will be dropped into the same path/$to folder, where $to is either lowercase or uppercase. They will still be separate files, but will be called out by case-folding and sort-and-unique-dash-c-ing.
On macOS, we can also have collisions in paths due to Unicode normalization. Linux considers a file named 's' 'c' 'h' 'combining-umlaut' 'o' 'n' to be one file name, and a file named 's' 'c' 'h' 'o-with-umlaut' 'n' to be a second, different file name. The macOS default file systems will turn both names into a common form and claim that this is just one name. (I have no idea what Windows does with this.) A proper tool will should take this into account as well.
Note that Git will store each file separately in the index, and can update each separate index entry from a file-system-stored-file independent of the stored-file's path name. So we could have Git build a mapping from internal name to external name and make it handle these cases all automatically. But that's a pretty big task.

Related

How do I not commit the development team lines in project.pbxproj without deselecting those lines manually?

I am collaborating with my friend on an iOS app. We use different Apple IDs in our Xcodes, so in "Signing and Capabilities" tab of project settings, we select different teams in the "Team" field:
From my observation, changing this affects the MyProject.xcodeproj/project.pbxproj file, which stores the file references that the Xcode project has, in addition to the "Team". Here's a snippet of what is changed:
buildSettings = {
ASSETCATALOG_COMPILER_APPICON_NAME = AppIcon;
ASSETCATALOG_COMPILER_GLOBAL_ACCENT_COLOR_NAME = AccentColor;
CODE_SIGN_STYLE = Automatic;
DEVELOPMENT_TEAM = <my team ID>; /* this is changed */
INFOPLIST_FILE = MyProject/Info.plist;
LD_RUNPATH_SEARCH_PATHS = (
"$(inherited)",
"#executable_path/Frameworks",
);
PRODUCT_BUNDLE_IDENTIFIER = io.github.sweeper777.MyApp;
PRODUCT_NAME = "$(TARGET_NAME)";
SWIFT_VERSION = 5.0;
TARGETED_DEVICE_FAMILY = 1;
};
The problem arises, when one of us commits this file and the other person pulls. The "puller" will now have the "Team" set to something invalid. When this person then tries to run the app on a real device, there will be code signing errors for obvious reasons. To solve this, this person must tediously go through all the targets that we have, and set each "Team" to their own team.
How can we make it so that on each person's computer, the "Team" stays the same after pulling, but any other changes to MyProject.xcodeproj/project.pbxproj is applied?
Remarks:
Putting the entire MyProject.xcodeproj/project.pbxproj in .gitignore doesn't work, because that would ignore every other change to it. Adding a new file to the project, for example, also changes MyProject.xcodeproj/project.pbxproj, and we want to be able to pull that change.
Manually deselecting the lines that say "DEVELOPMENT_TEAM = ..." when committing is as tedious as reselecting the correct team every time, so that's not a solution.
I found this. Apparently, I can configure git to run sed before git checkout and git add. However, that answer seems ignore the line by deleting it completely. This means that my friend, when he pulls, would still have to reselect the correct team. What I want is the kind of "ignore" that simply stops tracking that line. That is, if there is a local version of that line, use that.
I am also aware that this all wouldn't be a problem if we are on the same team. But if I understand this correctly, I can't have multiple people on my team unless I have a Company account, and not only can I not afford that, I don't own a Company.

I don't use Xcode itself and do not know how to smuggle Git hooks and scripts past the Xcode interface, so you'll need more than just this answer. But you mention sed in comments, and given your proposed file format, that may well be the way to go:
buildSettings = {
ASSETCATALOG_COMPILER_APPICON_NAME = AppIcon;
ASSETCATALOG_COMPILER_GLOBAL_ACCENT_COLOR_NAME = AccentColor;
CODE_SIGN_STYLE = Automatic;
DEVELOPMENT_TEAM = <my team ID>; /* this is changed */
INFOPLIST_FILE = MyProject/Info.plist;
LD_RUNPATH_SEARCH_PATHS = (
"$(inherited)",
"#executable_path/Frameworks",
);
PRODUCT_BUNDLE_IDENTIFIER = io.github.sweeper777.MyApp;
PRODUCT_NAME = "$(TARGET_NAME)";
SWIFT_VERSION = 5.0;
TARGETED_DEVICE_FAMILY = 1;
};
Git has the ability to run what it calls clean and smudge filters. These can be used to run any arbitrary program you like, including sed, the "stream editor", which is particularly good at making single-line changes based on regular expression matches.
There is another method that may also work, and may "play better" with Xcode, or may play worse. I'll go over that too, after covering clean and smudge filters.
Before we dive into writing clean and smudge filters, and using them from Git—you'll need to know all of these details as you will have to write your own custom filters—we should start with a simple fact about Git commits: No part of any commit can ever be changed. Once you make a commit, the stuff that's inside the commit—the stored data in all of its files—is the way it is, forever. So these filters have to work within that system. Remember that, as it will help with understanding what we're doing.
How Git makes and stores objects
The files inside a commit are not files, exactly: they're not the same thing as files in your file system, at least. Instead, they are what Git calls objects, specifically blob objects. A blob object holds the file's data; other objects hold the file's name; and commit objects collect everything together to be used all at once. There's one more internal object type for annotated tags but we'll stop here as we're really only interested in the blob-object part.
When Git extracts a commit, it reads the internal blob objects and runs them through internal code to decompress and format them into regular files. This can include doing end-of-line hacking (turning LFs into CRLFs) if desired. Normally, all this happens entirely inside Git, and the end result is that Git writes out an ordinary everyday file for you to use. This ordinary file is what you will work on / with, in Xcode or any other editor and compiler system and so on. These ordinary files are in your working tree.
After you've extracted some commit, you'll do some work on it, by changing some or all of the files in your working tree, to achieve whatever result you wanted. This can include changing the buildSettings, editing Swift code, editing Objectionable-C Objective-C code, and so on. You might add all-new files to the working tree, some of which you never commit at all (you can help make sure this never happens by listing such files in .gitignore).
Eventually, though, you'd like to commit the updated code. To do so, you must run git add, or maybe have your IDE run git add for you (perhaps Xcode has clicky buttons to do this). This invokes code in Git that converts the working tree file(s) back to internal blob objects if and as needed.
Again, normally this is all handled entirely inside of Git. Git will read the working tree file, maybe do CRLF-to-LF-only changes, compress the text, search for duplicate objects, and do all the other complicated things necessary to prepare the file, so that it is ready to be committed. The resulting data need not match what's in your working tree at all: it just has to be something that, when Git later goes to extract the file, produces what you will need in your working tree.
Clean and smudge filters
This is where these clean and smudge filters come in. I said, above, that normally, Git does the extraction and insertion all on its own. For binary files, the only thing Git does here is apply lossless compression.1 For text files, Git can do CRLF/LF substitutions as well. But what if you'd like to do your own operations?
You can: Git will let you do whatever you want during the extract process with a smudge filter, and will let you do whatever you want during the compress process with a clean filter. The clean filter replaces the in-file data, using a stream-edit type process,2 and then Git does its CRLF hacking if any and compressing on the "cleaned" data. The smudge filter replaces the decompressed, post-CRLF-hacking data coming out of Git with the data that should go into the working tree.
Hence you can write, as your clean filter, a sed script of the form:
s/DEVELOPMENT_TEAM = .*;/DEVELOPMENT_TEAM = DEVTEAMTEMPLATE;/
With that as the entire sed script, what sed will do is edit the incoming data stream and replace any actual development team text with the word DEVTEAMTEMPLATE.
Your smudge filter has to work slightly harder: it must find the template line and adjust it so that it contains the correct team ID. Where will you get the correct team ID? That's up to you: perhaps you can store it in a file in your home directory, or in a file that you create in the working tree but never commit in Git. You'll have to write this one or two or however-many-liner sed and/or shell script yourself.
1There are multiple phases of compression; git add does just one, and git checkout undoes all—including reading from "pack" files—as needed. The deeper level of compression, using delta encoding techniques, is entirely invisible at the "object" level, so nobody ever really has to think about it.
2With the advent of Git-LFS, Git gained the ability to run long-lived filters. Before that, Git always used simple stream filtering. The stream filtering is easier to understand, but is less efficient for doing en-masse operations on many files. Here, we're only interested in one file per repository anyway, so there's no need to go into the fancier long-lived filter details.
Defining clean and smudge filters
The tricky part here, with Git, is that you must define the filters in one place—in $HOME/.gitconfig or .git/config, for instance—and then tell Git to invoke them from another place, using the .gitattributes file. This is described in the gitattributes documentation. This documentation is pretty thorough, so read it. You can ignore all the long-running filter discussion, as noted above. I will quote one bit from the documentation here for emphasis, though, and expound on it:
Note that "%f" is the name of the path that is being worked on. Depending on the version that is being filtered, the corresponding file on disk may not exist, or may have different contents. So, smudge and clean commands should not try to access the file on disk, but only act as filters on the content provided to them on standard input.
When Git is running the smudge filter, it:
has opened some internal object (which may or may not be packed);
has decompressed it, or is in the process of decompressing it, and pumped / is-pumping out the data; and
this data is being fed to your filter, but is not written out to any file anywhere.
Your filter can use %f to know the name of the target output file, but the data are not in that file yet. The data bytes are only in some OS-level pipes or sockets or whatever your OS uses for connecting the output of one program (Git's internal decompressors) to another (your filter). Your smudge filter must read its standard input to get the data, and write the smudged data to standard output so that Git can read it (if necessary) and/or redirect that output to the correct file. Do not attempt to open the file by name!
(The same holds for the clean filter, except that in many cases, the input to your filter is just the raw data already in the file, so that opening the file and reading it mostly works. So this can mislead you, if you do your tests using a clean filter.)
Note that you can implement this scheme without a clean filter at all: your smudge filter can replace whatever is in the committed file even if it's a real team ID, rather than just a template. If you choose to do this, however, you'll "see" the team ID changing every time a different team-ID commits the file. The nice thing about using the clean filter is that once the committed copies of the file use the template line, every future cleaned file also uses the template line, so that it never changes.
Alternative: a template file
In general, it's unwise to commit actual configurations. Clean and smudge tricks can work, but they can only go so far: this particular file format works well because the change you want made is on a single line, and Git itself shows you file changes on a line-by-line basis, and sed works well with line-oriented input, and so on.
A lot of configuration files, though, wind up storing at least slightly-sensitive data, or perhaps very-sensitive data such as cleartext passwords. Such files should not be stored in Git at all if at all possible. Instead, you would store a template file in Git.
In this case, for instance, instead of storing MyProject.xcodeproj/project.pbxproj, you might have Git store MyProject.xcodeproj/project.pbxproj.template. This file would have template-ized contents. When you clone and check out the repository, you'd subsequently copy the template file into place and do any required adjustments.
Should the MyProject.xcodeproj/project.pbxproj file itself need to change, e.g., to acquire a new SWIFT_VERSION setting, you'd instead edit the template file, add that to Git, and commit. You would then use the usual "convert template to mine" process, or manually update the MyProject.xcodeproj/project.pbxproj file. Since this file is never committed—and is listed in .gitignore—it never goes into any commit and you never have to worry about collisions within it. Only the template file goes into Git.

How do I get a file manifest for each revision in a git repository?

I have a git repository that was created on Microsoft Windows. Microsoft Windows has a case insensitive file system. The people checking into this repository have not been careful about the case of their filenames. This means that the same directory or file sometimes shows up under two different names.
I mean to fix this problem. But in order to really fix it, I have to get a handle on it.
Is there a quick and simple way to get a list of the files at each revision?
I need this in order to figure out which revisions (if any) have the same file under two different names so I can decide on a strategy for fixing such cases. This means I need to get this information en-masse as quickly as possible so the analysis consumes a resonable amount of time.

One way to get this is with ls-tree:
git ls-tree -r --name-only <commit>
(Note that this looks at the portion of the tree corresponding to your current directory, so you should either run it from the top level of your repo, or give the --full-tree option.)
This is essentially instantaneous, since all Git has to do is recursively examine the tree; it doesn't even have to look at the contents of files.
I'm not sure how you're going to use a list of filenames to detect the same file under two different names. If you just mean that you want to look for filenames that would be the same on a case-insensitive filesystem, then the list of filenames is all you needed.
However, if you think the files might actually have the same content, you could drop the --name-only, so that you'll also see the SHA1s of all the file, and can find identical files by looking for duplicate hashes.

You could run something like this:
git log --name-only --pretty="format:%H"
This command will show the the sha1 and the list of changed files for every revision.

Git rename detection when class and filename changed in one commit

What is the best way to handle class renames (e.g. done with Resharper) with Git?
That is, if both the class name and containing file name are changed together and committed without further changes.
It seems the way Git handles renames via a percentage changed heuristic is a bit hit and miss.
For large classes it will be recognised as a rename but for small classes the percentage threshold is reached such that it will be seen as a delete and add.

Keep in mind that in Git's history, file renames are not stored as "this was renamed from X to Y". Instead, the file X exists in one revision, and in the next revision Y exists (and X doesn't). For example:
Revision | Files
---------+----------------------------------
HEAD^ | a.cpp x.cpp z.cpp
HEAD | a.cpp y.cpp z.cpp
In the above diagram, each revision is a row and each contains three files. Between the two revisions, x.cpp was renamed to y.cpp. The only information that the repository stores is the contents of each separate revision.
When Git (or another tool that reads Git repositories) looks at the above history, it notices that y.cpp is a new file in HEAD. Then it looks at the previous revision to see whether a similar file existed. In the case of a straight file rename, then yes, a file called x.cpp with the identical contents existed in the previous revision (and no longer exists in the current revision). So the new file is shown as a rename from x.cpp to y.cpp.
In the case of a rename-and-modify, Git will look at the previous revision's files to see if one file looks close to the new file (in terms of its contents). This is where the heuristic comes in. If most of the lines are the same, then Git will show it as a rename, but if there are enough changed lines compared to unchanged lines, then Git will simply say it looks like a new file.
To answer your question, the best way to handle resharper class renames is to simply do it and commit the new files. Git stores the old and the new files in its repository. Rename detection is handled later, at the time you actually ask about the history. This is why commands such as git log have options like --find-copies and --find-copies-harder.

Years later this is still a quirk I guess because it's fundamental to the way git works.
What I do now is do the rename, commit, then do the change. Bit annoying with refactoring tools but no other solution retains history (--find-copies and --find-copies-harder don't seem to work).

Are there any invalid linux filenames?

If I wanted to create a string which is guaranteed not to represent a filename, I could put one of the following characters in it on Windows:
\ / : * ? | < >
e.g.
this-is-a-filename.png
?this-is-not.png
Is there any way to identify a string as 'not possibly a file' on Linux?

There are almost no restrictions - apart from '/' and '\0', you're allowed to use anything. However, some people think it's not a good idea to allow this much flexibility.

An empty string is the only truly invalid path name on Linux, which may work for you if you need only one invalid name. You could also use a string like "///foo", which would not be a canonical path name, although it could refer to a file ("/foo"). Another possibility would be something like "/dev/null/foo", since /dev/null has a POSIX-defined non-directory meaning. If you only need strings that could not refer to a regular file you could use "/" or ".", since those are always directories.

Technically it's not invalid but files with dash(-) at the beginning of their name will put you in a lot of troubles. It's because it has conflicts with command arguments.

I personally find that a lot of the time the problem is not Linux but the applications one is using on Linux.
Take for example Amarok. Recently I noticed that certain artists I had copied from my Windows machine where not appearing in the library. I check and confirmed that the files were there and then I noticed that certain characters in the folder names (Named for the artist) were represented with a weird-looking square rather than an actual character.
In a shell terminal the filenames look even stranger: /Music/Albums/Einst$'\374'rzende\ Neubauten is an example of how strange.
While these files were definitely there, Amarok could not see them for some reason. I was able to use some shell trickery to rename them to sane versions which I could then re-name with ASCII-only characters using Musicbrainz Picard. Unfortunately, Picard was also unable to open the files until I renamed them, hence the need for a shell script.
Overall this a a tricky area and it seems to get very thorny if you are trying to synchronise a music collection between Windows and Linux wherein certain folder or file names contain funky characters.
The safest thing to do is stick to ASCII-only filenames.

definition of filename?

After years of programming it's still some of the simple things that keep tripping me up.
Is there a commonly agreed definition of filename ?
Even the wikipedia article confuses the two interpretations.
It starts by defining it as 'a special kind of string used to uniquely identify a file stored on the file system of a computer'. That seems clear enough, and suggests that a filename is a fully qualified filename, specifying the complete path to the file.
However, it then goes on to:
talk about basename and extension (so basename would contain an absolute path ?)
says that the length of a filename in DOS is limited to 8.3
says that a filename without a path part is assumed to be a file in the current working directory (so the filename does not uniquely identify a file)
So, simple questions:
what is a correct definition of 'filename' (include references)
how should I unambiguously name variables for:
a path to a file (which can be absolute/full or relative)
a path to a resource that can be a file/directory/socket

No references, just vernacular from experience. When I'm being specific I tend to use:
path or filespec (or file specification): all of the characters needed to identify a file on a filesystem. The path may be absolute (starting from the root, or topmost, directory) or relative (starting from the currently active directory).
filename: the characters needed to identify a file within the current directory.
extension: characters at the end of the filename that typically identify the type of the file. By convention, the extension usually starts with a dot ("."), and a filename may contain more than one extension.
basename: the filename up to (but not including) the dot that begins the first extension.

Javadoc for File.getName() method

file·name also file name
(fīl'nām') Pronunciation Key n. A
name given to a computer file to
distinguish it from other files, often
containing an extension that
classifies it by type.
# Dictionary.com
It states that a filename is used to name a file, (just like you name a person). And that it's used to distinguish it from other files. This does not tell you it includes a path, or other file-system imposed attributes.
This definition does say that often a filename has an extension. But this definition is very careful... (Which I think is a good thing)
So.. before you start thinking about paths and such, you have to set your scope. Are you in a unix world? Ar you in a dos/windows world?

Again no references, but the file name specification depends on the operating system or to be more accurate the file system. Lets start with early versions of DOS (Disk Operating System). File names were 8 character names containing numbers, letters, dashes, and underscores. They were followed by a three, two, one, or even zero character extension used to identify the file type. A dot separated the name from the extension. The name had to be unique in the directory.
You could extend the name by adding a directory name, or series of directory names. a slash character separated the directory names from each other and from the file name. This was usually referred to as the path name. The path was relative to current directory.
Finally in DOS you could include the drive name. Usually a single letter followed by a : and a slash (some systems two slashes). Adding the drive to the path made it an absolute path instead of relative.
Today most of us use long file names which do not follow the old 8 character dot three character pattern. Still many files systems keep such as name and use the long name simply as a pointer to old style identifier.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio