git: detect file creator - bash

I run a book digitizing project and pay people certain rate per 10K characters for the text files they upload to a git repo. Till now I was using following command to detect files authored by CertainEditor:
git log --use-mailmap --no-merges --author="CertainEditor" --name-only --pretty=format:""
Then I would pipe this output in wc -m and get amount of characters authored by CertainEditor and pay him accordingly. Usually editors do not touch each other's files and everything worked well. However recently one editor spotted a typo in somebody else's file and corrected it. This behavior is actually good and I would like to encourage it. However, now the command above lists the corrected file also as his, and so he gets paid for all the characters in the file while he changed only one of them. This, obviously, is not good.
Do you have an idea how can I list all the files created (not authored / committed) by a user, so I can implement a fair characters counting?
Maybe some commit hooks can be used that will check whether the Author of a new commit is different from the previous one and if true - override it to the previous Author?
Those small changes by a non-creator may be left not paid for (as this action is not so intensive and may be reciprocal), but if you have a good idea how I can pay for the "diff" the non-creator provides - it would be nice!

Overall not a nice approach as mentioned in the comments, still you can get an author of the file via git:
git log --diff-filter=A --pretty=format:"%an" <file>
So you can find all files created by the author and then filter out the rest. Or find an author for each file you selected with your command and then filter out.

Related

Can I remove all duplicates (not only consecutive) to put my histories (.bash_history, .gdb_history) under version control?

I have this mantra:
If it's plain text and it is valuable, put it under version control.
So far I have the following under git:
My editor (Emacs/Spacemacs) configuration.
My bash configuration
Various todo lists
I have created a repo to store my histories, but I have come across the issue of having duplicates in them.
To me it is pretty disappointing that HISTCONTROL=erasedups only deals with consecutive duplicates.
Would it be possible to create a hook that is executed every time I enter a new command to remove duplicates in the histories?
Or should it be a pre push git hook each time I push to the repo?

Monitor A File For Additions And Get Last Added Line

I'm having trouble monitoring a file for changes. I need to be able to know when a file changes, and when it does, I need the new line that was added. I intend to parse each line and find ones that match certain criteria, and act on information in those lines. I know the expected number of matching lines ahead of time, but I do not know how many lines in total will be added to the file, or where the matching lines will be.
I've tried 2 packages so far, with no avail.
fsnotify/fsnotify
As fas as I can tell, fsnotify can only tell me when a file is modified, not what the details of the modification was. Since I need to know what exactly was added to the file, this is no good for me.
(As a side-question, can this be run in a loop? The example that I tried exited after just one modification. I need to monitor for multiple modifications.)
hpcloud/tail
This package tries to mimic the Unix tail command, but it seems to have its own issues. The output that I get includes timestamps and other data - I just want the added line, nothing else. Also, it seems to think a file has been modified multiple times, even when it's just one edit. Further, the deal breaker here is that it does not output the last line if the line was not followed by a newline character.
Delegating to tail
I came across this answer, which suggests to delegate this work to the tail command itself, but I need this to work cross-platform (specifically, macOS, Linux and Windows). I don't believe that an equivalent command exists on Windows.
How do I go about tackling this?
#user2515526,
Usually changed diff is out of scope of file watchers' functionality, because, you know, you could change an image, and a watcher would need to keep a track several Mb of a diff in memory, and what if we have thousands of files?
However, as bad as it sounds, this may be exactly the way you want to implement this (sure, depends on your app, etc. - could be fine for text files), i.e. - keeping a map of diffs (1 diff per file) since last modification. Cannot say I like it, but sounds like fsnotify has no support for changes/diffs that you need.
Also, regarding your question about running in a loop, maybe you can get some hints here: https://github.com/kataras/iris/blob/8370d76910cdd8de043753ed81ae080eae8dc798/utils/file.go
Its a framework that allows to build a server that watches for TypeScript file changes. So sounds similar to your case/question.
Cheers,
-D

Does git checkout update all files?

Newb question, I want to make sure I understand this.
When I git checkout <revision>, does this return the entire project to its state at that moment, or does it only recreate the files changed in that particular revision?
For example: If my folder was completely empty besides the .git repo, and I git checkout master, will the resulting files be the project in its entirety, or only the files changed in the most recent commit?
I ask, because I am checking out my project at various points (starting from the beginning), and instead of the project slowly growing in size as one would expect, the size of each checkout is varying quite a lot.
When I git checkout <revision>, does this return the entire project to its state at that moment, or does it only recreate the files changed in that particular revision?
If your working tree and staging area are completely empty (besides the .git subdirectory, of course) and you run
git checkout <revision>
then your working tree and staging area will perfectly reflect the contents of that particular revision.
On the other hand, if your working tree is not empty when you run git checkout, what happens is much more subtle, and may be broken down into three cases:
The checkout is not problematic and Git carries it out without batting an eyelid: the contents of that particular revision get copied to your working tree (and overwrite stuff already present there, if needed). Or
The checkout, if it were carried out, would result in a loss of local changes; therefore, Git (under the assumption that you didn't use the -f flag) tells you off and aborts the checkout. Or
A more complicated situation may arise in which stuff is only partially checked out, and some local, uncommitted changes are kept in your working tree and/or index. More details about that situation can be found in my answer to Why are unstaged changes still present after checking out a different branch?.
[...] the size of each checkout is varying quite a lot.
Are you taking into account untracked files? Did you commit, then later remove large files? On the basis of the information given in your question alone, we can do little more than hypothesize about the reason why the size varies a lot.
From the documentation: "Updates files in the working tree to match the version in the index or the specified tree. " In the case of your example, it will return the repository to the state at the time of the checkout in its entirety.
However as Jubobs pointed out there is a difference in behaviour if you have made any changes to the state of your repository since your last checkout. His answer is more comprehensive than mine if this is the case.
Also note that this will only apply to files that are tracked by git, so any other files you have lying around will not be affected.

Performance efficient Git history

Is there a way to efficiently read previous versions of files in Git? (I'm using Git as a database for Content Management System, and need it do display history).
Git doesn't store full versions of files, it only store differences, so, if You need previous version - You can't just read it from disk, You should ask Git to calculate it using differences.
It seems that GitHub somehow does that, for example, You can see previous version of file. Does it really calculates it for every HTTP request or somehow optimizes it?
Erratum: git ALWAYS stores full versions of files. Thus viewing any revision is equally efficient.
This is in marked contrast to some other revision systems which only store diffs (patches). Cvs in particular was hideous for accessing deep history or non-trunk branches for this very reason (for a large repository with many users).
For reference, To conveniently access a particular file at a particular version (sha/reference):
git show HEAD:full/path/to/file
Replace head with a tag, branch or git sha (the long hex number)
The path is the full path relative to the base of the git repository-not the file system root. I only mention this because it has bitten me a few times-you can't cd into a directory and expect to not specify the full path.
Wikipedia (home of all that is true and good) backs me up:
Git stores each revision of a file as a unique blob object. The relationships between the blobs can be found through examining the tree and commit objects. Newly added objects are stored in their entirety using zlib compression.
In case Wikipedia isn't your bag, a careful reading of the git internals manual also verifies it.

command line wisdom for 2 panel file manager user

Want to upgrade my file management productivity by replacing 2 panel file manager with command line (bash or cygwin). Can commandline give same speed? Please advise a guru way of how to do e.g. copy of some file in directory A to the directory B. Is it heavy use of pushd/popd? Or creation of links to most often used directories? What are the best practices and a day-to-day routine to manage files of a command line master?
Can commandline give same speed?
My experience is that commandline copying is significantly faster (especially in the Windows environment). Of course the basic laws of physics still apply, a file that is 1000 times bigger than a file that copies in 1 second will still take 1000 seconds to copy.
..(howto) copy of some file in directory A to the directory B.
Because I often have 5-10 projects that use similar directory structures, I set up variables for each subdir using a naming convention :
project=NewMatch
NM_scripts=${project}/scripts
NM_data=${project}/data
NM_logs=${project}/logs
NM_cfg=${project}/cfg
proj2=AlternateMatch
altM_scripts=${proj2}/scripts
altM_data=${proj2}/data
altM_logs=${proj2}/logs
altM_cfg=${proj2}/cfg
You can make this sort of thing as spartan or baroque as needed to match your theory of living/programming.
Then you can easily copy the cfg from 1 project to another
cp -p $NM_cfg/*.cfg ${altM_cfg}
Is it heavy use of pushd/popd?
Some people seem to really like that. You can try it and see what you thing.
Or creation of links to most often used directories?
Links to dirs are, in my experience used more for software development where a source code is expecting a certain set of dir names, and your installation has different names. Then making links to supply the dir paths expected is helpful. For production data, is just one more thing that can get messed up, or blow up. That's not always true, maybe you'll have a really good reason to have links, but I wouldn't start out that way, just because it is possible to do.
What are the best practices and a day-to-day routine to manage files of a command line master?
( Per above, use standardized directory structure for all projects.
Have scripts save any small files to a directory your dept keeps in the /tmp dir, .
i.e /tmp/MyDeptsTmpFile (named to fit your local conventions) )
It depends. If you're talking about data and logfiles, dated fileNames can save you a lot of time. I recommend dateFmts like YYYYMMDD(_HHMMSS) if you need the extra resolution.
Dated logfiles are very handy, when a current process seems like it is taking a long time, you can look at the log file from a week ago and quantify exactly how long this process took, a week, month, 6 months (up to how much space you can afford). LogFiles should also capture all STDERR messages, so you never have to re-run a bombed program just to see what the error message was.
This is Linux/Unix you're using, right? Read the man page for the cp cmd installed on your machine. I recommend using an alias like alias CP='/bin/cp -pi' so you always copy a file with the same permissions and with the original files' time stamp. Then it is easy to use /bin/ls -ltr to see a sorted list of files with the most recent files showing up at the bottom of the list. (No need to scroll back to the top, when you sort by time,reverse). Also the '-i' option will warn you that you are going to overwrite a file, and this has saved me more than a couple of times.
I hope this helps.
P.S. as you appear to be a new user, if you get an answer that helps you please remember to mark it as accepted, and/or give it a + (or -) as a useful answer.

Resources