How can I use Git to identify function changes across different revisions of a repository? - ruby

I have a repository with a bunch of C files. Given the SHA hashes of two commits,
<commit-sha-1> and <commit-sha-2>,
I'd like to write a script (probably bash/ruby/python) that detects which functions in the C files in the repository have changed across these two commits.
I'm currently looking at the documentation for git log, git commit and git diff. If anyone has done something similar before, could you give me some pointers about where to start or how to proceed.

That doesn't look too good but you could combine git with your
favorite tagging system such as GNU global to achieve that. For
example:
#!/usr/bin/env sh
global -f main.c | awk '{print $NF}' | cut -d '(' -f1 | while read i
do
if [ $(git log -L:"$i":main.c HEAD^..HEAD | wc -l) -gt 0 ]
then
printf "%s() changed\n" "$i"
else
printf "%s() did not change\n" "$i"
fi
done
First, you need to create a database of functions in your project:
$ gtags .
Then run the above script to find functions in main.c that were
modified since the last commit. The script could of course be more
flexible, for example it could handle all *.c files changed between 2 commits as reported by git diff --stats.
Inside the script we use -L option of git log:
-L <start>,<end>:<file>, -L :<funcname>:<file>
Trace the evolution of the line range given by
"<start>,<end>" (or the function name regex <funcname>)
within the <file>. You may not give any pathspec
limiters. This is currently limited to a walk starting from
a single revision, i.e., you may only give zero or one
positive revision arguments. You can specify this option
more than once.

See this question.
Bash script:
#!/usr/bin/env bash
git diff | \
grep -E '^(##)' | \
grep '(' | \
sed 's/##.*##//' | \
sed 's/(.*//' | \
sed 's/\*//' | \
awk '{print $NF}' | \
uniq
Explanation:
1: Get diff
2: Get only lines with hunk headers; if the 'optional section heading' of a hunk header exists, it will be the function definition of a modified function
3: Pick only hunk headers containing open parentheses, as they will contain function definitions
4: Get rid of '## [old-file-range] [new-file-range] ##' sections in the lines
5: Get rid of everything after opening parentheses
6: Get rid of '*' from pointers
7: [See 'awk']: Print the last field (i.e: column) of the records (i.e: lines).
8: Get rid of duplicate names.

Related

Merge 2 text files with differences marked as conflicts

I'd like to see the complete output of 2 merged files where each difference is marked in the same way as git marks conflicts in the files.
The use case is to merge 2 similar configuration files, inspect the merged file and and have visual hints to make evident all differences between the files so that it becomes easy to decide which one to pick.
I have already tried with diff and diff3, but I could only get the differences (with diff) or a fully merged file where only conflicts are marked (with diff3 -m -A file1 file1 file2). I've used git diff and related tools, but they all merge unconflicting changes, instead of marking them as differences.
The running environment would be a bash shell script, therefore it'd be nice to reach the desired output with common linux tools.
Example:
Contents of file1:
environment:
base_branch: master
branch: this_is_the_same_for_both_files
Contents of file2:
environment:
base_branch: a_different_base_branch
branch: this_is_the_same_for_both_files
a_new_key: the_new_key_value
Desired output:
environment:
<<<<< file1
base_branch: master
=====
base_branch: a_different_base_branch
>>>>> file2
branch: this_is_the_same_for_both_files
<<<<< file1
=====
a_new_key: the_new_key_value
>>>>> file2
Following the suggestions from the comments, I came up with this one liner which seems to solve my issue.
I would have liked to use constants for the markers in the sed substitutions, but apparently it's not straightforward to use variables containing \n with sed for mac os.
This code seems to work correctly even in docker alpine:3.8 using diffutils.
Other options (brew gnu-sed and similar) might not be easily portable.
diff -D AAAAAAA "${path}" <(echo -n "$decrypted") | \
sed -e $'s/#ifndef AAAAAAA/<<<<<<< file-on-disk/g' | \
sed -e $'s/#endif \/\* ! AAAAAAA \*\//=======\\\n>>>>>>> file-from-secret/g' | \
sed -e $'s/#else \/\* AAAAAAA \*\//=======/g' | \
sed -e $'s/#ifdef AAAAAAA/<<<<<<< file-on-disk\\\n=======/g' | \
sed -e $'s/#endif \/\* AAAAAAA \*\//>>>>>>> file-from-secret/g';
Explanation:
diff -D AAAAAAA "${path}" <(echo -n "$decrypted"): outputs a merged text with '#ifdef NAME' diffs. I'm using AAAAAAA as marker name. Diff uses #ifndef AAAAAAA and #endif /* ! AAAAAAA */ to surround text only present in the first file and /#ifdef AAAAAAA and #endif /* AAAAAAA */ to surround text only present in the second. Note the n in the first #ifndef and the ! in the first #endif comment. As all markers are different, it becomes easy to perform substitutions.
sed -e $'s/#endif \/\* ! AAAAAAA \*\//=======\\\n>>>>>>> file-from-secret/g': substitutes the marker with
=======
>>>>>>> file-from-secret
As there is a \n, the substitution string is enclosed with $'' which interprets correctly the new line character. However, the \ needs to be double-escaped.

Bash pipes and Shell expansions

I've changed my data source in a bash pipe from cat ${file} to cat file_${part_number} because preprocessing was causing ${file} to be truncated at 2GB, splitting the output eliminated the preprocessing issues. However while testing this change, I was unable to work out how to get Bash to continue acting the same for some basic operations I was using to test the pipeline.
My original pipeline is:
cat giantfile.json | jq -c '.' | python postprocessor.py
With the original pipeline, if I'm testing changes to postprocessor.py or the preprocessor and I want to just test my changes with a couple of items from giantfile.json I can just use head and tail. Like so:
cat giantfile.json | head -n 2 - | jq -c '.' | python postprocessor.py
cat giantfile.json | tail -n 3 - | jq -c '.' | python postprocessor.py
The new pipeline that fixes the issues the preprocessor is:
cat file_*.json | jq -c '.' | python postprocessor.py
This works fine, since every file gets output eventually. However I don't want to wait 5-10 minutes for each tests. I tried to test with the first 2 lines of input with head.
cat file_*.json | head -n 2 - | jq -c '.' | python postprocessor.py
Bash sits there working far longer than it should, so I try:
cat file_*.json | head -n 2 - | jq -c '.'
And my problem is clear. Bash is outputting the content of all the files as if head was not even there because each file now has 1 line of data in it. I've never needed to do this with bash before and I'm flummoxed.
Why does Bash behave this way, and How do I rewrite my little bash command pipeline to work the way it used to, allowing me to select the first/last n lines of data to work with for testing?
My guess is that when you split the json up into individual files, you managed to remove the newline character from the end of each line, with the consequence that the concatenated file (cat file_json.*) is really only one line in total, because cat will not insert newlines between the files it is concatenating.
If the files were really one line each with a terminating newline character, piping through head -n 2 should work fine.
You can check this hypothesis with wc, since that utility counts newline characters rather than lines. If it reports that the files have 0 lines, then you need to fix your preprocessing.

How to quickly check a .gz file without unzip? [duplicate]

How to get the first few lines from a gziped file ?
I tried zcat, but its throwing an error
zcat CONN.20111109.0057.gz|head
CONN.20111109.0057.gz.Z: A file or directory in the path name does not exist.
zcat(1) can be supplied by either compress(1) or by gzip(1). On your system, it appears to be compress(1) -- it is looking for a file with a .Z extension.
Switch to gzip -cd in place of zcat and your command should work fine:
gzip -cd CONN.20111109.0057.gz | head
Explanation
-c --stdout --to-stdout
Write output on standard output; keep original files unchanged. If there are several input files, the output consists of a sequence of independently compressed members. To obtain better compression, concatenate all input files before compressing
them.
-d --decompress --uncompress
Decompress.
On some systems (e.g., Mac), you need to use gzcat.
On a mac you need to use the < with zcat:
zcat < CONN.20111109.0057.gz|head
If a continuous range of lines needs be, one option might be:
gunzip -c file.gz | sed -n '5,10p;11q' > subFile
where the lines between 5th and 10th lines (both inclusive) of file.gz are extracted into a new subFile. For sed options, refer to the manual.
If every, say, 5th line is required:
gunzip -c file.gz | sed -n '1~5p;6q' > subFile
which extracts the 1st line and jumps over 4 lines and picks the 5th line and so on.
If you want to use zcat, this will show the first 10 rows
zcat your_filename.gz | head
Let's say you want the 16 first row
zcat your_filename.gz | head -n 16
This awk snippet will let you show not only the first few lines - but a range you can specify. It will also add line numbers which i needed for debugging an error message pointing to a certain line way down in a gzipped file.
gunzip -c file.gz | awk -v from=10 -v to=20 'NR>=from { print NR,$0; if (NR>=to) exit 1}'
Here is the awk snippet used in the one liner above. In awk NR is a built-in variable (Number of records found so far) which usually is equivalent to a line number. the from and to variable are picked up from the command line via the -v options.
NR>=from {
print NR,$0;
if (NR>=to)
exit 1
}

Recursive directory-diff accepting external diff

Consider the task of finding what has changed in two projects which were forked.
diff -r gets me close: it is capable of finding which files are missing in each target folder, and points out files which have changed by presenting their diffs.
I want to use a custom diff utility. I would like to not have to implement in my diff utility the recursive directory walking logic.
So, I basically just want a program that does what diff -r does, but which does not actually go ahead and run the diffs.
Does such a thing exist?
I figured the output of diff -r is already plenty structured enough for me to "get clever" with it. Here's the gist of it.
diff -r proj1/src proj2/src | grep 'diff -r' | cut -d ' ' -f 3,4 | xargs -n 2 sift
where sift is my little command-line char-based diff util which runs circles around diff's diff output.
and using diff (GNU diffutils) 2.8.1
I am open to more elegant solutions as well!
Edit: Thanks #janos, the -q option makes it pretty optimal!
One last thing to mention is that this can be made quite powerful by piping into the opendiff program on a Mac's command line (specifying the corresponding file in the desired dir as target, which of course is already inside a Git repo, right?) to do the manual merging nice and quickly.
In fact setting up opendiff to be used by Git when it needs a human merge is probably the way to go.
It's just that I still have not encountered very many merge conflict situations across the same code repo, it is mainly when forking repos (and having separate repos for divergent projects that contain shared code) that I need to do this kind of merge to manually bring my "primary" projects up to date with the changes made in the trenches.
I think this is more elegant:
diff -rq dir1 dir2 | sed -ne 's/^Files \(.*\) and \(.*\) differ$/\1\n\2/p' | xargs -n 2 sift
The main trick is using -q flag, which will print the differences in brief format, for example:
Files dir1/x and dir2/x differ
Only in dir1: path/to/file
Only in dir2: path/to/another
And then how you parse the output is your matter of taste.
Finally, to correctly handle spaces in the file names:
diff -rq dir1 dir2 | sed -ne 's/^Files \(.*\) and \(.*\) differ$/sift "\1" "\2"/p' | sh
It probably makes sense to wrap this in a function:
xdiff() {
diff=$1; shift
dir1=$1; shift
dir2=$1; shift
diff -rq $dir1 $dir2 | sed -ne 's/^Files \(.*\) and \(.*\) differ$/'$diff' "\1" "\2"/p' | sh
}
So you can call it like this:
xdiff sift dir1 dir2

bash: shortest way to get n-th column of output

Let's say that during your workday you repeatedly encounter the following form of columnized output from some command in bash (in my case from executing svn st in my Rails working directory):
? changes.patch
M app/models/superman.rb
A app/models/superwoman.rb
in order to work with the output of your command - in this case the filenames - some sort of parsing is required so that the second column can be used as input for the next command.
What I've been doing is to use awk to get at the second column, e.g. when I want to remove all files (not that that's a typical usecase :), I would do:
svn st | awk '{print $2}' | xargs rm
Since I type this a lot, a natural question is: is there a shorter (thus cooler) way of accomplishing this in bash?
NOTE:
What I am asking is essentially a shell command question even though my concrete example is on my svn workflow. If you feel that workflow is silly and suggest an alternative approach, I probably won't vote you down, but others might, since the question here is really how to get the n-th column command output in bash, in the shortest manner possible. Thanks :)
You can use cut to access the second field:
cut -f2
Edit:
Sorry, didn't realise that SVN doesn't use tabs in its output, so that's a bit useless. You can tailor cut to the output but it's a bit fragile - something like cut -c 10- would work, but the exact value will depend on your setup.
Another option is something like: sed 's/.\s\+//'
To accomplish the same thing as:
svn st | awk '{print $2}' | xargs rm
using only bash you can use:
svn st | while read a b; do rm "$b"; done
Granted, it's not shorter, but it's a bit more efficient and it handles whitespace in your filenames correctly.
I found myself in the same situation and ended up adding these aliases to my .profile file:
alias c1="awk '{print \$1}'"
alias c2="awk '{print \$2}'"
alias c3="awk '{print \$3}'"
alias c4="awk '{print \$4}'"
alias c5="awk '{print \$5}'"
alias c6="awk '{print \$6}'"
alias c7="awk '{print \$7}'"
alias c8="awk '{print \$8}'"
alias c9="awk '{print \$9}'"
Which allows me to write things like this:
svn st | c2 | xargs rm
Try the zsh. It supports suffix alias, so you can define X in your .zshrc to be
alias -g X="| cut -d' ' -f2"
then you can do:
cat file X
You can take it one step further and define it for the nth column:
alias -g X2="| cut -d' ' -f2"
alias -g X1="| cut -d' ' -f1"
alias -g X3="| cut -d' ' -f3"
which will output the nth column of file "file". You can do this for grep output or less output, too. This is very handy and a killer feature of the zsh.
You can go one step further and define D to be:
alias -g D="|xargs rm"
Now you can type:
cat file X1 D
to delete all files mentioned in the first column of file "file".
If you know the bash, the zsh is not much of a change except for some new features.
HTH Chris
Because you seem to be unfamiliar with scripts, here is an example.
#!/bin/sh
# usage: svn st | x 2 | xargs rm
col=$1
shift
awk -v col="$col" '{print $col}' "${#--}"
If you save this in ~/bin/x and make sure ~/bin is in your PATH (now that is something you can and should put in your .bashrc) you have the shortest possible command for generally extracting column n; x n.
The script should do proper error checking and bail if invoked with a non-numeric argument or the incorrect number of arguments, etc; but expanding on this bare-bones essential version will be in unit 102.
Maybe you will want to extend the script to allow a different column delimiter. Awk by default parses input into fields on whitespace; to use a different delimiter, use -F ':' where : is the new delimiter. Implementing this as an option to the script makes it slightly longer, so I'm leaving that as an exercise for the reader.
Usage
Given a file file:
1 2 3
4 5 6
You can either pass it via stdin (using a useless cat merely as a placeholder for something more useful);
$ cat file | sh script.sh 2
2
5
Or provide it as an argument to the script:
$ sh script.sh 2 file
2
5
Here, sh script.sh is assuming that the script is saved as script.sh in the current directory; if you save it with a more useful name somewhere in your PATH and mark it executable, as in the instructions above, obviously use the useful name instead (and no sh).
It looks like you already have a solution. To make things easier, why not just put your command in a bash script (with a short name) and just run that instead of typing out that 'long' command every time?
If you are ok with manually selecting the column, you could be very fast using pick:
svn st | pick | xargs rm
Just go to any cell of the 2nd column, press c and then hit enter
Note, that file path does not have to be in second column of svn st output. For example if you modify file, and modify it's property, it will be 3rd column.
See possible output examples in:
svn help st
Example output:
M wc/bar.c
A + wc/qax.c
I suggest to cut first 8 characters by:
svn st | cut -c8- | while read FILE; do echo whatever with "$FILE"; done
If you want to be 100% sure, and deal with fancy filenames with white space at the end for example, you need to parse xml output:
svn st --xml | grep -o 'path=".*"' | sed 's/^path="//; s/"$//'
Of course you may want to use some real XML parser instead of grep/sed.

Resources