Merge 2 text files with differences marked as conflicts - bash

I'd like to see the complete output of 2 merged files where each difference is marked in the same way as git marks conflicts in the files.
The use case is to merge 2 similar configuration files, inspect the merged file and and have visual hints to make evident all differences between the files so that it becomes easy to decide which one to pick.
I have already tried with diff and diff3, but I could only get the differences (with diff) or a fully merged file where only conflicts are marked (with diff3 -m -A file1 file1 file2). I've used git diff and related tools, but they all merge unconflicting changes, instead of marking them as differences.
The running environment would be a bash shell script, therefore it'd be nice to reach the desired output with common linux tools.
Example:
Contents of file1:
environment:
base_branch: master
branch: this_is_the_same_for_both_files
Contents of file2:
environment:
base_branch: a_different_base_branch
branch: this_is_the_same_for_both_files
a_new_key: the_new_key_value
Desired output:
environment:
<<<<< file1
base_branch: master
=====
base_branch: a_different_base_branch
>>>>> file2
branch: this_is_the_same_for_both_files
<<<<< file1
=====
a_new_key: the_new_key_value
>>>>> file2

Following the suggestions from the comments, I came up with this one liner which seems to solve my issue.
I would have liked to use constants for the markers in the sed substitutions, but apparently it's not straightforward to use variables containing \n with sed for mac os.
This code seems to work correctly even in docker alpine:3.8 using diffutils.
Other options (brew gnu-sed and similar) might not be easily portable.
diff -D AAAAAAA "${path}" <(echo -n "$decrypted") | \
sed -e $'s/#ifndef AAAAAAA/<<<<<<< file-on-disk/g' | \
sed -e $'s/#endif \/\* ! AAAAAAA \*\//=======\\\n>>>>>>> file-from-secret/g' | \
sed -e $'s/#else \/\* AAAAAAA \*\//=======/g' | \
sed -e $'s/#ifdef AAAAAAA/<<<<<<< file-on-disk\\\n=======/g' | \
sed -e $'s/#endif \/\* AAAAAAA \*\//>>>>>>> file-from-secret/g';
Explanation:
diff -D AAAAAAA "${path}" <(echo -n "$decrypted"): outputs a merged text with '#ifdef NAME' diffs. I'm using AAAAAAA as marker name. Diff uses #ifndef AAAAAAA and #endif /* ! AAAAAAA */ to surround text only present in the first file and /#ifdef AAAAAAA and #endif /* AAAAAAA */ to surround text only present in the second. Note the n in the first #ifndef and the ! in the first #endif comment. As all markers are different, it becomes easy to perform substitutions.
sed -e $'s/#endif \/\* ! AAAAAAA \*\//=======\\\n>>>>>>> file-from-secret/g': substitutes the marker with
=======
>>>>>>> file-from-secret
As there is a \n, the substitution string is enclosed with $'' which interprets correctly the new line character. However, the \ needs to be double-escaped.

Related

How can I use Git to identify function changes across different revisions of a repository?

I have a repository with a bunch of C files. Given the SHA hashes of two commits,
<commit-sha-1> and <commit-sha-2>,
I'd like to write a script (probably bash/ruby/python) that detects which functions in the C files in the repository have changed across these two commits.
I'm currently looking at the documentation for git log, git commit and git diff. If anyone has done something similar before, could you give me some pointers about where to start or how to proceed.
That doesn't look too good but you could combine git with your
favorite tagging system such as GNU global to achieve that. For
example:
#!/usr/bin/env sh
global -f main.c | awk '{print $NF}' | cut -d '(' -f1 | while read i
do
if [ $(git log -L:"$i":main.c HEAD^..HEAD | wc -l) -gt 0 ]
then
printf "%s() changed\n" "$i"
else
printf "%s() did not change\n" "$i"
fi
done
First, you need to create a database of functions in your project:
$ gtags .
Then run the above script to find functions in main.c that were
modified since the last commit. The script could of course be more
flexible, for example it could handle all *.c files changed between 2 commits as reported by git diff --stats.
Inside the script we use -L option of git log:
-L <start>,<end>:<file>, -L :<funcname>:<file>
Trace the evolution of the line range given by
"<start>,<end>" (or the function name regex <funcname>)
within the <file>. You may not give any pathspec
limiters. This is currently limited to a walk starting from
a single revision, i.e., you may only give zero or one
positive revision arguments. You can specify this option
more than once.
See this question.
Bash script:
#!/usr/bin/env bash
git diff | \
grep -E '^(##)' | \
grep '(' | \
sed 's/##.*##//' | \
sed 's/(.*//' | \
sed 's/\*//' | \
awk '{print $NF}' | \
uniq
Explanation:
1: Get diff
2: Get only lines with hunk headers; if the 'optional section heading' of a hunk header exists, it will be the function definition of a modified function
3: Pick only hunk headers containing open parentheses, as they will contain function definitions
4: Get rid of '## [old-file-range] [new-file-range] ##' sections in the lines
5: Get rid of everything after opening parentheses
6: Get rid of '*' from pointers
7: [See 'awk']: Print the last field (i.e: column) of the records (i.e: lines).
8: Get rid of duplicate names.

Find words from file a in file b and output the missing word matches from file a

I have two files that I am trying to run a find/grep/fgrep on. I have been trying several different commands to try to get the following results:
File A
hostnamea
hostnameb
hostnamec
hostnamed
hostnamee
hostnamef
File B
hostnamea-20170802
hostnameb-20170802
hostnamec-20170802.xml # some files have extensions
020214-_hostnamed-20170208.tar # some files have different extensions and have different date structure
HOSTNAMEF-20170802
*about files- date=20170802 - most all have this date format - some have different date format *
FileA is my control file - I want to search fileb with the whole word hostnamea-f and match the hostnamea-f in fileb and output the non-matches from filea into the output on terminal to be used in a shell script.
For this example I made it so hostnamee is not within fileb. I want to run an fgrep/grep/awk - whatever can work for this - and output only the missing hostnamee from filea.
I can get this to work but it does not particularly do what I need and if I swap it around I get nothing.
user#host:/netops/backups/scripts$ fgrep -f filea fileb -i -w -o
hostnamea
hostnameb
hostnamec
hostnamed
HOSTNAMEF
Cool - I get the matches in File-B but what if I try to reverse it.
host#host:/netops/backups/scripts$ fgrep -f fileb filea -i -w -o
host#host:/netops/backups/scripts$
I have tried several different commands but cannot seem to get it right. I am using -i to ignore case, -w to match whole word and -o
I have found some sort of workaround but was hoping there was a more elegant way of doing this with a single command either awk,egrep,fgrep or other.
user#host:/netops/backups/scripts$ fgrep -f filea fileb -i -w -o > test
user#host:/netops/backups/scripts$ diff filea test -i
5d4
< hostnamee
You can
look for "only-matches", i.e. -o, of a in b
use the result as patterns to look for in a, i.e. -f-
only list what does not match, i.e. -v
Code:
grep -of a.txt b.txt | grep -f- -v a.txt
Output:
hostnamee
hostnamef
Case-insensitive code:
grep -oif a.txt b.txt | grep -f- -vi a.txt
Output:
hostnamee
Edit:
Responding to the interesting input by Ed Morton, I have made the sample input somewhat "nastier" to test robustness against substring matches and regex-active characters (e.g. "."):
a.txt:
hostnamea
hostnameb
hostnamec
hostnamed
hostnamee
hostnamef
ostname
lilihostnamec
hos.namea
b.txt:
hostnamea-20170802
hostnameb-20170802
hostnamec-20170802.xml # some files have extensions
020214-_hostnamed-20170208.tar # some files have different extensions and have different date structure
HOSTNAMEF-20170802
lalahostnamef
hostnameab
stnam
This makes things more interesting.
I provide this case insensitive solution:
grep -Fwoif a.txt b.txt | grep -f- -Fviw a.txt
additional -F, meaning "no regex tricks"
additional -w, meaning "whole word matching"
I find the output quite satisfying, assuming that the following change of the "requirements" is accepted:
Hostnames in "a" only match parts of "b", if all adjoining _ (and other "word characers" are always considered part of the hostname.
(Note the additional output line of hostnamed, which is now not found in "b" anymore, because in "b", it is preceded by an _.)
To match possible occurrences of valid hostnames which are preceded/followed by other word characters, the list in "a" would have to explicitly name those variations. E.g. "_hostnamed" would have to be listed in order to not have "hostnamed" in the output.
(With a little luck, this might even be acceptable for OP, then this extended solution is recommended; for robustness against "EdMortonish traps". Ed, please consider this a compliment on your interesting input, it is not meant in any way negatively.)
Output for "nasty" a and b:
hostnamed
hostnamee
ostname
lilihostnamec
hos.namea
I am not sure whether the changed handling of _ still matches OPs goal (if not, within OPs scope the first case insensitive solution is satisfying).
_ is part of "letter characters" which can be used for "whole word only matching" -w. More detailed regex control at some point gets beyond grep, as Ed Morton has mentioned, using awk, perl (sed for masochistic brain exercise, the kind I enjoy) is then appropriate.
With GNU grep 2.5.4 on windows.
The files a.txt and b.txt have your content, I made however sure that they have UNIX line-endings, that is important (at least for a, possibly not for b).
$ cat tst.awk
NR==FNR {
gsub(/^[^_]+_|-[^-]+$/,"")
hostnames[tolower($0)]
next
}
!(tolower($0) in hostnames)
$ awk -f tst.awk fileB fileA
hostnamee
$ awk -f tst.awk b.txt a.txt
hostnamee
ostname
lilihostnamec
hos.namea
The only assumption in the above is that your host names don't contain underscores and anything after the last - on the line is a date. If that's not the case and there's a better definition of what the optional hostname prefix and suffix strings in fileB can be then just tweak the gsub() to use an appropriate regexp.

Bash pipes and Shell expansions

I've changed my data source in a bash pipe from cat ${file} to cat file_${part_number} because preprocessing was causing ${file} to be truncated at 2GB, splitting the output eliminated the preprocessing issues. However while testing this change, I was unable to work out how to get Bash to continue acting the same for some basic operations I was using to test the pipeline.
My original pipeline is:
cat giantfile.json | jq -c '.' | python postprocessor.py
With the original pipeline, if I'm testing changes to postprocessor.py or the preprocessor and I want to just test my changes with a couple of items from giantfile.json I can just use head and tail. Like so:
cat giantfile.json | head -n 2 - | jq -c '.' | python postprocessor.py
cat giantfile.json | tail -n 3 - | jq -c '.' | python postprocessor.py
The new pipeline that fixes the issues the preprocessor is:
cat file_*.json | jq -c '.' | python postprocessor.py
This works fine, since every file gets output eventually. However I don't want to wait 5-10 minutes for each tests. I tried to test with the first 2 lines of input with head.
cat file_*.json | head -n 2 - | jq -c '.' | python postprocessor.py
Bash sits there working far longer than it should, so I try:
cat file_*.json | head -n 2 - | jq -c '.'
And my problem is clear. Bash is outputting the content of all the files as if head was not even there because each file now has 1 line of data in it. I've never needed to do this with bash before and I'm flummoxed.
Why does Bash behave this way, and How do I rewrite my little bash command pipeline to work the way it used to, allowing me to select the first/last n lines of data to work with for testing?
My guess is that when you split the json up into individual files, you managed to remove the newline character from the end of each line, with the consequence that the concatenated file (cat file_json.*) is really only one line in total, because cat will not insert newlines between the files it is concatenating.
If the files were really one line each with a terminating newline character, piping through head -n 2 should work fine.
You can check this hypothesis with wc, since that utility counts newline characters rather than lines. If it reports that the files have 0 lines, then you need to fix your preprocessing.

sed find/replace lines with whitespaces

I'm writing a shell script that will find and replace a line to disable password caching in nscd. The problem is, there is a ton of white space before and inbetween the parameters and I can't seem to find a way with my limited knowledge of regex to ignore the spaces and change the no to yes.
Here is the line as it appears in the config file. Just in case it doesn't show properly, there are 8 spaces before enable-cache, 12 spaces after and 10 spaces before no.
enable-cache passwd no
I basically need to change the no to a yes for that line only. Anyone have any thoughts?
Thanks
greg
To show a complete usage example (albeit with input given on the command-line rather than from a file):
$ sed -r -e \
's/^([[:space:]]*enable-cache[[:space:]]+passwd[[:space:]]+)no([[:space:]]*)$/\1yes\2/' \
<input-file >output-file \
&& mv output-file input-file
To do this in-place, you'd want to use ed or ex (both, unlike sed -i, being POSIX-specified tools):
$ printf '%s\n' \
'%s/\([[:space:]]*enable-cache[[:space:]]\+passwd[[:space:]]\+\)no[[:space:]]*/\1yes/' \
'wq' \
| ex file-to-modify -s -
Match and capture the part starting from enable-cache, up to no, and also match no. Replace all matched part with the captured part and yes:
sed 's/\(enable-cache[ ]*passwd[ ]*\)no/\1yes/' input

sed not replacing lines

I have a file with 1 line of text, called output. I have write access to the file. I can change it from an editor with no problems.
$ cat output
1
$ ls -l o*
-rw-rw-r-- 1 jbk jbk 2 Jan 27 18:44 output
What I want to do is replace the first (and only) line in this file with a new value, either a 1 or a 0. It seems to me that sed should be perfect for this:
$ sed '1 c\ 0' output
0
$ cat output
1
But it never changes the file. I've tried it spread over 2 lines at the backslash, and with double quotes, but I cannot get it to put a 0 (or anything else) in the first line.
Sed operates on streams and prints its output to standard out.
It does not modify the input file.
It's typically used like this when you want to capture its output in a file:
#
# replace every occurrence of foo with bar in input-file
#
sed 's/foo/bar/g' input-file > output-file
The above command invokes sed on input-file and redirects the output to a new file named output-file.
Depending on your platform, you might be able to use sed's -i option to modify files in place:
sed -i.bak 's/foo/bar/g' input-file
NOTE:
Not all versions of sed support -i.
Also, different versions of sed implement -i differently.
On some platforms you MUST specify a backup extension (on others you don't have to).
Since this is an incredibly simple file, sed may actually be overkill. It sounds like you want the file to have exactly one character: a '0' or a '1'.
It may make better sense in this case to just overwrite the file rather than to edit it, e.g.:
echo "1" > output
or
echo "0" > output

Resources