How to show merged differences between two files? - bash

How can I get only diff letters between two files?
For example,
file1:
aaa;bbb;ccc
123;456;789
a1a;b1b;c1c
file2:
aAa;bbb;ccc
123;406;789
a1a;b1b;c5c
After diff I should get only this string of difference from the second file: A05

diff -y --suppress-common-lines <(fold -w 1 file1) <(fold -w 1 file2) |
sed 's/.*\(.\)$/\1/' | paste -s -d '' -
This uses process substitution with fold to turn each file into a column of characters that's one character wide and then compares them with diff.
The -y option prints lines next to each other, and --suppress-common-lines skips lines that are the same between both files. Until here, the output looks like this:
$ diff -y --suppress-common-lines <(fold -w 1 file1) <(fold -w 1 file2)
a | A
5 | 0
1 | 5
We're only interested in the the last character of each line. We use sed to discard the rest:
$ diff -y --suppress-common-lines <(fold -w 1 file1) <(fold -w 1 file2) |
> sed 's/.*\(.\)$/\1/'
A
0
5
To get these into a single line, we pipe to paste with the -s option (serial) and the empty string as the delimiter (-d ''). The dash tells paste to read from standard in.
$ diff -y --suppress-common-lines <(fold -w 1 file1) <(fold -w 1 file2) |
> sed 's/.*\(.\)$/\1/' | paste -s -d '' -
A05
An alternative, if you have the GNU diffutils at your disposal, is cmp:
$ cmp -lb file1 file2 | awk '{print $5}' | tr -d '\n'
A05
cmp compares files byte by byte. The -l option ("verbose") makes it print all the differences, not just the first one; the -b options make it add the ASCII interpretation of the differing bytes:
$ cmp -lb file1 file2
2 141 a 101 A
18 65 5 60 0
34 61 1 65 5
The awk command reduces this output to the fifth column, and tr removes the newlines.

For the example given,
you could compare the files character by character and if there is a difference, print the character of the second file. Here's one way to do that:
paste <(fold -w1 file1) <(fold -w1 file2) | \
while read c1 c2; do [[ $c1 = $c2 ]] || printf $c2; done
For the given example, this will print A05.

Related

Split pipe into two and paste the results together?

I want to pipe the output of the command into two commands and paste the results together. I found this answer and similar ones suggesting using tee but I'm not sure how to make it work as I'd like it to.
My problem (simplified):
Say that I have a myfile.txt with keys and values, e.g.
key1 /path/to/file1
key2 /path/to/file2
What I am doing right now is
paste \
<( cat myfile.txt | cut -f1 ) \
<( cat myfile.txt | cut -f2 | xargs wc -l )
and it produces
key1 23
key2 42
The problem is that cat myfile.txt is repeated here (in the real problem it's a heavier operation). Instead, I'd like to do something like
cat myfile.txt | tee \
<( cut -f1 ) \
<( cut -f2 | xargs wc -l ) \
| paste
But it doesn't produce the expected output. Is it possible to do something similar to the above with pipes and standard command-line tools?
This doesn't answer your question about pipes, but you can use AWK to solve your problem:
$ printf %s\\n 1 2 3 > file1.txt
$ printf %s\\n 1 2 3 4 5 > file2.txt
$ cat > myfile.txt <<EOF
key1 file1.txt
key2 file2.txt
EOF
$ cat myfile.txt | awk '{ ("wc -l " $2) | getline size; sub(/ .+$/,"",size); print $1, size }'
key1 3
key2 5
On each line we first we run wc -l $2 and save the result into a variable. Not sure about yours, but on my system wc -l includes the filename in the output, so we strip it with sub() to match your example output. And finally, we print the $1 field (key) and the size we got from wc -l command.
Also, can be done with shell, now that I think about it:
cat myfile.txt | while read -r key value; do
printf '%s %s\n' "$key" "$(wc -l "$value" | cut -d' ' -f1)"
done
Or more generally, by piping to two commands and using paste, therefore answering the question:
cat myfile.txt | while read -r line; do
printf %s "$line" | cut -f1
printf %s "$line" | cut -f2 | xargs wc -l | cut -d' ' -f1
done | paste - -
P.S. The use of cat here is useless, I know. But it's just a placeholder for the real command.

append output of each iteration of a loop to the same in bash

I have 44 files (2 for each chromosome) divided in two types: .vcf and .filtered.vcf.
I would like to make a wc -l for each of them in a loop and append the output always to the same file. However, I would like to have 3 columns in this file: chr[1-22], wc -l of .vcf and wc -l of .filtered.vcf.
I've been trying to do independent wc -l for each file and paste together columnwise the 2 outputs for each of the chromosomes, but this is obviously not very efficient, because I'm generating a lot of unnecessary files. I'm trying this code for the 22 pairs of files:
wc -l file1.vcf | cut -f 1 > out1.vcf
wc -l file1.filtered.vcf | cut -f 1 > out1.filtered.vcf
paste -d "\t" out1.vcf out1.filtered.vcf
I would like to have just one output file containing three columns:
Chromosome VCFCount FilteredVCFCount
chr1 out1 out1.filtered
chr2 out2 out2.filtered
Any help will be appreciated, thank you very much in advance :)
printf "%s\n" *.filtered.vcf |
cut -d. -f1 |
sort |
xargs -n1 sh -c 'printf "%s\t%s\t%s\n" "$1" "$(wc -l <"${1}.vcf")" "$(wc -l <"${1}.filtered.vcf")"' --
Output newline separated list of files in the directory
Remove the extension with cut (probably something along xargs -i basename {} .filtered.vcf would be safer)
Sort it (for nice sorted output!) (probably something along sort -tr -k2 -n would sort numerically and would be even better).
xargs -n1 For each one file execute the script sh -c
printf "%s\t%s\t%s\n" - output with custom format string ...
"$1" - the filename and...
"(wc -l <"${1}.vcf")" - the count the lines in .vcf file and...
"$(wc -l <"${1}.filtered.vcf")" - the count of the lines in the .filtered.vcf
Example:
> touch chr{1..3}{,.filtered}.vcf
> echo > chr1.filtered.vcf ; echo > chr2.vcf ;
> printf "%s\n" *.filtered.vcf |
> cut -d. -f1 |
> sort |
> xargs -n1 sh -c 'printf "%s\t%s\t%s\n" "$1" "$(wc -l <"${1}.filtered.vcf")" "$(wc -l <"${1}.vcf")"' --
chr1 0 1
chr2 1 0
chr3 0 0
To have nice looking table with headers, use column:
> .... | column -N Chromosome,VCFCount,FilteredVCFCount -t -o ' '
Chromosome VCFCount FilteredVCFCount
chr1 0 1
chr2 1 0
chr3 0 0
Maybe try this.
for chr in chr*.vcf; do
base=${chr%.vcf}
awk -v base="$base" 'BEGIN { OFS="\t"
# Remove this to not have this pesky header line
print "Chromosome", "VCFCount", "FilteredVCFCount"
}
FNR==1 && n { p=n }
{ n=FNR }
END { print base, p, n }' "$chr" "$base.filtered.vcf"
done >counts.txt
The very simple Awk script just collects the highest line number for each file (so we basically reimplement wc -l) and prints the collected numbers in the desired format. FNR is the line number in the current input file; we simply save this, and copy the value to p to keep the saved value from the previous file in a separate variable when we switch to a new file (starting over at line number 1).
The shell parameter substitution ${variable%pattern} retrieves the value of variable with any suffix match on pattern removed. (There is also ${variable#pattern} to remove a prefix, and Bash has ## and %% to trim the longest pattern match instead of the shortest.)
If efficiency is important, you could probably refactor all of the script into a single Awk script, but this way, all the pieces are simple and hopefully understandable.

BASH Finding palindromes in a .txt file

I have been given a .txt file in which we have to find all the palindromes in the text (must have at least 3 letters and they cant be the same letters e.g. AAA)
it should be displayed with the first column being the amount of times it appears and the second being the word e.g.
123 kayak
3 bob
1 dad
#!/bin/bash
tmp='mktemp'
awk '{for(x=1;$x;++x)print $x}' "${1}" | tr -d [[:punct:]] | tr -s [:space:] | sed -e 's/#//g' -e 's/[0-9]*//g'| sed -r '/^.{,2}$/d' | sort | uniq -c -i > tmp1
This outputs the file as it should do, ignoring case, words less than 3 letters, punctuation and digits.
However i am now stump on how to pull out the palindromes from this, i thought a temp file might be the way, just don't know where to take it.
any help or guidance is much appreciated.
# modify this to your needs; it should take your input on stdin, and return one word per
# line on stdout, in the same order if called more than once with the same input.
preprocess() {
tr -d '[[:punct:][:digit:]#]' \
| sed -E -e '/^(.)\1+$/d' \
| tr -s '[[:space:]]' \
| tr '[[:space:]]' '\n'
}
paste <(preprocess <"$1") <(preprocess <"$1" | rev) \
| awk '$1 == $2 && (length($1) >= 3) { print $1 }' \
| sort | uniq -c
The critical thing here is to paste together your input file with a stream that has each line from that input file reversed. This gives you two separate columns you can compare.

Count the number of whitespaces in a file

File test
musically us
challenged a goat that day
spartacus was his name
ba ba ba blacksheep
grep -oic "[\s]*" test
grep -oic "[ ]*" test
grep -oic "[\t]*" test
grep -oic "[\n]*" test
All give me 4, when I expect 11
grep --version -> grep (BSD grep) 2.5.1-FreeBSD
Running this on OSX Sierra 10.12
Repeating spaces should not be counted as one space.
If you are open to tricks and alternatives you might like this one:
$ awk '{print --NF}' <(tr -d '\n' <file)
11
Above solution will count "whitespace" between words. As a result for a string of 'fifteen--> <--spaces' awk will measure 1, like grep.
If you need to count actual single spaces you can use this :
$ awk -F"[ ]" '{print --NF}' <<<"fifteen--> <--spaces"
15
$ awk -F"[ ]" '{print --NF}' <<<" 2 4 6 8 10"
10
$ awk -F"[ ]" '{print --NF}' <(tr -d '\n' <file)
11
One step forward, to count single spaces and tabs:
$ awk -F"[ ]|\t" '{print --NF}' <(echo -e " 2 4 6 8 10\t12 14")
13
tr is generally better for this (in most cases):
tr -d -C ' ' <file | wc -c
The grep solution relies on the fact that the output of grep -o is newline-separated — it will fail miserably for example in the following type of circumstance where there might be multiple spaces:
v='fifteen--> <--spaces'
echo "$v" | grep -o -E ' +' | wc -l
echo "$v" | tr -d -C ' ' | wc -c
grep only returns 1, when it should be 15.
EDIT: If you wanted to count multiple characters (eg. TAB and SPACE) you could use:
tr -dC $'[ \t]' <<< $'one \t' | wc -c
Just use awk:
$ awk -v RS=' ' 'END{print NR-1}' file
11
or if you want to handle empty files gracefully:
$ awk -v RS=' ' 'END{print NR - (NR?1:0)}' /dev/null
0
The -c option counts the number of lines that match, not individual matches. Use grep -o and then pipe to wc -l, which will count the number of lines.
grep -o ' ' test | wc -l

Log the output of diff command to separate files in linux

I have 2 csv files in 2 different directories,i am running a diff on them like this :
diff -b -r -w <dir-one>/AFB.csv <dir-two>/AFB.csv
I am getting the output as expected:
14c14
< image_collapse,,collapse,,,,,batchcriteria^M
---
> image_collapse1,,collapse1,,,,,batchcriteria^M
16a17
> image_refresh,,refresh,,,,,batchcriteria^M
My requirement is that the lines which have changed should goto changed.log file,lines that have been appended should goto append.log.
The output clearly shows that "c" in 14c14 means that line has changed, and "a" in 16a17 means line has been appended. But how do i log them in different log files.
Edit: Same as original answer below but avoiding options not supported by diff on HP-UX. Use something like:
diff -b -r -w /tmp/one.txt /tmp/two.txt \
| sed -n -e '/c/ {s/[^c]*c\(.*\)/\1 p/;p}' \
| sed -n -f - /tmp/two.txt > /tmp/changed.txt
diff -b -r -w /tmp/one.txt /tmp/two.txt \
| sed -n -e '/a/ {s/[^a]*a\(.*\)/\1 p/;p}' \
| sed -n -f - /tmp/two.txt > /tmp/new.txt
This converts the line numbers output from diff to sed print (p) commands for added (a) and changed (c) line ranges. The resulting sed scripts are applied to the second file to print just the desired lines. (I hope HP-UX sed supports the -f - for taking script from standard input.)
There seems to be a solution which does not require interpreting line numbers from the output of diff. diff supports --side-by-side formatting (-y) which includes a gutter marking old, new, and changed lines with <, >, and | respectively. You can reduce this side-by-side format to just the markers by using --width=1 (or -W1). If you take the changed and new markers (grep -v) and prefix the lines of the second file with it (paste) then you can filter (grep) by prefixed markers and throw away (cut) the markers. You can do this for both new and changed files.
Here is a self-contained "script" as an example:
# create two example files (one character per line)
echo abcdefghijklmnopqrstuvwxyz | grep -o . > /tmp/one.txt
echo abcDeFghiJKlmnopPqrsStuvVVwxyzZZZ | grep -o . > /tmp/two.txt
# diff side-by-side to get markers and apply to new file
diff -b -r -w -y -W1 /tmp/one.txt /tmp/two.txt \
| fgrep -v '<' | paste - /tmp/two.txt \
| grep -e '^|' | cut -c3- > /tmp/changed.txt
diff -b -r -w -y -W1 /tmp/one.txt /tmp/two.txt \
| fgrep -v '<' | paste - /tmp/two.txt \
| grep -e '^>' | cut -c3- > /tmp/new.txt
# dump result
cat /tmp/changed.txt
echo ---
cat /tmp/new.txt
Its output is
D
F
J
K
---
P
S
V
V
Z
Z
Z
I hope this helps you solve your problem.
This can be done through a "grep" command like follows.
diff -b -r -w <dir-one>/AFB.csv <dir-two>/AFB.csv | grep ">" >> append.log
diff -b -r -w <dir-one>/AFB.csv <dir-two>/AFB.csv | grep "<" >> changed.log

Resources