diff two batches of files - bash

I would like to diff two batches of files. If I simply put them in two different directories and diff by directory, the comparisons will be alphabetical which I do not want.
Another approach would be to list files in text1.txt and list files in text2.txt:
text1:
a1
b1
c1
text2:
c2
a2
b2
How can I approach this such that my loop will be:
diff a1 c2
diff b1 a2
diff b2 c1

You can use paste to join the two files, then a bash loop to process.
paste text1 text2 | while read file1 file2; do diff "$file1" "$file2"; done

In bash, you can use the -u flag on read to read from a different fd. This allows you to read from two files in parallel:
while read -r -u3 file1 && read -r -u4 file2; do
diff "$file1" "$file2"
done 3<file1.txt 4<file2.txt

Another solution :
#!/bin/bash
file1="..."
file2="..."
getSize(){
wc -l "$1"|cut -d " " -f1
}
getValueFromLineNumber(){
sed -n "$1p" "$2"
}
diffFromLineNumber(){
f1=$(getValueFromLineNumber "$1" "$file1")
f2=$(getValueFromLineNumber "$1" "$file2")
diff "$f1" "$f2"
}
# get min size
[[ $(getSize "$file1") -le $(getSize "$file2") ]] && min=$s1 || min=$s2
for (( i=1 ; i <= "$min" ; i++)); do
diffFromLineNumber "$i"
done
This solution takes care of the case where the two files don't have the same number of lines.

Related

reading lines in a text file with special characters specifically as quoted '<', '>' in bash shell

I have a text file which is the output difference of two grepped files . the text file has lines like below I need to read the file (loop through the lines in the text file ) and based on text to the left hand side of '<' and right hand side of '>' do something.
editing to add details:
LHS of < OR RHS of >
if either of those, i will need to store the content into a variable, and get the 1st(ABCDEF) 3rd(10) and search (will grep) for them in one of other two files and if found print a message and attach those file(s) names in an email DL. All the file names and directories have been stored in separate variables.
how do i do that.
ps:have basic knowledge on text formatting and bash/shell commands but still learning the scripting syntax.Thanks.
ABCDEF,20200101,10 <
PQRSTU,20200106,11 <
LMNOPQ,20200101,12 <
EFGHIJ,20200102,13 <
KLMNOP,20200103,14 <
STUVWX,20200104,15 <
PQRSTU,20200105,16 <
> LMNOPQ,20200101,10
ABCDEF,20200107,17 <
What wrong am I doing now?
while IFS= read -r line; do
if $line =~ ([^[:blank:]]+)[[:blank:]]+\<
then
IFS=, read -r f1 f2 f3 <<< "${BASH_REMATCH[1]}"
#echo "f1=$f1 f2=$f2 f3=$f3"
zgrep "$f1" file1 | grep "with seq $f3" || zgrep "$f1" file2 | grep "with seq $f3"
elif $line =~ \>[[:blank:]]+([^[:blank:]]+)
then
IFS=, read -r g1 g2 g3 <<< "${BASH_REMATCH[1]}"
#echo "g1=$g1 g2=$g2 g3=$g3"
zgrep "$g1" file3 | grep "with seq $g3" || zgrep "$g1" file3 | grep "with seq $g3"
fi
Would you please try something like:
#!/bin/bash
while IFS= read -r line; do
if [[ $line =~ ([^[:blank:]]+)[[:blank:]]+\< || $line =~ \>[[:blank:]]+([^[:blank:]]+) ]]; then
IFS=, read -r f1 f2 f3 <<< "${BASH_REMATCH[1]}"
echo "f1=$f1 f2=$f2 f3=$f3"
# do something here with "$f1", "$f2" and "$f3"
fi
done < file.txt
Output:
f1=ABCDEF f2=20200101 f3=10
f1=PQRSTU f2=20200106 f3=11
f1=LMNOPQ f2=20200101 f3=12
f1=EFGHIJ f2=20200102 f3=13
f1=KLMNOP f2=20200103 f3=14
f1=STUVWX f2=20200104 f3=15
f1=PQRSTU f2=20200105 f3=16
f1=LMNOPQ f2=20200101 f3=10
f1=ABCDEF f2=20200107 f3=17
Please modify the echo "f1=$f1 f2=$f2 f3=$f3" line to your desired
command such as grep.
The regex ([^[:blank:]]+)[[:blank:]]+\< matches a line which contains <
and assigns the bash variable ${BASH_REMATCH[1]} to the LHS.
On the other hand, the regex \>[[:blank:]]+([^[:blank:]]+) does the similar thing for
a line which contains >.
The statement IFS=, read -r f1 f2 f3 <<< "${BASH_REMATCH[1]}" splits the bash variable
on , and assigns f1, f2 and f3 to the fields.
Please note if the input file is very large, bash solution may not
be efficient in execution time. I used bash just because it will be convenient
to pass the variables to your grep command.
EDIT
Regarding the updated script in your question, please refer to the following modification:
while IFS= read -r line; do
if [[ $line =~ ([^[:blank:]]+)[[:blank:]]+\< ]]; then
IFS=, read -r f1 f2 f3 <<< "${BASH_REMATCH[1]}"
# echo "f1=$f1 f2=$f2 f3=$f3"
result=$(zgrep "$f1" file1 | grep "with seq $f3" || zgrep "$f1" file2 | grep "with seq $f3")
elif [[ $line =~ \>[[:blank:]]+([^[:blank:]]+) ]]; then
IFS=, read -r g1 g2 g3 <<< "${BASH_REMATCH[1]}"
# echo "g1=$g1 g2=$g2 g3=$g3"
result=$(zgrep "$g1" file3 | grep "with seq $g3" || zgrep "$g1" file3 | grep "with seq $g3")
fi
if [[ -n $result ]]; then
echo "result = $result"
fi
done < file.txt

Linux: Search for coincidences in four different files

Scenario: Four files with 300 lines on each one. I want to know which lines are in all four files using bash only (no perl/python/ruby please)
Quick sample
$cat bad_domains.urlvoid
a
b
c
d
e
$cat bad_domains.alienvault
f
g
a
c
h
$cat bad_domains.hphosts
i
j
k
a
h
$cat bad_domains.malwaredomain
l
b
m
f
a
j
I only want to match the "a" i tried with stuff like this but it's slow as hell:
for void in $(cat bad_domains.urlvoid)
do
for vault in $(cat bad_domains.alienvault)
do
for hphosts in $(cat bad_domains.hphosts)
do
for malwaredomain in $(cat bad_domains.malwaredomain)
do
if [ $void == $vault -a $void == $hphosts -a $void == $malwaredomain -a $vault == $hphosts -a $vault == $malwaredomain -a $hphosts == $malwaredomain ]
then
echo $void
fi
done
done
done
done
Any good tips for optimizing my code? I read something about Dichotomic search that maybe could work.
Using comm:
comm -12 <(awk 'FNR==NR{a[$0];next} $0 in a' f1 f2) <(awk 'FNR==NR{a[$0];next} $0 in a' f3 f4)
a
Which works using these 3 steps:
Get common strings from file1 and file2
Get common strings from file3 and file4
Get common strings from above 2 steps thus getting intersection of 4 sets
EDIT: Pure awk solution:
awk 'FNR==NR{a[$0];next} $0 in a' <(awk 'FNR==NR{a[$0];next} $0 in a' f1 f2) <(awk 'FNR==NR{a[$0];next} $0 in a' f3 f4)
If the lines are unique within each file:
cat file1 file2 file3 file4 | sort | uniq -c | grep '^ *4 '
For bash 4.x (and ksh93)
Create an associative array indexed by the lines of one of the files (master).
For each of the remaining files, create a second array (work) indexed by the file's lines, then iterate over the master
array removing any entry with a key which does not also appear in the work array.
Any keys left in master[] after processing must have been in all files.
list=( bad_domains.* )
typeset -A master
while IFS= read -r key ; do master[$key]=1 ; done < "${list[0]}"
unset list[0]
for file in "${list[#]}" ; do
typeset -A work
while IFS= read -r key ; do work[$key]=1 ; done < "$file"
for key in "${!master[#]}" ; do [[ ${work[$key]+set} = set ]] || unset master[$key] ; done
unset work
done
for key in "${!master[#]}" ; do printf '%s\n' "$key" ; done

File comparision using shell script

I have two files named file1 and file2.
Content of file1 --->
Hello/Good/Morning
World/India
Content of file2 --->
Hello/Good/Morning
World/China
I need to check if the contents of these files are equal or not.Since both the files have "Hello/Good/Morning" in common it should print "EQUAL" as per my requirement.I have written a code for this:
file1=/app/webmcore1/Demo/FORLOOP/Kasturi/xyz/pqr.txt
file2=/app/webmcore1/Demo/FORLOOP/Prashast/xyz/pqr.txt
IFS=` `
for i in cat $file1
do
if [ "$i" != '' ]; then
echo "$i"
for j in cat $file2
do
if [ "$j" != '' ]; then
echo "$j"
if [[ $i -eq $j ]]; then
echo "EQUAL"
fi
fi
done
fi
done
But it is not displaying the output properly.
diff compares files, line by line. If diff filename outputs anything, the files are different.
If the output of diff is empty, they are the same.
There already is a tool to compare files, it's called diff (and actually much more powerful than just deciding equal or not, but can be used for this).
diff -q file1 file2 >/dev/null && echo "EQUAL"
If you also want to to print return something in case the files are not equal:
diff -q file1 file2 >/dev/null && echo "EQUAL" || echo "NOT EQUAL"
So, the files are "equal" if they have any single word in common?
result=$(
comm -12 <(tr '[:space:]' '\n' <file1 | sort) <(tr '[:space:]' '\n' <file2 | sort)
)
[[ -n $result ]] && echo EQUAL
Or, just in bash
words=( $(< file1) )
for word in $(< file2); do
if [[ " ${words[*]} " == *" $word "* ]]; then
echo "EQUAL due to $word"
break
fi
done
EQUAL due to Hello/Good/Morning

Take two at a time in a bash "for file in $list" construct

I have a list of files where two subsequent ones always belong together. I would like a for loop extract two files out of this list per iteration, and then work on these two files at a time (for an example, let's say I want to just concatenate, i.e. cat the two files).
In a simple case, my list of files is this:
FILES="file1_mateA.txt file1_mateB.txt file2_mateA.txt file2_mateB.txt"
I could hack around it and say
FILES="file1 file2"
for file in $FILES
do
actual_mateA=${file}_mateA.txt
actual_mateB=${file}_mateB.txt
cat $actual_mateA $actual_mateB
done
But I would like to be able to handle lists where mate A and mate B have arbitrary names, e.g.:
FILES="first_file_first_mate.txt first_file_second_mate.txt file2_mate1.txt file2_mate2.txt"
Is there a way to extract two values out of $FILES per iteration?
Use an array for the list:
files=(fileA1 fileA2 fileB1 fileB2)
for (( i=0; i<${#files[#]} ; i+=2 )) ; do
echo "${files[i]}" "${files[i+1]}"
done
You could read the values from a while loop and use xargs to restrict each read operation to two tokens.
files="filaA1 fileA2 fileB1 fileB2"
while read -r a b; do
echo $a $b
done < <(echo $files | xargs -n2)
You could use xargs(1), e.g.
ls -1 *.txt | xargs -n2 COMMAND
The switch -n2 let xargs select 2 consecutive filenames from the pipe output which are handed down do the COMMAND
To concatenate the 10 files file01.txt ... file10.txt pairwise
one can use
ls *.txt | xargs -n2 sh -c 'cat $# > $1.$2.joined' dummy
to get the 5 result files
file01.txt.file02.txt.joined
file03.txt.file04.txt.joined
file05.txt.file06.txt.joined
file07.txt.file08.txt.joined
file09.txt.file10.txt.joined
Please see 'info xargs' for an explantion.
How about this:
park=''
for file in $files # wherever you get them from, maybe $(ls) or whatever
do
if [ "$park" = '' ]
then
park=$file
else
process "$park" "$file"
park=''
fi
done
In each odd iteration it just stores the value (in park) and in each even iteration it then uses the stored and the current value.
Seems like one of those things awk is suited for
$ awk '{for (i = 1; i <= NF; i+=2) if( i+1 <= NF ) print $i " " $(i+1) }' <<< "$FILES"
file1_mateA.txt file1_mateB.txt
file2_mateA.txt file2_mateB.txt
You could then loop over it by setting IFS=$'\n'
e.g.
#!/bin/bash
FILES="file1_mateA.txt file1_mateB.txt file2_mateA.txt file2_mateB.txt file3_mat
input=$(awk '{for (i = 1; i <= NF; i+=2) if( i+1 <= NF ) print $i " " $(i+1) }'
IFS=$'\n'
for set in $input; do
cat "$set" # or something
done
Which will try to do
$ cat file1_mateA.txt file1_mateB.txt
$ cat file2_mateA.txt file2_mateB.txt
And ignore the odd case without the match.
You can transform you string to array and read this new array by elements:
#!/bin/bash
string="first_file_first_mate.txt first_file_second_mate.txt file2_mate1.txt file2_mate2.txt"
array=(${string})
size=${#array[*]}
idx=0
while [ "$idx" -lt "$size" ]
do
echo ${array[$idx]}
echo ${array[$(($idx+1))]}
let "idx=$idx+2"
done
If you have delimiter in string different from space (i.e. ;) you can use the following transformation to array:
array=(${string//;/ })
You could try something like this:
echo file1 file2 file3 file4 | while read -d ' ' a; do read -d ' ' b; echo $a $b; done
file1 file2
file3 file4
Or this, somewhat cumbersome technique:
echo file1 file2 file3 file4 |tr " " "\n" | while :;do read a || break; read b || break; echo $a $b; done
file1 file2
file3 file4

Taking line intersection of several files

I see comm can do 2 files and diff3 can do 3 files. I want to do for more files (5ish).
One way:
comm -12 file1 file2 >tmp1
comm -12 tmp1 file3 >tmp2
comm -12 tmp2 file4 >tmp3
comm -12 tmp3 file5
This process could be turned into a script
comm -12 $1 $2 > tmp1
for i in $(seq 3 1 $# 2>/dev/null); do
comm -12 tmp`expr $i - 2` $(eval echo '$'$i) >tmp`expr $i - 1`
done
if [ $# -eq 2 ]; then
cat tmp1
else
cat tmp`expr $i - 1`
fi
rm tmp*
This seems like poorly written code, even to a newbie like me, is there a better way?
It's quite a bit more convoluted than it has to be. Here's another way of doing it.
#!/bin/bash
# Create some temp files to avoid trashing and deleting tmp* in the directory
tmp=$(mktemp)
result=$(mktemp)
# The intersection of one file is itself
cp "$1" "$result"
shift
# For each additional file, intersect with the intermediate result
for file
do
comm -12 "$file" "$result" > "$tmp" && mv "$tmp" "$result"
done
cat "$result" && rm "$result"

Resources