How to compare parts of lines of two txt files? - bash

I am given two txt files, each one of them with information aligned in several columns separated by a tab. What I want to do is look for lines in both files, where one of these columns match. --Not the whole lines, but only their first column parts should be identical. How do I do that in a bash script?
I have tried using grep -Fwf.
So this is what the files look like
aaaa bbbb
cccc dddd
and
aaaa eeee
ffff gggg
The output I'd like to get is something like:
bbbb and eeee match
I really haven't found a command that does both line-wise and by words comparison at the same time.
Sorry for not providing any code of my own, I'm new to programming and couldn't come up with anything reasonable so far. Thanks in advance!

Have you seen the join command? This in combination with sort maybe what you are looking for. https://shapeshed.com/unix-join/
for example:
$ cat a
aaaa bbbb
cccc dddd
$ cat b
aaaa eeee
ffff gggg
$ join a b
aaaa bbbb eeee
If the values in the first column are not sorted, than you have to sort them first, otherwise join will not work.
join <(sort a) <(sort b)
Kind regards
Oliver

Assuming your tab separated file maintains the correct file structure, this should work:
diff <(awk '{print $2}' f1) <(awk '{print $2}' f2)
# File names: f1, f2
# Column: 2nd column.
The output when there is something different,
2c2
< dx
---
> ldx
No output when the column is the same.
I tried #Wiimm's answer and it didn't work for me.

There are different kinds and different tools to compare:
diff
cmp
comm
...
All commands have options to vary the comparison.
For each command, you can specify filters. E.g.
# remove comments before comparison
diff <( grep -v ^# file1) <( grep -v ^# file2)
Without concrete examples, it is impossible to be more exact.

You can use awk, like this:
awk 'NR==FNR{a[NR]=$1;b[NR]=$2;next}
a[FNR]==$1{printf "%s and %s match\n", b[FNR], $2}' file1 file2
Output:
bbbb and eeee match
Explanation (the same code broken into multiple lines):
# As long as we are reading file1, the overall record
# number NR is the same as the record number in the
# current input file FNR
NR==FNR{
# Store column 1 and 2 in arrays called a and b
# indexed by the record number
a[NR]=$1
b[NR]=$2
next # Do not process more actions for file1
}
# The following code gets only executed when we read
# file2 because of the above _next_ statement
# Check if column 1 in file1 is the same as in file2
# for this line
a[FNR]==$1{
printf "%s and %s match\n", b[FNR], $2
}

Related

How to keep the last occurrence of duplicate lines in a text file?

I have a text file with contents that may be duplicates. Below is a simplified representation of my txt file. text means a unique character or word or phrase). Note that the separator ---------- may not be present. Also, the whole content of the file consists of unicode Japanese and Chinese characters.
EDITED
sometext1
sometext2
sometext3
aaaa
sometext4
aaaa
aaaa
bbbb
bbbb
cccc
dddd
eeee
ffff
gggg
----------
sometext5
eeee
ffff
gggg
sometext6
sometext7:cccc
sometext8:dddd
sometext9
sometext10
What I want to achieve is to keep only the line with the last occurrence of the duplicates like so:
sometext1
sometext2
sometext3
sometext4
aaaa
bbbb
sometext5
eeee
ffff
gggg
sometext6
sometext7:cccc
sometext8:dddd
sometext9
sometext10
The closest I found online is How to remove only the first occurrence of a line in a file using sed but this requires that you know which matching pattern(s) to delete. The suggested topics provided when writing the title gives Duplicating characters using sed and last occurence of date but they didn't work.
I am on a Mac with Sierra. I am writing my executable commands in a script.sh file to execute commands line by line. I'm using sed and gsed as my primary stream editors.
I am not sure if your intent is to preserve the original order of the lines. If that is the case, you could do this:
export LC_ALL=en_US.utf8 # to handle unicode characters in file
nl -n rz -ba file | sort -k2,2 -t$'\t' | uniq -f1 | sort -k1,1 | cut -f2
nl -n rz -ba file adds zero padded line numbers to the file
sort -k2,2 -t'$\t' sorts the output of nl by the second field (note that nl puts a tab after the line number)
uniq -f1 removes the duplicates, while ignoring the line number field (-f1)
the final sort restores the original order of the lines, with duplicates removed
cut -f2 removes the line number field, restoring the content to the original format
This awk is very close.
Given:
$ cat file
sometext1
sometext2
sometext3
aaaa
sometext4
aaaa
aaaa
bbbb
bbbb
cccc
dddd
eeee
ffff
gggg
----------
sometext5
eeee
ffff
gggg
sometext6
sometext7:cccc
sometext8:dddd
sometext9
sometext10
You can do:
$ awk 'BEGIN{FS=":"}
FNR==NR {for (i=1; i<=NF; i++) {dup[$i]++; last[$i]=NR;} next}
/^$/ {next}
{for (i=1; i<=NF; i++)
if (dup[$i] && FNR==last[$i]) {print $0; next}}
' file file
sometext1
sometext2
sometext3
sometext4
aaaa
bbbb
----------
sometext5
eeee
ffff
gggg
sometext6
sometext7:cccc
sometext8:dddd
sometext9
sometext10
This might work for you (GNU sed):
sed -r '1h;1!H;x;s/([^\n]+)\n(.*\1)$/\2/;s/\n-+$//;x;$!d;x' file
Store the first line in the hold space (HS) and append every subsequent line. Swap to the HS and remove any duplicate line that matches the last line. Also delete any separator lines and then swap back to the pattern space (PS). Delete all but the last line, which is swapped with the HS and printed out.
I found a simpler solution but it sorts file in the process. So if u don't mind output in sort format then u can use the following:
$sort -u input.txt > output.txt
Note: the u flag sort the lines of the file listing unique lines.
Like in the uniq manual:
cat input.txt | uniq -d

ksh sort first column and avg second column

Looking for an awk or similar command that can sort a file with 2 columns, then produce an output of Unique column one names and total avg in column two.
so for example:
aaaa 11.5
aaaa 1.01
aaaa 5.50
bbbb 12.5
bbbb 1.10
bbbb 9.5
looking for output
aaaa 6.00
bbbb 7.7
Like this with awk:
awk '{a[$1]+=$2;b[$1]++} END{for(i in a)print i,a[i]/b[i]}' File
bbbb 7.7
aaaa 6.00333
If you want to round off, use printf.
awk '{a[$1]+=$2;b[$1]++} END{for(i in a)printf("%s %.2f\n",i,a[i]/b[i])}' File
with the first field as index, update array a (add up the 2nd fields). Keep a counter b[first field] also updated. At the end, print all indices of a and the average. Hope it's clear.
For sorted result, pipe the output to sort (awk '{a[$1]+=$2;b[$1]++} END{for(i in a)print i,a[i]/b[i]}' File | sort)

Print lines whose 1st and 4th column differ

I have a file with a bunch of lines of this form:
12 AAA 423 12 BBB beta^11 + 3*beta^10
18 AAA 1509 18 BBB -2*beta^17 - beta^16
18 AAA 781 12 BBB beta^16 - 5*beta^15
Now I would like to print only lines where the 1st and the 4th column differ (the columns are space-separated) (the values AAA and BBB are fixed). I know I can do that by getting all possible values in the first column and then use:
for i in $values; do
cat file.txt | grep "^$i" | grep -v " $i BBB"
done
However, this runs through the file as many times as how many different values appear in the first column. Is there a way how to do that simply in one pass only? I think I can do the comparison, my main problem is that I have no idea how to extract the space-separated columns.
This is something quite straight forward for awk:
awk '$1 != $4' file
With awk, you refer to the first field with $1, the second with $2 and so on. This way, you can compare the first and the forth with $1 != $4. If this is true (that is, $1 and $4 differ), awk performs its default action: print the current line.
For your sample input, this works:
$ awk '$1 != $4' file
18 AAA 781 12 BBB beta^16 - 5*beta^15
Note you can define a different field separator with -v FS="...". This way, you can tell awk that your lines contain fields tab / comma / ... separated. All together it would be like this: awk -v FS="\t" '$1 != $4' file.

(shell) How to remove strings from one file which can be found in another file?

file1.txt
aaaa
bbbb
cccc
dddd
eeee
file2.txt
DDDD
cccc
aaaa
result
bbbb
eeee
If it could be case insensitive it would be even more great!
Thank you!
grep can match patterns read from a file, and print out all lines NOT matching that pattern. Can match case insensitively too, like
grep -vi -f file2.txt file1.txt
Excerpts from the man pages:
SYNOPSIS
grep [OPTIONS] PATTERN [FILE...]
grep [OPTIONS] [-e PATTERN | -f FILE] [FILE...]
-f FILE, --file=FILE
Obtain patterns from FILE, one per line. The empty file contains zero
patterns, and therefore matches nothing. (-f is specified by POSIX.)ns zero
patterns, and therefore matches nothing. (-f is specified by POSIX.)
-i, --ignore-case
Ignore case distinctions in both the PATTERN and the input files. (-i is
specified by POSIX.)ions in both the PATTERN and the input files. (-i is
specified by POSIX.)
-v, --invert-match
Invert the sense of matching, to select non-matching lines. (-v is
specified by POSIX.)of matching, to select non-matching lines. (-v is
specified by POSIX.)
From the top of my head, use grep -Fiv -f file2.txt < file1.txt.
-F no regexps (fast)
-i case-insensitive
-v invert results
-f <pattern file> get patterns from file
$ grep -iv -f file2 file1
bbbb
eeee
or you can use awk
awk 'FNR==NR{ a[tolower($1)]=$1; next }
{
s=tolower($1)
f=0
for(i in a){if(i==s){f=1}}
if(!f) { print s }
} ' file2 file1
ghostdog74's awk example can be simplified:
awk '
FNR == NR { omit[tolower($0)]++; next }
tolower($0) in omit {next}
{print}
' file2 file1
For various set operations on files see:
http://www.pixelbeat.org/cmdline.html#sets
In your case the inputs are not sorted, so
you want the difference like:
sort -f file1 file1 file2 | uniq -iu

Shell command to find lines common in two files

I'm sure I once found a shell command which could print the common lines from two or more files. What is its name?
It was much simpler than diff.
The command you are seeking is comm. eg:-
comm -12 1.sorted.txt 2.sorted.txt
Here:
-1 : suppress column 1 (lines unique to 1.sorted.txt)
-2 : suppress column 2 (lines unique to 2.sorted.txt)
To easily apply the comm command to unsorted files, use Bash's process substitution:
$ bash --version
GNU bash, version 3.2.51(1)-release
Copyright (C) 2007 Free Software Foundation, Inc.
$ cat > abc
123
567
132
$ cat > def
132
777
321
So the files abc and def have one line in common, the one with "132".
Using comm on unsorted files:
$ comm abc def
123
132
567
132
777
321
$ comm -12 abc def # No output! The common line is not found
$
The last line produced no output, the common line was not discovered.
Now use comm on sorted files, sorting the files with process substitution:
$ comm <( sort abc ) <( sort def )
123
132
321
567
777
$ comm -12 <( sort abc ) <( sort def )
132
Now we got the 132 line!
To complement the Perl one-liner, here's its awk equivalent:
awk 'NR==FNR{arr[$0];next} $0 in arr' file1 file2
This will read all lines from file1 into the array arr[], and then check for each line in file2 if it already exists within the array (i.e. file1). The lines that are found will be printed in the order in which they appear in file2.
Note that the comparison in arr uses the entire line from file2 as index to the array, so it will only report exact matches on entire lines.
Maybe you mean comm ?
Compare sorted files FILE1 and FILE2 line by line.
With no options, produce three-column output. Column one
contains lines unique to FILE1, column
two contains lines unique to
FILE2, and column three contains lines common to both files.
The secret in finding these information are the info pages. For GNU programs, they are much more detailed than their man-pages. Try info coreutils and it will list you all the small useful utils.
While
fgrep -v -f 1.txt 2.txt > 3.txt
gives you the differences of two files (what is in 2.txt and not in 1.txt), you could easily do a
fgrep -f 1.txt 2.txt > 3.txt
to collect all common lines, which should provide an easy solution to your problem. If you have sorted files, you should take comm nonetheless. Regards!
Note: You can use grep -F instead of fgrep.
If the two files are not sorted yet, you can use:
comm -12 <(sort a.txt) <(sort b.txt)
and it will work, avoiding the error message comm: file 2 is not in sorted order
when doing comm -12 a.txt b.txt.
perl -ne 'print if ($seen{$_} .= #ARGV) =~ /10$/' file1 file2
awk 'NR==FNR{a[$1]++;next} a[$1] ' file1 file2
On limited version of Linux (like a QNAP (NAS) I was working on):
comm did not exist
grep -f file1 file2 can cause some problems as said by #ChristopherSchultz and using grep -F -f file1 file2 was really slow (more than 5 minutes - not finished it - over 2-3 seconds with the method below on files over 20 MB)
So here is what I did:
sort file1 > file1.sorted
sort file2 > file2.sorted
diff file1.sorted file2.sorted | grep "<" | sed 's/^< *//' > files.diff
diff file1.sorted files.diff | grep "<" | sed 's/^< *//' > files.same.sorted
If files.same.sorted shall be in the same order as the original ones, then add this line for same order than file1:
awk 'FNR==NR {a[$0]=$0; next}; $0 in a {print a[$0]}' files.same.sorted file1 > files.same
Or, for the same order than file2:
awk 'FNR==NR {a[$0]=$0; next}; $0 in a {print a[$0]}' files.same.sorted file2 > files.same
For how to do this for multiple files, see the linked answer to Finding matching lines across many files.
Combining these two answers (answer 1 and answer 2), I think you can get the result you are needing without sorting the files:
#!/bin/bash
ans="matching_lines"
for file1 in *
do
for file2 in *
do
if [ "$file1" != "$ans" ] && [ "$file2" != "$ans" ] && [ "$file1" != "$file2" ] ; then
echo "Comparing: $file1 $file2 ..." >> $ans
perl -ne 'print if ($seen{$_} .= #ARGV) =~ /10$/' $file1 $file2 >> $ans
fi
done
done
Simply save it, give it execution rights (chmod +x compareFiles.sh) and run it. It will take all the files present in the current working directory and will make an all-vs-all comparison leaving in the "matching_lines" file the result.
Things to be improved:
Skip directories
Avoid comparing all the files two times (file1 vs file2 and file2 vs file1).
Maybe add the line number next to the matching string
Not exactly what you were asking, but something I think still may be useful to cover a slightly different scenario
If you just want to quickly have certainty of whether there is any repeated line between a bunch of files, you can use this quick solution:
cat a_bunch_of_files* | sort | uniq | wc
If the number of lines you get is less than the one you get from
cat a_bunch_of_files* | wc
then there is some repeated line.
rm file3.txt
cat file1.out | while read line1
do
cat file2.out | while read line2
do
if [[ $line1 == $line2 ]]; then
echo $line1 >>file3.out
fi
done
done
This should do it.

Resources