Find difference in second field, report using first field (awk) - shell

I have 2 (dummy) files
file1.txt
Tom 25
John 27
Bob 22
Justin 37
Nick 19
Max 42
file2.txt
Tom 25
John 40
Bob 22
Justin 37
Nick 19
Max 24
I want to compare the Second field of these files (the numbers). Then If they are different, report using the First field (Names). So the expected output would be the following.
John's age in file1.txt is different from file2.txt
Max's age in file1.txt is different from file2.txt
I don't know if my approach is good but I first parse the ages into another file and compare them. If they are different, I will look in which line number is the difference. Then I will go back to the original file and parse the Name of the person from THAT line.
I run the following code in shell.
$ awk '{print $2}' file1.txt > tmp1.txt
$ awk '{print $2}' file2.txt > tmp2.txt
$
$ different=$(diff tmp1.txt tmp2.txt | awk '{$1=""; print $0')
$
$ if ["${different}"]; then
$ #This is to get the line number where the ages are different
$ #so that I can go to THAT line in file1.txt and get the first field.
$ awk 'NR==FNR{a[$0];next}!($0 in a){print FNR}' tmp1.txt tmp2.txt > lineNumber.txt
$ fi
However, I am blocked here. I don't know if my approach is right or there's an easier approach.
Thanks a lot

awk 'NR==FNR{a[$1]=$2;next} $2!=a[$1]{print "Age of "$1" is different"}' file1 file2

awk '
NR==FNR{a[$1]=$2;next}
a[$1] != $2 {print $1"\047s age in "ARGV[1]" is different from "ARGV[2]}
' file1.txt file2.txt

If both files list the same names, something like this works:
join file{1,2}.txt | awk '$2 != $3 { print "Age of " $1 " is different" }'

Related

How to compare two files using if loop?

I have a file named file1.txt that looks like this:
Alex Dog
Ana Cat
Jack Fish
Kyle Mouse
And a file named file2.txt that looks like this:
Alex Lion
Ana Cat
Jack Fish
Kyle Mouse
What would be a good way to run a loop that checks if the names (Alex, Ana etc) still own the same pets (second column)?
I want the script to run the compare and then if they all match do nothing. If there is 1 mismatch or more Echo the pet that has been changed. For example on these two files (file1.txt and file2.txt) the script would print:
Lion
There are a lot of assumptions in the following code, but it works as per KamilCuk's design:
$ join file1.txt file2.txt | awk ' $2 != $3 { print $3} '
Lion
Assumptions:
files are sorted
the files have the same people in the same order
the people have only first names - no spaces in the names
If we want to do a little better, we can sort the inputs and allow the names to have spaces as follows:
$ join <( sort file1.txt) <( sort file2.txt ) | awk ' $(NF - 1) != $NF { print $NF } '
Remove all lines in file2.txt that are found in file1.txt and show after the space.
grep -vf file1.txt file2.txt | cut -d" " -f2
Edit:
You asked for a loop. A loop will use a lot more resources and should be avoided (use awk when you can not think of smart commands, perhaps with some pipes).
# Avoid the following slow loop
while read -r name animal; do
grep "^${name} " file2.txt | grep -v " ${animal}$" | cut -d" " -f2
done < file1.txt

awk or shell command to count occurence of value in 1st column based on values in 4th column

I have a large file with records like below :
jon,1,2,apple
jon,1,2,oranges
jon,1,2,pineaaple
fred,1,2,apple
tom,1,2,apple
tom,1,2,oranges
mary,1,2,apple
I want to find the no of person (names in col 1) have apple and oranges both. And the command should take as less memory as possible and should be fast. Any help appreciated!
Output :
awk/sed file => 2 (jon and tom)
Using awk is pretty easy:
awk -F, \
'$4 == "apple" { apple[$1]++ }
$4 == "oranges" { orange[$1]++ }
END { for (name in apple) if (orange[name]) print name }' data
It produces the required output on the sample data file:
jon
tom
Yes, you could squish all the code onto a single line, and shorten the names, and otherwise obfuscate the code.
Another way to do this avoids the END block:
awk -F, \
'$4 == "apple" { if (apple[$1]++ == 0 && orange[$1]) print $1 }
$4 == "oranges" { if (orange[$1]++ == 0 && apple[$1]) print $1 }' data
When it encounters an apple entry for the first time for a given name, it checks to see if the name also (already) has an entry for oranges and prints it if it has; likewise and symmetrically, if it encounters an orange entry for the first time for a given name, it checks to see if the name also has an entry for apple and prints it if it has.
As noted by Sundeep in a comment, it could use in:
awk -F, \
'$4 == "apple" { if (apple[$1]++ == 0 && $1 in orange) print $1 }
$4 == "oranges" { if (orange[$1]++ == 0 && $1 in apple) print $1 }' data
The first answer could also use in in the END loop.
Note that all these solutions could be embedded in a script that would accept data from standard input (a pipe or a redirected file) — they have no need to read the input file twice. You'd replace data with "$#" to process file names if they're given, or standard input if no file names are specified. This flexibility is worth preserving when possible.
With awk
$ awk -F, 'NR==FNR{if($NF=="apple") a[$1]; next}
$NF=="oranges" && ($1 in a){print $1}' ip.txt ip.txt
jon
tom
This processes the input twice
In first pass, add key to an array if last field is apple (-F, would set , as input field separator)
In second pass, check if last field is oranges and if first field is a key of array a
To print only number of matches:
$ awk -F, 'NR==FNR{if($NF=="apple") a[$1]; next}
$NF=="oranges" && ($1 in a){c++} END{print c}' ip.txt ip.txt
2
Further reading: idiomatic awk for details on two file processing and awk idioms
I did a work around and used only grep and comm commands.
grep "apple" file | cut -d"," -f1 | sort > file1
grep "orange" file | cut -d"," -f1 | sort > file2
comm -12 file1 file2 > names.having.both.apple&orange
comm -12 shows only the common names between the 2 files.
Solution from Jonathan also worked.
For the input:
jon,1,2,apple
jon,1,2,oranges
jon,1,2,pineaaple
fred,1,2,apple
tom,1,2,apple
tom,1,2,oranges
mary,1,2,apple
the command:
sed -n "/apple\|oranges/p" inputfile | cut -d"," -f1 | uniq -d
will output a list of people with both apples and oranges:
jon
tom
Edit after comment: For an for input file where lines are not ordered by 1st column and where each person can have two or more repeated fruits, like:
jon,1,2,apple
fred,1,2,apple
fred,1,2,apple
jon,1,2,oranges
jon,1,2,pineaaple
jon,1,2,oranges
tom,1,2,apple
mary,1,2,apple
tom,1,2,oranges
This command will work:
sed -n "/\(apple\|oranges\)$/ s/,.*,/,/p" inputfile | sort -u | cut -d, -f1 | uniq -d

Split file based off of two columns in bash

I have a tab delimited file that I would like to split into smaller files based off of two columns. My data looks like the following:
360.40 hockey james april expensive
1200.00 hockey james may expensive
124.33 baseball liam april cheap
443.12 soccer john may moderate
I want to parse these rows by the third and fifth columns.
The end result would be three different files named after the third and fifth columns like this:
james-expensive.tsv liam-cheap.tsv john-moderate.tsv
In each of those files I want only the first value in the row associated with that name/expense type. So in james-expensive.tsv for exmaple,the file would contain one column:
360.40
1200.00
I thought maybe some sort of awk script or sed script may be able to solve this, but I'm not quite sure where to start.
If it seems like a bad idea to do this with either awk or sed, that would help to know too.
Using awk:
awk '{ print $1 > $3 "-" $5 ".tsv" }' your_file
Result:
$ for F in *.tsv; do echo "---- $F ----"; cat "$F"; done
---- james-expensive.tsv ----
360.40
1200.00
---- john-moderate.tsv ----
443.12
---- liam-cheap.tsv ----
124.33
Update for nawk:
awk '{ f = $3 "-" $5 ".tsv"; print $1 > f }' your_file
Prevent too many opened files:
awk '{ f = $3 "-" $5 ".tsv" } !a[f]++ { printf "" > f } { print $1 >> f; close(f) }' your_file
You didn't tag perl but here is a oneliner:
perl -lane '`echo "$F[0]" >> $F[2]-$F[4].tsv`' file

comparing 2 files and extracting elements from file

I have two files. one has list of names (only one column) and the second file is with three columns with names, phone number, country.
What I want is to extract the data of the people whose names are not present in file 1, but only present in file2.
#!/bin/bash
for i in `cat file1 `
do
cat file2 | awk '{ if ($1 != "'$i'") {print $1 "\t" $2 "\t" $3 }}'>>NonResp
done
What I get is a weird result with more data than expected.
Kindly help.
You can do this with grep:
grep -v -F -f file1 file2
awk '{print $1}' file2 | comm -1 -3 file1 - | join file2 -
The files must already be sorted for this to work properly.
Explanation:
=> awk '{print $1}' file2 |
print only the first fileld of file2 and feed it to the next command (|)
=> comm -1 -3 file1 - |
compare file1 and the output of the last command (-) and suppress lines only in file1 (-1) as well as lines in both files (-3); that leaves lines in file2 only and feed this to the next command (|)
=> join file2 -
join the original file2 and the output from the last command (-) and write out the fields fo the matching lines (whitespace between fields is truncated, however)
Testcase:
cat <<EOF >file1
alan
bert
cindy
dave
fred
sunny
ted
EOF
cat <<EOF >file2
bert 01 AU
cindy 03 CZ
ginny 05 CN
ted 07 CH
zorro 09 AG
EOF
awk '{print $1}' file2 | comm -1 -3 file1 - | join file2 -
assuming the field delimiter as "," in file2
awk -F, 'FNR==NR{a[$1];next}!($1 in a)' file1 file2
if "," is not the delimiter ,then simply
awk 'FNR==NR{a[$1];next}!($1 in a)' file1 file2
would be sufficient.

bash process data from two files

file1:
456
445
2323
file2:
433
456
323
I want get the deficit of the data in the two files and output to output.txt, that is:
23
-11
2000
How do I realize this? thank you.
$ paste file1 file2 | awk '{ print $1 - $2 }'
23
-11
2000
Use paste to create the formulae, and use bc to perform the calculations:
paste -d - file1 file2 | bc
In pure bash, with no external tools:
while read -u 4 line1 && read -u 5 line2; do
printf '%s\n' "$(( line1 - line2 ))"
done 4<file1 5<file2
This works by opening both files (attaching them to file descriptors 4 and 5); going into a loop in which we read one line from each descriptor per iteration (exiting the loop if either has no value), and calculate and print the result.
You could use paste and awk to operate between columns:
paste -d" " file1 file2 | awk -F" " '{print ($1-$2)}'
Or even pipe to a file:
paste -d" " file1 file2 | awk -F" " '{print ($1-$2)}' > output.txt
Hope it helps!

Resources