compare single column of two files - bash

I have two files, each with two columns separated by a space.
I'd like to find the lines in which column 2 is not the same in both files and output them to a third file.
file A:
1 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
2 BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
3 CCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
4 DDDDDDDDDDDDDDDDDDDDDDDDDDDDDD
5 EEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
6 FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
7 GGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
8 HHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
file B:
1 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
2 BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
3 CCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
4 DDDDDDDDDDDDDDDDDDDDDDDDDDDDDD
5 WWWWWWWWWWWWWWWWWWWWWWWWWWWWWW
6 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
7 YYYYYYYYYYYYYYYYYYYYYYYYYYYYYY
8 ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
9 EEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
10 FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
11 GGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
12 HHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
desired output:
5 WWWWWWWWWWWWWWWWWWWWWWWWWWWWWW
6 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
7 YYYYYYYYYYYYYYYYYYYYYYYYYYYYYY
8 ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
I assumed the easiest way to do this was grep each line from file A in file B, but I'm new to bash and can't figure out the next step. Any help is greatly appreciated!

You can use awk for this:
$ awk 'FNR==NR {a[$1]=$2; next} $1 in a && a[$1] != $2' fileA fileB
5 WWWWWWWWWWWWWWWWWWWWWWWWWWWWWW
6 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
7 YYYYYYYYYYYYYYYYYYYYYYYYYYYYYY
8 ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
It loops through the first file storing the values in an array a[1st col] = 2nd col. Then, it loops through the second file and prints those lines matching these conditions:
The first column is present in the first file.
The second column value is different from the one in the first file.
To store it into a new file, just redirect the command to a file:
awk 'FNR==NR {a[$1]=$2; next} $1 in a && a[$1] != $2' fileA fileB > fileC
^^^^^^^

Related

Optimizing grep -f piping commands [duplicate]

This question already has answers here:
Inner join on two text files
(5 answers)
Closed 4 years ago.
I have two files.
file1 has some keys that start have abc in the second column
et1 abc
et2 abc
et55 abc
file2 has the column 1 values and some other numbers I need to add up:
1 2 3 4 5 et1
5 5 5 5 5 et100
3 3 3 3 3 et55
5 5 5 5 4 et1
6 6 6 6 3 et1
For the keys extracted in file1, I need to add up the corresponding column 5 if it matches. File2 itself is very large
This command seems to be working but it is very slow:
egrep -isr "abc" file1.tcl | awk '{print $1}' | grep -vwf /dev/stdin file2.tcl | awk '{tl+=$5} END {print tl}'
How would I go about optimizing the pipe. Also what am I doing wrong with grep -f. Is it generally not recommended to do something like this.
Edit: Expected output is the sum of all column5 in file2 when the column6 key is present in file1
Edit2:Expected output: Since file 1 has keys "et1, et2 and et55", in file2 adding up the column 5 with matching keys in rows 1,3,4 and 5, the expected output is [5+3+4+3=15]
Use a single awk to read file1 into the keys of an array. Then when reading file2, add $5 to a total variable when $6 is in the array.
awk 'NR==FNR {if ($2 == "abc") a[$1] = 0;
next}
$6 in a {total += $5}
END { print total }
' file1.tcl file2.tcl
Could you please try following, with reading first Input_file2.tcl and with less loops. Since your expected output is not clear so haven't completely tested it.
awk 'FNR==NR{a[$NF]+=$(NF-1);next} $2=="abc"{print $1,a[$1]+0}' file2.tcl file1.tcl

awk with loops to reorder columns

I am trying to reorder the columns of a file writting a awk programn. The file looks like:
My little program to reorder the columns is:
awk -v column=number 'BEGIN {FS=","; ORS="\n"; OFS=","; n=column} {for (i=1; i<=NF; i++){if (i!=n) $(i+1)=$i else $1=$i} {print $0}' file_name
I would like to put first the column given with number and then the remaing ones, but it does not work
You are overwriting fields as you iterate. You should instead "bubble" the value from position in column to the first position.
Consider how would you move column 3 here:
1 2 3 4
to
3 1 2 4
For example with this input file:
$ cat table
1 2 3 4
2 3 4 5
3 4 5 6
4 5 6 7
You could do it like this (separators changed to whitespace for readability):
$ awk -v col=3 '{val=$col; for (i=col; i>1; i--) $i=$(i-1); $1=val; print $0}' table
3 1 2 4
4 2 3 5
5 3 4 6
6 4 5 7
You can simply print the required column first and then the rest of columns.
$ awk -v column=3 -F "," '{n=column; printf $column; for (i=1; i<NF; i++){if (i!=n) printf ","$i} print ","$NF}' file
Input:
10,Hello,meow,20,30
hello,world,34,meow,60
Output:
meow,10,Hello,20,30
34,hello,world,meow,60

How to print and store specific named columns from csv file with new row numbers

start by saying, I'm very new to using bash and any sort of script writing in general.
I have a csv file that has basic column headers and values underneath which looks something like this as an example:
a b c d
3 3 34 4
2 5 4 94
4 5 8 3
9 8 5 7
Is there a way to extract only the numerical values from a specific column and add a number for each row. For example first numbered row of the first column (starting from 1 after the column header) is 1, then 2, then 3, etc, for example for column b the output would be:
1 3
2 5
3 5
4 8
I would like to be able to do this for various different named column headers.
Any help would be appreciated,
Chris
Like this? Using awk:
$ awk 'NR>1{print NR-1, $2}' file
1 3
2 5
3 5
4 8
Explained:
$ awk ' # using awk for the job
NR>1 { # for the records or rows after the first
print NR-1, $2 # output record number minus one and the second field or column
}' file # state the file
I would like to be able to do this for various different named column headers. With awk you don't specify the column header name but the column number, like you don't state b but $2.
awk 'NR>1 {print i=1+i, $2}' file
NR>1 skips the first line, in your case the header.
print print following
i=1+i prints i, i is first 0 and add 1, so i is 1, next time 2 and so on.
$2 prints the second column.
file is the path to your file.
If you have a simple multi-space delimited file (as in your example) awk is the best tool for the job. To select the column by name in awk you can do something like:
$ awk -v col="b" 'FNR==1 { for (i=1;i<=NF;i++) if ($i==col) x=i; next }
{print FNR-1 OFS $x}' file
1 3
2 5
3 5
4 8

Count how many occurences are greater or equel of a defined value in a line

I've a file (F1) with N=10000 lines, each line contains M=20000 numbers. I've an other file (F2) with N=10000 lines with only 1 column. How can count the number of occurences in line i of file F2 that are greater or equal to the number found at line i in the file F2 ? I tried using a bash loop with awk / sed but my output is empty.
Edit >
For now I've only succeed to print the number of occurences that are higher than a defined value. Here an example with a file with 3 lines and a defined value of 15 (sorry it's a very dirty code ..) :
for i in {1..3};do sed -n "$i"p tmp.txt | sed 's/\t/\n/g' | awk '{if($1 > 15){print $1}}' | wc -l; done;
Thanks in advance,
awk 'FNR==NR{a[FNR]=$1;next}
{count=0;for(i=1;i<=NF;i++)
{if($i >= a[FNR])
{count++}
};
print count
}' file2 file1
While processing file2, total line record is equal to line record of current file, store value in array a with current record as index.
initialize count to 0 for each line.
loop through the fields, increment the counter if value is greater or equal at current FNR index in array a.
Print the count value
$ cat file1
1 3 5 7 3 6
2 5 6 8 7 7
4 6 7 8 9 4
$ cat file2
6
3
1
$ awk -f file.awk
2
5
6
You could do it in a single awk command:
awk 'NR==FNR{a[FNR]=$1;next}{c=0;for(i=1;i<=NF;i++)c+=($i>a[FNR]);print c}' file2 file1

Exclude a define pattern using awk

I have a file with two columns and want to print the first column only if a determined pattern is not found in the second column, the file can be for example:
3 0.
5 0.
4 1.
3 1.
10 0.
and I want to print the values in the first column only if there isn't the number 1. in the second file, i.e.
3
5
10
I know that to print the first column I can use
awk '{print $1}' fileInput >> fileOutput
Is it possible to have an if block somewhere?
In general, you just need to indicate what pattern you don't want to match:
awk '! /pattern/' file
In this specific case, where you want to print the 1st column of lines where 2st column is not "1.", you can say:
$ awk '$2 != "1." {print $1}' file
3
5
10
When the condition is accomplished, {print $1} will be performed, so that you will have the first column of the file.
In this special case, because the 1 evaluates to true and the 0 to false, you can do:
awk '!$2 { print $1 }' file
3
5
10
The part before the { } is the condition under which the commands are executed. In this case, !$2 means that not column 2 is true (i.e. column 2 is false).
edit: this remains to be the case, even with the trailing dot. In fact, all three of these solutions work:
bash-4.2$ cat file
3 0.
5 0.
4 1.
3 1.
10 0.
bash-4.2$ awk '!$2 { print $1 }' file # treat column 2 as a boolean
3
5
10
bash-4.2$ awk '$2 != "1." {print $1}' file # treat column 2 as a string
3
5
10
bash-4.2$ awk '$2 != 1 {print $1}' file # treat column 2 as a number
3
5
10

Resources