How do I grep a list in multiple columns? - bash

in bash I use grep -w -f list1.txt list2.txt to search a list1 name into list2, they are one-column files.
Now I need to search this list1's namesinto a multiple columns file (a matrix, tab delimited or csv), how do I get the name and the corresponding column number?
List1 is:
SERPINA3
ADRA1D
BDNF
ADSS
Matrix is:
**CLUST1 CLUST2 CLUST3**
AAMP A1BG ACACB
ACADSB A2M ADRA1D
ACO1 SERPINA3 AK4
ACP5 ACADM ALDH1A3
PLIN2 ACR AMD1
ADORA2B ACO2 ARSB
ADSL ALAS1 BDNF
ADSS ALB OSGIN2
output should be
SERPINA3 CLUST2
ADRA1D CLUST2
BDNF CLUST3
ADSS CLUST1
Thanks.

awk to the rescue!
$ awk 'NR==FNR{a[$0];next}
FNR==1{split($0,h);next}
{for(i=1;i<=NF;i++) if($i in a) print $i, h[i]}' file{1,2}
ADRA1D CLUST3
SERPINA3 CLUST2
BDNF CLUST3
ADSS CLUST1
you lose the order of file1, there are other ways to handle it, not sure it's important.
Explanation
NR==FNR{a[$0];next} store records of first file in array a, skip the rest while processing first file
FNR==1{split($0,h);next} now we know it's the second file, split the header to array h for reference of column names (first row), skip rest
for(i=1;i<=NF;i++) main loop for the second file for each record (line) iterate over all fields
if($i in a) if any field is in the array a (that is first file)
print $i, h[i]} print the field and column name (indexed by the field number)
file{1,2} shorthand for file1 file2, your case will be List1 Matrix

Related

How do I match columns in a small file to a larger file and do calculations using awk

I have this small file small.csv:
STATE|STAGE|SUBCAT_ID|TOTAL TIMING|FAMA_COEFF_TIMING|DB_IMPORT_TIMING|COMMENT
SUCCEEDED|fe|L3-002559|110|7|15
SUCCEEDED|staging|L3-002241|46||24
And this bigger file big.csv:
STATE|STAGE|SUBCAT_ID|TOTAL TIMING|FAMA_COEFF_TIMING|DB_IMPORT_TIMING|COMMENT
SUCCEEDED|fe|L3-004082|16|0|8
SUCCEEDED|staging|L3-002730|85||57
SUCCEEDED|staging|L3-002722|83||56
SUCCEEDED|fe|L3-002559|100|7|15
I need a command (probably awk) that will loop on the small.csv file to check if the 1st, 2nd and 3rd column match a record in the big.csv file and then calculate based on the 4th column the difference small-big. So in the example above, since the 1st record's first 3 columns match the 4th record in big.csv the output would be:
SUCCEEDED|fe|L3-002559|10
where 10 is 110-100
Thank you
Assuming that lines with similar first three fields do not occur more than twice in the two files taken together. This works:
awk -F '|' 'FNR!=1 { key = $1 "|" $2 "|" $3; if(a[key]) print key "|" a[key]-$4; else a[key]=$4 }' small.csv big.csv

bash: output contents of diff function into 2 columns

I have a file that looks like so:
file1.txt
rs13339951:45007956:T:C 45007956
rs2838331 45026728
rs5647 12335
rs4687576 5353566
file2.txt
rs13339951 45007956
rs2838331 45026728
rs5647 12335
rs4687576:ATCFHF 5353566
More descriptions:
Some of the values in column1 are identical between the 2 files but not all
The values in column2 are all identical between the 2 files
I want to identify the rows for which the values in column1 differs between the 2 files. I.e. these rows 1 and 4 in my example. I can do this with diff file1.txt and file2.txt.
However, I would like to obtain a end file like so (see below). Indeed, I aim to use sed to replace the names of one file in the other so that both files match completely.
rs13339951:45007956:T:C rs13339951
rs4687576 rs4687576:ATCFHF
awk is perfect for this
awk 'FNR==NR {a[$2]=$1; next} a[$2]!=$1 {print a[$2] " " $1}' file1 file2
outputs
rs13339951:45007956:T:C rs13339951
rs4687576 rs4687576:ATCFHF
We are passing two files to awk. It will pass over them consecutively.
FNR==NR {.... next} { ... }
With this "trick" the first action is executed for the first file and the second action is executed for the second file.
a[$2]=$1
A key value lookup table. Second column is key first column is value. We build this lookup table while reading the first file.
a[$2]!=$1 {print a[$2] " " $1}
While iterating over the second file, compare the current first column with the value in the lookup table. If they do not match print the desired output.

search 2 fields in a file in another huge file, passing 2nd file only once

file1 has 100,000 lines. Each line has 2 fields such as:
test 12345678
test2 43213423
Another file has millions of lines. Here is an example of how the above file entries look in file2:
'99' 'databases' '**test**' '**12345678**'
'1002' 'exchange' '**test2**' '**43213423**'
I would like a way to grep these 2 fields from file1 so that I can find any line that contains both, but the gotcha is, I would like to search the 100,000 entries through the 2nd file once as looping a grep is very slow as it could loop 100,000 x 10,000,000.
Is that at all possible?
You can do this in awk:
awk -F"['[:blank:]]+" 'NR == FNR { a[$1,$2]; next } $4 SUBSEP $5 in a' file1 file2
First set the field separator so that the quotes around the fields in the second file are consumed.
The first block applies to the first file and sets keys in the array a. The comma in the array index translates to the control character SUBSEP in the key.
Lines are printed in the second file when the third and fourth fields (with the SUBSEP in between) match one of the keys. Due to the ' at the start of the line, the first field $1 is actually an empty string, so the fields you want are $4 and $5.
If your fields are always quoted in the second file, then you can do this instead:
awk -v q="'" 'NR == FNR { a[q $1 q,q $2 q]; next } $3 SUBSEP $4 in a' file file2
This inserts the quotes into the array a, so the fields in the second file match without having to consume the quotes.
fgrep and sed method:
sed "s/\b/'/g;s/\b/**/g" file1 | fgrep -f - file2
Modify a stream from file1 with sed to match the format of the second file, (i.e. surround the fields with single quotes and asterisks), and send the stream to standard output. The fgrep -f - inputs that stream as a list of fixed strings (but no regexps) and finds every matching line in file2.
Output:
'99' 'databases' '**test**' '**12345678**'
'1002' 'exchange' '**test2**' '**43213423**'

Taking x number of last columns from a file in awk

I have a txt file with columns such as name, position, department, etc. I need to get certain characters from each column to generate a username and password for each listing.
I'm approaching this problem by first separating the columns with:
awk 'NR>10{print $1, $2, $3, ...etc}' $filename
#This skips the header information and simply prints out each specified column.
And then moving on to grab each character that I need from their respective fields.
The problem now is that each row is not uniform. For example the listings are organized as such:
(lastName, firstName, middleInitial, position, departmentName)
Some of the entries in the text file do not have certain fields such as middle initial. So when I try to list fields 3, 4, and 5 the entries without their middle initials return:
(position, departmentName, empty/null (I'm not sure how awk handles this))
The good news is that I do not need the middle initial, so I can ignore these listings . How can I go about grabbing the last 2 (in this example) columns from the file so I can isolate the fields that every entry has in order to cut the necessary characters out of them?
You can get them by $(NF-1) and $NF, they are last 2nd column and last column
echo "1,2,3,4" | awk -F, 'BEGIN{OFS=","}{print $(NF-1), $NF}'
NF means number of fields. If you have 4 fields. (NF-1) would be column 3, and $(NF-1) is the value of column 3.
Output would be
3,4
Another example with different length fields in file:
sample.csv
1,2,3,4,5
a,b,c,d
Run
awk -F, 'BEGIN{OFS=","}{print $(NF-1), $NF}' sample.csv
Output:
4,5
c,d

Columns addition

I try to do sums in columns (values between the ",") depending on the date and time.
exemple:
RG Data,2015/02/27,18:02:07,"0","52",50.0,5.3,44.7,5.6,100.0,0.23,0.03,0.20,6.3,4.5
RG Data,2015/02/27,18:02:07,"1","52",36.9,22.3,14.6,39.9,100.0,0.59,0.16,0.43,7.5,29.9
RG Data,2015/02/27,18:03:06,"0","52",21.2,0.7,20.5,50.0,100.0,0.08,0.00,0.08,0.0,4.2
RG Data,2015/02/27,18:03:06,"1","52",245.6,233.4,12.2,73.7,100.0,2.08,1.83,0.25,8.0,21.4
... more lines after...
Output:
RG Data,2015/02/27,18:02:07,86.9,27.6,59.3,....
RG Data,2015/02/27,18:03:06,266.8,234.1,....
where: 86.9 is from: "50.0"(1st line) + 36.9 (2nd line). etc.. for each column.
Code with awk:
for TIME in $(awk -F ',|/' '{print $4","$5}' FILE | sort -u) ;do echo -n "$TIME; awk -F ',' "/$TIME/ {SUM += \$6} END { print SUM}" FILE ; done
Many thanks for help
This awk one-liner produces something close to the desired output:
$ awk -F, '{k=$1FS$2FS$3;seen[k];for(i=6;i<=NF;++i)sum[k,i]+=$i}END{for(i in seen){printf "%s,",i;for(j=6;j<=NF;++j)printf "%s%s",sum[i ,j],(j<NF?FS:RS)}}' file
RG Data,2015/02/27,18:03:06,266.8,234.1,32.7,123.7,200,2.16,1.83,0.33,8,25.6
RG Data,2015/02/27,18:02:07,86.9,27.6,59.3,45.5,200,0.82,0.19,0.63,13.8,34.4
The variable k is the key, which is made up the first, second and third columns of each line, joined on the field separator FS (a comma in this case). The array seen keeps track of every key k that is encountered.
The loop goes through each field from the sixth one to the last, adding to an element of the sum array, whose key is composed of the first two fields (as in seen) and the current field number.
Once the file has been processed, loop through the seen array and print out all of the corresponding elements of the sum array.

Resources