Pattern matches by row name with multiple columns in multiple files - bash

I have two files, one with a full list of gene names and three others with partial lists of gene names. I want to match these files all into one. All the partial files are different number of rows but have 3000 columns, all representing different cells. I have been trying to join these files completely but when I use awk only one column is kept.
mergedAll.txt
GENE
SOX2
BRCA1
BRCA2
RHO
ultimatecontrolMed.txt
GENE CELL1 CELL2 CELL3
SOX2 30 152 2000
BRCA2 400 234 73
RHO 12 2 0
My Desired output would be
GENE CELL1 CELL2 CELL3
SOX2 30 152 2000
BRCA1 0 0 0
BRCA2 400 234 73
RHO 12 2 0
I run:
awk 'NR==FNR{k[$1];next}{b[$1]=$0;k[$1]}
END{for(x in k)
if ( x== "GENE" )
printf"%s %s\n",x,b[x]
else
printf"%s %d\n",x,b[x]
}' mergedAll.txt ultimatecontrolMed.txt > test.txt
And I get:
GENE CELL1 CELL 2 CELL3
SOX2 2000
BRCA1 0
BRCA2 73
RHO 0
For some reason it will keep the last column of counts but not any of the other lines, and keep all the cell names. I don't have any experience with awk so this has been a major challenge for me overall and would love it if someone could offer a better solution.

awk to the rescue!
$ awk 'NR==FNR {a[$1]=$0; next}
{print (a[$1]?a[$1]:($1 FS 0 FS 0 FS 0))}' file2 file1 |
column -t
GENE CELL1 CELL2 CELL3
SOX2 30 152 2000
BRCA1 0 0 0
BRCA2 400 234 73
RHO 12 2 0
final pipe to column is for pretty printing. Note the order of the files.
Not to hard code the number of columns you can try this alternative
$ awk 'NR==1 {for(i=2;i<=NF;i++) missing=missing FS 0}
NR==FNR {a[$1]=$0; next}
{print (a[$1]?a[$1]:($1 missing))}' file2 file1

Could you please try following awk and let me know if this helps you.
awk 'FNR==NR{a[$0];next} ($1 in a){print;delete a[$1];next} END{for(i in a){print i,"0 0 0"}}' mergedAll.txt ultimatecontrolMed.txt

The problem is that you're printing b[x] with %d format. That's for printing a single integer, so it will ignore all the other integers in b[x]. Change
printf"%s %d\n",x,b[x]
to:
if (b[x]) {
printf "%s\t%s\n", x, b[x]
} else {
printf "%s" x;
for (i = 0; i < 3000; i++) printf "\t0"
print ""
}
so that it will print the entire value. If there's no corresponding value, it will print zeroes.
Replace 3000 with the appropriate number of cells. If you don't want to hard-code it, you can get it from NF-1 when FNR == 1 && FNR != NR (the first line of the second file).

join -a 1 -a 2 -e 0 -o 0 2.{2..4} mergedAll.txt ultimatecontrolMed.txt
2.{2..4} prints a list of output fields and can easily adapted to any number of fields.
As you mention three input files, it would be possible to pipe the result of a first join into a second one
join .... file1 file2 | join ... file3
join needs sorted input. That may be a killer argument for this solution.

Related

Cross-referencing strings from two files by line number and collecting them into a third file

I have two files that I wish to coordinate into a single file for plotting an xy-graph.
File1 contains a different x-value on each line, followed by a series of y-values on the same line. File2 contains the specific y-value that I need from File1 for each point x.
In reality, I have 50,000 lines and 50-100 columns, but here is a simplified example.
File1 appears like this:
1 15 2 3 1
2 18 4 6 5
3 19 7 8 9
4 23 10 2 11
5 25 18 17 16
column 1 is the line number.
column 2 is my x-value, sorted in ascending order.
columns 3-5 are my y-values. They aren't unique; a y on one line could match a y on a different line.
File2 appears like this:
3
5
2
18
The y on each line in File2 corresponds to a number matching one of the y's in File1 from the same line (for the first few hundred lines). After the first few hundred lines, they may not always have a match. Therefore, File2 has fewer lines than File1. I would like to either ignore these rows or fill it with a 0.
Goal
The output, File3, should consist of:
15 3
18 5
19 0
23 2
25 18
or the line with
19 0
removed, whichever works for the script. If neither option is possible, then I would also be okay with just matching the y-values line-by-line until there is not a match, and then stopping there.
Attempts
I initially routed File2 into an array:
a=( $(grep -e '14,12|:*' File0 | cut -b 9-17) )
but then I noticed similar questions (1, 2) on Stackexchange used a second file, hence I routed the above grep command into File2.
These questions are slightly different, since I require specific columns from File1, but I thought I could at least use them as a starting point. The solutions to these questions:
1)
grep -Fwf File2 File1
reproduces of course the entire line in File1, and I'm not sure how to proceed from there; or
2)
awk 'FNR==NR {arr[$1];next} $1 in arr' File2 File1
fails entirely for me, with no error message except the general awk help response.
Is this possible to do? Thank you.
awk 'NR==FNR { arr[NR] = $1; next } {
for (i = 3; i <= NF; ++i) {
if ($i == arr[n]) {
print $2, $i
n++
next
}
}
print $2, 0
}' n=1 file2 file1
another awk, will print the first match only
$ awk 'NR==FNR {a[$1]; next}
{f2=$2; $1=$2="";
for(k in a) if($0 FS ~ FS k FS) {print f2,k; next}}' file2 file1
15 2
18 5
23 2
25 18
padded FS to eliminate sub-string matches. Note the order of the files, file2 should be provided first.

Subset a file by row and column numbers

We want to subset a text file on rows and columns, where rows and columns numbers are read from a file. Excluding header (row 1) and rownames (col 1).
inputFile.txt Tab delimited text file
header 62 9 3 54 6 1
25 1 2 3 4 5 6
96 1 1 1 1 0 1
72 3 3 3 3 3 3
18 0 1 0 1 1 0
82 1 0 0 0 0 1
77 1 0 1 0 1 1
15 7 7 7 7 7 7
82 0 0 1 1 1 0
37 0 1 0 0 1 0
18 0 1 0 0 1 0
53 0 0 1 0 0 0
57 1 1 1 1 1 1
subsetCols.txt Comma separated with no spaces, one row, numbers ordered. In real data we have 500K columns, and need to subset ~10K.
1,4,6
subsetRows.txt Comma separated with no spaces, one row, numbers ordered. In real data we have 20K rows, and need to subset about ~300.
1,3,7
Current solution using cut and awk loop (Related post: Select rows using awk):
# define vars
fileInput=inputFile.txt
fileRows=subsetRows.txt
fileCols=subsetCols.txt
fileOutput=result.txt
# cut columns and awk rows
cut -f2- $fileInput | cut -f`cat $fileCols` | sed '1d' | awk -v s=`cat $fileRows` 'BEGIN{split(s, a, ","); for (i in a) b[a[i]]} NR in b' > $fileOutput
Output file: result.txt
1 4 6
3 3 3
7 7 7
Question:
This solution works fine for small files, for bigger files 50K rows and 200K columns, it is taking too long, 15 minutes plus, still running. I think cutting the columns works fine, selecting rows is the slow bit.
Any better way?
Real input files info:
# $fileInput:
# Rows = 20127
# Cols = 533633
# Size = 31 GB
# $fileCols: 12000 comma separated col numbers
# $fileRows: 300 comma separated row numbers
More information about the file: file contains GWAS genotype data. Every row represents sample (individual) and every column represents SNP. For further region based analysis we need to subset samples(rows) and SNPs(columns), to make the data more manageable (small) as an input for other statistical softwares like r.
System:
$ uname -a
Linux nYYY-XXXX ZZZ Tue Dec 18 17:22:54 CST 2012 x86_64 x86_64 x86_64 GNU/Linux
Update: Solution provided below by #JamesBrown was mixing the orders of columns in my system, as I am using different version of awk, my version is: GNU Awk 3.1.7
Even though in If programming languages were countries, which country would each language represent? they say that...
Awk: North Korea. Stubbornly resists change, and its users appear to be unnaturally fond of it for reasons we can only speculate on.
... whenever you see yourself piping sed, cut, grep, awk, etc, stop and say to yourself: awk can make it alone!
So in this case it is a matter of extracting the rows and columns (tweaking them to exclude the header and first column) and then just buffering the output to finally print it.
awk -v cols="1 4 6" -v rows="1 3 7" '
BEGIN{
split(cols,c); for (i in c) col[c[i]] # extract cols to print
split(rows,r); for (i in r) row[r[i]] # extract rows to print
}
(NR-1 in row){
for (i=2;i<=NF;i++)
(i-1) in col && line=(line ? line OFS $i : $i); # pick columns
print line; line="" # print them
}' file
With your sample file:
$ awk -v cols="1 4 6" -v rows="1 3 7" 'BEGIN{split(cols,c); for (i in c) col[c[i]]; split(rows,r); for (i in r) row[r[i]]} (NR-1 in row){for (i=2;i<=NF;i++) (i-1) in col && line=(line ? line OFS $i : $i); print line; line=""}' file
1 4 6
3 3 3
7 7 7
With your sample file, and inputs as variables, split on comma:
awk -v cols="$(<$fileCols)" -v rows="$(<$fileRows)" 'BEGIN{split(cols,c, /,/); for (i in c) col[c[i]]; split(rows,r, /,/); for (i in r) row[r[i]]} (NR-1 in row){for (i=2;i<=NF;i++) (i-1) in col && line=(line ? line OFS $i : $i); print line; line=""}' $fileInput
I am quite sure this will be way faster. You can for example check Remove duplicates from text file based on second text file for some benchmarks comparing the performance of awk over grep and others.
One in Gnu awk version 4.0 or later as column ordering relies on for and PROCINFO["sorted_in"]. The row and col numbers are read from files:
$ awk '
BEGIN {
PROCINFO["sorted_in"]="#ind_num_asc";
}
FILENAME==ARGV[1] { # process rows file
n=split($0,t,",");
for(i=1;i<=n;i++) r[t[i]]
}
FILENAME==ARGV[2] { # process cols file
m=split($0,t,",");
for(i=1;i<=m;i++) c[t[i]]
}
FILENAME==ARGV[3] && ((FNR-1) in r) { # process data file
for(i in c)
printf "%s%s", $(i+1), (++j%m?OFS:ORS)
}' subsetRows.txt subsetCols.txt inputFile.txt
1 4 6
3 3 3
7 7 7
Some performance gain could probably come from moving the ARGV[3] processing block to the top berore 1 and 2 and adding a next to it's end.
Not to take anything away from both excellent answers. Just because this problem involves large set of data I am posting a combination of 2 answers to speed up the processing.
awk -v cols="$(<subsetCols.txt)" -v rows="$(<subsetRows.txt)" '
BEGIN {
n = split(cols, c, /,/)
split(rows, r, /,/)
for (i in r)
row[r[i]]
}
(NR-1) in row {
for (i=1; i<=n; i++)
printf "%s%s", $(c[i]+1), (i<n?OFS:ORS)
}' inputFile.txt
PS: This should work with older awk version or non-gnu awk as well.
to refine #anubhava solution we can
get rid of searching over 10k values for each row
to see if we are on the right row by takeing advantage of the fact the input is already sorted
awk -v cols="$(<subsetCols.txt)" -v rows="$(<subsetRows.txt)" '
BEGIN {
n = split(cols, c, /,/)
split(rows, r, /,/)
j=1;
}
(NR-1) == r[j] {
j++
for (i=1; i<=n; i++)
printf "%s%s", $(c[i]+1), (i<n?OFS:ORS)
}' inputFile.txt
Python has a csv module. You read a row into a list, print the desired columns to stdout, rinse, wash, repeat.
This should slice columns 20,000 to 30,000.
import csv
with open('foo.txt') as f:
gwas = csv.reader(f, delimiter=',', quoting=csv.QUOTE_NONE)
for row in gwas:
print(row[20001:30001]

Compare two files and display the count of duplicates occurences of a string

I have two files:
main1.txt
111
222
333
infoFile.txt
111
111
333
444
I need to compare both files and display how many times each line in file main1.txt is repeated in infoFile.txt, as an example:
111: Total 2
222: Total 0
333: Total 1
I've used grep -f main1.txt infoFile.txt | sort |uniq -c but it removes all the strings that are not available in foFile.txt, while I need it to display the count of these as 0.
Using awk you can do:
awk 'FNR==NR{a[$1]++; next} {print $1 ": Total", ($1 in a)?a[$1]:0}' infoFile.txt main1.txt
111: Total 2
222: Total 0
333: Total 1
How it works:
FNR==NR - Execute this block for first file only
{a[$1]++; next} - Create an associative array a with key as $1 and value as and incrementing count and then skip to next record
{...} - Execute this block for 2nd input file
for (i in a) Iterate array a
{print $1 ": Total", ($1 in a)?a[$1]:0} - Print first field followed by text ": Total " then print 0 if first field from 2nd file doesn't exist in array a. Otherwise print the count from array a.

awk compare fields from two different files

Here I have tried awk script to compare fields from two different files.
awk 'NR == FNR {if (NF >= 4) a[$1] b[$4]; next} {for (i in a) for (j in b) if (i >= $2 && i <=$3 && j>=$2 && j<=$3 ) {print $1, $2, $3, i, j; next}}' file1 file2
Input files:
File1:
24926 17 206 25189 5.23674 5.71882 4.04165 14.99721 c
50760 17 48 50874 3.49903 4.25043 7.66602 15.41548 c
104318 15 269 104643 2.94218 5.18301 5.97225 14.09744 c
126088 17 70 126224 3.12993 5.32649 6.14936 14.60578 c
174113 16 136 174305 4.32339 2.36452 8.60971 15.29762 c
196474 14 89 196626 2.24367 5.16966 7.33723 14.75056 c
......
......
File2:
GT_004279 1 280
GT_003663 19891 20217
GT_003416 22299 23004
GT_003151 24916 25391
GT_001715 39470 39714
GT_001585 40896 41380
....
....
The output which I got is:
GT_004279 1 280 2465483 2639576
GT_003663 19891 20217 2005645 2005798
GT_003416 22299 23004 2291204 2269898
GT_003151 24916 25391 2501183 25189
GT_001715 39470 39714 3964440 3950417
......
......
The desired output should be 1st and 4th field values from file1 lies in between 2nd and 3rd field values from file2. For example, If I have taken above given lines as INPUT files, the output must be..
GT_003151 24916 25391 24926 25189
If I guess correctly the problem is within the If loop. So, Could someone help to rectify this problem.
Thanks
You need to make composite keys and iterate through them. When you create such composite keys they are separated by SUBSEP variable. So you just split based on that and do the check.
awk '
NR==FNR{ flds[$1,$4]; next }
{
for (key in flds) {
split (key, fld, SUBSEP)
if ($2<=fld[1] && $3>=fld[2])
print $0, fld[1], fld[2]
}
}' file1 file2
GT_003151 24916 25391 24926 25189

Display only lines in which 1 column is equal to another, and a second column is in a range in AWK and Bash

I have two files. The first file looks like this:
1 174392
1 230402
2 4933400
3 39322
4 42390021
5 80022392
6 3818110
and so on
the second file looks like this:
chr1 23987 137011
chr1 220320 439292
chr2 220320 439292
chr2 2389328 3293292
chr3 392329 398191
chr4 421212 3292393
and so on.
I want to return the whole line, provided that the first column in FILE1 = the first line in FILE2, as a string match AND the 2nd column in file 2 is greater than column 2 in FILE2 but less than column 3 in FILE2.
So in the above example, the line
1 230402
in FILE1 and
chr1 220320 439292
in FILE2 would satisfy the conditions because 230402 is between 220320 and 439292 and 1 would be equal to chr1 after I make the strings match, therefore that line in FILE2 would be printed.
The code I wrote was this:
#!/bin/bash
$F1="FILE1.txt"
read COL1 COL2
do
grep -w "chr$COL1" FILE2.tsv \
| awk -v C2=$COL2 '{if (C2>$1 && C2<$2); print $0}'
done < "$F1"
I have tried many variations of this. I do not care if the code is entirely in awk, entirely in bash, or a mixture.
Can anyone help?
Thank you!
Here is one way using awk:
awk '
NR==FNR {
$1 = "chr" $1
seq[$1,$2]++;
next
}
{
for(key in seq) {
split(key, tmp, SUBSEP);
if(tmp[1] == $1 && $2 <= tmp[2] && tmp[2] <= $3 ) {
print $0
}
}
}' file1 file2
chr1 220320 439292
We read the first file in to an array using key as column 1 and 2. We add a string "chr" to column 1 while making it a key for easy comparison later on
When we process the file 2, we iterate over our array and split the key.
We compare the first piece of our key to column 1 and check if second piece of the key is in the range of second and third column.
If it satisfies our condition, we print the line.
awk 'BEGIN {i = 0}
FNR == NR { chr[i] = "chr" $1; test[i++] = $2 }
FNR < NR { for (c in chr) {
if ($1 == chr[c] && test[c] > $2 && test[c] < $3) { print }
}
}' FILE1.txt FILE2.tsv
FNR is the line number within the current file, NR is the line number within all the input. So the first block processes the first file, collecting all the lines into arrays. The second block processes any remaining files, searching through the array of chrN values looking for a match, and comparing the other two numbers to the number from the first file.
Thanks very much!
These answers work and are very helpful.
Also at long last I realized I should have had:
awk -v C2=$COL2 'if (C2>$1 && C2<$2); {print $0}'
with the brace in a different place and I would have been fine.
At any rate, thank you very much!

Resources