I have two files that I wish to coordinate into a single file for plotting an xy-graph.
File1 contains a different x-value on each line, followed by a series of y-values on the same line. File2 contains the specific y-value that I need from File1 for each point x.
In reality, I have 50,000 lines and 50-100 columns, but here is a simplified example.
File1 appears like this:
1 15 2 3 1
2 18 4 6 5
3 19 7 8 9
4 23 10 2 11
5 25 18 17 16
column 1 is the line number.
column 2 is my x-value, sorted in ascending order.
columns 3-5 are my y-values. They aren't unique; a y on one line could match a y on a different line.
File2 appears like this:
3
5
2
18
The y on each line in File2 corresponds to a number matching one of the y's in File1 from the same line (for the first few hundred lines). After the first few hundred lines, they may not always have a match. Therefore, File2 has fewer lines than File1. I would like to either ignore these rows or fill it with a 0.
Goal
The output, File3, should consist of:
15 3
18 5
19 0
23 2
25 18
or the line with
19 0
removed, whichever works for the script. If neither option is possible, then I would also be okay with just matching the y-values line-by-line until there is not a match, and then stopping there.
Attempts
I initially routed File2 into an array:
a=( $(grep -e '14,12|:*' File0 | cut -b 9-17) )
but then I noticed similar questions (1, 2) on Stackexchange used a second file, hence I routed the above grep command into File2.
These questions are slightly different, since I require specific columns from File1, but I thought I could at least use them as a starting point. The solutions to these questions:
1)
grep -Fwf File2 File1
reproduces of course the entire line in File1, and I'm not sure how to proceed from there; or
2)
awk 'FNR==NR {arr[$1];next} $1 in arr' File2 File1
fails entirely for me, with no error message except the general awk help response.
Is this possible to do? Thank you.
awk 'NR==FNR { arr[NR] = $1; next } {
for (i = 3; i <= NF; ++i) {
if ($i == arr[n]) {
print $2, $i
n++
next
}
}
print $2, 0
}' n=1 file2 file1
another awk, will print the first match only
$ awk 'NR==FNR {a[$1]; next}
{f2=$2; $1=$2="";
for(k in a) if($0 FS ~ FS k FS) {print f2,k; next}}' file2 file1
15 2
18 5
23 2
25 18
padded FS to eliminate sub-string matches. Note the order of the files, file2 should be provided first.
We want to subset a text file on rows and columns, where rows and columns numbers are read from a file. Excluding header (row 1) and rownames (col 1).
inputFile.txt Tab delimited text file
header 62 9 3 54 6 1
25 1 2 3 4 5 6
96 1 1 1 1 0 1
72 3 3 3 3 3 3
18 0 1 0 1 1 0
82 1 0 0 0 0 1
77 1 0 1 0 1 1
15 7 7 7 7 7 7
82 0 0 1 1 1 0
37 0 1 0 0 1 0
18 0 1 0 0 1 0
53 0 0 1 0 0 0
57 1 1 1 1 1 1
subsetCols.txt Comma separated with no spaces, one row, numbers ordered. In real data we have 500K columns, and need to subset ~10K.
1,4,6
subsetRows.txt Comma separated with no spaces, one row, numbers ordered. In real data we have 20K rows, and need to subset about ~300.
1,3,7
Current solution using cut and awk loop (Related post: Select rows using awk):
# define vars
fileInput=inputFile.txt
fileRows=subsetRows.txt
fileCols=subsetCols.txt
fileOutput=result.txt
# cut columns and awk rows
cut -f2- $fileInput | cut -f`cat $fileCols` | sed '1d' | awk -v s=`cat $fileRows` 'BEGIN{split(s, a, ","); for (i in a) b[a[i]]} NR in b' > $fileOutput
Output file: result.txt
1 4 6
3 3 3
7 7 7
Question:
This solution works fine for small files, for bigger files 50K rows and 200K columns, it is taking too long, 15 minutes plus, still running. I think cutting the columns works fine, selecting rows is the slow bit.
Any better way?
Real input files info:
# $fileInput:
# Rows = 20127
# Cols = 533633
# Size = 31 GB
# $fileCols: 12000 comma separated col numbers
# $fileRows: 300 comma separated row numbers
More information about the file: file contains GWAS genotype data. Every row represents sample (individual) and every column represents SNP. For further region based analysis we need to subset samples(rows) and SNPs(columns), to make the data more manageable (small) as an input for other statistical softwares like r.
System:
$ uname -a
Linux nYYY-XXXX ZZZ Tue Dec 18 17:22:54 CST 2012 x86_64 x86_64 x86_64 GNU/Linux
Update: Solution provided below by #JamesBrown was mixing the orders of columns in my system, as I am using different version of awk, my version is: GNU Awk 3.1.7
Even though in If programming languages were countries, which country would each language represent? they say that...
Awk: North Korea. Stubbornly resists change, and its users appear to be unnaturally fond of it for reasons we can only speculate on.
... whenever you see yourself piping sed, cut, grep, awk, etc, stop and say to yourself: awk can make it alone!
So in this case it is a matter of extracting the rows and columns (tweaking them to exclude the header and first column) and then just buffering the output to finally print it.
awk -v cols="1 4 6" -v rows="1 3 7" '
BEGIN{
split(cols,c); for (i in c) col[c[i]] # extract cols to print
split(rows,r); for (i in r) row[r[i]] # extract rows to print
}
(NR-1 in row){
for (i=2;i<=NF;i++)
(i-1) in col && line=(line ? line OFS $i : $i); # pick columns
print line; line="" # print them
}' file
With your sample file:
$ awk -v cols="1 4 6" -v rows="1 3 7" 'BEGIN{split(cols,c); for (i in c) col[c[i]]; split(rows,r); for (i in r) row[r[i]]} (NR-1 in row){for (i=2;i<=NF;i++) (i-1) in col && line=(line ? line OFS $i : $i); print line; line=""}' file
1 4 6
3 3 3
7 7 7
With your sample file, and inputs as variables, split on comma:
awk -v cols="$(<$fileCols)" -v rows="$(<$fileRows)" 'BEGIN{split(cols,c, /,/); for (i in c) col[c[i]]; split(rows,r, /,/); for (i in r) row[r[i]]} (NR-1 in row){for (i=2;i<=NF;i++) (i-1) in col && line=(line ? line OFS $i : $i); print line; line=""}' $fileInput
I am quite sure this will be way faster. You can for example check Remove duplicates from text file based on second text file for some benchmarks comparing the performance of awk over grep and others.
One in Gnu awk version 4.0 or later as column ordering relies on for and PROCINFO["sorted_in"]. The row and col numbers are read from files:
$ awk '
BEGIN {
PROCINFO["sorted_in"]="#ind_num_asc";
}
FILENAME==ARGV[1] { # process rows file
n=split($0,t,",");
for(i=1;i<=n;i++) r[t[i]]
}
FILENAME==ARGV[2] { # process cols file
m=split($0,t,",");
for(i=1;i<=m;i++) c[t[i]]
}
FILENAME==ARGV[3] && ((FNR-1) in r) { # process data file
for(i in c)
printf "%s%s", $(i+1), (++j%m?OFS:ORS)
}' subsetRows.txt subsetCols.txt inputFile.txt
1 4 6
3 3 3
7 7 7
Some performance gain could probably come from moving the ARGV[3] processing block to the top berore 1 and 2 and adding a next to it's end.
Not to take anything away from both excellent answers. Just because this problem involves large set of data I am posting a combination of 2 answers to speed up the processing.
awk -v cols="$(<subsetCols.txt)" -v rows="$(<subsetRows.txt)" '
BEGIN {
n = split(cols, c, /,/)
split(rows, r, /,/)
for (i in r)
row[r[i]]
}
(NR-1) in row {
for (i=1; i<=n; i++)
printf "%s%s", $(c[i]+1), (i<n?OFS:ORS)
}' inputFile.txt
PS: This should work with older awk version or non-gnu awk as well.
to refine #anubhava solution we can
get rid of searching over 10k values for each row
to see if we are on the right row by takeing advantage of the fact the input is already sorted
awk -v cols="$(<subsetCols.txt)" -v rows="$(<subsetRows.txt)" '
BEGIN {
n = split(cols, c, /,/)
split(rows, r, /,/)
j=1;
}
(NR-1) == r[j] {
j++
for (i=1; i<=n; i++)
printf "%s%s", $(c[i]+1), (i<n?OFS:ORS)
}' inputFile.txt
Python has a csv module. You read a row into a list, print the desired columns to stdout, rinse, wash, repeat.
This should slice columns 20,000 to 30,000.
import csv
with open('foo.txt') as f:
gwas = csv.reader(f, delimiter=',', quoting=csv.QUOTE_NONE)
for row in gwas:
print(row[20001:30001]
I have two files:
main1.txt
111
222
333
infoFile.txt
111
111
333
444
I need to compare both files and display how many times each line in file main1.txt is repeated in infoFile.txt, as an example:
111: Total 2
222: Total 0
333: Total 1
I've used grep -f main1.txt infoFile.txt | sort |uniq -c but it removes all the strings that are not available in foFile.txt, while I need it to display the count of these as 0.
Using awk you can do:
awk 'FNR==NR{a[$1]++; next} {print $1 ": Total", ($1 in a)?a[$1]:0}' infoFile.txt main1.txt
111: Total 2
222: Total 0
333: Total 1
How it works:
FNR==NR - Execute this block for first file only
{a[$1]++; next} - Create an associative array a with key as $1 and value as and incrementing count and then skip to next record
{...} - Execute this block for 2nd input file
for (i in a) Iterate array a
{print $1 ": Total", ($1 in a)?a[$1]:0} - Print first field followed by text ": Total " then print 0 if first field from 2nd file doesn't exist in array a. Otherwise print the count from array a.
Here I have tried awk script to compare fields from two different files.
awk 'NR == FNR {if (NF >= 4) a[$1] b[$4]; next} {for (i in a) for (j in b) if (i >= $2 && i <=$3 && j>=$2 && j<=$3 ) {print $1, $2, $3, i, j; next}}' file1 file2
Input files:
File1:
24926 17 206 25189 5.23674 5.71882 4.04165 14.99721 c
50760 17 48 50874 3.49903 4.25043 7.66602 15.41548 c
104318 15 269 104643 2.94218 5.18301 5.97225 14.09744 c
126088 17 70 126224 3.12993 5.32649 6.14936 14.60578 c
174113 16 136 174305 4.32339 2.36452 8.60971 15.29762 c
196474 14 89 196626 2.24367 5.16966 7.33723 14.75056 c
......
......
File2:
GT_004279 1 280
GT_003663 19891 20217
GT_003416 22299 23004
GT_003151 24916 25391
GT_001715 39470 39714
GT_001585 40896 41380
....
....
The output which I got is:
GT_004279 1 280 2465483 2639576
GT_003663 19891 20217 2005645 2005798
GT_003416 22299 23004 2291204 2269898
GT_003151 24916 25391 2501183 25189
GT_001715 39470 39714 3964440 3950417
......
......
The desired output should be 1st and 4th field values from file1 lies in between 2nd and 3rd field values from file2. For example, If I have taken above given lines as INPUT files, the output must be..
GT_003151 24916 25391 24926 25189
If I guess correctly the problem is within the If loop. So, Could someone help to rectify this problem.
Thanks
You need to make composite keys and iterate through them. When you create such composite keys they are separated by SUBSEP variable. So you just split based on that and do the check.
awk '
NR==FNR{ flds[$1,$4]; next }
{
for (key in flds) {
split (key, fld, SUBSEP)
if ($2<=fld[1] && $3>=fld[2])
print $0, fld[1], fld[2]
}
}' file1 file2
GT_003151 24916 25391 24926 25189
I have two files. The first file looks like this:
1 174392
1 230402
2 4933400
3 39322
4 42390021
5 80022392
6 3818110
and so on
the second file looks like this:
chr1 23987 137011
chr1 220320 439292
chr2 220320 439292
chr2 2389328 3293292
chr3 392329 398191
chr4 421212 3292393
and so on.
I want to return the whole line, provided that the first column in FILE1 = the first line in FILE2, as a string match AND the 2nd column in file 2 is greater than column 2 in FILE2 but less than column 3 in FILE2.
So in the above example, the line
1 230402
in FILE1 and
chr1 220320 439292
in FILE2 would satisfy the conditions because 230402 is between 220320 and 439292 and 1 would be equal to chr1 after I make the strings match, therefore that line in FILE2 would be printed.
The code I wrote was this:
#!/bin/bash
$F1="FILE1.txt"
read COL1 COL2
do
grep -w "chr$COL1" FILE2.tsv \
| awk -v C2=$COL2 '{if (C2>$1 && C2<$2); print $0}'
done < "$F1"
I have tried many variations of this. I do not care if the code is entirely in awk, entirely in bash, or a mixture.
Can anyone help?
Thank you!
Here is one way using awk:
awk '
NR==FNR {
$1 = "chr" $1
seq[$1,$2]++;
next
}
{
for(key in seq) {
split(key, tmp, SUBSEP);
if(tmp[1] == $1 && $2 <= tmp[2] && tmp[2] <= $3 ) {
print $0
}
}
}' file1 file2
chr1 220320 439292
We read the first file in to an array using key as column 1 and 2. We add a string "chr" to column 1 while making it a key for easy comparison later on
When we process the file 2, we iterate over our array and split the key.
We compare the first piece of our key to column 1 and check if second piece of the key is in the range of second and third column.
If it satisfies our condition, we print the line.
awk 'BEGIN {i = 0}
FNR == NR { chr[i] = "chr" $1; test[i++] = $2 }
FNR < NR { for (c in chr) {
if ($1 == chr[c] && test[c] > $2 && test[c] < $3) { print }
}
}' FILE1.txt FILE2.tsv
FNR is the line number within the current file, NR is the line number within all the input. So the first block processes the first file, collecting all the lines into arrays. The second block processes any remaining files, searching through the array of chrN values looking for a match, and comparing the other two numbers to the number from the first file.
Thanks very much!
These answers work and are very helpful.
Also at long last I realized I should have had:
awk -v C2=$COL2 'if (C2>$1 && C2<$2); {print $0}'
with the brace in a different place and I would have been fine.
At any rate, thank you very much!