Mixing several files column by column in bash - bash

I would like to merge four .txt files into in a unique file. However, the idea is not a simple concatenation, but otherwise an 'interlacement' between the input files where the file1 will be the first three columns and files 2-4 must be pasted column by column in a subsequent order. Thus we have:
file1:
file1 <- ' AX-1 1 125
AX-2 2 456
AX-3 3 3445'
file1 <- read.table(text=file1, header=F)
write.table(file1, "file1.txt", col.names=F, row.names=F, quote=F)
file2:
file2 <- ' AX-1 AA AB AA
AX-2 AA AA AB
AX-3 BB NA AB'
file2 <- read.table(text=file2, header=F)
write.table(file2, "file2.txt", col.names=F, row.names=F, quote=F)
file3:
file3 <- ' AX-1 0.20 -0.89 0.005
AX-2 0 -0.56 -0.003
AX-3 1.2 0.002 0.005'
file3 <- read.table(text=file3, header=F)
write.table(file3, "file3.txt", col.names=F, row.names=F, quote=F)
file4:
file4 <- ' AX-1 1 0 0.56
AX-2 0 0.56 0
AX-3 1 0 0.55'
file4 <- read.table(text=file34, header=F)
write.table(file4, "file4.txt", col.names=F, row.names=F, quote=F)
Where my expected out file could be something like:
out <- 'AX-1 1 125 AA 0.2 1 AB -0.89 0 AA 0.005 0.56
AX-2 2 456 AA 0 0 AA -0.56 0.56 AB -0.003 0
AX-3 3 3445 BB 1.2 1 NA 0.002 0 AA 0.005 0.55'
out <- read.table(text=out, header=F)
write.table(out, "out.txt", col.names=F, row.names=F, quote=F)
Thus, in the out: the column 1-3 are the file1, the columns 4,7 and 10 came from file2, the columns 5,8 and 11 came from file3 and the columns 6,9 and 12 came from file4.
I have an idea how to do it in R, but my original files are too large and it will take a lot of time. I would be grateful if someone has an idea how to perform it directly in bash.

This should work:
$ join a1 a2 | join - a3 | join - a4 | awk '{printf "%s %s %s %s %s %s %s %s %s %s %s %s\n", $1, $2, $3, $4, $7, $10, $5, $8, $11, $6, $9, $12}'
AX-1 1 125 AA 0.20 1 AB -0.89 0 AA 0.005 0.56
AX-2 2 456 AA 0 0 AA -0.56 0.56 AB -0.003 0
AX-3 3 3445 BB 1.2 1 NA 0.002 0 AB 0.005 0.55

Try this:
paste file1 file2 file3 file4 | awk '{ print $1 " " $2 " " $3 " " $5 " " $9 " " $13 " " $6 " " $10 " " $14 " " $7 " " $11 " " $15 }'
this works if your files have ordered rows, join suggested by Mauro is better choice.

Related

How to merge two tab-separated files and predefine formatting of missing values?

I am trying to merge two unsorted tab separated files by a column of partially overlapping identifiers (gene#) with the option of predefining missing values and keeping the order of the first table.
When using paste on my two example tables missing values end up as empty space.
cat file1
c3 100 300 gene4
c1 300 400 gene1
c13 600 700 gene2
cat file2
gene1 4.2 0.001
gene4 1.05 0.5
paste file1 file2
c3 100 300 gene1 gene1 4.2 0.001
c1 300 400 gene4 gene4 1.05 0.5
c13 600 700 gene2
As you see the result not surprisingly shows empty spaces in non matched lines. Is there a way to keep the order of file1 and fill lines like the third as follows:
c3 100 300 gene4 gene4 1.05 0.5
c1 300 400 gene1 gene1 4.2 0.001
c13 600 700 gene2 NA 1 1
I assume one way could be to build an awk conditional construct. It would be great if you could point me in the right direction.
With awk please try the following:
awk 'FNR==NR {a[$1]=$1; b[$1]=$2; c[$1]=$3; next}
{if (!a[$4]) {a[$4]="N/A"; b[$4]=1; c[$4]=1}
printf "%s %s %s %s\n", $0, a[$4], b[$4], c[$4]}
' file2 file1
which yields:
c3 100 300 gene1 gene1 4.2 0.001
c1 300 400 gene4 gene4 1.05 0.5
c13 600 700 gene2 N/A 1 1
awk 'FNR==NR {a[$1]=$1; b[$1]=$2; c[$1]=$3; next}
{if (!a[$4]) {a[$4]="N/A"; b[$4]=1; c[$4]=1}
printf "%s %s %s %s\n", $0, a[$4], b[$4], c[$4]}
' file2 file1
[Explanations]
In the 1st line, FNR==NR { command; next} is an idiom to execute the command only when reading the 1st file in the argument list ("file2" in this case). Then it creates maps (aka associative arrays) to associate values in "file2" to genes
as:
gene1 => gene1 (with array a)
gene1 => 4.2 (with array b)
gene1 => 0.001 (with array c)
gene4 => gene4 (with array a)
gene4 => 1.05 (with array b)
gene4 => 0.5 (with array c)
It is not necessary that "file2" is sorted.
The following lines are executed only when reading the 2nd file ("file1") because these lines are skipped when reading the 1st file due to the next statement.
The line {if (!a[$4]) .. is a fallback to assign variables to default values when the associative array a[gene] is undefined (meaning the gene is not found in "file2").
The final line prints the contents of "file1" followed by the associated values via the gene.
You can use join:
join -e NA -o '1.1 1.2 1.3 1.4 1.5 2.1 2.2 2.3' -a 1 -1 5 -2 1 <(nl -w1 -s ' ' file1 | sort -k 5) <(sort -k 1 file2) | sed 's/NA\sNA$/1 1/' | sort -n | cut -d ' ' -f 2-
-e NA — replace all missing values with NA
-o ... — output format (field is specified using <file>.<field>)
-a 1 — Keep every line from the left file
-1 5, -2 1 — Fields used to join the files
file1, file2 — The files
nl -w1 -s ' ' file1 — file1 with numbered lines
<(sort -k X fileN) — File N ready to be joined on column X
s/NA\sNA$/1 1/ — Replace every NA NA on end of line with 1 1
| sort -n | cut -d ' ' -f 2- — sort numerically and remove the first column
The example above uses spaces on output. To use tabs, append | tr ' ' '\t':
join -e NA -o '1.1 1.2 1.3 1.4 2.1 2.2 2.3' -a 1 -1 4 -2 1 file1 file2 | sed 's/NA\sNA$/1 1/' | tr ' ' '\t'
The broken lines have a TAB as the last character. Fix this with
paste file1 file2 | sed 's/\t$/\tNA\t1\t1/g'

How a loop works in awk ? and do we get matched data from two files?

I am trying to extract data from two files with a common column but I am unable to fetch the required data.
File1
A B C D E F G
Dec 3 abc 10 2B 21 OK
Dec 1 %xyZ 09 3F 09 NOK
Dec 5 mnp 89 R5 11 OK
File2
H I J K
abc 10 6.3 A9
xyz 00 0.2 2F
pqr 45 6.9 3c
I am able to get output A B C D E F G but unable to add columns of File2 in between columns in File1 column.
Trail:
awk 'FNR==1{next}
NR==FNR{a[$1]=$2; next}
{k=$3; sub(/^\%/,"",k)} k in a{print $1,$2,$3,$a[2,3,4],$4,$5,$6,$7; delete a[k]}
END{for(k in a) print k,a[k] > "unmatched"}' File2 File1 > matched
Required output:
matched:
A B I C J K D E F G
Dec 3 10 abc 6.3 A9 10 2B 21 OK
Dec 1 00 %syz 0.2 2F 09 3F 09 NOK
unmatched :
H I J K
pqr 45 6.9 3c
Could you please help me for getting this output please ? Thank you.
awk '
FNR == 1 { next }
FNR==NR {
As[ $3] = $0
S3 = $3
gsub( /%/, "", S3)
ALs[ tolower( S3)] = $3
next
}
{
Bs[ tolower( $1)] = $0
}
END {
print "matched:"
print "A B I C J K D E F G"
for ( B in Bs){
if ( B in ALs){
split( As[ ALs[B]] " " Bs[B], Fs)
printf( "%s %s %s %s %s %s %s %s %s %s\n", Fs[1], Fs[2], Fs[9], Fs[3], Fs[10], Fs[11], Fs[4], Fs[5], F[6], F[7])
}
}
print "unmatched :"
print "H I J K"
for ( B in Bs) if ( ! ( B in ALs)) print Bs[ B]
}
' File1 File2
added non define constraint of ignore case of reference (%xyZ vs xyz)
need to keep both file in memory (array) to treat at the end. Matched could be done at reading. I keep, for understanding purpose output at END level
Your problem:
you mainly take reference to wrong file in your code (k=$3 is used when reading File2 with field from File1 reference, ...)

Awk - Control when my $# variables are expanded to merge two files with variable number of columns

My bash script is calling a awk script that nicely merges two files
mapfieldfile1=1
mapfieldfile2=2
awk -v FS="\t" 'BEGIN {OFS="\t"}
FNR==NR{hash1['"\$${mapfieldfile2}"']=$1 FS $3 FS $4 FS $5 FS $6;next}
('"\$${mapfieldfile1}"' in hash1){ print $0, hash1['"\$${mapfieldfile1}"']}' file2 file1
However I want to a more general version,where I don't have to hardcode the columns that I want to print, I simply want to print everything but my id column. Replacing $1 FS $3 FS $4 FS $5 FS $6 for $0 "almost" does the work, except that repeats the id column. I have been trying to dynamically create a a string similar to the $1 FS $3 FS $4 FS $5 FS $6 but I am getting literally the $1 $3 $4 $5 $6 strings in the merged file, as opposed to expanding their values. Also, smaller side effects: I am adding a tab in the middle and losing some headers, below is the code and example files.
I would like to find the solution to my merge and also understand what I am doing wrong and why my variables are not expanding.
I appreciate any help!
mapfieldfile1=1
mapfieldfile2=2
awk -v FS="\t" 'BEGIN {OFS="\t";strfields=""}
FNR==NR{for(i=1;i<=NF;i++) if(i!='"${mapfieldfile2}"') {strfields=strfields" "FS" $"i};
hash1['"\$${mapfieldfile2}"']=strfields;strfields="";next}
('"\$${mapfieldfile1}"' in hash1){print $0, hash1['"\$${mapfieldfile1}"']}' file2 file1
$cat file1
sampleid s1 s2 s3 s4
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 4 4
$cat file2
a0 sampleid a1 a2 a3 a4
a0 1 a a a a4
a0 2 b b b a4
a0 3 c c c a4
a0 5 e e e a4
$cat first_code_result.txt (good one!)
sampleid s1 s2 s3 s4 a0 a1 a2 a3 a4
1 1 1 1 1 a0 a a a a4
2 2 2 2 2 a0 b b b a4
3 3 3 3 3 a0 c c c a4
$cat second_code_result.txt
sampleid s1 s2 s3 s4 $1 $3 $4 $5 $6
1 1 1 1 1 $1 $3 $4 $5 $6
2 2 2 2 2 $1 $3 $4 $5 $6
3 3 3 3 3 $1 $3 $4 $5 $6
Try this (untested):
awk -v mf1="$mapfieldfile1" -v mf2="$mapfieldfile2" '
BEGIN {FS=OFS="\t"}
FNR==NR{sub(/\t[^\t]+/,""); hash1[$mf2]=$0; next}
($mf1 in hash1){ print $0, hash1[$mf1]}
' file2 file1
Don't let shell variables expand within awk scripts, use a regexp to remove fields from the record and idk why the script you haven't shown us is printing literally $3, etc. but you must be including them in a string. You'd have to post that script for help debugging it.
Check where mf1 vs mf2 should appear, I got confused reading your scripts.
EDIT - I had to tweak it as above I was deleting $2 before using it:
$ awk -v mf1="1" -v mf2="2" '
BEGIN {FS=OFS="\t"}
FNR==NR{key=$mf2; sub(/\t[^\t]+/,""); hash1[key]=$0; next}
($mf1 in hash1){ print $0, hash1[$mf1]}
' file2 file1
sampleid s1 s2 s3 s4 a0 a1 a2 a3 a4
1 1 1 1 1 a0 a a a a4
2 2 2 2 2 a0 b b b a4
3 3 3 3 3 a0 c c c a4
Note that the sub() above relies on the key field being $2 and FS being a tab. If you need a more general solution let us know.
Here's a version that'll do what you want for any key field values and will work in any awk, it just requires the FS to be a tab or some other fixed string (i.e. not a regexp):
$ cat tst.awk
BEGIN { FS=OFS="\t" }
NR==FNR {
key = $mf2
val = ""
nf = 0
for (i=1; i<=NF; i++) {
if (i != mf2) {
val = (nf++ ? val FS : "") $i
}
}
hash1[key] = val
next
}
$mf1 in hash1 { print $0, hash1[$mf1] }
$ awk -v mf1="1" -v mf2="2" -f tst.awk file2 file1
sampleid s1 s2 s3 s4 a0 a1 a2 a3 a4
1 1 1 1 1 a0 a a a a4
2 2 2 2 2 a0 b b b a4
3 3 3 3 3 a0 c c c a4
if your files are sorted already, the default output of join is what you want
$ join -t$'\t' -11 -22 file1 file2
sampleid s1 s2 s3 s4 a0 a1 a2 a3 a4
1 1 1 1 1 a0 a a a a4
2 2 2 2 2 a0 b b b a4
3 3 3 3 3 a0 c c c a4
or, after prettying with column
$ join -t$'\t' -11 -22 file1 file2 | column -t
sampleid s1 s2 s3 s4 a0 a1 a2 a3 a4
1 1 1 1 1 a0 a a a a4
2 2 2 2 2 a0 b b b a4
3 3 3 3 3 a0 c c c a4

awk condition always TRUE in a loop [duplicate]

This question already has answers here:
How do I use shell variables in an awk script?
(7 answers)
Closed 7 years ago.
Good morning,
I'm sorry this question will seem trivial to some. It has been driving me mad for hours. My problem is the following:
I have these two files:
head <input file>
SNP CHR BP A1 A2 OR P
chr1:751343 1 751343 A T 0.85 0.01
chr1:751756 1 751756 T C 1.17 0.01
rs3094315 1 752566 A G 1.14 0.0093
rs3131972 1 752721 A G 0.88 0.009
rs3131971 1 752894 T C 0.87 0.01
chr1:753405 1 753405 A C 1.17 0.01
chr1:753425 1 753425 T C 0.87 0.0097
rs2073814 1 753474 G C 1.14 0.009
rs2073813 1 753541 A G 0.85 0.0095
and
head <interval file>
1 112667912 114334946
1 116220516 117220516
1 160997252 161997252
1 198231312 199231314
2 60408994 61408994
2 64868452 65868452
2 99649474 100719272
2 190599907 191599907
2 203245673 204245673
2 203374196 204374196
I would like to use a bash script to remove all lines from the input file in which the BP column lies within the interval specified in the input file and in which there is matching of the CHR column with the first column of the interval file.
Here is the code I've been working with (although a simpler solution would be welcomed):
while read interval; do
chr=$(echo $interval | awk '{print $1}')
START=$(echo $interval | awk '{print $2}')
STOP=$(echo $interval | awk '{print $3}')
awk '$2!=$chr {print} $2==$chr && ($3<$START || $3>$STOP) {print}' < input_file > tmp
mv tmp <input file>
done <
My problem is that no lines are removed from the input file. Even if the command
awk '$2==1 && ($3>112667912 && $3<114334946) {print}' < input_file | wc -l
returns >4000 lines, so the lines clearly are in the input file.
Thank you very much for your help.
You can try with perl instead of awk. The reason is that in perl you can create a hash of arrays to save the data of interval file, and extract it easier when processing your input, like:
perl -lane '
$. == 1 && next;
#F == 3 && do {
push #{$h{$F[0]}}, [#F[1..2]];
next;
};
#F == 7 && do {
$ok = 1;
if (exists $h{$F[1]}) {
for (#{$h{$F[1]}}) {
if ($F[2] > $_->[0] and $F[2] < $_->[1]) {
$ok = 0;
last;
}
}
}
printf qq|%s\n|, $_ if $ok;
};
' interval input
$. skips header of interval file. #F checks number of columns and the push creates the hash of arrays.
Your test data is not accurate because none line is filtered out, I changed it to:
SNP CHR BP A1 A2 OR P
chr1:751343 1 751343 A T 0.85 0.01
chr1:751756 1 112667922 T C 1.17 0.01
rs3094315 1 752566 A G 1.14 0.0093
rs3131972 1 752721 A G 0.88 0.009
rs3131971 1 752894 T C 0.87 0.01
chr1:753405 2 753405 A C 1.17 0.01
chr1:753425 1 753425 T C 0.87 0.0097
rs2073814 1 199231312 G C 1.14 0.009
rs2073813 2 204245670 A G 0.85 0.0095
So you can run it and get as result:
SNP CHR BP A1 A2 OR P
chr1:751343 1 751343 A T 0.85 0.01
rs3094315 1 752566 A G 1.14 0.0093
rs3131972 1 752721 A G 0.88 0.009
rs3131971 1 752894 T C 0.87 0.01
chr1:753405 2 753405 A C 1.17 0.01
chr1:753425 1 753425 T C 0.87 0.0097

Compare two file columns (unsorted files)

Input File 1
A1 123 AA
B1 123 BB
C2 44 CC1
D1 12 DD1
E1 11 EE1
Input File 2
A sad21 1
DD1 124f2 2
CC 123tges 3
BB 124sdf 4
AA 1asrf 5
Output File
A1 123 AA 1asrf 5
B1 123 BB 124sdf 4
D1 12 DD1 124f2 2
Making of Output file
We check 3rd column of Input File 1 and 1st Col of Input File 2.
If they match , we print it in Output file.
Note :
The files are not sorted
I tried :
join -t, A B | awk -F "\t" 'BEGIN{OFS="\t"} {if ($3==$4) print $1,$2,$3,$4,$6}'
But this doesnot work as files are unsorted. so the condition ($3==$4) won't work all the time. Please help .
nawk 'FNR==NR{a[$3]=$0;next}{if($1 in a){p=$1;$1="";print a[p],$0}}' file1 file2
tested below:
> cat file1
A1 123 AA
B1 123 BB
C2 44 CC1
D1 12 DD1
E1 11 EE1
> cat file2
A sad21 1
DD1 124f2 2
CC 123tges 3
BB 124sdf 4
AA 1asrf 5
> awk 'FNR==NR{a[$3]=$0;next}{if($1 in a){p=$1;$1="";print a[p],$0}}' file1 file2
D1 12 DD1 124f2 2
B1 123 BB 124sdf 4
A1 123 AA 1asrf 5
>
You can use join, but you need to sort on the key field first and tell join that the key in the first file is column 3 (-1 3):
join -1 3 <(sort -k 3,3 file1) <(sort file2)
Will get you the correct fields, output (with column -t for output formatting):
AA A1 123 1asrf 5
BB B1 123 124sdf 4
DD1 D1 12 124f2 2
To get the same column ordering listed in the question, you need to specify the output format:
join -1 3 -o 1.1,1.2,1.3,2.2,2.3 <(sort -k 3,3 file1) <(sort file2)
i.e. file 1 fields 1 through 3 then file 2 fields 2 and 3. Output (again with column -t):
A1 123 AA 1asrf 5
B1 123 BB 124sdf 4
D1 12 DD1 124f2 2
perl -F'/\t/' -anle 'BEGIN{$f=1}if($f==1){$H{$F[2]}=$_;$f++ if eof}else{$l=$H{$F[0]};print join("\t",$l,#F[1..$#F]) if defined$l}' f1.txt f2.txt
or shorter
perl -F'/\t/' -anle'$f?($l=$H{$F[0]})&&print(join"\t",$l,#F[1..$#F]):($H{$F[2]}=$_);eof&&$f++' f1.txt f2.txt
One way using awk:
awk 'BEGIN { FS=OFS="\t" } FNR==NR { array[$1]=$2 OFS $3; next } { if ($3 in array) print $0, array[$3] }' file2.txt file1.txt
Results:
A1 123 AA 1asrf 5
B1 123 BB 124sdf 4
D1 12 DD1 124f2 2
This might work for you (GNU sed):
sed 's|\(\S*\)\(.*\)|/\\s\1$/s/$/\2/p|' file2 | sed -nf - file1

Resources