Print values from both files using awk as vlookup - bash

I have two files, file1:
1 I0626_all 0 0 1 1
2 I0627_all 0 0 2 1
3 I1137_all_published 0 0 1 1
4 I1859_all 0 0 2 1
5 I2497_all 0 0 2 1
6 I2731_all 0 0 1 1
7 I4451_all 0 0 1 1
8 I0626 0 0 1 1
9 I0627 0 0 2 1
10 I0944 0 0 2 1
and file 2:
I0626_all 1 138
I0627_all 1 139
I1137_all_published 1 364
I4089 1 365
AfontovaGora2.SG 1 377
AfontovaGora3_d 1 378
At the end I want
1 I0626_all 138
2 I0627_all 139
3 I1137_all_published 364
I tried using:
awk 'NR==FNR{a[$1]=$2;next} {b[$3]} {print $1,$2,b[$3]}' file2 file1
But It doesnt work.

You may use this awk:
awk 'NR == FNR {map[$1] = $NF; next} $2 in map {print $1, $2, map[$2]}' file2 file1
1 I0626_all 138
2 I0627_all 139
3 I1137_all_published 364

Related

How to merge files depending on a string in a specific column

I have two files that I need to merge together based on what string they contain in a specific column.
File 1 looks like this:
1 1655 1552 189
1 1433 1552 185
1 1623 1553 175
1 691 1554 182
1 1770 1554 184
1 1923 1554 182
1 1336 1554 181
1 660 1592 179
1 743 1597 179
File 2 looks like this:
1 1552 0 0 2 -9 G A A A
1 1553 0 0 2 -9 A A G A
1 1554 0 751 2 -9 A A A A
1 1592 0 577 1 -9 G A A A
1 1597 0 749 2 -9 A A G A
1 1598 0 420 1 -9 A A A A
1 1600 0 0 1 -9 A A G G
1 1604 0 1583 1 -9 A A A A
1 1605 0 1080 2 -9 G A A A
I am wanting to match column 3 from file 1 to column 2 on file 2, with my output looking like:
1 1655 1552 189 0 0 2 -9 G A A A
1 1433 1552 185 0 0 2 -9 G A A A
1 1623 1553 175 0 0 2 -9 A A G A
1 691 1554 182 0 751 2 -9 A A A A
1 1770 1554 184 0 751 2 -9 A A A A
1 1923 1554 182 0 751 2 -9 A A A A
1 1336 1554 181 0 751 2 -9 A A A A
1 660 1592 179 0 577 1 -9 G A A A
1 743 1597 179 0 749 2 -9 A A G A
I am not interested in keeping any lines in file 2 that are not in file 1. Thanks in advance!
Thanks to #Abelisto I managed to figure something out 4 hours later!
sort -k 3,3 File1.txt > Pheno1.txt
awk '($2 >0)' File2.ped > Ped1.ped
sort -k 2,2 Ped1.ped > Ped2.ped
join -1 3 -2 2 Pheno1.txt Ped2.ped > Ped3.txt
cut -d ' ' -f 1,4,5 --complement Ped3.txt > Output.ped
My real File2 actually contained negative values in the 2nd column (thankfully my real File1 didn't have any negatives) hence the use of awk to remove those rows
Using awk:
awk 'NR == FNR { arr[$2]=$3" "$4" "$5" "$6" "$7" "$8" "$9" "$10 } NR != FNR { print $1" "$2" "$3" "$4" "arr[$3] }' file2 file1
Process file2 first (NR==FNR) Set up an array called arr with the 3rd space delimited field as the index and the 3rd to 10th fields as values separated with a space. Then when processing the first file (NR!=FNR) print the 1st to the 4th space delimited fields followed by the contents of arr, index field 3.
Since $1 seems like constant 1 and I have no idea about rowcounts of either file (800,000 columns in file2 sounded a lot) I'm hashing file1 instead:
$ awk '
NR==FNR {
a[$3]=a[$3] (a[$3]==""?"":ORS) $2 OFS $3 OFS $4
next
}
($2 in a) {
n=split(a[$2],t,ORS)
for(i=1;i<=n;i++) {
$2=t[i]
print
}
}' file1 file2
Output:
1 1655 1552 189 0 0 2 -9 G A A A
1 1433 1552 185 0 0 2 -9 G A A A
1 1623 1553 175 0 0 2 -9 A A G A
1 691 1554 182 0 751 2 -9 A A A A
1 1770 1554 184 0 751 2 -9 A A A A
1 1923 1554 182 0 751 2 -9 A A A A
1 1336 1554 181 0 751 2 -9 A A A A
1 660 1592 179 0 577 1 -9 G A A A
1 743 1597 179 0 749 2 -9 A A G A
When posting a question, please add details such as row and column counts to it. Better requirements yield better answers.

Print each two column together from a matrix

I have a matrix:
$cat ifile.txt
2 3 4 5 10 0 2 2 0 1 0 0 0 1
0 3 4 6 2 0 2 0 0 0 0 1 2 3
0 0 0 2 3 0 3 0 3 1 2 3 1 0
Here it has total 14 columns e.g. A1 B1 A2 B2 A3 B3 A4 B4 A5 B5 A6 B6 A7 B7. Each odd number columns correspond to A and even number columns correspond to B.
I would like to print all A in one column and all B in one column. So my desire file looks like:
$cat ofile.txt
2 3
0 3
0 0
4 5
4 6
0 2
10 0
2 0
3 0
2 0
0 0
0 3
....
It is possible for me to do manually in the following way, but I am looking for some more easy way to do it.
for c in 1 3 5 7 9 11 13;do
awk'{printf"%5s %5s",$c,$(c+1)} > A$c.txt
cat A1 A3 A5 A7 A9 A11 A13 > ofile.txt
$ cat tst.awk
{
for ( i=1; i<=NF; i++ ) {
a[NR,i] = $i
}
}
END {
for ( i=1; i<=NF; i+=2 ) {
for (j=1; j<=NR; j++ ) {
print a[j,i], a[j,i+1]
}
}
}
.
$ awk -f tst.awk file
2 3
0 3
0 0
4 5
4 6
0 2
10 0
2 0
3 0
2 2
2 0
3 0
0 1
0 0
3 1
0 0
0 1
2 3
0 1
2 3
1 0
If you want to generalize for more than 2 output columns:
$ cat tst.awk
BEGIN { n=(n ? n : 2) }
{
for (i=1; i<=NF; i++) {
a[NR,i] = $i
}
}
END {
for ( i=1; i<=NF; i+=n ) {
for (j=1; j<=NR; j++) {
for ( k=1; k<=n; k++ ) {
printf "%s%s", a[j,i+k-1], (k<n ? OFS : ORS)
}
}
}
}
.
$ awk -v n=2 -f tst.awk file
2 3
0 3
0 0
4 5
4 6
0 2
10 0
2 0
3 0
2 2
2 0
3 0
0 1
0 0
3 1
0 0
0 1
2 3
0 1
2 3
1 0
.
$ awk -v n=7 -f tst.awk file
2 3 4 5 10 0 2
0 3 4 6 2 0 2
0 0 0 2 3 0 3
2 0 1 0 0 0 1
0 0 0 0 1 2 3
0 3 1 2 3 1 0

How can I use shell awk to get this dataset?

Thank you for your interest.
The original data
land_cover_classes rows columns LandCoverDist
"1 of 18" "1 of 720" "1 of 1440" 20
"1 of 18" "1 of 720" "2 of 1440" 0
"1 of 18" "1 of 720" "3 of 1440" 0
"10 of 18" "1 of 720" "4 of 1440" 1
"9 of 18" "110 of 720" "500 of 1440" 0
"1 of 18" "1 of 720" "6 of 1440" 354
"1 of 18" "1 of 720" "7 of 1440" 0
"1 of 18" "1 of 720" "8 of 1440" 0
"1 of 18" "720 of 720" "1440 of 1440" 0
And the expected should be
land_cover_classes rows columns LandCoverDist
1 1 1 20
......
9 110 500 0
1 1 6 354
......
1 720 1440 0
$ awk -F'["[:space:]]+' 'NR>1{$0 = $2 OFS $5 OFS $8 OFS $11} 1' file
land_cover_classes rows columns LandCoverDist
1 1 1 20
1 1 2 0
1 1 3 0
10 1 4 1
9 110 500 0
1 1 6 354
1 1 7 0
1 1 8 0
1 720 1440 0
$ awk -F'["[:space:]]+' 'NR>1{$0 = $2 OFS $5 OFS $8 OFS $11} 1' file | column -t
land_cover_classes rows columns LandCoverDist
1 1 1 20
1 1 2 0
1 1 3 0
10 1 4 1
9 110 500 0
1 1 6 354
1 1 7 0
1 1 8 0
1 720 1440 0
Awk solution:
awk 'BEGIN{ FS="\"[[:space:]]+"; OFS="\t" }
function get_num(n){
gsub(/^"| of.*/,"",n);
return n
}
NR==1; NR>1{ print get_num($1), get_num($2), get_num($3), $4 }' file
The output:
land_cover_classes rows columns LandCoverDist
1 1 1 20
1 1 2 0
1 1 3 0
10 1 4 1
9 110 500 0
1 1 6 354
1 1 7 0
1 1 8 0
1 720 1440 0
$ awk '
BEGIN { FS="\" *\"?" }
NR==1 # print header
{
for(i=2;i<=NF;i++) { # starting from the second field
split($i,a," of ") # split at _of_
printf "%s%s", a[1], (i==NF?ORS:OFS) # print the first part and separator
}
}' file
land_cover_classes rows columns LandCoverDist
1 1 1 20
1 1 2 0
1 1 3 0
10 1 4 1
9 110 500 0
1 1 6 354
1 1 7 0
1 1 8 0
1 720 1440 0

How to use this awk command without affecting the header

Good nigt. I have this two files:
File 1 - with phenotype informations, the first column are the Ids, the orinal file has 400 rows:
ID a b c d
215 2 25 13.8354303 15.2841303
222 2 25.2 15.8507278 17.2994278
216 2 28.2 13.0482192 14.4969192
223 11 15.4 9.2714745 11.6494745
File 2 - with SNPs information, the original file has 400 lines and 42,000 characters per line.
ID t u j l
215 2 0 2 1
222 2 0 1 1
216 2 0 2 1
223 2 0 2 2
217 2 0 2 1
218 0 2 0 2
And I need to remove from file 2 individuals that do not appear in the file 1, for example:
ID t u j l
215 2 0 2 1
222 2 0 1 1
216 2 0 2 1
223 2 0 2 2
I used this code:
awk 'NR==FNR{a[$1]; next}$1 in a{print $0}' file2 file1 > file3
and I can get this output(file 3):
215 2 0 2 1
222 2 0 1 1
216 2 0 2 1
223 2 0 2 2
but I lose the header, how do I not lose the header?
To keep the header of the second file, add a condition{action} like this:
awk 'NR==FNR {a[$1]; next}
FNR==1 {print $0; next} # <= this will print the header of file2.
$1 in a {print $0}' file1 file2
NR holds the total record number while FNR is the file record number, it counts the records of the file currently being processed. Also the next statements are important, so that to continue with the next record and don't try the rest of the actions.

How to insert whitespace between characters of words in a specific field in a file

I have a file containing 100000 lines like this
1 0110100010010101
2 1000010010111001
3 1000011001111000
10 1011110000111110
123 0001000000100001
I would like to know how can I display efficiently just the second field by adding whitespaces between characters.
0 1 1 0 1 0 0 0 1 0 0 1 0 1 0 1
1 0 0 0 0 1 0 0 1 0 1 1 1 0 0 1
1 0 0 0 0 1 1 0 0 1 1 1 1 0 0 0
1 0 1 1 1 1 0 0 0 0 1 1 1 1 1 0
0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1
One solution would be to get the second column with awk and then add the whitespaces using sed. But as the file is too long I would like to avoid using pipes. Then I'm wondering if I can do that by just using awk.
Thanks in advance
is this ok?
awk '{gsub(/./,"& ",$2);print $2}' yourFile
example
kent$ echo "1 0110100010010101
2 1000010010111001
3 1000011001111000"|awk '{gsub(/./,"& ",$2);print $2}'
0 1 1 0 1 0 0 0 1 0 0 1 0 1 0 1
1 0 0 0 0 1 0 0 1 0 1 1 1 0 0 1
1 0 0 0 0 1 1 0 0 1 1 1 1 0 0 0
update
more than 2 digits in 1st column won't work? I didn't get it:
kent$ echo "133 0110100010010101
233 1000010010111001
333 1000011001111000"|awk '{gsub(/./,"& ",$2);print $2}'
0 1 1 0 1 0 0 0 1 0 0 1 0 1 0 1
1 0 0 0 0 1 0 0 1 0 1 1 1 0 0 1
1 0 0 0 0 1 1 0 0 1 1 1 1 0 0 0
gsub(/./,"& ", $2)
1 /./ match any single character
2 "& " & here means the matched string, in this case, each character
3 $2 column 2
so it means, replace each character in 2nd column into the character itself + " ".
One way using only awk:
awk '{ gsub( /./, "& ", $2 ); print $2; }' infile
That yields:
0 1 1 0 1 0 0 0 1 0 0 1 0 1 0 1
1 0 0 0 0 1 0 0 1 0 1 1 1 0 0 1
1 0 0 0 0 1 1 0 0 1 1 1 1 0 0 0
EDIT: Kent and I gave the same implementation, so, for this answer to be a bit more useful, I will add the sed one:
sed -e 's/^[^ ]* *//; s/./& /g' infile
Just adding a sed alternative:
sed -e 's/^.* *//;s/./& /g;s/ $//' file
Three comands:
Remove the characters and spaces on the start of the line
Replace everycharacter with itself followed by a space
(Optional) Remove the trailing space at the end of the line
sed solution.
sed 's/.* //;s/\(.\)/\1 /g'
It adds an extra space at the end of each line. Add ;s/ $// to the expression to remove it.
This might work for you (GNU sed):
sed 's/^\S*\s*//;s/\B/ /g' /file

Resources