I need to merge two files and create a new files - shell

input 1
1 10611 2 122 C:0.983607 G:0.0163934
input 2
1 10611 rs146752890 C G 100 PASS AC=184;RSQ=0.8228;AVGPOST=0.9640;AN=2184;ERATE=0.0031;VT=SNP;AA=.;THETA=0.0127;LDAF=0.0902;SNPSOURCE=LOWCOV;AF=0.08;ASN_AF=0.08;AMR_AF=0.14;AFR_AF=0.08;EUR_AF=0.07
here
1st and 2nd column are matching and values before ':' of 5th column of first file and 4th column of 2nd files are equel and 6th column(values before ':') of first and 5th column of second files are equel and output is creating based on this match.Will get the clear idea from input and output line and both files are .gz files
output
1 10611 rs146752890 C G 100 PASS AC=184;RSQ=0.8228;AVGPOST=0.9640;AN=2184;ERATE=0.0031;VT=SNP;AA=.;THETA=0.0127;LDAF=0.0902;SNPSOURCE=LOWCOV;AF=0.08;ASN_AF=0.08;AMR_AF=0.14;AFR_AF=0.08;EUR_AF=0.07;REF=0.983607;ALT=0.0163934;

Here's one way using awk:
awk 'FNR==NR { split($5,a,":"); split($6,b,":"); c[$1,$2,a[1],b[1]]="REF=" a[2] ";ALT=" b[2] ";"; next } ($1,$2,$4,$5) in c { print $0 ";" c[$1,$2,$4,$5] }' input1 input2
Result:
1 10611 rs146752890 C G 100 PASS AC=184;RSQ=0.8228;AVGPOST=0.9640;AN=2184;ERATE=0.0031;VT=SNP;AA=.;THETA=0.0127;LDAF=0.0902;SNPSOURCE=LOWCOV;AF=0.08;ASN_AF=0.08;AMR_AF=0.14;AFR_AF=0.08;EUR_AF=0.07;REF=0.983607;ALT=0.0163934;
So for compressed files, try:
awk 'FNR==NR { split($5,a,":"); split($6,b,":"); c[$1,$2,a[1],b[1]]="REF=" a[2] ";ALT=" b[2] ";"; next } ($1,$2,$4,$5) in c { print $0 ";" c[$1,$2,$4,$5] }' <(gzip -dc input1.gz) <(gzip -dc input2.gz) | gzip > output.gz
EDIT:
From the comments below, try this:
awk 'FNR==NR { split($5,a,":"); split($6,b,":"); c[$1,$2,a[1],b[1]]="REF=" a[2] ";ALT=" b[2] ";"; next } ($1,$2,$4,$5) in c { print $1, $2, $3, $4, $5, $6, $7, c[$1,$2,$4,$5] $8 ";" }' file1 file2
Result:
1 10611 rs146752890 C G 100 PASS REF=0.983607;ALT=0.0163934;AC=184;RSQ=0.8228;AVGPOST=0.9640;AN=2184;ERATE=0.0031;VT=SNP;AA=.;THETA=0.0127;LDAF=0.0902;SNPSOURCE=LOWCOV;AF=0.08;ASN_AF=0.08;AMR_AF=0.14;AFR_AF=0.08;EUR_AF=0.07;

This should work (assuming you have enough disk space to store the expanded .gz files):
zcat 1 | awk '{print $1$2,$0}' | sort > new1
zcat 2 | awk '{print $1$2,$0}' | sort > new2
join new1 new2 -11 -21 -o "2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 1.6 1.7"|sed 's/ C:/;REF=/'|sed 's/ G:/;ALT=/' > output

Related

merge 2 files based on 1,2,3 column matching in 2 files & print specific columns from 2 files. also retain non-pairing rows from both files in output [duplicate]

file1
rs12345 G C
rs78901 A T
file2
3 22745180 rs12345 G C,G
12 67182999 rs78901 A G,T
desired output
3 22745180 rs12345 G C
12 67182999 rs78901 A T
I tried
awk 'NR==FNR {h[$1] = $3; next} {print $1,$2,$3,h[$2]}' file1 file2
output generated
3 22745180 rs12345
print first 4 columns of file2 and 3rd column of file1 as 5th col in output
You may use this awk:
awk 'FNR == NR {map[$1,$2] = $3; next} ($3,$4) in map {$NF = map[$3,$4]} 1' f1 f2 | column -t
3 22745180 rs12345 G C
12 67182999 rs78901 A T
A more readable version:
awk '
FNR == NR {
map[$1,$2] = $3
next
}
($3,$4) in map {
$NF = map[$3,$4]
}
1' file1 file2 | column -t
Used column -t for tabular output only.
In the case you presented (rows in both files are matched) this will work
paste file1 file2 | awk '{print $4,$5,$6,$7,$3}'

Extracting a part of value in a column which begins with '+' and ends with'=' using shell script

I have a log file where there is a a pattern of strings i need to print.
(i am not giving the log details so putting forward a sample case).
cat file.txt
1234 is so so from 12+3=15
1235 is so so from 123+4=16
1236 is so so from 1543+4=16
1237 is so so from 13+4=16
1237 is so so from 13+5=16
the result value i am looking for is:-
1234 3
1235 4
1236 4
1237 9
I have tried using
cat file.txt |grep " is so so from " | awk '{print $1,substr($6,3,1);}' |awk '{a[$1]+=$2} END {for(i in a) print i,a[i]
but this gives only when the 6th column have a constant string.
to make it dynamic, i am seeking help where i can get a part of string having "+" value before it and "=" at its other end.
I am seeking help where i can get a part of string having "+" value before it and "=" at its other end.
How about using awk and a regular expression to extract the interesting columns?
cat file.txt | awk 'match($0, /([0-9]+)[^+]*\+([0-9]+)=.*/, a) { print a[1], a[2] }'
yields
1234 3
1235 4
1236 4
1237 4
1237 5
Edit: Summing up the second column if the first one is identical is shown by #eridal:
cat file.txt | awk 'match($0, /([0-9]+)[^+]*\+([0-9]+)=.*/, a) { print a[1], a[2] }' | awk '{ a[$1] += $2 } END { for(i in a) print i, a[i] }'
yields
1234 3
1235 4
1236 4
1237 9
It's not clear what you input file is, so I'm assuming my answer on this file.txt:
1234 is so so from 12+3=15
1235 is so so from 123+4=16
1236 is so so from 1543+4=16
1237 is so so from 13+4=16
1237 is so so from 13+5=16
So with such file as input, here's how I'd target those values
cat file.txt \
| grep -Po '^[0-9]+.*\+\d'
| sed -E 's/^([0-9]+)[^+]+\+([0-9]+)/\1 \2/'
| awk '{ a[$1] += $2 } END { for(i in a) print i, a[i] }'
How does it work?
grep to extract the portion we care about
sed to remove the in between noise
awk to compute the needed sum result
Another solution: Just a simple line, albeit a bit not so simple to follow..
cat file.txt \
| awk 'match($0, /^([0-9]+)[^+]+\+([0-9]+)/, m) { a[m[1]] += m[2] } END { for(i in a) print i, a[i] }'

copying columns from different files into a single file using awk

I have about more than 500 files having two columns "Gene short name" and "FPKM" values. The number of rows is same and the "Gene short name" column is common in all the files. I want to create a matrix by keeping first column as gene short name (can be taken from any of the files) and rest other columns having the FPKM.
I have used this command which works well, but then, how can I use it for 500 files?
paste -d' ' <(awk -F'\t' '{print $1}' 69_genes.fpkm.txt) \
<(awk -F'\t' '{print $2}' 69_genes.fpkm.txt) \
<(awk -F'\t' '{print $2}' 72_genes.fpkm.txt) \
<(awk -F'\t' '{print $2}' 75_genes.fpkm.txt) \
<(awk -F'\t' '{print $2}' 78_genes.fpkm.txt) > col.txt
sample data (files are tab separated):
head 69_genes.fpkm.txt
gene_short_name FPKM
DDX11L1 0.196141
MIR1302-2HG 0.532631
MIR1302-2 0
WASH7P 4.51437
Expected outcome
gene_short_name FPKM FPKM FPKM FPKM
DDX11L1 0.196141 0.206591 0.0201256 0.363618
MIR1302-2HG 0.532631 0.0930007 0.0775838 0
MIR1302-2 0 0 0 0
WASH7P 4.51437 3.31073 3.23326 1.05673
MIR6859-1 0 0 0 0
FAM138A 0.505155 0.121703 0.105235 0
OR4G4P 0.0536387 0 0 0
OR4G11P 0 0 0 0
OR4F5 0.0390888 0.0586067 0 0
Also, I want to change the name "FPKM" to "filename_FPKM".
Given the input
$ cat a.txt
a 1
b 2
c 3
$ cat b.txt
a I
b II
c III
$ cat c.txt
a one
b two
c three
you can loop:
cut -f1 a.txt > result.txt
for f in a.txt b.txt c.txt
do
cut -f2 "$f" | paste result.txt - > tmp.txt
mv {tmp,result}.txt
done
$ cat result.txt
a 1 I one
b 2 II two
c 3 III three
In awk, using #Micha's data for clarity:
$ awk '
BEGIN { FS=OFS="\t" } # set the field separators
FNR==1 {
$2=FILENAME "_" $2 # on first record of each file rename $2
}
NR==FNR { # process the first file
a[FNR]=$0 # hash whole record to a
next
}
{ # process other files
a[FNR]=a[FNR] OFS $2 # add $2 to the end of the record
}
END { # in the end
for(i=1;i<=FNR;i++) # print all records
print a[i]
}' a.txt b.txt c.txt
Output:
a a.txt_1 b.txt_I c.txt_one
b 2 II two
c 3 III three

Bash: extract columns with cut and filter one column further

I have a tab-separated file and want to extract a few columns with cut.
Two example line
(...)
0 0 1 0 AB=1,2,3;CD=4,5,6;EF=7,8,9 0 0
1 1 0 0 AB=2,1,3;CD=1,1,2;EF=5,3,4 0 1
(...)
What I want to achieve is to select columns 2,3,5 and 7, however from column 5 only CD=4,5,6.
So my expected result is
0 1 CD=4,5,6; 0
1 0 CD=1,1,2; 1
How can I use cut for this problem and run grep on one of the extracted columns? Any other one-liner is of course also fine.
here is another awk
$ awk -F'\t|;' -v OFS='\t' '{print $2,$3,$6,$NF}' file
0 1 CD=4,5,6 0
1 0 CD=1,1,2 1
or with cut/paste
$ paste <(cut -f2,3 file) <(cut -d';' -f2 file) <(cut -f7 file)
0 1 CD=4,5,6 0
1 0 CD=1,1,2 1
Easier done with awk. Split the 5th field using ; as the separator, and then print the second subfield.
awk 'BEGIN {FS="\t"; OFS="\t"}
{split($5, a, ";"); print $2, $3, a[2]";", $7 }' inputfile > outputfile
If you want to print whichever subfield begins with CD=, use a loop:
awk 'BEGIN {FS="\t"; OFS="\t"}
{n = split($5, a, ";");
for (i = 1; i <= n; i++) {
if (a[i] ~ /^CD=/) subfield = a[i];
}
print $2, $3, subfield";", $7}' < inputfile > outputfile
I think awk is the best tool for this kind of task and the other two answers give you good short solutions.
I want to point out that you can use awk's built-in splitting facility to gain more flexibility when parsing input. Here is an example script that uses implicit splitting:
parse.awk
# Remember second, third and seventh columns
{
a = $2
b = $3
d = $7
}
# Split the fifth column on ";". After this the positional variables
# (e.g. $1, # $2, ..., $NF) contain the fields from the previous
# fifth column
{
oldFS = FS
FS = ";"
$0 = $5
}
# For example to test if the second elemnt starts with "CD", do
# something like this
$2 ~ /^CD/ {
c = $2
}
# Print the selected elements
{
print a, b, c, d
}
# Restore FS
{
FS = oldFS
}
Run it like this:
awk -f parse.awk FS='\t' OFS='\t' infile
Output:
0 1 CD=4,5,6 0
1 0 CD=1,1,2 1

using sed, awk, or sort for csv manipulation

I have a csv file that needs a lot of manipulation. Maybe by using awk and sed?
input:
"Sequence","Fat","Protein","Lactose","Other Solids","MUN","SCC","Batch Name"
1,4.29,3.3,4.69,5.6,11,75,"35361305a"
2,5.87,3.58,4.41,5.32,10.9,178,"35361305a"
3,4.01,3.75,4.75,5.66,12.2,35,"35361305a"
4,6.43,3.61,3.56,4.41,9.6,275,"35361305a"
final output:
43330075995647
59360178995344
40380035995748
64360275964436
I'm able to get through some of it going step by step.
How do I test specific columns for a value over 9.9 and replace it with 9.9 ?
Also, is there a way to combine any of these steps?
remove first line:
tail -n +2 test.csv > test1.txt
remove commas:
sed 's/,/ /g' test1.txt > test2.txt
remove quotes:
sed 's/"//g' test2.txt > test3.txt
remove columns 1 and 8 and
reorder remaining columns as 1,2,6,5,4,3:
sort test3.txt | uniq -c | awk '{print $3 "\t" $4 "\t" $8 "\t" $7 "\t" $6 "\t" $5}' test4.txt
test new columns 1,2,4,5,6 - if the value is over 9.9, replace it with 9.9
How should I do this step?
solution for following parts were found in a previous question - reformating a text file
columns 1,2,4,5,6 round decimals to tenths
column 3 needs to be four characters long, using zero to left fill
remove periods and spaces
awk '{$0=sprintf("%.1f%.1f%4s%.1f%.1f%.1f", $1,$2,$3,$4,$5,$6);gsub(/ /,"0");gsub(/\./,"")}1' test5.txt > test6.txt
This produces the output you want from the original file. Note that in the question you specified - note that in the question you specified "column 4 round to whole number" but in the desired output you had rounded it to one decimal place instead:
awk -F'[,"]+' 'function m(x) { return x < 9.9 ? x : 9.9 }
NR > 1 {
s = sprintf("%.1f%.1f%04d%.1f%.1f%.1f", m($2),m($3),$7,m($6),m($5),m($4))
gsub(/\./, "", s)
print s
}' test.csv
I have specified the field separator as any number of commas and double quotes together, so this "parses" your CSV format for you without requiring any additional steps.
The function m returns the minimum of 9.9 and the number you pass to it.
Output:
43330075995647
59360178995344
40380035995748
64360275964436
The three first in one go:
awk -F, '{gsub(/"/,"");$1=$1} NR>1' test.csc
1 4.29 3.3 4.69 5.6 11 75 35361305a
2 5.87 3.58 4.41 5.32 10.9 178 35361305a
3 4.01 3.75 4.75 5.66 12.2 35 35361305a
4 6.43 3.61 3.56 4.41 9.6 275 35361305a
tail -n +2 file | sort -u | awk -F , '
{
$0 = $1 FS $2 FS $6 FS $5 FS $4 FS $3
for (i = 1; i <= 6; ++i)
if ($i > 9.9)
$i = 9.9
$0 = sprintf("%.1f%.1f%4s%.0f%.1f%.1f", $1, $2, $3, $4, $5, $6)
gsub(/ /, "0"); gsub(/[.]/, "")
print
}
'
Or
< file awk -F , '
NR > 1 {
$0 = $1 FS $2 FS $6 FS $5 FS $4 FS $3
for (i = 1; i <= 6; ++i)
if ($i > 9.9)
$i = 9.9
$0 = sprintf("%.1f%.1f%4s%.0f%.1f%.1f", $1, $2, $3, $4, $5, $6)
gsub(/ /, "0"); gsub(/[.]/, "")
print
}
'
Output:
104309964733
205909954436
304009964838
406409643636

Resources