I have a text file looking like
text_a_3 xxx yyy
- - - - - - - - - - -
text_b_2 xyx zyz
- - - - - - - - - - -
text_b_3 xxy zyy
- - - - - - - - - - -
text_a_2 foo bar
- - - - - - - - - - -
text_a_1 foo bla
- - - - - - - - - - -
text_b_1 bla bla
I want to sort this file numerically, based on the first field, so that my output would look like:
text_a_1 foo bla
- - - - - - - - - - -
text_a_2 foo bar
- - - - - - - - - - -
text_a_3 xxx yyy
- - - - - - - - - - -
text_b_1 bla bla
- - - - - - - - - - -
text_b_2 xyx zyz
- - - - - - - - - - -
text_b_3 xxy zyy
I thought sort would do the job. I thus tried
sort -n name_of_my_file
sort -k1 -n name_of_my_file
But it gives
- - - - - - - - - - -
- - - - - - - - - - -
- - - - - - - - - - -
- - - - - - - - - - -
- - - - - - - - - - -
text_a_1 foo bla
text_a_2 foo bar
text_a_3 xxx yyy
text_b_1 bla bla
text_b_2 xyx zyz
text_b_3 xxy zyy
The option --field-separator is not of any help.
Is there any way to achieve this with a one-line, sort based command ?
Or is the only solution to extract text containing lines, sort them, and insert line delimiters afterwards ?
Using GNU awk only, and relying with internal sort function asort() and record separator set to dashes line:
awk -v RS='- - - - - - - - - - -\n' '
{a[++c]=$0}
END{
asort(a)
for(i=1;i<=c;i++)
printf "%s%s",a[i],(i==c?"":RS)
}' name_of_my_file
The script first fills the content of the input file into the array a. When the file is read, the array is sorted and then printed with the same input record separator.
When the line delimiters are all on the even lines, you can use
paste -d'\r' - - < yourfile | sort -n | tr '\r' '\n'
I actually prefer removing the delimiters in front, sort and add them afterwards, so please reconsider your requirements:
grep -Ev "(- )*-" yourfile | sort -n | sed 's/$/\n- - - - - - - - - - -/'
Following sort + awk may help you.
sort -t"_" -k2 -k3 Input_file | awk '/^-/ && !val{val=$0} !/^-/{if(prev){print prev ORS val};prev=$0} END{print prev}'
Adding a non-one liner form of solution too now.
sort -t"_" -k2 -k3 Input_file |
awk '
/^-/ && !val{
val=$0}
!/^-/{
if(prev){
print prev ORS val};
prev=$0
}
END{
print prev
}'
Hello I have these two files:
cat file1.tab
1704 1.000000 T G
1708 1.000000 C G
1711 1.000000 G C
1712 0.989011 T A
1712 0.003564 T G
cat file2.tab
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
I'd like this output:
1704 1.000000 T G
1705 0
1706 0
1707 0
1708 1.000000 C G
1709 0
1710 0
1711 1.000000 G C
1712 0.003564 T G 0.003564 T G
1713 0
I was able to almost get it with this:
awk 'NR==FNR { a[$1]=$0;b[$1]=$1; next} { if ($1 == b[$1]) print a[$1]; else print $1,"0";}' file1.tab file2.tab
But I don't know how to deal with repetitions.. My script does not check if the character in column 1 in file1.tab is repeated, so it outputs the $0 of only the last time it appears...
You could use something like this:
$ awk 'NR==FNR{$1=$1 in a?a[$1]:$1;$0=$0;a[$1]=$0;next}{print $1 in a?a[$1]:$1 OFS 0}' file1.tab file2.tab
1704 1.000000 T G
1705 0
1706 0
1707 0
1708 1.000000 C G
1709 0
1710 0
1711 1.000000 G C
1712 0.989011 T A 0.003564 T G
1713 0
Some explanation how this works:
This block 'NR==FNR{$1=$1 in a?a[$1]:$1;$0=$0;a[$1]=$0;next} is executed at for the first file, where the record index is equal to the file record index. So for the first file, we set the first word, to the value stored in the array, if one exists, or with the first word otherwise. Then, with $0=$0 we re-split the fields since the first field now may contain multiple words. After that, we store the line in the array, using the first word as an index
The block {print $1 in a?a[$1]:$1 OFS 0}' is executed only for the lines of the second file (due to the next statement in the previous block). If we find a matching line, we print it , otherwise, we concatenale 0 to the first word, and print.
You can use this awk:
awk 'FNR==NR{a[$1] = (a[$1]==""?"":a[$1] " ") $2 OFS $3 OFS $4; next}
{print $1, ($1 in a ? a[$1] : 0)}' file1 file2
1704 1.000000 T G
1705 0
1706 0
1707 0
1708 1.000000 C G
1709 0
1710 0
1711 1.000000 G C
1712 0.989011 T A 0.003564 T G
1713 0
Reference: Effective AWK Programming
How it works:
FNR==NR - Execute this block for file1 only
a[$1] = (a[$1]==""?"":a[$1] " ") $2 OFS $3 OFS $4 - Create an associative array a with key as $1 and value as $2 + $3 + $4 (keep appending previous values)
next - skip to next record
{...} - Execute this block for 2nd input file file2
if ($1 in a) if $1 in 2nd file exists in aray a
print $1, ($1 in a ? a[$1] : 0 - Print $1 and the value from array if $1 in a otherwise 0 will be printed.
With perl
$ perl -F'/\s+/,$_,2' -lane '
if(!$#ARGV){ $h{$F[0]} .= $h{$F[0]} ? " $F[1]" : $F[1] }
else{ print "$F[0] ", $h{$F[0]} ? $h{$F[0]} : 0 }
' file1.tab file2.tab
1704 1.000000 T G
1705 0
1706 0
1707 0
1708 1.000000 C G
1709 0
1710 0
1711 1.000000 G C
1712 0.989011 T A 0.003564 T G
1713 0
-F'/\s+/,$_,2' split input line on whitespace, maximum of 2 fields
!$#ARGV will work similar to awk's NR==FNR for two file command line arguments
%h hash variable saves appended values based on first field as key
When second file is processed, print as per required format
-l option strips newlines from input lines and adds newlines to each print statement
Here is a product of an unstoppable thought process using join, uniq, tac, grep and sort. The idea is to get the unique key-value pairs (especially for key 1712) and join those to avoid rows like 1708 1.000000 C G 1.000000 C G so this solution won't support grouping three or more values per one key. join -o ... -e "0" also would not produce only 1 0 on the non-joining rows, because file1.tab has 3 columns to join.
$ join -a 1 <(join -a 1 file2.tab <(uniq -w 4 file1.tab )) <(grep -v -f <(uniq -w 4 file1.tab ) <(tac file1.tab|uniq -w 4|sort))
1704 1.000000 T G
1705
1706
1707
1708 1.000000 C G
1709
1710
1711 1.000000 G C
1712 0.989011 T A 0.003564 T G
1713
More structured layout:
$ join -a 1
<(join -a 1
file2.tab
<(uniq -w 4 file1.tab ))
<(grep -v -f
<(uniq -w 4 file1.tab )
<(tac file1.tab|uniq -w 4|sort))
I would like to merge 2 files and assign it to a new file while adding a new non-existing column using preferably awk in unix:
File 1: VDR.txt doesn't have a header, is space separated and looks like this:
chr12-45000000-50000000 --- rs192072617 48225416 0.000 0.270 0.999 0 -1 -1 -1
chr12-45000000-50000000 --- rs181728325 48225429 0.000 0.144 1.000 0 -1 -1 -1
chr12-45000000-50000000 --- rs187216594 48225500 0.000 0.007 1.000 0 -1 -1 -1
File 2: METAL1.tbl has a header, is tab separated and looks like this:
MarkerName Allele1 Allele2 Weight Zscore P-value Direction HetISq HetChiSq HetDf HetPVal
rs192072617 a g 2887.00 1.579 0.1143 ++ 0.0 0.032 1 0.8579
rs7929618 c g 2887.00 -1.416 0.1568 -+ 47.4 1.899 1 0.1681
rs181728325 t c 2887.00 1.469 0.1419 ++ 73.9 3.830 1 0.05033
rs7190157 a c 2887.00 1.952 0.05088 +- 72.7 3.669 1 0.05542
rs12364336 a g 2887.00 -1.503 0.1328 -+ 69.8 3.306 1 0.06902
rs187216594 t c 2887.00 -0.082 0.9349 +- 74.8 3.964 1 0.04649
rs12562373 a g 2887.00 -0.290 0.7717 -+ 0.0 0.150 1 0.6984
Files have unequal number of lines, first file (VDR.txt) is much shorter than the second file (METAL1.tbl).
I want to:
Merge these files by the 3rd column of the first file (VDR.txt) and the 1st column of the second file (METAL1.tbl).
Keep only the columns 1, 2, 3 and 4 from first file (VDR.txt) and all columns from the second file (METAL1.tbl).
Keep only the characters before the first dash "-" from the 1st column of the first file (VDR.txt)
Add a new column to the output file that repeats a certain character string (e.g. "VDR")
Output file doesn't have to have a header, but if that's necessary it would be nice to have it as given below.
So I would like to have an output file (output.txt) that looks like this at the end:
gene MarkerName chr BP impute Allele1 Allele2 Weight Zscore P-value Direction HetISq HetChiSq HetDf HetPVal
VDR rs192072617 chr12 48225416 --- a g 2887 1.579 0.1143 ++ 0 0.032 1 0.8579
VDR rs181728325 chr12 48225429 --- t c 2887 1.469 0.1419 ++ 73.9 3.83 1 0.05033
VDR rs187216594 chr12 48225500 --- t c 2887 -0.082 0.9349 +- 74.8 3.964 1 0.04649
My attempt to this:
$ awk 'FNR==NR {a[$1]=$1" "$2" "$3" "$4" "$5;next}{print $3, gensub(/-.*/, "", $1), $4, $2, a[$3]}' METAL1.tbl VDR.txt
It does get the chr column and the column order the right way but unfortunately only prints the wanted columns from VDR.txt and not the merged file.
I am aware that this is a pretty complex example, any help or suggestion would be much appreciated.
Thanks,
Mel
As long as the title line is not needed, it is straight-forward in a single, fairly simple awk script:
$ awk 'FNR == NR { sub(/-.*/, "", $1); row[$3] = "VDR " $3 " " $1 " " $4 " " $2 }
> FNR != NR { if ($1 in row) { name = $1; $1 = ""; print row[name] $0 } }' \
> VDR.txt METAL1.tbl
VDR rs192072617 chr12 48225416 --- a g 2887.00 1.579 0.1143 ++ 0.0 0.032 1 0.8579
VDR rs181728325 chr12 48225429 --- t c 2887.00 1.469 0.1419 ++ 73.9 3.830 1 0.05033
VDR rs187216594 chr12 48225500 --- t c 2887.00 -0.082 0.9349 +- 74.8 3.964 1 0.04649
$
The files must be listed in the order shown for it to work.
The FNR == NR line processes the first file. The sub eliminates the first dash and everything after it in the first field; the assignment is keyed by the markername in $3, and contains the information for the start of the line — the fixed code, the marker name, the reduced chromosome number, the BP and the set of dashes marked 'Impute'.
The FNR != NR line processes the other file(s). When the value in column 1 matches a key in the row array, then eliminate the key from the current row (which leaves a blank in situ at the start of $0), and then print the value from row concatenated with $0.
There's no need to treat the heading line specially; the value MarkerName won't match any of the actual marker names from the first file, so that line is simply ignored.
$ cat > test.awk
NR==FNR {
sub(/-.*/,"",$1) # remove from 1st dash forward
a[$3]="VDR" OFS $3 OFS $1 OFS $4 OFS $2 # cols 1-4 of the 1st file
next
}
FNR==1 {
printf "%s", "H0" OFS "H3" OFS "H1" OFS "H4" OFS "H2" # 1st part of header
}
FNR==1 || $1 in a { # header and matching rows
print a[$1], $0 # print'em
}
$ awk -f test.awk VDR.txt METAL1.tbl
H0 H3 H1 H4 H2 MarkerName Allele1 Allele2 Weight Zscore P-value Direction HetISq HetChiSq HetDf HetPVal
VDR rs192072617 chr12 48225416 --- rs192072617 a g2887.00 1.579 0.1143 ++ 0.0 0.032 1 0.8579
VDR rs181728325 chr12 48225429 --- rs181728325 t c2887.00 1.469 0.1419 ++ 73.9 3.830 1 0.05033
VDR rs187216594 chr12 48225500 --- rs187216594 t c2887.00 -0.082 0.9349 +- 74.8 3.964 1 0.04649
As a one-liner:
awk 'NR==FNR { sub(/-.*/,"",$1); a[$3]="VDR" OFS $3 OFS $1 OFS $4 OFS $2; next} FNR==1 {printf "%s", "H0" OFS "H3" OFS "H1" OFS "H4" OFS "H2"} FNR==1 || $1 in a {print a[$1], $0}' VDR.txt METAL1.tbl
I've sorted the two datafiles to use the join command - this affects the order of rows in the output - if that is not desirable I can use another approach
export LANG=C
genef=$1
metalf=$2
gene=$(basename $genef .txt)
join -13 -21 <(sort -k3,3 $genef) <(sort -k1,1 $metalf)|
awk -vgene=$gene '
{
marker=$1
chr=substr($2, 1, index($2, "-")-1)
bp=$4
impute=$3
printf("%s\t%s\t%s\t%s\t%s", gene, marker, chr, bp, impute)
for(i=12; i<=NF; ++i)
printf("\t%s", $i)
printf("\n")
}
'
this is the tab-separated output
VDR rs181728325 chr12 48225429 --- t c 2887.00 1.469 0.1419 ++ 73.9 3.830 1 0.05033
VDR rs187216594 chr12 48225500 --- t c 2887.00 -0.082 0.9349 +- 74.8 3.964 1 0.04649
VDR rs192072617 chr12 48225416 --- a g 2887.00 1.579 0.1143 ++ 0.0 0.032 1 0.8579
I have a quite simple question, but I find it hard to solve this problem.
I have two quite long column of data, and i want to separate it into several columns. the script should start writing data into a new column, each time it finds a specific string in the first column:
input:
A B
1 C
2 C
3 C
4 C
A D
1 D
2 D
3 D
4 D
output:
A B A D
1 C 1 D
2 C 2 D
3 C 3 D
4 C 4 D
(the separating pattern is A)
You can do this using single awk:
awk 'NR>1 && /^A/{p=1} {if (p) print a[++i], $0; else a[NR]=$0}' OFS='\t' file
A B A D
1 C 1 D
2 C 2 D
3 C 3 D
4 C 4 D
awk with paste:
$ awk '$1 == "A" { ++n } { print > ("t.tmp." n) }' input.txt
$ ls t.tmp.*
t.tmp.1 t.tmp.2
$ paste t.tmp.*
A B A D
1 C 1 D
2 C 2 D
3 C 3 D
4 C 4 D
EDIT
More efficient (only build the file name once for each group) and more robust (avoid the chance of having too many open files by closing them as we go) --- thanks, Ed Morton:
awk '$1 == "A" { close(out); out = "t.tmp." ++n} { print > out }' input.txt
(Above assumes first record contains pattern. If not, can initialize out in a BEGIN block.)
Using csplit and paste
$ csplit -zsf file infile.txt '/A/' {*}
$ paste file*
A B A D
1 C 1 D
2 C 2 D
3 C 3 D
4 C 4 D
From man csplit
csplit - split a file into sections determined by context lines
-z, --elide-empty-files
remove empty output files
-s, --quiet, --silent
do not print counts of output file sizes
-f, --prefix=PREFIX
use PREFIX instead of 'xx'
{*} repeat the previous pattern as many times as possible
using gnu awk multiline records - works for any number of occurrences of pattern - assumes equal length columns
pat=A
awk -vpat=$pat -F'\n' '
BEGIN {RS="(^|\n)"pat" "}
NR>1{
nr=NR-2
fld[nr][0]=pat" "$1
for(i=2; i<=NF; ++i)
fld[nr][i-1]=$i
}
END {
for(i=0; i < NF; ++i) {
for(j=0; j < NR-1; ++j)
printf("%s%s", j?"\t":"", fld[j][i])
printf("\n")
}
}
'
input
A B
1 C
2 C
3 C
4 C
A D
1 D
2 D
3 D
4 D
A X
1 X
3 X
5 X
7 X
output
A B A D A X
1 C 1 D 1 X
2 C 2 D 3 X
3 C 3 D 5 X
4 C 4 D 7 X
If you're reading this and wondering why it got downvoted, it's just some clown being childish because I pointed out some problems with and ways they could improve their previous answer, the downvote has nothing to do with the technical merits of this answer. This is the idiomatic awk solution to this problem.
$ awk -v OFS='\t' '
$1 == "A" { numRows=0; ++numCols }
{ val[++numRows,numCols] = $0 }
END {
for (rowNr=1;rowNr<=numRows;rowNr++) {
for (colNr=1;colNr<=numCols;colNr++) {
printf "%s%s", val[rowNr,colNr], (colNr<numCols ? OFS : ORS)
}
}
}
' file
A B A D
1 C 1 D
2 C 2 D
3 C 3 D
4 C 4 D
I have a file with the following fields (and an example value to the right):
hg18.ensGene.bin 0
hg18.ensGene.name ENST00000371026
hg18.ensGene.chrom chr1
hg18.ensGene.strand -
hg18.ensGene.txStart 67051161
hg18.ensGene.txEnd 67163158
hg18.ensGene.exonStarts 67051161,67060631,67065090,67066082,67071855,67072261,67073896,67075980,67078739,67085754,67100417,67109640,67113051,67129424,67131499,67143471,67162932,
hg18.ensGene.exonEnds 67052451,67060788,67065317,67066181,67071977,67072419,67074048,67076067,67078942,67085949,67100573,67109780,67113208,67129537,67131684,67143646,67163158,
hg18.ensGene.name2 ENSG00000152763
hg18.ensGene.exonFrames 0,2,0,0,1,2,0,0,1,1,1,2,1,2,0,2,0,
This is a shortened version of the file:
0 ENST00000371026 chr1 - 67051161 67163158 67051161,67060631,67065090,67066082,67071855,67072261,67073896,67075980,67078739,67085754,67100417,67109640,67113051,67129424,67131499,67143471,67162932, 67052451,67060788,67065317,67066181,67071977,67072419,67074048,67076067,67078942,67085949,67100573,67109780,67113208,67129537,67131684,67143646,67163158, ENSG00000152763 0,2,0,0,1,2,0,0,1,1,1,2,1,2,0,2,0, uc009waw.1,uc009wax.1,uc001dcx.1,
0 ENST00000371023 chr1 - 67075869 67163055 67075869,67078739,67085754,67100417,67109640,67113051,67129424,67131499,67143471,67162932, 67076067,67078942,67085949,67100573,67109780,67113208,67129537,67131684,67143646,67163055, ENSG00000152763 0,1,1,1,2,1,2,0,2,0, uc001dcy.1
0 ENST00000395250 chr1 - 67075991 67163158 67075991,67076022,67078739,67085754,67100417,67109640,67113051,67129424,67131499,67143471,67162932, 67076018,67076067,67078942,67085949,67100573,67109780,67113208,67129537,67131684,67143646,67163158, ENSG00000152763 0,0,1,1,1,2,0,-1,-1,-1,-1, n/a
I need to sum the difference of the exon starts and ends for example:
hg18.ensGene.exonStarts 67051161,67060631,67065090,67066082,67071855,67072261,67073896,67075980,67078739,67085754,67100417,67109640,67113051,67129424,67131499,67143471,67162932,
hg18.ensGene.exonEnds 67052451,67060788,67065317,67066181,67071977,67072419,67074048,67076067,67078942,67085949,67100573,67109780,67113208,67129537,67131684,67143646,67163158,
difference:
1290,157,227,99,122,158,152,87,203,195,156,140,157,113,185,175,226
sum (hg18.ensGene.exonLenSum):
3842
And I would like the output to have the following fields:
hg18.ensGene.name
hg18.ensGene.name2
hg18.ensGene.exonLenSum
such as this:
ENST00000371026 ENST00000371023 3842
I would like to do this with one awk script for all lines in the input file. How can I do this? This is useful for calculating exon lengths, say for a RPMK (Reads Per Kilobase exon Model per million mapped reads) calculation.
so ross$ awk -f gene.awk gene.dat
ENST00000371026 ENSG00000152763 3842
ENST00000371023 ENSG00000152763 1645
ENST00000395250 ENSG00000152763 1622
so ross$ cat gene.awk
/./ {
name = $2
name2 = $9
s = $7
e = $8
sc = split(s, sa, ",")
ec = split(e, ea, ",")
if (sc != ec) {
print "starts != ends ", name, name2, sc, ec
}
diffsum = 0
for(i = 1; i <= sc; ++i) {
diffsum += ea[i] - sa[i]
}
print name, name2, diffsum
}
using the UCSC mysql anonymous server:
mysql -N -h genome-mysql.cse.ucsc.edu -A -u genome -D hg18 -e 'select name,name2,exonStarts,exonEnds from ensGene' |\
awk -F ' ' '{n=split($3,a1,"[,]"); split($4,a2,"[,]"); size=0; for(i=1;i<=n;++i) {size+=int(a2[i]-a1[i]);} printf("%s\t%s\t%d\n",$1,$2,size); }'
result:
ENST00000404059 ENSG00000219789 632
ENST00000326632 ENSG00000146556 1583
ENST00000408384 ENSG00000221311 138
ENST00000409575 ENSG00000222003 1187
ENST00000409981 ENSG00000222027 1187
ENST00000359752 ENSG00000197490 126
ENST00000379479 ENSG00000205292 873
ENST00000326183 ENSG00000177693 918
ENST00000407826 ENSG00000219467 2820
ENST00000405199 ENSG00000220902 1231
(...)