I have a file.txt which has the following columns
id chr pos alleleA alleleB
1 01 1234 CT T
2 02 5678 G A
3 03 8901 T C
4 04 12345 C G
5 05 567890 T A
I am looking for a way of creating a new column so that it looks like : chr:pos:alleleA:alleleB
The problem is that alleleA and alleleB should be sorted based on:
1. alphabetical order
2. either of these two columns with more letter per line should be first and followed by the second column
In this example , it would look like this :
id chr pos alleleA alleleB newID
1 01 1234 CT T chr1:1234:CT:T
2 02 5678 G A chr2:5678:A:G
3 03 8901 T C chr3:8901:C:T
4 04 12345 C G chr4:12345:C:G
5 05 567890 T A chr5:567890:A:T
I appreciate any help and suggestion. Thanks.
EDIT
Up to now i can modify chr column so that it will have a look as "chr:1"...
AlleleA and AlleleB columns should be combined so that if either of columns contains more than 1 letter, in column newID it would be in the first place. If there is only one letter in both columns, these letters are arranged alphabetically in the newID column
gawk solution:
awk 'function custom_sort(i1,v1,i2,v2){ # custom function to compare 2 crucial fields
l1=length(v1); l2=length(v2); # getting length of both fields
if (l1 == l2) {
return (v1 > v2)? 1:-1 # compare characters if field lengths are equal
} else {
return l2 - l1 # otherwise - compare by length (descending)
}
} NR==1 { $0=$0 FS "newID" } # add new column
NR>1 { a[1]=$4; a[2]=$5; asort(a,b,"custom_sort"); # sort the last 2 columns using function `custom_sort`
$(NF+1) = sprintf("chr%s:%s:%s:%s",$1,$3,b[1],b[2])
}1' file.txt | column -t
The output:
id chr pos alleleA alleleB newID
1 01 1234 CT T chr1:1234:CT:T
2 02 5678 G A chr2:5678:A:G
3 03 8901 T C chr3:8901:C:T
4 04 12345 C G chr4:12345:C:G
5 05 567890 T A chr5:567890:A:T
Perl to the rescue:
perl -lane '
if (1 == $.) { print "$_ newID" }
else { print "$_ ", join ":", "chr" . ($F[1] =~ s/^0//r),
$F[2],
sort { length $b <=> length $a
or $a cmp $b
} #F[3,4];
}' -- input.txt
-l removes newlines from input and adds them to print
-n reads the input line by line
-a splits each input line on whitespace into the #F array
$. is the input line number, the condition just prints the header for the first line
s/^0// removes the initial zero from $F[1] (i.e. column 2)
/r returns the result of the substitution
the last two column lenghts are compared, if they are the same, string comparison is used.
Related
Im having this table structure (assume that the delimiters are tabs):
AAA BBBB CCC
01 Item Description here
02 Meti A very very veeeery long description which will easily extend the recommended output width of 80 characters.
03 Etim Last description
What i want is this:
AAA BBBB CCC
01 Item Description here
02 Meti A very very veeeery
long description which
will easily extend the
recommended output width
of 80 characters.
03 Etim Last description
That means I want to split $3 into an array of strings with predefined WIDTH, where the first element is appended "normally" to the current line and all subsequent elements get a new line width identation according to the padding of the first two columns (padding could also be fixed if thats easier).
Alternatively, the text in $0 could be split by a GLOBAL_WIDTH (e.g. 80 chars) into first string and "rest" -> first string gets printed "normally" with printf, the rest is split by GLOBAL_WIDTH - (COLPAD1 + COLPAD2) and appended width new lines as above.
I tried to work with fmt and fold after my awk formatting (which is basically just putting headings to the table) but they do not reflect awk's field perceptance of course.
How can I achieve this using bash tools and / or awk?
First build a test file (called file.txt):
echo "AA BBBB CCC
01 Item Description here
02 Meti A very very veeeery long description which will easily extend the recommended output width of 80 characters.
03 Etim Last description" > file.txt
Now the script (called ./split-columns.sh):
#!/bin/bash
FILE=$1
#find position of 3rd column (starting with 'CCC')
padding=`cat $FILE | head -n1 | grep -aob 'CCC' | grep -oE '[0-9]+'`
paddingstr=`printf "%-${padding}s" ' '`
#set max length
maxcolsize=50
maxlen=$(($padding + $maxcolsize))
cat $FILE | while read line; do
#split the line only if it exceeds the desired length
if [[ ${#line} -gt $maxlen ]] ; then
echo "$line" | fmt -s -w$maxcolsize - | head -n1
echo "$line" | fmt -s -w$maxcolsize - | tail -n+2 | sed "s/^/$paddingstr/"
else
echo "$line";
fi;
done;
Finally run it with the file as a single argument
./split-columns.sh file.txt > fixed-width-file.txt
Output will be:
AA BBBB CCC
01 Item Description here
02 Meti A very very veeeery long description
which will easily extend the recommended output
width of 80 characters.
03 Etim Last description
You can try Perl one-liner
perl -lpe ' s/(.{20,}?)\s/$1\n\t /g ' file
with the given inputs
$ cat thurse.txt
AAA BBBB CCC
01 Item Description here
02 Meti A very very veeeery long description which will easily extend the recommended output width of 80 characters.
03 Etim Last description
$ perl -lpe ' s/(.{20,}?)\s/$1\n\t /g ' thurse.txt
AAA BBBB CCC
01 Item Description
here
02 Meti A very very
veeeery long description
which will easily extend
the recommended output
width of 80 characters.
03 Etim Last description
$
If you want to try with length window of 30/40/50
$ perl -lpe ' s/(.{30,}?)\s/$1\n\t /g ' thurse.txt
AAA BBBB CCC
01 Item Description here
02 Meti A very very veeeery
long description which will easily
extend the recommended output width
of 80 characters.
03 Etim Last description
$ perl -lpe ' s/(.{40,}?)\s/$1\n\t /g ' thurse.txt
AAA BBBB CCC
01 Item Description here
02 Meti A very very veeeery long description
which will easily extend the recommended
output width of 80 characters.
03 Etim Last description
$ perl -lpe ' s/(.{50,}?)\s/$1\n\t /g ' thurse.txt
AAA BBBB CCC
01 Item Description here
02 Meti A very very veeeery long description which
will easily extend the recommended output width of
80 characters.
03 Etim Last description
$
#!/usr/bin/awk -f
# Read standard input, which should be a file of lines each line
# containing tab-separated strings. The string values may be very long.
# Columnate the output by
# wrapping long strings onto multiple lines within each field's
# specified length.
# Arguments are numeric field lengths. If an input line contains more
# values than the # of field lengths supplied, the last field length will
# be re-used.
#
# arguments are the field lengths
# invoke like this: wrapcolumns 30 40 40
BEGIN {
FS=" ";
for (i = 1; i < ARGC; i++) {
fieldlengths[i-1] = ARGV[i];
ARGV[i]="";
}
if (ARGC < 2) {
print "usage: wrapcolumns length1 ... lengthn";
exit;
}
}
function blanks(n) {
result = " ";
while (length(result) < n) {
result = result result;
}
return substr(result, 1, n);
}
{
# ARGC - 1 is the length of the fieldlengths array
# So ARGC - 2 is the index of its last element because its zero-origin.
# if the input line has more fields than the fieldlengths array,
# use the last element.
# any nonempty fields left?
gotanyleft = 1;
while (gotanyleft == 1) {
gotanyleft = 0;
for (i = 1; i <= NF; i++) {
# length of the current field
len = (ARGC - 2 < i) ? (fieldlengths[ARGC - 2]) : fieldlengths[i - 1];
# print that much of the current field and remove that much from the front
printf "%s", substr($(i) blanks(len), 1, len) ":::"
$(i) = substr($(i), len + 1);
if ($(i) != "") {
gotanyleft = 1;
}
}
print ""
}
}
loop-free awk-solution :
{m,g}awk -v ______="${WIDTH}" 'BEGIN {
1 OFS = ""
1 FS = "\t"
1 ___ = "\32\23"
1 __ = sprintf("\n%*s",
(_+=_^=_<_)+_^!_+(_+=_____=_+=_+_)+_____,__)
1 ____ = sprintf("%*s",______-length(__),"")
1 gsub(".",".",____)
sub("[.].......$","..?.?.?.?.?.?.?.[ ]",____)
1 ______ = _
} $!NF = sprintf("%.*s %*s %-*s %-s", _<_,_= $NF,_____,
$2,______, $--NF, substr("",gsub(____,
("&")___,_) * gsub("("(___)")+$","",_),
__ * gsub( (___), (__),_) )_)'
|
AAA BBBB CCC
01 Item Description here
02 Meti A very very veeeery long description which
will easily extend the recommended output
width of 80 characters.
03 Etim Last description
Say I've got a file.txt
Position name1 name2 name3
2 A G F
4 G S D
5 L K P
7 G A A
8 O L K
9 E A G
and I need to get the output:
name1 name2 name3
2 2 7
4 7 9
7 9
It outputs each name, and the position numbers where there is an A or G
In file.txt, the name1 column has an A in position 2, G's in positions 4 and 7... therefore in the output file: 2,4,7 is listed under name1
...and so on
Strategy I've devised so far (not very efficient): reading each column one at a time, and outputting the position number when a match occurs. Then I'd get the result for each column and cbind them together using r.
I'm fairly certain there's a better way using awk or bash... ideas appreciated.
$ cat tst.awk
NR==1 {
for (nameNr=2;nameNr<=NF;nameNr++) {
printf "%5s%s", $nameNr, (nameNr<NF?OFS:ORS)
}
next
}
{
for (nameNr=2;nameNr<=NF;nameNr++) {
if ($nameNr ~ /^[AG]$/) {
hits[nameNr,++numHits[nameNr]] = $1
maxHits = (numHits[nameNr] > maxHits ? numHits[nameNr] : maxHits)
}
}
}
END {
for (hitNr=1; hitNr<=maxHits; hitNr++) {
for (nameNr=2;nameNr<=NF;nameNr++) {
printf "%5s%s", hits[nameNr,hitNr], (nameNr<NF?OFS:ORS)
}
}
}
$ awk -f tst.awk file
name1 name2 name3
2 2 7
4 7 9
7 9
Save the below script :
#!/bin/bash
gawk '{if( NR == 1 ) {print $2 >>"name1"; print $3 >>"name2"; print $4>>"name3";}}
{if($2=="A" || $2=="G"){print $1 >> "name1"}}
{if($3=="A" || $3=="G"){print $1 >> "name2"}}
{if($4=="A" || $4=="G"){print $1 >> "name3"}}
END{system("paste name*;rm name*")}' $1
as finder. Make finder an executable(using chmod) and then do :
./finder file.txt
Note : I have used three temporary files name1, name2 and name3. You could change the file names at your convenience. Also, these files will be deleted at the end.
Edit : Removed the BEGIN part of the gawk.
I have two files like this:
File 1:
1 1987969 1987970 . 7.078307 33
1 2066715 2066716 . 7.426998 34
1 2066774 2066775 . 6.851217 33
File 2:
1 HANASAI gelliu 1186928 1441229
1 FEBRUCA sepaca 3455487 3608150
I want to take each value of column 3 in File 1, and search in File 2 with a condition like (if File1_col3_value >= File2_col4_value && File1_col3_value <= File2_col5_value) then print the whole line of File 2 in a new file.
One more thing is also important: for every variable from file_1 to be searched in file_2, the corresponding value in column_1 should be the same in both files, e.g. for '1987970' of file_1, the value in corresponding in column_1 is '1', so in file_2, it should also be '1' in first column.
Thanks
EDIT: Only considers lines with matching "class" values in column 1
$ cat msh.awk
# Save all the pairs of class and third-column values from file1
NR==FNR { a[$1,$3]; next }
# For each line of file2, if there exists a third-column-file1
# value between the values of columns 4 and 5 in a record of the
# same class, print the line
{
for (cv in a) {
split(cv, class_val, SUBSEP);
c = class_val[1];
v = class_val[2];
if (c == $1 && v >= $4 && v <= $5) {
print
break
}
}
}
$ cat file1
1 1987969 1987970 . 7.078307 33
1 2066715 2066716 . 7.426998 34
1 2066774 1200000 . 6.851217 33
1 2066774 2066775 . 6.851217 33
$ cat file2
1 HANASAI gelliu 1186928 1441229
1 FEBRUCA sepaca 3455487 3608150
$ awk -f msh.awk file1 file2
1 HANASAI gelliu 1186928 1441229
I have a file with a very large number of columns (basically several thousand sets of threes) with three special columns (Chr and Position and Name) at the end.
I want to move these final three columns to the front of the file, so that that columns become Name Chr Position and then the file continues with the trios.
I think this might be possible with awk, but I don't know enough about how awk works to do it!
Sample input:
Gene1.GType Gene1.X Gene1.Y ....ending in GeneN.Y Chr Position Name
Desired Output:
Name Chr Position (Gene1.GType Gene1.X Gene1.Y ) x n samples
I think the below example does more or less what you want.
$ cat file
A B C D E F G Chr Position Name
1 2 3 4 5 6 7 8 9 10
$ cat process.awk
{
printf $(NF-2)" "$(NF-1)" "$NF
for( i=1; i<NF-2; i++)
{
printf " "$i
}
print " "
}
$ awk -f process.awk file
Chr Position Name A B C D E F G
8 9 10 1 2 3 4 5 6 7
NF in awk denotes the number of field on a row.
one liner:
awk '{ Chr=$(NF-2) ; Position=$(NF-1) ; Name=$NF ; $(NF-2)=$(NF-1)=$NF="" ; print Name, Chr, Position, $0 }' file
I am trying to replace values in a large space-delimited text-file and could not find a suitable answer for this specific problem:
Say I have a file "OLD_FILE", containing a header and approximately 2 million rows:
COL1 COL2 COL3 COL4 COL5
rs10 7 92221824 C A
rs1000000 12 125456933 G A
rs10000010 4 21227772 T C
rs10000012 4 1347325 G C
rs10000013 4 36901464 C A
rs10000017 4 84997149 T C
rs1000002 3 185118462 T C
rs10000023 4 95952929 T G
...
I want to replace the first value of each row with a corresponding value, using a large (2.8M rows) conversion table. In this conversion table, the first column lists the value I want to have replaced, and the second column lists the corresponding new values:
COL1_b36 COL2_b37
rs10 7_92383888
rs1000000 12_126890980
rs10000010 4_21618674
rs10000012 4_1357325
rs10000013 4_37225069
rs10000017 4_84778125
rs1000002 3_183635768
rs10000023 4_95733906
...
The desired output would be a file where all values in the first column have been changed according to the conversion table:
COL1 COL2 COL3 COL4 COL5
7_92383888 7 92221824 C A
12_126890980 12 125456933 G A
4_21618674 4 21227772 T C
4_1357325 4 1347325 G C
4_37225069 4 36901464 C A
4_84778125 4 84997149 T C
3_183635768 3 185118462 T C
4_95733906 4 95952929 T G
...
Additional info:
Performance is an issue (the following command takes approximately a year:
while read a b; do sed -i "s/\b$a\b/$b/g" OLD_FILE ; done < CONVERSION_TABLE
A complete match is necessary before replacing
Not every value in the OLD_FILE can be found in the conversion table...
...but every value that could be replaced, can be found in the conversion table.
Any help is very much appreciated.
Here's one way using awk:
awk 'NR==1 { next } FNR==NR { a[$1]=$2; next } $1 in a { $1=a[$1] }1' TABLE OLD_FILE
Results:
COL1 COL2 COL3 COL4 COL5
7_92383888 7 92221824 C A
12_126890980 12 125456933 G A
4_21618674 4 21227772 T C
4_1357325 4 1347325 G C
4_37225069 4 36901464 C A
4_84778125 4 84997149 T C
3_183635768 3 185118462 T C
4_95733906 4 95952929 T G
Explanation, in order of appearance:
NR==1 { next } # simply skip processing the first line (header) of
# the first file in the arguments list (TABLE)
FNR==NR { ... } # This is a construct that only returns true for the
# first file in the arguments list (TABLE)
a[$1]=$2 # So when we loop through the TABLE file, we add the
# column one to an associative array, and we assign
# this key the value of column two
next # This simply skips processing the remainder of the
# code by forcing awk to read the next line of input
$1 in a { ... } # Now when awk has finished processing the TABLE file,
# it will begin reading the second file in the
# arguments list which is OLD_FILE. So this construct
# is a condition that returns true literally if column
# one exists in the array
$1=a[$1] # re-assign column one's value to be the value held
# in the array
1 # The 1 on the end simply enables default printing. It
# would be like saying: $1 in a { $1=a[$1]; print $0 }'
This might work for you (GNU sed):
sed -r '1d;s|(\S+)\s*(\S+).*|/^\1\\>/s//\2/;t|' table | sed -f - file
You can use join:
join -o '2.2 1.2 1.3 1.4 1.5' <(tail -n+2 file1 | sort) <(tail -n+2 file2 | sort)
This drops the headers of both files, you can add it back with head -n1 file1.
Output:
12_126890980 12 125456933 G A
4_21618674 4 21227772 T C
4_1357325 4 1347325 G C
4_37225069 4 36901464 C A
4_84778125 4 84997149 T C
3_183635768 3 185118462 T C
4_95733906 4 95952929 T G
7_92383888 7 92221824 C A
Another way with join. Assuming the files are sorted on the 1st column:
head -1 OLD_FILE
join <(tail -n+2 CONVERSION_TABLE) <(tail -n+2 OLD_FILE) | cut -f 2-6 -d' '
But with data of this size you should consider using a database engine.