Given the following string,
>Q07092
MWVSWAPGLWLLGLWATFGHGANTGAQCPPSQQEGLKLEHSSSLPANVTGFNLIHRLSLMKTSAIKKIRNPKGPLILRLGAAPVTQPTRRVFPRGLPEEFALVLTLLLKKHTHQKTWYLFQVTDANGYPQISLEVNSQERSLELRAQGQDGDFVSCIFPVPQLFDLRWHKLMLSVAGRVASVHVDCSSASSQPLGPRRPMRPVGHVFLGLDAEQGKPVSFDLQQVHIYCDPELVLEEGCCEILPAGCPPETSKARRDTQSNELIEINPQSEGKVYTRCFCLEEPQNSEVDAQLTGRISQKAERGAKVHQETAADECPPCVHGARDSNVTLAPSGPKGGKGERGLPGPPGSKGEKGARGNDCVRISPDAPLQCAEGPKGEKGESGALGPSGLPGSTGEKGQKGEKGDGGIKGVPGKPGRDGRPGEICVIGPKGQKGDPGFVGPEGLAGEPGPPGLPGPPGIGLPGTPGDPGGPPGPKGDKGSSGIPGKEGPGGKPGKPGVKGEKGDPCEVCPTLPEGFQNFVGLPGKPGPKGEPGDPVPARGDPGIQGIKGEKGEPCLSCSSVVGAQHLVSSTGASGDVGSPGFGLPGLPGRAGVPGLKGEKGNFGEAGPAGSPGPPGPVGPAGIKGAKGEPCEPCPALSNLQDGDVRVVALPGPSGEKGEPGPPGFGLPGKQGKAGERGLKGQKGDAGNPGDPGTPGTTGRPGLSGEPGVQGPAGPKGEKGDGCTACPSLQGTVTDMAGRPGQPGPKGEQGPEGVGRPGKPGQPGLPGVQGPPGLKGVQGEPGPPGRGVQGPQGEPGAPGLPGIQGLPGPRGPPGPTGEKGAQGSPGVKGATGPVGPPGASVSGPPGRDGQQGQTGLRGTPGEKGPRGEKGEPGECSCPSQGDLIFSGMPGAPGLWMGSSWQPGPQGPPGIPGPPGPPGVPGLQGVPGNNGLPGQPGLTAELGSLPIEQHLLKSICGDCVQGQRAHPGYLVEKGEKGDQGIPGVPGLDNCAQCFLSLERPRAEEARGDNSEGDPGCVGSPGLPGPPGLPGQRGEEGPPGMRGSPGPPGPIGPPGFPGAVGSPGLPGLQGERGLTGLTGDKGEPGPPGQPGYPGATGPPGLPGIKGERGYTGSAGEKGEPGPPGSEGLPGPPGPAGPRGERGPQGNSGEKGDQGFQGQPGFPGPPGPPGFPGKVGSPGPPGPQAEKGSEGIRGPSGLPGSPGPPGPPGIQGPAGLDGLDGKDGKPGLRGDPGPAGPPGLMGPPGFKGKTGHPGLPGPKGDCGKPGPPGSTGRPGAEGEPGAMGPQGRPGPPGHVGPPGPPGQPGPAGISAVGLKGDRGATGERGLAGLPGQPGPPGHPGPPGEPGTDGAAGKEGPPGKQGFYGPPGPKGDPGAAGQKGQAGEKGRAGMPGGPGKSGSMGPVGPPGPAGERGHPGAPGPSGSPGLPGVPGSMGDMVNYDEIKRFIRQEIIKMFDERMAYYTSRMQFPMEMAAAPGRPGPPGKDGAPGRPGAPGSPGLPGQIGREGRQGLPGVRGLPGTKGEKGDIGIGIAGENGLPGPPGPQGPPGYGKMGATGPMGQQGIPGIPGPPGPMGQPGKAGHCNPSDCFGAMPMEQQYPPMKTMKGPFG
I want to first grep for pattern matching 6 or more xGx repeats, where x is any character. This, I can easily do,
grep -EIho -B1 '([^G]G[^G]){6,}' file
which outputs
>Q07092
KGERGLPGPPGSKGEKGARGN
EGPKGEKGESGALGPSGLPGSTGEKGQKGEKGD
IGPKGQKGDPGFVGPEGLAGEPGPPGLPGPPGI
PGPKGDKGSSGIPGKEGP
FGLPGLPGRAGVPGLKGEKGNFGEAGPAGSPGPPGPVGPAGIKGAKGE
FGLPGKQGKAGERGLKGQKGDAGNPGDPGTPGTTGRPGLSGEPGVQGPAGPKGEKGD
AGRPGQPGPKGEQGPEGV
PGKPGQPGLPGVQGPPGLKGVQGEPGPPGR
QGPQGEPGAPGLPGIQGLPGPRGPPGPTGEKGAQGSPGVKGATGPVGPPGA
SGPPGRDGQQGQTGLRGTPGEKGPRGEKGEPGE
PGPQGPPGIPGPPGPPGVPGLQGVPGNNGLPGQPGL
EGDPGCVGSPGLPGPPGLPGQRGEEGPPGMRGSPGPPGPIGPPGFPGAVGSPGLPGLQGERGLTGLTGDKGEPGPPGQPGYPGATGPPGLPGIKGERGYTGSAGEKGEPGPPGSEGLPGPPGPAGPRGERGPQGNSGEKGDQGFQGQPGFPGPPGPPGFPGKVGSPGPPGP
KGSEGIRGPSGLPGSPGPPGPPGIQGPAGLDGLDGKDGKPGLRGDPGPAGPPGLMGPPGFKGKTGHPGLPGPKGDCGKPGPPGSTGRPGAEGEPGAMGPQGRPGPPGHVGPPGPPGQPGPAGI
VGLKGDRGATGERGLAGLPGQPGPPGHPGPPGEPGTDGAAGKEGPPGKQGFYGPPGPKGDPGAAGQKGQAGEKGRAGM
PGKSGSMGPVGPPGPAGERGHPGAPGPSGSPGLPGVPGSMGD
PGRPGPPGKDGAPGRPGAPGSPGLPGQIGREGRQGLPGVRGLPGTKGEKGDIGI
AGENGLPGPPGPQGPPGY
MGATGPMGQQGIPGIPGPPGPMGQPGKAGH
Now, I want to find the character position of all G's when they occur in 'TGA' or 'SGA'. The character positions should be based on the input and NOT the output.
Expected output,
$ some-grep-awk-code
>Q07092
TGA: 573
SGA: 384
The awk solution,
awk -v str='TGA' '{ off=0; while (pos=index(substr($0,off+1),str)) { printf("%d: %d\n", NR, pos+off); off+=length(str)+pos } }' file
outputs TGA both at character position 25 and 573. However, I want to only identify the character position of G in SGA/TGA when they occur in the midst of six or more xGx repeats.
Really appreciate any help!
Here's a basic awk solution:
Each sequence must span a single line
The resulting positions are relatives to the start of the line
The algorithm first searches the parts of the line that match [^G]G[^G]{6,}, then searches for the occurrences of SGA and TGA in those parts. The implementation is a little tedious, as there's no offset option for the match() and index() functions of awk.
awk '
BEGIN {
regexp = "([^G]G[^G]){6,}"
search["SGA"]
search["TGA"]
}
/^>/ {
print
next
}
{
i0 = 1
s0 = $0
while ( match( s0, regexp ) ) {
head = substr(s0,RSTART,RLENGTH)
tail = substr(s0,RSTART+RLENGTH)
i0 += RSTART - 1
for (s in search) {
s1 = head
i1 = i0
while ( i = index(s1, s) ) {
s1 = substr(s1, i+1)
i1 += i
search[s] = search[s] " " i1-1
}
}
s0 = tail
i0 += RLENGTH
}
for (s in search) {
print s ":" search[s]
search[s] = ""
}
}
'
Example with simplified sequences
>TEST1
SGA.G..G.TGATGA.G..G..G.SGA.....TGA.....SGA.....G..G.SGA.G..G..G.
>TEST2
.G..G.TGA.G..G.G.....G..G..G..G.SGA.G.
>TEST1
SGA: 1 25 54
TGA: 10 13
>TEST2
SGA: 33
TGA:
TODO
Parameterize the regex and the search strings: it's not difficult per se but the current code will run into an infinite loop when a search string is empty or when the regex allows 0-length matches; you'll need to prevent that from happening.
Allow multi-line sequences
Allow overlapping matches for the regex. Basically, it means to look for the next match at RSTART+1 of the previous iteration; that will generate a lot of duplicate results that you need to discard one way or an other.
You could match all occurrences of the regular expression [ST]GA and look at the wider substring surrounding each match to compare that window to (.G.){6}. Here is some code to do that:
$ awk '
/^>/ { label = $0 ORS; next }
{
while (match(substr($0, pos + 1), /[ST]GA/)) {
pos += RSTART
if (len = RLENGTH) {
wbeg = pos - 18 + len # 18 is the length of .G..G..G..G..G..G.
wlen = 2 * 18 - len + (wbeg < 1 ? wbeg - 1 : 0)
wbeg = (wbeg < 1 ? 1 : wbeg) # substr must start from at least 1
window = substr($0, wbeg, wlen)
if (window ~ /.G..G..G..G..G..G./) {
str = substr($0, pos, len)
print label str ":", pos + int(len / 2)
label = ""
}
pos += len - 1
}
if (pos >= length($0)) {
break
}
}
pos = 0
}
' file
>Q07092
SGA: 384
The output only shows SGA: 384 because that is the only portion of the example input that meets the requirement:
I want to only identify the character position of G in SGA/TGA when they occur in the midst of six or more xGx repeats.
With your shown samples please try following awk code. Written and tested in GNU awk should work in any POSIX awk. In this code we could pass how many strings/variables into the function and can get their ALL present index values in the line. Pass all the values needs to be searched into awk variable named keyWords and it will look for all those into the lines.
awk -v keyWords="SGA,TGA" '
BEGIN{
num=split(keyWords,arr1,",")
for(i=1;i<=num;i++){
checkValues[arr1[i]]
}
}
!/>/{
start=diff=prev=""
while(match($0,/(.G.){6,}/)){
lineMatch=substr($0,RSTART,RLENGTH)
start+=(RSTART>1?RSTART-1:RSTART)
diff=(start-prev)
for(key in checkValues){
if(ind=index(lineMatch,key)){
print substr(lineMatch,ind,length(key)),(RSTART?RSTART-1:1)+ind+start+diff
}
prev=start
}
$0=substr($0,RSTART+RLENGTH)
}
}
' Input_file
Output with shown samples will be as follows:
>Q07092
SGA: 384
Related
how can i replace the numeric genotype code with a DNA letter?
i have a modified vcf file that looks like that:
POS REF ALT A2.bam C10.bam
448 T C 0/0:0,255,255 0/0:0,255,255
2402 C T 1/1:209,23,0 xxx:255,0,255
n...
i want to replace the 0/0 with the ref letter, 1/1 with the alt letter and delete all the string after it.
it should look like this:
POS REF ALT A2.bam C10.bam
448 T C T T
2402 C G G xxx
n...
been trying to do it with sed but it didn't work
don't know how to approach it
Would you please try:
awk '{
if (NR > 1) {
for (i=4; i<=5; i++) {
split($i, a, ":")
$i = a[1]
if ($i == "0/0") $i = $2
if ($i == "1/1") $i = $3
}
}
print
}' file.txt
Output:
POS REF ALT A2.bam C10.bam
448 T C T T
2402 C T T xxx
n...
The for loop processes the 4th and 5th columns (A2.bam and C10.bam).
First it chops off the substring after ":".
If the remaining value is equal to "0/0", then replace it with the 2nd column (REF).
In case of "1/1", use the 3rd column (ALT).
Hope this helps.
I was wondering what the best way to sort a chart using the column command to center each column instead of the default left aligned column was. I have been using the column -t filename command.
Current Output:
Label1 label2
Anotherlabel label2442
label152 label42242
label78765 label373737737
Desired Output: Something like this
Label1 label2
Anotherlabel label2442
label152 label42242
label78765 label373737737
Basically, I want it to be centered instead of left aligned.
Here is an awk solution:
# Collect all lines in "data", keep track of maximum width for each field
{
data[NR] = $0
for (i = 1; i <= NF; ++i)
max[i] = length($i) > max[i] ? length($i) : max[i]
}
END {
for (i = 1; i <= NR; ++i) {
# Split record into array "arr"
split(data[i], arr)
# Loop over array
for (j = 1; j <= NF; ++j) {
# Calculate amount of padding required
pad = max[j] - length(arr[j])
# Print field with appropriate padding, see below
printf "%*s%*s%s", length(arr[j]) + int(pad/2), arr[j], \
pad % 2 == 0 ? pad/2 : int(pad/2) + 1, "", \
j == NF ? "" : " "
}
# Newline at end of record
print ""
}
}
Called like this:
$ awk -f centre.awk infile
Label1 label2
Anotherlabel label2442
label152 label42242
label78765 label373737737
The printf statement uses padding with dynamic widths:
The first %*s takes care of left padding and the data itself: arr[j] gets printed and padded to a total width of length(arr[j]) + int(pad/2).
The second %*s prints the empty string, left padded to half of the total padding required. pad % 2 == 0 ? pad/2 : int(pad/2) + 1 checks if the total padding was an even number, and if not, adds an extra space.
The last %s prints j == NF ? "" : " ", i.e., two spaces, unless we're at the last field.
Some older awks don't support the %*s syntax, but the formatting string can be assembled like width = 5; "%" width "s" in that case.
Here's a Python program to do what you want. It's probably too hard to do in bash, so you'll need to use a custom program or awk script. Basic algorithm:
count number of columns
[optional] make sure each line has the same number of columns
figure out the maximum length of data for each column
print each line using the max lengths
.
#!/usr/bin/env python3
import sys
def column():
# Read file and split each line into fields (by whitespace)
with open(sys.argv[1]) as f:
lines = [line.split() for line in f]
# Check that each line has the same number of fields
num_fields = len(lines[0])
for n, line in enumerate(lines):
if len(line) != num_fields:
print('Line {} has wrong number of columns: expected {}, got {}'.format(n, num_fields, len(line)))
sys.exit(1)
# Calculate the maximum length of each field
max_column_widths = [0] * num_fields
for line in lines:
line_widths = (len(field) for field in line)
max_column_widths = [max(z) for z in zip(max_column_widths, line_widths)]
# Now print them centered using the max_column_widths
spacing = 4
format_spec = (' ' * spacing).join('{:^' + str(n) + '}' for n in max_column_widths)
for line in lines:
print(format_spec.format(*line))
if __name__ == '__main__':
column()
I have an input file with fields in several lines. In this file, the field pattern is repeated according to query size.
ZZZZ
21293
YYYYY XXX WWWW VV
13242 MUTUAL BOTH NO
UUUUU TTTTTTTT SSSSSSSS RRRRR QQQQQQQQ PPPPPPPP
3 0 3 0
NNNNNN MMMMMMMMM LLLLLLLLL KKKKKKKK JJJJJJJJ
2 0 5 3
IIIIII HHHHHH GGGGGGG FFFFFFF EEEEEEEEEEE DDDDDDDDDDD
5 3 0 3
My desired output is one line per total group of fields. Empty
fields should be marked. Example:"x"
21293 13242 MUTUAL BOTH NO 3 0 X 3 0 X 2 0 X 5 3 5 3 0 X 3 X
12345 67890 MUTUAL BOTH NO 3 0 X 3 0 X 2 0 X 5 3 5 3 0 X 3 X
I have been thinking about how can I get the desired output with awk/unix scripts but can't figure it out. Any ideas? Thank you very much!!!
This isn't really a great fit for awk's style of programming, which is based on fields that are delimited by a pattern, not fields with variable positions on the line. But it can be done.
When you process the first line in each pair, scan through it finding the positions of the beginning of each field name.
awk 'NR%3 == 1 {
delete fieldpos;
delete fieldlen;
lastspace = 1;
fieldindex = 0;
for (i = 1; i <= length(); i++) {
if (substr($0, i, 1) != " ") {
if (lastspace) {
fieldpos[fieldindex] = i;
if (fieldindex > 0) {
fieldlen[fieldindex-1] = i - fieldpos[fieldindex-1];
}
fieldindex++;
}
lastspace = 0;
} else {
lastspace = 1;
}
}
}
NR%3 == 2 {
for (i = 0; i < fieldindex; i++) {
if (i in fieldlen) {
f = substr($0, fieldpos[i], fieldlen[i]);
} else { # last field, go to end of line
f = substr($0, fieldpos[i]);
}
gsub(/^ +| +$/, "", f); # trim surrounding spaces
if (f == "") { f = "X" }
printf("%s ", f);
}
}
NR%15 == 14 { print "" } # print newline after 5 data blocks
'
Assuming your fields are separated by blank chars and not tabs, GNU awk's FIELDWITDHS is designed to handle this sort of situation:
/^ZZZZ/ { if (rec!="") print rec; rec="" }
/^[[:upper:]]/ {
FIELDWIDTHS = ""
while ( match($0,/\S+\s*/) ) {
FIELDWIDTHS = (FIELDWIDTHS ? FIELDWIDTHS " " : "") RLENGTH
$0 = substr($0,RLENGTH+1)
}
next
}
NF {
for (i=1;i<=NF;i++) {
gsub(/^\s+|\s+$/,"",$i)
$i = ($i=="" ? "X" : $i)
}
rec = (rec=="" ? "" : rec " ") $0
}
END { print rec }
$ awk -f tst.awk file
2129 13242 MUTUAL BOTH NO 3 0 X 3 0 X 2 0 X 5 3 5 3 0 X 3 X
In other awks you'd use match()/substr(). Note that the above isn't perfect in that it truncates a char off 21293 - that's because I'm not convinced your input file is accurate and if it is you haven't told us why that number is longer than the string on the preceding line or how to deal with that.
I have lots of files like this:
3
10
23
.
.
.
720
810
980
And a much bigger file like this:
2 0.004
4 0.003
6 0.034
.
.
.
996 0.01
998 0.02
1000 0.23
What I want to do is find in which range of the second file my first file falls and then estimate the mean of the values in the 2nd column of that range.
Thanks in advance.
NOTE
The numbers in the files do not necessarily follow an easy pattern like 2,4,6...
Since your smaller files are sorted you can pull out the first row and the last row to get the min and max. Then you just need go through the bigfile with an awk script to compute the mean.
So for each smallfile small you would run the script
awk -v start=$(head -n 1 small) -v end=$(tail -n 1 small) -f script bigfile
Where script can be something simple like
BEGIN {
sum = 0;
count = 0;
range_start = -1;
range_end = -1;
}
{
irow = int($1)
ival = $2 + 0.0
if (irow >= start && end >= irow) {
if (range_start == -1) {
range_start = NR;
}
sum = sum + ival;
count++;
}
else if (irow > end) {
if (range_end == -1) {
range_end = NR - 1;
}
}
}
END {
print "start =", range_start, "end =", range_end, "mean =", sum / count
}
You can try below:
for r in *; do
awk -v r=$r -F' ' \
'NR==1{b=$2;v=$4;next}{if(r >= b && r <= $2){m=(v+$4)/2; print m; exit}; b=$2;v=$4}' bigfile.txt
done
Explanation:
First pass it saves column 2 & 4 into temp variables. For all other passes it checks if filename r is between the begin range (previous coluimn 2) and end range (current column 2).
It then works out the mean and prints the result.
I had a input which is a result from text comparison. It is in a very simple format. It has 3 columns, position, original texts and new texts.
But some of the records looks like this
4 ATCG ATCGC
10 1234 123
How to write the short script to normalize it to
7 G GC
12 34 3
probably, the whole original texts and the whole new text is like below respectively
ACCATCGGA1234
ACCATCGCGA123
"Normalize" means "trying to move the position in the first column to the position that changes gonna occur", or "we would remove the common prefix ATG, add its length 3 to the first field; similarly on line 2 the prefix we remove is length 2"
This script
awk '
BEGIN {OFS = "\t"}
function common_prefix_length(str1, str2, max_len, idx) {
idx = 1
if (length(str1) < length(str2))
max_len = length(str1)
else
max_len = length(str2)
while (substr(str1, idx, 1) == substr(str2, idx, 1) && idx < max_len)
idx++
return idx - 1
}
{
len = common_prefix_length($2, $3)
print $1 + len, substr($2, len + 1), substr($3, len + 1)
}
' << END
4 ATCG ATCGC
10 1234 123
END
outputs
7 G GC
12 34 3