Split file according to patterns in two consecutive lines - bash

I have files with the following format:
ATOM 3736 CB THR A 486 -6.552 153.891 -7.922 1.00115.15 C
ATOM 3737 OG1 THR A 486 -6.756 154.842 -6.866 1.00114.94 O
ATOM 3738 CG2 THR A 486 -7.867 153.727 -8.636 1.00115.11 C
ATOM 3739 OXT THR A 486 -4.978 151.257 -9.140 1.00115.13 O
HETATM10351 C1 NAG A 203 33.671 87.279 39.456 0.50 90.22 C
HETATM10483 C1 NAG A 702 28.025 104.269 -27.569 0.50 92.75 C
ATOM 3736 CB THR B 486 -6.552 86.240 7.922 1.00115.15 C
ATOM 3737 OG1 THR B 486 -6.756 85.289 6.866 1.00114.94 O
ATOM 3738 CG2 THR B 486 -7.867 86.404 8.636 1.00115.11 C
ATOM 3739 OXT THR B 486 -4.978 88.874 9.140 1.00115.13 O
HETATM10351 C1 NAG B 203 33.671 152.852 -39.456 0.50 90.22 C
HETATM10639 C2 FUC B 402 -48.168 162.221 -22.404 0.50103.03 C
I would like to split the file after each line starting with HETATM* but only if the next line starts with ATOM. I would like the new files to be called $basename_$column, where $basename is the base name of the input file and $column is the character at position 22-23 (either A or B, in the example). I am not able to figure out how to check both consecutive lines to determine the splitting point.

Here's an awk version
awk 'NR==1{n=$5}/HETATM/{f=1}f && /^ATOM/{n=$5;f=0}{print > "file"n".txt"}' file
Use FILENAME instead of file to create the same file name.

Here's a simple Python solution with no error checking. Should work in Python 2 or 3; change the first line to match your environment. Don't take this as an example of good coding style.
Edited for unique file names.
#!/usr/bin/env python2.4
import os.path
import sys
fname = sys.argv[1]
bname = os.path.basename(fname)
fin = open(fname)
fout = None
ct = 0
for line in fin:
if line[:6] == 'HETATM':
flag = True
if (not fout) or (flag and line[:4] == 'ATOM'):
if fout:
fout.close()
ct += 1
fout = open(bname + '_' + line[21:22] + str(ct), 'w')
flag = False
fout.write(line)
fout.close()

Related

Mutate new column from random value in existing columns

I'm looking to mutate my data and create a new column which randomly selects a value from the existing data. My data looks something like:
individual
age_2010
age_2011
age_2012
age_2013
a
20
21
NA
21
b
33
34
35
36
c
76
NA
78
79
d
46
46
48
49
And I want it to look like:
individual
age_2010
age_2011
age_2012
age_2013
Random Sample
a
20
21
22
NA
21
b
33
34
35
36
36
c
76
NA
78
79
78
d
46
46
48
49
48
Is there any way to add a new column which includes a random figure from any of the previous age columns, and preferably keeping the data in wide form?
I think this is an easier approach:
d[, RandomSample:=sample(na.omit(t(.SD)),1),individual]
If dealing with the edge cases discussed above is desired, and one wanted to follow this approach, we could do this:
f <- function(df) {
s = na.omit(t(df))
ifelse(length(s)>0, sample(s,1),NA_real_)
}
d[, RandomSample:=f(.SD),individual]
Or,
we could just wrap the original approach in tryCatch
d[, RandomSample:=tryCatch(sample(na.omit(t(.SD)),1),error=\(e) NA),individual]
You can reshape longer, then do grouped sampling:
library(data.table)
# Sample data
d <- structure(list(individual = c("a", "b", "c", "d"), age_2010 = c(20, 33, 76, 46), age_2011 = c(21, 34, NA, 46), age_2012 = c(NA, 35, 78, 48), age_2013 = c(21, 36, 79, 49)), row.names = c(NA, -4L), spec = structure(list(cols = list(individual = structure(list(), class = c("collector_character", "collector")), age_2010 = structure(list(), class = c("collector_double", "collector")), age_2011 = structure(list(), class = c("collector_double", "collector")), age_2012 = structure(list(), class = c("collector_double", "collector")), age_2013 = structure(list(), class = c("collector_double", "collector"))), default = structure(list(), class = c("collector_guess", "collector")), skip = 2L), class = "col_spec"), class = c("data.table", "data.frame"))
d
#> individual age_2010 age_2011 age_2012 age_2013
#> 1: a 20 21 NA 21
#> 2: b 33 34 35 36
#> 3: c 76 NA 78 79
#> 4: d 46 46 48 49
# Solution
d[, "Random Sample"] <- d |>
melt("individual") |> # go long
(`[`)(!is.na(value), # drop NAs
.(x = sample(value, 1)), # sampling
keyby = .(individual)) |> # Grouping variable
(`[[`)(2) # extract vector from frame
d
#> individual age_2010 age_2011 age_2012 age_2013 Random Sample
#> 1: a 20 21 NA 21 21
#> 2: b 33 34 35 36 33
#> 3: c 76 NA 78 79 76
#> 4: d 46 46 48 49 49
Alternatively, you can also use apply(), which is less verbose but much slower:
d[, "Random Sample"] <- apply(d[, -1], 1, \(x) x |> na.omit() |> sample(1))
See the benchmark here for speed comparison. On just 40k observations, apply() needs 59 times longer and 8 times the memory.
# Make large sample data set
d_large <- d |>
list() |>
rep(1e4) |>
rbindlist()
bench::mark(
base = apply(d_large[, -1], 1, \(x) x |> na.omit() |> sample(1)),
dt = d_large |>
melt("individual") |>
(`[`)(!is.na(value),
.(x = sample(value, 1)),
keyby = .(individual)) |>
(`[[`)(2),
check = F
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 base 617.86ms 617.9ms 1.62 103.3MB 12.9
#> 2 dt 6.96ms 10.5ms 80.9 13.1MB 47.3
Created on 2022-07-27 by the reprex package (v2.0.1)
Edit:
Here are versions that work with the edge case where all years are NA. In the first case I went for a join with the original table, which is a bit more expensive than the other version
# Solution with Data Table
d <- d |>
melt("individual") |> # go long
(`[`)(!is.na(value), # drop NAs
.(`Random Sample` = sample(value, 1)), # sampling
keyby = .(individual)) |> # Grouping variable
(`[`)(d) # right join with original frame
Here I simply used purrr::possibly() to return NA when sampling a zero length vector.
# Solution with apply
d[, "Random Sample"] <- apply(d[, -1], 1,
\(x) x |> na.omit() |> purrr::possibly(sample, NA)(1))

Algorithm to produce number series

I am not sure how to attack this problem... I tried many things, and it seems to be that it shouldn't be so difficult, but not getting there...
Is it possible to create a function "series ( _x )", that produces this :
The function for example should be myfunction( 11 ) => 211
The terms become suffix for the next terms. See below picture for more clarity. The boxes with same color gets repeated. So, we could just keep prepending 1 and 2 for previous results.
Code(In java):
public class Solution {
public static void main(String[] args) {
List<String> ans = solve(10);
for(int i=0;i<ans.size();++i) System.out.println(ans.get(i));
}
private static List<String> solve(int terms){
List<String> ans = new ArrayList<>();
String[] digits = new String[]{"1","2"};
ans.add("1");
if(terms == 1) return ans;
ans.add("2");
if(terms == 2) return ans;
List<String> final_result = new ArrayList<>();
final_result.addAll(ans);
terms -= 2;//since 2 numbers are already added
while(terms > 0){
List<String> temp = new ArrayList<>();
for(String s : digits){
for(int j=0;j<ans.size() && terms > 0;++j){
temp.add(s + ans.get(j));
terms--;
}
}
ans = temp;
final_result.addAll(ans);
}
return final_result;
}
}
This hint should help you... It isn't quite binary, but it is close. Let me know if you need any further help
0 -> - -> -
1 -> - -> -
10 -> 0 -> 1
11 -> 1 -> 2
100 -> 00 -> 11
101 -> 01 -> 12
110 -> 10 -> 21
111 -> 11 -> 22
1000 -> 000 -> 111
1001 -> 001 -> 112
1010 -> 010 -> 121
1011 -> 011 -> 122
1100 -> 100 -> 211
1101 -> 101 -> 212
1110 -> 110 -> 221
1111 -> 111 -> 222
Edit: I didn't like the way I ordered the columns, so I swapped 2 and 3
Python approach
First thing that we need to do is produce binary strings
in Python this can be done with bin(number)
However this will return a number in the form 0b101
We can easily strip away the 0b from the beginning though by telling python that we dont want the first two characters, but we want all the rest of them. The code for that is: bin(number)[2:] left side of the : says start two spaces in, and since the right side is blank go to the end
Now we have the binary numbers, but we need to strip away the first number. Luckily we already know how to strip away leading characters so change that line to bin(number)[3:].
All that is left to do now is add one to every position in the number. To do that lets make a new string and add each character from our other string to it after incrementing it by one.
# we already had this
binary = bin(user_in + 1)[3:]
new = ""
for char in binary:
# add to the string the character + 1
new += str(int(char) + 1)
And we are done. That snippet will convert from decimal to whatever this system is. One thing you might notice is that this solution will be offset by one (2 will be 1, 3 will be 2) we can fix this by simply adding one to user input before we begin.
final code with some convenience (a while loop and print statement)
while True:
user_in = int(input("enter number: "))
binary = bin(user_in + 1)[3:]
new = ""
for char in binary:
new += str(int(char) + 1)
print(user_in, "\t->\t", binary, "\t->\t", new)
According to A000055
We should perform 3 steps:
Convert value + 1 to base 2
Remove 1st 1
Add 1 to the rest digits
For instance, for 11 we have
Converting 11 + 1 == 12 to binary: 1100
Removing 1st 1: 100
Adding 1 to the rest digits: 211
So 11 has 211 representation.
C# code:
private static String MyCode(int value) =>
string.Concat(Convert
.ToString(value + 1, 2) // To Binary
.Skip(1) // Skip (Remove) 1st 1
.Select(c => (char)(c + 1))); // Add 1 to the rest digits
Demo:
var result = Enumerable
.Range(1, 22)
.Select(value => $"{MyCode(value),4} : {value,2}");
Console.Write(string.Join(Emvironment.NewLine, result));
Outcome:
1 : 1
2 : 2
11 : 3
12 : 4
21 : 5
22 : 6
111 : 7
112 : 8
121 : 9
122 : 10
211 : 11
212 : 12
221 : 13
222 : 14
1111 : 15
1112 : 16
1121 : 17
1122 : 18
1211 : 19
1212 : 20
1221 : 21
1222 : 22
In VB.NET, showing both the counting in base-3 and OEIS formula ways, with no attempts at optimisation:
Module Module1
Function OEIS_A007931(n As Integer) As Integer
' From https://oeis.org/A007931
Dim m = Math.Floor(Math.Log(n + 1) / Math.Log(2))
Dim x = 0
For j = 0 To m - 1
Dim b = Math.Floor((n + 1 - 2 ^ m) / (2 ^ j))
x += CInt((1 + b Mod 2) * 10 ^ j)
Next
Return x
End Function
Function ToBase3(n As Integer) As String
Dim s = ""
While n > 0
s = (n Mod 3).ToString() & s
n \= 3
End While
Return s
End Function
Function SkipZeros(n As Integer) As String
Dim i = 0
Dim num = 1
Dim s = ""
While i < n
s = ToBase3(num)
If s.IndexOf("0"c) = -1 Then
i += 1
End If
num += 1
End While
Return s
End Function
Sub Main()
Console.WriteLine("A007931 Base3 ITERATION")
For i = 1 To 22
Console.WriteLine(OEIS_A007931(i).ToString().PadLeft(7) & SkipZeros(i).PadLeft(7) & i.ToString().PadLeft(11))
Next
Console.ReadLine()
End Sub
End Module
Outputs:
A007931 Base3 ITERATION
1 1 1
2 2 2
11 11 3
12 12 4
21 21 5
22 22 6
111 111 7
112 112 8
121 121 9
122 122 10
211 211 11
212 212 12
221 221 13
222 222 14
1111 1111 15
1112 1112 16
1121 1121 17
1122 1122 18
1211 1211 19
1212 1212 20
1221 1221 21
1222 1222 22

glmnet input error, format of the input matrix incorrect

I get the following error
Error in storage.mode(y) <- "double" : invalid to change the storage mode of
a factor
an issue with getting the format of the input matrix right.
Here the code:
library("glmnet")
daten = read.csv("test.csv",header = 1)
# Sex Age Weight Height Other
# 0 22 54 154 1.51
# 1 34 76 178 1.94
# 1 38 75 178 1.93
# 1 32 102 178 2.19
# ...
# 1 35 94 184 2.18
trainX <- daten
# outcome variable
Y <- c(0,0,0,0,0,0,1,0,0,0,0,0,1,1,1,0,0,0,0,1)
trainY <- factor(Y)
fit.lasso=glmnet(trainX,trainY,alpha=1)
trainY seems to be formated correctly as factor, but what is wrong with trainX?
Any comments would be highly appreciated.
Problem solved!
Simply skip the step
trainY <- factor(Y)
and use
fit.lasso=glmnet(trainX,Y,alpha=1)
and it works fine!

Binning Together Allele Frequencies From VCF Sequencing Data

I have a sequencing datafile containing base pair locations from the genome, that looks like the following example:
chr1 814 G A 0.5
chr1 815 T A 0.3
chr1 816 C G 0.2
chr2 315 A T 0.3
chr2 319 T C 0.8
chr2 340 G C 0.3
chr4 514 A G 0.5
I would like to compare certain groups defined by the location of the bp found in column 2. I then want the average of the numbers in column 5 of the matching regions.
So, using the example above lets say I am looking for the average of the 5th column for all samples spanning chr1 810-820 and chr2 310-330. The first five rows should be identified, and their 5th column numbers should be averaged, which equals 0.42.
I tried creating an array of ranges and then using awk to call these locations, but have been unsuccessful. Thanks in advance.
import pandas as pd
from StringIO import StringIO
s = """chr1 814 G A 0.5
chr1 815 T A 0.3
chr1 816 C G 0.2
chr2 315 A T 0.3
chr2 319 T C 0.8
chr2 340 G C 0.3
chr4 514 A G 0.5"""
sio = StringIO(s)
df = pd.read_table(sio, sep=" ", header=None)
df.columns=["a", "b", "c", "d", "e"]
# The query expression is intuitive
r = df.query("(a=='chr1' & 810<b<820) | (a=='chr2' & 310<b<330)")
print r["e"].mean()
pandas might be better for such tabular data processing, and it's python.
Here's some python code to do what you are asking for. It assumes that your data lives in a text file called 'data.txt'
#!/usr/bin/env python
data = open('data.txt').readlines()
def avg(keys):
key_sum = 0
key_count = 0
for item in data:
fields = item.split()
krange = keys.get(fields[0], None)
if krange:
r = int(fields[1])
if krange[0] <= r and r <= krange[1]:
key_sum += float(fields[-1])
key_count += 1
print key_sum/key_count
keys = {} # Create dict to store keys and ranges of interest
keys['chr1'] = (810, 820)
keys['chr2'] = (310, 330)
avg(keys)
Sample Output:
0.42
Here's an awk script answer. For input, I created a 2nd file which I called ranges:
chr1 810 820
chr2 310 330
The script itself looks like:
#!/usr/bin/awk -f
FNR==NR { low_r[$1] = $2; high_r[$1] = $3; next }
{ l = low_r[ $1 ]; h = high_r[$1]; if( l=="" ) next }
$2 >= l && $2 <= h { total+=$5; cnt++ }
END {
if( cnt > 0 ) print (total/cnt)
else print "no matched data"
}
Where the breakdown is like:
FNR==NR - absorb the ranges file, making a low_r and high_r array keyed off of the first column in that file.
Then for every row in the data, lookup matches in the low_r and high_r array. If there's no match, then skip any other processing
Check an inclusive range based on low and high testing, incrementing total and cnt for matched ranges.
At the END, print the simple averages when there were matches
When the script (called script.awk) is made executable it can be run like:
$ ./script.awk ranges data
0.42
where I've called the data file data.

Change column according to previous line with conditions

I have files with the format:
ATOM 3736 CB THR A 486 -6.552 153.891 -7.922 1.00115.15 C
ATOM 3737 OG1 THR A 486 -6.756 154.842 -6.866 1.00114.94 O
ATOM 3738 CG2 THR A 486 -7.867 153.727 -8.636 1.00115.11 C
ATOM 3739 OXT THR A 486 -4.978 151.257 -9.140 1.00115.13 O
HETATM10351 C1 NAG B 203 33.671 87.279 39.456 0.50 90.22 C
HETATM10483 C1 NAG Z 702 28.025 104.269 -27.569 0.50 92.75 C
ATOM 3736 CB THR X 486 -6.552 86.240 7.922 1.00115.15 C
ATOM 3737 OG1 THR X 486 -6.756 85.289 6.866 1.00114.94 O
ATOM 3738 CG2 THR X 486 -7.867 86.404 8.636 1.00115.11 C
ATOM 3739 OXT THR X 486 -4.978 88.874 9.140 1.00115.13 O
HETATM10351 C1 NAG Y 203 33.671 152.852 -39.456 0.50 90.22 C
HETATM10639 C2 FUC C 402 -48.168 162.221 -22.404 0.50103.03 C
For each block of lines starting with HETATM*, I would like to change column 5 to match that of the previous ATOM block. It means that for the first HETATM* block both B and Z will change to A, whereas for the second HETATM* block both Y and C will change to X.
A second question, I do not really need to do it, it is just out of curiosity, how would I split the file after each line starting with HETATM* but only if the next line is ATOM?
Try this:
awk '{
if( $1 == "ATOM" ) {
col5=$5;
}
else if( match($1,/HETATM[0-9]*/)) {
$5=col5;
}
print
}' < infile
awk '$1=="ATOM"{c=$5}/^HETATM/{ $5=c };1' file
To preserve space, use field separator
awk -F" " '/^ATOM/{c=$5}/^HETATM/{ $5=c };1' file
Here is my solution, which solves the first problem (replacing the fifth field) while preserving white spaces:
$1=="ATOM" {
fifthField=$5
# Block to determine which index position field #5 is
fifthField_index = 1
for (i = 0; i < 4; i++) {
// Skip until white space
for (; substr($0, fifthField_index, 1) != " "; fifthField_index++) { }
// Skip white spaces
for (; substr($0, fifthField_index, 1) == " "; fifthField_index++) { }
}
print;next
}
/^HETATM/ {
before_fifthField = substr($0, 1, fifthField_index - 1)
after_fifthField = substr($0, fifthField_index + 1, length($0))
print before_fifthField fifthField after_fifthField
next
}
1
It is not the most elegant solution, but it works. This solution assumes that the fifth field is a single character.

Resources