Ruby multi-line regex - ruby

I have a ruby multi-line string (called efixes) that looks like:
ID STATE LABEL INSTALL TIME UPDATED BY ABSTRACT
=== ===== ========== ================= ========== ======================================
1 S hayo32.02 xxxxxxx xxxxxxxx xxxxxxxxxxxxxxx
2 S 23434.23 xxxxxxx xxxxxxxx xxxxxxxxxxxxxxx
STATE codes:
S = STABLE
M = MOUNTED
U = UNMOUNTED
Q = REBOOT REQUIRED
B = BROKEN
I = INSTALLING
R = REMOVING
T = TESTED
P = PATCHED
N = NOT PATCHED
SP = STABLE + PATCHED
SN = STABLE + NOT PATCHED
QP = BOOT IMAGE MODIFIED + PATCHED
QN = BOOT IMAGE MODIFIED + NOT PATCHED
RQ = REMOVING + REBOOT REQUIRED
I only want to display the lines that start with a number. I am having trouble, it doesn't seem to be matching. I found this solution here, (that I don't truly understand right now):
efixes_array = efixes.split("\n").select{|x| /\A[0-9]/.match(x)}
io.puts efixes_array.collect{|x| x.scan(/\A[0-9]/)}.flatten
It is only matching the numbers. I want to display the entire line. The end result, I want to display what is under the "LABELS" column.

This line from your example code
efixes.split("\n").select{|x| /\A[0-9]/.match(x)}
returns an array with all lines that start with a number.

Related

How to improve running time of my binary search code in peripherical parts?

I am studying for this great Coursera course https://www.coursera.org/learn/algorithmic-toolbox . On the fourth week, we have an assignment related to binary trees.
I think I did a good job. I created a binary search code that solves this problem using recursion in Python3. That's my code:
#python3
data_in_sequence = list(map(int,(input().split())))
data_in_keys = list(map(int,(input()).split()))
original_array = data_in_sequence[1:]
data_in_sequence = data_in_sequence[1:]
data_in_keys = data_in_keys[1:]
def binary_search(data_in_sequence,target):
answer = 0
sub_array = data_in_sequence
#print("sub_array",sub_array)
if not sub_array:
# print("sub_array",sub_array)
answer = -1
return answer
#print("target",target)
mid_point_index = (len(sub_array)//2)
#print("mid_point", sub_array[mid_point_index])
beg_point_index = 0
#print("beg_point_index",beg_point_index)
end_point_index = len(sub_array)-1
#print("end_point_index",end_point_index)
if sub_array[mid_point_index]==target:
#print ("final midpoint, ", sub_array[mid_point_index])
#print ("original_array",original_array)
#print("sub_array[mid_point_index]",sub_array[mid_point_index])
#print ("answer",answer)
answer = original_array.index(sub_array[mid_point_index])
return answer
elif target>sub_array[mid_point_index]:
#print("target num higher than current midpoint")
beg_point_index = mid_point_index+1
sub_array=sub_array[beg_point_index:]
end_point_index = len(sub_array)-1
#print("sub_array",sub_array)
return binary_search(sub_array,target)
elif target<sub_array[mid_point_index]:
#print("target num smaller than current midpoint")
sub_array = sub_array[:mid_point_index]
return binary_search(sub_array,target)
else:
return None
def bin_search_over_seq(data_in_sequence,data_in_keys):
final_output = ""
for key in data_in_keys:
final_output = final_output + " " + str(binary_search(data_in_sequence,key))
return final_output
print (bin_search_over_seq(data_in_sequence,data_in_keys))
I usually get the correct output. For instance, if I input:
5 1 5 8 12 13
5 8 1 23 1 11
I get the correct indexes of the sequences or (-1) if the term is not in sequence (first line):
2 0 -1 0 -1
However, my code does not pass on the expected running time.
Failed case #4/22: time limit exceeded (Time used: 13.47/10.00, memory used: 36696064/536870912.)
I think this happens not due to the implementation of my binary search (I think it is right). Actually, I think this happens due to some inneficieny in a peripheral part of the code. Like the way I am managing to output the final answer. However, the way I am presenting the final answer does not seem to be really "heavy"... I am lost.
Am I not seeing something? Is there another inefficiency I am not seeing? How can I solve this? Just trying to present the final result in a faster way?

Extract multiple protein sequences from a Protein Data Bank along with Secondary Structure

I want to extract protein sequences and their corresponding secondary structure from any Protein Data bank, say RCSB. I just need short sequences and their secondary structure. Something like,
ATRWGUVT Helix
It is fine even if the sequences are long, but I want a tag at the end that denotes its secondary structure. Is there any programming tool or anything available for this.
As I've shown above I want only this much minimal information. How can I achieve this?
from Bio.PDB import *
from distutils import spawn
Extract sequence:
def get_seq(pdbfile):
p = PDBParser(PERMISSIVE=0)
structure = p.get_structure('test', pdbfile)
ppb = PPBuilder()
seq = ''
for pp in ppb.build_peptides(structure):
seq += pp.get_sequence()
return seq
Extract secondary structure with DSSP as explained earlier:
def get_secondary_struc(pdbfile):
# get secondary structure info for whole pdb.
if not spawn.find_executable("dssp"):
sys.stderr.write('dssp executable needs to be in folder')
sys.exit(1)
p = PDBParser(PERMISSIVE=0)
ppb = PPBuilder()
structure = p.get_structure('test', pdbfile)
model = structure[0]
dssp = DSSP(model, pdbfile)
count = 0
sec = ''
for residue in model.get_residues():
count = count + 1
# print residue,count
a_key = list(dssp.keys())[count - 1]
sec += dssp[a_key][2]
print sec
return sec
This should print both sequence and secondary structure.
You can use DSSP.
The output of DSSP is explained extensively under 'explanation'. The very short summary of the output is:
H = α-helix
B = residue in isolated β-bridge
E = extended strand, participates in β ladder
G = 3-helix (310 helix)
I = 5 helix (π-helix)
T = hydrogen bonded turn
S = bend

Numeral.JS zeroFormat includes $ and % symbol at end result

Why does zero formatting include $ and % symbols in the formatted result?
numeral.js version is 1.5.3
var number = numeral(0);
numeral.zeroFormat('N/A');
var zero = number.format('0.0%')
// 'N/A%'
var zero = number.format('$0.0')
// '$N/A'
// What I expect is 'N/A'
Is it a bug or am I missing something?
Problem duplication - https://jsfiddle.net/wbuu53qr/
Quickly found the solution. This problem happens in the older version.
Just have to move to the latest version:
var number = numeral(0);
numeral.zeroFormat('N/A');
var zero = number.format('0.0%')
// 'N/A'
https://jsfiddle.net/4jz4vp5h/

Dealing with under flow while calculating GMM parameters using EM

I am currently runnuing training in matlab on a matrix of logspecrum samples I am constantly dealing with underflow problems.I understood that I need to work with log's in order to deal with underflowing.
I am still strugling with uderflow though , when i calculate the mean (mue) bucause it is negetive i cant work with logs so i need the real values that underflow.
These are equasions i am working with:
In MATLAB code i calulate log_tau in oreder avoid underflow but when calulating mue i need exp(log(tau)) which goes to zero.
I am attaching relevent MATLAB code
**in the code i called the variable alpha is tau ...
for i = 1 : 50
log_c = Logsum(log_alpha,1) - log(N);
c = exp(log_c);
mue = DataMat*alpha./(repmat(exp(Logsum(log_alpha,1)),FrameSize,1));
log_abs_mue = log(abs(mue));
log_SigmaSqr = log((DataMat.^2)*alpha) - repmat(Logsum(log_alpha,1),FrameSize,1) - 2*log_abs_mue;
SigmaSqr = exp(log_SigmaSqr);
for j=1:N
rep_DataMat(:,:,j) = repmat(DataMat(:,j),1,M);
log_gamma(j,:) = log_c - 0.5*(FrameSize*log(2*pi)+sum(log_SigmaSqr)) + sum((rep_DataMat(:,:,j) - mue).^2./(2*SigmaSqr));
end
log_alpha = log_gamma - repmat(Logsum(log_gamma,2),1,M);
alpha = exp(log_alpha);
end
c = exp(log_c);
SigmaSqr = exp(log_SigmaSqr);
does any one see how i can avoid this? or what needs to be fixed in code?
What i did was add this line to the MATLAB code:
mue(isnan(mue))=0; %fix 0/0 problem
and this one:
SigmaSqr(SigmaSqr==0)=1;%fix if mue_k = x_k
not sure if this is the best solution but is seems to work...
any have a better idea?

How to extract string from large file only if specific string appears previous using Ruby?

I am trying to extract information from a large file and cannot figure out how to extract strings from file lines only when a previous line in the same record within the file has been matched by regex. An example of one record in the file is as follows:
*NEW RECORD
RECTYPE = D
MH = Informed Consent
AQ = ES HI LJ PX SN ST
ENTRY = Consent, Informed
MN = N03.706.437.650.312
MN = N03.706.535.489
FX = Disclosure
FX = Mental Competency
FX = Therapeutic Misconception
FX = Treatment Refusal
ST = T058
ST = T078
AN = competency to consent: coordinate IM with MENTAL COMPETENCY (IM)
PI = Jurisprudence (1966-1970)
PI = Physician-Patient Relations (1966-1970)
MS = Voluntary authorization, by a patient or research subject, etc,...
This file contains over 20,000 records like this example. I want to identify a small percent of those records using the "MH" field. In this example, I want to find "Informed Consent", and then use regex to extract the information in the FX, AN, and MS fields only within that record. So far, I have opened the file, accessed the hash that the MH terms are stored in, and been able to extract those terms from the records in the file. I also have a functioning regex that identifies the content in the "FX" field.
File.open('mesh_descriptor.bin').each do |file_line|
file_line = file_line.chomp
# read each key of candidate_descriptor_keys
candidate_descriptor_keys.each do |cand_term|
if file_line =~ /^MH\s=\s(#{cand_term})$/
mesh_header = $1
puts "MH from Mesh Descriptor file is: #{mesh_header}"
if file_line =~ /^FX\s=\s(.*)$/
see_also = $1
puts " See_Also from Descriptor file is: #{see_also}"
end
end
end
end
The hash contains the following MH (keys):
candidate_descriptor_keys = ["Body Weight", "Obesity", "Thinness", "Fetal Weight", "Overweight"]
I had success extracting "FX" when I put the statement outside of the "if" statement to extract "MH", but all of the "FX" from the whole file were retrieved - not what I need. I thought putting the "if" statement for "FX" within the previous "if" statement would restrict the results to only those found when the first statement is true, but I am getting no results (also no errors) with this strategy. What I would like as a result is:
> Informed Consent
> Disclosure
> Mental Competency
> Therapeutic Misconception
> Treatment Refusal
as well as the strings within the "AN" and "MS" fields for only those records matching "MH". Any suggestions would be helpful!
I think this may be what you are looking for, but if not, let me know and I will change it. Look especially at the very end to see if that is the sort of output (for input having two records, both with a "MH" field) you want. I will also add a "explanation" section at the end once I have understood your question correctly.
I have assumed that each record begins
*NEW_RECORD
and you wish to identify all lines beginning "MH" whose field is one of the elements of:
candidate_descriptor_keys =
["Body Weight", "Obesity", "Thinness", "Informed Consent"]
and for each match, you would like to print the contents of the lines for the same record that begin with "FX", "AN" and "MS".
Code
NEW_RECORD_MARKER = "*NEW RECORD"
def getem(fname, candidate_descriptor_keys)
line = 0
found_mh = false
File.open(fname).each do |file_line|
file_line = file_line.strip
case
when file_line == NEW_RECORD_MARKER
puts # space between records
found_mh = false
when found_mh == false
candidate_descriptor_keys.each do |cand_term|
if file_line =~ /^MH\s=\s(#{cand_term})$/
found_mh = true
puts "MH from line #{line} of file is: #{cand_term}"
break
end
end
when found_mh
["FX", "AN", "MS"].each do |des|
if file_line =~ /^#{des}\s=\s(.*)$/
see_also = $1
puts " Line #{line} of file is: #{des}: #{see_also}"
end
end
end
line += 1
end
end
Example
Let's begin be creating a file, starging with a "here document that contains two records":
records =<<_
*NEW RECORD
RECTYPE = D
MH = Informed Consent
AQ = ES HI LJ PX SN ST
ENTRY = Consent, Informed
MN = N03.706.437.650.312
MN = N03.706.535.489
FX = Disclosure
FX = Mental Competency
FX = Therapeutic Misconception
FX = Treatment Refusal
ST = T058
ST = T078
AN = competency to consent
PI = Jurisprudence (1966-1970)
PI = Physician-Patient Relations (1966-1970)
MS = Voluntary authorization
*NEW RECORD
MH = Obesity
AQ = ES HI LJ PX SN ST
ENTRY = Obesity
MN = N03.706.437.650.312
MN = N03.706.535.489
FX = 1st FX
FX = 2nd FX
AN = Only AN
PI = Jurisprudence (1966-1970)
PI = Physician-Patient Relations (1966-1970)
MS = Only MS
_
If you puts records you will see it is just a string. (You'll see that I shortened two of them.) Now write it to a file:
File.write('mesh_descriptor', records)
If you wish to confirm the file contents, you could do this:
puts File.read('mesh_descriptor')
We also need to define define the array candidate_descriptor_keys:
candidate_descriptor_keys =
["Body Weight", "Obesity", "Thinness", "Informed Consent"]
We can now execute the method getem:
getem('mesh_descriptor', candidate_descriptor_keys)
MH from line 2 of file is: Informed Consent
Line 7 of file is: FX: Disclosure
Line 8 of file is: FX: Mental Competency
Line 9 of file is: FX: Therapeutic Misconception
Line 10 of file is: FX: Treatment Refusal
Line 13 of file is: AN: competency to consent
Line 16 of file is: MS: Voluntary authorization
MH from line 18 of file is: Obesity
Line 23 of file is: FX: 1st FX
Line 24 of file is: FX: 2nd FX
Line 25 of file is: AN: Only AN
Line 28 of file is: MS: Only MS

Resources