Multiline RegEx: match last occurrence only - ruby

I have a string containing a Python stack trace like this (with some irrelevant text before and after):
Traceback (most recent call last):
File "/workspace/r111.py", line 232, in test_assess
exec(code)
File "a111.py", line 17, in
def reset(self):
File "/workspace/r111.py", line 123, in failed
raise AssertionError(msg)
AssertionError: Dein Programm funktioniert nicht. Python sagt:
Traceback (most recent call last):
File "a111.py", line 6, in
File "/workspace/r111.py", line 111, in runcaptured
exec(c, variables)
File "", line 1, in
ZeroDivisionError: division by zero
Now I want to extract the line in which the error occurred (1 extracted from File "", line 1) using a multiline RegEx (in Ruby).
/File ".*", line ([0-9]+)/ works nicely, but matches all occurrences. I only want the last. Iterating over the matches in the target environment is not a valid solution, as I can't change the business logic there.

You may use
/(?m:.*)(?-m:File ".*", line ([0-9]+))/
Details
(?m:.*) - a modifier group where the multiline flag is on and the dot matches any char including line break chars that matches any zero or more chars as many as possible up to the last occurrence of the subsequent subpatterns
(?-m:File ".*", line ([0-9]+)) - another modifier group where the multiline flag is off and the dot now matches any char but line break chars:
File - a literal substring with a space after it
".*" - a double quote, any zero or mmore chars other than linebreaks and then another double quote
, line - comma, space, "line" substring
([0-9]+) -Group 1 capturing one or more digits.

Related

Sort two text files with its indented text aligned to it

I would like to compare two of my log files generated before and after an implementation to see if it has impacted anything. However, the order of the logs I get is not the same all the time. Since, the log file also has multiple indented lines, when I tried to sort, everything is sorted. But, I would like to keep the child intact with the parent. Indented lines are spaces and not tab.
Any help would be greatly appreciated. I am fine with any windows solution or Linux one.
Eg of the file:
#This is a sample code
Parent1 to be verified
Child1 to be verified
Child2 to be verified
Child21 to be verified
Child23 to be verified
Child22 to be verified
Child221 to be verified
Child4 to be verified
Child5 to be verified
Child53 to be verified
Child52 to be verified
Child522 to be verified
Child521 to be verified
Child3 to be verified
I am posting another answer here to sort it hierarchically, using python.
The idea is to attach the parents to the children to make sure that the children under the same parent are sorted together.
See the python script below:
"""Attach parent to children in an indentation-structured text"""
from typing import Tuple, List
import sys
# A unique separator to separate the parent and child in each line
SEPARATOR = '#'
# The indentation
INDENT = ' '
def parse_line(line: str) -> Tuple[int, str]:
"""Parse a line into indentation level and its content
with indentation stripped
Args:
line (str): One of the lines from the input file, with newline ending
Returns:
Tuple[int, str]: The indentation level and the content with
indentation stripped.
Raises:
ValueError: If the line is incorrectly indented.
"""
# strip the leading white spaces
lstripped_line = line.lstrip()
# get the indentation
indent = line[:-len(lstripped_line)]
# Let's check if the indentation is correct
# meaning it should be N * INDENT
n = len(indent) // len(INDENT)
if INDENT * n != indent:
raise ValueError(f"Wrong indentation of line: {line}")
return n, lstripped_line.rstrip('\r\n')
def format_text(txtfile: str) -> List[str]:
"""Format the text file by attaching the parent to it children
Args:
txtfile (str): The text file
Returns:
List[str]: A list of formatted lines
"""
formatted = []
par_indent = par_line = None
with open(txtfile) as ftxt:
for line in ftxt:
# get the indentation level and line without indentation
indent, line_noindent = parse_line(line)
# level 1 parents
if indent == 0:
par_indent = indent
par_line = line_noindent
formatted.append(line_noindent)
# children
elif indent > par_indent:
formatted.append(par_line +
SEPARATOR * (indent - par_indent) +
line_noindent)
par_indent = indent
par_line = par_line + SEPARATOR + line_noindent
# siblings or dedentation
else:
# We just need first `indent` parts of parent line as our prefix
prefix = SEPARATOR.join(par_line.split(SEPARATOR)[:indent])
formatted.append(prefix + SEPARATOR + line_noindent)
par_indent = indent
par_line = prefix + SEPARATOR + line_noindent
return formatted
def sort_and_revert(lines: List[str]):
"""Sort the formatted lines and revert the leading parents
into indentations
Args:
lines (List[str]): list of formatted lines
Prints:
The sorted and reverted lines
"""
sorted_lines = sorted(lines)
for line in sorted_lines:
if SEPARATOR not in line:
print(line)
else:
leading, _, orig_line = line.rpartition(SEPARATOR)
print(INDENT * (leading.count(SEPARATOR) + 1) + orig_line)
def main():
"""Main entry"""
if len(sys.argv) < 2:
print(f"Usage: {sys.argv[0]} <file>")
sys.exit(1)
formatted = format_text(sys.argv[1])
sort_and_revert(formatted)
if __name__ == "__main__":
main()
Let's save it as format.py, and we have a test file, say test.txt:
parent2
child2-1
child2-1-1
child2-2
parent1
child1-2
child1-2-2
child1-2-1
child1-1
Let's test it:
$ python format.py test.txt
parent1
child1-1
child1-2
child1-2-1
child1-2-2
parent2
child2-1
child2-1-1
child2-2
If you wonder how the format_text function formats the text, here is the intermediate results, which also explains why we could make file sorted as we wanted:
parent2
parent2#child2-1
parent2#child2-1#child2-1-1
parent2#child2-2
parent1
parent1#child1-2
parent1#child1-2#child1-2-2
parent1#child1-2#child1-2-1
parent1#child1-1
You may see that each child has its parents attached, all the way along to the root. So that the children under the same parent are sorted together.
Short answer (Linux solution):
sed ':a;N;$!ba;s/\n /#/g' test.txt | sort | sed ':a;N;$!ba;s/#/\n /g'
Test it out:
test.txt
parent2
child2-1
child2-1-1
child2-2
parent1
child1-1
child1-2
child1-2-1
$ sed ':a;N;$!ba;s/\n /#/g' test.txt | sort | sed ':a;N;$!ba;s/#/\n /g'
parent1
child1-1
child1-2
child1-2-1
parent2
child2-1
child2-1-1
child2-2
Explanation:
The idea is to replace the newline followed by an indentation/space with a non newline character, which has to be unique in your file (here I used # for example, if it is not unique in your file, use other characters or even a string), because we need to turn it back the newline and indentation/space later.
About sed command:
:a create a label 'a'
N append the next line to the pattern space
$! if not the last line, ba branch (go to) label 'a'
s substitute, /\n / regex for newline followed by a space
/#/ a unique character to replace the newline and space
if it is not unique in your file, use other characters or even a string
/g global match (as many times as it can)

Analyzing protein sequences with the ProtParam module

I'm fairly new with Biopython. Right now, I'm trying to compute protein parameters from several protein sequences (more than 100) in fasta format. However, I've found difficult to parse the sequences correctly.
This is the code im using:
from Bio import SeqIO
from Bio.SeqUtils.ProtParam import ProteinAnalysis
input_file = open ("/Users/matias/Documents/Python/DOE.fasta", "r")
for record in SeqIO.parse(input_file, "fasta"):
my_seq = str(record.seq)
analyse = ProteinAnalysis(my_seq)
print(analyse.molecular_weight())
But I'm getting this error message:
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site- packages/Bio/SeqUtils/__init__.py", line 438, in molecular_weight
weight = sum(weight_table[x] for x in seq) - (len(seq) - 1) * water
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/Bio/SeqUtils/__init__.py", line 438, in <genexpr>
weight = sum(weight_table[x] for x in seq) - (len(seq) - 1) * water
KeyError: '\\'
Printing each sequence as string shows me every seq has a "\" at the end, but so far I haven't been able to remove it. Any ideas would be very appreciated.
That really shouldn't be there in your file, but if you can't get a clean input file, you can use my_seq = str(record.seq).rstrip('\\') to remove it at runtime.

Ruby question mark in filename

I have a little piece of ruby that creates a file containing tsv content with 2 columns, a date, and a random number.
#!/usr/bin/ruby
require 'date'
require 'set'
startDate=Date.new(2014,11,1)
endDate=Date.new(2015,9,1)
dates=File.new("/PATH_TO_FILE/dates_randoms.tsv","w+")
rands=Set.new
while startDate <= endDate do
random=rand(1000)
while rands.add?(random).nil? do
random=rand(1000)
end
dates.puts("#{startDate.to_s.gsub("-","")} #{random}")
startDate=startDate+1
end
Then, from another program, i read this file and create a file out of the random number:
dates_file=File.new(DATES_FILE_PATH,"r")
dates_file.each_line do |line|
parts=line.split("\t")
random=parts.at(1)
table=File.new("#{TMP_DIR}#{random}.tsv","w")
end
But when i go and check the file i see 645?.tsv for example.
I initially thought that was the line separator in the tsv file (the one containing the date and the random) but its run in the same unix filesystem, its not a transaction from dos to unix
Some lines from the file:
head dates_randoms.tsv
20141101 356
20141102 604
20141103 680
20141104 668
20141105 995
20141106 946
20141107 354
20141108 234
20141109 429
20141110 384
Any advice?
parts = line.split("\t")
random = parts.at(1)
line there will contain a trailing newline char. So for a line
"whatever\t1234\n"
random will contain "1234\n". That newline char then becomes a part of filename and you see it as a question mark. The simplest workaround is to do some sanitization:
random = parts.at(1).chomp
# alternatively use .strip if you want to remove whitespaces
# from beginning of the value too

Converting a multi line string to an array in Ruby using line breaks as delimiters

I would like to turn this string
"P07091 MMCNEFFEG
P06870 IVGGWECEQHS
SP0A8M0 VVPVADVLQGR
P01019 VIHNESTCEQ"
into an array that looks like in ruby.
["P07091 MMCNEFFEG", "P06870 IVGGWECEQHS", "SP0A8M0 VVPVADVLQGR", "P01019 VIHNESTCEQ"]
using split doesn't return what I would like because of the line breaks.
This is one way to deal with blank lines:
string.split(/\n+/)
For example,
string = "P07091 MMCNEFFEG
P06870 IVGGWECEQHS
SP0A8M0 VVPVADVLQGR
P01019 VIHNESTCEQ"
string.split(/\n+/)
#=> ["P07091 MMCNEFFEG", "P06870 IVGGWECEQHS",
# "SP0A8M0 VVPVADVLQGR", "P01019 VIHNESTCEQ"]
To accommodate files created under Windows (having line terminators \r\n) replace the regular expression with /(?:\r?\n)+/.
I like to use this as a pretty generic method for handling newlines and returns:
lines = string.split(/\n+|\r+/).reject(&:empty?)
string = "P07091 MMCNEFFEG
P06870 IVGGWECEQHS
SP0A8M0 VVPVADVLQGR
P01019 VIHNESTCEQ"
Using CSV::parse
require 'csv'
CSV.parse(string).flatten
# => ["P07091 MMCNEFFEG", "P06870 IVGGWECEQHS", "SP0A8M0 VVPVADVLQGR", "P01019 VIHNESTCEQ"]
Another way using String#each_line :-
ar = []
string.each_line { |line| ar << line.strip unless line == "\n" }
ar # => ["P07091 MMCNEFFEG", "P06870 IVGGWECEQHS", "SP0A8M0 VVPVADVLQGR", "P01019 VIHNESTCEQ"]
Building off of #Martin's answer:
lines = string.split("\n").reject(&:blank?)
That'll give you only the lines that are valued
Split can take a parameter in the form of the character to use to split, so you can do:
lines = string.split("\n")
I think it should be noted that in some situations, line breaks can include not only newlines (\n) but also carriage returns (\r) and that there could potentially be any combination or quantity thereof. Let's take the following string for example:
str = "Useful Line 1 ....
Useful Line 2
Useful Line 3
Useful Line 4... \n
Useful Line 5\r \n
Useful Line 6\n\r
Useful Line 7\n\r\n\r
Useful Line 8 \r\n\r\n
Useful Line 9\r\r\r Useful Line 10\n\n\n\n\nUseful Line 11 \r Useful Line 12"
To deal with all instances of \n and \r, I would do the following to replace all instances of \r with \n using gsub, and then I would combine all consecutive instances of \n using squeeze(arg):
str.gsub("\r", "\n").squeeze("\n")
which would result in :
#=>
"Useful Line 1 ....
Useful Line 2
Useful Line 3
Useful Line 4...
Useful Line 5
Useful Line 6
Useful Line 7
Useful Line 8
Useful Line 9
Useful Line 10
Useful Line 11
Useful Line 12"
...which brings me to our next issue. Sometimes those extra line breaks contain unwanted whitespace and not truly blank or empty lines. To deal with not only line breaks but also unwanted empty lines, I would add the each_line, reject, and strip method like so:
str.gsub("\r", "\n").squeeze("\n").each_line.reject{|x| x.strip == ""}.join
which would result in the desired string:
#=>
Useful Line 1 ....
Useful Line 2
Useful Line 3
Useful Line 4...
Useful Line 5
Useful Line 6
Useful Line 7
Useful Line 8
Usefule Line 9
Useful Line 10
Useful Line 11
Useful Line 12
Now more specifically to the OP, we could then simply use split("\n") to finish it all off (as was already mentioned by others):
str.gsub("\r", "\n").squeeze("\n").each_line.reject{|x| x.strip == ""}.join.split("\n")
or we could simply skip straight to the desired array by replacing each_line with map and leaving off the unnecessary join like so:
str.gsub("\r", "\n").squeeze("\n").split("\n").map.reject{|x| x.strip == ""}
both of which would result in:
#=>
["Useful Line 1 ....", " Useful Line 2", "Useful Line 3", " Useful Line 4... ", "Useful Line 5", " Useful Line 6", "Useful Line 7", " Useful Line 8 ", "Usefule Line 9", " Useful Line 10", "Useful Line 11 ", " Useful Line 12"]
NOTE:
You may also want to strip off leading and trailing whitespace from each line in which case we could replace .join.split("\n") with .map(&:strip) like so:
str.gsub("\r", "\n").squeeze("\n").each_line.reject{|x| x.strip == ""}.map(&:strip)
or
str.gsub("\r", "\n").squeeze("\n").split("\n").map.reject{|x| x.strip == ""}.map(&:strip)
which would both result in:
#=>
["Useful Line 1 ....", "Useful Line 2", "Useful Line 3", "Useful Line 4...", "Useful Line 5", "Useful Line 6", "Useful Line 7", "Useful Line 8", "Usefule Line 9", "Useful Line 10", "Useful Line 11", "Useful Line 12"]

how to replace last comma in a line with a string in unix

I trying to insert a string in every line except for first and last lines in a file, but not able to get it done, can anyone give some clue how to achieve? Thanks in advance.
How to replace last comma in a line with a string xxxxx (except for first and last rows)
using unix
Original File
00,SRI,BOM,FF,000004,20120808030100,20120907094412,"GTEXPR","SRIVIM","8894-7577","SRIVIM#GTEXPR."
10,SRI,FF,NMNN,3112,NMNSME,U,NM,GEB,,230900,02BLYPO
10,SRI,FF,NMNN,3112,NMNSME,U,NM,TCM,231040,231100,01
10,SRI,FF,NMNN,3112,NMNSME,U,NM,UPW,231240,231300,01
10,SRI,FF,NMNN,3112,NMNSME,U,NM,UFG,231700,231900,01
10,SRI,FF,NMNN,3112,NMNSME,U,NM,FTG,232140,232200,01
10,SRI,FF,NMNN,3112,NMNSME,U,NM,BOR,232340,232400,01
10,SRI,FF,NMNN,3112,NMNSME,U,NM,BAY,232640,232700,01
10,SRI,FF,NMNN,3112,NMNSME,U,NM,RWD,233400,,01
10,SRI,FF,BUN,0800,NMJWJB,U,NM,CCL,,101400,02CHLSU
10,SRI,FF,BUN,0800,NMJWJB,U,NM,PAR,101540,101700,01
10,SRI,FF,BUN,0800,NMJWJB,U,NM,MCE,101840,101900,01
10,SRI,FF,BUN,0800,NMJWJB,U,NM,SSS,102140,102200,09
10,SRI,FF,BUN,0800,NMJWJB,U,NM,FSS,102600,,01
10,SRI,FF,BUN,0802,NMJWJB,U,NM,CCL,,103700,01CHLSU
10,SRI,FF,BUN,0802,NMJWJB,U,NM,PAR,103940,104000,01
10,SRI,FF,BUN,0802,NMJWJB,U,NM,MCE,104140,104200,01
10,SRI,FF,BUN,0802,NMJWJB,U,NM,SSS,104440,104500,09
10,SRI,FF,BUN,0802,NMJWJB,U,NM,FSS,105000,,01
10,SRI,FF,BUN,3112,NMNSME,U,NM,GEB,,230900,02BLYSU
10,SRI,FF,BUN,3112,NMNSME,U,NM,TCM,231040,231100,01
10,SRI,FF,BUN,3112,NMNSME,U,NM,UPW,231240,231300,01
10,SRI,FF,BUN,3112,NMNSME,U,NM,UFG,231700,231900,01
10,SRI,FF,BUN,3112,NMNSME,U,NM,FTG,232140,232200,01
10,SRI,FF,BUN,3112,NMNSME,U,NM,BOR,232340,232400,01
10,SRI,FF,BUN,3112,NMNSME,U,NM,BAY,232640,232700,01
10,SRI,FF,BUN,3112,NMNSME,U,NM,RWD,233400,,01
99,SRI,FF,28
Expected File
00,SRI,BOM,FF,000004,20120808030100,20120907094412,"GTEXPR","SRIVIM","8894-7577","SRIVIM#GTEXPR."
10,SRI,FF,NMNN,3112,NMNSME,U,NM,GEB,,230900,xxxxx02BLYPO
10,SRI,FF,NMNN,3112,NMNSME,U,NM,TCM,231040,xxxxx231100,01
10,SRI,FF,NMNN,3112,NMNSME,U,NM,UPW,231240,xxxxx231300,01
10,SRI,FF,NMNN,3112,NMNSME,U,NM,UFG,231700,xxxxx231900,01
10,SRI,FF,NMNN,3112,NMNSME,U,NM,FTG,232140,xxxxx232200,01
10,SRI,FF,NMNN,3112,NMNSME,U,NM,BOR,232340,xxxxx232400,01
10,SRI,FF,NMNN,3112,NMNSME,U,NM,BAY,232640,xxxxx232700,01
10,SRI,FF,NMNN,3112,NMNSME,U,NM,RWD,233400,,xxxxx01
10,SRI,FF,BUN,0800,NMJWJB,U,NM,CCL,,101400,xxxxx02CHLSU
10,SRI,FF,BUN,0800,NMJWJB,U,NM,PAR,101540,101700,xxxxx01
10,SRI,FF,BUN,0800,NMJWJB,U,NM,MCE,101840,101900,xxxxx01
10,SRI,FF,BUN,0800,NMJWJB,U,NM,SSS,102140,102200,xxxxx09
10,SRI,FF,BUN,0800,NMJWJB,U,NM,FSS,102600,,xxxxx01
10,SRI,FF,BUN,0802,NMJWJB,U,NM,CCL,,103700,xxxxx01CHLSU
10,SRI,FF,BUN,0802,NMJWJB,U,NM,PAR,103940,104000,xxxxx01
10,SRI,FF,BUN,0802,NMJWJB,U,NM,MCE,104140,104200,xxxxx01
10,SRI,FF,BUN,0802,NMJWJB,U,NM,SSS,104440,104500,xxxxx09
10,SRI,FF,BUN,0802,NMJWJB,U,NM,FSS,105000,,xxxxx01
10,SRI,FF,BUN,3112,NMNSME,U,NM,GEB,,230900,xxxxx02BLYSU
10,SRI,FF,BUN,3112,NMNSME,U,NM,TCM,231040,231100,xxxxx01
10,SRI,FF,BUN,3112,NMNSME,U,NM,UPW,231240,231300,xxxxx01
10,SRI,FF,BUN,3112,NMNSME,U,NM,UFG,231700,231900,xxxxx01
10,SRI,FF,BUN,3112,NMNSME,U,NM,FTG,232140,232200,xxxxx01
10,SRI,FF,BUN,3112,NMNSME,U,NM,BOR,232340,232400,xxxxx01
10,SRI,FF,BUN,3112,NMNSME,U,NM,BAY,232640,232700,xxxxx01
10,SRI,FF,BUN,3112,NMNSME,U,NM,RWD,233400,,xxxxx01
99,SRI,FF,28
awk can be quite useful for manipulating data files like this one. Here's a one-liner that does more-or-less what you want. It prepends the string "xxxxx" to the twelfth field of each input line that has at least twelve fields.
$ awk 'BEGIN{FS=OFS=","}NF>11{$12="xxxxx"$12}{print}' 16006747.txt
00,SRI,BOM,FF,000004,20120808030100,20120907094412,"GTEXPR","SRIVIM","8894-7577","SRIVIM#GTEXPR."
10,SRI,FF,NMNN,3112,NMNSME,U,NM,GEB,,230900,xxxxx02BLYPO
10,SRI,FF,NMNN,3112,NMNSME,U,NM,TCM,231040,231100,xxxxx01
10,SRI,FF,NMNN,3112,NMNSME,U,NM,UPW,231240,231300,xxxxx01
10,SRI,FF,NMNN,3112,NMNSME,U,NM,UFG,231700,231900,xxxxx01
10,SRI,FF,NMNN,3112,NMNSME,U,NM,FTG,232140,232200,xxxxx01
10,SRI,FF,NMNN,3112,NMNSME,U,NM,BOR,232340,232400,xxxxx01
10,SRI,FF,NMNN,3112,NMNSME,U,NM,BAY,232640,232700,xxxxx01
10,SRI,FF,NMNN,3112,NMNSME,U,NM,RWD,233400,,xxxxx01
10,SRI,FF,BUN,0800,NMJWJB,U,NM,CCL,,101400,xxxxx02CHLSU
10,SRI,FF,BUN,0800,NMJWJB,U,NM,PAR,101540,101700,xxxxx01
10,SRI,FF,BUN,0800,NMJWJB,U,NM,MCE,101840,101900,xxxxx01
10,SRI,FF,BUN,0800,NMJWJB,U,NM,SSS,102140,102200,xxxxx09
10,SRI,FF,BUN,0800,NMJWJB,U,NM,FSS,102600,,xxxxx01
10,SRI,FF,BUN,0802,NMJWJB,U,NM,CCL,,103700,xxxxx01CHLSU
10,SRI,FF,BUN,0802,NMJWJB,U,NM,PAR,103940,104000,xxxxx01
10,SRI,FF,BUN,0802,NMJWJB,U,NM,MCE,104140,104200,xxxxx01
10,SRI,FF,BUN,0802,NMJWJB,U,NM,SSS,104440,104500,xxxxx09
10,SRI,FF,BUN,0802,NMJWJB,U,NM,FSS,105000,,xxxxx01
10,SRI,FF,BUN,3112,NMNSME,U,NM,GEB,,230900,xxxxx02BLYSU
10,SRI,FF,BUN,3112,NMNSME,U,NM,TCM,231040,231100,xxxxx01
10,SRI,FF,BUN,3112,NMNSME,U,NM,UPW,231240,231300,xxxxx01
10,SRI,FF,BUN,3112,NMNSME,U,NM,UFG,231700,231900,xxxxx01
10,SRI,FF,BUN,3112,NMNSME,U,NM,FTG,232140,232200,xxxxx01
10,SRI,FF,BUN,3112,NMNSME,U,NM,BOR,232340,232400,xxxxx01
10,SRI,FF,BUN,3112,NMNSME,U,NM,BAY,232640,232700,xxxxx01
10,SRI,FF,BUN,3112,NMNSME,U,NM,RWD,233400,,xxxxx01
99,SRI,FF,28

Resources