Multiline RegEx: match last occurrence only - ruby
I have a string containing a Python stack trace like this (with some irrelevant text before and after):
Traceback (most recent call last):
File "/workspace/r111.py", line 232, in test_assess
exec(code)
File "a111.py", line 17, in
def reset(self):
File "/workspace/r111.py", line 123, in failed
raise AssertionError(msg)
AssertionError: Dein Programm funktioniert nicht. Python sagt:
Traceback (most recent call last):
File "a111.py", line 6, in
File "/workspace/r111.py", line 111, in runcaptured
exec(c, variables)
File "", line 1, in
ZeroDivisionError: division by zero
Now I want to extract the line in which the error occurred (1 extracted from File "", line 1) using a multiline RegEx (in Ruby).
/File ".*", line ([0-9]+)/ works nicely, but matches all occurrences. I only want the last. Iterating over the matches in the target environment is not a valid solution, as I can't change the business logic there.
You may use
/(?m:.*)(?-m:File ".*", line ([0-9]+))/
Details
(?m:.*) - a modifier group where the multiline flag is on and the dot matches any char including line break chars that matches any zero or more chars as many as possible up to the last occurrence of the subsequent subpatterns
(?-m:File ".*", line ([0-9]+)) - another modifier group where the multiline flag is off and the dot now matches any char but line break chars:
File - a literal substring with a space after it
".*" - a double quote, any zero or mmore chars other than linebreaks and then another double quote
, line - comma, space, "line" substring
([0-9]+) -Group 1 capturing one or more digits.
Related
Sort two text files with its indented text aligned to it
I would like to compare two of my log files generated before and after an implementation to see if it has impacted anything. However, the order of the logs I get is not the same all the time. Since, the log file also has multiple indented lines, when I tried to sort, everything is sorted. But, I would like to keep the child intact with the parent. Indented lines are spaces and not tab. Any help would be greatly appreciated. I am fine with any windows solution or Linux one. Eg of the file: #This is a sample code Parent1 to be verified Child1 to be verified Child2 to be verified Child21 to be verified Child23 to be verified Child22 to be verified Child221 to be verified Child4 to be verified Child5 to be verified Child53 to be verified Child52 to be verified Child522 to be verified Child521 to be verified Child3 to be verified
I am posting another answer here to sort it hierarchically, using python. The idea is to attach the parents to the children to make sure that the children under the same parent are sorted together. See the python script below: """Attach parent to children in an indentation-structured text""" from typing import Tuple, List import sys # A unique separator to separate the parent and child in each line SEPARATOR = '#' # The indentation INDENT = ' ' def parse_line(line: str) -> Tuple[int, str]: """Parse a line into indentation level and its content with indentation stripped Args: line (str): One of the lines from the input file, with newline ending Returns: Tuple[int, str]: The indentation level and the content with indentation stripped. Raises: ValueError: If the line is incorrectly indented. """ # strip the leading white spaces lstripped_line = line.lstrip() # get the indentation indent = line[:-len(lstripped_line)] # Let's check if the indentation is correct # meaning it should be N * INDENT n = len(indent) // len(INDENT) if INDENT * n != indent: raise ValueError(f"Wrong indentation of line: {line}") return n, lstripped_line.rstrip('\r\n') def format_text(txtfile: str) -> List[str]: """Format the text file by attaching the parent to it children Args: txtfile (str): The text file Returns: List[str]: A list of formatted lines """ formatted = [] par_indent = par_line = None with open(txtfile) as ftxt: for line in ftxt: # get the indentation level and line without indentation indent, line_noindent = parse_line(line) # level 1 parents if indent == 0: par_indent = indent par_line = line_noindent formatted.append(line_noindent) # children elif indent > par_indent: formatted.append(par_line + SEPARATOR * (indent - par_indent) + line_noindent) par_indent = indent par_line = par_line + SEPARATOR + line_noindent # siblings or dedentation else: # We just need first `indent` parts of parent line as our prefix prefix = SEPARATOR.join(par_line.split(SEPARATOR)[:indent]) formatted.append(prefix + SEPARATOR + line_noindent) par_indent = indent par_line = prefix + SEPARATOR + line_noindent return formatted def sort_and_revert(lines: List[str]): """Sort the formatted lines and revert the leading parents into indentations Args: lines (List[str]): list of formatted lines Prints: The sorted and reverted lines """ sorted_lines = sorted(lines) for line in sorted_lines: if SEPARATOR not in line: print(line) else: leading, _, orig_line = line.rpartition(SEPARATOR) print(INDENT * (leading.count(SEPARATOR) + 1) + orig_line) def main(): """Main entry""" if len(sys.argv) < 2: print(f"Usage: {sys.argv[0]} <file>") sys.exit(1) formatted = format_text(sys.argv[1]) sort_and_revert(formatted) if __name__ == "__main__": main() Let's save it as format.py, and we have a test file, say test.txt: parent2 child2-1 child2-1-1 child2-2 parent1 child1-2 child1-2-2 child1-2-1 child1-1 Let's test it: $ python format.py test.txt parent1 child1-1 child1-2 child1-2-1 child1-2-2 parent2 child2-1 child2-1-1 child2-2 If you wonder how the format_text function formats the text, here is the intermediate results, which also explains why we could make file sorted as we wanted: parent2 parent2#child2-1 parent2#child2-1#child2-1-1 parent2#child2-2 parent1 parent1#child1-2 parent1#child1-2#child1-2-2 parent1#child1-2#child1-2-1 parent1#child1-1 You may see that each child has its parents attached, all the way along to the root. So that the children under the same parent are sorted together.
Short answer (Linux solution): sed ':a;N;$!ba;s/\n /#/g' test.txt | sort | sed ':a;N;$!ba;s/#/\n /g' Test it out: test.txt parent2 child2-1 child2-1-1 child2-2 parent1 child1-1 child1-2 child1-2-1 $ sed ':a;N;$!ba;s/\n /#/g' test.txt | sort | sed ':a;N;$!ba;s/#/\n /g' parent1 child1-1 child1-2 child1-2-1 parent2 child2-1 child2-1-1 child2-2 Explanation: The idea is to replace the newline followed by an indentation/space with a non newline character, which has to be unique in your file (here I used # for example, if it is not unique in your file, use other characters or even a string), because we need to turn it back the newline and indentation/space later. About sed command: :a create a label 'a' N append the next line to the pattern space $! if not the last line, ba branch (go to) label 'a' s substitute, /\n / regex for newline followed by a space /#/ a unique character to replace the newline and space if it is not unique in your file, use other characters or even a string /g global match (as many times as it can)
Analyzing protein sequences with the ProtParam module
I'm fairly new with Biopython. Right now, I'm trying to compute protein parameters from several protein sequences (more than 100) in fasta format. However, I've found difficult to parse the sequences correctly. This is the code im using: from Bio import SeqIO from Bio.SeqUtils.ProtParam import ProteinAnalysis input_file = open ("/Users/matias/Documents/Python/DOE.fasta", "r") for record in SeqIO.parse(input_file, "fasta"): my_seq = str(record.seq) analyse = ProteinAnalysis(my_seq) print(analyse.molecular_weight()) But I'm getting this error message: Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site- packages/Bio/SeqUtils/__init__.py", line 438, in molecular_weight weight = sum(weight_table[x] for x in seq) - (len(seq) - 1) * water File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/Bio/SeqUtils/__init__.py", line 438, in <genexpr> weight = sum(weight_table[x] for x in seq) - (len(seq) - 1) * water KeyError: '\\' Printing each sequence as string shows me every seq has a "\" at the end, but so far I haven't been able to remove it. Any ideas would be very appreciated.
That really shouldn't be there in your file, but if you can't get a clean input file, you can use my_seq = str(record.seq).rstrip('\\') to remove it at runtime.
Ruby question mark in filename
I have a little piece of ruby that creates a file containing tsv content with 2 columns, a date, and a random number. #!/usr/bin/ruby require 'date' require 'set' startDate=Date.new(2014,11,1) endDate=Date.new(2015,9,1) dates=File.new("/PATH_TO_FILE/dates_randoms.tsv","w+") rands=Set.new while startDate <= endDate do random=rand(1000) while rands.add?(random).nil? do random=rand(1000) end dates.puts("#{startDate.to_s.gsub("-","")} #{random}") startDate=startDate+1 end Then, from another program, i read this file and create a file out of the random number: dates_file=File.new(DATES_FILE_PATH,"r") dates_file.each_line do |line| parts=line.split("\t") random=parts.at(1) table=File.new("#{TMP_DIR}#{random}.tsv","w") end But when i go and check the file i see 645?.tsv for example. I initially thought that was the line separator in the tsv file (the one containing the date and the random) but its run in the same unix filesystem, its not a transaction from dos to unix Some lines from the file: head dates_randoms.tsv 20141101 356 20141102 604 20141103 680 20141104 668 20141105 995 20141106 946 20141107 354 20141108 234 20141109 429 20141110 384 Any advice?
parts = line.split("\t") random = parts.at(1) line there will contain a trailing newline char. So for a line "whatever\t1234\n" random will contain "1234\n". That newline char then becomes a part of filename and you see it as a question mark. The simplest workaround is to do some sanitization: random = parts.at(1).chomp # alternatively use .strip if you want to remove whitespaces # from beginning of the value too
Converting a multi line string to an array in Ruby using line breaks as delimiters
I would like to turn this string "P07091 MMCNEFFEG P06870 IVGGWECEQHS SP0A8M0 VVPVADVLQGR P01019 VIHNESTCEQ" into an array that looks like in ruby. ["P07091 MMCNEFFEG", "P06870 IVGGWECEQHS", "SP0A8M0 VVPVADVLQGR", "P01019 VIHNESTCEQ"] using split doesn't return what I would like because of the line breaks.
This is one way to deal with blank lines: string.split(/\n+/) For example, string = "P07091 MMCNEFFEG P06870 IVGGWECEQHS SP0A8M0 VVPVADVLQGR P01019 VIHNESTCEQ" string.split(/\n+/) #=> ["P07091 MMCNEFFEG", "P06870 IVGGWECEQHS", # "SP0A8M0 VVPVADVLQGR", "P01019 VIHNESTCEQ"] To accommodate files created under Windows (having line terminators \r\n) replace the regular expression with /(?:\r?\n)+/.
I like to use this as a pretty generic method for handling newlines and returns: lines = string.split(/\n+|\r+/).reject(&:empty?)
string = "P07091 MMCNEFFEG P06870 IVGGWECEQHS SP0A8M0 VVPVADVLQGR P01019 VIHNESTCEQ" Using CSV::parse require 'csv' CSV.parse(string).flatten # => ["P07091 MMCNEFFEG", "P06870 IVGGWECEQHS", "SP0A8M0 VVPVADVLQGR", "P01019 VIHNESTCEQ"] Another way using String#each_line :- ar = [] string.each_line { |line| ar << line.strip unless line == "\n" } ar # => ["P07091 MMCNEFFEG", "P06870 IVGGWECEQHS", "SP0A8M0 VVPVADVLQGR", "P01019 VIHNESTCEQ"]
Building off of #Martin's answer: lines = string.split("\n").reject(&:blank?) That'll give you only the lines that are valued
Split can take a parameter in the form of the character to use to split, so you can do: lines = string.split("\n")
I think it should be noted that in some situations, line breaks can include not only newlines (\n) but also carriage returns (\r) and that there could potentially be any combination or quantity thereof. Let's take the following string for example: str = "Useful Line 1 .... Useful Line 2 Useful Line 3 Useful Line 4... \n Useful Line 5\r \n Useful Line 6\n\r Useful Line 7\n\r\n\r Useful Line 8 \r\n\r\n Useful Line 9\r\r\r Useful Line 10\n\n\n\n\nUseful Line 11 \r Useful Line 12" To deal with all instances of \n and \r, I would do the following to replace all instances of \r with \n using gsub, and then I would combine all consecutive instances of \n using squeeze(arg): str.gsub("\r", "\n").squeeze("\n") which would result in : #=> "Useful Line 1 .... Useful Line 2 Useful Line 3 Useful Line 4... Useful Line 5 Useful Line 6 Useful Line 7 Useful Line 8 Useful Line 9 Useful Line 10 Useful Line 11 Useful Line 12" ...which brings me to our next issue. Sometimes those extra line breaks contain unwanted whitespace and not truly blank or empty lines. To deal with not only line breaks but also unwanted empty lines, I would add the each_line, reject, and strip method like so: str.gsub("\r", "\n").squeeze("\n").each_line.reject{|x| x.strip == ""}.join which would result in the desired string: #=> Useful Line 1 .... Useful Line 2 Useful Line 3 Useful Line 4... Useful Line 5 Useful Line 6 Useful Line 7 Useful Line 8 Usefule Line 9 Useful Line 10 Useful Line 11 Useful Line 12 Now more specifically to the OP, we could then simply use split("\n") to finish it all off (as was already mentioned by others): str.gsub("\r", "\n").squeeze("\n").each_line.reject{|x| x.strip == ""}.join.split("\n") or we could simply skip straight to the desired array by replacing each_line with map and leaving off the unnecessary join like so: str.gsub("\r", "\n").squeeze("\n").split("\n").map.reject{|x| x.strip == ""} both of which would result in: #=> ["Useful Line 1 ....", " Useful Line 2", "Useful Line 3", " Useful Line 4... ", "Useful Line 5", " Useful Line 6", "Useful Line 7", " Useful Line 8 ", "Usefule Line 9", " Useful Line 10", "Useful Line 11 ", " Useful Line 12"] NOTE: You may also want to strip off leading and trailing whitespace from each line in which case we could replace .join.split("\n") with .map(&:strip) like so: str.gsub("\r", "\n").squeeze("\n").each_line.reject{|x| x.strip == ""}.map(&:strip) or str.gsub("\r", "\n").squeeze("\n").split("\n").map.reject{|x| x.strip == ""}.map(&:strip) which would both result in: #=> ["Useful Line 1 ....", "Useful Line 2", "Useful Line 3", "Useful Line 4...", "Useful Line 5", "Useful Line 6", "Useful Line 7", "Useful Line 8", "Usefule Line 9", "Useful Line 10", "Useful Line 11", "Useful Line 12"]
how to replace last comma in a line with a string in unix
I trying to insert a string in every line except for first and last lines in a file, but not able to get it done, can anyone give some clue how to achieve? Thanks in advance. How to replace last comma in a line with a string xxxxx (except for first and last rows) using unix Original File 00,SRI,BOM,FF,000004,20120808030100,20120907094412,"GTEXPR","SRIVIM","8894-7577","SRIVIM#GTEXPR." 10,SRI,FF,NMNN,3112,NMNSME,U,NM,GEB,,230900,02BLYPO 10,SRI,FF,NMNN,3112,NMNSME,U,NM,TCM,231040,231100,01 10,SRI,FF,NMNN,3112,NMNSME,U,NM,UPW,231240,231300,01 10,SRI,FF,NMNN,3112,NMNSME,U,NM,UFG,231700,231900,01 10,SRI,FF,NMNN,3112,NMNSME,U,NM,FTG,232140,232200,01 10,SRI,FF,NMNN,3112,NMNSME,U,NM,BOR,232340,232400,01 10,SRI,FF,NMNN,3112,NMNSME,U,NM,BAY,232640,232700,01 10,SRI,FF,NMNN,3112,NMNSME,U,NM,RWD,233400,,01 10,SRI,FF,BUN,0800,NMJWJB,U,NM,CCL,,101400,02CHLSU 10,SRI,FF,BUN,0800,NMJWJB,U,NM,PAR,101540,101700,01 10,SRI,FF,BUN,0800,NMJWJB,U,NM,MCE,101840,101900,01 10,SRI,FF,BUN,0800,NMJWJB,U,NM,SSS,102140,102200,09 10,SRI,FF,BUN,0800,NMJWJB,U,NM,FSS,102600,,01 10,SRI,FF,BUN,0802,NMJWJB,U,NM,CCL,,103700,01CHLSU 10,SRI,FF,BUN,0802,NMJWJB,U,NM,PAR,103940,104000,01 10,SRI,FF,BUN,0802,NMJWJB,U,NM,MCE,104140,104200,01 10,SRI,FF,BUN,0802,NMJWJB,U,NM,SSS,104440,104500,09 10,SRI,FF,BUN,0802,NMJWJB,U,NM,FSS,105000,,01 10,SRI,FF,BUN,3112,NMNSME,U,NM,GEB,,230900,02BLYSU 10,SRI,FF,BUN,3112,NMNSME,U,NM,TCM,231040,231100,01 10,SRI,FF,BUN,3112,NMNSME,U,NM,UPW,231240,231300,01 10,SRI,FF,BUN,3112,NMNSME,U,NM,UFG,231700,231900,01 10,SRI,FF,BUN,3112,NMNSME,U,NM,FTG,232140,232200,01 10,SRI,FF,BUN,3112,NMNSME,U,NM,BOR,232340,232400,01 10,SRI,FF,BUN,3112,NMNSME,U,NM,BAY,232640,232700,01 10,SRI,FF,BUN,3112,NMNSME,U,NM,RWD,233400,,01 99,SRI,FF,28 Expected File 00,SRI,BOM,FF,000004,20120808030100,20120907094412,"GTEXPR","SRIVIM","8894-7577","SRIVIM#GTEXPR." 10,SRI,FF,NMNN,3112,NMNSME,U,NM,GEB,,230900,xxxxx02BLYPO 10,SRI,FF,NMNN,3112,NMNSME,U,NM,TCM,231040,xxxxx231100,01 10,SRI,FF,NMNN,3112,NMNSME,U,NM,UPW,231240,xxxxx231300,01 10,SRI,FF,NMNN,3112,NMNSME,U,NM,UFG,231700,xxxxx231900,01 10,SRI,FF,NMNN,3112,NMNSME,U,NM,FTG,232140,xxxxx232200,01 10,SRI,FF,NMNN,3112,NMNSME,U,NM,BOR,232340,xxxxx232400,01 10,SRI,FF,NMNN,3112,NMNSME,U,NM,BAY,232640,xxxxx232700,01 10,SRI,FF,NMNN,3112,NMNSME,U,NM,RWD,233400,,xxxxx01 10,SRI,FF,BUN,0800,NMJWJB,U,NM,CCL,,101400,xxxxx02CHLSU 10,SRI,FF,BUN,0800,NMJWJB,U,NM,PAR,101540,101700,xxxxx01 10,SRI,FF,BUN,0800,NMJWJB,U,NM,MCE,101840,101900,xxxxx01 10,SRI,FF,BUN,0800,NMJWJB,U,NM,SSS,102140,102200,xxxxx09 10,SRI,FF,BUN,0800,NMJWJB,U,NM,FSS,102600,,xxxxx01 10,SRI,FF,BUN,0802,NMJWJB,U,NM,CCL,,103700,xxxxx01CHLSU 10,SRI,FF,BUN,0802,NMJWJB,U,NM,PAR,103940,104000,xxxxx01 10,SRI,FF,BUN,0802,NMJWJB,U,NM,MCE,104140,104200,xxxxx01 10,SRI,FF,BUN,0802,NMJWJB,U,NM,SSS,104440,104500,xxxxx09 10,SRI,FF,BUN,0802,NMJWJB,U,NM,FSS,105000,,xxxxx01 10,SRI,FF,BUN,3112,NMNSME,U,NM,GEB,,230900,xxxxx02BLYSU 10,SRI,FF,BUN,3112,NMNSME,U,NM,TCM,231040,231100,xxxxx01 10,SRI,FF,BUN,3112,NMNSME,U,NM,UPW,231240,231300,xxxxx01 10,SRI,FF,BUN,3112,NMNSME,U,NM,UFG,231700,231900,xxxxx01 10,SRI,FF,BUN,3112,NMNSME,U,NM,FTG,232140,232200,xxxxx01 10,SRI,FF,BUN,3112,NMNSME,U,NM,BOR,232340,232400,xxxxx01 10,SRI,FF,BUN,3112,NMNSME,U,NM,BAY,232640,232700,xxxxx01 10,SRI,FF,BUN,3112,NMNSME,U,NM,RWD,233400,,xxxxx01 99,SRI,FF,28
awk can be quite useful for manipulating data files like this one. Here's a one-liner that does more-or-less what you want. It prepends the string "xxxxx" to the twelfth field of each input line that has at least twelve fields. $ awk 'BEGIN{FS=OFS=","}NF>11{$12="xxxxx"$12}{print}' 16006747.txt 00,SRI,BOM,FF,000004,20120808030100,20120907094412,"GTEXPR","SRIVIM","8894-7577","SRIVIM#GTEXPR." 10,SRI,FF,NMNN,3112,NMNSME,U,NM,GEB,,230900,xxxxx02BLYPO 10,SRI,FF,NMNN,3112,NMNSME,U,NM,TCM,231040,231100,xxxxx01 10,SRI,FF,NMNN,3112,NMNSME,U,NM,UPW,231240,231300,xxxxx01 10,SRI,FF,NMNN,3112,NMNSME,U,NM,UFG,231700,231900,xxxxx01 10,SRI,FF,NMNN,3112,NMNSME,U,NM,FTG,232140,232200,xxxxx01 10,SRI,FF,NMNN,3112,NMNSME,U,NM,BOR,232340,232400,xxxxx01 10,SRI,FF,NMNN,3112,NMNSME,U,NM,BAY,232640,232700,xxxxx01 10,SRI,FF,NMNN,3112,NMNSME,U,NM,RWD,233400,,xxxxx01 10,SRI,FF,BUN,0800,NMJWJB,U,NM,CCL,,101400,xxxxx02CHLSU 10,SRI,FF,BUN,0800,NMJWJB,U,NM,PAR,101540,101700,xxxxx01 10,SRI,FF,BUN,0800,NMJWJB,U,NM,MCE,101840,101900,xxxxx01 10,SRI,FF,BUN,0800,NMJWJB,U,NM,SSS,102140,102200,xxxxx09 10,SRI,FF,BUN,0800,NMJWJB,U,NM,FSS,102600,,xxxxx01 10,SRI,FF,BUN,0802,NMJWJB,U,NM,CCL,,103700,xxxxx01CHLSU 10,SRI,FF,BUN,0802,NMJWJB,U,NM,PAR,103940,104000,xxxxx01 10,SRI,FF,BUN,0802,NMJWJB,U,NM,MCE,104140,104200,xxxxx01 10,SRI,FF,BUN,0802,NMJWJB,U,NM,SSS,104440,104500,xxxxx09 10,SRI,FF,BUN,0802,NMJWJB,U,NM,FSS,105000,,xxxxx01 10,SRI,FF,BUN,3112,NMNSME,U,NM,GEB,,230900,xxxxx02BLYSU 10,SRI,FF,BUN,3112,NMNSME,U,NM,TCM,231040,231100,xxxxx01 10,SRI,FF,BUN,3112,NMNSME,U,NM,UPW,231240,231300,xxxxx01 10,SRI,FF,BUN,3112,NMNSME,U,NM,UFG,231700,231900,xxxxx01 10,SRI,FF,BUN,3112,NMNSME,U,NM,FTG,232140,232200,xxxxx01 10,SRI,FF,BUN,3112,NMNSME,U,NM,BOR,232340,232400,xxxxx01 10,SRI,FF,BUN,3112,NMNSME,U,NM,BAY,232640,232700,xxxxx01 10,SRI,FF,BUN,3112,NMNSME,U,NM,RWD,233400,,xxxxx01 99,SRI,FF,28