multiline matching with ruby - ruby

I have a string variable with multiple lines: e.g.
"SClone VARPB63A\nSeq_vec SVEC 1 65 pCR2.1-topo\nSequencing_vector \"pCR2.1-topo\"\nSeq_vec SVEC 102 1710 pCR2.1-topo\nClipping QUAL 46 397\n
I would want to get both of lines that start with "Seq_vec SVEC" and extract the values of the integer part that matches...
string = "Clone VARPB63A\nSeq_vec SVEC 1 65 pCR2.1-topo\nSequencing_vector \"pCR2.1-topo\"\nSeq_vec SVEC 102 1710 pCR2.1-topo\nClipping QUAL 46 397\n"
seqvector = Regexp.new("Seq_vec\\s+SVEC\\s+(\\d+\\s+\\d+)",Regexp::MULTILINE )
vector = string.match(seqvector)
if vector
vector_start,vector_stop = vector[1].split(/ /)
puts vector_start.to_i
puts vector_stop.to_i
end
However this only grabs the first match's values and not the second as i would like.
Any ideas what i could be doing wrong?
Thank you

To capture groups use String#scan
vector = string.scan(seqvector)
=> [["1 65"], ["102 1710"]]

match finds just the first match. To find all matches use String#scan e.g.
string.scan(seqvector)
=> [["1 65"], ["102 1710"]]
or to do something with each match:
string.scan(seqvector) do |match|
# match[0] will be the substring captured by your first regexp grouping
puts match.inspect
end

Just to make this a bit easier to handle, I would split the whole string into an array first and then would do:
string = "SClone VARPB63A\nSeq_vec SVEC 1 65 pCR2.1-topo\nSequencing_vector \"pCR2.1-topo\"\nSeq_vec SVEC 102 1710 pCR2.1-topo\nClipping QUAL 46 397\n"
selected_strings = string.split("\n").select{|x| /Seq_vec SVEC/.match(x)}
selected_strings.collect{|x| x.scan(/\s\d+/)}.flatten # => [" 1", " 65", " 102", " 1710"]

Related

count specific lines in specific files in a folder

I'm fairly new to ruby but this is testing me
I want to count all the lines in any file that ends in bowtie.txt in a folder
The lines have to start with a number of varying length followed by a '+' or a '-' (with or without whitespace inbetween. Sometimes the lines are wrapped but I don't know if this matters).
I want to then create a hash that stores the filename with the count associated with it.
I've got as far I think as looping through the directory to select the files out and then counting the number of lines in that file but how do I then create the hash and return it?
The file data looks like:
0 + chr12 129402816 ACACAGGGAGGGGAATAACACACACTGGGACCTGTCAGGAGAGGGTAGGGCTGGGGGCATCAGGAGAGCATCAGGAAAAATAGCTAATGCATGCTGGGCT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 0
2 - chr5 93625939 TCAACCTGTCATCTACATTAGGTATTTCTCCTAATGCTATCCCTCCCCTAGCCCCCCACCACCCAACAGACCCTGGTGTGTGATGTTCCCCTCCCTGTGT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 0 5:T>C
5 + chr3 155023119 ACACAGGGAGGGGAACATCACACACCGGGGCCTGTAGTGGGGGTGAGGGGCAAGAGGAGGAATAGCATTAGGAGAAATACCTAATGTAGATGACCGGTTG IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 0
7 + chr2 22818055 ACACAGGGAGGGGAAAAACACACACTGGGGCTTCTCAGGGGTGGTGGGGGGAGAGCATCAGGATAAATAGCTAATGCATGCAGGGCTTAATACCTAGGTG IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 0
8 + chr3 131206106 ACACAGGGAGGGGAACATCACACACCAGGCCCTGTCAGCGGTGAGGGGCTGGGGGAGGGATAGCATTAAGAGAAATACCTAATATAAATGACGAGTTGAT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 0 8:C>A
10 + chrX 108455592 ACACAGGGAGGGGAACATCACACACCAGGGCCTGTCGGGCAGTGGGGGGGCAAAGGGAGGGATTAAGTCATACACCCAATGCATGTGGGGCTTAAAACCC IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 0 7:A>G
11 - chr2 31936302 ACCCATTAACTCGTCATTTACATTAGGTATATCTCCTAATGCTATCCCTCCCCCCACCCCACAACAGGCCCCCCGGTGTGTGATGTTCCCCTCCCTGTGT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 0 7:T>C
This is what I am trying to get at the end
blablabla.bowtie.txt : 27998
blablafsfds.bowtie.txt : 25987
etc
This is my attempt at the code:
Dir[File.join('/Volumes/SeagateBackupPlusDriv/SequencingRawFiles/TumourOesophagealOCCAMS/SequencingScripts/3finalcounts', '*.bowtie.txt')].each |file| do
puts File.open(file) { |f| f.grep(/^[0-9]*.\+|\-/).count }
end
Untested, since I have no input files, but likely working:
# `Dir[]` expects it’s own format
# ⇓ will inject results into hash
Dir['/Volumes/.../*.bowtie.txt'].inject({}) do |memo, file|
memo[file] = File.readlines(file).select do |line|
line =~ /^[0-9]+\s*(\+|\-)/ # only those, matching
end.count
memo
end
Additional references: IO#readlines, Enumerable#select, Enumerable#inject.

Regex for First Line (Only) that Contains a String

I have a bunch of phone numbers with one per line:
[Home] (202) 121-7777
C (202) 456-1111
[mobile] 55 55 5 55555
[Work] (404) 555-1234
[Cell] (505) 555-1234
W 303-555-5555
M 777-555-5555
c 12346567s
I want to grab the first one that contains the letter "c" upper or lower case.
So far, I have this /^.*[C].*$/i and that matches C (202) 456-1111, [Cell] (505) 555-1234 and c 12346567s. How do I return only the first? In other words, the match should only be C (202) 456-1111.
I have been blindly putting question marks everywhere without success.
I am using Ruby if it makes a difference http://www.rubular.com/r/h6ReB9IN8t
Edit: Here is another question that Hrishi pointed to but I cannot figure out how to adapt it to match the whole line.
Try match method. Here is an example:
list = <<EOF
[Home] (202) 121-7777
C (202) 456-1111
[mobile] 55 55 5 55555
[Work] (404) 555-1234
[Cell] (505) 555-1234
W 303-555-5555
M 777-555-5555
c 12346567s
EOF
Update
#match line with "c" letter in line, even that are part of word
puts list.match(/^.*C.*$/i)
#match line with "c" letter in line, that are not a part of word
puts list.match(/^\W*C\W.*$/i)
I'd go about this a bit differently. I prefer to reduce regular expressions to very simple patterns:
str = <<EOT
[Home] (202) 121-7777
C (202) 456-1111
[mobile] 55 55 5 55555
[Work] (404) 555-1234
[Cell] (505) 555-1234
W 303-555-5555
M 777-555-5555
c 12346567s
EOT
Finding the right line to work with is easily done using either select or find:
str.split("\n").select{ |s| s[/c/i] }.first # => "C (202) 456-1111"
str.split("\n").find{ |s| s[/c/i] } # => "C (202) 456-1111"
I'd recommend find because it only returns the first occurrence.
Once the desired string is found, use scan to grab the numbers:
str.split("\n").find{ |s| s[/c/i] }.scan(/\d+/) # => ["202", "456", "1111"]
Then join them. When you have phone numbers stored in a database you don't really want them to be formatted, you just want the numbers. Formatting occurs later when you're outputting them again.
phone_number = str.split("\n").find{ |s| s[/c/i] }.scan(/\d+/).join # => "2024561111"
When you need to output the number, break it into the right grouping based on the regional phone-number representation. You should have some idea where the person is located, because you've usually also got their country code. Based on that you know how many digits you should have, plus the groups:
area_code, prefix, number = phone_number[0 .. 2], phone_number[3 .. 5], phone_number[6 .. 9] # => ["202", "456", "1111"]
Then output them so they're displayed correctly:
"(%s) %s-%s" % [area_code, prefix, number] # => "(202) 456-1111"
As far as your original pattern /^.*[C].*$/i, there are some things wrong with your understanding of regex:
^.* says "start at the beginning of the string and find zero or more characters", which is no more effective than saying /[C].
Using [C] creates an unnecessary character-set which means "find one of the letters in the set "C"; It does nothing useful, so just use C as /C.
.*$ artificially finds the end of the string also, but since you're not capturing it there's nothing accomplished, so don't bother with it. The regex is now /C/.
Since you want to match upper and lower-case, use /C/i or /c/i. (Or you could use /[cC]/ but why?)
Instead:
To find a "c" or "C" anywhere in the string, just use /c/i. That's all that's needed. http://rubular.com/r/uPyxACOWls
To find "c", "C" or "cell" or "Cell", you can use /c(?:ell)?/. http://rubular.com/r/TkSRPWG2y6
To find "c", "C", "cell" or "Cell" as a separate word, use word-break markers like /\bc(?:ell)?\b/. http://rubular.com/r/Smo0bFs9w8
You can get a whole lot more complicated, but if you're not accomplishing anything with the additional pattern information, you're just wasting the regex-engine's CPU-time, and slowing your code. A confused regex-engine can waste a LOT of CPU-time, so be efficient and aware of what you're asking it to do.
EDIT Added two more ways of handling this. The last one is preferable.
This will do what you want. It will search for matches of your regex, and then get the first one. Please note that this will produce an error if string does not have any matches.
string = "[Home] (202) 121-7777
C (202) 456-1111
[mobile] 55 55 5 55555
[Work] (404) 555-1234
[Cell] (505) 555-1234
W 303-555-5555
M 777-555-5555
c 12346567s"
puts string.match(/^(.*[C].*)$/i).captures.first
puts string.match(/^(.*[C].*)$/i)
puts string[/^(.*[C].*)$/i]
Ruby Docs String#match.
Split the string by the new line characters, and select the substring which matches your requirements and grab the first one:
str = '[Home] (202) 121-7777
C (202) 456-1111
[mobile] 55 55 5 55555
[Work] (404) 555-1234
[Cell] (505) 555-1234
W 303-555-5555
M 777-555-5555
c 12346567s'
p str.split(/\n/).select{|el| el =~ /^.*[C].*$/i}[0]
or use match:
p str.match(/^.*[C].*$/i)[0]
EDITED:
Or, in case you want to find the first chunk that exactly starts with C try this:
p str.match(/^C.*$/)[0]

Join array of strings into 1 or more strings each within a certain char limit (+ prepend and append texts)

Let's say I have an array of Twitter account names:
string = %w[example1 example2 example3 example4 example5 example6 example7 example8 example9 example10 example11 example12 example13 example14 example15 example16 example17 example18 example19 example20]
And a prepend and append variable:
prepend = 'Check out these cool people: '
append = ' #FollowFriday'
How can I turn this into an array of as few strings as possible each with a maximum length of 140 characters, starting with the prepend text, ending with the append text, and in between the Twitter account names all starting with an #-sign and separated with a space. Like this:
tweets = ['Check out these cool people: #example1 #example2 #example3 #example4 #example5 #example6 #example7 #example8 #example9 #FollowFriday', 'Check out these cool people: #example10 #example11 #example12 #example13 #example14 #example15 #example16 #example17 #FollowFriday', 'Check out these cool people: #example18 #example19 #example20 #FollowFriday']
(The order of the accounts isn't important so theoretically you could try and find the best order to make the most use of the available space, but that's not required.)
Any suggestions? I'm thinking I should use the scan method, but haven't figured out the right way yet.
It's pretty easy using a bunch of loops, but I'm guessing that won't be necessary when using the right Ruby methods. Here's what I came up with so far:
# Create one long string of #usernames separated by a space
tmp = twitter_accounts.map!{|a| a.insert(0, '#')}.join(' ')
# alternative: tmp = '#' + twitter_accounts.join(' #')
# Number of characters left for mentioning the Twitter accounts
length = 140 - (prepend + append).length
# This method would split a string into multiple strings
# each with a maximum length of 'length' and it will only split on empty spaces (' ')
# ideally strip that space as well (although .map(&:strip) could be use too)
tweets = tmp.some_method(' ', length)
# Prepend and append
tweets.map!{|t| prepend + t + append}
P.S.
If anyone has a suggestion for a better title let me know. I had a difficult time summarizing my question.
The String rindex method has an optional parameter where you can specify where to start searching backwards in a string:
arr = %w[example1 example2 example3 example4 example5 example6 example7 example8 example9 example10 example11 example12 example13 example14 example15 example16 example17 example18 example19 example20]
str = arr.map{|name|"##{name}"}.join(' ')
prepend = 'Check out these cool people: '
append = ' #FollowFriday'
max_chars = 140 - prepend.size - append.size
until str.size <= max_chars do
p str.slice!(0, str.rindex(" ", max_chars))
str.lstrip! #get rid of the leading space
end
p str unless str.empty?
I'd make use of reduce for this:
string = %w[example1 example2 example3 example4 example5 example6 example7 example8 example9 example10 example11 example12 example13 example14 example15 example16 example17 example18 example19 example20]
prepend = 'Check out these cool people:'
append = '#FollowFriday'
# Extra -1 is for the space before `append`
max_content_length = 140 - prepend.length - append.length - 1
content_strings = string.reduce([""]) { |result, target|
result.push("") if result[-1].length + target.length + 2 > max_content_length
result[-1] += " ##{target}"
result
}
tweets = content_strings.map { |s| "#{prepend}#{s} #{append}" }
Which would yield:
"Check out these cool people: #example1 #example2 #example3 #example4 #example5 #example6 #example7 #example8 #example9 #FollowFriday"
"Check out these cool people: #example10 #example11 #example12 #example13 #example14 #example15 #example16 #example17 #FollowFriday"
"Check out these cool people: #example18 #example19 #example20 #FollowFriday"

How to convert space-delimited .txt File to ","-delimited .txt file using Ruby?

I do have a text file as below:
Employee details.txt
Raja Palit 77489 24 84 12/12/2011
Mathew bargur 77559 25 88 01/12/2011
harin Roy 77787 24 80 12/12/2012
Soumi paul 77251 24 88 11/11/2012
I want the file as below:
Expected file:
Raja,Palit,77489,24,84,12/12/2011
Mathew,bargur,77559,25,88,01/12/2011
harin,Roy,77787,24,80,12/12/2012
Soumi,paul,77251,24,88,11/11/2012
What I tried below:
IO.foreach('D://docs//details.txt') do |line|
splits = line.split("\t")
col1, col2, col3, col4, col5, col6 = splits
splits[6..-1].join(',')
end
Though it seems like a quick way to deal with this sort of data by splitting on whitespace, that will fail if any field contains embedded whitespace. For instance, if the name of the person in the record is something like "Maria Von Trapp" or "Smokey the Bear", the resulting comma-delimited fields will be wrong.
The correct way to deal with this is to parse based on column-field widths, then squeeze and strip whitespace inside those fields, then turn the record into a CSV record.
require 'csv'
require 'scanf' if (RUBY_VERSION >= '1.9.3')
FORMAT = '%15c %d %d %d %10c'
data = <<EOT
Raja Palit 77489 24 84 12/12/2011
Mathew bargur 77559 25 88 01/12/2011
harin Roy 77787 24 80 12/12/2012
Soumi paul 77251 24 88 11/11/2012
Maria Von Trapp 99999 99 99 12/31/2012
Smokey the Bear 99999 99 99 12/31/2012
EOT
data.split("\n").each do |li|
fields = li.scanf(FORMAT)
puts [fields.first.strip, *fields[1 .. -1]].to_csv
end
Which outputs:
Raja Palit,77489,24,84,12/12/2011
Mathew bargur,77559,25,88,01/12/2011
harin Roy,77787,24,80,12/12/2012
Soumi paul,77251,24,88,11/11/2012
Maria Von Trapp,99999,99,99,12/31/2012
Smokey the Bear,99999,99,99,12/31/2012
Note, Ruby 1.9.3 split scanf into its own module, which explains the conditional require.
Strings come with a squeeze method, it squeezes runs of the char(s) in the argument into one char. In this case it reduces the multiple spaces into one space, which is then replaced by a comma:
File.open("test.txt") do |in_file|
File.open("test.csv", 'w') do |out_file| #the 'w' opens the file for writing
in_file.each {|line| out_file << line.squeeze(' ').gsub(' ', ',') }
end # closes test.csv
end # closes test.txt
You could use a regular expression to replace any whitespace characters with a comma:
my_string.sub! /\s/g, ','
If you want to discard empty fields, you could use this:
my_string.sub! /\s+/g, ','
An alternative would be to split it on spaces and join on commas. This will also discard empty fields:
my_string = my_string.split(' ').join(',')
File.open("details.txt", "r+"){|io| io.write(io.read.gsub(/[ \t]+/, ","))}

ruby regex for php comment block

I've been trying to find the regex in ruby to match a php comment block:
/**
* #file
* lorum ipsum
*
* #author ME <me#localhost>
* #version 00:00 00-00-0000
*/
Could anyone help I've tried searching alot and even though some regex I found has worked in a regex tester but doesn't when I write it in my ruby file.
This is the most successful bit of regex I have found:
(/\*([^*]|[\r\n]|(\*+([^*/]|[\r\n])))*\*+/)
This is the output from my script
file is ./test/123.rb so regex is ((^\s*#\s)+(.*?))+
i = 0
found: my first ruby comment
file is ./test/abc.php so regex is (/\*([^*]|[\r\n]|(\*+([^*/]|[\r\n])))*\*+/)
i = 0
found: *
i = 1
found: *
Here is the code I have to do this:
56 def self.extract_comments f
57 if #regex[File.extname(f)]
58 puts "file is " + f + " so regex is " + #regex[File.extname(f)]
59 cur_rgx = Regexp.new #regex[File.extname(f)]
60 matches = IO.read( f ).scan( cur_rgx )
61 content = ""
62 if ! matches.empty?
63 # content = "== " + f + " ==\n"
64 content += f + "\n"
65 for i in 0...f.length
66 content += "="
67 end
68 content += "\n"
69 for i in 0...matches.length
70 puts "i = " + i.to_s
71 puts "found: " + matches[i][2].to_s
72 content << matches[i][2].to_s + "\n"
73 end
74 content << "\n"
75 end
76 end
77 content || '' # return something
78 end
It seems like /\/\*.*?\*\//m should do.
Also that's really a c-style comment block.
Unless it is important that each line inside the comment block begins with an asterisk, you may want to try this regex:
/\/\*(?:[^*]+|\*+(?!\/))*\*\//
EDIT: And here's a stricter version, which will only match comments that are formatted exactly like your example:
/^( *)\/\*\*\n(?:\1 \*(?:[^*\n]|\*(?!\/))*\n)+\1 \*\//
This version will only match a comment that has /** and */ on separate lines. /** can be indented by an arbitrary number of spaces (but no other white-space characters), but the other lines must be indented by exactly one more space than the /** line.
EDIT 2: Here is another version:
/^([ \t]*)\/\*\*.*?\n(?:^\1 .*?\n)+^\1 \*\//
It allows a mixture of tabs and spaces (ew) for indentation, but still requires all lines to conform to the indentation of the /** one (plus a single space).

Resources