I've been trying to find the regex in ruby to match a php comment block:
/**
* #file
* lorum ipsum
*
* #author ME <me#localhost>
* #version 00:00 00-00-0000
*/
Could anyone help I've tried searching alot and even though some regex I found has worked in a regex tester but doesn't when I write it in my ruby file.
This is the most successful bit of regex I have found:
(/\*([^*]|[\r\n]|(\*+([^*/]|[\r\n])))*\*+/)
This is the output from my script
file is ./test/123.rb so regex is ((^\s*#\s)+(.*?))+
i = 0
found: my first ruby comment
file is ./test/abc.php so regex is (/\*([^*]|[\r\n]|(\*+([^*/]|[\r\n])))*\*+/)
i = 0
found: *
i = 1
found: *
Here is the code I have to do this:
56 def self.extract_comments f
57 if #regex[File.extname(f)]
58 puts "file is " + f + " so regex is " + #regex[File.extname(f)]
59 cur_rgx = Regexp.new #regex[File.extname(f)]
60 matches = IO.read( f ).scan( cur_rgx )
61 content = ""
62 if ! matches.empty?
63 # content = "== " + f + " ==\n"
64 content += f + "\n"
65 for i in 0...f.length
66 content += "="
67 end
68 content += "\n"
69 for i in 0...matches.length
70 puts "i = " + i.to_s
71 puts "found: " + matches[i][2].to_s
72 content << matches[i][2].to_s + "\n"
73 end
74 content << "\n"
75 end
76 end
77 content || '' # return something
78 end
It seems like /\/\*.*?\*\//m should do.
Also that's really a c-style comment block.
Unless it is important that each line inside the comment block begins with an asterisk, you may want to try this regex:
/\/\*(?:[^*]+|\*+(?!\/))*\*\//
EDIT: And here's a stricter version, which will only match comments that are formatted exactly like your example:
/^( *)\/\*\*\n(?:\1 \*(?:[^*\n]|\*(?!\/))*\n)+\1 \*\//
This version will only match a comment that has /** and */ on separate lines. /** can be indented by an arbitrary number of spaces (but no other white-space characters), but the other lines must be indented by exactly one more space than the /** line.
EDIT 2: Here is another version:
/^([ \t]*)\/\*\*.*?\n(?:^\1 .*?\n)+^\1 \*\//
It allows a mixture of tabs and spaces (ew) for indentation, but still requires all lines to conform to the indentation of the /** one (plus a single space).
Related
puts "%-30s%2s%3d%2s%3d%2s%3d%2s%3d%2s%3d" % [tn,ln,a,ln,b,ln,c,ln,d,ln,e]
This is Ruby, but many languages use this formatting. I have forgotten how to output several variables in the same format, without repeating the format in each case. Here, I want "%3d%2s" for 5 integers, each separated by a '|'
You could write the following.
def print_my_string(tn, ln, *ints)
fmt = "%-10s" + ("|#{"%2s" % ln}%3d" * ints.size) + "|"
puts fmt % [tn, *ints]
end
Then, for example,
print_my_string("hello", "ho", 2, 77, 453, 61, 999)
displays
hello |ho 2|ho 77|ho453|ho 61|ho999|
after having computed
fmt = "%-10s" + ("|#{"%2s" % ln}%3d" * ints.size) + "|"
#=> %-10s|ho%3d|ho%3d|ho%3d|ho%3d|ho%3d|"
I'm fairly new to ruby but this is testing me
I want to count all the lines in any file that ends in bowtie.txt in a folder
The lines have to start with a number of varying length followed by a '+' or a '-' (with or without whitespace inbetween. Sometimes the lines are wrapped but I don't know if this matters).
I want to then create a hash that stores the filename with the count associated with it.
I've got as far I think as looping through the directory to select the files out and then counting the number of lines in that file but how do I then create the hash and return it?
The file data looks like:
0 + chr12 129402816 ACACAGGGAGGGGAATAACACACACTGGGACCTGTCAGGAGAGGGTAGGGCTGGGGGCATCAGGAGAGCATCAGGAAAAATAGCTAATGCATGCTGGGCT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 0
2 - chr5 93625939 TCAACCTGTCATCTACATTAGGTATTTCTCCTAATGCTATCCCTCCCCTAGCCCCCCACCACCCAACAGACCCTGGTGTGTGATGTTCCCCTCCCTGTGT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 0 5:T>C
5 + chr3 155023119 ACACAGGGAGGGGAACATCACACACCGGGGCCTGTAGTGGGGGTGAGGGGCAAGAGGAGGAATAGCATTAGGAGAAATACCTAATGTAGATGACCGGTTG IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 0
7 + chr2 22818055 ACACAGGGAGGGGAAAAACACACACTGGGGCTTCTCAGGGGTGGTGGGGGGAGAGCATCAGGATAAATAGCTAATGCATGCAGGGCTTAATACCTAGGTG IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 0
8 + chr3 131206106 ACACAGGGAGGGGAACATCACACACCAGGCCCTGTCAGCGGTGAGGGGCTGGGGGAGGGATAGCATTAAGAGAAATACCTAATATAAATGACGAGTTGAT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 0 8:C>A
10 + chrX 108455592 ACACAGGGAGGGGAACATCACACACCAGGGCCTGTCGGGCAGTGGGGGGGCAAAGGGAGGGATTAAGTCATACACCCAATGCATGTGGGGCTTAAAACCC IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 0 7:A>G
11 - chr2 31936302 ACCCATTAACTCGTCATTTACATTAGGTATATCTCCTAATGCTATCCCTCCCCCCACCCCACAACAGGCCCCCCGGTGTGTGATGTTCCCCTCCCTGTGT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 0 7:T>C
This is what I am trying to get at the end
blablabla.bowtie.txt : 27998
blablafsfds.bowtie.txt : 25987
etc
This is my attempt at the code:
Dir[File.join('/Volumes/SeagateBackupPlusDriv/SequencingRawFiles/TumourOesophagealOCCAMS/SequencingScripts/3finalcounts', '*.bowtie.txt')].each |file| do
puts File.open(file) { |f| f.grep(/^[0-9]*.\+|\-/).count }
end
Untested, since I have no input files, but likely working:
# `Dir[]` expects it’s own format
# ⇓ will inject results into hash
Dir['/Volumes/.../*.bowtie.txt'].inject({}) do |memo, file|
memo[file] = File.readlines(file).select do |line|
line =~ /^[0-9]+\s*(\+|\-)/ # only those, matching
end.count
memo
end
Additional references: IO#readlines, Enumerable#select, Enumerable#inject.
I do have a text file as below:
Employee details.txt
Raja Palit 77489 24 84 12/12/2011
Mathew bargur 77559 25 88 01/12/2011
harin Roy 77787 24 80 12/12/2012
Soumi paul 77251 24 88 11/11/2012
I want the file as below:
Expected file:
Raja,Palit,77489,24,84,12/12/2011
Mathew,bargur,77559,25,88,01/12/2011
harin,Roy,77787,24,80,12/12/2012
Soumi,paul,77251,24,88,11/11/2012
What I tried below:
IO.foreach('D://docs//details.txt') do |line|
splits = line.split("\t")
col1, col2, col3, col4, col5, col6 = splits
splits[6..-1].join(',')
end
Though it seems like a quick way to deal with this sort of data by splitting on whitespace, that will fail if any field contains embedded whitespace. For instance, if the name of the person in the record is something like "Maria Von Trapp" or "Smokey the Bear", the resulting comma-delimited fields will be wrong.
The correct way to deal with this is to parse based on column-field widths, then squeeze and strip whitespace inside those fields, then turn the record into a CSV record.
require 'csv'
require 'scanf' if (RUBY_VERSION >= '1.9.3')
FORMAT = '%15c %d %d %d %10c'
data = <<EOT
Raja Palit 77489 24 84 12/12/2011
Mathew bargur 77559 25 88 01/12/2011
harin Roy 77787 24 80 12/12/2012
Soumi paul 77251 24 88 11/11/2012
Maria Von Trapp 99999 99 99 12/31/2012
Smokey the Bear 99999 99 99 12/31/2012
EOT
data.split("\n").each do |li|
fields = li.scanf(FORMAT)
puts [fields.first.strip, *fields[1 .. -1]].to_csv
end
Which outputs:
Raja Palit,77489,24,84,12/12/2011
Mathew bargur,77559,25,88,01/12/2011
harin Roy,77787,24,80,12/12/2012
Soumi paul,77251,24,88,11/11/2012
Maria Von Trapp,99999,99,99,12/31/2012
Smokey the Bear,99999,99,99,12/31/2012
Note, Ruby 1.9.3 split scanf into its own module, which explains the conditional require.
Strings come with a squeeze method, it squeezes runs of the char(s) in the argument into one char. In this case it reduces the multiple spaces into one space, which is then replaced by a comma:
File.open("test.txt") do |in_file|
File.open("test.csv", 'w') do |out_file| #the 'w' opens the file for writing
in_file.each {|line| out_file << line.squeeze(' ').gsub(' ', ',') }
end # closes test.csv
end # closes test.txt
You could use a regular expression to replace any whitespace characters with a comma:
my_string.sub! /\s/g, ','
If you want to discard empty fields, you could use this:
my_string.sub! /\s+/g, ','
An alternative would be to split it on spaces and join on commas. This will also discard empty fields:
my_string = my_string.split(' ').join(',')
File.open("details.txt", "r+"){|io| io.write(io.read.gsub(/[ \t]+/, ","))}
I have a list of sql queries beautifully encoded in utf-8. I read them from files, perform the inserts and than do a select.
# encoding: utf-8
def exec_sql_lines(file_name)
puts "----> #{file_name} <----"
File.open(file_name, 'r') do |f|
# sometimes a query doesn't fit one line
previous_line=""
i = 0
while line = f.gets do
puts i+=1
if(line[-2] != ')')
previous_line += line[0..-2]
next
end
puts (previous_line + line) # <---- (1)
$db.execute((previous_line + line))
previous_line =""
end
a = $db.execute("select * from Table where _id=6")
puts a <---- (2)
end
end
$db=SQLite3::Database.new($DBNAME)
exec_sql_lines("creates.txt")
exec_sql_lines("inserts.txt")
$db.close
The text in (1) is different than the one in (2). Polish letters get broken. If I use IRB and call $db.open ; $db.encoding is says UTF-8.
Why do Polish letters come out broken? How to fix it?
I need this database properly encoded in UTF-8 for my Android app, so I'm not interested in manipulating with database output. I need to fix it's content.
EDIT
Significant lines from the output:
6
INSERT INTO 'Leki' VALUES (NULL, '6', 'Acenocoumarolum', 'Acenocumarol WZF', 'tabl. ', '4 mg', '60 tabl.', '5909990055715', '2012-01-01', '2 lata', '21.0, Leki przeciwzakrzepowe z grupy antagonistów witaminy K', '8.32', '12.07', '12.07', 'We wszystkich zarejestrowanych wskazaniach na dzień wydania decyzji', '', 'ryczałt', '5.12')
out:
6
6
Acenocoumarolum
Acenocumarol WZF
tabl.
4 mg
60 tabl.
5909990055715
2012-01-01
2 lata
21.0, Leki przeciwzakrzepowe z grupy antagonistĂł[<--HERE]w witaminy K
8.32
12.07
12.07
We wszystkich zarejestrowanych wskazaniach na dzieĹ[<--HERE] wydania decyzji
ryczaĹ[<--HERE]t
5.12
There are three default encoding.
In you code you set the source encoding.
Perhaps there is a problem with External and Internal Encoding?
A quick test in windows:
#encoding: utf-8
File.open(__FILE__,'r'){|f|
p f.external_encoding
p f.internal_encoding
p f.read.encoding
}
Result:
#<Encoding:CP850>
nil
#<Encoding:CP850>
Even if UTF-8 is used, the data are read as cp850.
In your case:
Does File.open(filename,'r:utf-8') help?
I have a string variable with multiple lines: e.g.
"SClone VARPB63A\nSeq_vec SVEC 1 65 pCR2.1-topo\nSequencing_vector \"pCR2.1-topo\"\nSeq_vec SVEC 102 1710 pCR2.1-topo\nClipping QUAL 46 397\n
I would want to get both of lines that start with "Seq_vec SVEC" and extract the values of the integer part that matches...
string = "Clone VARPB63A\nSeq_vec SVEC 1 65 pCR2.1-topo\nSequencing_vector \"pCR2.1-topo\"\nSeq_vec SVEC 102 1710 pCR2.1-topo\nClipping QUAL 46 397\n"
seqvector = Regexp.new("Seq_vec\\s+SVEC\\s+(\\d+\\s+\\d+)",Regexp::MULTILINE )
vector = string.match(seqvector)
if vector
vector_start,vector_stop = vector[1].split(/ /)
puts vector_start.to_i
puts vector_stop.to_i
end
However this only grabs the first match's values and not the second as i would like.
Any ideas what i could be doing wrong?
Thank you
To capture groups use String#scan
vector = string.scan(seqvector)
=> [["1 65"], ["102 1710"]]
match finds just the first match. To find all matches use String#scan e.g.
string.scan(seqvector)
=> [["1 65"], ["102 1710"]]
or to do something with each match:
string.scan(seqvector) do |match|
# match[0] will be the substring captured by your first regexp grouping
puts match.inspect
end
Just to make this a bit easier to handle, I would split the whole string into an array first and then would do:
string = "SClone VARPB63A\nSeq_vec SVEC 1 65 pCR2.1-topo\nSequencing_vector \"pCR2.1-topo\"\nSeq_vec SVEC 102 1710 pCR2.1-topo\nClipping QUAL 46 397\n"
selected_strings = string.split("\n").select{|x| /Seq_vec SVEC/.match(x)}
selected_strings.collect{|x| x.scan(/\s\d+/)}.flatten # => [" 1", " 65", " 102", " 1710"]