Ruby 1.8 regexp: index of match in utf string - ruby

I'm trying to search a text for a match and return it with snippet around it. For this, I want to find match with regex, then cut the string using match index +- snippet radius (text.mb_chars[start..finish]).
However, I cannot get ruby's (1.8) regex to return match index which would be multi-byte aware.
I understand that regex is one place in 1.8 which is supposed to be utf aware, but it doesn't seem to work despite /u switch:
"Résumé" =~ /s/u
=> 3
"Resume" =~ /s/u
=> 2
Result should be the same if regex was really working in multibyte (/u), but it's returning byte index.
How you get match index in characters, not bytes?
Or maybe some other way to get snippet around (each) match?

Not a real answer, but too long for a comment.
The code
print "Résumé" =~ /s/u
print "\n"
print "Resume" =~ /s/u
on Windows (Ruby 1.8.6, release 26.) prints:
2
2
And on Linux (ruby 1.8.7 (2009-06-12 patchlevel 174) [i486-linux]) it prints:
3
2

How about using this jindex function I wrote, which corresponds to the other methods in the jcode library:
class String
def jslice *args
split(//)[*args].join rescue ""
end
def jindex match, start=0
if match.is_a? String
match = Regexp.new(Regexp.escape(match))
end
if self.jslice(start..-1) =~ match
$PREMATCH.jlength + start
else
nil
end
end
end

Related

Ruby: How to find out if a character is a letter or a digit?

I just started tinkering with Ruby earlier this week and I've run into something that I don't quite know how to code. I'm converting a scanner that was written in Java into Ruby for a class assignment, and I've gotten down to this section:
if (Character.isLetter(lookAhead))
{
return id();
}
if (Character.isDigit(lookAhead))
{
return number();
}
lookAhead is a single character picked out of the string (moving by one space each time it loops through) and these two methods determine if it is a character or a digit, returning the appropriate token type. I haven't been able to figure out a Ruby equivalent to Character.isLetter() and Character.isDigit().
Use a regular expression that matches letters & digits:
def letter?(lookAhead)
lookAhead.match?(/[[:alpha:]]/)
end
def numeric?(lookAhead)
lookAhead.match?(/[[:digit:]]/)
end
These are called POSIX bracket expressions, and the advantage of them is that unicode characters under the given category will match. For example:
'ñ'.match?(/[A-Za-z]/) #=> false
'ñ'.match?(/\w/) #=> false
'ñ'.match?(/[[:alpha:]]/) #=> true
You can read more in Ruby’s docs for regular expressions.
The simplest way would be to use a Regular Expression:
def numeric?(lookAhead)
lookAhead =~ /[0-9]/
end
def letter?(lookAhead)
lookAhead =~ /[A-Za-z]/
end
Regular expression is an overkill here, it's much more expensive in terms of performance. If you just need a check is character a digit or not there is a simpler way:
def is_digit?(s)
code = s.ord
# 48 is ASCII code of 0
# 57 is ASCII code of 9
48 <= code && code <= 57
end
is_digit?("2")
=> true
is_digit?("0")
=> true
is_digit?("9")
=> true
is_digit?("/")
=> false
is_digit?("d")
=> false

How to access the various occurences of the same match group in Ruby Regular expressions ?

I have a regular expression which has multiple matches. I figured out that $1 , $2 etc .. can be used to access the matched groups. But how to access the multiple occurences of the same matched group ?
Please take a look at the rubular page below.
http://rubular.com/r/nqHP1qAqRY
So now $1 gives 916 and $2 gives NIL. How can i access the 229885 ? Is there something similar to $1[1] or so ?
Firstly it is not a good idea to parse xml-based data only with regular expressions.
Instead use a library for parsing xml-files, like nokogiri.
But if you're sure, that you want to use this approach, you do need to know the following.
Regex engines stop as soon as they get a (pleasing) match. So you cannot
expect to get all possible matches in a string from one regex-call,
you need to iterate through the string applying a new regex-match after
each already occurred match. You could do it like that:
# ruby 1.9.x version
regex = /<DATA size="(\d+)"/
str = your_string # Your string to be parsed
position = 0
matches = []
while(match = regex.match(str,position)) do # Until there are no matches anymore
position = match.end 0 # set position to the end of the last match
matches << match[1] # add the matched number to the matches-array
end
After this all your parsed numbers should be in matches.
But since your comment suggests, that you are using ruby 1.8.x i will post another
version here, which works in 1.8.x (the method definition are different in these versions).
# ruby 1.8.x version
regex = /<DATA size="(\d+)"/
str = your_string # Your string to be parsed
matches = []
while(match = regex.match(str)) do # Until there are no matches anymore
str = match.post_match # set str to the part which is after the match.
matches << match[1] # add the matched number to the matches-array
end
To expand on my comment and respond to your question:
If you want to store the values in an array, modify the block and collect instead of iterate:
> arr = xml.grep(/<DATA size="(\d+)"/).collect { |d| d.match /\d+/ }
> arr.each { |a| puts "==> #{a}" }
==> 916
==> 229885
The |d| is normal Ruby block parameter syntax; each d is the matching string, from which the number is extracted. It's not the cleanest Ruby, although it's functional.
I still recommend using a parser; note that the rexml version would be this (more or less):
require 'rexml/document'
include REXML
doc = Document.new xml
arr = doc.elements.collect("//DATA") { |d| d.attributes["size"] }
arr.each { |a| puts "==> #{a}" }
Once your "XML" is converted to actual XML you can get even more useful data:
doc = Document.new xml
arr = doc.elements.collect("//file") do |f|
name = f.elements["FILENAME"].attributes["path"]
size = f.elements["DATA"].attributes["size"]
[name, size]
end
arr.each { |a| puts "#{a[0]}\t#{a[1]}" }
~/Users/1.txt 916
~/Users/2.txt 229885
This is not possible in most implementations of regex. (AFAIK only .NET can do this.)
You will have to use an alternate solution, e.g. using scan(): Equivalent to Python’s findall() method in Ruby?.

How to split a string in Ruby?

I have special strings like name1="value1" name2='value2'. Values can contain whitespaces and are delimited by either single quotes or double quotes. Names never contain whitespaces. name/value pairs are separated by whitespaces.
I want to parse them into a list of name-value pairs like this
string.magic_split() => { "name1"=>"value1", "name2"=>"value2" }
If Ruby understood lookaround assertions, I could do this by
string.split(/[\'\"](?=\s)/).each do |element|
element =~ /(\w+)=[\'\"](.*)[\'\"]/
hash[$1] = $2
end
but Ruby does not understand lookaround assertions, so I am somewhat stuck.
However, I am sure that there are much more elegant ways to solve this problem anyway, so I turn to you. Do you have a good idea for solving this problem?
This fails on values like '"hi" she said', but it might be good enough.
str = %q(name1="value1" name2='value 2')
p Hash[ *str.chop.split( /' |" |='|="/ ) ]
#=> {"name1"=>"value1", "name2"=>"value 2"}
This is not a complete answer, but Oniguruma, the standard regexp library in 1.9 supports lookaround assertions. It can be installed as a gem if you are using Ruby 1.8.x.
That said, and as Sorpigal has commented, instead of using a regexp I would be inclined to iterate through the string one character at a time keeping track of whether you are in a name portion, when you reach the equals sign, when you are within quotes and when you reach a matched closing quote. On reaching a closing quote you can put the name and value into the hash and proceed to the next entry.
class String
def magic_split
str = self.gsub('"', '\'').gsub('\' ', '\'\, ').split('\, ').map{ |str| str.gsub("'", "").split("=") }
Hash[str]
end
end
This should do it for you.
class SpecialString
def self.parse(string)
string.split.map{|s| s.split("=") }.inject({}) {|h, a| h[a[0]] = a[1].gsub(/"|'/, ""); h }
end
end
Have a try with : /[='"] ?/
I don't know Ruby syntax but here is a Perl script you could translate
#!/usr/bin/perl
use 5.10.1;
use warnings;
use strict;
use Data::Dumper;
my $str = qq/name1="val ue1" name2='va lue2'/;
my #list = split/[='"] ?/,$str;
my %hash;
for (my $i=0; $i<#list;$i+=3) {
$hash{$list[$i]} = $list[$i+2];
}
say Dumper \%hash;
Output :
$VAR1 = {
'name2' => 'va lue2',
'name1' => 'val ue1'
};

Looking to clean up a small ruby script

I'm looking for a much more idiomatic way to do the following little ruby script.
File.open("channels.xml").each do |line|
if line.match('(mms:\/\/{1}[a-zA-Z\.\d\/\w-]+)')
puts line.match('(mms:\/\/{1}[a-zA-Z\.\d\/\w-]+)')
end
end
Thanks in advance for any suggestions.
The original:
File.open("channels.xml").each do |line|
if line.match('(mms:\/\/{1}[a-zA-Z\.\d\/\w-]+)')
puts line.match('(mms:\/\/{1}[a-zA-Z\.\d\/\w-]+)')
end
end
can be changed into this:
m = nil
open("channels.xml").each do |line|
puts m if m = line.match(%r|(mms://{1}[\w\./-]+)|)
end
File.open can be changed to just open.
if XYZ
puts XYZ
end
can be changed to puts x if x = XYZ as long as x has occurred at some place in the current scope before the if statement.
The Regexp '(mms:\/\/{1}[a-zA-Z\.\d\/\w-]+)' can be refactored a little bit. Using the %rXX notation, you can create regular expressions without the need for so many backslashes, where X is any matching character, such as ( and ) or in the example above, | |.
This character class [a-zA-Z\.\d\/\w-] (read: A to Z, case insensitive, the period character, 0 to 9, a forward slash, any word character, or a dash) is a little redundant. \w denotes "word characters", i.e. A-Za-z0-9 and underscore. Since you specify \w as a positive match, A-Za-z and \d are redundant.
Using those 2 cleanups, the Regexp can be changed into this: %r|(mms://{1}[\w\./-]+)|
If you'd like to avoid the weird m = nil scoping sorcery, this will also work, but is less idiomatic:
open("channels.xml").each do |line|
m = line.match(%r|(mms://{1}[\w\./-]+)|) and puts m
end
or the longer, but more readable version:
open("channels.xml").each do |line|
if m = line.match(%r|(mms://{1}[\w\./-]+)|)
puts m
end
end
One very easy to read approach is just to store the result of the match, then only print if there's a match:
File.open("channels.xml").each do |line|
m = line.match('(mms:\/\/{1}[a-zA-Z\.\d\/\w-]+)')
puts m if m
end
If you want to start getting clever (and have less-readable code), use $& which is the global variable that receives the match variable:
File.open("channels.xml").each do |line|
puts $& if line.match('(mms:\/\/{1}[a-zA-Z\.\d\/\w-]+)')
end
Personally, I would probably just use the POSIX grep command. But there is Enumerable#grep in Ruby, too:
puts File.readlines('channels.xml').grep(%r|mms://{1}[\w\./-]+|)
Alternatively, you could use some of Ruby's file and line processing magic that it inherited from Perl. If you pass the -p flag to the Ruby interpreter, it will assume that the script you pass in is wrapped with while gets; ...; end and at the end of each loop it will print the current line. You can then use the $_ special variable to access the current line and use the next keyword to skip iteration of the loop if you don't want the line printed:
ruby -pe 'next unless $_ =~ %r|mms://{1}[\w\./-]+|' channels.xml
Basically,
ruby -pe 'next unless $_ =~ /re/' file
is equivalent to
grep -E re file

Ruby: How to get the first character of a string

How can I get the first character in a string using Ruby?
Ultimately what I'm doing is taking someone's last name and just creating an initial out of it.
So if the string was "Smith" I just want "S".
You can use Ruby's open classes to make your code much more readable. For instance, this:
class String
def initial
self[0,1]
end
end
will allow you to use the initial method on any string. So if you have the following variables:
last_name = "Smith"
first_name = "John"
Then you can get the initials very cleanly and readably:
puts first_name.initial # prints J
puts last_name.initial # prints S
The other method mentioned here doesn't work on Ruby 1.8 (not that you should be using 1.8 anymore anyway!--but when this answer was posted it was still quite common):
puts 'Smith'[0] # prints 83
Of course, if you're not doing it on a regular basis, then defining the method might be overkill, and you could just do it directly:
puts last_name[0,1]
If you use a recent version of Ruby (1.9.0 or later), the following should work:
'Smith'[0] # => 'S'
If you use either 1.9.0+ or 1.8.7, the following should work:
'Smith'.chars.first # => 'S'
If you use a version older than 1.8.7, this should work:
'Smith'.split(//).first # => 'S'
Note that 'Smith'[0,1] does not work on 1.8, it will not give you the first character, it will only give you the first byte.
"Smith"[0..0]
works in both ruby 1.8 and ruby 1.9.
For completeness sake, since Ruby 1.9 String#chr returns the first character of a string. Its still available in 2.0 and 2.1.
"Smith".chr #=> "S"
http://ruby-doc.org/core-1.9.3/String.html#method-i-chr
In MRI 1.8.7 or greater:
'foobarbaz'.each_char.first
Try this:
>> a = "Smith"
>> a[0]
=> "S"
OR
>> "Smith".chr
#=> "S"
In Rails
name = 'Smith'
name.first
>> s = 'Smith'
=> "Smith"
>> s[0]
=> "S"
Another option that hasn't been mentioned yet:
> "Smith".slice(0)
#=> "S"
Because of an annoying design choice in Ruby before 1.9 — some_string[0] returns the character code of the first character — the most portable way to write this is some_string[0,1], which tells it to get a substring at index 0 that's 1 character long.
Try this:
def word(string, num)
string = 'Smith'
string[0..(num-1)]
end
If you're using Rails You can also use truncate
> 'Smith'.truncate(1, omission: '')
#=> "S"
or for additional formatting:
> 'Smith'.truncate(4)
#=> "S..."
> 'Smith'.truncate(2, omission: '.')
#=> "S."
While this is definitely overkill for the original question, for a pure ruby solution, here is how truncate is implemented in rails
# File activesupport/lib/active_support/core_ext/string/filters.rb, line 66
def truncate(truncate_at, options = {})
return dup unless length > truncate_at
omission = options[:omission] || "..."
length_with_room_for_omission = truncate_at - omission.length
stop = if options[:separator]
rindex(options[:separator], length_with_room_for_omission) || length_with_room_for_omission
else
length_with_room_for_omission
end
"#{self[0, stop]}#{omission}"
end
Other way around would be using the chars for a string:
def abbrev_name
first_name.chars.first.capitalize + '.' + ' ' + last_name
end
Any of these methods will work:
name = 'Smith'
puts name.[0..0] # => S
puts name.[0] # => S
puts name.[0,1] # => S
puts name.[0].chr # => S

Resources