Remove   from Ruby String - ruby

i am try to parse some data and meet trouble with clean a   symbol. I knew that this is just a "space" but i realy got trouble to clean it from string
my code:
require 'rubygems'
require 'mechanize'
agent = Mechanize.new
page = agent.get('my_page.hmtl')
price = page.search('#product_buy .price').text.to_s.gsub(/\s+/, "").gsub(" ","").gsub(" ", "")
puts price
And as result i always got "4 162" - with dat spaces. Don't know what to do.
Help please who meet this issue previously. Thank you

HTML escape codes don't mean anything to Ruby's regex engine. Looking for " " will look for those literal characters, not a thin space. Instead, versions of Ruby >= 1.8 support Unicode in strings, meaning that you can use the Unicode code point corresponding to a thin space to make your substitution. The Unicode code point for a thin space is 0x2009, meaning that you can reference it in a Ruby string as \u2009.
Additionally, instead of calling some_string.gsub('some_string', ''), you can just call some_string.delete('some_string').
Note that this isn't appropriate for all situations, because delete removes all instances of all characters appearing in the intersection of its arguments, while gsub will remove only segments matching the pattern provided. For example, 'hellohi'.gsub('hello', '') == "hi", while 'hellohi'.delete('hello') == 'i').
In your specific case, I'd use something like:
price = page.search('#product_buy .price').text.delete('\u2009\s')

Related

Ruby Regex and string variable

I am trying to match a string with a non-breaking space ( ) given a variable that contains the string with a regular space. The string I am looking for is the text in a HTML link/anchor and I am using Watir (note the non-breaking space).
<a onlick='DoSomthing()' href=''>Some Text</a>
There appears to be a difference between a regex created by // and by Regex.new.
Interactive Ruby says the following is true (where my_text = 'Some Text'):
/Some Text/ == Regexp.new(my_text)
Yet while this returns True:
browser.link(:text, /Some Text/).exists?
This does not:
browser.link(:text, Regexp.new(my_text)).exists?
Nor does this:
browser.link(:text, /#{my_text}/).exists?
I've also tried the following with no luck:
Regexp.new(my_text.gsub(' ', '[[:space:]]'))
Does anyone know how I can accomplish this match?
Use alternation:
browser.link(:text, / | /).exists?
Also, try upgrading Ruby and gems. I've heard weird regex issues in Watir resolving magically that way.
A non breaking space is an html entity, and regex afaik does not recognize that as a space, so you need to convert one or the other before matching.
my_text = 'Some Text'
in other words, I don't think regex would ever match a space to " ". change your search string, or the source text, whichever is easier...

How do I write regexes for German character classes like letters, vowels, and consonants?

For example, I set up these:
L = /[a-z,A-Z,ßäüöÄÖÜ]/
V = /[äöüÄÖÜaeiouAEIOU]/
K = /[ßb-zBZ&&[^#{V}]]/
So that /(#{K}#{V}{2})/ matches "ᄚ" in "azAZᄚ".
Are there any better ways of dealing with them?
Could I put those constants in a module in a file somewhere in my Ruby installation folder, so I can include/require them inside any new script I write on my computer? (I'm a newbie and I know I'm muddling this terminology; Please correct me.)
Furthermore, could I get just the meta-characters \L, \V, and \K (or whatever isn't already set in Ruby) to stand for them in regexes, so I don't have to do that string interpolation thing all the time?
You're starting pretty well, but you need to look through the Regexp class code that is installed by Ruby. There are tricks for writing patterns that build themselves using String interpolation. You write the bricks and let Ruby build the walls and house with normal String tricks, then turn the resulting strings into true Regexp instances for use in your code.
For instance:
LOWER_CASE_CHARS = 'a-z'
UPPER_CASE_CHARS = 'A-Z'
CHARS = LOWER_CASE_CHARS + UPPER_CASE_CHARS
DIGITS = '0-9'
CHARS_REGEX = /[#{ CHARS }]/
DIGITS_REGEX = /[#{ DIGITS }]/
WORDS = "#{ CHARS }#{ DIGITS }_"
WORDS_REGEX = /[#{ WORDS }]/
You keep building from small atomic characters and character classes and soon you'll have big regular expressions. Try pasting those one by one into IRB and you'll quickly get the hang of it.
A small improvement on what you do now would be to use regex unicode support for categories or scripts.
If you mean L to be any letter, use \p{L}. Or use \p{Latin} if you want it to mean any letter in a Latin script (all German letters are).
I don't think there are built-ins for vowels and consonants.
See \p{L} match your example.

Ruby 1.9.3 add unsafe characters to URI.escape

I am using Sinatra and get parameters from the url using the get '/foo/:bar' {} method. Unfortunately, the value in :bar can contain nasty things like / which leads to an 404, since no route matches /foo/:bar/baz/. I use URI.escape to escape the URL paramter, but it considers / valid a valid character. As it is mentioned here this is because the default Regexp to check against does not differentiate between unsafe and reserved characters. I would like to change this and did this:
URI.escape("foo_<_>_&_3_#_/_+_%_bar", Regexp.union(URI::REGEXP::UNSAFE, '/'))
just to test it.
URI::REGEXP::UNSAFE is the default regexp to match against according to the Ruby 1.9.3 Documentaton:
escape(*arg)
Synopsis
URI.escape(str [, unsafe])
Args
str
String to replaces in.
unsafe
Regexp that matches all symbols that must be replaced with
codes. By default uses REGEXP::UNSAFE. When this argument is
a String, it represents a character set.
Description
Escapes the string, replacing all unsafe characters with codes.
Unfortunatelly I get this error:
uninitialized constant URI::REGEXP::UNSAFE
And as this GitHub Issue suggests, this Regexp was removed from Ruby with 1.9.3. Unfortunately, the URI modules documentation is generally kind of bad, but I really cannot figure this out. Any hints?
Thanks in advance!
URI#escape is not what you are looking for. You want CGI#escape:
require 'cgi'
CGI.escape("foo_<_>_&_3_#_/_+_%_bar")
# => "foo_%3C_%3E_%26_3_%23_%2F_%2B_%25_bar"
This will properly encode it to allow Sinatra to retrieve it.
Perhaps you would have better luck with CGI.escape?
>> require 'uri'; URI.escape("foo_<_>_&_3_#_/_+_%_bar")
=> "foo_%3C_%3E_&_3_%23_/_+_%25_bar"
>> require 'cgi'; CGI.escape("foo_<_>_&_3_#_/_+_%_bar")
=> "foo_%3C_%3E_%26_3_%23_%2F_%2B_%25_bar"

Ruby -- looking for some sort of "Regexp unescape" method

I have a bunch of string with special escape codes that I want to store unescaped- eg, the interpreter shows
"\\014\"\\000\"\\016smoothing\"\\011mean\"\\022color\"\\011zero#\\016"
but I want it to show (when inspected) as
"\014\"\000\"\016smoothing\"\011mean\"\022color\"\011zero#\016"
What's the method to unescape them? I imagine that I could make a regex to remove 1 backslash from every consecutive n backslashes, but I don't have a lot of regex experience and it seems there ought to be a "more elegant" way to do it.
For example, when I puts MyString it displays the output I'd like, but I don't know how I might capture that into a variable.
Thanks!
Edited to add context: I have this class that is being used to marshal / restore some stuff, but when I restore some old strings it spits out a type error which I've determined is because they weren't -- for some inexplicable reason -- stored as base64. They instead appear to have just been escaped, which I don't want, because trying to restore them similarly gives the TypeError
TypeError: incompatible marshal file format (can't be read)
format version 4.8 required; 92.48 given
because Marshal looks at the first characters of the string to determine the format.
require 'base64'
class MarshaledStuff < ActiveRecord::Base
validates_presence_of :marshaled_obj
def contents
obj = self.marshaled_obj
return Marshal.restore(Base64.decode64(obj))
end
def contents=(newcontents)
self.marshaled_obj = Base64.encode64(Marshal.dump(newcontents))
end
end
Edit 2: Changed wording -- I was thinking they were "double-escaped" but it was only single-escaped. Whoops!
If your strings give you the correct output when you print them then they are already escaped correctly. The extra backslashes you see are probably because you are displaying them in the interactive interpreter which adds extra backslashes for you when you display variables to make them less ambiguous.
> x
=> "\\"
> puts x
\
=> nil
> x.length
=> 1
Note that even though it looks like x contains two backslashes, the length of the string is one. The extra backslash is added by the interpreter and is not really part of the string.
If you still think there's a problem, please be more specific about how you are displaying the strings that you mentioned in your question.
Edit: In your example the only thing that need unescaping are octal escape codes. You could try this:
x = x.gsub(/\\[0-2][0-7]{2}/){ |c| c[1,3].to_i(8).chr }

how to convert strings like "this is an example" to "this-is-an-example" under ruby

How do I convert strings like "this is an example" to "this-is-an-example" under ruby?
The simplest version:
"this is an example".tr(" ", "-")
#=> "this-is-an-example"
You could also do something like this, which is slightly more robust and easier to extend by updating the regular expression:
"this is an example".gsub(/\s+/, "-")
#=> "this-is-an-example"
The above will replace all chunks of white space (any combination of multiple spaces, tabs, newlines) to a single dash.
See the String class reference for more details about the methods that can be used to manipulate strings in Ruby.
If you are trying to generate a string that can be used in a URL, you should also consider stripping other non-alphanumeric characters (especially the ones that have special meaning in URLs), or replacing them with an alphanumeric equivalent (example, as suggested by Rob Cameron in his answer).
If you are trying to make something that is a good URL slug, there are lots of ways to do it.
Generally, you want to remove everything that is not a letter or number, and then replace all whitespace characters with dashes.
So:
s = "this is an 'example'"
s = s.gsub(/\W+/, ' ').strip
s = s.gsub(/\s+/,'-')
At the end s will equal "this-is-an-example"
I used the source code from a ruby testing library called contest to get this particular way to do it.
If you're using Rails take a look at parameterize(), it does exactly what you're looking for:
http://api.rubyonrails.org/classes/ActiveSupport/CoreExtensions/String/Inflections.html#M001367
foo = "Hello, world!"
foo.parameterize => 'hello-world'

Resources