Hpricot error parsing special characters in URI - ruby

I'm working on a ruby script to grab historical stock prices from Yahoo, using Hpricot to parse the pages. This is mostly straighforward: the url is "http://finance.yahoo.com/q/hp?s=TickerSymbol" For example, to look up Google, I would use "http://finance.yahoo.com/q/hp?s=GOOG"
Unfortunately, it breaks down when I'm looking up the price of an index. The indexes are prefixed with a caret, such as "http://finance.yahoo.com/q/hp?s=^DJI" for the Dow.
The line:
ticker_symbol = '^DJI'
doc = Hpricot(open("http://finance.yahoo.com/q/hp?s=#{ticker_symbol}"))
throws this exception:
bad URI(is not URI?): http://finance.yahoo.com/q/hp?s=^DJI
Hpricot chokes on the caret (I think because the underlying Ruby URI library does). Is there a way to escape that character or force the library to try it?

Well, don't I feel dumb. Five more minutes and I got this working:
doc = Hpricot(open(URI.encode("http://finance.yahoo.com/q/hp?s=#{ticker_symbol}")))
So if anyone else is wondering, that's how you do it. facepalm

The escape for ^ is %5E; you could do a straight substitution on the URL.
http://finance.yahoo.com/q/hp?s=%5EDJI

Related

Simple alternative to URI.escape

While using URI.parse on a URL, I was confronted with an error message:
URI::InvalidURIError: URI must be ascii only
I found a StackOverflow question that recommended using URI.escape, which works. Using the URL in that question as an example:
URI.parse('http://dxczjjuegupb.cloudfront.net/wp-content/uploads/2017/10/Оуэн-Мэтьюс.jpg')
=> URI::InvalidURIError: URI must be ascii only "http://dxczjjuegupb.cloudfront.net/wp-content/uploads/2017/10/\u041E\u0443\u044D\u043D-\u041C\u044D\u0442\u044C\u044E\u0441.jpg"
URI.encode('http://dxczjjuegupb.cloudfront.net/wp-content/uploads/2017/10/Оуэн-Мэтьюс.jpg')
=> "http://dxczjjuegupb.cloudfront.net/wp-content/uploads/2017/10/%D0%9E%D1%83%D1%8D%D0%BD-%D0%9C%D1%8D%D1%82%D1%8C%D1%8E%D1%81.jpg"
However, URI.escape is obsolete, as Rubocop warns:
URI.escape method is obsolete and should not be used. Instead, use CGI.escape, URI.encode_www_form or URI.encode_www_form_component depending on your specific use case.
But while URI.escape gives us a usable result, the alternatives don’t:
CGI.escape('http://dxczjjuegupb.cloudfront.net/wp-content/uploads/2017/10/Оуэн-Мэтьюс.jpg')
=> "http%3A%2F%2Fdxczjjuegupb.cloudfront.net%2Fwp-content%2Fuploads%2F2017%2F10%2F%D0%9E%D1%83%D1%8D%D0%BD-%D0%9C%D1%8D%D1%82%D1%8C%D1%8E%D1%81.jpg"
This is a bother because in my case I’m constructing a URL from data I get via Nokogiri:
my_url = page.at('.someclass').at('img').attr('src')
Since I only need to escape the last part of the resulting URL, but CGI.escape and similar transform the whole string (including necessary characters, such as : and /), getting the escaped result now becomes a multiple-lines-of-code ordeal, having to split the path and using several variables to achieve what could be previously done with a single function (URI.escape).
Is there a simple alternative I’m not seeing? It needs to be done without external gems.
I tend to use Addressable for parsing URLs since the standard URI has flaws:
require 'addressable'
Addressable::URI.parse('http://dxczjjuegupb.cloudfront.net/wp-content/uploads/2017/10/Оуэн-Мэтьюс.jpg')
#<Addressable::URI:0x3fc37ecc1c40 URI:http://dxczjjuegupb.cloudfront.net/wp-content/uploads/2017/10/Оуэн-Мэтьюс.jpg>
Addressable::URI.parse('http://dxczjjuegupb.cloudfront.net/wp-content/uploads/2017/10/Оуэн-Мэтьюс.jpg').path
# "/wp-content/uploads/2017/10/Оуэн-Мэтьюс.jpg"
It isn't part of the Ruby core or the standard library but it should be and it always ends up in my Gemfiles.

Trying to remove \"id\":x from json string using regular expression in Ruby

I have a JSON string that looks like {\"heading\":\"Test\",\"id\":1} and I want to wipe the ID data from the string.
I've tried test.gsub(/\,\\"id\\"\:d+/, '') but that's not working.
How best to achieve this?
Sergio's JSON.parse is something you should consider. But baring that, those \'s you are seeing probably aren't really part of the string. That's just how irb is displaying it.
So test.gsub(/,"id":\d+/, '') should be what you want. (Also fixed a few other small bugs in the regex).

Nokogiri - Encoding Issue - Invalid UTF8 characters

Can someone take a look at this. I think there is invalid UTF-8 characters when making this call.
Nokogiri::HTML(open("http://www.next.co.uk/x502062s2"))
If there a way around this? And is this the issue? I am writing a new open source screen scraper designed for product information capture (when a site does not supply a feed) before anyone says I am doing something a little shifty :-)
Before passing anything to Nokogiri, you can encode the content of the page, and ignore all invalid UTF characters using Iconv.
I was using it like this:
ic = Iconv.new('UTF-8//IGNORE', 'UTF-8')
valid_string = ic.iconv(open('http://example.com').read)
You can also check "Fixing invalid UTF-8 in Ruby, revisited."

Ruby RegEx issue

I'm having a problem getting my RegEx to work with my Ruby script.
Here is what I'm trying to match:
http://my.test.website.com/{GUID}/{GUID}/
Here is the RegEx that I've tested and should be matching the string as shown above:
/([-a-zA-Z0-9#:%_\+.~#?&\/\/=]{2,256}\.[a-z]{2,4}\b(\/[-a-zA-Z0-9#:%_\+.~#?&\/\/=]*)([\/\/[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}\/\/[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}\/\/])*?\/)/
3 capturing groups:
group 1: ([-a-zA-Z0-9#:%_\+.~#?&\/\/=]{2,256}\.[a-z]{2,4}\b(\/[-a-zA-Z0-9#:%_\+.~#?&\/\/=]*)([\/\/[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}\/\/[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}\/\/])*?\/)
group 2: (\/[-a-zA-Z0-9#:%_\+.~#?&\/\/=]*)
group 3: ([\/\/[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}\/\/[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}\/\/])
Ruby is giving me an error when trying to validate a match against this regex:
empty range in char class: (My RegEx goes here) (SyntaxError)
I appreciate any thoughts or suggestions on this.
You could simplify things a bit by using URI to deal parsing the URL, \h in the regex, and scan to pull out the GUIDs:
uri = URI.parse(your_url)
path = uri.path
guids = path.scan(/\h{8}-\h{4}-\h{4}-\h{4}-\h{12}/)
If you need any of the non-path components of the URL the you can easily pull them out of uri.
You might need to tighten things up a bit depending on your data or it might be sufficient to check that guids has two elements.
You have several errors in your RegEx. I am very sleepy now, so I'll just give you a hint instead of a solution:
...[\/\/[0-9a-fA-F]....
the first [ does not belong there. Also, having \/\/ inside [] is unnecessary - you only need each character once inside []. Also,
...[-a-zA-Z0-9#:%_\+.~#?&\/\/=]{2,256}...
is greedy, and includes a period - indeed, includes all chars (AFAICS) that can come after it, effectively swallowing the whole string (when you get rid of other bugs). Consider {2,256}? instead.

Detect encoding

I'm getting some string data from the web, and I suspect that it's not always what it says it is. I don't know where the problem is, and I just don't care any more. From day one on this project I've been fighting Ruby string encoding. I really want some way to say: "Here's a string. What is it?", and then use that data to get it to UTF-8 so that it doesn't explode gsub() 2,000 lines down in the depths of my app. I've checked out rchardet, but even though it supposedly works for 1.9 now, it just blows up given any input with multiple bytes... which is not helpful.
You can't really detect the encoding. You can only assume it.
For the most Western languages applications, the following construct
will work. The traditional encoding usually is "ISO-8859-1". The new and preferred encoding is UTF-8. Why not simply try to encode it with UTF-8 and fallback with the old encoding
def detect_encoding( str )
begin
str.encode("UTF-8")
"UTF-8"
rescue
"ISO-8859-1"
end
end
It is impossible to tell from a string what encoding it is in. You always need some additional metadata that tells you what the string's encoding is.
If you get the string from the web, that metadata is in the HTTP headers. If the HTTP headers are wrong, there is absolutely nothing that you or Ruby or anyone else can do. You need to file a bug with the webmaster of the site where you got the string from and wait till he fixes it. If you have a Service Level Agreement with the website, file a bug, wait a week, then sue them.
Old question, but chardet works on 1.9: http://rubygems.org/gems/chardet
why not try use https://github.com/brianmario/charlock_holmes to get the exact encoding. Then also use it to convert to UTF8
require 'charlock_holmes'
class EncodeParser
def initialize(text)
#text = text
end
def detected_encoding
CharlockHolmes::EncodingDetector.detect(#text)[:encoding]
end
def convert_to_utf8
CharlockHolmes::Converter.convert(#text, detected_encoding, "UTF-8")
end
end
then just use EncodeParser.new(text).detected_encoding or EncodeParser.new(text). convert_to_utf8
We had some fine experience with ensure_encoding. It actually does the job for us to convert resource files having unknown encoding to UTF-8.
The README will give you some hints which options would be a good fit for your situation.
I have never tried chardet since ensure_encoding did the job just fine for us.
I covered here how we use ensure_encoding.
Try setting these in your environment.
export LC_ALL=en_US.UTF-8
export LC_CTYPE=en_US.UTF-8
Try ruby -EBINARY or ruby -EASCII-8BIT to command line
Try adding -Ku or -Kn to your ruby command line.
Could you paste the error message ?
Also try this: http://github.com/candlerb/string19/blob/master/string19.rb
Might try reading this: http://yehudakatz.com/2010/05/05/ruby-1-9-encodings-a-primer-and-the-solution-for-rails/
I know it's an old question, but in modern versions of Ruby it's as simple as str.encoding. You get a return value something like this: #Encoding:UTF-8

Resources