Coding japanese characters in a google API search string in Ruby - ruby

I am trying to perform searches in Japanese using the custom google search api as follows:
require 'httparty'
require 'json'
class Search
include HTTParty
format :json
end
#response = Search.get('https://www.googleapis.com/customsearch/v1?key=etcetc&q=JAPANESE SEARCH TERM')
When Japanese text is used it fails complaining of "invalid multibyte char (US-ASCII)"
How can I input Japanese text in a format which Ruby allows and google custom api also accepts?
Thanks for any advice.

add
# encoding: utf-8
to the top of the file

As a follow up google api may still not accept these Japanese search terms - it's very simple to escape them and use them in your search by using URI.escape
require 'uri'
retVal = URI.escape("Japanese term", Regexp.new("[^#{URI::PATTERN::UNRESERVED}]"))

Related

Some spaces are removed when scraping a string using Nokogiri

I am very new to Ruby and I am currently working on site-scraping using Nokogiri to practice. I would like to scrape the details from 'deals' from a random group-buying site. I have been able to successfully scrape a site but I am having problems in parsing the output. I tried the solutions suggested in here and also using regex. So far, I have failed.
I am trying to parse the following title/description from this page:
Frosty Frappes starting at P100 for P200 worth at Café Tavolo – up to 55% off
This is what I got:
FrostyFrappes starting at P100 for P200 worth at Caf Tavolo up to 55% off
Here are the snippets in my code:
require 'uri'
require 'nokogiri'
html = open(url)
doc = Nokogiri::HTML(html.read)
doc.encoding = "utf-8"
title = doc.at_xpath('/html/body/div/div[9]/div[2]/div/div/div/h1/a')
puts title.content.to_s.strip.gsub(/[^0-9a-z%&!\n\/(). ]/i, '')
Please do tell me if I missed something out. Thank you.
Your xpath is too rigid and your regex is removing chars you want to keep. Here's how I would do it:
title = doc.at('div#contentDealTitle h1 a').text.strip.gsub(/\s+/,' ')
That says take the text from the first a tag that comes after div#contentDealTitle and h1, strip it (remove leading and trailing spaces) and replace all sequences of 1 or more whitespace char with a single space.

Trouble opening utf-8 URI's with Ruby's 'open-uri'

I'm trying to get Danish location addresses from google maps web services API with ruby and open-uri.
Trying to get Ærø, Denmark: http://maps.googleapis.com/maps/api/geocode/json?address=ærø&sensor=false&region=dk works in Chrome does not with open-uri:
require 'rubygems'
require "open-uri"
require 'json'
uri = "http://maps.googleapis.com/maps/api/geocode/json?address=ærø&sensor=false&region=dk"
response = open(uri)
array = JSON.parse(response)
pp array
Here it yields
/usr/lib/ruby/1.8/uri/common.rb:436:in `split': bad URI(is not URI?): http://maps.googleapis.com/maps/api/geocode/json?address=ærø&sensor=false&region=dk (URI::InvalidURIError)
Another way of doing it seems to be to escape characters:
uri = "http://maps.googleapis.com/maps/api/geocode/json?address=ærø&sensor=false&region=dk"
uri_escaped = URI.escape(uri)
response = open(uri_escaped)
array = JSON.parse(response.read)
pp array
But this yields an escaped result (which is not sought after :-)
Anyone have any idea what could solve this problem (getting unescaped feedback or sending an utf-8 request)?
Ruby version here is 1.8.7
Figured it out:
Just add
require 'string19'
to the top of the second example and it works

Unescaping characters in a string with Ruby

Given a string in the following format (the Posterous API returns posts in this format):
s="\\u003Cp\\u003E"
How can I convert it to the actual ascii characters such that s="<p>"?
On OSX, I successfully used Iconv.iconv('ascii', 'java', s) but once deployed to Heroku, I receive an Iconv::IllegalSequence exception. I'm guessing that the system Heroku deploys to does't support the java encoder.
I am using HTTParty to make a request to the Posterous API. If I use curl to make the same request then I do not get the double slashes.
From HTTParty github page:
Automatic parsing of JSON and XML into
ruby hashes based on response
content-type
The Posterous API returns JSON (no double slashes) and HTTParty's JSON parsing is inserting the double slash.
Here is a simple example of the way I am using HTTParty to make the request.
class Posterous
include HTTParty
base_uri "http://www.posterous.com/api/2"
basic_auth "username", "password"
format :json
def get_posts
response = Posterous.get("/users/me/sites/9876/posts&api_token=1234")
# snip, see below...
end
end
With the obvious information (username, password, site_id, api_token) replaced with valid values.
At the point of snip, response.body contains a Ruby string that is in JSON format and response.parsed_response contains a Ruby hash object which HTTParty created by parsing the JSON response from the Posterous API.
In both cases the unicode sequences such as \u003C have been changed to \\u003C.
I've found a solution to this problem. I ran across this gist. elskwid had the identical problem and ran the string through a JSON parser:
s = ::JSON.parse("\\u003Cp\\u003E")
Now, s = "<p>".
I ran into this exact problem the other day. There is a bug in the json parser that HTTParty uses (Crack gem) - basically it uses a case-sensitive regexp for the Unicode sequences, so because Posterous puts out A-F instead of a-f, Crack isn't unescaping them. I submitted a pull request to fix this.
In the meantime HTTParty nicely lets you specify alternate parsers so you can do ::JSON.parse bypassing Crack entirely like this:
class JsonParser < HTTParty::Parser
def json
::JSON.parse(body)
end
end
class Posterous
include HTTParty
parser ::JsonParser
#....
end
You can also use pack:
"a\\u00e4\\u3042".gsub(/\\u(....)/){[$1.hex].pack("U")} # "aäあ"
Or to do the reverse:
"aäあ".gsub(/[^ -~\n]/){"\\u%04x"%$&.ord} # "a\\u00e4\\u3042"
The doubled-backslashes almost look like a regular string being viewed in a debugger.
The string "\u003Cp\u003E" really is "<p>", only the \u003C is unicode for < and \003E is >.
>> "\u003Cp\u003E" #=> "<p>"
If you are truly getting the string with doubled backslashes then you could try stripping one of the pair.
As a test, see how long the string is:
>> "\\u003Cp\\u003E".size #=> 13
>> "\u003Cp\u003E".size #=> 3
>> "<p>".size #=> 3
All the above was done using Ruby 1.9.2, which is Unicode aware. v1.8.7 wasn't. Here's what I get using 1.8.7's IRB for comparison:
>> "\u003Cp\u003E" #=> "u003Cpu003E"

Ruby - internationalized domain names

I need to support internationalized domain names in an app I am writing. More specifically, I need to ACE encode domain names before I pass them on to an external API.
The best way to do this seems to be by using libidn. However, I have problems installing it on my development machine (Windows 7, ruby 1.8.6), as it complains about not finding the GNU IDN library (which I have installed, and also provided the full path to).
So basically I am considering two things:
Search the web for a prebuilt win32 libidn gem (fruitless so far)
Find another (hopefully pure) ruby library that can do the same thing (not found apperantly as I am asking this question here)
So have anyone of you got libidn to work under Windows? Or have you used some other library/code snippet that is able to encode domain names?
Thanks to this snippet, I finally found a solution that did not require libidn. It is built upon punicode4r together with either the unicode gem (a prebuilt binary can be found here), or with ActiveSupport. I will use ActiveSupport since I use Rails anyway, but for reference I include both methods.
With the unicode gem:
require 'unicode'
require 'punycode' #This is not a gem, but a standalone file.
def idn_encode(domain)
parts = domain.split(".").map do |label|
encoded = Punycode.encode(Unicode::normalize_KC(Unicode::downcase(label)))
if encoded =~ /-$/ #Pure ASCII
encoded.chop!
else #Contains non-ASCII characters
"xn--" + encoded
end
end
parts.join(".")
end
With ActiveSupport:
require "punycode"
require "active_support"
$KCODE = "UTF-8" #Have to set this to enable mb_chars
def idn_encode(domain)
parts = domain.split(".").map do |label|
encoded = Punycode.encode(label.mb_chars.downcase.normalize(:kc))
if encoded =~ /-$/ #Pure ASCII
encoded.chop! #Remove trailing '-'
else #Contains non-ASCII characters
"xn--" + encoded
end
end
parts.join(".")
end
The ActiveSupport solution was found thanks to this StackOverflow question.

Nokogiri, open-uri, and Unicode Characters

I'm using Nokogiri and open-uri to grab the contents of the title tag on a webpage, but am having trouble with accented characters. What's the best way to deal with these? Here's what I'm doing:
require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open(link))
title = doc.at_css("title")
At this point, the title looks like this:
Rag\303\271
Instead of:
Ragù
How can I have nokogiri return the proper character (e.g. ù in this case)?
Here's an example URL:
http://www.epicurious.com/recipes/food/views/Tagliatelle-with-Duck-Ragu-242037
Summary: When feeding UTF-8 to Nokogiri through open-uri, use open(...).read and pass the resulting string to Nokogiri.
Analysis:
If I fetch the page using curl, the headers properly show Content-Type: text/html; charset=UTF-8 and the file content includes valid UTF-8, e.g. "Genealogía de Jesucristo". But even with a magic comment on the Ruby file and setting the doc encoding, it's no good:
# encoding: UTF-8
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('http://www.biblegateway.com/passage/?search=Mateo1-2&version=NVI'))
doc.encoding = 'utf-8'
h52 = doc.css('h5')[1]
puts h52.text, h52.text.encoding
#=> Genealogà a de Jesucristo
#=> UTF-8
We can see that this is not the fault of open-uri:
html = open('http://www.biblegateway.com/passage/?search=Mateo1-2&version=NVI')
gene = html.read[/Gene\S+/]
puts gene, gene.encoding
#=> Genealogía
#=> UTF-8
This is a Nokogiri issue when dealing with open-uri, it seems. This can be worked around by passing the HTML as a raw string to Nokogiri:
# encoding: UTF-8
require 'nokogiri'
require 'open-uri'
html = open('http://www.biblegateway.com/passage/?search=Mateo1-2&version=NVI')
doc = Nokogiri::HTML(html.read)
doc.encoding = 'utf-8'
h52 = doc.css('h5')[1].text
puts h52, h52.encoding, h52 == "Genealogía de Jesucristo"
#=> Genealogía de Jesucristo
#=> UTF-8
#=> true
I was having the same problem and the Iconv approach wasn't working. Nokogiri::HTML is an alias to Nokogiri::HTML.parse(thing, url, encoding, options).
So, you just need to do:
doc = Nokogiri::HTML(open(link).read, nil, 'utf-8')
and it'll convert the page encoding properly to utf-8. You'll see Ragù instead of Rag\303\271.
When you say "looks like this," are you viewing this value IRB? It's going to escape non-ASCII range characters with C-style escaping of the byte sequences that represent the characters.
If you print them with puts, you'll get them back as you expect, presuming your shell console is using the same encoding as the string in question (Apparently UTF-8 in this case, based on the two bytes returned for that character). If you are storing the values in a text file, printing to a handle should also result in UTF-8 sequences.
If you need to translate between UTF-8 and other encodings, the specifics depend on whether you're in Ruby 1.9 or 1.8.6.
For 1.9: http://blog.grayproductions.net/articles/ruby_19s_string
for 1.8, you probably need to look at Iconv.
Also, if you need to interact with COM components in Windows, you'll need to tell ruby to use the correct encoding with something like the following:
require 'win32ole'
WIN32OLE.codepage = WIN32OLE::CP_UTF8
If you're interacting with mysql, you'll need to set the collation on the table to one that supports the encoding that you're working with. In general, it's best to set the collation to UTF-8, even if some of your content is coming back in other encodings; you'll just need to convert as necessary.
Nokogiri has some features for dealing with different encodings (probably through Iconv), but I'm a little out of practice with that, so I'll leave explanation of that to someone else.
Try setting the encoding option of Nokogiri, like so:
require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open(link))
doc.encoding = 'utf-8'
title = doc.at_css("title")
Changing Nokogiri::HTML(...) to Nokogiri::HTML5(...) fixed issues I was having with parsing certain special character, specifically em-dashes.
(The accented characters in your link came through fine in both, so don't know if this would help you with that.)
EXAMPLE:
url = 'https://www.youtube.com/watch?v=4r6gr7uytQA'
doc = Nokogiri::HTML(open(url))
doc.title
=> "Josh Waitzkin â\u0080\u0094 How to Cram 2 Months of Learning into 1 Day | The Tim Ferriss Show - YouTube"
doc = Nokogiri::HTML5(open(url))
doc.title
=> "Josh Waitzkin — How to Cram 2 Months of Learning into 1 Day | The Tim Ferriss Show - YouTube"
You need to convert the response from the website being scraped (here epicurious.com) into utf-8 encoding.
as per the html content from the page being scraped, its "ISO-8859-1" for now. So, you need to do something like this:
require 'iconv'
doc = Nokogiri::HTML(Iconv.conv('utf-8//IGNORE', 'ISO-8859-1', open(link).read))
Read more about it here: http://www.quarkruby.com/2009/9/22/rails-utf-8-and-html-screen-scraping
Just to add a cross-reference, this SO page gives some related information:
How to make Nokogiri transparently return un/encoded Html entities untouched?
Tip: you could also use the Scrapifier gem to get metadata, as the page title, from URIs in a very simple way. The data are all encoded in UTF-8.
Check it out: https://github.com/tiagopog/scrapifier
Hope it's useful for you.

Resources