Ruby windows-1250 encoding - ruby

I'm trying to get data from site with charset windows-1250
I have this code:
require 'open-uri'
p open('http://www.ceskybenzin.cz/mapa/0').read.force_encoding('Windows-1250').encode('UTF-8').scan /addMarker\( point, '(.*?) - (.*?) - (.*?) - (.*?)', 'green', (.*?), bublina, 0 \);/
and I'm getting data like:
["EuroOil", "Prun\u00E9\u0159ov ", "U\u0161\u00E1k", "Zat\u00EDm nezadan\u00FD kraj", "181"]
could someone tell me how to correctly get data from windows-1250 site
Thank you

a[0] => ["Kont.cz (NOVA-KONT)", "Praha 4", "Opatovsk\xC3\xA1", "Hlavn\u00ED m\u011Bsto Praha", "1"]
a.last => ["EuroOil", "Prun\u00E9\u0159ov ", "U\u0161\u00E1k", "Zat\u00EDm nezadan\u00FD kraj", "181"]
a.last.select { |i| puts i.encode("utf-8") } => produces
EuroOil
Prunérov
Usák
Zatím nezadaný kraj
181

you have unicode-8 symbols in your data not win-1250.
to convert your current example string to correct text you can do this
data = ["EuroOil", "Prun\u00E9\u0159ov ", "U\u0161\u00E1k", "Zat\u00EDm nezadan\u00FD kraj", "181"]
data.select{|snippet| snippet.encode("UTF-8")}
=> ["EuroOil", "Prunéřov ", "Ušák", "Zatím nezadaný kraj", "181"]
if output you exampled is from console, then this is because console outputs with utf-8 encoding not with encoding of your source site (and maybe parsing works correctly until it displays)

Related

Net::Telnet - puts or print string in UTF-8

I'm using an API in which I have to send client informations as a Json-object over a telnet connection (very strange, I know^^).
I'm german so the client information contains very often umlauts or the ß.
My procedure:
I generate a Hash with all the command information.
I convert the Hash to a Json-object.
I convert the Json-object to a string (with .to_s).
I send the string with the Net::Telnet.puts command.
My puts command looks like: (cmd is the Json-object)
host.puts(cmd.to_s.force_encoding('UTF-8'))
In the log files I see, that the Json-object don't contain the umlauts but for example this: ü instead of ü.
I proved that the string is (with or without the force_encoding() command) in UTF-8. So I think that the puts command doesn't send the strings in UTF-8.
Is it possible to send the command in UTF-8? How can I do this?
The whole methods:
host = Net::Telnet::new(
'Host' => host_string,
'Port' => port_integer,
'Output_log' => 'log/'+Time.now.strftime('%Y-%m-%d')+'.log',
'Timeout' => false,
'Telnetmode' => false,
'Prompt' => /\z/n
)
def send_cmd_container(host, cmd, params=nil)
cmd = JSON.generate({'*C'=>'se','Q'=>[get_cmd(cmd, params)]})
host.puts(cmd.to_s.force_encoding('UTF-8'))
add_request_to_logfile(cmd)
end
def get_cmd(cmd, params=nil)
if params == nil
return {'*C'=>'sq','CMD'=>cmd}
else
return {'*C'=>'sq','CMD'=>cmd,'PARAMS'=>params}
end
end
Addition:
I also log my sended requests by this method:
def add_request_to_logfile(request_string)
directory = 'log/'
File.open(File.join(directory, Time.now.strftime('%Y-%m-%d')+'.log'), 'a+') do |f|
f.puts ''
f.puts '> '+request_string
end
end
In the logfile my requests also don't contain UTF-8 umlauts but for example this: ü
TL;DR
Set 'Binmode' => true and use Encoding::BINARY.
The above should work for you. If you're interested in why, read on.
Telnet doesn't really have a concept of "encoding." Telnet just has two modes: Normal mode assumes you're sending 7-bit ASCII characters, and binary mode assumes you're sending 8-bit bytes. You can't tell Telnet "this is UTF-8" because Telnet doesn't know what that means. You can tell it "this is ASCII-7" or "this is a sequence of 8-bit bytes," and that's it.
This might seem like bad news, but it's actually great news, because it just so happens that UTF-8 encodes text as sequences of 8-bit bytes. früh, for example, is five bytes: 66 72 c3 bc 68. This is easy to confirm in Ruby:
puts str = "\x66\x72\xC3\xBC\x68"
# => früh
puts str.bytes.size
# => 5
In Net::Telnet we can turn on binary mode by passing the 'Binmode' => true option to Net::Telnet::new. But there's one more thing we have to do: Tell Ruby to treat the string like binary data, i.e. a sequence of 8-bit bytes.
You already tried to use String#force_encoding, but what you might not have realized is that String#force_encoding doesn't actually convert a string from one encoding to another. Its purpose isn't to change the data's encoding—its purpose is to tell Ruby what encoding the data is already in:
str = "früh" # => "früh"
p str.encoding # => #<Encoding:UTF-8>
p str[2] # => "ü"
p str.bytes # => [ 102, 114, 195, 188, 104 ] # This is the decimal represent-
# ation of the hexadecimal bytes
# we saw before, `66 72 c3 bc 68`
str.force_encoding(Encoding::BINARY) # => "fr\xC3\xBCh"
p str[2] # => "\xC3"
p str.bytes # => [ 102, 114, 195, 188, 104 ] # Same bytes!
Now I'll let you in on a little secret: Encoding::BINARY is just an alias for Encoding::ASCII_8BIT. Since ASCII-8BIT doesn't have multi-byte characters, Ruby shows ü as two separate bytes, \xC3\xBC. Those bytes aren't printable characters in ASCII-8BIT, so Ruby shows the \x## escape codes instead, but the data hasn't changed—only the way Ruby prints it has changed.
So here's the thing: Even though Ruby is now calling the string BINARY or ASCII-8BIT instead of UTF-8, it's still the same bytes, which means it's still UTF-8. Changing the encoding it's "tagged" as, however, means when Net::Telnet does (the equivalent of) data[n] it will always get one byte (instead of potentially getting multi-byte characters as in UTF-8), which is just what we want.
And so...
host = Net::Telnet::new(
# ...all of your other options...
'Binmode' => true
)
def send_cmd_container(host, cmd, params=nil)
cmd = JSON.generate('*C' => 'se','Q' => [ get_cmd(cmd, params) ])
cmd.force_encoding(Encoding::BINARY)
host.puts(cmd)
# ...
end
(Note: JSON.generate always returns a UTF-8 string, so you never have to do e.g. cmd.to_s.)
Useful diagnostics
A quick way to check what data Net::Telnet is actually sending (and receiving) is to set the 'Dump_log' option (in the same way you set the 'Output_log' option). It will write both sent and received data to a log file in hexdump format, which will allow you to see if the bytes being sent are correct. For example, I started a test server (nc -l 5555) and sent the string früh (host.puts "früh".force_encoding(Encoding::BINARY)), and this is what was logged:
> 0x00000: 66 72 c3 bc 68 0a fr..h.
You can see that it sent six bytes: the first two are f and r, the next two make up ü, and the last two are h and a newline. On the right, bytes that aren't printable characters are shown as ., ergo fr..h.. (By the same token, I sent the string I❤NY and saw I...NY. in the right column, because ❤ is three bytes in UTF-8: e2 9d a4).
So, if you set 'Dump_log' and send a ü, you should see c3 bc in the output. If you do, congratulations—you're sending UTF-8!
P.S. Read Yehuda Katz' article Ruby 1.9 Encodings: A Primer and the Solution for Rails. In fact, read it yearly. It's really, really useful.

Generating KML files with Ruby

I'm using the ruby_kml gem right now to try to generate KML from some data in my model.
I also tried georuby.
Both of them, when they generate XML it seems to be coming back escaped like this:
"<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<kml xmlns=\"http://earth.google.com/kml/2.1\">\n <Folder>\n <name>San Francisco</name>\n <LineStyle>\n <color>#0D7215</color>\n </LineStyle>\n <Placemark>\n <name>21 Google Bus</name>\n <description>\n <![CDATA[Click to add description.]]>\n </description>\n <LineString>\n <coordinates>37.784282779035216,-122.42228507995605 37.784144999999995,-122.42225699999999,37.784084,-122.42274499999999,37.785472,-122.423023,37.785391,-122.423564,37.785364,-122.423839,37.785418,-122.424714,37.785410999999996,-122.42497999999999,37.785391,-122.42522,37.784839,-122.42956,37.784631,-122.431297,37.782576,-122.43086799999999,37.776969,-122.42975399999999,37.776759999999996,-122.431384,37.776368,-122.431305 37.776368,-122.431305,37.777699999999996,-122.431575,37.778746999999996,-122.42335399999999,37.773609,-122.42231199999999,37.773013999999996,-122.42222799999999,37.772974999999995,-122.42222799999999,37.772915,-122.42226799999999,37.772774,-122.422446,37.772636999999996,-122.422585,37.772562,-122.42263399999999,37.772521999999995,-122.422643,37.771588,-122.42253799999999,37.771631,-122.421759</coordinates>\n </LineString>\n </Placemark>\n <LineStyle>\n <color>#0071CA</color>\n </LineStyle>\n <Placemark>\n <name>45 Inverter</name>\n <description>\n <![CDATA[Click to add description.]]>\n </description>\n <LineString>\n <coordinates>37.792490234462946,-122.40863800048828 37.792516,-122.408429,37.793068,-122.408541,37.792957,-122.409357,37.792051,-122.409189,37.788289999999996,-122.40841499999999,37.785495,-122.407866,37.785713,-122.406229,37.785713,-122.40591599999999,37.785699,-122.40576999999999,37.785658,-122.40568999999999,37.783249999999995,-122.40270699999999,37.778850999999996,-122.40827499999999,37.779104,-122.408577</coordinates>\n </LineString>\n </Placemark>\n <LineStyle>\n <color>#AD0101</color>\n </LineStyle>\n <Placemark>\n <name>82 X Wing</name>\n <description>\n <![CDATA[Click to add description.]]>\n </description>\n <LineString>\n <coordinates></coordinates>\n </LineString>\n </Placemark>\n <LineStyle>\n <color>#AD0101</color>\n </LineStyle>\n <Placemark>\n <name>93 X Wing</name>\n <description>\n <![CDATA[Click to add description.]]>\n </description>\n <LineString>\n <coordinates></coordinates>\n </LineString>\n </Placemark>\n </Folder>\n</kml>\n"
I'm not sure why it should be coming it escaped, since it definitely is not valid XML.
georuby does the same.
Does anyone know why it's coming out escaped and also how to unescape it?
Here's the code I'm using:
map = self;
kml = KMLFile.new
folder = KML::Folder.new(:name => map[:name])
map.lines.each do |line|
folder.features << KML::LineStyle.new(
color: line.color,
)
folder.features << KML::Placemark.new(
:name => line.name,
:geometry => KML::LineString.new(:coordinates => line.coordinates),
:description => line.description
)
end
kml.objects << folder
kml.render
Thanks!!!

How do I parse Google image URLs using Ruby and Nokogiri?

I'm trying to make an array of all the image files on a Google images webpage.
I want a regular expression to pull everything after "imagurl=" and ending before "&amp" as seen in this HTML:
<img height="124" width="124" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRLy5inpSdHxWuE7z3QSZw35JwN3upbBaLr11LR25noTKbSMn9-qrySSg"><br><cite title="trendytree.com">trendytree.com</cite><br>Silent Night Chapel <b>20031</b><br>400 × 400 - 58k - jpg</td>
I feel like I can do this with a regex, but I can't find a way to search my parsed document using regex, but I'm not finding any solutions.
str = '<img height="124" width="124" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRLy5inpSdHxWuE7z3QSZw35JwN3upbBaLr11LR25noTKbSMn9-qrySSg"><br><cite title="trendytree.com">trendytree.com</cite><br>Silent Night Chapel <b>20031</b><br>400 × 400 - 58k - jpg</td>'
str.split('imgurl=')[1].split('&amp')[0]
#=> "http://www.trendytree.com/old-world- christmas/images/20031chapel20031-silent-night-chapel.jpg"
Is that what you're looking for?
The problem with using a regex is you assume too much knowledge about the order of parameters in the URL. If the order changes, or & disappears the regex won't work.
Instead, parse the URL, then split the values out:
# encoding: UTF-8
require 'nokogiri'
require 'cgi'
require 'uri'
doc = Nokogiri::HTML.parse('<img height="124" width="124" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRLy5inpSdHxWuE7z3QSZw35JwN3upbBaLr11LR25noTKbSMn9-qrySSg"><br><cite title="trendytree.com">trendytree.com</cite><br>Silent Night Chapel <b>20031</b><br>400 × 400 - 58k - jpg</td>')
img_url = doc.search('a').each do |a|
query_params = CGI::parse(URI(a['href']).query)
puts query_params['imgurl']
end
Which outputs:
http://www.trendytree.com/old-world-christmas/images/20031chapel20031-silent-night-chapel.jpg
Both URI and CGI are used because URI's decode_www_form raises an exception when trying to decode the query.
I've also been known to decode the query string into a hash using something like:
Hash[URI(a['href']).query.split('&').map{ |p| p.split('=') }]
That will return:
{"imgurl"=>
"http://www.trendytree.com/old-world-christmas/images/20031chapel20031-silent-night-chapel.jpg",
"imgrefurl"=>
"http://www.trendytree.com/old-world-christmas/silent-night-chapel-20031-christmas-ornament-old-world-christmas.html",
"usg"=>"__YJdf3xc4ydSfLQa9tYnAzavKHYQ",
"h"=>"400",
"w"=>"400",
"sz"=>"58",
"hl"=>"en",
"start"=>"19",
"zoom"=>"1",
"tbnid"=>"ajDcsGGs0tgE9M:",
"tbnh"=>"124",
"tbnw"=>"124",
"ei"=>"qagfUbXmHKfv0QHI3oG4CQ",
"itbs"=>"1",
"sa"=>"X",
"ved"=>"0CE4QrQMwEg"}
To get all the img urls you want do
# get all links
url = 'some-google-images-url'
links = Nokogiri::HTML( open(url) ).css('a')
# get regex match or nil on desired img
img_urls = links.map {|a| a['href'][/imgurl=(.*?)&/, 1] }
# get rid of nils
img_urls.compact
The regex you want is /imgurl=(.*?)&/ because you want a non-greedy match between imgurl= and &, otherwise the greedy .* would take everything to the last & in the string.

Encoding issue with Sqlite3 in Ruby

I have a list of sql queries beautifully encoded in utf-8. I read them from files, perform the inserts and than do a select.
# encoding: utf-8
def exec_sql_lines(file_name)
puts "----> #{file_name} <----"
File.open(file_name, 'r') do |f|
# sometimes a query doesn't fit one line
previous_line=""
i = 0
while line = f.gets do
puts i+=1
if(line[-2] != ')')
previous_line += line[0..-2]
next
end
puts (previous_line + line) # <---- (1)
$db.execute((previous_line + line))
previous_line =""
end
a = $db.execute("select * from Table where _id=6")
puts a <---- (2)
end
end
$db=SQLite3::Database.new($DBNAME)
exec_sql_lines("creates.txt")
exec_sql_lines("inserts.txt")
$db.close
The text in (1) is different than the one in (2). Polish letters get broken. If I use IRB and call $db.open ; $db.encoding is says UTF-8.
Why do Polish letters come out broken? How to fix it?
I need this database properly encoded in UTF-8 for my Android app, so I'm not interested in manipulating with database output. I need to fix it's content.
EDIT
Significant lines from the output:
6
INSERT INTO 'Leki' VALUES (NULL, '6', 'Acenocoumarolum', 'Acenocumarol WZF', 'tabl. ', '4 mg', '60 tabl.', '5909990055715', '2012-01-01', '2 lata', '21.0, Leki przeciwzakrzepowe z grupy antagonistów witaminy K', '8.32', '12.07', '12.07', 'We wszystkich zarejestrowanych wskazaniach na dzień wydania decyzji', '', 'ryczałt', '5.12')
out:
6
6
Acenocoumarolum
Acenocumarol WZF
tabl.
4 mg
60 tabl.
5909990055715
2012-01-01
2 lata
21.0, Leki przeciwzakrzepowe z grupy antagonistĂł[<--HERE]w witaminy K
8.32
12.07
12.07
We wszystkich zarejestrowanych wskazaniach na dzieĹ[<--HERE] wydania decyzji
ryczaĹ[<--HERE]t
5.12
There are three default encoding.
In you code you set the source encoding.
Perhaps there is a problem with External and Internal Encoding?
A quick test in windows:
#encoding: utf-8
File.open(__FILE__,'r'){|f|
p f.external_encoding
p f.internal_encoding
p f.read.encoding
}
Result:
#<Encoding:CP850>
nil
#<Encoding:CP850>
Even if UTF-8 is used, the data are read as cp850.
In your case:
Does File.open(filename,'r:utf-8') help?

Working with nested hashes in Rails 3

I'm working with the Koala gem and the Facebook Graph API, and I want to break down the results I get for a users feed into separate variables for inserting into a mySQL database, probably using Active Record. Here is the code I have so far:
#token = Service.where(:provider => 'facebook', :user_id => session[:user_id]).first.token
#graph = Koala::Facebook::GraphAPI.new(#token)
#feeds = params[:page] ? #graph.get_page(params[:page]) : #graph.get_connections("me", "home")
And here is what #feeds looks like:
[{"id"=>"1519989351_1799856285747", "from"=>{"name"=>"April Daggett Swayne", "id"=>"1519989351"},
"picture"=>"http://photos-d.ak.fbcdn.net/hphotos-ak-ash4/270060_1799856805760_1519989351_31482916_3866652_s.jpg",
"link"=>"http://www.facebook.com/photo.php?fbid=1799856805760&set=a.1493877356465.2064294.1519989351&type=1", "name"=>"Mobile Uploads",
"icon"=>"http://static.ak.fbcdn.net/rsrc.php/v1/yx/r/og8V99JVf8G.gif", "type"=>"photo", "object_id"=>"1799856805760", "application"=>{"name"=>"Facebook for Android",
"id"=>"350685531728"}, "created_time"=>"2011-07-03T03:14:04+0000", "updated_time"=>"2011-07-03T03:14:04+0000"}, {"id"=>"2733058_10100271380562998", "from"=>{"name"=>"Joshua Ramirez",
"id"=>"2733058"}, "message"=>"Just posted a photo",
"picture"=>"http://platform.ak.fbcdn.net/www/app_full_proxy.php?app=124024574287414&v=1&size=z&cksum=228788edbab39cb34861aecd197ff458&src=http%3A%2F%2Fimages.instagram.com%2Fmedia%2F2011%2F07%2F02%2F2ad9768378cf405fad404b63bf5e2053_7.jpg",
"link"=>"http://instagr.am/p/G1tp8/", "name"=>"jtrainexpress's photo", "caption"=>"instagr.am",
"icon"=>"http://photos-e.ak.fbcdn.net/photos-ak-snc1/v27562/10/124024574287414/app_2_124024574287414_6936.gif", "actions"=>[{"name"=>"Comment",
"link"=>"http://www.facebook.com/2733058/posts/10100271380562998"}, {"name"=>"Like", "link"=>"http://www.facebook.com/2733058/posts/10100271380562998"}], "type"=>"link",
"application"=>{"name"=>"Instagram", "id"=>"124024574287414"}, "created_time"=>"2011-07-03T02:07:37+0000", "updated_time"=>"2011-07-03T02:07:37+0000"},
{"id"=>"588368718_10150230423643719", "from"=>{"name"=>"Eric Bailey", "id"=>"588368718"}, "link"=>"http://www.facebook.com/pages/Martis-Camp/105474549513998", "name"=>"Martis Camp",
"caption"=>"Eric checked in at Martis Camp.", "description"=>"Rockin the pool", "icon"=>"http://www.facebook.com/images/icons/place.png", "actions"=>[{"name"=>"Comment",
"link"=>"http://www.facebook.com/588368718/posts/10150230423643719"}, {"name"=>"Like", "link"=>"http://www.facebook.com/588368718/posts/10150230423643719"}],
"place"=>{"id"=>"105474549513998", "name"=>"Martis Camp", "location"=>{"city"=>"Truckee", "state"=>"CA", "country"=>"United States", "latitude"=>39.282813917575,
"longitude"=>-120.16736760768}}, "type"=>"checkin", "application"=>{"name"=>"Facebook for iPhone", "id"=>"6628568379"}, "created_time"=>"2011-07-03T01:58:32+0000",
"updated_time"=>"2011-07-03T01:58:32+0000", "likes"=>{"data"=>[{"name"=>"Mike Janes", "id"=>"725535294"}], "count"=>1}}]
I have looked around for clues on this, and haven't found it yet (but I'm still working on my stackoverflow-foo). Any help would be greatly appreciated.
That isn't a Ruby Hash, that's a fragment of a JSON string. First you need to decode into a Ruby data structure:
# If your JSON string is in json...
h = ActiveSupport::JSON.decode(json) # Or your favorite JSON decoder.
Now you'll have a Hash in h so you can access it like any other Hash:
array = h['data']
puts array[0]['id']
# prints out 1111111111_0000000000000
puts array[0]['from']['name']
# prints Jane Done

Resources