Ruby zlib deflate method is generating invalid characters - ruby

I hope you can help me with this.
I've trying to implement a simple code to deflate a string using the zlib gem in a Sinatra app but it seems to be deflating it wrong?!
Here's my code so far:
require 'sinatra'
require 'zlib'
get '/v1/generate' do
file_content = "teste"
generate_diagram_from file_content
end
def generate_diagram_from file_content
data_compressed = Zlib::Deflate.deflate(file_content)
end
And here's what I'm getting from the deflate method:
x�+I-.I�&
A weird string with strange characters and everything.
I'd like to know what I am doing wrong here.
Thanks you guys in advance!

The only thing wrong is your expectation of a readable string.
Compression produces a sequence of bytes of all values (0..255). This is called binary data. It will always contain "weird" or "strange" characters if you try to display it as you have. In fact, it will almost certainly contain invalid UTF-8 sequences, which is why you are getting the white-on-black question marks. This is why you would never try to display or print such sequences as you have.
If you want to look at them or, for example, put them in a question here, display them in hexadecimal.

Related

Ruby URI.extract returns empty array or ArgumentError: invalid byte sequence in UTF-8

I'm trying to get a list of files from url like this:
require 'uri'
require 'open-uri'
url = 'http://www.wmprof.com/media/niti/download'
html = open(url).read
puts URI.extract(html).select{ |link| link[/(PL)/]}
This code returns ArgumentError: invalid byte sequence in UTF-8 in line with URI.extract (even though html.encoding returns utf-8)
I've found some solutions to encoding problems, but when I'm changing the code to
html.encode('UTF-8', invalid: :replace, undef: :replace, replace: '?')
URI.extract returns empty string, even when I'm not calling the select method on it. Any suggestions?
The character encoding of the website might be ISO-8859-1 or a related one. We can't tell for sure since there are only two occurrences of the same non-US-ASCII-character and it doesn't really matter anyway.
html.each_char.reject(&:ascii_only?) # => ["\xDC", "\xDC"]
Finding the actual encoding is done by guessing. The age of HTML 3.2 or the used language/s might be a clue. And in this case especially the content of the PDF file is helpful (it contains SPRÜH-EX and the file has the name TI_DE_SPR%dcH_EX.pdf). Then we only need to find the encoding for which "\xDC" and "Ü" are equal. Either by knowing it or writing some Ruby:
Encoding.list.select { |e| "Ü" == "\xDC".encode!(Encoding::UTF_8, e) rescue next }.map(&:name)
Of course, letting a program do the guessing is an option too. There is the libguess library. The web browser can do it too. However you you need to download the file though unless the server might tell the browser it's UTF-8 even if it isn't (like in this case). Any decent text editor will also try to detect the file encoding: e.g. ST3 thinks it's Windows 1252 which is a superset of ISO-8859-1 (like UTF-8 is of US-ASCII).
Possible solutions are manually setting the string encoding to ISO-8859-1:
html.force_encoding(Encoding::ISO_8859_1)
Or (preferably) transcoding the string from ISO-8859-1 to UTF-8:
html.encode!(Encoding::UTF_8, Encoding::ISO_8859_1)
To answer the other question: URI.extract isn't the method you're looking for. Apparently it's obsolete and more importantly, it doesn't extract relative URI.
A simple alternative is using a regular expression with String#scan. It works with this site but it might not with other ones. You have to use a HTML parser for the best reliability (there might be also a gem). Here's an example that should do what you want:
html.scan(/href="(.*?PL.*?)"/).flatten # => ["SI_PL_ACTIV_bicompact.pdf", ...]

Ruby character encoding issue with scraped HTML

I'm having a character encoding issue with a Ruby script that does some HTML scraping and parsing with the Nokogiri gem. At one point in the script, I call join("\n") on an array of strings that have been pulled from some HTML, which causes this error:
./script.rb:333:in `join': incompatible character encodings: UTF-8 and ASCII-8BIT (Encoding::CompatibilityError)
In my logs, I can see Café showing up for some of the strings that would be included in the join operation.
Is it that some of the strings in my array to be joined are ASCII-8BIT and some are UTF-8 and ruby can't combine them? Do I need to convert or sanitize my strings after parsing them with Nokogiri (into UTF-8)?.
I tried force_encoding('UTF-8') and encode('UTF-8') on the scraped HTML content before I do anything else with it, but it didn't help. In fact, after I tried encode('UTF-8'), my script crashed even earlier when it called to_s on a string containing Café.
Character encoding always really confuses me. Is there something else I can do to sanitize the strings to avoid this error?
Edit:
I was doing something similar in Perl recently and used a module called Text::Unidecode and was able to pass my strings to a function that translates any problematic characters e.g. the letter a with an acute to the plain letter a. Is there anything similar for ruby? (This isn't necessarily what I'm aiming for though, if I can keep the a with acute then that's preferable I think.
Edit2:
I'm really confused by this and it's proving difficult to reproduce reliably. Here's some code:
[CODE REMOVED]
Edit3:
I removed the previously posted code example because it wasn't correct. But the bottom line is, whenever I try to print or call to_s on the string that was scraped, I get the encoding error.
Edit4:
It turned out in the end that the scraped html input was not what was causing the problem. I got the encoding error whenever I tried to print or call to_s on a hash containing, among other things, the scraped html text. The 'other things' were values from database queries, and they were being returned in ASCII-8BIT. To fix the issue, I explicitly had to call force_encoding('UTF-8') on each database value that I use (although I hear that the mysql2 gem does this automatically so I should switch to that).
I hate character encoding.
Presumably, Café is supposed to be Café. If we start out with Café in UTF-8 but treat the bytes as though they were encoded in ISO-8859-1 (AKA Latin-1) and then re-encode them as UTF-8, we get the Café that you're seeing; for example:
> s = 'Café'
=> "Café"
> s.encoding
=> #<Encoding:UTF-8>
> s.force_encoding('iso-8859-1').encode('utf-8')
=> "Café"
So somewhere you're reading a UTF-8 string but treating it as Latin-1 and re-encoding it as UTF-8. I'd guess that Nokogiri is reading the page and thinking that it is Latin-1 or being told by your user agent that it is getting Latin-1 text. Perhaps you have a bad default encoding somewhere, or the HTTP headers are lying about the encoding, or the page itself is lying about its encoding.
You need to get everything into UTF-8 at the edges of your scraper. Figure out who is lying about the encoding and sort it out right there.
Don't feel bad, scraping and encoding is a nightmare of confusion, stupidity, guesswork, and hard liquor. Servers lie, pages lie, browsers lie, no one is happy.

How do I get a UTF-8 string out of an MD5 digest?

I am trying to use an API that requires an MD5 hash to be sent in UTF-8 format.
Problem is, I can't find any way to actually make that happen.
require 'digest/md5'
api_sig = Digest::MD5.digest "api_key=blahblahblah"
puts api_sig
>> Decode error: not UTF-8
So I try force_encoding(Encoding::UTF_8). Same error. inspect, to_s, nothing gives me what I want.
How can I get a UTF-8 string representing an MD5 digest of another string?
Call Digest::MD5.hexdigest "api_key=blahblahblah"
The documentation of this is very poor, but you can find a lackluster explanation here: http://www.ruby-doc.org/stdlib-2.0/libdoc/digest/rdoc/Digest/Class.html#method-c-hexdigest

Nokogiri - Encoding Issue - Invalid UTF8 characters

Can someone take a look at this. I think there is invalid UTF-8 characters when making this call.
Nokogiri::HTML(open("http://www.next.co.uk/x502062s2"))
If there a way around this? And is this the issue? I am writing a new open source screen scraper designed for product information capture (when a site does not supply a feed) before anyone says I am doing something a little shifty :-)
Before passing anything to Nokogiri, you can encode the content of the page, and ignore all invalid UTF characters using Iconv.
I was using it like this:
ic = Iconv.new('UTF-8//IGNORE', 'UTF-8')
valid_string = ic.iconv(open('http://example.com').read)
You can also check "Fixing invalid UTF-8 in Ruby, revisited."

Detect encoding

I'm getting some string data from the web, and I suspect that it's not always what it says it is. I don't know where the problem is, and I just don't care any more. From day one on this project I've been fighting Ruby string encoding. I really want some way to say: "Here's a string. What is it?", and then use that data to get it to UTF-8 so that it doesn't explode gsub() 2,000 lines down in the depths of my app. I've checked out rchardet, but even though it supposedly works for 1.9 now, it just blows up given any input with multiple bytes... which is not helpful.
You can't really detect the encoding. You can only assume it.
For the most Western languages applications, the following construct
will work. The traditional encoding usually is "ISO-8859-1". The new and preferred encoding is UTF-8. Why not simply try to encode it with UTF-8 and fallback with the old encoding
def detect_encoding( str )
begin
str.encode("UTF-8")
"UTF-8"
rescue
"ISO-8859-1"
end
end
It is impossible to tell from a string what encoding it is in. You always need some additional metadata that tells you what the string's encoding is.
If you get the string from the web, that metadata is in the HTTP headers. If the HTTP headers are wrong, there is absolutely nothing that you or Ruby or anyone else can do. You need to file a bug with the webmaster of the site where you got the string from and wait till he fixes it. If you have a Service Level Agreement with the website, file a bug, wait a week, then sue them.
Old question, but chardet works on 1.9: http://rubygems.org/gems/chardet
why not try use https://github.com/brianmario/charlock_holmes to get the exact encoding. Then also use it to convert to UTF8
require 'charlock_holmes'
class EncodeParser
def initialize(text)
#text = text
end
def detected_encoding
CharlockHolmes::EncodingDetector.detect(#text)[:encoding]
end
def convert_to_utf8
CharlockHolmes::Converter.convert(#text, detected_encoding, "UTF-8")
end
end
then just use EncodeParser.new(text).detected_encoding or EncodeParser.new(text). convert_to_utf8
We had some fine experience with ensure_encoding. It actually does the job for us to convert resource files having unknown encoding to UTF-8.
The README will give you some hints which options would be a good fit for your situation.
I have never tried chardet since ensure_encoding did the job just fine for us.
I covered here how we use ensure_encoding.
Try setting these in your environment.
export LC_ALL=en_US.UTF-8
export LC_CTYPE=en_US.UTF-8
Try ruby -EBINARY or ruby -EASCII-8BIT to command line
Try adding -Ku or -Kn to your ruby command line.
Could you paste the error message ?
Also try this: http://github.com/candlerb/string19/blob/master/string19.rb
Might try reading this: http://yehudakatz.com/2010/05/05/ruby-1-9-encodings-a-primer-and-the-solution-for-rails/
I know it's an old question, but in modern versions of Ruby it's as simple as str.encoding. You get a return value something like this: #Encoding:UTF-8

Resources