Encoding::UndefinedConversionError "\xC2" from ASCII-8BIT to UTF-8 with redcarpet - ruby

I'm using redcarpet gem to render some markdown text to html, a portion of the markdown was user inserted, and they typed in a totally valid special character (£), but now when rendering it I get a: Encoding::UndefinedConversionError "\xC2" from ASCII-8BIT to UTF-8
I know it's the £ sign because if I replace it in the text to render then it all works. but they might be inserting other special characters.
I'm not sure how to deal with this, here's my code building the html:
def generate_document
temp_file_service = TempFileService.new
path = temp_file_service.path
template_url = TenantConfig.get('DEPOSIT_GUIDE_TEMPLATE') || DEFAULT_DOC
template = open(template_url, 'rb', &:read)
html = ERB.new(template).result(binding)
File.open( path, 'w') do |f|
f.write html
end
File.new(path, 'r')
end
the error is risen on the f.write line
here's my html.erb:
<%= markdown(clause.text) %>
and here's the helper:
def markdown(text)
Redcarpet::Markdown.new(Redcarpet::Render::HTML).render(text)
end
Note that the encoding problem happens only when saving the html to a file, somewhere else I correctly use the same markdown helper to render the text to the browser, and no problems there.
It would work also the other way, cleaning the markdown code before saving it to DB and replacing any special characters with the corresponding html code (ex. £ becomes £)
I tried having a before_save callback (as suggested here: Encoding::UndefinedConversionError: "\xC2" from ASCII-8BIT to UTF-8) :
before_save :convert_text
private
def convert_text
self.text = self.text.force_encoding("utf-8")
end
which didn't work
I also tried (as recommended here: Using ERB in Markdown with Redcarpet):
<%= markdown(extra_clause.text).html_safe %>
which didn't work either.
How would I fix either way?

in the end I solved this with adding force_encoding("UFT-8") to the html
like this:
f.write html.force_encoding("UTF-8")
it fixed it.

Related

incompatible character encodings: ASCII-8BIT and UTF-8 in Oga gem

I am using an XML/HTML parser called Oga.
I am attempting to crawl this URL: http://www.johnvanderlyn.com and parse the body for text, like so:
def get_page
body = Net::HTTP.get(URI.parse(#url))
document = Oga.parse_html(body)
end
document = get_page
words = document.css('body').text
When I get this error:
/gems/oga-2.7/lib/oga/xml/node_set.rb:276:in block in text': incompatible character encodings: ASCII-8BIT and UTF-8 (Encoding::CompatibilityError)
That is related to this bit of code here.
What could be causing this and how can I fix it? Is there a way for me to fix it locally, or do I have to fork the gem, fix that method and then use my fork?
Thoughts?
The bit of code you linked has nothing to do with the glitch, that is the issue of body is being interpreted in wrong encoding. Try adding body = body.force_encoding 'UTF-8' before parsing a document:
def get_page
body = Net::HTTP.get(URI.parse(#url)).force_encoding 'UTF-8'
document = Oga.parse_html(body)
end

incompatible character encodings: UTF-8 and ASCII-8BIT in render action

ActionView::Template::Error (incompatible character encodings: UTF-8
and ASCII-8BIT): app/controllers/posts_controller.rb:27:in `new'
# GET /posts/new
def new
if params[:post]
#post = Post.new(post_params).dup
if #post.valid?
render :action => "confirm"
else
format.html { render action: 'new' }
format.json { render json: #post.errors, status: :unprocessable_entity }
end
else
#post = Post.new
#document = Document.new
#documents = #post.documents.all
#document = #post.documents.build
end
I don't know why it is happening.
Make sure config.encoding = "utf-8" is there in application.rb file.
Make sure you are using 'mysql2' gem instead mysql gem
Putting # encoding: utf-8 on top of rake file.
Above Rails.application.initialize! line in environment.rb file, add following two lines:
Encoding.default_external = Encoding::UTF_8
Encoding.default_internal = Encoding::UTF_8
solution from here: http://rorguide.blogspot.in/2011/06/incompatible-character-encodings-ascii.html
If above solution not helped then I think you either copy/pasted a part of your Haml template into the file, or you're working with a non-Unicode/non-UTF-8 friendly editor.
If you can recreate that file from the scratch in a UTF-8 friendly editor. There are plenty for any platform and see whether this fixes your problem.
Sometimes you may get this error:
incompatible character encodings: ASCII-8BIT and UTF-8
That typically happens because you are trying to concatenate two strings, and one contains characters that do not map to the character-set of the other string. There are characters in ISO-8859-1 that do not have equivalents in UTF-8, and vice-versa and how to handle string joining with those incompatibilities requires the programmer to step in.
I was upgrading my rails and spree and the error was actually coming from cache
Deleting the cache solved the problem for me
rm -rf tmp/cache

Ruby extracting links from html

Hello here is my script:
ARGV.each do |input_filename|
doc = Nokogiri::HTML(File.read(input_filename))
title, body = doc.title.gsub("/\s+/"," ").downcase.strip, doc.xpath('//body').inner_text.tr('"', '').gsub("\n", '').downcase.strip
link = doc.search("a[#href]") //Adding this part generates errors
filename = File.basename(input_filename, ".*")
puts %Q("#{title}", "#{body}", "#{filename}", "#{link}").downcase
end
I am having trouble extracting links from a list of html files. I believe the issue is due to unconventional coding in some of the html files. Here is the error i am getting.
extractor.rb:9:in `block in <main>': incompatible character encodings: UTF-8 and CP850 (Encoding::CompatibilityError)
from extractor.rb:4:in `each'
from extractor.rb:4:in `<main>'
You can go about it a different way using the CSS selector:
doc.css('a').map { |link| link['href'] }
This would search the doc for all anchors and return their href text in an array.
Nokogiri stores Strings always as UTF-8 internally. Methods that return text values will always return UTF-8 encoded strings.
You have a conflict UTF-8 and cp850 (you are working with windows?).
You may adapt your File.read(input_filename)
Try
File.read(input_filename, :encoding => 'cp850:utf-8')
If your html-files are windows files.
If your html-files are already utf-8, the try:
File.read(input_filename, :encoding => 'utf-8')
Another solution may be a Encoding.default_external = 'utf-8' at the begin of your code. (I wouldn't recommend it, use it only for small scripts).

Html wrongly encoded fetched by Nokogiri

I use Nokogiri to parse an html. I need both the content and image tags in the page, so I use inner_html instead of content method. But the value returned by content is encoded correct, while wrongly encoded by inner_html. One note, the page is in Chinese and not use UTF-8 encoding.
Here is my code:
# encoding: utf-8
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'iconv'
doc = Nokogiri::HTML.parse(open("http://www.sfzt.org/advise/view.asp?id=536"), nil, 'gb18030')
doc.css('td.font_info').each do |link|
# output, correct but not i expect: 目前市面上影响比
puts link.content
# output, wrong and not i expect: <img ....></img>Ŀǰ??????Ӱ??Ƚϴ?Ľ????
# I expect: <img ....></img>目前市面上影响比
puts link.inner_html
end
That is written on the 'Encoding' section on README: http://nokogiri.org/
Strings are always stored as UTF-8 internally. Methods that return
text values will always return UTF-8 encoded strings. Methods that
return XML (like to_xml, to_html and inner_html) will return a string
encoded like the source document.
So, you should convert inner_html string manually if you want to get it as UTF-8 string:
puts link.inner_html.encode('utf-8') # for 1.9.x
I think content strips out tags well, however the inner_html method nodes does not do this very well or at all.
"I think you can end up with some pretty weird states if you change the inner_html (which contain tags) while you are traversing. In other words, if you are traversing a node tree, you shouldn’t do anything that could add or remove nodes."
Try this:
doc.css('td.font_info').each do |link|
puts link.content
some_stuff = link.inner_html
link.children = Nokogiri::HTML.fragment(some_stuff, 'utf-8')
end

Interpreting non-latin characters in Sinatra coming from Mac Excel 2011

I've a Mac VBA script making a request to a Ruby Sinatra web app.
The text passing from Excel contains characters such as é. Ruby (version 1.9.2) chokes on these characters as Excel is not sending them as UTF-8.
# encoding: utf-8
require 'rubygems'
require 'sinatra'
require "sinatra/reloader" if development?
configure do
class << Sinatra::Base
def options(path, opts={}, &block)
route 'OPTIONS', path, opts, &block
end
end
Sinatra::Delegator.delegate :options
end
options '/' do
response.headers["Access-Control-Allow-Origin"] = "*"
response.headers["Access-Control-Allow-Methods"] = "POST"
halt 200
end
post '/fetch' do
chars = []
params['excel_input'].valid_encoding? #returns false
params['excel_input']
end
My Excel VBA:
Sub FetchAddress()
For Each oDest In Selection
With ActiveSheet.QueryTables.Add(Connection:="URL;http://localhost:4567/fetch", Destination:=oDest)
.PostText = "excel_input=" & oDest.Offset(0, -1).Value
.RefreshStyle = xlOverwriteCells
.SaveData = True
.Refresh
End With
Next
End Sub
The character é comes out the other end as Ž.
It looks like the text in Excel is encoded as Windows-1252 http://en.wikipedia.org/wiki/Windows-1252.
The byte representation of the character is 142 (or Ž in Windows-1252).
iconv can convert the input to UTF-8. It converts the character encoding from one encoding to another. So something like this should work:
require "iconv"
...
post '/fetch' do
excel_input = Iconv.conv("UTF-8", "WINDOWS-1252", params['excel_input'])
...
end
you can also probably look at: https://github.com/jmhodges/rchardet
then, you can autodetect charset and then convert it to utf-8.
Ruby 1.9 Encodings: A Primer and the Solution for Rails - yehuda katz is a good read. If you have some time. Goes in to depth about encodings and how to convert between them.

Resources