Jekyll encoding name of category special characters - ruby

My Jekyll installation used to work. Since an update, I face an issue with URL containing tag names which have some special characters.
I now get an error message when trying to reach a URL with special characters in it like http://127.0.0.1:4000/tag/Actualit%C3%A9%20europ%C3%A9enne/, where Actualité européenne is the name of a category.
The error message is incompatible character encodings: UTF-8 and ASCII-8BIT. All the files in _posts directory are utf-8.
Here is the stack trace :
[2017-01-30 17:39:09] ERROR Encoding::CompatibilityError: incompatible
character encodings: UTF-8 and ASCII-8BIT
/usr/lib/ruby/2.1.0/webrick/httpservlet/filehandler.rb:313:in
'set_filename'
/usr/lib/ruby/2.1.0/webrick/httpservlet/filehandler.rb:282:in
'exec_handler'
/usr/lib/ruby/2.1.0/webrick/httpservlet/filehandler.rb:217:in
'do_GET'
/var/lib/gems/2.1.0/gems/jekyll-3.4.0/lib/jekyll/commands/serve/servlet.rb:30:in
'do_GET' /usr/lib/ruby/2.1.0/webrick/httpservlet/abstract.rb:106:in
'service'
/usr/lib/ruby/2.1.0/webrick/httpservlet/filehandler.rb:213:in
'service' /usr/lib/ruby/2.1.0/webrick/httpserver.rb:138:in 'service'
/usr/lib/ruby/2.1.0/webrick/httpserver.rb:94:in 'run'
/usr/lib/ruby/2.1.0/webrick/server.rb:295:in 'block in start_thread'
[2017-01-30 17:41:59] ERROR Encoding::CompatibilityError: incompatible
character encodings: UTF-8 and ASCII-8BIT
/usr/lib/ruby/2.1.0/webrick/httpservlet/filehandler.rb:313:in
'set_filename'
/usr/lib/ruby/2.1.0/webrick/httpservlet/filehandler.rb:282:in
'exec_handler'
/usr/lib/ruby/2.1.0/webrick/httpservlet/filehandler.rb:217:in
'do_GET'
/var/lib/gems/2.1.0/gems/jekyll-3.4.0/lib/jekyll/commands/serve/servlet.rb:30:in
'do_GET' /usr/lib/ruby/2.1.0/webrick/httpservlet/abstract.rb:106:in
'service'
/usr/lib/ruby/2.1.0/webrick/httpservlet/filehandler.rb:213:in
'service' /usr/lib/ruby/2.1.0/webrick/httpserver.rb:138:in 'service'
/usr/lib/ruby/2.1.0/webrick/httpserver.rb:94:in 'run'
/usr/lib/ruby/2.1.0/webrick/server.rb:295:in 'block in start_thread'
I've renamed all the files in _posts to remove special characters in the filenames, but still does not work. I don't want to rename the tags.

all the pages are encoded to 'utf-8' by default. but you can override this in config.yml:
encoding: ENCODING
but it seems that jekyll doesn't works well (until now: jan-2017) with unicode no english characters, see this similar issue Slugify a string doesn't seem to work on Unicode/Swedish letters #4623. the space also my cause a little problem if you don't put the category inside ' '
a fix whould be to slugify your "Catégories" explicitly before integrating them in the url, using a generator, with:
slug = category.strip.downcase.gsub(' ', '-').gsub(/[^\w-]/, '') # categories slugiffier
// use this slug as the category id
the slugifier above just down case, replace space with -, and remove all non ascii letter, so you'll need to add other substitutions gsub before the last one .gsub(/[^\w-]/, '') to replace:
é è ê -> e
à â -> a
...
Update
while reading the old jekyll issues in GitHub list to implement a "fix" for that one, I found this detailed solution posted by #david-jacquel on 2014 :
This needs to change the way Jekyll generates urls for posts. This can
be done with a plugin.
# _plugins/post.rb
module Jekyll
class Post
# override post method in order to return categories names as slug
# instead of strings
#
# An url for a post with category "category with space" will be in
# slugified form : /category-with-space
# instead of url encoded form : /category%20with%20space
#
# #see utils.slugify
def url_placeholders
{
:year => date.strftime("%Y"),
:month => date.strftime("%m"),
:day => date.strftime("%d"),
:title => slug,
:i_day => date.strftime("%-d"),
:i_month => date.strftime("%-m"),
:categories => (categories || []).map { |c| Utils.slugify(c) }.join('/'),
:short_month => date.strftime("%b"),
:short_year => date.strftime("%y"),
:y_day => date.strftime("%j"),
:output_ext => output_ext
}
end
end
end
-- David Jacquel on Jekyll/jekyll-help/issues/129#
that will resolve the space issue, and give a starter point to solve the encoding name

Related

Ruby converting string encoding from ISO-8859-1 to UTF-8 not working

I am trying to convert a string from ISO-8859-1 encoding to UTF-8 but I can't seem to get it work. Here is an example of what I have done in irb.
irb(main):050:0> string = 'Norrlandsvägen'
=> "Norrlandsvägen"
irb(main):051:0> string.force_encoding('iso-8859-1')
=> "Norrlandsv\xC3\xA4gen"
irb(main):052:0> string = string.encode('utf-8')
=> "Norrlandsvägen"
I am not sure why Norrlandsvägen in iso-8859-1 will be converted into Norrlandsvägen in utf-8.
I have tried encode, encode!, encode(destinationEncoding, originalEncoding), iconv, force_encoding, and all kinds of weird work-arounds I could think of but nothing seems to work. Can someone please help me/point me in the right direction?
Ruby newbie still pulling hair like crazy but feeling grateful for all the replies here... :)
Background of this question: I am writing a gem that will download an xml file from some websites (which will have iso-8859-1 encoding) and save it in a storage and I would like to convert it to utf-8 first. But words like Norrlandsvägen keep messing me up. Really any help would be greatly appreciated!
[UPDATE]: I realized running tests like this in the irb console might give me different behaviors so here is what I have in my actual code:
def convert_encoding(string, originalEncoding)
puts "#{string.encoding}" # ASCII-8BIT
string.encode(originalEncoding)
puts "#{string.encoding}" # still ASCII-8BIT
string.encode!('utf-8')
end
but the last line gives me the following error:
Encoding::UndefinedConversionError - "\xC3" from ASCII-8BIT to UTF-8
Thanks to #Amadan's answer below, I noticed that \xC3 actually shows up in irb if you run:
irb(main):001:0> string = 'ä'
=> "ä"
irb(main):002:0> string.force_encoding('iso-8859-1')
=> "\xC3\xA4"
I have also tried to assign a new variable to the result of string.encode(originalEncoding) but got an even weirder error:
newString = string.encode(originalEncoding)
puts "#{newString.encoding}" # can't even get to this line...
newString.encode!('utf-8')
and the error is Encoding::UndefinedConversionError - "\xC3" to UTF-8 in conversion from ASCII-8BIT to UTF-8 to ISO-8859-1
I am still quite lost in all of this encoding mess but I am really grateful for all the replies and help everyone has given me! Thanks a ton! :)
You assign a string, in UTF-8. It contains ä. UTF-8 represents ä with two bytes.
string = 'ä'
string.encoding
# => #<Encoding:UTF-8>
string.length
# 1
string.bytes
# [195, 164]
Then you force the bytes to be interpreted as if they were ISO-8859-1, without actually changing the underlying representation. This does not contain ä any more. It contains two characters, Ã and ¤.
string.force_encoding('iso-8859-1')
# => "\xC3\xA4"
string.length
# 2
string.bytes
# [195, 164]
Then you translate that into UTF-8. Since this is not reinterpretation but translation, you keep the two characters, but now encoded in UTF-8:
string = string.encode('utf-8')
# => "ä"
string.length
# 2
string.bytes
# [195, 131, 194, 164]
What you are missing is the fact that you originally don't have an ISO-8859-1 string, as you would from your Web-service - you have gibberish. Fortunately, this is all in your console tests; if you read the response of the website using the proper input encoding, it should all work okay.
For your console test, let's demonstrate that if you start with a proper ISO-8859-1 string, it all works:
string = 'Norrlandsvägen'.encode('iso-8859-1')
# => "Norrlandsv\xE4gen"
string = string.encode('utf-8')
# => "Norrlandsvägen"
EDIT For your specific problem, this should work:
require 'net/https'
uri = URI.parse("https://rusta.easycruit.com/intranet/careerbuilder_se/export/xml/full")
options = {
:use_ssl => uri.scheme == 'https',
:verify_mode => OpenSSL::SSL::VERIFY_NONE
}
response = Net::HTTP.start(uri.host, uri.port, options) do |https|
https.request(Net::HTTP::Get.new(uri.path))
end
body = response.body.force_encoding('ISO-8859-1').encode('UTF-8')
There's a difference between force_encoding and encode. The former sets the encoding for the string, whereas the latter actually transcodes the contents of the string to the new encoding. Consequently, the following code causes your problem:
string = "Norrlandsvägen"
string.force_encoding('iso-8859-1')
puts string.encode('utf-8') # Norrlandsvägen
Whereas the following code will actually correctly encode your contents:
string = "Norrlandsvägen".encode('iso-8859-1')
string.encode!('utf-8')
Here's an example running in irb:
irb(main):023:0> string = "Norrlandsvägen".encode('iso-8859-1')
=> "Norrlandsv\xE4gen"
irb(main):024:0> string.encoding
=> #<Encoding:ISO-8859-1>
irb(main):025:0> string.encode!('utf-8')
=> "Norrlandsvägen"
irb(main):026:0> string.encoding
=> #<Encoding:UTF-8>
The above answer was spot on. Specifically this point here:
There's a difference between force_encoding and encode. The former
sets the encoding for the string, whereas the latter actually
transcodes the contents of the string to the new encoding.
In my situation, I had a text file with iso-8859-1 encoding. By default, Ruby uses UTF-8 encoding, so if you were to try to read the file without specifying the encoding, then you would get an error:
results = File.read(file)
results.encoding
=> #<Encoding:UTF-8>
results.split("\r\n")
ArgumentError: invalid byte sequence in UTF-8
You get an invalid byte sequence error because the characters in different encodings are represented by different byte lengths. Consequently, you would need to specify the encoding to the File API. Think of it like force_encoding:
results = File.read(file, encoding: "iso-8859-1")
So everything is good right? No, not if you want to start parsing the iso-8859-1 string with UTF-8 character encodings:
results = File.read(file, encoding: "iso-8859-1")
results.each do |line|
puts line.split('¬')
end
Encoding::CompatibilityError: incompatible character encodings: ISO-8859-1 and UTF-8
Why this error? Because '¬' is represented as UTF-8. You are using a UTF-8 character sequence against an ISO-8859-1 string. They are incompatible encodings. Consequently, after you read the File as a ISO-8859-1, then you can ask Ruby to encode that ISO-8859-1 into a UTF-8. And now you will be working with UTF-8 strings and thus no problems:
results = File.read(file, encoding: "iso-8859-1").encode('UTF-8')
results.encoding
results = results.split("\r\n")
results.each do |line|
puts line.split('¬')
end
Ultimately, with some Ruby APIs, you do not need to use force_encoding('ISO-8859-1'). Instead, you just specify the expected encoding to the API. However, you must convert it back to UTF-8 if you plan to parse it with UTF-8 strings.

incompatible character encodings: UTF-8 and ASCII-8BIT in render action

ActionView::Template::Error (incompatible character encodings: UTF-8
and ASCII-8BIT): app/controllers/posts_controller.rb:27:in `new'
# GET /posts/new
def new
if params[:post]
#post = Post.new(post_params).dup
if #post.valid?
render :action => "confirm"
else
format.html { render action: 'new' }
format.json { render json: #post.errors, status: :unprocessable_entity }
end
else
#post = Post.new
#document = Document.new
#documents = #post.documents.all
#document = #post.documents.build
end
I don't know why it is happening.
Make sure config.encoding = "utf-8" is there in application.rb file.
Make sure you are using 'mysql2' gem instead mysql gem
Putting # encoding: utf-8 on top of rake file.
Above Rails.application.initialize! line in environment.rb file, add following two lines:
Encoding.default_external = Encoding::UTF_8
Encoding.default_internal = Encoding::UTF_8
solution from here: http://rorguide.blogspot.in/2011/06/incompatible-character-encodings-ascii.html
If above solution not helped then I think you either copy/pasted a part of your Haml template into the file, or you're working with a non-Unicode/non-UTF-8 friendly editor.
If you can recreate that file from the scratch in a UTF-8 friendly editor. There are plenty for any platform and see whether this fixes your problem.
Sometimes you may get this error:
incompatible character encodings: ASCII-8BIT and UTF-8
That typically happens because you are trying to concatenate two strings, and one contains characters that do not map to the character-set of the other string. There are characters in ISO-8859-1 that do not have equivalents in UTF-8, and vice-versa and how to handle string joining with those incompatibilities requires the programmer to step in.
I was upgrading my rails and spree and the error was actually coming from cache
Deleting the cache solved the problem for me
rm -rf tmp/cache

Ruby CSV + Spanish Characters: Encoding::UndefinedConversionError

I'm getting an error U+2014 from UTF-8 to ISO-8859-1 when I try to use send_data with a CSV that has spanish characters:
model:
def self.books_data(books)
csv = CSV.generate(:col_sep => "|", quote_char: '"') do |csv|
...
end
csv
end
controller:
def export_data
...
data = CsvGenerator.books_data(#books)
send_data(data.encode("iso-8859-1"), filename: "books_data_#{date}.csv", type: 'text/csv; charset=iso-8859-1; header=present') #<-- error occurs here
end
How would I fix this?
=== UPDATE ===
I think I semi-fixed it by replacing .encode with .force_encoding. However, I now have a lot of characters that don't look right:
Ex: The file contains:
My Diary from Here to There / Mi diario de aqui hasta allá
when it should look like
My Diary from Here to There / Mi diario de aqui hasta allá
String#force_encoding should never be used as it just "tags" string with different encoding, while #encode does actual conversion.
The reason you're getting this error because, somewhere in your data you have a \u2014 character: "—". As the String#encode documentation states:
raise Encoding::UndefinedConversionError for characters that are undefined in the destination encoding [...]
And if you check the iso map (http://en.wikipedia.org/wiki/ISO/IEC_8859-1), there is no "—"
character in 8859-1. So to solve this, you need to remove those "invalid" characters from data.
Besides that, unless there are some specific reasons, you should avoid such conversions, and let CSV to be generated in utf-8 encoding.
http://railscasts.com/episodes/362-exporting-csv-and-excel

how to select dropdown having Encoding::UndefinedConversionError in watir?

I want to select dropdown having text="Côte d'Ivoire".
ie.select_list(:id, "name01").select("#{text}")
I tried these codes,
1.encoding: UTF-8 #not working
2.text.force_encoding("ASCII-8BIT").encode('UTF-8', undef: :replace, replace:'')
#text=Cte d'Ivoire
what should I do for it?
I also want to save this text to my DB.Please help.
If you know the string is UTF-8 encoded, why not just force encoding to UTF-8?
#encoding: ASCII-8BIT
str = "C\xC3\xB4te d'Ivoire" # => "C\xC3\xB4te d'Ivoire"
str.encoding # => #<Encoding:ASCII-8BIT>
str.force_encoding('UTF-8')
str # => "Côte d'Ivoire"
str.encoding # => #<Encoding:UTF-8>
If you are using Côte d'Ivoire as a literal anywhere in your Ruby source files, be sure to add
#encoding: UTF-8
as the first line of the file to tell Ruby that the file is UTF-8 encoded.
I would have expected your solutions to work, unless the software you are using to save/execute the files is overriding the setting. I recall having that issue with NetBeans.
An alternative, if you cannot fix the actual encoding, is to use a regex to match just the standard characters.
text = /C.te d'Ivoire/
browser.select_list.select(text)
The regex has replaced all accented characters with a ..
Not a great solution, but perhaps a solution if nothing else works.

Ruby extracting links from html

Hello here is my script:
ARGV.each do |input_filename|
doc = Nokogiri::HTML(File.read(input_filename))
title, body = doc.title.gsub("/\s+/"," ").downcase.strip, doc.xpath('//body').inner_text.tr('"', '').gsub("\n", '').downcase.strip
link = doc.search("a[#href]") //Adding this part generates errors
filename = File.basename(input_filename, ".*")
puts %Q("#{title}", "#{body}", "#{filename}", "#{link}").downcase
end
I am having trouble extracting links from a list of html files. I believe the issue is due to unconventional coding in some of the html files. Here is the error i am getting.
extractor.rb:9:in `block in <main>': incompatible character encodings: UTF-8 and CP850 (Encoding::CompatibilityError)
from extractor.rb:4:in `each'
from extractor.rb:4:in `<main>'
You can go about it a different way using the CSS selector:
doc.css('a').map { |link| link['href'] }
This would search the doc for all anchors and return their href text in an array.
Nokogiri stores Strings always as UTF-8 internally. Methods that return text values will always return UTF-8 encoded strings.
You have a conflict UTF-8 and cp850 (you are working with windows?).
You may adapt your File.read(input_filename)
Try
File.read(input_filename, :encoding => 'cp850:utf-8')
If your html-files are windows files.
If your html-files are already utf-8, the try:
File.read(input_filename, :encoding => 'utf-8')
Another solution may be a Encoding.default_external = 'utf-8' at the begin of your code. (I wouldn't recommend it, use it only for small scripts).

Resources