Ruby extracting links from html - ruby

Hello here is my script:
ARGV.each do |input_filename|
doc = Nokogiri::HTML(File.read(input_filename))
title, body = doc.title.gsub("/\s+/"," ").downcase.strip, doc.xpath('//body').inner_text.tr('"', '').gsub("\n", '').downcase.strip
link = doc.search("a[#href]") //Adding this part generates errors
filename = File.basename(input_filename, ".*")
puts %Q("#{title}", "#{body}", "#{filename}", "#{link}").downcase
end
I am having trouble extracting links from a list of html files. I believe the issue is due to unconventional coding in some of the html files. Here is the error i am getting.
extractor.rb:9:in `block in <main>': incompatible character encodings: UTF-8 and CP850 (Encoding::CompatibilityError)
from extractor.rb:4:in `each'
from extractor.rb:4:in `<main>'

You can go about it a different way using the CSS selector:
doc.css('a').map { |link| link['href'] }
This would search the doc for all anchors and return their href text in an array.

Nokogiri stores Strings always as UTF-8 internally. Methods that return text values will always return UTF-8 encoded strings.
You have a conflict UTF-8 and cp850 (you are working with windows?).
You may adapt your File.read(input_filename)
Try
File.read(input_filename, :encoding => 'cp850:utf-8')
If your html-files are windows files.
If your html-files are already utf-8, the try:
File.read(input_filename, :encoding => 'utf-8')
Another solution may be a Encoding.default_external = 'utf-8' at the begin of your code. (I wouldn't recommend it, use it only for small scripts).

Related

incompatible character encodings: ASCII-8BIT and UTF-8 in Oga gem

I am using an XML/HTML parser called Oga.
I am attempting to crawl this URL: http://www.johnvanderlyn.com and parse the body for text, like so:
def get_page
body = Net::HTTP.get(URI.parse(#url))
document = Oga.parse_html(body)
end
document = get_page
words = document.css('body').text
When I get this error:
/gems/oga-2.7/lib/oga/xml/node_set.rb:276:in block in text': incompatible character encodings: ASCII-8BIT and UTF-8 (Encoding::CompatibilityError)
That is related to this bit of code here.
What could be causing this and how can I fix it? Is there a way for me to fix it locally, or do I have to fork the gem, fix that method and then use my fork?
Thoughts?
The bit of code you linked has nothing to do with the glitch, that is the issue of body is being interpreted in wrong encoding. Try adding body = body.force_encoding 'UTF-8' before parsing a document:
def get_page
body = Net::HTTP.get(URI.parse(#url)).force_encoding 'UTF-8'
document = Oga.parse_html(body)
end

`scan': invalid byte sequence in UTF-8 (ArgumentError)

I'm trying to read a .txt file in ruby and split the text line-by-line.
Here is my code:
def file_read(filename)
File.open(filename, 'r').read
end
puts f = file_read('alice_in_wonderland.txt')
This works perfectly. But when I add the method line_cutter like this:
def file_read(filename)
File.open(filename, 'r').read
end
def line_cutter(file)
file.scan(/\w/)
end
puts f = line_cutter(file_read('alice_in_wonderland.txt'))
I get an error:
`scan': invalid byte sequence in UTF-8 (ArgumentError)
I found this online for untrusted website and tried to use it for my own code but it's not working. How can I remove this error?
Link to the file: File
The linked text file contains the following line:
Character set encoding: ISO-8859-1
If converting it isn't desired or possible then you have to tell Ruby that this file is ISO-8859-1 encoded. Otherwise the default external encoding is used (UTF-8 in your case). A possible way to do that is:
s = File.read('alice_in_wonderland.txt', encoding: 'ISO-8859-1')
s.encoding # => #<Encoding:ISO-8859-1>
Or even like this if you prefer your string UTF-8 encoded (see utf8everywhere.org):
s = File.read('alice_in_wonderland.txt', encoding: 'ISO-8859-1:UTF-8')
s.encoding # => #<Encoding:UTF-8>
It seems to work if you read the file directly from the page, maybe there's something funny about the local copy you have. Try this:
require 'net/http'
uri = 'http://www.ccs.neu.edu/home/vip/teach/Algorithms/7_hash_RBtree_simpleDS/hw_hash_RBtree/alice_in_wonderland.txt'
scanned = Net::HTTP.get_response(URI.parse(uri)).body.scan(/\w/)

SmarterCSV and file encoding issues in Ruby

I'm working with a file that appears to have UTF-16LE encoding. If I run
File.read(file, :encoding => 'utf-16le')
the first line of the file is:
"<U+FEFF>=\"25/09/2013\"\t18:39:17\t=\"Unknown\"\t=\"+15168608203\"\t\"Message.\"\r\n
If I read the file using something like
csv_text = File.read(file, :encoding => 'utf-16le')
I get an error stating
ASCII incompatible encoding needs binmode (ArgumentError)
If I switch the encoding in the above to
csv_text = File.read(file, :encoding => 'utf-8')
I make it to the SmarterCSV section of the code, but get an error that states
`=~': invalid byte sequence in UTF-8 (ArgumentError)
The full code is below. If I run this in the Rails console, it works just fine, but if I run it using ruby test.rb, it gives me the first error:
require 'smarter_csv'
headers = ["date_of_message", "timestamp_of_message", "sender", "phone_number", "message"]
path = '/path/'
Dir.glob("#{path}*.CSV").each do |file|
csv_text = File.read(file, :encoding => 'utf-16le')
File.open('/tmp/tmp_file', 'w') { |tmp_file| tmp_file.write(csv_text) }
puts 'made it here'
SmarterCSV.process('/tmp/tmp_file', {
:col_sep => "\t",
:force_simple_split => true,
:headers_in_file => false,
:user_provided_headers => headers
}).each do |row|
converted_row = {}
converted_row[:date_of_message] = row[:date_of_message][2..-2].to_date
converted_row[:timestamp] = row[:timestamp]
converted_row[:sender] = row[:sender][2..-2]
converted_row[:phone_number] = row[:phone_number][2..-2]
converted_row[:message] = row[:message][1..-2]
converted_row[:room] = file.gsub(path, '')
end
end
Update - 05/13/15
Ultimately, I decided to encode the file string as UTF-8 rather than diving deeper into the SmarterCSV code. The first problem in the SmarterCSV code is that it does not allow a user to specify binary mode when reading in a file, but after adjusting the source to handle that, a myriad of other encoding-related issues popped-up, many of which related to the handling of various parameters on files that were not UTF-8 encoded. It may have been the easy way out, but encoding everything as UTF-8 before feeding it into SmarterCSV solved my issue.
Add binmode to the File.read call.
File.read(file, :encoding => 'utf-16le', mode: "rb")
"b" Binary file mode
Suppresses EOL <-> CRLF conversion on Windows. And
sets external encoding to ASCII-8BIT unless explicitly
specified.
ref: http://ruby-doc.org/core-2.0.0/IO.html#method-c-read
Now pass the correct encoding to SmarterCSV
SmarterCSV.process('/tmp/tmp_file', {
:file_encoding => "utf-16le", ...
Update
It was found that smartercsv does not support binary mode. After the OP attempted to modify the code with no success it was decided the simple solution was to convert the input to UTF-8 which smartercsv supports.
Unfortunately, you're using a 'flat-file' style of storage and character encoding is going to be an issue on both ends (reading or writing).
I would suggest using something along the lines of str = str.force_encoding("UTF-8") and see if you can get that to work.

incompatible character encodings: UTF-8 and ASCII-8BIT in render action

ActionView::Template::Error (incompatible character encodings: UTF-8
and ASCII-8BIT): app/controllers/posts_controller.rb:27:in `new'
# GET /posts/new
def new
if params[:post]
#post = Post.new(post_params).dup
if #post.valid?
render :action => "confirm"
else
format.html { render action: 'new' }
format.json { render json: #post.errors, status: :unprocessable_entity }
end
else
#post = Post.new
#document = Document.new
#documents = #post.documents.all
#document = #post.documents.build
end
I don't know why it is happening.
Make sure config.encoding = "utf-8" is there in application.rb file.
Make sure you are using 'mysql2' gem instead mysql gem
Putting # encoding: utf-8 on top of rake file.
Above Rails.application.initialize! line in environment.rb file, add following two lines:
Encoding.default_external = Encoding::UTF_8
Encoding.default_internal = Encoding::UTF_8
solution from here: http://rorguide.blogspot.in/2011/06/incompatible-character-encodings-ascii.html
If above solution not helped then I think you either copy/pasted a part of your Haml template into the file, or you're working with a non-Unicode/non-UTF-8 friendly editor.
If you can recreate that file from the scratch in a UTF-8 friendly editor. There are plenty for any platform and see whether this fixes your problem.
Sometimes you may get this error:
incompatible character encodings: ASCII-8BIT and UTF-8
That typically happens because you are trying to concatenate two strings, and one contains characters that do not map to the character-set of the other string. There are characters in ISO-8859-1 that do not have equivalents in UTF-8, and vice-versa and how to handle string joining with those incompatibilities requires the programmer to step in.
I was upgrading my rails and spree and the error was actually coming from cache
Deleting the cache solved the problem for me
rm -rf tmp/cache

Ruby CSV UTF8 encoding error while reading

This is what I was doing:
csv = CSV.open(file_name, "r")
I used this for testing:
line = csv.shift
while not line.nil?
puts line
line = csv.shift
end
And I ran into this:
ArgumentError: invalid byte sequence in UTF-8
I read the answer here and this is what I tried
csv = CSV.open(file_name, "r", encoding: "windows-1251:utf-8")
I ran into the following error:
Encoding::UndefinedConversionError: "\x98" to UTF-8 in conversion from Windows-1251 to UTF-8
Then I came across a Ruby gem - charlock_holmes. I figured I'd try using it to find the source encoding.
CharlockHolmes::EncodingDetector.detect(File.read(file_name))
=> {:type=>:text, :encoding=>"windows-1252", :confidence=>37, :language=>"fr"}
So I did this:
csv = CSV.open(file_name, "r", encoding: "windows-1252:utf-8")
And still got this:
Encoding::UndefinedConversionError: "\x8F" to UTF-8 in conversion from Windows-1252 to UTF-8
It looks like you have problem with detecting the valid encoding of your file. CharlockHolmes provide you with useful tip of :confidence=>37 which simply means the detected encoding may not be the right one.
Basing on error messages and test_transcode.rb from https://github.com/MacRuby/MacRuby/blob/master/test-mri/test/ruby/test_transcode.rb I found the encoding that passes through both of your error messages. With help of String#encode it's easy to test:
"\x8F\x98".encode("UTF-8","cp1256") # => "ڈک"
Your issue looks like strictly related to the file and not to ruby.
In case we are not sure which encoding to use and can agree to loose some character we can use :invalid and :undef params for String#encode, in this case:
"\x8F\x98".encode("UTF-8", "CP1250",:invalid => :replace, :undef => :replace, :replace => "?") # => "Ź?"
other way is to use Iconv *//IGNORE option for target encoding:
Iconv.iconv("UTF-8//IGNORE","CP1250", "\x8F\x98")
As a source encoding suggestion of CharlockHolmes should be pretty good.
PS. String.encode was introduced in ruby 1.9. With ruby 1.8 you can use Iconv

Resources