hpricot-invalid byte sequence in UTF-8 - ruby

I already done some searches but none of that can solve this peculiar,unexpected problem.
Just look at the code blow:
require 'open-uri'
require 'hpricot'
doc = Hpricot(open("http://www.baidu.com/")) #this web page's encoding is GB2312
I don't know what's going on here,you can this in your irb to see if you can get the problem
It just pop up "ArgumentError: invalid byte sequence in UTF-8"
I have try to convert the original HTML into utf-8 by Iconv but it still won't work
Guys,I really don't what to do now,please help me

Hpricot - UTF-8 issues
invalid byte sequence in UTF-8 (ArgumentError)
require 'hpricot'
require 'open-uri'
doc = open('http://www.amazon.co.jp/') {|f| Hpricot(f.read) }
puts doc.to_html
open('http://www.amazon.co.jp/') {|f| Hpricot(f.read.encode("UTF-8")) }

I know how it could work with Net::HTTP (Ruby 1.9.2):
require 'net/http'
require 'uri'
url = URI.parse('http://www.baidu.com')
res = Net::HTTP.start(url.host, url.port) {|http|
http.get('/')
}
str = res.body.force_encoding('GB2312')
puts str
puts str.encoding.name # => GB2312
Does that help?

Related

I am getting (eval):1: invalid Unicode codepoint error while trying to scrape instagram

I am trying to scrape data from instagram. Here is my code
require 'open-uri'
require 'nokogiri'
require 'json'
require "unicode/emoji"
def get_html
url = 'https://www.instagram.com/muriithi_kabogo/'
html = open(url)
end
def pass_data
html = get_html
doc = Nokogiri::HTML(html)
end
def get_data
profiles = []
body = pass_data.at('body')
script = body.at('script').text
myText = script
json_object_data = eval(myText)
end
get_data()
When I try to change the text into json format, I get an error:
(eval):1: invalid Unicode codepoint (SyntaxError)
usinessmen #beautiful #smile\ud83d\ude0a #teambringit #shebr
How do I move past this error?
JSON, like JavaScript, uses UCS2 encoding, which Ruby chokes on.
Do not use evil. For one thing, Ruby will detect \ud83d\ude0a as invalid codepoints, as it should; for another, it is a security hole; and lastly, it slows down your code.
Use JSON.parse, which is safer, faster, and knows how to deal with UCS2:
require 'json'
json_str = '"usinessmen #beautiful #smile\ud83d\ude0a #teambringit #shebr"'
JSON.parse(json_str)
# => "usinessmen #beautiful #smile😊 #teambringit #shebr"

How to search in page body and force an encoding conversion

I have pretty simple code:
require 'rubygems'
require 'mechanize'
URL = 'http://yandex.ru'
agent = Mechanize.new
page = agent.get(URL)
# page.encoding => UTF-8
# page.body.encoding => ASCII-8BIT
page.body.include?("Карты")
And on the last line of that code Ruby returned an error:
in `include?': incompatible character encodings: ASCII-8BIT and UTF-8 (Encoding::CompatibilityError)
Solutions from "How to get Mechanize to auto-convert body to UTF8?" don't help. What should I do to fix it?
You can use the force_encoding method like this:
agent.page.body.force_encoding('utf-8')

How to remove non-UTF8 characters in RSS with Ruby

I'm using Ruby's RSS Library to parse an RSS feed, but I am encountering errors occasionally when a bullet point character appears in the RSS feed as a �.
require 'rss'
rss = RSS::Parser.parse('rss_url_here', false)
which results in
#<ArgumentError: invalid byte sequence in UTF-8>
due to the � character. How can I remove � characters?
Update:
I have tried using
require 'net/http'
require 'rss'
uri = URI('https://newyork.craigslist.org/search/jjj?query=graphic%20design&s=100&sort=date&format=rss')
json = Net::HTTP.get(uri)
json.force_encoding('CP1252')
json.force_encoding('utf-8')
rss = RSS::Parser.parse(json, false)
Still getting
ArgumentError: invalid byte sequence in UTF-8
You can use HTMLEntities
HTMLEntities.new.decode(rss_feed_content)
I wonder, whether it is so hard to read a documentation on two functions I mentioned in the comment and distinguish force_encoding and encode.
require 'net/http'
require 'rss'
uri = URI('https://newyork.craigslist.org/search/jjj?query=graphic%20design&s=100&sort=date&format=rss')
text = Net::HTTP.get(uri)
rss = RSS::Parser.parse(text.force_encoding('CP1252').encode('utf-8'), false)
#⇒ #<RSS::RDF:0x000000053791a0 .....
I like to remove junk char codes like this:
json = json.encode('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '')

how to save StringIO (pdf) data into file

I want to save pdf file which is located in external remote server with ruby. The pdf file is coming in StringIO. I tried saving the data with File.write but it is not working. I received the below error .
ArgumentError: string contains null byte
How to save now ?
require 'stringio'
sio = StringIO.new("he\x00llo")
File.open('data.txt', 'w') do |f|
f.puts(sio.read)
end
$ cat data.txt
hello
Response to comment:
Okay, try this:
require 'stringio'
sio = StringIO.new("\c2\xb5")
sio.set_encoding('ASCII-8BIT') #Apparently, this is what you have.
File.open('data.txt', 'w:utf-8') do |f|
f.puts(sio.read)
end
--output:--
1.rb:7:in `write': "\xB5" from ASCII-8BIT to UTF-8 (Encoding::UndefinedConversionError)
To get rid of that error, you can set the encoding of the StringIO to UTF-8:
require 'stringio'
sio = StringIO.new("\c2\xb5")
sio.set_encoding('ASCII-8BIT') #Apparently, this is what you have.
sio.set_encoding('UTF-8') #Change the encoding to what it should be.
File.open('data.txt', 'w:UTF-8') do |f|
f.puts(sio.read)
end
Or, you can use the File.open modes:
require 'stringio'
sio = StringIO.new("\c2\xb5")
sio.set_encoding('ASCII-8BIT') #Apparently, this is what you have.
File.open('data.txt', 'w:UTF-8:ASCII-8BIT') do |f|
f.puts(sio.read)
end
But, that assumes the data is encoded in UTF-8. If you actually have binary data, i.e. data that isn't encoded because it represents a .jpg file for instance, then that won't work.

File Creation/Loop Problems in Ruby

EDIT: My original question was way off, my apologies. Mark Reed has helped me find out the real problem, so here it is.
Note that this code works:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
source_url = "www.flickr.com"
puts "Visiting #{source_url}"
page = Nokogiri::HTML(open("http://website/script.php?value=#{source_url}"))
textarea = page.css('textarea')
filename = source_url.to_s + ".txt"
create_file = File.open("#{filename}", 'w')
create_file.puts textarea
create_file.close
Which is really awesome, but I need it to do this to ~110 URLs, not just Flickr. Here's my loop that isn't working:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
File.open('sources.txt').each_line do |source_url|
puts "Visiting #{source_url}"
page = Nokogiri::HTML(open("http://website/script.php?value=#{source_url}"))
textarea = page.css('textarea')
filename = source_url.to_s + ".txt"
create_file = File.open("#{filename}", 'w')
create_file.puts "#{textarea}"
create_file.close
end
What am I doing wrong with my loop?
Ok, now you're looping over the lines of the input file. When you do that, you get strings that end in a newilne. So you're trying to create a file with a newline in the middle of its name, which is not legal in Windows.
Just chomp the string:
File.open('sources.txt').each_line do |source_url|
source_url.chomp!
# ... rest of code goes here ...
You can also use File#foreach instead of File#open.each_line:
File.foreach('sources.txt') do |source_url|
source_url.chomp!
# ... rest of code goes here
You're putting your parentheses in the wrong place:
create_file = File.open(variable, 'w')

Resources