Ruby Invalid Byte Sequence in UTF-8 - ruby

I have the following code, which gives me an invalid byte sequence error pointing to the scan method in initialize. Any ideas on how to fix this? For what it's worth, the error does not occur when the (.*) between the h1 tag and the closing > is not there.
#!/usr/bin/env ruby
class NewsParser
def initialize
Dir.glob("./**/index.htm") do |file|
#file = IO.read file
parsed = #file.scan(/<h1(.*)>(.*?)<\/h1>(.*)<!-- InstanceEndEditable -->/im)
self.write(parsed)
end
end
def write output
#contents = output
open('output.txt', 'a') do |f|
f << #contents[0][0]+"\n\n"+#contents[0][1]+"\n\n\n\n"
end
end
end
p = NewsParser.new
Edit: Here is the error message:
news_parser.rb:10:in 'scan': invalid byte sequence in UTF-8 (ArgumentError)
SOLVED: The combination of using:
#file = IO.read(file).force_encoding("ISO-8859-1").encode("utf-8", replace: nil)
and
encoding: UTF-8
solve the issue.
Thanks!

The combination of using: #file = IO.read(file).force_encoding("ISO-8859-1").encode("utf-8", replace: nil) and #encoding: UTF-8 solved the issue.

While this question already has an accepted answer, I found it while having the same problem with a different style of opening the file:
File.open(file_name).each_with_index do |line, index|
line.gsub!(/[{}]/, "'")
puts "#{index} #{line}"
end
I found that my input file was encoded in ISO-8859-1, so I changed it to the following to avoid the error:
File.open(file_name, 'r:ISO-8859-1:utf-8').each_with_index do |line, index|
line.gsub!(/[{}]/, "'")
puts "#{index} #{line}"
end
See the documentation for the optional mode argument of the File.open method for more details.

Related

Zlib data error while opening file in Ruby - works when a string is specified

I wrote this bit of code that reverses XOR-based encryption in Ruby. The chipertext is XORed with 'key' and the output is passed to Zlib.deflate.
require 'zlib'
def bin_to_hex(s)
s.unpack('H*').first
end
def hex(s)
s.scan(/../).map { |x| x.hex }.pack('c*')
end
chipertext = "Encrypted data"
key = "Some encryption key"
puts hex((bin_to_hex(Zlib::Inflate.inflate(code)).to_i(16) ^ ((bin_to_hex(key) * (bin_to_hex(Zlib::Inflate.inflate(chipertext)).length/bin_to_hex(key).length)) + bin_to_hex(key)[0, bin_to_hex(Zlib::Inflate.inflate(chipertext)).length%bin_to_hex(key).length]).to_i(16)).to_s(16))
The code runs perfectly when I specify chipertext as a string, in the example above. But when I use code like chipertext = File.open(ARGV[0], 'rb') { |f| f.read }, I get a inflate: incorrect header check (Zlib::DataError).
How can I prevent this from happening?

Missing parts after parsing and processing a very large XML file in Ruby

I have to parse and modify a 22.2MB XML file (a wordpress export).
The problem is after parsing, the last part of the file is always missing, but I can't really figure out why.
I've tried using the saxerator gem, but it does not seem to solve my problem
Here I'm just trying to get all the <item> from the input file and display them in an output file:
class SaxImport
def initialize input_file, output_file
f = File.read(input_file, File.size(input_file))
xml_data = Saxerator.parser(f) do |config|
config.output_type = :xml
end
category_fr_list = {}
items = []
output = File.open output_file, "w"
xml_data.for_tag(:item).reverse_each do |item|
output << item.to_xml
end
output.close
end
end
import_en = SaxImport.new 'weekly.xml', 'weekly.processed.xml'

How to encode csv file in Roo (Rails) : invalid byte sequence in UTF-8

I am trying to upload a csv file but getting invalid byte sequence in UTF-8 error. I am using 'roo' gem.
My code is like this :
def upload_results_csv file
spreadsheet = MyFileUtil.open_file(file)
header = spreadsheet.row(1) # THIS LINE RAISES THE ERROR
(2..spreadsheet.last_row).each do |i|
row = Hash[[header, spreadsheet.row(i)].transpose]
...
...
end
class MyFileUtil
def self.open_file(file)
case File.extname(file.original_filename)
when ".csv" then
Roo::Csv.new(file.path,csv_options: {encoding: Encoding::UTF_8})
when ".xls" then
Roo::Excel.new(file.path, nil, :ignore)
when ".xlsx" then
Roo::Excelx.new(file.path, nil, :ignore)
else
raise "Unknown file type: #{file.original_filename}"
end
end
end.
I don't know how to encode csv file. Please help!
Thanks
To safely convert a string to utf-8 you can do:
str.encode('utf-8', 'binary', invalid: :replace, undef: :replace, replace: '')
also see this blog post.
Since the roo gem will only take filenames as constructor argument, not plain IO objects, the only solution I can think of is to write a sanitized version to a tempfile and pass it to roo, along the lines of
require 'tempfile'
def upload_results_csv file
tmpfile = Tempfile.new(file.path)
tmpfile.write(File.read(file.path).encode('utf-8', 'binary', invalid: :replace, undef: :replace, replace: ''))
tmpfile.rewind
spreadsheet = MyFileUtil.open_file(tmpfile, file.original_filename)
header = spreadsheet.row(1) # THIS LINE RAISES THE ERROR
# ...
ensure
tmpfile.close
tmpfile.unlink
end
You need to alter MyFileUtil as well, because the original filename needs to be passed down:
class MyFileUtil
def self.open_file(file, original_filename)
case File.extname(original_filename)
when ".csv" then
Roo::Csv.new(file.path,csv_options: {encoding: Encoding::UTF_8})
when ".xls" then
Roo::Excel.new(file.path, nil, :ignore)
when ".xlsx" then
Roo::Excelx.new(file.path, nil, :ignore)
else
raise "Unknown file type: #{original_filename}"
end
end
end

Parse remote file with FasterCSV

I'm trying to parse the first 5 lines of a remote CSV file. However, when I do, it raises Errno::ENOENT exception, and says:
No such file or directory - [file contents] (with [file contents] being a dump of the CSV contents
Here's my code:
def preview
#csv = []
open('http://example.com/spreadsheet.csv') do |file|
CSV.foreach(file.read, :headers => true) do |row|
n += 1
#csv << row
if n == 5
return #csv
end
end
end
end
The above code is built from what I've seen others use on Stack Overflow, but I can't get it to work.
If I remove the read method from the file, it raises a TypeError exception, saying:
can't convert StringIO into String
Is there something I'm missing?
Foreach expects a filename. Try parse.each
You could manually pass each line to CSV for parsing:
require 'open-uri'
require 'csv'
def preview(file_url)
#csv = []
open(file_url).each_with_index do |line, i|
next if i == 0 #Ignore headers
#csv << CSV.parse(line)
if i == 5
return #csv
end
end
end
puts preview('http://www.ferc.gov/docs-filing/eqr/soft-tools/sample-csv/contract.txt')

Strange number conversion while reading a csv file with ruby

i've got a strange problem in ruby on rails
There is a csv file, made with Excel 2003.
5437390264172534;Mark;5
I have a page with upload input and i read the file like this:
file = params[:upload]['datafile']
file.read.split("\n").each do |line|
num,name,type = line.split(";")
logger.debug "row: #{num} #{name} #{type}"
end
etc
So. finally i've got the following:
num = 5437...2534
name = Mark
type = 5
Why num has so strange value?
Also i tried to do like this:
str = file.read
csv = CSV.parse(str)
csv.each do |line|
RAILS_DEFAULT_LOGGER.info "######## #{line.to_yaml}"
end
but again i got
######## ---
- !str:CSV::Cell "5437...2534;Mark;5"
The csv file in win1251 (i can't change file encoding)
ruby file in UTF8
ruby version 1.8.4
rails version 2.0.2
If it indeed has a strange value, it probably has to to do with the code you didn't post. Edit your question, and include the smallest bit of code that will run independently and still produce your questionable output.
split() returns an array of strings. So the first value of your CSV file is a String, not a Bignum. Maybe you need num.to_i, or a test like num.is_a?(Bignum) somewhere in your code.
file = File.open("test.csv", "r")
# Just getting the first line
line = file.gets
num,name,type = line.split(";")
# split() returns an array of String
puts num.class
puts num
# Make num a number
puts num.to_i.class
puts num.to_i
file.close
Running that file here gives me this:
$ ruby test.rb
String
5437390264172534
Bignum
5437390264172534

Resources