Is there a way to remove the BOM from a UTF-8 encoded file? - ruby

Is there a way to remove the BOM from a UTF-8 encoded file?
I know that all of my JSON files are encoded in UTF-8, but the data entry person who edited the JSON files saved it as UTF-8 with the BOM.
When I run my Ruby scripts to parse the JSON, it is failing with an error.
I don't want to manually open 58+ JSON files and convert to UTF-8 without the BOM.

With ruby >= 1.9.2 you can use the mode r:bom|utf-8
This should work (I haven't test it in combination with json):
json = nil #define the variable outside the block to keep the data
File.open('file.txt', "r:bom|utf-8"){|file|
json = JSON.parse(file.read)
}
It doesn't matter, if the BOM is available in the file or not.
Andrew remarked, that File#rewind can't be used with BOM.
If you need a rewind-function you must remember the position and replace rewind with pos=:
#Prepare test file
File.open('file.txt', "w:utf-8"){|f|
f << "\xEF\xBB\xBF" #add BOM
f << 'some content'
}
#Read file and skip BOM if available
File.open('file.txt', "r:bom|utf-8"){|f|
pos =f.pos
p content = f.read #read and write file content
f.pos = pos #f.rewind goes to pos 0
p content = f.read #(re)read and write file content
}

So, the solution was to do a search and replace on the BOM via gsub!
I forced the encoding of the string to UTF-8 and also forced the regex pattern to be encoded in UTF-8.
I was able to derive a solution by looking at http://self.d-struct.org/195/howto-remove-byte-order-mark-with-ruby-and-iconv and http://blog.grayproductions.net/articles/ruby_19s_string
def read_json_file(file_name, index)
content = ''
file = File.open("#{file_name}\\game.json", "r")
content = file.read.force_encoding("UTF-8")
content.gsub!("\xEF\xBB\xBF".force_encoding("UTF-8"), '')
json = JSON.parse(content)
print json
end

You can also specify encoding with the File.read and CSV.read methods, but you don't specify the read mode.
File.read(path, :encoding => 'bom|utf-8')
CSV.read(path, :encoding => 'bom|utf-8')

the "bom|UTF-8" encoding works well if you only read the file once, but fails if you ever call File#rewind, as I was doing in my code. To address this, I did the following:
def ignore_bom
#file.ungetc if #file.pos==0 && #file.getc != "\xEF\xBB\xBF".force_encoding("UTF-8")
end
which seems to work well. Not sure if there are other similar type characters to look out for, but they could easily be built into this method that can be called any time you rewind or open.

Server side cleanup of utf-8 bom bytes that worked for me:
csv_text.gsub!("\xEF\xBB\xBF".force_encoding(Encoding::BINARY), '')

Related

Opening files from filepaths on windows

I try to define a function that will take a file path and turn it into a string.
This is the defenition I came up with:
def get_book(file_path):
'''Takes a file path and returns the entire book as a string.'''
with open(file_path, 'r', 'utf-8') as infile:
content = infile.read()
return content
AnnaKarenina = get_book('../Python/Data/books/AnnaKarenina.txt')
I now get TypeError: an integer is required (got type str)
I also tried using the os.path, different kinds of slashes and other tricks for opening files with windows, but that all returns the error file not found.
Does anyone know what I am doing wrong?
The encoding parameters of open function is a named parameters, so you have to specify it like this :
def get_book(file_path):
'''Takes a file path and returns the entire book as a string.'''
with open(file_path, 'r', encoding='utf-8') as infile:
content = infile.read()
return content
AnnaKarenina = get_book('../Python/Data/books/AnnaKarenina.txt')

ruby csv file encoded to utf-8 but windows excel not recognising

In ruby i am generating CSV file while opening the file I mentioned it as UTF-8 encoding. The code is given below.In linux and mac it is working fine but in windows when i am trying to open the csv file excel is not recognizing as UTF-8. What can i do so that windows does recognize it as UTF-8 encoding.
CSV.open(File.join(Rails.public_path,"/csv_uploads/#{csv_name}.csv"), "w:UTF-8").
I even manually encode the items in the file to UTF-8.
`result[2].encode('UTF-8')`.
While writing itself need to be use open_mode and bom.
Important things to note here is open mode and bom
open_mode = "w+:UTF-16LE:UTF-8"
bom = "\xEF\xBB\xBF"
Before writing the CSV insert BOM
f.write bom
f.write(csv_file)
Example I18n content
In Mac and Linux
Swedish : Förnamn
English : First name
In Windows
Swedish : Förnamn
English : First name
Example code:
def user_information_report(report_file_path, user_id)
user = User.find(user_id)
I18n.locale = user.current_lang
open_mode = "w+:UTF-16LE:UTF-8"
bom = "\xEF\xBB\xBF"
body user, open_mode, bom
end
def headers
headers = [
"ID", "SDN ID",
I18n.t('sys_first_name'), I18n.t('sys_last_name'), I18n.t('sys_dob'),
I18n.t('sys_gender'), I18n.t('sys_email'), I18n.t('sys_address'),
I18n.t('sys_city'), I18n.t('sys_state'), I18n.t('sys_zip'),
I18n.t('sys_phone_number')
]
end
def body tenant, open_mode, bom
File.open(report_file_path, open_mode) do |f|
csv_file = CSV.generate(col_sep: "\t") do |csv|
csv << headers
tenant.patients.find_each(batch_size: 10) do |patient|
csv << [
patient.id, patient.patientid,
patient.first_name, patient.last_name, "#{patient.dob}",
"#{translate_gender(patient.gender)}", patient.email, "#{patient.address_1.to_s} #{patient.address_2.to_s}",
"#{patient.city}", "#{patient.state}", "#{patient.zip}",
"#{patient.phone_number}"
]
end
end
f.write bom
f.write(csv_file)
end
end
Windows and Mac
File can be opened directly by double clicking.
Linux (ubuntu)
While opening a file ask for the separator options -> choose “TAB”

Ruby Simple Read/Write File (Copy File)

I am practicing Ruby, and I am trying to copy contents from file "from" to file "to". can you tell me where I did it wrong?
thanks !
from = "1.txt"
to = "2.txt"
data = open(from).read
out = open(to, 'w')
out.write(data)
out.close
data.close
Maybe I am missing the point, but I think writing it like so is more 'ruby'
from = "1.txt"
to = "2.txt"
contents = File.open(from, 'r').read
File.open(to, 'w').write(contents)
Personally, however, I like to use the Operating systems terminal to do File operations like so. Here is an example on linux.
from = "1.txt"
to = "2.txt"
system("cp #{from} #{to}")
And for Windows I believe you would use..
from = "1.txt"
to = "2.txt"
system("copy #{from} #{to}")
Finally, if you were needing the output of the command for some sort of logging or other reason, I would use backticks.
#A nice one liner
`cp 1.txt 2.txt`
Here is the system and backtick methods documentation.
http://ruby-doc.org/core-1.9.3/Kernel.html
You can't perform data.close — data.class would show you that you have a String, and .close is not a valid String method. By opening from the way you chose to, you lost the File reference after using it with your read. One way to fix that would be:
from = "1.txt"
to = "2.txt"
infile = open(from) # Retain the File reference
data = infile.read # Use it to do the read
out = open(to, 'w')
out.write(data)
out.close
infile.close # And finally, close it

Create in-memory only gzip

I'm trying to gzip a file in ruby without having to write it to disk first. Currently I only know how to make it work by using Zlib::GzipWriter, but I'm really hoping that I can avoid that and keep it in-memory only.
I've tried this, with no success:
def self.make_gzip(data)
gz = Zlib::GzipWriter.new(StringIO.new)
gz << data
string = gz.close.string
StringIO.new(string, 'rb').read
end
Here is what happens when I test it out:
# Files
normal = File.new('chunk0.nbt')
gzipped = File.new('chunk0.nbt.gz')
# Try to create gzip in program
make_gzip normal
=> "\u001F\x8B\b\u0000\x8AJhS\u0000\u0003S\xB6q\xCB\xCCI\xB52\xA8000OK1L\xB2441J5\xB5\xB0\u0003\u0000\u0000\xB9\x91\xDD\u0018\u0000\u0000\u0000"
# Read from a gzip created with the gzip command
reader = Zlib::GzipReader.open gzipped
reader.read
"\u001F\x8B\b\u0000\u0000\u0000\u0000\u0000\u0000\u0000\xED]\xDBn\xDC\xC8\u0011%\x97N\xB82<\x9E\x89\xFF!\xFF!\xC9\xD6dFp\x80\u0005\xB2y\r\"\xEC\n\x89\xB0\xC6\xDAX+A./\xF94\xBF\u0006\xF1\x83>`\u0005\xCC\u000F\xC4\xF0\u000F.............(for 10,000 columns)
You're actually gzipping normal.to_s(which is something like "#<File:0x007f53c9b55b48>") in the following code.
# Files
normal = File.new('chunk0.nbt')
# Try to create gzip in program
make_gzip normal
You should read the content of the file, and make_gzip on the content:
make_gzip normal.read
As I commented, the make_gzip can be updated:
def self.make_gzip(data)
gz = Zlib::GzipWriter.new(StringIO.new)
gz << data
gz.close.string
end

Zlib inflate error

I am trying to save compressed strings to a file and load them later for use in the game. I kept getting "in 'finish': buffer error" errors when loading the data back up for use. I came up with this:
require "zlib"
def deflate(string)
zipper = Zlib::Deflate.new
data = zipper.deflate(string, Zlib::FINISH)
end
def inflate(string)
zstream = Zlib::Inflate.new
buf = zstream.inflate(string)
zstream.finish
zstream.close
buf
end
setting = ["nothing","nada","nope"]
taggedskills = ["nothing","nada","nope","nuhuh"]
File.open('testzip.txt','wb') do |w|
w.write(deflate("hello world")+"\n")
w.write(deflate("goodbye world")+"\n")
w.write(deflate("etc")+"\n")
w.write(deflate("etc")+"\n")
w.write(deflate("Setting: name "+setting[0]+" set"+(setting[1].class == String ? "str" : "num")+" "+setting[1].to_s)+"\n")
w.write(deflate("Taggedskill: "+taggedskills[0]+" "+taggedskills[1]+" "+taggedskills[2]+" "+taggedskills[3])+"\n")
w.write(deflate("etc")+"\n")
end
File.open('testzip.txt','rb') do |file|
file.each do |line|
p inflate(line)
end
end
It was throwing errors at the "Taggedskill:" point. I don't know what it is, but trying to change it to "Skilltag:", "Skillt:", etc. continues to throw a buffer error, while things like "Setting:" or "Thing:" work fine, while changing the setting line to "Taggedskill:" continues to work fine. What is going on here?
In testzip.txt, you are storing newline separated binary blobs. However, binary blobs may contain newlines by themselves, so when you open testzip.txt and split it by line, you may end up splitting one binary blob that inflate would understand, into two binary blobs that it does not understand.
Try to run wc -l testzip.txt after you get the error. You'll see the file contains one more line, than the number of lines you are putting in.
What you need to do, is compress the whole file at once, not line by line.

Resources