ruby csv file encoded to utf-8 but windows excel not recognising - ruby

In ruby i am generating CSV file while opening the file I mentioned it as UTF-8 encoding. The code is given below.In linux and mac it is working fine but in windows when i am trying to open the csv file excel is not recognizing as UTF-8. What can i do so that windows does recognize it as UTF-8 encoding.
CSV.open(File.join(Rails.public_path,"/csv_uploads/#{csv_name}.csv"), "w:UTF-8").
I even manually encode the items in the file to UTF-8.
`result[2].encode('UTF-8')`.

While writing itself need to be use open_mode and bom.
Important things to note here is open mode and bom
open_mode = "w+:UTF-16LE:UTF-8"
bom = "\xEF\xBB\xBF"
Before writing the CSV insert BOM
f.write bom
f.write(csv_file)
Example I18n content
In Mac and Linux
Swedish : Förnamn
English : First name
In Windows
Swedish : Förnamn
English : First name
Example code:
def user_information_report(report_file_path, user_id)
user = User.find(user_id)
I18n.locale = user.current_lang
open_mode = "w+:UTF-16LE:UTF-8"
bom = "\xEF\xBB\xBF"
body user, open_mode, bom
end
def headers
headers = [
"ID", "SDN ID",
I18n.t('sys_first_name'), I18n.t('sys_last_name'), I18n.t('sys_dob'),
I18n.t('sys_gender'), I18n.t('sys_email'), I18n.t('sys_address'),
I18n.t('sys_city'), I18n.t('sys_state'), I18n.t('sys_zip'),
I18n.t('sys_phone_number')
]
end
def body tenant, open_mode, bom
File.open(report_file_path, open_mode) do |f|
csv_file = CSV.generate(col_sep: "\t") do |csv|
csv << headers
tenant.patients.find_each(batch_size: 10) do |patient|
csv << [
patient.id, patient.patientid,
patient.first_name, patient.last_name, "#{patient.dob}",
"#{translate_gender(patient.gender)}", patient.email, "#{patient.address_1.to_s} #{patient.address_2.to_s}",
"#{patient.city}", "#{patient.state}", "#{patient.zip}",
"#{patient.phone_number}"
]
end
end
f.write bom
f.write(csv_file)
end
end
Windows and Mac
File can be opened directly by double clicking.
Linux (ubuntu)
While opening a file ask for the separator options -> choose “TAB”

Related

Opening files from filepaths on windows

I try to define a function that will take a file path and turn it into a string.
This is the defenition I came up with:
def get_book(file_path):
'''Takes a file path and returns the entire book as a string.'''
with open(file_path, 'r', 'utf-8') as infile:
content = infile.read()
return content
AnnaKarenina = get_book('../Python/Data/books/AnnaKarenina.txt')
I now get TypeError: an integer is required (got type str)
I also tried using the os.path, different kinds of slashes and other tricks for opening files with windows, but that all returns the error file not found.
Does anyone know what I am doing wrong?
The encoding parameters of open function is a named parameters, so you have to specify it like this :
def get_book(file_path):
'''Takes a file path and returns the entire book as a string.'''
with open(file_path, 'r', encoding='utf-8') as infile:
content = infile.read()
return content
AnnaKarenina = get_book('../Python/Data/books/AnnaKarenina.txt')

Python: Opening auto-generated file

As part of my larger program, I want to create a logfile with the current time & date as part of the title. I can create it as follows:
malwareLog = open(datetime.datetime.now().strftime("%Y%m%d - %H.%M " + pcName + " Malware scan log.txt"), "w+")
Now, my app is going to call a number of other functions, so I'll need to open the file, write some output to it and close the file, several times. It doesn't seem to work if I simply go:
malwareLog.open(malwareLog, "a+")
or similar. So how should I open a dynamically created txt file that I don't know the actual filename for...?
When you create malwareLog object, it has name attribute which contains the file name.
Here's an example: (my test is your malwareLog)
import random
test = open(str(random.randint(0,999999))+".txt", "w+")
test.write("hello ")
test.close()
test = open(test.name, "a+")
test.write("world!")
test.close()
with open(test.name, "r") as f: print(f.read())
You also can store the file name in a variable before or after creating the file.
###Before
file_name = "123"
malwareLog = open(file_name, "w")
###After
malwareLog = open(random.randint(0,999999), "w")
file_name = malwareLog.name

Change the delimiter in multiple CSV files from same folder and write them into a new folder

I am a newbie programmer in python and I am trying to read multiple csv files from a folder, replace the delimiter for all the csv files with 'tab' delimiter and then output these files into a new folder with replaced delimiter. So far I am stuck at the beginning.
Here is the code that I started using, this is working for a single file. But am not able to work with multiple files in same folder.
print("\nWrite same CSV File with different string(Replace ',' with tab delimiter)")
with open('Names.csv','r') as csv_file:
csv_reader = csv.reader(csv_file)
with open('Names_new.csv', 'w') as new_file:
csv_writer = csv.writer(new_file, delimiter = '\t', lineterminator='\r')
for line in csv_reader:
csv_writer.writerow(line)
Please can someone point out some tips?
Thank you in advance!
I don't think the code in your question does what you want. However, here's how to embed it in more code that will read the csv files from a specified folder for processing.
listdir takes input_folder and yields a list of all of the files in that folder.
I loop through that list and process only those files whose names end with '.csv'.
from os import listdir
import csv
input_folder = 'catalyst/'
for file_name in listdir(input_folder):
if file_name.endswith('.csv'):
print ('---> processing input file: ', file_name)
with open(input_folder + file_name,'r') as csv_file:
csv_reader = csv.reader(csv_file)
out_file_name = file_name[:-3]+'_new.csv'
print (' creating', out_file_name )
with open(input_folder + out_file_name, 'w') as new_file:
csv_writer = csv.writer(new_file, delimiter = '\t', lineterminator='\r')
for line in csv_reader:
csv_writer.writerow(line)

Word Document.SaveAs ignores encoding, when calling through OLE, from Ruby or VBS

I have a script, VBS or Ruby, that saves a Word document as 'Filtered HTML', but the encoding parameter is ignored. The HTML file is always encoded in Windows-1252. I'm using Word 2007 SP3 on Windows 7 SP1.
Ruby Example:
require 'win32ole'
word = WIN32OLE.new('Word.Application')
word.visible = false
word_document = word.documents.open('C:\whatever.doc')
word_document.saveas({'FileName' => 'C:\whatever.html', 'FileFormat' => 10, 'Encoding' => 65001})
word_document.close()
word.quit
VBS Example:
Option Explicit
Dim MyWord
Dim MyDoc
Set MyWord = CreateObject("Word.Application")
MyWord.Visible = False
Set MyDoc = MyWord.Documents.Open("C:\whatever.doc")
MyDoc.SaveAs "C:\whatever2.html", 10, , , , , , , , , , 65001
MyDoc.Close
MyWord.Quit
Set MyDoc = Nothing
Set MyWord = Nothing
Documentation:
Document.SaveAs: http://msdn.microsoft.com/en-us/library/bb221597.aspx
msoEncoding values: http://msdn.microsoft.com/en-us/library/office/aa432511(v=office.12).aspx
Any suggestions, how to make Word save the HTML file in UTF-8?
Hi Bo Frederiksen and kardeiz,
I also encountered the problem of "Word Document.SaveAs ignores encoding" today in my "Word 2003 (11.8411.8202) SP3" version.
Luckily I managed to make msoEncodingUTF8(namely, 65001) work in VBA code. However, I have to change the Word document's settings first. Steps are:
1) From Word's 'Tools' menu, choose 'Options'.
2) Then click 'General'.
3) Press the 'Web Options' button.
4) In the popping-up 'Web Options' dialogue, click 'Encoding'.
5) You can find a combobox, now you can change the encoding, for example, from 'GB2312' to 'Unicode (UTF-8)'.
6) Save the changes and try to rerun the VBA code.
I hope my answer can help you. Below is my code.
Public Sub convert2html()
With ActiveDocument.WebOptions
.Encoding = msoEncodingUTF8
End With
ActiveDocument.SaveAs FileName:=ActiveDocument.Path & "\" & "file_name.html", FileFormat:=wdFormatFilteredHTML, Encoding:=msoEncodingUTF8
End Sub
Word can't do this as far as I know.
However, you could add the following lines to the end of your Ruby script
text_as_utf8 = File.read('C:\whatever.html').encode('UTF-8')
File.open('C:\whatever.html','wb') {|f| f.print text_as_utf8}
If you have an older version of Ruby, you may need to use Iconv. If you have special characters in 'C:\whatever.html', you'll want to look into your invalid/undefined replacement options.
You'll also probably want to update the charset in the HTML meta tag:
text_as_utf8.gsub!('charset=windows-1252', 'charset=UTF-8')
before you write to the file.
My solution was to open the HTML file using the same character set, as Word used to save it.
I also added a whitelist filter (Sanitize), to clean up the HTML. Further cleaning is done using Nokogiri, which Sanitize also rely on.
require 'sanitize'
# ... add some code converting a Word file to HTML.
# Post export cleanup.
html_file = File.open(html_file_name, "r:windows-1252:utf-8")
html = '<!DOCTYPE html>' + html_file.read()
html_document = Nokogiri::HTML::Document.parse(html)
Sanitize.new(Sanitize::Config::RESTRICTED).clean_node!(html_document)
html_document.css('html').first['lang'] = 'en-US'
html_document.css('meta[name="Generator"]').first.remove()
# ... add more cleaning up of Words HTML noise.
sanitized_html = html_document.to_html({:encoding => 'utf-8', :indent => 0})
# writing output to (new) file
sanitized_html_file_name = word_file_name.sub(/(.*)\..*$/, '\1.html')
File.open(sanitized_html_file_name, 'w:UTF-8') do |f|
f.write sanitized_html
end
HTML Sanitizer: https://github.com/rgrove/sanitize/
HTML parser and modifier: http://nokogiri.org/
In Word 2010 there is a new method, SaveAs2: http://msdn.microsoft.com/en-us/library/ff836084(v=office.14).aspx
I haven't tested SaveAs2, since I don't have Word 2010.

Is there a way to remove the BOM from a UTF-8 encoded file?

Is there a way to remove the BOM from a UTF-8 encoded file?
I know that all of my JSON files are encoded in UTF-8, but the data entry person who edited the JSON files saved it as UTF-8 with the BOM.
When I run my Ruby scripts to parse the JSON, it is failing with an error.
I don't want to manually open 58+ JSON files and convert to UTF-8 without the BOM.
With ruby >= 1.9.2 you can use the mode r:bom|utf-8
This should work (I haven't test it in combination with json):
json = nil #define the variable outside the block to keep the data
File.open('file.txt', "r:bom|utf-8"){|file|
json = JSON.parse(file.read)
}
It doesn't matter, if the BOM is available in the file or not.
Andrew remarked, that File#rewind can't be used with BOM.
If you need a rewind-function you must remember the position and replace rewind with pos=:
#Prepare test file
File.open('file.txt', "w:utf-8"){|f|
f << "\xEF\xBB\xBF" #add BOM
f << 'some content'
}
#Read file and skip BOM if available
File.open('file.txt', "r:bom|utf-8"){|f|
pos =f.pos
p content = f.read #read and write file content
f.pos = pos #f.rewind goes to pos 0
p content = f.read #(re)read and write file content
}
So, the solution was to do a search and replace on the BOM via gsub!
I forced the encoding of the string to UTF-8 and also forced the regex pattern to be encoded in UTF-8.
I was able to derive a solution by looking at http://self.d-struct.org/195/howto-remove-byte-order-mark-with-ruby-and-iconv and http://blog.grayproductions.net/articles/ruby_19s_string
def read_json_file(file_name, index)
content = ''
file = File.open("#{file_name}\\game.json", "r")
content = file.read.force_encoding("UTF-8")
content.gsub!("\xEF\xBB\xBF".force_encoding("UTF-8"), '')
json = JSON.parse(content)
print json
end
You can also specify encoding with the File.read and CSV.read methods, but you don't specify the read mode.
File.read(path, :encoding => 'bom|utf-8')
CSV.read(path, :encoding => 'bom|utf-8')
the "bom|UTF-8" encoding works well if you only read the file once, but fails if you ever call File#rewind, as I was doing in my code. To address this, I did the following:
def ignore_bom
#file.ungetc if #file.pos==0 && #file.getc != "\xEF\xBB\xBF".force_encoding("UTF-8")
end
which seems to work well. Not sure if there are other similar type characters to look out for, but they could easily be built into this method that can be called any time you rewind or open.
Server side cleanup of utf-8 bom bytes that worked for me:
csv_text.gsub!("\xEF\xBB\xBF".force_encoding(Encoding::BINARY), '')

Resources