Ruby: Remove invisible characters after converting string to UTF-8 - ruby

I am working with text coming from this website with windows-1252 charset. Converting the text to UTF-8 was done using force_encoding, but the text still contains whitespace that I can't get rid of. The whitespace cannot be removed using text.gsub!(/\s/, ' ') or a similar technique.
The iconv gem doesn't do the trick either - as explained here. It is clear that the whitespace is a remnant of the original text and the windows-1252 charset as I get a invalid multibyte char (US-ASCII) warning if I don't specify the encoding as UTF-8.
I'm not an expert of text encoding so I may be overlooking something trivial.
Update: This is the script that I currently use.
#!/bin/env ruby
# encoding: utf-8
require 'rubygems'
require 'nokogiri'
require 'open-uri'
URL = 'http://www.eximsystems.com/LaVerdad/Antiguo/Gn/Genesis.htm'
html = Nokogiri.HTML(open(URL))
# Extract Paragraphs
text = ''
html.css('p').each do |p|
text += p.text
end
# Clean Up Text
text.gsub!(/\s+/, ' ')
puts text
This is a sample of the text that contains invisible characters that I try to remove. The space before the number 16 is what I am referring to.
cobraron aliento para conversar con él.   16 Al punto corrió la voz, y
se divulgó generalmente esta noticia en el palacio del rey: Han

Without seeing your code, it's hard to know exactly what's going on for you. I'll point out, however, that String#force_encoding doesn't transcode the String; it's a way of saying, "No, really, this is UTF-8", for example. To transcode from one encoding to another, use String#encode.
This seems to work for me:
require 'net/http'
s = Net::HTTP.get('www.eximsystems.com', '/LaVerdad/Antiguo/Gn/Genesis.htm')
s.force_encoding('windows-1252')
s.encode!('utf-8')
In general, /[[:space:]]/ should capture more kinds of whitespace that /\s/ (which is equivalent to /[ \t\r\n\f]/), but it doesn't appear to be necessary in this case. I can't find any abnormal whitespace in s at this point. If you're still having problems, you'll need to post your code and a more precise description of the issue.
Update: Thanks for updating your question with your code and an example of the problem. It looks like the issue is non-breaking spaces. I think it's simplest to get rid of them at the source:
require 'nokogiri'
require 'open-uri'
URL = 'http://www.eximsystems.com/LaVerdad/Antiguo/Gn/Genesis.htm'
s = open(URL).read # Separate these three lines to convert
s.gsub!(' ', ' ') # to normal ' ' in source rather than after
html = Nokogiri.HTML(s) # conversion to unicode non-breaking space
# Extract Paragraphs
text = ''
html.css('p').each do |p|
text += p.text
end
# Clean Up Text
text.gsub!(/\s+/, ' ')
puts text
There's now just a single, normal space between the period at the end of 15 and the number 16:
15) Besó también José a todos sus hermanos, orando sobre cada uno de ellos; después de cuyas demostraciones cobraron aliento para conversar con él. 16 Al punto corrió la voz, y se divulgó generalmente esta noticia en el palacio del rey: Han venido los hermanos de José; y holgóse de ello Faraón y toda su corte.

You can try to use text.strip for removing the whitespaces.

Related

How to print unicode charaters in Command Prompt with Ruby

I was wondering how to print unicode characters, such as Japanese or fun characters like 📦.
I can print hearts with:
hearts = "\u2665"
puts hearts.encode('utf-8')
How can I print more unicode charaters with Ruby in Command Prompt?
My method works with some characters but not all.
Code examples would be greatly appreciated.
You need to enclose the unicode character in { and } if the number of hex digits isn't 4 (credit : /u/Stefan) e.g.:
heart = "\u2665"
package = "\u{1F4E6}"
fire_and_one_hundred = "\u{1F525 1F4AF}"
puts heart
puts package
puts fire_and_one_hundred
Alternatively you could also just put the unicode character directly in your source, which is quite easy at least on macOS with the Emoji & Symbols menu accessed by Ctrl + Command + Space by default (a similar menu can be accessed on Windows 10 by Win + ; ) in most applications including your text editor/Ruby IDE most likely:
heart = "♥"
package = "📦"
fire_and_one_hundred = "🔥💯"
puts heart
puts package
puts fire_and_one_hundred
Output:
♥
📦
🔥💯
How it looks in the macOS terminal:

Printing list with polish letters

I am writing a simple program for windows using Python 2.7. It opens an email, take some words from it and puts them in a form on web. Problem starts when the email contains polish letters like Ó, Ź, Ł etc. Whenever I try to print it I get something like: ['\xc4\x84', '\xc5\xbb', '\xc3\x93', '\xc4\x86', '\xc5\xb9'].
I already know it is because of encoding and that Python 3 has no such problem. Here is what I tried already:
mail = " Ą Ż Ó Ć Ź"
mail = mail.split()
mail = mail.decode("UTF-8")
print mail
or
mail = " Ą Ż Ó Ć Ź"
mail = mail.split()
[x.encode('UTF8') for x in mail]
print mail
Can anyone please show me how to make the list print properly ?
Python 2.x uses ASCII as a default encoding. To force it to use Unicode, add this line to the top of your program.
# -*- coding: utf-8 -*-
Also you should prefix any string literals with 'u'. e.g.
polishLetters = u'Ą Ż Ó Ć Ź'

can't write IP to text file without formatting issues

I'm having trouble reading an IP from a text file and properly writing it to another text file. It shows the written IP in the file as: "ÿþ1 9 2 . 1 6 8 . 1 1 0 . 4"
#Read the first line for the IP
def get_server_ip
File.open("d:\\ip_addr.txt") do |line|
a = line.readline()
b = a.to_s
end
end
#append the ip to file2
def append_ip
FileUtils.cp('file1.txt', 'file2.txt')
file_names = ['file2.txt']
file_names.each do |file_name|
text = File.read(file_name)
b = get_server_ip
new_contents = text.gsub('ip_here', b)
File.open(file_name, "w") {|file| file.puts new_contents }
end
end
I've tried .strip and .delete(' ') with no luck. Can anyone see the issue?
Thank you
The file was generated with Notepad on Windows. It is encoded as UTF-16LE.
The first two bytes in the file have the codes 0xFF and 0xFE; this is the Bytes Order Mark of UTF-16LE.
Each character is encoded on 2 bytes (16 bits), the least significant byte first (Less Endian order).
The spaces between the printable characters in the output are, in fact NUL characters (characters with code 0).
What you can do (apart from converting the file to a more decent format like UTF-8 or even ISO-8859-1) is to pass 'rb:BOM|UTF-16LE' as the second argument of File#open.
r tells File#open to open the file in read-only mode (which is also does by default);
b means "binary mode"; it is required by BOM|UTF-16;
:BOM|UTF-16LE tells Ruby to read and ignore the BOM if it is present in the file and to expect the rest of the file being encoded as UTF16-LE.
If you can, I recommend you to convert the file encoding using a decent editor (even Notepad can be used) to UTF-8 or ISO-8859-1 and all these problems vanish.

How to decoding IFC using Ruby

In Ruby, I'm reading an .ifc file to get some information, but I can't decode it. For example, the file content:
"'S\X2\00E9\X0\jour/Cuisine'"
should be:
"'Séjour/Cuisine'"
I'm trying to encode it with:
puts ifcFileLine.encode("Windows-1252")
puts ifcFileLine.encode("ISO-8859-1")
puts ifcFileLine.encode("ISO-8859-5")
puts ifcFileLine.encode("iso-8859-1").force_encoding("utf-8")'
But nothing gives me what I need.
I don't know anything about IFC, but based solely on the page Denis linked to and your example input, this works:
ESCAPE_SEQUENCE_EXPR = /\\X2\\(.*?)\\X0\\/
def decode_ifc(str)
str.gsub(ESCAPE_SEQUENCE_EXPR) do
$1.gsub(/..../) { $&.to_i(16).chr(Encoding::UTF_8) }
end
end
str = 'S\X2\00E9\X0\jour/Cuisine'
puts "Input:", str
puts "Output:", decode_ifc(str)
All this code does is replace every sequence of four characters (/..../) between the delimiters, which will each be a Unicode code point in hexadecimal, with the corresponding Unicode character.
Note that this code handles only this specific encoding. A quick glance at the implementation guide shows other encodings, including an \X4 directive for Unicode characters outside the Basic Multilingual Plane. This ought to get you started, though.
See it on eval.in: https://eval.in/776980
If someone is interested, I wrote here a Python Code that decode 3 of the IFC encodings : \X, \X2\ and \S\
import re
def decodeIfc(txt):
# In regex "\" is hard to manage in Python... I use this workaround
txt = txt.replace('\\', 'µµµ')
txt = re.sub('µµµX2µµµ([0-9A-F]{4,})+µµµX0µµµ', decodeIfcX2, txt)
txt = re.sub('µµµSµµµ(.)', decodeIfcS, txt)
txt = re.sub('µµµXµµµ([0-9A-F]{2})', decodeIfcX, txt)
txt = txt.replace('µµµ','\\')
return txt
def decodeIfcX2(match):
# X2 encodes characters with multiple of 4 hexadecimal numbers.
return ''.join(list(map(lambda x : chr(int(x,16)), re.findall('([0-9A-F]{4})',match.group(1)))))
def decodeIfcS(match):
return chr(ord(match.group(1))+128)
def decodeIfcX(match):
# Sometimes, IFC files were made with old Mac... wich use MacRoman encoding.
num = int(match.group(1), 16)
if (num <= 127) | (num >= 160):
return chr(num)
else:
return bytes.fromhex(match.group(1)).decode("macroman")

Encoding issue with Sqlite3 in Ruby

I have a list of sql queries beautifully encoded in utf-8. I read them from files, perform the inserts and than do a select.
# encoding: utf-8
def exec_sql_lines(file_name)
puts "----> #{file_name} <----"
File.open(file_name, 'r') do |f|
# sometimes a query doesn't fit one line
previous_line=""
i = 0
while line = f.gets do
puts i+=1
if(line[-2] != ')')
previous_line += line[0..-2]
next
end
puts (previous_line + line) # <---- (1)
$db.execute((previous_line + line))
previous_line =""
end
a = $db.execute("select * from Table where _id=6")
puts a <---- (2)
end
end
$db=SQLite3::Database.new($DBNAME)
exec_sql_lines("creates.txt")
exec_sql_lines("inserts.txt")
$db.close
The text in (1) is different than the one in (2). Polish letters get broken. If I use IRB and call $db.open ; $db.encoding is says UTF-8.
Why do Polish letters come out broken? How to fix it?
I need this database properly encoded in UTF-8 for my Android app, so I'm not interested in manipulating with database output. I need to fix it's content.
EDIT
Significant lines from the output:
6
INSERT INTO 'Leki' VALUES (NULL, '6', 'Acenocoumarolum', 'Acenocumarol WZF', 'tabl. ', '4 mg', '60 tabl.', '5909990055715', '2012-01-01', '2 lata', '21.0, Leki przeciwzakrzepowe z grupy antagonistów witaminy K', '8.32', '12.07', '12.07', 'We wszystkich zarejestrowanych wskazaniach na dzień wydania decyzji', '', 'ryczałt', '5.12')
out:
6
6
Acenocoumarolum
Acenocumarol WZF
tabl.
4 mg
60 tabl.
5909990055715
2012-01-01
2 lata
21.0, Leki przeciwzakrzepowe z grupy antagonistĂł[<--HERE]w witaminy K
8.32
12.07
12.07
We wszystkich zarejestrowanych wskazaniach na dzieĹ[<--HERE] wydania decyzji
ryczaĹ[<--HERE]t
5.12
There are three default encoding.
In you code you set the source encoding.
Perhaps there is a problem with External and Internal Encoding?
A quick test in windows:
#encoding: utf-8
File.open(__FILE__,'r'){|f|
p f.external_encoding
p f.internal_encoding
p f.read.encoding
}
Result:
#<Encoding:CP850>
nil
#<Encoding:CP850>
Even if UTF-8 is used, the data are read as cp850.
In your case:
Does File.open(filename,'r:utf-8') help?

Resources