Generate UTF-8 character list - utf-8

I have a UTF-8 file which I convert to ISO-8859-1 before sending the file to a consuming system that does not understand the UTF-8. Our current issue is that when we run the iconv process on the UTF-8 file, some characters are getting converted to '?'. Currently, for every failing character, we have been providing a fix.
I am trying to understand if it is possible to create a file which has all possible UTF-8 characters? The intent is to downgrade them using iconv and identify the characters that are getting replaced with '?'

Rather than looking at every possible Unicode character (over 140k of them), I recommend performing an iconv substitution and then seeing where your actual problems are. For example:
iconv -f UTF-8 -t ISO-8859-1 --unicode-subst="<U+%04X>"
This will convert characters that aren't in ISO-8859-1 to a "<U+####>" syntax. You can then search your output for these.
If your data will be read by something that handles C-style escapes (\u####), you can also use:
iconv -f UTF-8 -t ISO-8859-1 --unicode-subst="\\u%04x"

An exhaustive list of all Unicode characters seems rather impractical for this use case. There are tens of thousands of characters in non-Latin scripts which don't have any obvious near-equivalent in Latin-1.
Instead, probably look for a mapping from Latin characters which are not in Latin-1 to corresponding homographs or near-equivalents.
Some programming languages have existing libraries for this; a common and simple transformation is to attempt to strip any accents from characters which cannot be represented in Latin-1, and use the unaccented variant if this works. (You'll want to keep the accent for any character which can be normalized to Latin-1, though. Maybe also read about Unicode normalization.)
Here's a quick and dirty Python attempt.
from unicodedata import normalize
def latinize(string):
"""
Map string to Latin-1, replacing characters which can be approximated
"""
result = []
for char in string:
try:
byte = normalize("NFKC", char).encode('latin-1')
except UnicodeEncodeError:
byte = normalize("NFKD", char).encode('ascii', 'ignore')
result.append(byte)
return b''.join(result)
def convert(fh):
for line in fh:
print(latinize(line), end='')
def main():
import sys
if len(sys.argv) > 1:
for filename in sys.argv[1:]:
with open(filename, 'r') as fh:
convert(fh)
else:
convert(sys.stdin)
if __name__ == '__main__':
main()
Demo: https://ideone.com/sOEBW9

Related

Text file encoding properly as utf-8 in Scite editor but fails to encode to uft-8 in ruby

I have a text file which if viewed in the Scite editor with the encoding set to utf-8, displays all text correctly, including capital letters with an accute accent (i.e. Á).
However, if I write a ruby script and use mystring.encode("utf-8") it will give me this error on capital letters that carry an acute accent (i.e. Á):
encode': "\x81" to UTF-8 in conversion from Windows-1252 to UTF-8 (Encoding::UndefinedConversionError)
Is this expected behaviour? How can I encode the whole text to utf-8 using ruby, knowing that otherwise it does get successfully encoded in the Scite editor?
Code:
ine_file = File.open("../../_data/ine_spain_demographics.csv", 'r')
ine_towns_population_hash = Hash.new
ine_file.each do|line|
values = line.split(";")
town_name = values[3]
population = values[4]
begin
ine_towns_population_hash[town_name.encode("utf-8")] = population
rescue
puts "problematic string: " + town_name
end
end
You say that ine_file.external_encoding says Windows-1252 so the file is being opened as a Windows-1252 encoded file. Then you say town_name.encode("utf-8") in an attempt to encoded a string as UTF-8 and Ruby complains. But the file is actually UTF-8; reading UTF-8 bytes as Windows-1252 and then trying to recode those bytes as UTF-8 isn't going to work.
You need to open the file in UTF-8 mode:
File.open("../../_data/ine_spain_demographics.csv", 'r:UTF-8')
and stop trying to change the encoding of town_name, just use town_name as-is.
It seems like it's misinterpreting the encoding of ine_spain_demographics.csv.
Looking at the doc's for encode and open you have two options:
Use replace in encode to tell Ruby what character to use town_name.encode("utf-8", replace: '').
Identify the correct file encoding and specify it: File.open("../../_data/ine_spain_demographics.csv", 'r:ISO-8859-1')

How can I use Net::Http to download a file with UTF-8 characters in it?

I have an application where users can upload text-based files (xml, csv, txt) that are persisted to S3. Some of these files are pretty big. There are a variety of operations that need to be performed on the data in these files, so rather than read them from S3 and have it time out occasionally I download the files locally, then turn the operations loose on them.
Here's the code I use to download the file from S3. Upload is the name of the AR model I use to store this information. This method is an instance method on the Upload model:
def download
basename = File.basename(self.text_file_name.path)
filename = Rails.root.join(basename)
host = MyFitment::Utility.get_host_without_www(self.text_file_name.url)
Net::HTTP.start(host) do |http|
f = open(filename)
begin
http.request_get(self.text_file_name.url) do |resp|
resp.read_body do |segment|
f.write(segment) # Fails when non-ASCII 8-bit characters are included.
end
end
ensure
f.close()
end
end
filename
end
So you see that line above where the load fails. This code somehow thinks all files that are downloaded are encoded in ASCII 8-bit. How can I:
1) Check the encoding of a remote file like that
2) Download it and write it successfully.
Here's the error that is happening with a particular file right now:
Encoding::UndefinedConversionError: "\x95" from ASCII-8BIT to UTF-8
from /Users/me/code/myapp/app/models/upload.rb:47:in `write'
Thank you for any help you can offer!
How can I: 1) Check the encoding of a remote file like that.
You can check the Content-Type header of the response, which, if present, may look something like this:
Content-Type: text/plain; charset=utf-8
As you can see, the encoding is specified there. If there's no Content-Type header, or if the charset is not specified, or if the charset is specified incorrectly, then you can't know the encoding of the text. There are gems that can try to guess the encoding(with increasing accuracy), e.g. rchardet, charlock_holmes, but for complete accuracy, you have to know the encoding before reading the text.
This code somehow thinks all files that are downloaded are encoded in
ASCII 8-bit.
In ruby, ASCII-8BIT is equivalent to binary, which means the Net::HTTP library just gives you a string containing a series of single bytes, and it's up to you to decide how to interpret those bytes.
If you want to interpret those bytes as UTF-8, then you do that with String#force_encoding():
text = text.force_encoding("UTF-8")
You might want to do that if, for instance, you want to do some regex matching on the string, and you want to match full characters(which might be multi-byte) rather than just single bytes.
Encoding::UndefinedConversionError: "\x95" from ASCII-8BIT to UTF-8
Using String#encode('UTF-8') to convert ASCII-8BIT to UTF-8 doesn't work for bytes whose ascii codes are greater than 127:
(0..255).each do |ascii_code|
str = ascii_code.chr("ASCII-8BIT")
#puts str.encoding #=>ASCII-8BIT
begin
str.encode("UTF-8")
rescue Encoding::UndefinedConversionError
puts "Can't encode char with ascii code #{ascii_code} to UTF-8."
end
end
--output:--
Can't encode char with ascii code 128 to UTF-8.
Can't encode char with ascii code 129 to UTF-8.
Can't encode char with ascii code 130 to UTF-8.
...
...
Can't encode char with ascii code 253 to UTF-8.
Can't encode char with ascii code 254 to UTF-8.
Can't encode char with ascii code 255 to UTF-8.
Ruby just reads one byte at a time from the ASCII-8BIT string and tries to convert the character in the byte to UTF-8. So, while 128 may be a legal byte in UTF-8 when part of a multi-byte character sequence, 128 is not a legal UTF-8 character as a single byte.
As for writing the strings to a file, instead of this:
f = open(filename)
...if you want to output UTF-8 to the file, you would write:
f = open(filename, "w:UTF-8")
By default, ruby uses whatever the value of Encoding.default_external is to encode output to a file. The default_external encoding is pulled from your system's environment, or you can set it explicitly.

`gsub': incompatible character encodings: UTF-8 and IBM437

I try to use search, google but with no luck.
OS: Windows XP
Ruby version 1.9.3po
Error:
`gsub': incompatible character encodings: UTF-8 and IBM437
Code:
require 'rubygems'
require 'hpricot'
require 'net/http'
source = Net::HTTP.get('host', '/' + ARGV[0] + '.asp')
doc = Hpricot(source)
doc.search("p.MsoNormal/a").each do |a|
puts a.to_plain_text
end
Program output few strings but when text is ”NOŻYCE” I am getting error above.
Could somebody help?
You could try converting your HTML to UTF-8 since it appears the original is in vintage-retro DOS format:
source.encode!('UTF-8')
That should flip it from 8-bit ASCII to UTF-8 as expected by the Hpricot parser.
The inner encoding of the source variable is UTF-8 but that is not what you want.
As tadman wrote, you must first tell Ruby that the actual characters in the string are in the IBM437 encoding. Then you can convert that string to your favourite encoding, but only if such a conversion is possible.
source.force_encoding('IBM437').encode('UTF-8')
In your case, you cannot convert your string to ISO-8859-2 because not all IBM437 characters can be converted to that charset. Sticking to UTF-8 is probably your best option.
Anyway, are you sure that that file is actually transmitted in IBM437? Maybe it is stored as such in the HTTP server but it is sent over-the-wire with another encoding. Or it may not even be exactly in IBM437, it may be CP852, also called MS-DOC Latin 2 (different from ISO Latin 2).

Why do I get an "Invalid Byte Sequence in UTF-8" error reading a text file?

I'm writing a Ruby script to process a large text file, and keep getting an odd encoding error.
Here's the situation:
input_data = File.new(in_path, 'r').read
p input_data.encoding.name # UTF-8
break_char = "\r".encode("UTF-8")
p break_char # "\r"
p break_char.encoding.name # "UTF-8"
input_data.split(",".encode("UTF-8"))
p Encoding.compatible?(input_data, break_char) # # Encoding:UTF-8>
This produces the error :in 'split': invalid byte sequence in UTF-8 (ArgumentError)
I read http://blog.grayproductions.net/articles/ruby_19s_string and looked at other solutions to apparently the same problem, but still can't work out why it's happening when I believe I am controlling the encodings.
I'm on OSX working with ruby 1.9.2
Obviously your input file is not UTF-8 (or at least, not entirely). If you don't care about non-ascii characters, you can simply assume your file is ascii-8bit encoded. BTW, your separator (break_char) is not causing problems as comma is encoded the same way in UTF-8 as in ASCII.
fname = 'test.in'
# create example file and fill it with invalid UTF-8 sequence
File.open(fname, 'w') do |f|
f.write "\xc3\x28"
end
# then try to read and parse it
s = File.open(fname) do |f| # file opened as UTF-8
#s = File.open(fname, 'r:ascii-8bit') do |f| # file opened as ascii-8bit
f.read
end
p s.split ','
I fail to get an error here on Linux even when the input file is not UTF-8. (I'm using Ruby 1.9.2, as well.)
Logically, either this problem is linked with OS-X, or it's something to do with your input data. Does it happen regardless of the data in the input file?
(I realise that this is not a proper answer, but I lack the rep to add a comment. And since no-one has responded yet, I thought it better than nothing...)
You read the file using the default encoding your system provides. So ruby tags the string as utf8, which doesn't mean it's really utf8-data. Try file <input file> to guess what kind of encoding is in there, then tell ruby it's that one (unclean: force_encoding(<encoding>), clean: tell the File object what encoding it is, I don't know how to do that) and then use encode!("utf8") to convert it to utf8.
Here are 2 common situations and how to deal with them:
Situation 1
You have an UTF-8 input-file with possibly a few invalid bytes
Remove the invalid bytes:
test = "Partly valid\xE4 UTF-8 encoding: äöüß"
File.open( 'input_file', 'w' ) {|f| f.write(test)}
str = File.read( 'input_file' )
str.scrub('')
=> "Partly valid UTF-8 encoding: äöüß"
Situation 2
You have an input-file that could be in either UTF-8 or ISO-8859-1 encoding
Check which encoding it is and convert to UTF-8 (if necessary):
test = "String in ISO-8859-1 encoding: \xE4\xF6\xFC\xDF"
File.open( 'input_file', 'w' ) {|f| f.write(test)}
str = File.read( 'input_file' )
unless str.valid_encoding?
str.encode!( 'UTF-8', 'ISO-8859-1', invalid: :replace )
end #unless
=> "String in ISO-8859-1 encoding: äöüß"
Notes
The above code snippets assume that Ruby encodes all your strings in UTF-8 by default. Even though, this is almost always the case, you can make sure of this by starting your scripts with # encoding: UTF-8.
If invalid, it is programmatically possible to detect most multi-byte encodings like UTF-8 (in Ruby, see: #valid_encoding?). However, it is NOT possible (or at least extremely hard) to programmatically detect invalidity of single-byte-encodings like ISO-8859-1. Thus the above code snippet does not work the other way around, i.e. detecting if a String is valid ISO-8859-1 encoding.
Even though UTF-8 has become increasingly popular as the default encoding in computer-systems, ISO-8859-1 and other Latin1 flavors are still very popular in the Western countries, especially in North America. Be aware that there a several single-byte encodings out there that are very similar, but slightly vary from ISO-8859-1. Examples: CP1252 (a.k.a. Windows-1252), ISO-8859-15
[ruby] [encoding] [utf8] [file-encoding] [character-encoding]
Please try this one:-
input_data = File.open("path/your_file.pdf", "rb") {|io| io.read}
Thanks

Determine if a text file without BOM is UTF8 or ASCII

Long story short:
+ I'm using ffmpeg to check the artist name of a MP3 file.
+ If the artist has asian characters in its name the output is UTF8.
+ If it just has ASCII characters the output is ASCII.
The output does not use any BOM indication at the beginning.
The problem is if the artist has for example a "ä" in the name it is ASCII, just not US-ASCII so "ä" is not valid UTF8 and is skipped.
How can I tell whether or not the output text file from ffmpeg is UTF8 or not? The application does not have any switches and I just think it's plain dumb not to always go with UTF8. :/
Something like this would be perfect:
http://linux.die.net/man/1/isutf8
If anyone knows of a Windows version?
Thanks a lot in before hand guys!
This program/source might help you:
Detect Encoding for In- and Outgoing
Detect the encoding of a text without BOM (Byte Order Mask) and choose the best Encoding ...
You say, "ä" is not valid UTF-8 ... This is not correct...
It seems you don't have a clear understanding of what UTF-8 is. UTF-8 is a system of how to encode Unicode Codepoints. The issue of validity is notin the character itself, it is a question of how has it been encoded...
There are many systems which can encode Unicode Codepoints; UTF-8 is one and UTF16 is another... "ä" is quite legal in the UTF-8 system.. Actually all characters are valid, so long as that character has a Unicode Codepoint.
However, ASCII has only 128 valid values, which equate identically to the first 128 characters in the Unicode Codepoint system. Unicode itself is nothing more that a big look-up table. What does the work is teh encoding system; eg. UTF-8.
Because the 128 ASCII characters are identical to the first 128 Unicode characters, and because UTF-8 can represent these 128 values is a single byte, just as ASCII does, this means that the data in an ASCII file is identical to a file with the same date but which you call a UTF-8 file. Simply put: ASCII is a subset of UTF-8... they are indistinguishable for data in the ASCII range (ie, 128 characters).
You can check a file for 7-bit ASCII compliance..
# If nothing is output to stdout, the file is 7-bit ASCII compliant
# Output lines containing ERROR chars -- to stdout
perl -l -ne '/^[\x00-\x7F]*$/ or print' "$1"
Here is a similar check for UTF-8 compliance..
perl -l -ne '/
^( ([\x00-\x7F]) # 1-byte pattern
|([\xC2-\xDF][\x80-\xBF]) # 2-byte pattern
|((([\xE0][\xA0-\xBF])|([\xED][\x80-\x9F])|([\xE1-\xEC\xEE-\xEF][\x80-\xBF]))([\x80-\xBF])) # 3-byte pattern
|((([\xF0][\x90-\xBF])|([\xF1-\xF3][\x80-\xBF])|([\xF4][\x80-\x8F]))([\x80-\xBF]{2})) # 4-byte pattern
)*$ /x or print' "$1"

Resources