Determine if a text file without BOM is UTF8 or ASCII

Determine if a text file without BOM is UTF8 or ASCII - validation

Long story short:
+ I'm using ffmpeg to check the artist name of a MP3 file.
+ If the artist has asian characters in its name the output is UTF8.
+ If it just has ASCII characters the output is ASCII.
The output does not use any BOM indication at the beginning.
The problem is if the artist has for example a "ä" in the name it is ASCII, just not US-ASCII so "ä" is not valid UTF8 and is skipped.
How can I tell whether or not the output text file from ffmpeg is UTF8 or not? The application does not have any switches and I just think it's plain dumb not to always go with UTF8. :/
Something like this would be perfect:
http://linux.die.net/man/1/isutf8
If anyone knows of a Windows version?
Thanks a lot in before hand guys!

This program/source might help you:
Detect Encoding for In- and Outgoing
Detect the encoding of a text without BOM (Byte Order Mask) and choose the best Encoding ...

You say, "ä" is not valid UTF-8 ... This is not correct...
It seems you don't have a clear understanding of what UTF-8 is. UTF-8 is a system of how to encode Unicode Codepoints. The issue of validity is notin the character itself, it is a question of how has it been encoded...
There are many systems which can encode Unicode Codepoints; UTF-8 is one and UTF16 is another... "ä" is quite legal in the UTF-8 system.. Actually all characters are valid, so long as that character has a Unicode Codepoint.
However, ASCII has only 128 valid values, which equate identically to the first 128 characters in the Unicode Codepoint system. Unicode itself is nothing more that a big look-up table. What does the work is teh encoding system; eg. UTF-8.
Because the 128 ASCII characters are identical to the first 128 Unicode characters, and because UTF-8 can represent these 128 values is a single byte, just as ASCII does, this means that the data in an ASCII file is identical to a file with the same date but which you call a UTF-8 file. Simply put: ASCII is a subset of UTF-8... they are indistinguishable for data in the ASCII range (ie, 128 characters).
You can check a file for 7-bit ASCII compliance..
# If nothing is output to stdout, the file is 7-bit ASCII compliant
# Output lines containing ERROR chars -- to stdout
perl -l -ne '/^[\x00-\x7F]*$/ or print' "$1"
Here is a similar check for UTF-8 compliance..
perl -l -ne '/
^( ([\x00-\x7F]) # 1-byte pattern
|([\xC2-\xDF][\x80-\xBF]) # 2-byte pattern
|((([\xE0][\xA0-\xBF])|([\xED][\x80-\x9F])|([\xE1-\xEC\xEE-\xEF][\x80-\xBF]))([\x80-\xBF])) # 3-byte pattern
|((([\xF0][\x90-\xBF])|([\xF1-\xF3][\x80-\xBF])|([\xF4][\x80-\x8F]))([\x80-\xBF]{2})) # 4-byte pattern
)*$ /x or print' "$1"

Related

Gettext failed to extract non-ASCII characters

In my source files I have string containing non-ASCII characters like
sCursorFormat = TRANSLATE("Frequency (Hz): %s\nDegree (°): %s");
But when I extract them they vanish like
msgid ""
"Frequency (Hz): %s\n"
"Degree (): %s"
msgstr ""
I have specified the encoding when extracting as
xgettext --from-code=UTF-8
I'm running under MS Windows and the source files are C++ (not that it should matter).

The encoding of your source file is probably not UTF-8, but ANSI, which stands for whatever the encoding for non-Unicode applications is (probably code page 1252). If you would open the file in some hex editor you would see byte 0x80 standing for degree symbol. This byte is not a valid UTF-8 character. In UTF-8 encoding degree symbol is represented with two bytes 0xC2 0xB0. This is why the byte vanishes when using --from-code=UTF-8.
The solution for your problem is to use --from-code=windows-1252. OR, better yet, to save all source files as UTF-8, and then use --from-code=UTF-8.

Generate UTF-8 character list

I have a UTF-8 file which I convert to ISO-8859-1 before sending the file to a consuming system that does not understand the UTF-8. Our current issue is that when we run the iconv process on the UTF-8 file, some characters are getting converted to '?'. Currently, for every failing character, we have been providing a fix.
I am trying to understand if it is possible to create a file which has all possible UTF-8 characters? The intent is to downgrade them using iconv and identify the characters that are getting replaced with '?'

Rather than looking at every possible Unicode character (over 140k of them), I recommend performing an iconv substitution and then seeing where your actual problems are. For example:
iconv -f UTF-8 -t ISO-8859-1 --unicode-subst="<U+%04X>"
This will convert characters that aren't in ISO-8859-1 to a "<U+####>" syntax. You can then search your output for these.
If your data will be read by something that handles C-style escapes (\u####), you can also use:
iconv -f UTF-8 -t ISO-8859-1 --unicode-subst="\\u%04x"

An exhaustive list of all Unicode characters seems rather impractical for this use case. There are tens of thousands of characters in non-Latin scripts which don't have any obvious near-equivalent in Latin-1.
Instead, probably look for a mapping from Latin characters which are not in Latin-1 to corresponding homographs or near-equivalents.
Some programming languages have existing libraries for this; a common and simple transformation is to attempt to strip any accents from characters which cannot be represented in Latin-1, and use the unaccented variant if this works. (You'll want to keep the accent for any character which can be normalized to Latin-1, though. Maybe also read about Unicode normalization.)
Here's a quick and dirty Python attempt.
from unicodedata import normalize
def latinize(string):
"""
Map string to Latin-1, replacing characters which can be approximated
"""
result = []
for char in string:
try:
byte = normalize("NFKC", char).encode('latin-1')
except UnicodeEncodeError:
byte = normalize("NFKD", char).encode('ascii', 'ignore')
result.append(byte)
return b''.join(result)
def convert(fh):
for line in fh:
print(latinize(line), end='')
def main():
import sys
if len(sys.argv) > 1:
for filename in sys.argv[1:]:
with open(filename, 'r') as fh:
convert(fh)
else:
convert(sys.stdin)
if __name__ == '__main__':
main()
Demo: https://ideone.com/sOEBW9

Concatenating files in Windows Command Prompt and the string "ï»¿"

I am concatenating files using Windows. I have used the TYPE and the COPY command and I get the same artifact. At the place where my original files are joined in the new file, the character string "ï»¿" (i.e. Decimal: 139 175 168 Hex: 8BAFA8) is inserted.
How can I troubleshoot this? Is there an easy explanation you can provide for how to avoid this. And why does this happen?

The very good explanation why does this happen is in #Mark_Tolonen answer, so I will not repeat it.
Instead of obsolete TYPE and COPY one have to use powershell now:
powershell -Command "& { Get-Content a*.txt | Out-File output.txt -Encoding utf8 }"
This command get content of all files patterned by a*.txt in a current folder and concatenates them in the output.txt file using UTF-8.
Powershell is a part of Windows 7 and later.

The extra bytes are a UTF-8 encoding signature. The Unicode byte order mark U+FEFF is encoded in UTF-8 and written to the beginning of the file to indicate the file is encoded in UTF-8. It's not required but Windows assumes a text file is encoded in the local ANSI encoding (commonly Windows-1252) unless a BOM appears.
Many file tools don't know about this (DOS copy being one of them), so concatenating files can be troublesome.
Today being ignorant of encodings often causes trouble. You can't simply concatenate two text files of unknown encoding...they may be different.
If you know the encoding, use a tool that understands the encoding. Here's a very basic concatenate script written in Python that will convert encodings as well.
# cat.py
import sys
if len(sys.argv) < 5:
print('usage: cat <in_encoding> <out_encoding> <outfile> <infile> [infile...]')
else:
with open(sys.argv[3],'w',encoding=sys.argv[2]) as fout:
for file in sys.argv[4:]:
with open(file,'r',encoding=sys.argv[1]) as fin:
fout.write(fin.read())
Given two files with UTF-8 w/ BOM encoding, this command will output UTF-8 (no BOM):
cat.py utf-8-sig utf-8 out.txt test1.txt test2.txt
Side note about Python: utf-8-sig encoding reads files and removes the BOM from the data if present, so it can be used to read any UTF-8 file with or without a BOM. utf-8-sig encoding writes a BOM at the start of a file, but utf-8 does not.

How to grep for exact hexadecimal value of characters

I am trying to grep for the hexadecimal value of a range of UTF-8 encoded characters and I only want just that specific range of characters to be returned.
I currently have this:
grep -P -n "[\xB9-\xBF]" $str_st_location >> output_st.txt
But this returns every character that has any of those hex values in it hex representation i.e it returns 00B9 - FFB9 as long as the B9 is present.
Is there a way I can specify using grep that I only want the exact/specific hex value range I search for?
Sample Input:
STRING_OPEN
Open
æ–å¼€
Ouvert
Abierto
ÐžÑ‚ÐºÑ€Ñ‹Ñ‚Ð¾
Abrir
Now using my grep statement, it should return the 3rd line and 6th line, but it also includes some text in my file that are Russian and Chinese because the range for languages include the hex values I'm searching for like these:
断开
Открыто
I can't give out more sample input unfortunately as it's work related.
EDIT: Actually the below code snippet worked!
grep -P -n "[\x{00B9}-\x{00BF}]" $str_st_location > output_st.txt
It found all the corrupted characters and there were no false positives. The only issue now is that the lines with the corrupted characters automatically gets "uncorrupted" i.e when I open the file, grep's output is the corrected version of the corrupted characters. For example, it finds æ–å¼€ and in the text file, it's show as 断开.

Since you're using -P, you're probably using GNU grep, because that is a GNU grep extension. Your command works using GNU grep 2.21 with pcre 8.37 and a UTF-8 locale, however there have been bugs in the past with multi-byte characters and character ranges. You're probably using an older version, or it is possible that your locale is set to one that uses single-byte characters.
If you don't want to upgrade, it is possible to match this character range by matching individual bytes, which should work in older versions. You would need to convert the characters to bytes and search for the byte values. Assuming UTF-8, U+00B9 is C2 B9 and U+00BF is C2 BF. Setting LC_CTYPE to something that uses single-byte characters (like C) will ensure that it will match individual bytes even in versions that correctly support multi-byte characters.
LC_CTYPE=C grep -P -n "\xC2[\xB9-\xBF]" $str_st_location >> output_st.txt

Is Bash support Unicode 6.0?

When I use unicode 6.0 character(for example, 'beer mug') in Bash(4.3.11), it doesn't display correctly.
Just copy and paste character is okay, but if you use utf-16 hex code like
$ echo -e '\ud83c\udf7a',
output is '??????'.
What's the problem?

You can't use UTF-16 with bash and a unix(-like) terminal. Bash strings are strings of bytes, and the terminal will (if you have it configured correctly) be expecting UTF-8 sequences. In UTF-8, surrogate pairs are illegal. So if you want to show your beer mug, you need to provide the UTF-8 sequence.
Note that echo -e interprets unicode escapes in the forms \uXXXX and \UXXXXXXXX, producing the corresponding UTF-8 sequence. So you can get your beer mug (assuming your terminal font includes it) with:
echo -e '\U0001f37a'

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Determine if a text file without BOM is UTF8 or ASCII - validation

This program/source might help you: Detect Encoding for In- and Outgoing Detect the encoding of a text without BOM (Byte Order Mask) and choose the best Encoding ...

Related

Gettext failed to extract non-ASCII characters

Generate UTF-8 character list

Concatenating files in Windows Command Prompt and the string "ï»¿"

How to grep for exact hexadecimal value of characters

Is Bash support Unicode 6.0?

Categories

Resources