Identifying bad rows(utf-8) in oracle table - oracle

I get the below error message when trying to export and Oracle table using ora2pg. I would like to know if there is a query i can run to find out which columns have this utf-8 characters in it?
Unicode surrogate U+DBC0 is illegal in UTF-8 at /usr/lib64/perl5/IO/Handle.pm line 420.721 (1800 recs/sec)
Unicode surrogate U+DF72 is illegal in UTF-8 at /usr/lib64/perl5/IO/Handle.pm line 420.
Unicode surrogate U+DBC0 is illegal in UTF-8 at /usr/lib64/perl5/IO/Handle.pm line 420.
Unicode surrogate U+DF72 is illegal in UTF-8 at /usr/lib64/perl5/IO/Handle.pm line 420.
Unicode surrogate U+DBC0 is illegal in UTF-8 at /usr/lib64/perl5/IO/Handle.pm line 420.721 (1802 recs/sec)
Thanks.

Related

How to display utf-8 characters in hbase shell?

when I use get command in hbase shell,like:
hbase(main)> get 't1','00003ab'
result is:
PATHID:path0 timestamp=1463537742385, value={"pathSign":"\xE5\x8C\x97\xE5\xAE\x89\xE9\x97\xA8\xE8\xA1\x97"} //hexadecimal
The utf-8 characters is displayed to hexadecimal.

gsub codification issues with UTF-8

I am trying to create a slug from some usernames in a DB migration.
nick = nick.gsub('á','a')
I really want change also éíóúñ to eioun.
Doing so, it doesn't work, I will get:
incompatible encoding regexp match (UTF-8 regexp with ASCII-8BIT string) (Encoding::CompatibilityError)
But, however I do, for example by adding force_encoding method, I always get encodings errors like:
invalid byte sequence in UTF-8 (ArgumentError)
"\xF3" from ASCII-8BIT to UTF-8 (Encoding::UndefinedConversionError)
incompatible character encodings: ASCII-8BIT and UTF-8 (Encoding::CompatibilityError)
This just happends when I have a gsub for changing those vocals or the spanish ñ letter.
There's also a encoding: utf-8 line on my file and data comes from a UTF-8 database. But nothing seems to help.
I've seen some questions on SO but anything I try to do doesn't fix it.
By the way, this is not rails related.
I finally used transliterate from Rails ActiveSupport:
require 'active_support/all'
v = ActiveSupport::Inflector.transliterate v.downcase
v.gsub(/[^a-z1-9]+/, '-').chomp('-')
Works fine.

openSSl - Parsing subjectAltName with ASCII and UTF-8 characters

a SubjectAltName contains a mix of ASCII and UTF-8 characters such as:
6DC3C16E - m(small a with acute)n
I'm using X509_NAME_oneline to parse and getting mixed escape sequences like 'm\xC3\xA1n'
Is there an openssl function which would return a full UTF-8 string?
Thanks
John
"m\xC3\xA1n" is a full UTF-8 string. UTF-8 is a variable length encoding, and all of the ASCII characters (which have codepoints less than 128) are encoded identically in ASCII and UTF-8. The character m, for example, is just the single byte 0x6d in both ASCII and UTF-8.

Invalid multibyte char(UTF-8)

I am trying compile this Ruby code with option --1.9:
\# encoding: utf-8
module Modd
def cpd
#"_¦+?" mySQL
"ñ,B˜"
end
end
I used the GVim editor and compiled then got the following error:
SyntaxError: f3.rb:6: invalid multibyte char (UTF-8)
After that I used Notepad++ and changed to Encode as UTF-8 and compiled with this option:
jruby --1.9 f3.rb
then I get:
SyntaxError: f3.rb:1: \273Invalid char `\273' ('╗') in expression
I have seen this happen when the BOM gets messed up during a charset conversion (the BOM in octal is 357 273 277). If you open the file with a hexadecimal editor (:%!xxd on vi), you will more than likely see characters at the beginning of the file, before the first #.
If you recreate that file directly in utf-8, or get rid of these spurious characters, this should solve your problem.

Determine if a text file without BOM is UTF8 or ASCII

Long story short:
+ I'm using ffmpeg to check the artist name of a MP3 file.
+ If the artist has asian characters in its name the output is UTF8.
+ If it just has ASCII characters the output is ASCII.
The output does not use any BOM indication at the beginning.
The problem is if the artist has for example a "ä" in the name it is ASCII, just not US-ASCII so "ä" is not valid UTF8 and is skipped.
How can I tell whether or not the output text file from ffmpeg is UTF8 or not? The application does not have any switches and I just think it's plain dumb not to always go with UTF8. :/
Something like this would be perfect:
http://linux.die.net/man/1/isutf8
If anyone knows of a Windows version?
Thanks a lot in before hand guys!
This program/source might help you:
Detect Encoding for In- and Outgoing
Detect the encoding of a text without BOM (Byte Order Mask) and choose the best Encoding ...
You say, "ä" is not valid UTF-8 ... This is not correct...
It seems you don't have a clear understanding of what UTF-8 is. UTF-8 is a system of how to encode Unicode Codepoints. The issue of validity is notin the character itself, it is a question of how has it been encoded...
There are many systems which can encode Unicode Codepoints; UTF-8 is one and UTF16 is another... "ä" is quite legal in the UTF-8 system.. Actually all characters are valid, so long as that character has a Unicode Codepoint.
However, ASCII has only 128 valid values, which equate identically to the first 128 characters in the Unicode Codepoint system. Unicode itself is nothing more that a big look-up table. What does the work is teh encoding system; eg. UTF-8.
Because the 128 ASCII characters are identical to the first 128 Unicode characters, and because UTF-8 can represent these 128 values is a single byte, just as ASCII does, this means that the data in an ASCII file is identical to a file with the same date but which you call a UTF-8 file. Simply put: ASCII is a subset of UTF-8... they are indistinguishable for data in the ASCII range (ie, 128 characters).
You can check a file for 7-bit ASCII compliance..
# If nothing is output to stdout, the file is 7-bit ASCII compliant
# Output lines containing ERROR chars -- to stdout
perl -l -ne '/^[\x00-\x7F]*$/ or print' "$1"
Here is a similar check for UTF-8 compliance..
perl -l -ne '/
^( ([\x00-\x7F]) # 1-byte pattern
|([\xC2-\xDF][\x80-\xBF]) # 2-byte pattern
|((([\xE0][\xA0-\xBF])|([\xED][\x80-\x9F])|([\xE1-\xEC\xEE-\xEF][\x80-\xBF]))([\x80-\xBF])) # 3-byte pattern
|((([\xF0][\x90-\xBF])|([\xF1-\xF3][\x80-\xBF])|([\xF4][\x80-\x8F]))([\x80-\xBF]{2})) # 4-byte pattern
)*$ /x or print' "$1"

Resources