Find and replace non utf8 character - bash

I have a process that inserts data into PDFs that eventually loads into a system that gets searched based on that inserted data. The inserted data looks something like:
<<
/IBM-ODIndexes
<< /Private
<<
/DOB (05031983)
/FULL_NAME (TEST USER)
/YEAR (2020)
>>
/LastModified(D:20210112201530)
>>
However, there are instances where the data in the FULL_NAME field contains non UTF8 characters and then users are unable to search the data. Specifically apostrophes come over from Microsoft Word and then gets interpreted like this:
/FULL_NAME (JERRY OÃ<83>¢ââ<80><9a>‰â<80><9e>¢CONNELL)
In this case I am looking to strip out the apostrophe that is represented as Ã<83>¢ââ<80><9a>‰â<80><9e>¢ and replace it with a white space.

There are several complexities here, but in general I would say that the only reliable way to deal with it is to figure out the text encoding of the incoming document and converting it to the target encoding.
Ã<83>¢ââ<80><9a>‰â<80><9e>¢ is 34 characters (that is, at least 34 bytes), and no single encoding ever used that much space for a single character. What’s probably happening is multiple levels of encoding, such as HTML entities, base64, UTF-8/16/32 or escape characters like %% to represent % in SQL or \\ to represent \ in Bash. Reversing all these levels of encoding manually is going to involve quite a lot of reading the huge docx standard. The simpler alternative is to use a library which can just convert the entire text into a known character encoding for you, at which point you have to do at most a single conversion into UTF-8.
Another argument for this is that the “apostrophe string” does contain otherwise harmless characters like “a” and “e”. Without at least some understanding of the encodings you’re unlikely to be able to separate encoded characters from non-encoded ones, which would make the resulting text full of invalid text.

Related

Error with utf8 encoding

When I get data from some website, sometime the data is encode in utf8 but look like this:
Thỏ , Nạt
The accent mark is seperated from character when in fact these string must be:
Thỏ, Nạt
I don't know what is the problem here and how to correct it. Can someone help me with this
The first sample string contains two Vietnamese characters in decomposed form. The first one of them is “ỏ”, consisting of simple letter “o” followed by U+0309 COMBINING HOOK ABOVE.
The second sample string has those characters in precomposed form. The first one of them is “ỏ” U+1ECF LATIN SMALL LETTER O WITH HOOK ABOVE.
The decomposed and precomposed form are defined to be “canonical equivalent” and are normally expected to result in the same rendering (though this does not always happen). They are not identical, however; in programmatic comparison of characters and strings, they are very much different.
Mostly Latin letters with diacritics, such as “é” and “ä”, are used in precomposed form only, since that’s what keyboard drivers, online keyboards, character picking utilities, etc., normally produce. However, Vietnamese keyboard drivers often work so that some diacritic marks are entered after entering a base character, and the diacritic is thus produced as a combining character, i.e. the letter (like “ỏ”) is then in decomposed form.
One way of dealing with this issue, recommended in many contexts, is to convert your strings to Normalization Form C (NFC). This would put these characters into precomposed form. Note, however, that conversion to NFC removes some other distinctions, too (but this is not relevant if the text is in Vietnamese only and does not contain special symbols).
It remains a mystery why the first sample string has a space character before the comma.

UTF-8 string delimiter

I am parsing a binary protocol which has UTF-8 strings interspersed among raw bytes. This particular protocol prefaces each UTF-8 string with a short (two bytes) indicating the length of the following UTF-8 string. This gives a maximum string length 2^16 > 65 000 which is more than adequate for the particular application.
My question is, is this a standard way of delimiting UTF-8 strings?
I wouldn't call that delimiting, more like "length prefixing". Some people call them Pascal strings since in the early days the language Pascal was one of the popular ones that stored strings that way in memory.
I don't think there's a formal standard specifically for just that, as it's a rather obvious way of storing UTF-8 strings (or any strings of bytes for that matter). It's defined over and over as a part of many standards that deal with messages that contain strings, though.
UTF8 is not normally de-limited, you should be able to spot the multibyte characters in there by using the rules mentioned here: http://en.wikipedia.org/wiki/UTF-8#Description
i would use a delimiter which starts with 0x11......
but if you send raw bytes you will have to exclude this delimiter from the data\messages processed ,this means that if there is a user input similar to that delimiter, you will have to convert it.
if the user inputs any utf8 represented char you may simply send it as is.

How do I escape a Unicode string with Ruby?

I need to encode/convert a Unicode string to its escaped form, with backslashes. Anybody know how?
In Ruby 1.8.x, String#inspect may be what you are looking for, e.g.
>> multi_byte_str = "hello\330\271!"
=> "hello\330\271!"
>> multi_byte_str.inspect
=> "\"hello\\330\\271!\""
>> puts multi_byte_str.inspect
"hello\330\271!"
=> nil
In Ruby 1.9 if you want multi-byte characters to have their component bytes escaped, you might want to say something like:
>> multi_byte_str.bytes.to_a.map(&:chr).join.inspect
=> "\"hello\\xD8\\xB9!\""
In both Ruby 1.8 and 1.9 if you are instead interested in the (escaped) unicode code points, you could do this (though it escapes printable stuff too):
>> multi_byte_str.unpack('U*').map{ |i| "\\u" + i.to_s(16).rjust(4, '0') }.join
=> "\\u0068\\u0065\\u006c\\u006c\\u006f\\u0639\\u0021"
To use a unicode character in Ruby use the "\uXXXX" escape; where XXXX is the UTF-16 codepoint. see http://leejava.wordpress.com/2009/03/11/unicode-escape-in-ruby/
If you have Rails kicking around you can use the JSON encoder for this:
require 'active_support'
x = ActiveSupport::JSON.encode('µ')
# x is now "\u00b5"
The usual non-Rails JSON encoder doesn't "\u"-ify Unicode.
There are two components to your question as I understand it: Finding the numeric value of a character, and expressing such values as escape sequences in Ruby. Further, the former depends on what your starting point is.
Finding the value:
Method 1a: from Ruby with String#dump:
If you already have the character in a Ruby String object (or can easily get it into one), this may be as simple as displaying the string in the repl (depending on certain settings in your Ruby environment). If not, you can call the #dump method on it. For example, with a file called unicode.txt that contains some UTF-8 encoded data in it – say, the currency symbols €£¥$ (plus a trailing newline) – running the following code (executed either in irb or as a script):
s = File.read("unicode.txt", :encoding => "utf-8") # this may be enough, from irb
puts s.dump # this will definitely do it.
... should print out:
"\u20AC\u00A3\u00A5$\n"
Thus you can see that € is U+20AC, £ is U+00A3, and ¥ is U+00A5. ($ is not converted, since it's straight ASCII, though it's technically U+0024. The code below could be modified to give that information, if you actually need it. Or just add leading zeroes to the hex values from an ASCII table – or reference one that already does so.)
(Note: a previous answer suggested using #inspect instead of #dump. That sometimes works, but not always. For example, running ruby -E UTF-8 -e 'puts "\u{1F61E}".inspect' prints an unhappy face for me, rather than an escape sequence. Changing inspect to dump, though, gets me the escape sequence back.)
Method 1b: with Ruby using String#encode and rescue:
Now, if you're trying the above with a larger input file, the above may prove unwieldy – it may be hard to even find escape sequences in files with mostly ASCII text, or it may be hard to identify which sequences go with which characters. In such a case, one might replace the second line above with the following:
encodings = {} # hash to store mappings in
s.split("").each do |c| # loop through each "character"
begin
c.encode("ASCII") # try to encode it to ASCII
rescue Encoding::UndefinedConversionError # but if that fails
encodings[c] = $!.error_char.dump # capture a dump, mapped to the source character
end
end
# And then print out all the captured non-ASCII characters:
encodings.each do |char, dumped|
puts "#{char} encodes to #{dumped}."
end
With the same input as above, this would then print:
€ encodes to "\u20AC".
£ encodes to "\u00A3".
¥ encodes to "\u00A5".
Note that it's possible for this to be a bit misleading. If there are combining characters in the input, the output will print each component separately. For example, for input of 🙋🏾 ў ў, the output would be:
🙋 encodes to "\u{1F64B}".
🏾 encodes to "\u{1F3FE}".
ў encodes to "\u045E".
у encodes to "\u0443". ̆
encodes to "\u0306".
This is because 🙋🏾 is actually encoded as two code points: a base character (🙋 - U+1F64B), with a modifier (🏾, U+1F3FE; see also). Similarly with one of the letters: the first, ў, is a single pre-combined code point (U+045E), while the second, ў – though it looks the same – is formed by combining у (U+0443) with the modifier ̆ (U+0306 - which may or may not render properly, including on this page, since it's not meant to stand alone). So, depending on what you're doing, you may need to watch out for such things (which I leave as an exercise for the reader).
Method 2a: from web-based tools: specific characters:
Alternatively, if you have, say, an e-mail with a character in it, and you want to find the code point value to encode, if you simply do a web search for that character, you'll frequently find a variety of pages that give unicode details for the particular character. For example, if I do a google search for ✓, I get, among other things, a wiktionary entry, a wikipedia page, and a page on fileformat.info, which I find to be a useful site for getting details on specific unicode characters. And each of those pages lists the fact that that check mark is represented by unicode code point U+2713. (Incidentally, searching in that direction works well, too.)
Method 2b: from web-based tools: by name/concept:
Similarly, one can search for unicode symbols to match a particular concept. For example, I searched above for unicode check marks, and even on the Google snippet there was a listing of several code points with corresponding graphics, though I also find this list of several check mark symbols, and even a "list of useful symbols" which has a bunch of things, including various check marks.
This can similarly be done for accented characters, emoticons, etc. Just search for the word "unicode" along with whatever else you're looking for, and you'll tend to get results that include pages that list the code points. Which then brings us to putting that back into ruby:
Representing the value, once you have it:
The Ruby documentation for string literals describes two ways to represent unicode characters as escape sequences:
\unnnn Unicode character, where nnnn is exactly 4 hexadecimal digits ([0-9a-fA-F])
\u{nnnn ...} Unicode character(s), where each nnnn is 1-6 hexadecimal digits ([0-9a-fA-F])
So for code points with a 4-digit representation, e.g. U+2713 from above, you'd enter (within a string literal that's not in single quotes) this as \u2713. And for any unicode character (whether or not it fits in 4 digits), you can use braces ({ and }) around the full hex value for the code point, e.g. \u{1f60d} for 😍. This form can also be used to encode multiple code points in a single escape sequence, separating characters with whitespace. For example, \u{1F64B 1F3FE} would result in the base character 🙋 plus the modifier 🏾, thus ultimately yielding the abstract character 🙋🏾 (as seen above).
This works with shorter code points, too. For example, that currency character string from above (€£¥$) could be represented with \u{20AC A3 A5 24} – requiring only 2 digits for three of the characters.
You can directly use unicode characters if you just add #Encoding: UTF-8 to the top of your file. Then you can freely use ä, ǹ, ú and so on in your source code.
try this gem. It converts Unicode or non-ASCII punctuation and symbols to nearest ASCII punctuation and symbols
https://github.com/qwuen/punctuate
example usage:
"100٪".punctuate
=> "100%"
the gem uses the reference in https://lexsrv3.nlm.nih.gov/LexSysGroup/Projects/lvg/current/docs/designDoc/UDF/unicode/DefaultTables/symbolTable.html for the conversion.

Allowed characters in submit forms (including UTF-8)

Suppose I allow my users to submit a form containing some text fields (I'm not talking about passwords). My users would occasionally use non-ASCII characters like Russian, Chinese, etc. so I use UTF-8 charsets in my database. The question is, should I really allow all of the possible UTF-8 characters? I had a look at the ASCII table and saw that characters 0 to 31 have nothing to do with text, except for newlines and white spaces. Characters 176 to 223 seem to be for decorative purposes :p. Should I restrict them?
The W3C skips these characters in their example regular expression in Multilingual form encoding:
$field =~
m/\A(
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*\z/x;
Make sure it is valid UTF-8 and Unicode? Yes
Make sure it does not include certain characters, such as control codes? Probably not necessary
You should be aware that even though you are using UTF-8 in your form, you may not get valid UTF-8 from all user-agents when they send form data to you, and you will have to filter it as necessary. Invalid UTF-8 can take many forms, some of them being
Overlong encodings (which can lead to security issues)
Other invalid UTF-8 byte sequences, which may indicate that the user-agent ignored the character encoding and has submitted something like Windows-1252 or ISO-8859-1 encoding instead.
Code points that lie in reserved surrogate space in Unicode
All the above need to be filtered out during input, otherwise you are not storing valid Unicode.
If you want to serve valid HTML or XHTML, which use a subset of Unicode, you will need also need to filter out (either at input or output):
C0 control codes 0x00 to 0x19 (apart from tab, space, new line, carraige return)
0x7F
C1 control codes 0x80 to 0xBF
(probably) any code point above 0x10FFFF
No.
It's a very bad idea to try to "pre-clean" user input. What you consider "decorative" might be absolutely necessary to readers of another language. The best solution is to store the text as-is in the database, and then sanitize it before writing to the page.
When you say "the ASCII table" you're talking about this page, aren't you? That page is garbage. Only the first 128 characters (ie, 0..127) are "ASCII"; the mappings they show for the numbers 128..255 are from an ASCII extension called cp437. There are a lot of "extended ASCII's" out there, and cp437 is far from the most common one.
But I digress. Your question isn't about character encodings, it's about filtering, and a filter should be based on the properties of the characters: is it a letter, a digit, a control character? Most modern programming languages provide methods or functions to obtain such information, and most provide regex support as well. As for what you should filter, or whether you should filter at all, only you can know that.
It sounds like you need to learn more about character encodings and Unicode, though. Start here.

Least used delimiter character in normal text < ASCII 128

For coding reasons which would horrify you (I'm too embarrassed to say), I need to store a number of text items in a single string.
I will delimit them using a character.
Which character is best to use for this, i.e. which character is the least likely to appear in the text? Must be printable and probably less than 128 in ASCII to avoid locale issues.
I would choose "Unit Separator" ASCII code "US": ASCII 31 (0x1F)
In the old, old days, most things were done serially, without random access. This meant that a few control codes were embedded into ASCII.
ASCII 28 (0x1C) File Separator - Used to indicate separation between files on a data input stream.
ASCII 29 (0x1D) Group Separator - Used to indicate separation between tables on a data input stream (called groups back then).
ASCII 30 (0x1E) Record Separator - Used to indicate separation between records within a table (within a group). These roughly map to a tuple in modern nomenclature.
ASCII 31 (0x1F) Unit Separator - Used to indicate separation between units within a record. The roughly map to fields in modern nomenclature.
Unit Separator is in ASCII, and there is Unicode support for displaying it (typically a "us" in the same glyph) but many fonts don't display it.
If you must display it, I would recommend displaying it in-application, after it was parsed into fields.
Assuming for some embarrassing reason you can't use CSV I'd say go with the data. Take some sample data, and do a simple character count for each value 0-127. Choose one of the ones which doesn't occur. If there is too much choice get a bigger data set. It won't take much time to write, and you'll get the answer best for you.
The answer will be different for different problem domains, so | (pipe) is common in shell scripts, ^ is common in math formulae, and the same is probably true for most other characters.
I personally think I'd go for | (pipe) if given a choice but going with real data is safest.
And whatever you do, make sure you've worked out an escaping scheme!
When using different languages, this symbol: ¬
proved to be the best. However I'm still testing.
Probably | or ^ or ~ you could also combine two characters
You said "printable", but that can include characters such as a tab (0x09) or form feed (0x0c). I almost always choose tabs rather than commas for delimited files, since commas can sometimes appear in text.
(Interestingly enough the ascii table has characters GS (0x1D), RS (0x1E), and US (0x1F) for group, record, and unit separators, whatever those are/were.)
If by "printable" you mean a character that a user could recognize and easily type in, I would go for the pipe | symbol first, with a few other weird characters (# or ~ or ^ or \, or backtick which I can't seem to enter here) as a possibility. These characters +=!$%&*()-'":;<>,.?/ seem like they would be more likely to occur in user input. As for underscore _ and hash # and the brackets {}[] I don't know.
How about you use a CSV style format? Characters can be escaped in a standard CSV format, and there's already a lot of parsers already written.
Can you use a pipe symbol? That's usually the next most common delimiter after comma or tab delimited strings. It's unlikely most text would contain a pipe, and ord('|') returns 124 for me, so that seems to fit your requirements.
For fast escaping I use stuff like this:
say you want to concatinate str1, str2 and str3
what I do is:
delimitedStr=str1.Replace("#","#a").Replace("|","#p")+"|"+str2.Replace("#","#a").Replace("|","#p")+"|"+str3.Replace("#","#a").Replace("|","#p");
then to retrieve original use:
splitStr=delimitedStr.Split("|".ToCharArray());
str1=splitStr[0].Replace("#p","|").Replace("#a","#");
str2=splitStr[1].Replace("#p","|").Replace("#a","#");
str3=splitStr[2].Replace("#p","|").Replace("#a","#");
note: the order of the replace is important
its unbreakable and easy to implement
Pipe for the win! |
We use ascii 0x7f which is pseudo-printable and hardly ever comes up in regular usage.
Well it's going to depend on the nature of your text to some extent but a vertical bar 0x7C doesn't crop up in text very often.
I don't think I've ever seen an ampersand followed by a comma in natural text, but you can check the file first to see if it contains the delimiter, and if so, use an alternative. If you want to always be able to know that the delimiter you use will not cause a conflict, then do a loop checking the file for the delimiter you want, and if it exists, then double the string until the file no longer has a match. It doesn't matter if there are similar strings because your program will only look for exact delimiter matches.
This can be good or bad (usually bad) depending on the situation and language, but keep mind mind that you can always Base64 encode the whole thing. You then don't have to worry about escaping and unescaping various patterns on each side, and you can simply seperate and split strings based on a character which isn't used in your Base64 charset.
I have had to resort to this solution when faced with putting XML documents into XML properties/nodes. Properties can't have CDATA blocks in them at all, and nodes escaped as CDATA obviously cannot have further CDATA blocks inside that without breaking the structure.
CSV is probably a better idea for most situations, though.
Both pipe and caret are the obvious choices. I would note that if users are expected to type the entire response, caret is easier to find on any keyboard than is pipe.
I've used double pipe and double caret before. The idea of a non printable char works if your not hand creating or modifying the file. For quick random access file storage and retrieval field width is used. You don't even have to read the file.. your literally pulling from the file by reference. This is how databases do some storage.. but they also manage the spaces between records and such. And it introduced the problem of max data element width. (Index attach a header which is used to define the width of each element and it's data type in the original old days.. later they introduced compression with remapping chars. This allows for a text file to get about 1/8 the size in transmission.. variable length char encoding for the win
make it dynamic : )
announce your control characters in the file header
for example
delimiter: ~
escape: \
wrapline: $
width: 19
hello world~this i$
s \\just\\ a sampl$
e text~$someVar$~h$
ere is some \~\~ma$
rkdown strikethrou$
gh\~\~ text
would give the strings
hello world
this is \just\ a sample text
$someVar$
here is some ~~markdown strikethrough~~ text
i have implemented something similar:
a plaintar text container format,
to escape and wrap utf16 text in ascii,
as an alternative to mime multipart messages.
see https://github.com/milahu/live-diff-html-editor

Resources