I have an Excel file including latin character, which is shown as follows:
abcón
After saving it into a csv file, the latin character was lost
abc??n
What causes this problem and how to solve it? Thanks.
It's likely that the ó you're using in the excel file isn't supported in ascii text. There are a couple different symbols that look almost if not entirely identical. From the Insert->Symbol character map, 00F3 is supported and is from the latin extended alphabet. However, 1F79 from the greek extended alphabet is not supported and from my casual inspection is identical. Try replacing the char in question with the char from the char map.
Alternatively, you can use Alt-Codes and use 0243 for the char which should work.
Related
I have a process that inserts data into PDFs that eventually loads into a system that gets searched based on that inserted data. The inserted data looks something like:
<<
/IBM-ODIndexes
<< /Private
<<
/DOB (05031983)
/FULL_NAME (TEST USER)
/YEAR (2020)
>>
/LastModified(D:20210112201530)
>>
However, there are instances where the data in the FULL_NAME field contains non UTF8 characters and then users are unable to search the data. Specifically apostrophes come over from Microsoft Word and then gets interpreted like this:
/FULL_NAME (JERRY OÃ<83>¢ââ<80><9a>‰â<80><9e>¢CONNELL)
In this case I am looking to strip out the apostrophe that is represented as Ã<83>¢ââ<80><9a>‰â<80><9e>¢ and replace it with a white space.
There are several complexities here, but in general I would say that the only reliable way to deal with it is to figure out the text encoding of the incoming document and converting it to the target encoding.
Ã<83>¢ââ<80><9a>‰â<80><9e>¢ is 34 characters (that is, at least 34 bytes), and no single encoding ever used that much space for a single character. What’s probably happening is multiple levels of encoding, such as HTML entities, base64, UTF-8/16/32 or escape characters like %% to represent % in SQL or \\ to represent \ in Bash. Reversing all these levels of encoding manually is going to involve quite a lot of reading the huge docx standard. The simpler alternative is to use a library which can just convert the entire text into a known character encoding for you, at which point you have to do at most a single conversion into UTF-8.
Another argument for this is that the “apostrophe string” does contain otherwise harmless characters like “a” and “e”. Without at least some understanding of the encodings you’re unlikely to be able to separate encoded characters from non-encoded ones, which would make the resulting text full of invalid text.
I'm working with binary data and want to find out that is wrong.
I use notepad++ to preview binary, I have set View->Show Symbol->Show All Characters to see all chars, but there still exists some chars I cannot identify, e.g. â©ÎÅ. The problem is that ASCII has strong standart for number 0 to 127, extended ASCII may be picturing in many ways, so I have problem with chars what represents numbers 128 to 255.
Is there any table of notepad++ extended chars or some option to make it show symbol code instead of symbol.
Maybe not solution for you, but in PSPad and SynWrite editors you can:
create text-converter (INI file) which changes ASCII 127..255 to strings like <127>...<255> or others
apply this converter to text.
Text converter usage is described in help of both apps.
I have some UTF-Text starting with "ef bb bf". How can I turn this message to human read-able text? vim, gedit, etc. interpret the file as plain text and show all the ef-text even when I force them to read the file with several utf-encodings. I tried the "recode" tool, it doesn't work. Even php's utf8_decode failed to produce the expected text output.
Please help, how can I convert this file so that I can read it?
ef bb bf is the UTF-8 BOM. Strip of the first three bytes and try to utf8_decode the remainder.
$text = "\xef\xbb\xbf....";
echo utf8_decode(substr($text, 3));
Is it UFT8, UTF16, UTF32? It matters a lot! I assume you want to convert the text into old-fashioned ASCII (all characters are 1 byte long).
UTF8 should already be (at least mostly) readable as it uses 1 byte for standard ASCII characters and only uses multiple bytes for special/multilingual characters (Character codes > 127). It sounds like your file isn't UTF8, or you'd already be able to read it! Online content is generally UTF-8.
Unicode character codes are the same as the old ASCII codes up to 127.
UTF16 and UTF32 always use 2 and 4 bytes respectively to encode every character, whether those characters can be represented in a single byte or not. That makes it unreadable if the text editor is expecting UTF8.
Gedit supports UTF16 and UTF32 but you need to 'add' those encoding explicitly in the open dialog box (and possibly select them explicitly instead of using auto-detect)
I have to work with text that was previously copy/pasted from an excel document into a .txt file. There are a few characters that I assume mean something to excel but that show up as an unrecognised character (i.e. that '?' symbol in gedit, or one of those rectangles in some other text editors.). I wanted to parse those out somehow, but I'm unsure of how to do so. I know regular expressions can be helpful, but there really isn't a pattern that matches unrecognisable characters. How should I set about doing this?
you could work with http://spreadsheet.rubyforge.org/ maybe to read / parse the data
I suppose you're getting these characters because the text file contains invalid Unicode characters, that means your '?'s and triangles could actually be unrecognized multi byte sequences.
If you want to properly handle the spreadsheet contents, i recommend you to first export the data to CSV using (Open|Libre)Office and choosing UTF-8 as file encoding.
https://en.wikipedia.org/wiki/Comma-separated_values
If you are not worried about multi byte sequences I find this regex to be handy:
line.gsub( /[^0-9a-zA-Z\-_]/, '*' )
I have a procedure that imports a binary file containing some strings. The strings can contain extended ASCII, e.g. CHR(224), 'à'. The procedure is taking a RAW and converting the BCD bytes into characters in a string one by one.
The problem is that the extended ASCII characters are getting lost. I suspect this is due to their values meaning something else in UTF8.
I think what I need is a function that takes an ASCII character index and returns the appropriate UTF8 character.
Update: If I happen to know the equivalent Oracle character set for the incoming text can I then convert the raw bytes to UTF8? The source text will always be single byte.
There's no such thing as "extended ASCII." Or, to be more precise, so many encodings are supersets of ASCII, sharing the same first 127 code points, that the term is too vague to be meaningful. You need to find out if the strings in this file are encoded using UTF-8, ISO-8859-whatever, MacRoman, etc.
The answer to the second part of your question is the same. UTF-8 is, by design, a superset of ASCII. Any ASCII character (i.e. 0 through 127) is also a UTF-8 character. To translate some non-ASCII character (i.e. >= 128) into UTF-8, you first need to find out what encoding it's in.