How to remove C2 A0 in QByteArray before converting to QString? - qstring

I have a QByteArray that has "C2 A0" in it that when converted to a QString makes "=C2=A0". I remove this "=C2=A0" and replace it with a space before I send it to be converted to a PDF. Problem is I always get the  in the resultant PDF. I have tried all kinds way to try and fix this. The file I send the pdf conversion service is verified not to have the "=C2=A0" in it yet it still appears. I need to remove this "C2 A0" from the QByteArray before converting to a QString. Is there a function where I can remove those bytes before the conversion to QString? I opened the file in a Mac hex editor and even with the "=C2=A0" removed the "C2 A0" is still there. I need to remove this bytes for proper conversion.
This answer isn't helping me: What is "=C2=A0" in MIME encoded, quoted-printable text?

Related

File with first bit of every byte set to 0

I was given a file that seems to be encoded in UTF-8, but every byte that should start with 1 starts with 0.
E.g. in place where one would expect polish letter 'ę', encoded in UTF-8 as \o304\o231, there is \o104\o031. Or, in binary, there is 01000100:00011001 instead of 11000100:10011001.
I assume that this was not done on purpose by evil file creator who enjoys my headache, but rather is a result of some erroneous operations performed on a correct UTF-8 file.
The question is: what "reasonable" operations could be the cause? I have no idea how the file was created, probably it was exported by some unknown software, than could have been compressed, uploaded, copied & pasted, converted to another encoding etc.
I'll be greatful for any idea : )

wxWidgets and UTF8 - some characters missing

So I have this file encoded in UTF8. I load it and print like this:
char buffer[2048] = {0};
FILE *pFile = fopen("D:/localization.csv","rb");
int iret = fread(buffer,1,2048,pFile);
fclose(pFile);
wxString strMessageText = wxString::FromUTF8(buffer);
wxMessageBox(strMessageText);
The problem is that when the text contains some "invalid" characters, it doesn't get created (length of strMessageText is 0). I noticed, for instance, that Danish or German characters are fine but when I put Polish or Russian chars in the text file the wxString::FromUTF8 function fails to create proper text. Any idea?
If the file contains correctly encoded UTF-8 text, wxString::FromUTF8() will decode it. If it doesn't, you can still use wxMBConvUTF8 with e.g. MAP_INVALID_UTF8_TO_OCTAL to preserve even incorrectly encoded bytes in the input, but this isn't a good idea, in general.
I found solution here https://forums.wxwidgets.org/viewtopic.php?f=1&t=41068
It turned out that my wxWidgets lib was out of date. I had version 2.8.12 and updated to 3.0.2 and it's fine.

Issues with getline/file reading in Windows

I created some .txt files on my Mac (didn't think that would matter at first, but...) so that I could read them in the application I am making in (unfortunately) Visual Studio on a different computer. They are basically files filled with records, with the number of entries per row at the top, e.g.:
2
int int
age name
9 Bob
34 Mary
12 Jim
...
In the code, which I originally just made (and tested successfully) on the Mac, I attempt to read this file and similar ones:
Table TableFromFile(string _filename){ //For a database system
ifstream infile;
infile.open(_filename.c_str());
if(!infile){
cerr << "File " << _filename << " could not be opened.";
exit(1);
}
//Determine number attributes (columns) in table,
//which is number on first line of input file
std::string num;
getline(infile, num);
int numEntries = atoi(num.c_str());
...
...
In short, this causes a crash! As I looked into it, I found some interesting "Error reading characters of string" issues and found that numEntries is getting some crazy negative garbage value. This seems to be caused by the fact that "num", which should just be "2" as read from the first line, is actually coming out as "ÿþ2".
From a little research, it seems that these strange characters are formatting things...perhaps unicode/Mac specific? In any case, they are a problem, and I am wondering if there is a fast and easy way to make the text files I created on my Mac cooperate and behave in Windows just like they did in the Mac terminal. I tried connecting to a UNIX machine, putting a txt file there, running unix2dos on it, and put into back in VS, but to no avail...still those symbols at the start of the line! Should I just make my input files all over again in Windows? I am very surprised to learn that what you see is not always what you get when it comes to characters in a file across platforms...but a good lesson, I suppose.
As the commenter indicated, the bytes you're seeing are the byte order mark. See http://en.wikipedia.org/wiki/Byte_order_mark.
"ÿþ" is 0xFFFE, the UTF-16 "little endian" byte order mark. The "2" is your first actual character (for UTF-16, characters below 256 will be represented by bytes of the for 0xnn00;, where "nn" is the usual ASCII or UTF-8 code for that character, so something trying to read the bytes as ASCII or UTF-8 will do OK until it reaches the first null byte).
If you need to puzzle out the Unicode details of a text file the best tool I know of is the free SC Unipad editor (www.unipad.org). It is Windows-only but can read and write pretty much any encoding and will be able to tell you what there is to know about the file. It is very good at guessing the encoding.
Unipad will be able to open the file and let you save it in whatever encoding you want: ASCII, UTF-8, etc.

How can I do a Get on an InputStream?

One annoying thing of encoded packages is that they have to be in a separate file. If we want to distribute a simple self contained app (encoded), we need to supply two files: the app "interface", and the app package.
If I place all the content of the encoded file inside a string, and transform that string into an InputStream, I'm halfway to view that package content as a file.
But Get, that to my knowledge is the only operation (also used by Needs) that has the decoding function, doesn't work on Streams. It only works on real files.
Can someone figure out a way to Get a Stream?
Waiting for Mathematica to arrive on my iPhone so couldn't test anything, but why don't you write the string to a temporary file and Get that?
Update
Here's how to do it:
encoded = ToFileName[$TemporaryDirectory, "encoded"];
Export[encoded, "code string", "Text"]; (*export encrypted code to temp file *)
It's important to copy the contents of the code string from the ASCII file containing the encoded code using an ASCII editor and paste it between existing empty quotes (""). Mathematica will then do automatic escaping of backslashes and quotes that may be in the code. This file has been made earlier using Encode. Can't do it here in the sample code as SO's Markdown messes with the string.
Get[encoded] (* get encrypted code and decode *)
DeleteFile[encoded] (* Remove temp file *)
Final Answer
Get doesn't appear to be necessary for decoding. ImportString does work as well:
ImportString["code string", "NB"]
As above, paste your encoded tekst from an ASCII editor straight between the "" and let MMA do the escaping.
I don't know of a way to Get a Stream, but you could store the encoded data in your single package, write it out to a temp file, then read the temp file back in with Get.
Just to keep things up to date:
Get works with streams since V9.0.

RSS reader Error : Input is not proper UTF-8 when use simplexml_load_file()

I'm using simplexml_load_file method for parsing feed from external source.
My code like this
$rssFeed['DAILYSTAR'] = 'http://www.thedailystar.net/latest/rss/rss.xml';
$rssParser = simplexml_load_file($url);
The output is as follows :
Warning: simplexml_load_file() [function.simplexml-load-file]: http://www.thedailystar.net/latest/rss/rss.xml:12: parser error : Input is not proper UTF-8, indicate encoding ! Bytes: 0x92 0x73 0x20 0x48 in C:\xampp\htdocs\googlebd\index.php on line 39
Ultimately stop with a fatal error. Main problem is the site's character encoding is ISO-8859-1, not UTF-8.
Can i be able to read this using this method(SimpleXML API)?
If no then any other method is available?
I've searched through Google but no answer. Every method I applied returns with this error.
Thanks,
Rashed
Well, well, when I retrieve this content using Python, I get the following:
'\n<rss version="2.0" encoding="ISO-8859-1">\n [...]
<description>The results of this year\x92s Higher Secondary Certificate
Now it says it's ISO-8859-1, but \x92 is not in that character set, but instead is the closing curly single quote, used as an apostrophe, in Windows-1252. So the page throws an encoding error, and as per the XML spec, clients should be "strict" and not fix errors.
You can retrieve it, and filter out the non-ISO-8859-1 characters in some fashion, or better, convert the encoding using mb-convert-encoding() before passing the result to your RSS parser.
Oh, and if you want to incorporate the result into a UTF-8 page, you may have convert everything to UTF-8, though this is English, which might not even require any different character encodings, if all turns out to be ASCII after all.
We ran into the same issue and used utf8_encode to change the encoding from ISO-8859-1/latin-1 to UTF-8 and get past the error.
$contents = file_get_contents($url);
simplexml_load_string(utf8_encode($contents));

Resources