convert encoding from charset iso-8859-1 to UTF-8 - utf-8

I am trying to finish exporting a 1000 article website (ASP SQL Server) with categories and tags into a WordPress blog. The articles were originally written in Microsoft Word and included many non-UTF-8 characters. They were then copy and pasted into Microsoft Access. The articles are currently stored in a SQL Server 2008 database and displayed on a website using the iso-8859-1 charset
I am using the default WordPress import/export xml file (WordPress eXtended RSS (WXR) file) which I copied from the file used when exporting a blog from WordPress. This file requires UTF-8 encoding.
My problem is that iso-8859-1 characters break the importer and many articles are not fully imported. Characters such as these
naïve ,
and funny characters such as “ ’
My question is how do I clean up all the text, I can create a replace function to clean up the funny quotes but there will always be a random word like naïve which will cause a problem?
What is the simplest way to convert the encoding of all the text from iso-8859-1 to UTF-8?

See http://en.wikipedia.org/wiki/Iconv:
iconv is a computer program and a standardized API used to convert between different character encodings.
If you are trapped on pure Windows (i.e. not even Cygwin), and you don't agree that it's probably the easiest to copy the files to a Unix system and perform the conversion there, http://www.unicodetools.com/ has a bunch of conversion tools.

Related

Has anyone heard about B-FORP?

Recently I've got a project where I need to read a binary file that was encoded through a codec called B-FORP, that seems to encode the UI files of the program, in the beginning of the file we can read it if we open in UTF8 for example.
Have anyone heard about this B-FORP? I google it but I get nothing. I just now that it seems a company called BITS developed it.

Sporadic character encoding issue

Certain characters, most notable all apostrophes, are showing up as question marks (?) in my website for certain users. Normally I'd say "easy, that's a UTF-8 char encoding problem".
But here's what's odd -- the issue only appears for a few international users. On American computers the site appears 100% fine. There is no explicit internationalization being done on the site; everyone gets the same content.
I have the usual headers and meta in place:
in header:
Content-Type: text/html;charset=UTF-8
in html:
<head><meta charset="utf-8"/></head>
The content is coming via
Unix file system -> SOLR -> Spring Boot -> Apache
Any ideas why the same page appears differently for certain international users? My best guess is that perhaps the content is not really UTF-8 encoded and that some browsers are better than others at failing gracefully? As far as I can ascertain though the content truly is UTF-8 -- I ran tests against the raw data from the filesystem according to this
How can I be sure of the file encoding?
Even among international users though, one user has the issue on Chrome while another's Chrome is fine but only sees the problem on Safari. (Chrome and Safari are both 100% fine for me). I've personally verified that the problem does exist on their computers, but cannot reproduce it on my own. Coincidentally both verified cases are using Mac-family products (a macbook and an ipad), dunno if that's significant.

Can I parse a pdf with powershell, using no extra libraries?

I would like to parse a pdf with a Windows powershell script. Is it possible to do this without any open source libraries, though? I am in a situation at work that views this as a security risk.
The pdfs are in an expected text format, and I need to extract two numbers from them to be used later.
Sample pdf:
Obviously, the Device ID and Agreement number are crossed off for security, but those are the two strings I care about.

how to translate chinese characters in an image to english?

I recently bought a Barcode scanner online and when i got it, I noticed that the entire user guide is in chinese (Simplified)... I was wondering if there was some sort of OCR software out there that could take a scanned copy (.jpg) and turn it into an English-translated copy (.txt or .doc) ? I have tried JOCR.exe and that works perfectly with every language except chinese, japanese, and other foreign languages. to use those languages i need to aquire the language packs for MS Office's OCR Plugin. If OCR is not the best method in this situation, then what would be? Any and all Advice would be appreciated!

Flex and full UTF-8 Support?

Doing some software review for a RIA project - I was hoping to use Flex but need to make sure it has full UTF-8 support - I'm talking all fonts for all languages - everything from English, to Finish, to Russian, to Japanese to Thai to Sanskrit...
I haven't worked with Flash/Flex/ActionScript in years - but I seem to remember it's up to the font you embed into the movie - so if you have, say MS Arial UniCode that has the full character set you simply include in the movie and the support is there to display the characters? Is this right?
Also including that level of character support(that large a font) -how much does that bloat the application?
Any insight would be helpful as I am still in the information gather stage.
Other software suggestions would also be appreciated.
Thanks
JD
If you use ActionScript 3 (and you should), all strings are Unicode.
And if you use the newer text components (Flash 10) then the text engine supports complex scripts (including Russia, Japanese, and Indic scripts).
All you would have to do is make sure you have the right fonts. You might embed your own (with the mandatory bloat that you can't avoid if you embed a 30 MB Chinese font :-)
In practice you will probably just use the system fonts.
Among others because there are no free and good quality Chinese/Japanese fonts. And you have not right to embed the font without the proper licensing (and the prices are not low :-)

Resources