I have been parsing this website for my windows phone app using Html agility pack;
First I download it using webclient class and then give the result for HtmlDocument.
There was some problems with iso-8859-1 encoding but htmlentity.DeEntitize solved problems with letters Ö ä showing as Ö and ä...
But the document still has some scandinavian characters (äö) in some random encoding (which are showed as: �).
Those letters show perfectly using chrome.
site is: http://reittiopas.tampere.fi/mobile/fi/
Windows Phone only support a small set of encodings, and iso-8859-1 is not one of them!
To solve this, just create the encoding handler with Silverlight Encoding Generator, convert the text, and then use HTML Agility Pack as you are now!
Related
I'm using HTML5 Boilerplate template for creating a basic template for a CMS. How can I make sure, this template can display Arabic language or any Indian language? I noticed there is in header. But when I typed in Indian language it's not showing up, instead showing ???? marks. Do I have to change the font-family from default Sans-Serif? Thanks in advance.
The problem was with the encoding of my html file. I'm using Notepad++ and it's encoding was 'Encode in UTF-8 without BOM'. I changed that to 'Encode in UTF-8'. Now it's fine.
I am converting html templates into PDF/A format using jod converter3(using open office 3x).This is working fine in the development environment(using Eclipse and JRE).
But while executing the same program on production(Linux,JBoss 5) some templates that have hindi characters in output pdf with ???
Working fine for english characters
Tried running my program via command line w/o app server still the same output.
java -cp bin:PATH/JARNAME.jar:lib ConvertToPDFA encoding=UTF8 x (Not working).
HTML is also UTF-8 encoded.
Please suggest the problem area.
Check for a hindi font tiff under the location usr\share\fonts\TTF.
if missing then create the same and place the required font tiff.
I have windows XP at home - home ed, with SP3. In any case, at College, they have windows 7. So, basically when I saved my documents and brought them here, things messed up. I was writing up a short bio.
I was coding my website, and so as usual I had used charset utf-8, the standard. But when I get home and I verify my website (locally), I see the weird characters appear! The triangle and the question mark inside it. So, then I'm like WTF? So I decide to go online and check which charset is better. So randomly, I fall onto windows-1252. Voila, it worked! But then, I decided to re-use charset utf-8, being the standard. I don't want to mess up my website lol.
So I basically go back inside my html document, just to notice that very weird characters appeared. So I delete them and replace them with the the apostrophe that were originally there. Finally, I check my website, and the apostrophes correctly appear.
So, what the hell is going on??? And should I keep using utf-8?
It sounds like the content of the webpage is actually encoded as Windows-1252 by whatever editor you are using, but you are manually writing a <meta> tag that states UTF-8 instead. That would account for the behavior you describe. An explicit charset declaration must match the actual encoding used by the data. When you tell your editor to save the document, make sure it is saving the data in the correct encoding you are expecting. Some editors do support multiple encodings, so don't just blindly use a default encoding if multiple encodings are available.
I have done almost all the different things mentioned ranging from Font Extension to jasperreports.properties file; but unable to find the solution. Please assist.
I am using Ireport designer to create a pdf which contains UTF8 encoding. The font used in all the fields is Arial Unicode MS (which is used in all the fields). The output PDF generated shows unicode charactes in the default previewer and PDF previewr for itext.
However, when I use Spring Application to display the page, I get blanks instead of Unicode characters. I have created jasperreports-fonts-arialuni.jar and included in my project classpath but somehow I am not able to get those unicode characters.
Can someone please assist?
Thanks,
My C# .NET 3.5 application has an option to export text to PDF. I am using ReportingCloud (based on RDL) as generation engine. However, cyrillic texts shown incorrectly in resulting PDF. What means can I use to generate cyrillic PDF correctly? A method to generate UTF8 will also do.
UPD: Particularly, how to embed right fonts into PDF?
I am not familiar with ReportingCloud, so perhaps this is not the easiest answer to your question. But for really great looking PDFs with UTF8 and cyrillic support you could use LaTeX. But it is a language like HTML, just for PDFs. So you have to generate some source code. It is also possible to embed the desired fonts.