I have need where I am writing sentences in MS Word (any version is fine) in a foreign language (specifically Arabic and/or Urdu) with custom font and size and I would like convert those sentences to pictures using a script or something. This is because I need to post these pictures on a website where I cannot upload custom fonts or store them along with the webpage. Right now the way we do it is to crop pictures out of the word document. Is there way to do this in VBScript or any other scripting language?
Related
Is there anything we need to change on pdf template level to support Chinese or Japanese . I am using a fillable template. While writing it with my pdf library(using tcpdf) it is giving some garbled text. While copy pasting Chinese text to this template field also gives some squares where as English is filling perfectly. Does anyone out there have a sample fillable pdf template that supports Chinese to share?
generally you need to embed font to properly display foreign characters in a PDF. I've not tried using tcpdf but saw posts that say embedding a TTF (true type font) of the correct language seems to do the trick.
The question seems to be weird, but I need to ask this, since I am witnessing a quite interesting output when I compare text as image and graphics as image.
Ideally I am in process of identifying an tool, or algorithm to compare two pdfs, generate output which will highlight the difference between them.
There are possibilities in pdfs, which will have text as image format (legacy text on papers, are converted to pdfs).
and we are doing migration of those legacy pdfs, and finally we are comparing with legacy and converted pdf output.
I am evaluating couple of tools like Adobe dc pro, i-net pdfc and power pdf etc, for comparing two pdfs.
While evaluating, I am able to see graphic images are getting compared(not accurate either) on either side of the pdfs. Where as text as images are completely ignored, unanimously same results in all the tools.
But I am more interested in text as image, since we deal more of legacy text pdfs.
Below, is attached graphic image comparison result, where it could able to capture the differences between the images.
But when I compare text image, differences are not highlighted in the tool.
What I understand from this, text is not compared as image graphics, and tool is completely ignoring the comparison. I would like have clarification whether my assumption is correct.
Secondly, I would like to know how to compare text image in pdfs to generate the differences?.
I'm working for the company that is author of i-net PDFC so I'll answer your first question as well:
Your assumption is correct. i-net PDFC is able to compare images and shapes, but it cannot detect if some content completely changed it's meaning, e.G. a line shape that is used to draw a letter or in your case an image that has to be recognized as text. Recognizing ASCII art as image won't work for the same reason either. Such cases will always be detected as differences even though their visual appearance is similar.
On your second question: Using an OCR conversion tool for one or both documents is a common solution to this problem. A simple image comparison of the compared pages in unlikely to work due to the different font styles and line wrappings in the converted file.
Please note that most OCR applications will use the rendered page images for the recognition. This may lead to incorrect recognition results even if there are no images in the PDF file.
i-net Software is aware of this general issue and an OCR module is currently in development. It'll provide an option to apply the recognition solely to the images in the PDF files.
I've read much about PDF extractions and libraries (as iText) but i just haven't found a solution to extract images and text (with coordinates) from a PDF.
The task is to scan PDF with catalog of products and extract each image. There is an image code printed next to each image and also a list of product codes for products that are shown on the image.
I know that there is no way to extract structured info from a PDF like this but with coordinates of all image and text objects I could write code to identify linked text by its distance from the image. Then I could split text using a RegExp and find out what is a product code, what is an image code etc.
Could you recommend a good and working solution for the task?
Use XPDF (http://www.foolabs.com/xpdf/)
It can extract all the characters in the PDF with co-ordinates (pdftotext -bbox [sourcefile] [outputfile]) and also all the images and SVGs in the PDF.
It's open source (GPLv2) and supports a lot of additional extraction functionalities as well.
Several Java libraries can do this. Have you looked at JPedal or PdfBox?
If a commercial library is an option for you, you could try Amyuni PDF Creator .Net or Amyuni PDF Creator ActiveX. You could use the method IacDocument.GetObjectsInRectangle to retrieve all the "graphic objects" of your interest, then use the ObjectType attribute to separate images from text. The library already provides an algorithm for putting close text together. From the documentation:
IacDocument.GetObjectsInRectangle Method
The GetObjectsInRectangle method gets all the objects that are in the specified rectangle.
Usual disclaimer applies.
I'm trying to output RTF (Rich Text Format) from a Ruby program - and I'd prefer to just emit RTF directly without using the RTF gem as I'm doing pretty simple stuff.
I would like to highlight specific characters in a DNA sequence alignment and from the docs it seems that I can either use \highlightN ... \highlight0 or \cbN ... \cb1
The problem is that I cannot get \cb to work in either Word:Mac 2008 or Mac TextEdit (\cf works fine so I know it's not a color table issue)
\highlight does work but seemingly only with two of the possible colors (black and red) and \highlight does not use the custom color table.
By creating simple docs in Word with character shading and saving as RTF I can see blocks of ridiculously verbose RTF code that presumably does what I want, but it is so impenetrable that I'm not seeing the wood for the trees.
Part of the problem may well be that Mac Word is just not implementing RTF properly. I don't have a Windows version of Word handy.
Anyone know the right way to shade blocks of text?
Thanks
--Rob
There is a note in the RTF Pocket Guide that says MS Word does not implement the \cb command. It says MS Word uses \chshdng0\chcbpatN (where "N" is the color number that you would use with \cb). The book recommends using something like the following for compatibility with programs that implement \cbN and/or \chshdng0\chcbpatN: {\chshdng0\chcbpat5\cb5 text}.
Note: The copy of the book I have was published in 2003, so it might be a bit out-of-date.
The sequence of RTF commands that seems to be most universally supported by RTF-capable applications is:
\chshdng10000\chcbpatN\chcfpatN\cbN
These commands:
set the shading to 100 percent
set the pattern foreground and background colors to the color from the color table (we're not actually specifying a shading pattern)
set the character background to the color from the color table
Word was the most difficult application to properly render background colors in:
Despite what the latest (1.9.1) RTF spec says, Word 2013 does not resolve \highlightN colors from the \colortbl. Instead, \highlightN maps to a predefined list of colors. It looks like those colors come from the 1.5 version of the RTF spec.
Regarding \cb, the 1.9.1 spec contains this helpful pointer at the end of the section on Color Table:
Note: Windows versions of Word have never supported \cbN, but it can be emulated by the control word sequence \chshdng0\chcbpatN.
This is almost a useful suggestion, except that if you read the documentation for \chshdngN:
Character shading. The N argument is a value representing the shading of the text in hundredths of a percent.
So, 0 turns out to not be a very useful value; 100 / 0.01 gives us the 10000 we used in the sequence above.
Use WordPad to create RTF documents, not Word. WordPad creates much simpler documents, i.e. approaching human-readable.
I use WordPad every time I need to display formatted text in a WinForms application, and need something that the RichTextBox control can handle being assigned to its Rtf parameter.
i am using zend_pdf to create pdf files from persian language texts.
i have two problem:
1- i want to change document's direction to right to left as persian, arabic and some other languages are.
2- i use Zend_Pdf_Font::fontWithPath('myfont.ttf'); to set a persian font. but when i render the pdf document i see that characters are separated one by one. but in persian characters of a word are joint together.
thx in advance.
i finally used tcpdf class for generating pdf files using php. it has good support for RTL languages and UTF. it also has a good fiture for converting html to pdf.