How can I output .doc files with bolded and colored text - ruby

I need to output text to a .doc file. I am currently just outputting to a file like usual and using a .doc at the end of the file name
File.open('output_file.doc', 'a+') {|x| x.write(str)}
The issue is I want to make some of the text red and bold. How can this be achieved? I am using ruby, but I can easily switch to jruby thanks to the amazingness that is rvm, so if there are java libraries for this, that'd be great as well.

The short answer: use .rtf and then convert to .doc using word or open office. The following .rtf file (writes "normal text red text more normal text." and colors and bolds the red text):
{\rtf1\ansi\ansicpg1252\cocoartf1038\cocoasubrtf350
{\fonttbl\f0\fswiss\fcharset0 Helvetica;}
{\colortbl;\red255\green255\blue255;\red255\green0\blue0;}
\margl1440\margr1440\vieww13280\viewh10420\viewkind0
\pard\tx720\tx1440\tx2160\tx2880\tx3600\tx4320\tx5040\tx5760\tx6480\tx7200\tx7920\tx8640\ql\qnatural\pardirnatural
\f0\fs24 \cf0 normal text
\b \cf2 red text
\b0 \cf0 more normal text.}
The long answer:
Strings are just plain ascii text, so there is no command that can make them bold. This is a property of all files in general, not just how Ruby works with files.
What text-editors do is use key strings within the file as commands to render the text in a certain way. For example, double asterisk surrounds bold text in the Stack Overflow editor. The file format of a file determines these rules.
.rtf is a basic file format that has the features you want and is easy to convert to .doc using msword or open office. THe advantage to .rtf is that it is human readable. So you can write an rtf file with red text, rename it .txt and open in a text editor and see what "decorations" the red font added. Play around with the parameters
If you are curious, the complete .rtf specifications can be found here:
http://www.biblioscape.com/rtf15_spec.htm
What's all the garbage at the top? That is header stuff. Fortunately you don't need to add more header material to add more text.

Related

Windows Tesseract OCR getting scattered HOCR out put instead of clean standard format

A quick help is highly appreciated. I am extracting the text from the tiff image through tesseract-OCR.
The output I am looking for is.HOCR (HTML).
I am getting the perfect output in terms of content, but the format looks very unorganized.
But the same when I open with Notepad ++ it gives a clean format.
The windows command line is given below
Tesseract "Path\image.tiff" "Path\output" HOCR
need your help in getting the organised hocr format in notepad as enclosed
How do I get organized hocr data when I open with notepad?
Problem is not in tesseract, but in notepad. Use some normal text editor like notepad++ or context.

AppleScript: renaming PDF with content of PDF

I am trying to do exactly what is described in the following thread:
AppleScript/Automator: renaming PDF with extracted text content of this PDF
So I am using the Chino22's version and there are two issues with it:
First, instead of the contents of the pdf, theFileContentsText gets some metadata stuff.
Second, althought the script runs to the end, I get the following error for the last step:
error "The variable thisFile is not defined." number -2753 from "thisFile"
So, how do I get the text contents instead, and how do I define thisFile to the current pdf that is being processed in the loop?
Thanks in advance!
I would not expect the linked script to work.
Except for document metadata, extracting text content from PDF is notoriously difficult and unreliable, and not a road you want to go down if you can possibly avoid it. Adobe’s PDF file format is designed for printing, not for data processing. PDF files contain blocks of Postscript-like page drawing instructions, typically compressed, and while it’s possible for PDFs also to include the original plain text for accessibility use, most PDF generators do not do this so the only way to get the original text is by reconstructing it from those low-level drawing instructions—not a trivial job.
AppleScript’s read command only reads that raw file data; it does not parse it into drawing instructions, never mind translating those drawing instructions back into plain text. Change a PDF file’s extension to .txt and open it in a plain text editor, and you’ll see what I mean. Nasty.
If you need to work with the PDF’s original content (text, images, whatever), your best solution is to get those files before they were converted into a PDF.
If you must extract content from a PDF file, use an existing tool that knows how to do it.
For instance, if you’re lucky enough to have PDFs that contain XFDF (XML form) or accessibility data, there are 3rd-party apps and libraries to extract that content in readable form. I can’t think offhand of any that are AppleScriptable (Adobe Acrobat has only minimal AS support) so you’ll probably need to find one you can run from command line (do shell script in AS).
Or, if the PDFs have a consistent visual structure, a 3rd-party library such as Python’s PDFMiner (which I’ve used in the past) can identify blocks of characters by position and convert those back into strings with varying degrees of reliability (it has to convert font glyphs back into Unicode characters, guess at which characters are close enough to constitute a word, and where to insert space and return characters between those words). You’ll have to write some Python code to extract the bits you want, so look for tutorials to get started (or pay someone to write it for you).
But again, if you can possibly avoid having to extract text from PDF, you should. You will save yourself a lot of trouble.

Spacing issue between letters while converting Word to PDF on Windows

I am having a word document(docx) of urdu text in Jameel Noori Nastaleeq Font. And in word its showing 10 pages file but after exporting into PDF its showing 11 pages pdf file becuase every letter contains extra space.
Can anyone please provide information ?
Edited:
Please download the file from
File
It has to do with the XML formatting of Word. When any text is pasted into Word (while the font is Jameel Noori Nastaleeq) Word places extra formatting in between the words. That formatting shows fine in Word however in when the file is converted into PDF the extra space becomes visible. When the text is merely typed in Word, the formatting is applied to entire paragraphs rather than words. That is why a typed document doesn't contain the extra spaces.

Converting bold text within a .doc to marked-up text programmatically

I am currently dealing with a large .docx file (roughly 400 pages). It is divided up into sections that are very easily digestable by humans and look like this :
Bold text
Written paragraph
This is perfectly humanly readable and great. Unfortunately we have an in-house program in our University that uses the mark-up of .docx files to sort them out/do some processing on them. By this I mean that sectioning a .doc/.docx using only bold markup is not enough, you must use the in-built tools within MS Office to do this (as below) :
So what I need to write is a simple script that will find the text that is bold within a .docx document and convert this text to properly marked up "Heading 1"s, or similar. It doesn't concern me whether or not the .docx file format is maintained or anything like this.
is it possible to do this? What APIs/languages/tools should I start looking into to accomplish this relatively simple task?
Using a short VBA macro you can iterate over all paragraphs and change the style for all paragraphs containing only bold text into a heading style:
Sub FormatBoldAsHeading()
Dim p As Paragraph
For Each p In ActiveDocument.Paragraphs
If p.Range.Font.Bold <> wdUndefined And p.Range.Font.Bold Then
p.Style = WdBuiltinStyle.wdStyleHeading1
End If
Next
End Sub

Batch file to remove text between 2 characters

Is it possible to write a Windows batch file that can delete all text between 2 characters, including the characters themselves?
I am dynamically generating text files that includes a piece of text in HTML format. I want to extract only the non-HTML part of the text, meaning, I want to remove all HTML tags from it.
So, I want a Windows batch file that takes a text file as input, removes all characters between < and > (including) and creates an output file. Can you please help me with this?

Resources