Is it possible to write a Windows batch file that can delete all text between 2 characters, including the characters themselves?
I am dynamically generating text files that includes a piece of text in HTML format. I want to extract only the non-HTML part of the text, meaning, I want to remove all HTML tags from it.
So, I want a Windows batch file that takes a text file as input, removes all characters between < and > (including) and creates an output file. Can you please help me with this?
Related
I would like to know if there is any way to just take our relevant data from a pdf file. Suppose we have something like this Name:John, so we can some how automate to take just this field value in order to store it somewhere like a predefined database or file?? Thanks.
Use pdftotext to extract text content from your pdf file. Then parse the text file with your favorite programming language.
If your pdf doesn't contain real text, just images of text, you will need to use an optical character recognition software to extract the text.
I have a Word file containing Multiple line text which include bullet points, headings and need to compare this text with the text in web application. Is it possible to do using UFT ?
I am having a word document(docx) of urdu text in Jameel Noori Nastaleeq Font. And in word its showing 10 pages file but after exporting into PDF its showing 11 pages pdf file becuase every letter contains extra space.
Can anyone please provide information ?
Edited:
Please download the file from
File
It has to do with the XML formatting of Word. When any text is pasted into Word (while the font is Jameel Noori Nastaleeq) Word places extra formatting in between the words. That formatting shows fine in Word however in when the file is converted into PDF the extra space becomes visible. When the text is merely typed in Word, the formatting is applied to entire paragraphs rather than words. That is why a typed document doesn't contain the extra spaces.
I found the following issues with iText PDF conversion:
- if a Word file has multiple columns, iText produces a PDF with just one column.
- if a Word file has multiple lines of text next to a picture, only the first line of text is displayed next to the picture. The other lines of text are displayed within the picture.
These seem to be bugs in iText. Is there a way to fix these issues or a workaround around them?
Regards,
I need to output text to a .doc file. I am currently just outputting to a file like usual and using a .doc at the end of the file name
File.open('output_file.doc', 'a+') {|x| x.write(str)}
The issue is I want to make some of the text red and bold. How can this be achieved? I am using ruby, but I can easily switch to jruby thanks to the amazingness that is rvm, so if there are java libraries for this, that'd be great as well.
The short answer: use .rtf and then convert to .doc using word or open office. The following .rtf file (writes "normal text red text more normal text." and colors and bolds the red text):
{\rtf1\ansi\ansicpg1252\cocoartf1038\cocoasubrtf350
{\fonttbl\f0\fswiss\fcharset0 Helvetica;}
{\colortbl;\red255\green255\blue255;\red255\green0\blue0;}
\margl1440\margr1440\vieww13280\viewh10420\viewkind0
\pard\tx720\tx1440\tx2160\tx2880\tx3600\tx4320\tx5040\tx5760\tx6480\tx7200\tx7920\tx8640\ql\qnatural\pardirnatural
\f0\fs24 \cf0 normal text
\b \cf2 red text
\b0 \cf0 more normal text.}
The long answer:
Strings are just plain ascii text, so there is no command that can make them bold. This is a property of all files in general, not just how Ruby works with files.
What text-editors do is use key strings within the file as commands to render the text in a certain way. For example, double asterisk surrounds bold text in the Stack Overflow editor. The file format of a file determines these rules.
.rtf is a basic file format that has the features you want and is easy to convert to .doc using msword or open office. THe advantage to .rtf is that it is human readable. So you can write an rtf file with red text, rename it .txt and open in a text editor and see what "decorations" the red font added. Play around with the parameters
If you are curious, the complete .rtf specifications can be found here:
http://www.biblioscape.com/rtf15_spec.htm
What's all the garbage at the top? That is header stuff. Fortunately you don't need to add more header material to add more text.