Can I prevent ABCpdf from mashing words together (e.g. mashingwordstogether) when convertering PDF to Text? - abcpdf

I'm using ABCpdf to extract the text content of some PDF files, in particular by calling Doc.GetText("Text"). (You call it in a loop, once per page.) This usually works well, but for some PDF files the resulting text consists of text with a dearth of space characters, e.g.
Thissentencedoesn'thaveanyspacesbetweenwords.
What's interesting is if I try to extract text from the exact same PDFs using Apache Tika (powered under the hood by PDFBox), I tend to get all the spaces I'd expect between words. That is, the above sentence would be rendered by Tika as
This sentence doesn't have any spaces between words.
Overall, the two tools act like they're afraid of committing different mistakes -- ABCpdf acts like the worst thing in the world would be to insert a space where one doesn't belong, while Tika acts like the worst thing in the world would be to fail to insert a space where one does belong.
Are there any settings to make ABCpdf act more like Tika in this regard?

Short Answer: You can get individual tokens of text via Doc.GetText("SVG"), parsing the XML for TEXT and TSPAN elements, and determining if there is layout spacing that should be treated as actual spaces. The behavior you're seeing from PDFBox is probably their attempt to make that assumption. Also, even Adobe Acrobat can return spaced text via the clipboard as PDFBox does.
Long Answer: This may cause more problems, as this may not be the original intent of the text in the PDF.
ABCpdf is doing the correct thing here, as the PDF spec only describes where things should be placed in the output medium. One can construct a PDF file that ABCpdf interprets in both styles, even though the original sentence looks nearly the same.
To demonstrate this, here is a snapshot of a document from Adobe InDesign that shows a text layout matching both cases for your sample sentence.
Note that the first row was not constructed with actual spaces, instead, the words were placed by hand in individual text regions and lined up to look approximately like a properly spaced sentence. The second row has a single sentence that has actual text spaces between the words, in a single text region.
When exported to PDF and then read in by ABCpdf, Doc.GetText("TEXT") will return the following:
ThisSentenceDoesn'tHaveAnySpacesBetweenWords.
This Sentence Doesn't Have Any Spaces Between Words.
Thus if you wish to detect layout spaces, you must use SVG output and step through the tokens of text manually. Doc.GetText("SVG") returns text and other drawing entities as ABCpdf sees them on the page, and you can decide how you want to handle the case of layout based spacing.
You'll receive output similar to this:
<?xml version="1.0" standalone="no"?>
<svg width="612" height="792" x="0" y="0" version="1.1" baseProfile="full" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">
<text xml:space="preserve" x="36" y="46.1924" font-size="14" font-family="ArialMT" textLength="26.446" transform="translate(36, 46.1924) translate(-36, -46.1924)">This</text>
<text xml:space="preserve" x="66.002" y="46.1924" font-size="14" font-family="ArialMT" textLength="59.15" transform="translate(66.002, 46.1924) translate(-66.002, -46.1924)">Sentence</text>
<text xml:space="preserve" x="129.604" y="46.1924" font-size="14" font-family="ArialMT" textLength="47.46" transform="translate(129.604, 46.1924) translate(-129.604, -46.1924)">Doesn’t</text>
<text xml:space="preserve" x="181.208" y="46.1924" font-size="14" font-family="ArialMT" textLength="32.676" transform="translate(181.208, 46.1924) translate(-181.208, -46.1924)">Have</text>
<text xml:space="preserve" x="219.61" y="46.1924" font-size="14" font-family="ArialMT" textLength="24.122" transform="translate(219.61, 46.1924) translate(-219.61, -46.1924)">Any</text>
<text xml:space="preserve" x="249.612" y="46.1924" font-size="14" font-family="ArialMT" textLength="46.69" transform="translate(249.612, 46.1924) translate(-249.612, -46.1924)">Spaces</text>
<text xml:space="preserve" x="301.216" y="46.1924" font-size="14" font-family="ArialMT" textLength="54.474" transform="translate(301.216, 46.1924) translate(-301.216, -46.1924)">Between</text>
<text xml:space="preserve" x="360.016" y="46.1924" font-size="14" font-family="ArialMT" transform="translate(360.016, 46.1924) translate(-360.016, -46.1924)"><tspan textLength="13.216">W</tspan><tspan dx="-0.252" textLength="31.122">ords.</tspan></text>
<text xml:space="preserve" x="36.014" y="141.9944" font-size="14" font-family="ArialMT" transform="translate(36.014, 141.9944) translate(-36.014, -141.9944)">
<tspan textLength="181.3">This Sentence Doesn’t Have </tspan><tspan dx="-0.756" textLength="150.178">Any Spaces Between W</tspan><tspan dx="-0.252" textLength="31.122">ords.</tspan></text>
</svg>
And note that the basic structure reveals the original intent that gave you problems. (xml:space and attributes removed, whitespace modifications for the sake of example)
<?xml version="1.0" standalone="no"?>
<svg>
<text>This</text>
<text>Sentence</text>
<text>Doesn’t</text>
<text>Have</text>
<text>Any</text>
<text>Spaces</text>
<text>Between</text>
<text><tspan>W</tspan><tspan>ords.</tspan></text>
<text>
<tspan>This Sentence Doesn’t Have </tspan>
<tspan>Any Spaces Between W</tspan>
<tspan>ords.</tspan>
</text>
</svg>

This question and answer are based around old releases of ABCpdf.
ABCpdf Version 9 will do this all automatically for you.
I work on the ABCpdf .NET software component so my replies may feature concepts based around ABCpdf. It's just what I know. :-)

Related

How to add svg file into code? Visual studio 2019?

I need to add an svg in the following format:
<svg class="xxx" width="x" height="x" viewBox="xx" fill="xx" xmlns="xxx">
<path fill-rule="xx" clip-rule="xx" d="192 1920192 1920129.210291090291 192012" fill="blue" />
</svg>
How do I get an svg file and get it into the format above? SVG code above is in a .cshtml file. I've tried dragging and dropping the SVG file into code but it turns into an image tag.
There are many ways to get a svg image in that format. You can use Illustrator, CorellDraw, but if you intend to use it for the web then the best option is to use Inkscape.
On how to add the svg code, if you want to add it to html code, there are many ways to do that. Any of which have pros and cons depending on what you want to do with the svg.
1-You can simply embed the code within the html file. Open the svg in Visual Studio and copy then paste into a div tag in the html code.
Like this:
2- You can use the img tag and call the svg file like any other image.
Like this:
3- You can use Object to call the svg file into html but I really never use this method and I don't know how to do it this way. But 1 min of Google search probably knows how to.

Why are the svg files created in Inkscape suddenly cannot be opened in Graphic?

Till today, I was able to create SVG files using Inkscape, then open the file (and modify it) using Graphic on my Mac. Today, suddenly, when Graphic opens the svg files (saved as "plain SVG" or "Inkscape SVG"), the files are blank. If I drag the file into Graphic, I get an error message: "the file ... cannot be imported as a valid image".
So I updated Inkscape to 1.1, the current version. Still, the problem persists. Searching yielded little insight except that perhaps the incompatibility may be in the first line of the svg files. Pulled up a working SVG file, indeed it has a different first line. Copying it over made no difference. So what do I need to do to be able to open Inkscape svg files in Graphic (which I am far more familiar with)?
The first couple of lines from one of the invalid files:
<?xml version="1.0" encoding="UTF-8"?>
<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" width="595.275591pt" height="841.889764pt" viewBox="0 0 595.275591 841.889764" version="1.2">
<g id="surface71726">
<path style=" stroke:none;fill-rule:nonzero;fill:rgb(99.215686%,99.215686%,99.215686%);fill-opacity:1;" d="M 10.785156 426.429688 L 10.785156 169.285156 L 580.644531 169.285156 L 580.644531 683.570312 L 10.785156 683.570312 Z M 10.785156 426.429688 "/>
And the first line and half from the working svg file:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
<svg version="1.1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" x="0" y="0" width="744.23" height="512.157" viewBox="0, 0, 744.23, 512.157">
<g id="layer1">
<path d="M608.765,190.665 L608.765,190.665 L602.066,192.274 L602.066,192.349 C599.976,192.587 597.668,192.769 595.613,192.904 L595.334,190.477 C603.239,189.938 610.32,188.216 616.595,185.536 C616.
Try the following:
open a working file
delete all the contents from it
save as ... template with your favourite name
now use the new template for your next drawings
I suspect the difference in the header that you have posted here explains why 'Graphic' cannot deal with the valid SVG document.
You could also try and report a bug for 'Graphic', if they take customer reports.

Display character SQUARE M SQUARED (\u33a1) in generated pdf report

I am using following code for jasper pdf report to display character M SQUARED (\u33a1)
<?xml version="1.0" encoding="UTF-8"?>
...
<textField isStretchWithOverflow="true">
<reportElement x="0" y="0" width="609" height="20" uuid="df8665ef-2226-4aaa-bd04-09805582eaef"/>
<textElement verticalAlignment="Middle">
<font fontName="SomeCustFont" size="20" pdfEncoding="Cp1252" isPdfEmbedded="true"/>
</textElement>
<textFieldExpression><![CDATA["Squared M : \u33a1"]]></textFieldExpression>
</textField>
For this code, I am not able to see the unicode character in PDF. It is simply blank. But in XLSX, I am able to see the character.
I tried following:
Remove pdfEncoding
Set isPdfEmbedded="false"
But no luck
Update: It seems, the custom font I am using is not supporting squared m character. I cannot add a new font or update existing custom font. But I can use any or in-built fonts for that particular character. How can I achieve this using in-built font?
I tried:
fontName="Courier" pdfFontName="Courier"
This in-built font for jasper supported that character but I am getting error as font cannot be located.
The main problem here was \u33a1 is an extended ASCII unicode. Most of the free fonts don't support this. So instead of this squared m, I used english 'm' character followed by superscript 2 unicode \u00b2 which is available in almost all fonts.
\u33a1 -> m\u00b2

Image format which is editable as plaintext

I'm trying to find an image file format which features plaintext editable source code. So far in my searches i've found text-in-image scanning with output in plaintext, But that's not what i'm after.
My project is intended to create fractal images using seed integers derived from the binary content of an audio file. I have a solution for the first part. I seem to recall an image format which used tables of hexadecimal pairs to describe a 256 color palette in plaintext source.
A way to convert an image to a space delimited text file table of hex pairs and back again would probably get me past this wall.
Thanks in advance.
try SVG
https://picsvg.com/
example in HTML
<!-- `<use>` shape defined ON THIS PAGE somewhere else -->
<svg viewBox="0 0 100 100">
<use xlink:href="#icon-1"></use>
</svg>
<!-- `<use>` shape defined in an EXTERNAL RESOURCE -->
<svg viewBox="0 0 100 100">
<use xlink:href="defs.svg#icon-1"></use>
</svg>

Bash command line application to do image editing and stiching and create output as A4 PDF?

I use a command line application to create QR images from a given input text. I create enough of these and do some image editing, namely:
resize the QR image,
place on an office document
type an index number next to the image
repeat the above steps by adding more QR images next and below this image
print an A4 page on the printer full of QR images.
The whole process is very repetitive and can be automated. But I don't know where to start from with this. I see gimp has "script-fu" based on the scripting "scheme" language but I can't find (or think of) some relevant function that can do the above. Sure the resize is easy, but adding text and creating a restricted image tile surface seems not as straight forward.
Is there some application that I could use that edits the image according to some script and places the result in an A4 image which will later be printed?
Or am I plainly asking for too much? If it integrates weel with bash / python scripts then that is even better!
thank you
IMO this will require some work; one way to accomplish this is doing the following:
Create a svg template file with A4 size, layout it in a way that fits a page with several of the desired QR images. Something like the following:
<?xml version="1.0" standalone="no"?>
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
<!-- A4 size -->
<svg width="210mm" height="297mm" version="1.1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">
<image xlink:href="_qr_image_here_1_" x="0" y="0" height="50px" width="50px"/>
<image xlink:href="_qr_image_here_2_" x="60" y="0" height="50px" width="50px"/>
... <!-- More Images --> ...
</svg>
Code a shell script that calls sed to replace the xlink:href="_qr_image_here_N_" attributes with the path to the QR images you want to fit in (the svg file tools would take care of the resizing process for you).
Generate several of these svg documents from your script, these files will represent the pages of your doc.
Convert all the svg files to pdf, you can use rsvg-convert for this, more info here.
Merge all the pdf generated pages into one pdf file, you can use pdftk for this, also you can find info on how to do this step here.

Resources