PDF and Word extraction of text overlaid on images - image

Tools used for processing content from PDF or Microsoft Word (DOC, DOCX), when parsing documents with images that have text labels overlaid on them, extract these labels separately to the images. The result is each such image being extracted without the overlaid text and then followed by one or more paragraphs of that text, out of context.
In such cases, an image like (a)
-------------
| Level 2 |
-------------
| Level 1 |
-------------
is extracted as (b)
-------------
| |
-------------
| |
-------------
Level 2
Level 1
This is "standard" behavior for tools used for PDF or Word processing, like Apache PDFBox and POI.
Is there any way of handling this, in the Apache tools, or any other similar tool?
The ideal solution would be to extract both the image and the labels as a single entity, like (a) above. Alternatively, image and label extraction could be deactivated together.
Ultimately, there should be way for avoiding the "pollution" of the document text with the labels, which otherwise appear out of content.

Related

How to convert xlsx to pdf on one page

I have a 13 column xlsx and I want convert to pdf.
I use this code: "soffice" ,"--headless","--convert-to", "pdf" , filepath ,"--outdir",outpath.
I can convert to pdf but the columns too many so they have been showed on four pages.
I need they show on one page.
And it show on straight , I need it show on horizontal.
Thanks
XLSX printout settings (PDF export) are part of the file contents so here is the same file saved with different settings but same export command. (convert-to implies headless, so not generally needed. The author decides a cell content and shape and also sets how many rows and columns will fit in a standard page such as A4 portrait or A4 Landscape etc. Thus only a macro can change print layout area. The best that may be possible externally is to scale it up or down on to bigger or smaller paper.
soffice --convert-to pdf:calc_pdf_Export "DataTables example Default.xlsx"
soffice --convert-to pdf:calc_pdf_Export "DataTables example A3.xlsx"
You need to change layout for printing and export in the preview screen if you want 13 columns you set area from A:1 to M:Y where Y is your desired number of lines (whatever their variable height may be.)

Repeat or reference a figure in another rst file

I have two files named a.rst and b.rst, both of which contain a good deal of text. In a.rst, I define a figure:
.. figure:: ../images/some-image.png
:scale: 70%
:align: center
:alt: Some Text
Some Caption
I would like to have the same image and caption in b.rst with the same figure number, But repeating the above code gives me a new figure.
As a compromise, I can refer to this image in b.rst using the :numref: directive, but that does not resolve to the figure. It only displays the name as a piece of code.
I understand that these are two question, but I think they are sufficiently related. How can I repeat or reference a figure defined in a rst file in another file.
Edit to elaborate on the expected output:
I want the resulting files to have the following content:
a.html:
Fig. 1 + caption
Fig. 2 + caption
Fig. 3 + caption
b.html:
Fig. 4 + caption
Fig. 2 + caption
Fig. 5 + caption
Effectively, this would add the figure to b.rst not as a separate entity, but merely as a mirror of what was in a.rst. This is similar to what was discussed here.
Assuming you want to repeat the exact same content in two different files, put that content into a separate file, then include that file wherever you want it to appear.
includeme.rst
Note document root relative path.
.. figure:: /images/some-image.png
:scale: 70%
:align: center
:alt: Some Text
Some Caption
a.rst and b.rst
Below this paragraph should appear an image.
..include:: /includeme.rst
EDIT
To remove the figure number, set numfig to False. This will avoid the incongruent figure numbering, but won't solve it. I think that's the best you can achieve, as Sphinx automatically numbers figures (and other objects) otherwise.

Looking to find text columnar position verification tool. Does one exist?

I am working on creating text based data feed files that have fixed column widths. Example: Position 1-5 is record layout ID, position 6-35 is part number, position 36-70 is description, etc.
I wish there were a tool I could provide these data input widths, then paste in the raw text to visually see where it lines up. Conceptually, this would seem to be a pretty simple tool.
Do you know of any solutions or creative ideas?
Thanks!
Use https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/substr
Layout Id would be str.substr(1, 5)
Part number would be str.substr(6, 35)
etc.

LibreOffice Calc two Alignments - One Cell

Is it possible to write one word with left alignment and one word with right alignment in a single cell in LibreOffice Calc?
Like that: | normal Cell one | Halli _________ Hallo | normal Cell three |
Everytime I try to simulate it with many spaces between Halli and Hallo, there are format problems when I convert it to PDF.
Format the cell to have a distributed horizontal alignment [my LO 4.1.6.2 on Linux offer this option].
Stayed that way after exporting to PDF.
This is not a programming question - use SuperUser for software handling questions next time, please.

How can I place two images side-by-side with Asciidoctor?

I'm trying to place two images side-by-side, and ideally center the whole block in an Asciidoctor project. The code below works in the HTML5 output, but stacks the images when rendered to a PDF.
[.clearfix]
--
[.left]
.Title1
image::chapter3/images/foo.png[Foo, 450, scaledwidth="75%"]
[.left]
.Title2
image::chapter3/images/bar.png[Bar, 450, scaledwidth="75%"]
--
Is it possible to 1) render side-by-side images in a PDF and 2) center the block of images? If it's possible to specify the space between them, that would be great too.
Thanks,
Matt
Not sure if you can specify the space between them, but you're using the block image instead of the inline (image::...[] vs image:..[], note the colons). I'm also not sure how centring works in pdf as I don't do a lot of pdf generation, but if those are the only things on that line, they may center, or maybe a .center would do it?
1) render side-by-side images in a PDF
Yes. Following eskwayrd answer for Asciidoctor: how to layout two code blocks side by side? you can insert your image inside a table with only 2 columns.
[cols="a,a"]
|===
| image::foo.png[]
| image::bar.png[]
|===
I would in your case even completely hide the table
[cols="a,a", frame=none, grid=none]
|===
| image::foo.png[]
| image::bar.png[]
|===
2) center the block of images
This is currently complicated in PDF.
Well our block is now a table so we have a few options in HTML. Aligning the content with < and > is simple enough and works.
[cols=">a,<a", frame=none, grid=none]
|===
| image::foo.png[]
| image::bar.png[]
|===
Setting the table width to automatic and centering it also works in HTML:
[%autowidth, cols="a,a", frame=none, grid=none, role="center"]
|===
| image::foo.png[]
| image::bar.png[]
|===
These two methods however, for some reason, do not work in PDF when converting with asciidoctor-pdf. One "solution" for PDF would be to expand your table with extra empty columns left and right and trying to adjust their width with integers.
[cols="3,1a,1a,3", frame=none, grid=none]
|===
|
| image::foo.png[]
| image::bar.png[]
|
|===

Resources