Extracting meta-data at the individual shape/component level of a Powerpoint slide - powerpoint

I have a slide that has a series of text boxes organized into headings with associated text beneath it, like below (though it's not an actual table)
Heading A
Heading B
Heading C
A1 content
B1 content
C1 content
A2 content
B2 content
C2 content
The results of parsing (in XML format) basically shows everything jumbled. This allows me to extract the text, but I have no way of connecting content to its heading.
<div class="slide-content">
<p>Heading A</p>
<p>Heading B</p>
<p>Heading C</p>
<p>C1 content</p>
<p>A2 content</p>
<p>B1 content</p>
<p>A1 content</p>
<p>B2 content</p>
<p>C2 content</p>
</div>
I would like to use information related to the position (eg. x,y coords) and format (eg. font, size) of text boxes in a PPTX slide to better infer associations between content. However, I don't see in the docs any options to extract additional detail.
Is this possible out of the box? Many thanks for any insights!

Related

how to add boundary to an image in webpage

<html>
<h3> MY FIRST WEBPAGE </h3>
<H1> DESIGNING MY FIRST WEBPAGE </H1>
<title> MY FIRST TAB </title>
<img src="3333.jpg"
width="800"
height="500" >
<style>
body {font:12px Verdana,Arial; color #428bca; background-color:#5bc0de}
</style>
</html>
how to add boundary to image . ineed to knwo like how toadd a red coloured boundary to enclose the image in a table format
Add Borders to a Images using HTML & CSS
Using HTML/CSS to add borders to an image is easier than what you think. Before you add an image to your post in the text module, you need to switch to the text editor. Then, you will add the image and see the HTML code of the picture. This will look like the following images.

After adding the image to the text module, type this style=”border:5px solid #000000; padding:3px; margin:5px” to add the borders to your images. It will look like the next picture.

As result of that code, the image will look similar to the next picture.

Feel free to make any change to the border width, color, padding, and margin to your images. Also, you can change to the visual editor after you are done with the picture to see the changes that you’ve made. Let us know below if you have any comments or questions; we’d love to hear from you.

Pandoc 2.x renders images' alternative texts in an inaccessible fashion

Since I upgraded from Pandoc v1.19 to 2.9, decorative images are not exported as expected anymore.
First of all, when generating HTML from ![](test.jpg), in v1.19 a <p class="figure"> structure was wrapped around the image, but now it's only a <p>:
<p>
<img src="test.jpg">
</p>
This makes it harder to style in line with other images that have an alternative text.
But what's really a problem here: there's no alt="" attribute produced anymore! This means that e.g. screen readers will not recognise this as a decorative image anymore.
So let's see what happens to an image with an actual alternative text, e.g. when generating HTML from ![Hello](test.jpg):
<div class="figure">
<img src="test.jpg" alt="">
<p class="caption">Hello</p>
</div>
Here we get a class="figure" in the surrounding element, but now it's a <div> instead of a <p> (I don't bother too much about this, but again, it makes it harder to style everything the same).
What again is a big problem though is the fact that the alt attribute is now set empty: this prevents screen readers from perceiving them at all, which is horribly wrong! I guess that Pandoc concludes that having alternative text and caption would be redundant, which is correct, and that the caption below would be the right thing to show - which it is not.
The right structure would look something like this:
<div class="figure">
<img src="test.jpg" alt="Hello"><!-- Leave the alternative text on the image -->
<p class="caption" aria-hidden="true">Hello</p><!-- Hide the redundant visual alternative text from screen readers -->
</div>
Any reason why this behaviour would make sense? Can it be changed somehow? Otherwise I will have to fiddle around with some post-processing JavaScript...
The ![](test.jpg) example is no longer treated as a figure, because pandoc now requires that
the image is the only element in a paragraph, and
it has a caption.
Wrapping of figures with <div> happens when exporting to HTML4. Using the latest pandoc 2.9.2.1 and running pandoc -t html5 on the input ![Hello](test.jpg)
<figure>
<img src="test.jpg" alt="" /><figcaption>Hello</figcaption>
</figure>
The rationale for emitting an empty alt attribute is that screen readers would read the caption twice: first the alt, then the figcaption. Your suggestion seems much better, please open an issue.
If you can't wait for a new release, then use a Lua filter to create figures the way you like:
function Para (p)
if #p.content == 1 and p.content[1].t == "Image" then
local image = p.content[1]
local figure_content = pandoc.List{}
figure_content:insert(image)
figure_content:insert(
pandoc.RawInline('html', '\n<p class=caption aria-hidden="true">'))
figure_content:extend(image.caption)
figure_content:insert(pandoc.RawInline('html', '</p>'))
local attr = pandoc.Attr("", {"figure"})
return pandoc.Div({pandoc.Plain(figure_content)}, attr)
end
end

CKEditor moving br tags

I'm having a problem with CKEditor changing my original paragraph formatting with negative side effects.
I start with a basic paragraph loaded into CKEditor using setData():
<p><span style="font-size:50px">My Text</span></p>
... more document content ...
In the editor, I move the cursor to the end of the phrase "My Text" and press enter (with config.enterMode=CKEDITOR.ENTER_BR setting enabled). Inspecting the markup inside the editor I now see:
<p><span style="font-size:50px">My Text<br><br></span></p>
... more document content ...
Then, when I call getData() to pull the contents from the editor and save the document to a database, the HTML extracted by getData() looks like this:
<p><span style="font-size:50px">My Text</span><br> </p>
... more document content ...
This is a problem because while editing, the <br> tag was inside the <span> and was subject to the 50px font size style. The user saw a 50px blank line before the next piece of document content. After saving the HTML to a database and reloading later the <br> tag is now outside the <span> and is not subject to the 50px font sizing and the blank line appears much smaller than before.
The round trip fidelity of the text formatting is not preserved and the user is frustrated by the results.
Can someone help me understand the results I'm seeing with <br> tags being reformatted and moved around during the editing life cycle, and how I might fix this problem?
Using CKEditor v4.4.1

Nokogiri: wrap top-level text elements with <p> tags

Having trouble building the XPath selector for "naked" text nodes that are not already contained by another tag. I'd like to transform this:
some naked text <p>some wrapped text</p> more naked text
into this:
<p>some naked text</p> <p>some wrapped text</p> <p>more naked text</p>
I tried using doc.xpath("//child::text()").wrap('<p></p>') but that seems to grab all text nodes, not just the top-level ones.
doc.xpath('/html/body/text()').wrap('<p/>')
When you use // you are choosing the descendant-or-self axis, i.e. anywhere in the document. Instead you want to use / and (the default child axis) to match only text nodes that are direct children of a particular element.
If this is not an HTML document with <html> and <body> elements, then simply:
doc.xpath('/*/text()').wrap('<p/>')
will select all text elements that are children of the root XML element (whatever its name).
You could find every text except those inside paragraphs.
'//text()[not(ancestor::p)]'

Web page dimension to create a standard pdf page size

I want to generate 40 pages of some report from the database but first i want to generate a html file containing the 40 reports and then convert the html file to pdf.Each report is supposed to occupy one page of the resulting pdf document.
What should be the dimensions of each html report so that when i convert the entire html page containing the 40 reports,each report occupies exactly one page of the pdf document?.
I would simply make the reports as concise as possible and then force a page break after each report
<p>This is raport 1</p>
<table>
<tr><td>Line 1</td></tr>
<tr><td>Line 2</td></tr>
<tr><td>Line 3</td></tr>
</table>
<div class="EndOfRaport" style="page-break-before:always" />
<p>This is another raport</p>
<p>All is well in ponyville</p>
<div class="EndOfRaport" style="page-break-before:always" />
<p>This is a raport with a simple graph</p>
<img src="ponies.png" />
That way you don't really have to care about the contents and you can be assured that each raport is on it's own page or if it is too long it will spill into the next page. Then using headers and footers you can add the page independent elements if you need some.

Resources