Parse PDF with ABCPDF - abcpdf

I want to parse a PDF document I download with ABCPDF, but I cant find any elements in the document or how to reach them and iterate them. I want to parse out some text.
var webClient = new WebClient();
var bytes = webClient.DownloadData("http://test.com/test.pdf");
var doc = new Doc();
doc.Read(bytes);

Use the Doc.GetText method to extract content from the current page, specifying the format in which content is to be returned.
doc.PageNumber = 1;
string pageContent = doc.GetText("Text");
The example above will return plain text in layout order. Specifying "SVG" or "SVG+" returns additional information along with the text, such as style and position.

Related

How do I obtain a File ID, Blob, or URL of a screenshot that has been inserted in a Google Sheet cell?

Context: I want to be able to extract text via OCR from a screenshot that I insert into a Google Sheet cell. I have tried for hours, searching the Internet, trying different tactics, but I cannot find a way to get a file ID, URL, or Blob for the screenshot that has been inserted into the cell (the sheet has only 1 cell). BTW I have already worked out the code I need once I get a fileID, URL, or Blob to do the OCR.
Questions:
First, is it possible to get a reference (file ID, etc.) to an image in a cell?
Second, when I select a function in a dropdown (e.g. .getContentURL) and it produces a “not a function” error, what does that indicate (e.g. there is no content, or something else)?
function getImageHandle()
{
// get the sheet object
var inputSheet = ss.getSheetByName("InPutSheet");
// get the file ID of the cell content
var fileID = inputSheet.getRange("A1").getValue().fileID();
// get the Blob of the cell content
var blob = inputSheet.getRange("A1").getValue().getBlob();
// get the URL of the cell content
var url = inputSheet.getRange("A1").getContentUrl();
// return the reference to the image in the cell
return one of the above (fileID, blob, or URL) ;
}

How can I list and merge inline-images in a PDF file using IText7-dotnet?

I have several PDF documents that supposedly contain scanned images, but upon inspection in Acrobat Pro, each page contains a huge number of tiny "inline images". From what I understand these are not regular images inside XObjects, but rather images embedded directly inside content streams.
How could I go about extracting and merging these images?
The only code I could find online starts out like this:
var reader = new PdfReader(#"path\to\file.pdf");
PdfDocument document = new PdfDocument(reader);
for (var i = 1; i <= document.GetNumberOfPages(); i++)
{
PdfDictionary obj = (PdfDictionary)document.GetPdfObject(i);
// ... more code goes here
}
...but the rest of the code doesn't work because the PdfDictionary returned from GetPdfObject is not a stream, only a dictionary. I don't know how to access the images inside it.

Not able to add image in header as well as in body part of word file using apache poi

I'm adding a picture in the header of a word document. It shows a frame for the image and says "the image cannot currently be displayed". If I add text to the header, it shows the text, and if I add the image in the document body, it also shows the image. So it is getting the image and it shows text on the header, but not the image.
XWPFHeader head = document.createHeader(HeaderFooterType.DEFAULT);
paragraph = head.getParagraphArray(0);
if (paragraph == null)
//paragraph = head.createParagraph();
paragraph=document.createParagraph();
paragraph.setAlignment(ParagraphAlignment.RIGHT);
InputStream inputStream = getClass().getClassLoader().getResourceAsStream("static/Picture.png");
String imgFile = "static/Picture.png";
run = paragraph.createRun();
XWPFPicture picture= run.addPicture(inputStream, XWPFDocument.PICTURE_TYPE_PNG, imgFile, Units.toEMU(100),
Units.toEMU(50));
System.out.println(picture); //XWPFPicture is added
System.out.println(picture.getPictureData()); //but without access to XWPFPictureData (no blipID)

How to add image to google doc with text wrapping via Google Script

I have an image saved in my Google drive called "logo.png".
I want to add the image to a Google document so that it's in the top left corner and so that it does not distort the text around it. (In other words, I want the text to "wrap" around the image).
How do I add an image to a Google Doc using Google Script so that surrounding text wraps around the image?
My script so far:
function myFunction(e) {
var t1 = 'Center for Success';
var t2 = 'Foundational Hall';
var t3 = 'Instruction Sheet for Testing Requirements';
var boldRight ={};
boldRight[DocumentApp.Attribute.BOLD]=true;
boldRight[DocumentApp.Attribute.HORIZONTAL_ALIGNMENT]=DocumentApp.HorizontalAlignment.RIGHT;
var boldCenterUnderline ={};
boldCenterUnderline[DocumentApp.Attribute.BOLD]=true;
boldCenterUnderline[DocumentApp.Attribute.UNDERLINE]=true;
boldCenterUnderline[DocumentApp.Attribute.HORIZONTAL_ALIGNMENT]=DocumentApp.HorizontalAlignment.CENTER;
var filename = 'fileTest';
var doc = DocumentApp.create(filename);
var body = doc.getBody();
body.appendParagraph(t1).setAttributes(boldRight);
body.appendParagraph(t2).setAttributes(boldRight);
body.appendParagraph(space);
body.appendParagraph(t3).setAttributes(boldCenterUnderline);
doc.saveAndClose();
}
Desired Result:
I saw here that an image can be added in various ways, but neither approach has worked for me AND I do not see how I can control a wrapping attribute.
Update
I tried using the following code (with fake ID shown in URL), but it just created a blank document:
var image = "https://drive.google.com/open?id=2PJGK5C64HLKKoQIv52jGhUjjdiXU34Mp";
var fileID = image.match(/[\w\_\-]{25,}/).toString();
var blob = DriveApp.getFileById(fileID).getBlob();
body.appendImage(blob)

kendo ui editor how to modify user selection with range object

Kendo UI 2015.2.805 Kendo UI Editor for Jacascript
I want to extend the kendo ui editor by adding a custom tool that will convert a user selected block that spans two or more paragraphs into block of single spaced text. This can be done by locating all interior p tags and converting them into br tags, taking care not to change the first or last tag.
My problem is working with the range object.
Getting the range is easy:
var range = editor.getRange();
The range object has a start and end container, and a start and end offset (within that container). I can access the text (without markup)
console.log(range.toString());
Oddly, other examples I have seen, including working examples, show that
console.log(range);
will dump the text, however that does not work in my project, I just get the word 'Range', which is the type of the object. This concerns me.
However, all I really need however is a start and end offset in the editor's markup (editor.value()) then I can locate and change the p's to br's.
I've read the telerik documentation and the referenced quirksmode site's explanation of html ranges, and while informative nothing shows how to locate the range withing the text (which seems pretty basic to me).
I suspect I'm overlooking something simple.
Given a range object how can I locate the start and end offset within the editor's content?
EDIT: After additional research it appears much more complex than I anticipated. It seems I must deal with the range and/or selection objects rather than directly with the editor content. Smarter minds than I came up with the range object for reasons I cannot fathom.
Here is what I have so far:
var range = letterEditor.editor.getRange();
var divSelection;
divSelection = range.cloneRange();
//cloning may be needless extra work...
//here manipulate the divSelection to how I want it.
//divSeletion is a range, not sure how to manipulate it
var sel = letterEditor.editor.getSelection()
sel.removeAllRanges();
sel.addRange(divSelection);
EDIT 2:
Based on Tim Down's Solution I came up with this simple test:
var html;
var sel = letterEditor.editor.getSelection();
if (sel.rangeCount) {
var container = document.createElement("div");
for (var i = 0, len = sel.rangeCount; i < len; ++i) {
container.appendChild(sel.getRangeAt(i).cloneContents());
}
html = container.innerHTML;
}
html = html.replace("</p><p>", "<br/>")
var range = letterEditor.editor.getRange();
range.deleteContents();
var div = document.createElement("div");
div.innerHTML = html;
var frag = document.createDocumentFragment(), child;
while ((child = div.firstChild)) {
frag.appendChild(child);
}
range.insertNode(frag);
The first part, getting the html selection works fine, the second part also works however the editor inserts tags around all lines so the result is incorrect; extra lines including fragments of the selection.
The editor supports a view html popup which shows the editor content as html and it allows for editing the html. If I change the targeted p tags to br's I get the desired result. (The editor does support br as a default line feed vs p, but I want p's most of the time). That I can edit the html with the html viewer tool lets me know this is possible, I just need identify the selection start and end in the editor content, then a simple textual replacement via regex on the editor value would do the trick.
Edit 3:
Poking around kendo.all.max.js I discovered that pressing shift+enter creates a br instead of a p tag for the line feed. I was going to extend it to do just that as a workaround for the single-space tool. I would still like a solution to this if anyone knows, but for now I will instruct users to shift-enter for single spaced blocks of text.
This will accomplish it. Uses Tim Down's code to get html. RegEx could probably be made more efficient. 'Trick' is using split = false in insertHtml.
var sel = letterEditor.editor.getSelection();
if (sel.rangeCount) {
var container = document.createElement("div");
for (var i = 0, len = sel.rangeCount; i < len; ++i) {
container.appendChild(sel.getRangeAt(i).cloneContents());
}
var block = container.innerHTML;
var rgx = new RegExp(/<br class="k-br">/gi);
block = block.replace(rgx, "");
rgx = new RegExp(/<\/p><p>/gi);
block = block.replace(rgx, "<br/>");
rgx = new RegExp(/<\/p>|<p>/gi);
block = block.replace(rgx, "");
letterEditor.editor.exec("insertHtml", { html: block, split: false });
}

Resources