My web application loads a pdf in the browser. I have figured out how to check that the pdf has loaded correctly using:
verifyAttribute
xpath=//embed/#src
{URL of PDF goes here}
It would be really nice to be able to check the contents of the pdf with Selenium - for example verify that some text is present. Is there any way to do this?
While not natively supported, I have found a couple ways using the java driver. One way is to have the pdf open in your browser (having adobe acrobat installed) and then use keyboard shortcut keys to select all text (CTRL+A), then copy it to the clipboard (CTRL+C) and then you can verify the text in the clipboard. eg:
protected String getLastWindow() {
return session().getEval("var windowId; for(var x in selenium.browserbot.openedWindows ){windowId=x;} ");
}
#Test
public void testTextInPDF() {
session().click("link=View PDF");
String popupName = getLastWindow();
session().waitForPopUp(popupName, PAGE_LOAD_TIMEOUT);
session().selectWindow(popupName);
session().windowMaximize();
session().windowFocus();
Thread.sleep(3000);
session().keyDownNative("17"); // Stands for CTRL key
session().keyPressNative("65"); // Stands for A "ascii code for A"
session().keyUpNative("17"); //Releases CTRL key
Thread.sleep(1000);
session().keyDownNative("17"); // Stands for CTRL key
session().keyPressNative("67"); // Stands for C "ascii code for C"
session().keyUpNative("17"); //Releases CTRL key
TextTransfer textTransfer = new TextTransfer();
assertTrue(textTransfer.getClipboardContents().contains("Some text in my pdf"));
}
Another way, still in java, is to download the pdf and then convert the pdf to text with PDFBox, see http://www.prasannatech.net/2009/01/convert-pdf-text-parser-java-api-pdfbox.html for an example on how to do this.
You cannot do this using WebDriver natively. However, PDFBox API can be used here to read content of PDF file. You will have to first of all shift a focus to browser window where PDF file is opened. You can then parse all the content of PDF file and search for the desired text string.
Here is a code to use PDFBox API to search within PDF document.
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.PrintWriter;
import org.pdfbox.cos.COSDocument;
import org.pdfbox.pdfparser.PDFParser;
import org.pdfbox.pdmodel.PDDocument;
import org.pdfbox.util.PDFTextStripper;
public class pdfToTextConverter {
public static void pdfToText(String path_to_PDF_file, String Path_to_output_text_file) throws FileNotFoundException, IOException{
//Parse text from a PDF into a string variable
File f = new File("path_to_PDF_file");
PDFParser parser = new PDFParser(new FileInputStream(f));
parser.parse();
COSDocument cosDoc = parser.getDocument();
PDDocument pdDoc = new PDDocument(cosDoc);
PDFTextStripper pdfStripper = new PDFTextStripper();
String parsedText = pdfStripper.getText(pdDoc);
System.out.println(parsedText);
//Write parsed text into a file
PrintWriter pw = new PrintWriter("Path_to_output_text_file");
pw.print(parsedText);
pw.close();
}
}
JAR Source
http://sourceforge.net/projects/pdfbox/files/latest/download?source=files
Unfortunately you can not do this at all with Selenium
There is a way.
Before you click the link you can obtain the href value
element.FindElement(By.TagName("href")).Text
Then after the PDF loads you can get the Url
driver.GetUrl();
Then you can just check to see if the url contains the href.
It's not the best, but it's better than nothing.
Related
My flutter web app creates files (csv and pdf) that should be downloaded on user click. It works fine on PC Chrome but fails on Android Chrome. No downloaded file is shown and a file named ".com.google.Chrome.xxxxxx" is stored (where the suffix is random).
Here is my code:
import 'dart:html' as html;
void saveFile(String name, dynamic data, String type) {
final blob = html.Blob([data], type);
html.AnchorElement(href: html.Url.createObjectUrlFromBlob(blob))
..download = name
..click();
}
I pass either the result of pdf.save() or the csv String into the data parameter.
I also tried this and it works perfectly on the same android Chrome (but I can't set the file name in this case and the automatically generated file name looks awful):
import 'dart:html' as html;
void saveFile(String name, dynamic data, String type) {
final blob = html.Blob([data], type);
html.window.open(html.Url.createObjectUrlFromBlob(blob), '_blank');
}
Any suggestions how to fix this?
I created a small GitHub repo that demonstrates the problem:
See https://github.com/abrighton/itext-bug.
The repo contains a generated HTML file (TEST.html) that causes itext 7 to throw an exception when converting to PDF in landscape mode:
Exception in thread "main" java.lang.UnsupportedOperationException
at com.itextpdf.layout.renderer.AreaBreakRenderer.draw(AreaBreakRenderer.java:83)
at com.itextpdf.layout.renderer.AbstractRenderer.drawChildren(AbstractRenderer.java:855)
at com.itextpdf.layout.renderer.BlockRenderer.draw(BlockRenderer.java:580)
at com.itextpdf.layout.renderer.AbstractRenderer.drawChildren(AbstractRenderer.java:855)
at com.itextpdf.layout.renderer.BlockRenderer.draw(BlockRenderer.java:580)
at com.itextpdf.layout.renderer.DocumentRenderer.flushSingleRenderer(DocumentRenderer.java:147)
at com.itextpdf.layout.renderer.RootRenderer.processRenderer(RootRenderer.java:380)
at com.itextpdf.layout.renderer.RootRenderer.shrinkCurrentAreaAndProcessRenderer(RootRenderer.java:369)
at com.itextpdf.html2pdf.attach.impl.layout.HtmlDocumentRenderer.shrinkCurrentAreaAndProcessRenderer(HtmlDocumentRenderer.java:347)
at com.itextpdf.layout.renderer.RootRenderer.addChild(RootRenderer.java:264)
at com.itextpdf.html2pdf.attach.impl.layout.HtmlDocumentRenderer.processWaitingElement(HtmlDocumentRenderer.java:234)
at com.itextpdf.html2pdf.attach.impl.layout.HtmlDocumentRenderer.close(HtmlDocumentRenderer.java:194)
at com.itextpdf.layout.Document.close(Document.java:135)
at com.itextpdf.html2pdf.HtmlConverter.convertToPdf(HtmlConverter.java:261)
at com.itextpdf.html2pdf.HtmlConverter.convertToPdf(HtmlConverter.java:221)
at ItextBug$.saveAsPdf(ItextBug.scala:15)
at ItextBug$.delayedEndpoint$ItextBug$1(ItextBug.scala:23)
at ItextBug$delayedInit$body.apply(ItextBug.scala:9)
at scala.Function0.apply$mcV$sp(Function0.scala:39)
at scala.Function0.apply$mcV$sp$(Function0.scala:39)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17)
at scala.App.$anonfun$main$1(App.scala:73)
at scala.App.$anonfun$main$1$adapted(App.scala:73)
at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:553)
at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:551)
at scala.collection.AbstractIterable.foreach(Iterable.scala:921)
at scala.App.main(App.scala:73)
at scala.App.main$(App.scala:71)
at ItextBug$.main(ItextBug.scala:9)
at ItextBug.main(ItextBug.scala)
Here is the code:
import java.io.{ByteArrayInputStream, FileOutputStream, OutputStream}
import java.nio.file.{Files, Paths}
import com.itextpdf.html2pdf.HtmlConverter
import com.itextpdf.kernel.geom.PageSize
import com.itextpdf.kernel.pdf.{PdfDocument, PdfWriter}
// Run this from the directory containing TEST.html
object ItextBug extends App {
def saveAsPdf(out: OutputStream, html: String, orientation: String): Unit = {
val pageSize = if (orientation == "landscape") PageSize.LETTER.rotate() else PageSize.LETTER
val writer: PdfWriter = new PdfWriter(out)
val document: PdfDocument = new PdfDocument(writer)
document.setDefaultPageSize(pageSize)
HtmlConverter.convertToPdf(new ByteArrayInputStream(html.getBytes()), document)
out.close()
}
val html = new String(Files.readAllBytes(Paths.get("TEST.html")))
val out = new FileOutputStream("TEST.pdf")
// This version crashes
saveAsPdf(out, html, "landscape")
// This version works
// saveAsPdf(out, html, "portrait")
}
Is there anything wrong with this code?
I have only seen this happen on certain input HTML files. There could be something odd in there, however the HTML displays fine in the browser. Browsers don't throw exceptions for bad HTML and the HTML to PDF converter probably should not either, assuming that is the problem.
(Uses Scala-2.13.1, Java-11)
I want to get images of Discogs releases. Can I do it without Discogs API?
They don't have links to the images in their db dumps.
To do this without the API, you would have to load a web page and extract the image from the html source code. You can find the relevant page by loading https://www.discogs.com/release/xxxx where xxxx is the release number. Since html is just a text file, you can now extract the jpeg URL.
I don't know what your programming language is, but I'm sure it can handle String functions, like indexOf and subString. You could extract the html's OG:Image content for picture.
So taking an example: https://www.discogs.com/release/8140515
Find the .indexOf("og:image\" content=\"); save as startPos to some integer.
That's 19 chars so next do a .indexOf(".jpg", startPos + 19); into a endPos. This gets the first occurence of .jpg after index of startPos + 19 any other chars.
Now extract a subString from html text img_URL = myHtmlStr.substring(startPos+19, endPos);
You should end up with a string reading like this below (extracted URL): https://img.discogs.com/_zHBK73yJ5oON197YTDXM7JoBjA=/fit-in/600x600/filters:strip_icc():format(jpeg):mode_rgb():quality(90)/discogs-images/R-8140515-1460073064-5890.jpeg.jpg
The process can be shortened to finding the startPos index of https://img., then find first occurrence of .jpg when searching from after that startPos index. Extract within that length range. This is because the image URL is only mentioned in the html source at https://img.
Compare page at : https://www.discogs.com/release/8140515 with extracted URL image below.
This is how to do it with Java & Jsoup library.
get HTML page of the release
parse HTML & get <meta property="og:image" content=".." /> to get content value
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class DiscogRelease {
private final String url;
public DiscogRelease(String url) {
this.url = url;
}
public String getImageUrl() {
try {
Document doc = Jsoup.connect(this.url).get();
Elements metas = doc.head().select("meta[property=\"og:image\"]");
if (!metas.isEmpty()) {
Element element = metas.get(0);
return element.attr("content");
}
} catch (IOException ex) {
Logger.getLogger(DiscogRelease.class.getName()).log(Level.SEVERE, null, ex);
}
return null;
}
}
Let me give you an overview of my project first. I have a pdf which I need to convert into images(One image for one page) using PDFBox API and write all those images onto a new pdf using PDFBox API itself. Basically, converting a pdf into a pdf, which we refer to as PDF Transcoding.
For certain pdfs, which contain JBIG2 images, PDFbox implementation of convertToImage() method is failing silently without any exceptions or errors and finally, producing a PDF, but this time, just with blank content(white). The message I am getting on the console is:
Dec 06, 2013 5:15:42 PM org.apache.pdfbox.filter.JBIG2Filter decode
SEVERE: Can't find an ImageIO plugin to decode the JBIG2 encoded datastream.
Dec 06, 2013 5:15:42 PM org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap getRGBImage
SEVERE: Something went wrong ... the pixelmap doesn't contain any data.
Dec 06, 2013 5:15:42 PM org.apache.pdfbox.util.operator.pagedrawer.Invoke process
WARNING: getRGBImage returned NULL
I need to know how to resolve this issue? We have something like:
import org.apache.pdfbox.filter.JBIG2Filter;
which I don't know how to implement.
I am searching on that, but to no avail. Could anyone please suggest?
Take a look at this ticket in PDFBox https://issues.apache.org/jira/browse/PDFBOX-1067 . I think the answer to your question is:
to make sure that you have JAI and the JAI-ImageIO plugins installed for your version of Java: decent installation instructions are available here: http://docs.geoserver.org/latest/en/user/production/java.html
to use the JBIG2-imageio plugin, (newer versions are licensed under the Apache2 license) https://github.com/levigo/jbig2-imageio/
I had the same problem and I fixed it by adding this dependency in my pom.xml :
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>jbig2-imageio</artifactId>
<version>3.0.2</version>
</dependency>
Good luck.
I had the exact same problem.
I downloaded the jar from
jbig2-imageio
and I just included it in my project's application libraries, and it worked right out of the box. As adam said, it uses GPL3.
Installing the JAI seems not needed.
I only needed to download the levigo-jbig2-imageio-1.6.5.jar, place it in the folder of my dependency-jars and in eclipse add it to the java build path libraries.
https://github.com/levigo/jbig2-imageio/
import java.awt.image.BufferedImage
import org.apache.pdfbox.cos.COSName
import org.apache.pdfbox.pdmodel.PDDocument
import org.apache.pdfbox.pdmodel.PDPage
import org.apache.pdfbox.pdmodel.PDPageTree
import org.apache.pdfbox.pdmodel.PDResources
import org.apache.pdfbox.pdmodel.graphics.PDXObject
import org.apache.pdfbox.rendering.ImageType
import org.apache.pdfbox.rendering.PDFRenderer
import org.apache.pdfbox.tools.imageio.ImageIOUtil
import javax.imageio.ImageIO
import javax.imageio.spi.IIORegistry
import javax.imageio.spi.ImageReaderSpi
import javax.swing.*
import javax.swing.filechooser.FileNameExtensionFilter
public class savePDFAsImage{
String path = "c:/pdfImage/"
//allow pdf file selection for extracting
public static File selectPDF() {
File file = null
JFileChooser chooser = new JFileChooser()
FileNameExtensionFilter filter = new FileNameExtensionFilter("PDF", "pdf")
chooser.setFileFilter(filter)
chooser.setMultiSelectionEnabled(false)
int returnVal = chooser.showOpenDialog(null)
if (returnVal == JFileChooser.APPROVE_OPTION) {
file = chooser.getSelectedFile()
println "Please wait..."
}
return file
}
public static void main(String[] args) {
try {
// help to view list of plugin registered. check by adding JBig2 plugin and JAI plugin
ImageIO.scanForPlugins()
IIORegistry reg = IIORegistry.getDefaultInstance()
Iterator spIt = reg.getServiceProviders(ImageReaderSpi.class, false)
spIt.each(){
println it.getProperties()
}
testPDFBoxSaveAsImage()
testPDFBoxExtractImagesX()
} catch (Exception e) {
e.printStackTrace()
}
}
public static void testPDFBoxExtractImagesX() throws Exception {
PDDocument document = PDDocument.load(selectPDF())
PDPageTree list = document.getPages()
for (PDPage page : list) {
PDResources pdResources = page.getResources()
for (COSName c : pdResources.getXObjectNames()) {
PDXObject o = pdResources.getXObject(c)
if (o instanceof org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject) {
File file = new File( + System.nanoTime() + ".png")
ImageIO.write(((org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject) o).getImage(), "png", file)
}
}
}
document.close()
println "Extraction complete"
}
public static void testPDFBoxSaveAsImage() throws Exception {
PDDocument document = PDDocument.load(selectPDF().getBytes())
PDFRenderer pdfRenderer = new PDFRenderer(document)
for (int page = 0; page < document.getNumberOfPages(); ++page) {
BufferedImage bim = pdfRenderer.renderImageWithDPI(page,300, ImageType.BINARY)
// suffix in filename will be used as the file format
OutputStream fileOutputStream = new FileOutputStream(+ System.nanoTime() + ".png")
boolean b = ImageIOUtil.writeImage(bim, "png",fileOutputStream,300)
}
document.close()
println "Extraction complete"
}
}
The following code is an attempt to search google, and return the results as text or html.
The code was almost entirely copied directly from code snippets online, and i see no reason for it to not return results from the search. How do you return google search results, using htmlunit to submit the search query, without a browser?
import com.gargoylesoftware.htmlunit.WebClient;
import java.io.*;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import com.gargoylesoftware.htmlunit.html.HtmlInput;
import com.gargoylesoftware.htmlunit.html.HtmlSubmitInput;
import java.net.*;
public class GoogleSearch {
public static void main(String[] args)throws IOException, MalformedURLException
{
final WebClient webClient = new WebClient();
HtmlPage page1 = webClient.getPage("http://www.google.com");
HtmlInput input1 = page1.getElementByName("q");
input1.setValueAttribute("yarn");
HtmlSubmitInput submit1 = page1.getElementByName("btnK");
page1=submit1.click();
System.out.println(page1.asXml());
webClient.closeAllWindows();
}
}
There must be some browser detection that changes the generated HTML, because when inspecting the HTML with page1.getWebResponse().getContentAsString(), the submit button is named btnG and not btnK (which is not what I observe in Firefox). Make this change, and the result will be the expected one.
I've just checked this. It's actually 2 ids for 2 google pages:
btnK: on the google home page (where there's 1 long textbox in the middle of the screen). This time the button's id = 'gbqfa'
btnG: on the google result page (where the main textbox is on top of the screen). This time the button's id = 'gbqfb'