I want to get images of Discogs releases. Can I do it without Discogs API?
They don't have links to the images in their db dumps.
To do this without the API, you would have to load a web page and extract the image from the html source code. You can find the relevant page by loading https://www.discogs.com/release/xxxx where xxxx is the release number. Since html is just a text file, you can now extract the jpeg URL.
I don't know what your programming language is, but I'm sure it can handle String functions, like indexOf and subString. You could extract the html's OG:Image content for picture.
So taking an example: https://www.discogs.com/release/8140515
Find the .indexOf("og:image\" content=\"); save as startPos to some integer.
That's 19 chars so next do a .indexOf(".jpg", startPos + 19); into a endPos. This gets the first occurence of .jpg after index of startPos + 19 any other chars.
Now extract a subString from html text img_URL = myHtmlStr.substring(startPos+19, endPos);
You should end up with a string reading like this below (extracted URL): https://img.discogs.com/_zHBK73yJ5oON197YTDXM7JoBjA=/fit-in/600x600/filters:strip_icc():format(jpeg):mode_rgb():quality(90)/discogs-images/R-8140515-1460073064-5890.jpeg.jpg
The process can be shortened to finding the startPos index of https://img., then find first occurrence of .jpg when searching from after that startPos index. Extract within that length range. This is because the image URL is only mentioned in the html source at https://img.
Compare page at : https://www.discogs.com/release/8140515 with extracted URL image below.
This is how to do it with Java & Jsoup library.
get HTML page of the release
parse HTML & get <meta property="og:image" content=".." /> to get content value
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class DiscogRelease {
private final String url;
public DiscogRelease(String url) {
this.url = url;
}
public String getImageUrl() {
try {
Document doc = Jsoup.connect(this.url).get();
Elements metas = doc.head().select("meta[property=\"og:image\"]");
if (!metas.isEmpty()) {
Element element = metas.get(0);
return element.attr("content");
}
} catch (IOException ex) {
Logger.getLogger(DiscogRelease.class.getName()).log(Level.SEVERE, null, ex);
}
return null;
}
}
Related
In the end, my goal is to send a raw image data from the front-end, then split that image into however many pages, and lastly send that pdf back to the front-end for download.
But every time I use the theDoc.addImageFile(), it tells me that the "Image is not in a suitable format". I'm using this as reference: https://www.websupergoo.com/helppdfnet/source/5-abcpdf/doc/1-methods/addimagefile.htm
To troubleshoot, I thought that the image might not be rendering correctly, so I added a File.WriteAllBytes to view the rendered image and it was exactly what I wanted, but still not adding to the PDF. I also tried sending the actual path of a previously rendered image thinking that the new image might not have been fully created yet, but it also gave me the same error. Lastly, I thought PNGs might be problematic and changed to JPG but it did not work.
Here is the code:
[HttpPost]
public IActionResult PrintToPDF(string imageString)
{
// Converts dataUri to bytes
var base64Data = Regex.Match(imageString, #"data:image/(?<type>.+?),(?<data>.+)").Groups["data"].Value;
var binData = Convert.FromBase64String(base64Data);
/* Ultimately will be removed, but used for debugging image */
string path = Environment.GetFolderPath(Environment.SpecialFolder.MyDocuments);
string imgName= "Test.jpg";
string filename = Path.Combine(path, imgName);
System.IO.File.WriteAllBytes(filename, binData);
/***********************************************************/
using (Doc theDoc = new Doc())
{
// Using explicit path
theDoc.AddImageFile(#"C:\Users\User\Documents\Test.jpg", 1);
// Using variable
//theDoc.AddImageFile(filename, 1);
// What I really want
//theDoc.AddImageFile(binData , 1);
theDoc.Page = theDoc.AddPage();
theDoc.AddText("Thanks");
Response.Headers.Clear();
Response.Headers.Add("content-disposition", "attachment; filename=test.pdf");
return new FileStreamResult(theDoc.GetStream(), "application/pdf");
}
}
Try something like this (not tested, but cleaned up from my own code):
public int AddImageFile(Doc doc, byte[] data, int insertBeforePageID)
{
int pageid;
using (var img = new XImage())
{
img.SetData(data);
doc.Page = doc.AddPage(insertBeforePageID);
pageid = doc.Page;
doc.AddImage(img);
img.Clear();
}
return pageid;
}
To add a JPEG from a byte array you need Doc.AddImageData instead of Doc.AddImageFile. Note that AddImageFile / AddImageData do not support PNG - for that you would definitely need to use an XImage. The XImage.SetData documentation has the currently supported image formats.
I have below HTML code:
<img title="hotelThumbImage" id="hotelThumbImage01" width="140px" height="129px"
src="/b2c/images/?url=FixedPkgB2c/FF-252-325"/>
It renders in IE as below:
It renders in all other browser like FireFox and Chrome as:
Related question : How to make a Servlet call form UI which returns the Content itself and place an img tag using Script in the output?
My project is suffering from this too, and it's because IE prevents download/display of files which have a different encoding than their extension. It has something to do with malicious code being able to be hidden as image files simply by changing the extension of the file.
Firefox and Chrome are smart enough to display it as an image so long as the encoding is that of an image, but IE takes no chances, it seems.
You'll have to add the extension that matches your image's encoding for it to display in IE.
Edit: It's also possible that your server is sending the file with a header denoting plain text. Again, Firefox and Chrome are smart enough to handle it, but IE isn't. See: https://stackoverflow.com/a/32988576/4793951
Welcome to IE world... :(
What i would do, in order to have better control of the situation is to modify the getter method, so in Holiday.getPkgCode():
public String getPkgCode() throws IOException {
if (!this.pkgCode.contains(".")) {
String ext = ImgUtil.determineFormat(this.pkgCode);
return this.pkgCode + ImgUtil.toExtension(ext);
} else {
return this.pkgCode;
}
}
To use it you will need to catch exceptions and this ImgUtil class adapted from here:
class ImgUtil {
public static String determineFormat(String name) throws IOException {
// get image format in a file
File file = new File(name);
// create an image input stream from the specified file
ImageInputStream iis = ImageIO.createImageInputStream(file);
// get all currently registered readers that recognize the image format
Iterator<ImageReader> iter = ImageIO.getImageReaders(iis);
if (!iter.hasNext()) {
throw new RuntimeException("No readers found!");
}
// get the first reader
ImageReader reader = iter.next();
String toReturn = reader.getFormatName();
// close stream
iis.close();
return toReturn;
}
public static String toExtension(String ext) {
switch (ext) {
case "JPEG": return ".jpg";
case "PNG": return ".png";
}
return null;
}
}
TEST IT:
NOTE: I placed an image (jpg) without extension placed in C:\tmp folder
public class Q37052184 {
String pkgCode = "C:\\tmp\\yorch";
public static void main(String[] args) throws IOException {
Q37052184 q = new Q37052184();
System.out.println(q.getPkgCode());
}
// the given getter!!!
}
OUTPUT:
C:\tmp\yorch.jpg
You have to set the Content Type property of responses' header in the servlet.
For example in spring 4 mvc,
#GetMapping(value = "/b2c/images/?url=FixedPkgB2c/FF-252-325")
public ResponseEntity<byte []> getImageThumbnail() {
HttpHeaders headers = new HttpHeaders();
headers.setContentType(media type));
byte [] content= ...;
return ResponseEntity.ok().headers(headers).body(content);
}
I am a trying to parse and display images from a feed that has the imgage URL inside tags. An example is this:
*Note>> http://someImage.jpg is not a real image link, this is just an example. This is what I have done so far.
public void startElement(String uri, String localName, String qName, Attributes atts) {
chars = new StringBuilder();
if (qName.equalsIgnoreCase("content:encoded")) {
if (!atts.getValue("src").toString().equalsIgnoreCase("null")) {
feedStr.setImgLink(atts.getValue("src").toString());
Log.d(TAG, "inside if " + feedStr.getImgLink());
} else {
feedStr.setImgLink("");
Log.d(TAG, feedStr.getImgLink());
}
}
}
I believe this part of my programming needs to be tweaked. First, when qName is equal to "content:encoded" the parsing stops. The application just runs endlessly and displays nothing. Second, if I change that initial if to anything that qName cannot equal like "purplebunny" everything works perfect, except there will be no images. What am I missing? Am I using atts.getValue properly? I have used log to see what comes up in ImgLink and it is null always.
You can store the content:encoded data in a String. Then you can extract image by this library Jsoup
Example:
Suppose content:encoded raw data stored in Description variable.
Document doc = Jsoup.parse(Description);
Element image =doc.select("img").first();
String url = image.absUrl("src");
My web application loads a pdf in the browser. I have figured out how to check that the pdf has loaded correctly using:
verifyAttribute
xpath=//embed/#src
{URL of PDF goes here}
It would be really nice to be able to check the contents of the pdf with Selenium - for example verify that some text is present. Is there any way to do this?
While not natively supported, I have found a couple ways using the java driver. One way is to have the pdf open in your browser (having adobe acrobat installed) and then use keyboard shortcut keys to select all text (CTRL+A), then copy it to the clipboard (CTRL+C) and then you can verify the text in the clipboard. eg:
protected String getLastWindow() {
return session().getEval("var windowId; for(var x in selenium.browserbot.openedWindows ){windowId=x;} ");
}
#Test
public void testTextInPDF() {
session().click("link=View PDF");
String popupName = getLastWindow();
session().waitForPopUp(popupName, PAGE_LOAD_TIMEOUT);
session().selectWindow(popupName);
session().windowMaximize();
session().windowFocus();
Thread.sleep(3000);
session().keyDownNative("17"); // Stands for CTRL key
session().keyPressNative("65"); // Stands for A "ascii code for A"
session().keyUpNative("17"); //Releases CTRL key
Thread.sleep(1000);
session().keyDownNative("17"); // Stands for CTRL key
session().keyPressNative("67"); // Stands for C "ascii code for C"
session().keyUpNative("17"); //Releases CTRL key
TextTransfer textTransfer = new TextTransfer();
assertTrue(textTransfer.getClipboardContents().contains("Some text in my pdf"));
}
Another way, still in java, is to download the pdf and then convert the pdf to text with PDFBox, see http://www.prasannatech.net/2009/01/convert-pdf-text-parser-java-api-pdfbox.html for an example on how to do this.
You cannot do this using WebDriver natively. However, PDFBox API can be used here to read content of PDF file. You will have to first of all shift a focus to browser window where PDF file is opened. You can then parse all the content of PDF file and search for the desired text string.
Here is a code to use PDFBox API to search within PDF document.
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.PrintWriter;
import org.pdfbox.cos.COSDocument;
import org.pdfbox.pdfparser.PDFParser;
import org.pdfbox.pdmodel.PDDocument;
import org.pdfbox.util.PDFTextStripper;
public class pdfToTextConverter {
public static void pdfToText(String path_to_PDF_file, String Path_to_output_text_file) throws FileNotFoundException, IOException{
//Parse text from a PDF into a string variable
File f = new File("path_to_PDF_file");
PDFParser parser = new PDFParser(new FileInputStream(f));
parser.parse();
COSDocument cosDoc = parser.getDocument();
PDDocument pdDoc = new PDDocument(cosDoc);
PDFTextStripper pdfStripper = new PDFTextStripper();
String parsedText = pdfStripper.getText(pdDoc);
System.out.println(parsedText);
//Write parsed text into a file
PrintWriter pw = new PrintWriter("Path_to_output_text_file");
pw.print(parsedText);
pw.close();
}
}
JAR Source
http://sourceforge.net/projects/pdfbox/files/latest/download?source=files
Unfortunately you can not do this at all with Selenium
There is a way.
Before you click the link you can obtain the href value
element.FindElement(By.TagName("href")).Text
Then after the PDF loads you can get the Url
driver.GetUrl();
Then you can just check to see if the url contains the href.
It's not the best, but it's better than nothing.
I'm still learning Grails and seem to have hit a stumbling block.
Here are the 2 domain classes:
class Photo {
byte[] file
static belongsTo = Profile
}
class Profile {
String fullName
Set photos
static hasMany = [photos:Photo]
}
The relevant controller snippet:
class PhotoController {
def viewImage = {
def photo = Photo.get( params.id )
byte[] image = photo.file
response.outputStream << image
}
}
Finally the GSP snippet:
<img class="Photo" src="${createLink(controller:'photo', action:'viewImage', id:'profileInstance.photos.get(1).id')}" />
Now how do I access the photo so that it will be shown on the GSP? I'm pretty sure that
profileInstance.photos.get(1).id is not correct.
If you have a url for the image, you just have to make sure you return the appropriate anser in the controller:
def viewImage= {
//retrieve photo code here
response.setHeader("Content-disposition", "attachment; filename=${photo.name}")
response.contentType = photo.fileType //'image/jpeg' will do too
response.outputStream << photo.file //'myphoto.jpg' will do too
response.outputStream.flush()
return;
}
As it is a Set, if you want the first element, you will have to go:
profileInstance.photos.toArray()[0].id
or
profileInstance.photos.iterator().next()
now, i actually think storing the photo as a binary blob in the database isnt the best solution - though you might have reasons why it needs to be done that way.
how about storing the name of the photo (and/or the path) instead? If name clashing issues are probable, use the md5 checksum of the photo as the name. Then the photo becomes a static resource, a simple file, instead of a more complicated and slower MVC request.
I´m learning grails too was searching for an example like this one.
The GSP snipplet didn´t work for me. I resolved by replacing the single quotes around profileInstance.photos.get(1).id
<img class="Photo" src="${createLink(controller:'photo', action:'viewImage', id:'profileInstance.photos.get(1).id')}" />
with double quotes:
<img class="Photo" src="${createLink(controller:'photo', action:'viewImage', id:"profileInstance.photos.get(1).id")}" />
Now grails resolves the expression around the double quotes. Otherwise it takes it as string.
My guess is you need to set the content type of the response stream. Something like:
response.ContentType = "image/jpeg"
This may or may not need to be before you stream to the response stream (can't imagine that it would matter). I'd just put it before the outputStream line in your code above.
id:'profileInstance.photos.get(1).id' should be id:profileInstance.photos.get(1).id. no quota