How can I extract images from a site that I'm linking to? - image

If you're familiar with Reddit, you'll know how all of their posts containing pictures get a small thumbnail preview beside the title of the submission. How does Reddit go about doing that? Does it just check to see if the link ends with .jpg, .png, .bmp, etc?

reddit will try to pull a thumbnail from any source--not just an image URL. This is done firstly by having set rules for specific sites, and secondly by having one generic process for retrieving thumbnails for unknown URLs--and is an automated periodic task.
One of the (many) benefits of reddit is that the source code is open, and if you understand Python, you should check out /r2/lib/scraper.py for a more detailed view at how this process works.
Also, while StackOverflow is a great place to have programming-related questions answered, you might also want to check out reddit's own /r/redditdev for information on reddit development.

Indeed, if the URL contains .jpg, .png,
etc., use that.
If the site is a
popular domain (flickr.com,
youtube.com, amazon.com, etc.), have
a set of predefined rules to extract
something you know will be relevant
(may it be the featured image, YouTube
thumbnail, Amazon product image,
etc.)
Otherwise, if all you have to
work with is some HTML, you'll have to dig it out yourself. You could choose the
first one on the page, the biggest by size,
or even the one you've algorithmically
determined to be the most relevent (e.g. relatively big, inside what you think is the main body content.)
If you have to resort to the last option, one technique I'd recommend is to extract multiple images, and A/B test them to find the one which has the best click-through rate. That way you can nearly always get the best one.

You can check for the content of the <img> tag.

Related

exist-db how to access a pdf

I am sure it is very simple ... I just cannot get my head around this...
the exist-db Documentation is a bit fuzzy on content extraction...
http://exist-db.org/exist/apps/doc/contentextraction.
I have a pdf-file, containing of about 162 high-res images (the pdf is quite big ...) and I do not know how to access any of the that are presumably created ...
please do not destroy me! I am just starting to build a database (for an Edition at Uni)I'd love to have a facsimile edition (so one Tab with the image-file and one tab with the transcribed texts)
I aim at doing something similar to what Heidelberg Universitdy did with the "Welsche Gast Digital" http://digi.ub.uni-heidelberg.de/diglit/cpg389/0190/image
(the choosen image is just an example! )
This pic
When clicking on faksimile the Scan opens and when clicking on Transkription the transcribed texts open!
I am quite new to Xquery, Xpath and most X-related stuff. I have a "working design" put together in exist-db and am looking at TEI for marking up the transcritpion etc, I fear I'll have to spend quite some time on this issue ...
(it is not about doing my job for me, it's just about pointing me in the right direction)
I m afraid the short answer is simply don't.
Storing a pdf in your db, and then trying to extract images from it, is kind of a recipe for disaster. Instead you should use the source images (not necessarily extracted from the pdf), and store these individually in a collection (e.g. resources/img). Those image files are then the binary resources that the documentation is actually talking about.
You might want to take a look at tei-publisher for creating digital edition in exist, especially this demo app for how to present high-res facsimiles with transcribed portions of text. I m afraid its all a bit more involved then just opening a pdf in a browser, but so is the Welsche Gast Digital

find similar image in library to photo

I work at a printer where we generate thumbnails of artwork for orders and store them in a folder before printing.
I'm looking for a code library that will allow us to take a photo of a printed item and look through the library of thumbnails for the design.
Just wondered if anyone knows of a library or api that could do this?
Thanks
David
pHash is one solution.
There are others but that mainly depends on your requirements: do you only want to identify identical images, if not, what types of transformations do you want to be able to capture etc.
In general you should look for near duplicate image search.
#david-jennings there are numerous methods to look for similar images in libraries. Remember that google already does this in google images.
Your problem falls under the scope of Content Based Image Retrieval (CBIR), which aims at looking for images with similarities in their content. MPEG-7 is a standard established many years ago to address these issues and the research field is very active with new techniques being developed constantly.
The main idea in CBIR is to extract some kind of a signature from an image and try to match it with all previously extracted signatures of all images in your database. Which method to use depends upon the specifics of your problem... According to your initial post I suppose that probably the use of SHIFT is going to do the work for you...
You may implement such a system using OpenCV with C/C++/Java/etc., or something more "scientific" using MATLAB.

Wordpress theme/image management

So I've been creating a custom responsive theme in Wordpress and I've hit a wall when it comes to image management. I'd like to style images in a way that wordpress doesn't seem to inherently support - I'm looking for something like this:
with the images added via the regular wordpress media management pane, and inserted into posts/pages. The images should be out of the flow of the content but accurately placed next to the correct headers/text blocks. Most importantly, the images ought to collapse into a column with the rest of the content at the correct media query breakpoints.
Here's what I've tried, from worst to best:
Hard coded images in template files
Obviously the worst option. Not portable, requires a lot of meddling, and would be almost impossible to align the images with the correct content. Also, no real way of making the images responsive with the content.
Use the default image styling and abandon the idea of pulling the images out of the regular flow
Non optimal, but it would allow anyone to change/edit images easily.
Remove images from the results of the_content(), then place and style them separate.
Portable, but has the same problems as #1 - difficult to align the images with content and keep responsiveness.
Use the featured image on pages that only require one image
Pretty good option for pages that need ONLY one image, but there is no easy way to make the featured image an arbitrary size/aspect ratio.
Use markup in the editor to correctly layout the images
Requires anyone editing the posts/pages to have some knowledge of the underlying theme. This seems to work the best, but it isn't portable (might break stuff on theme change).
While I've had the best results with this option, it seems sort of antithetical to using a cms/wysiwyg editor in the first place.
My question is whether or not the last option really is the best to get the result I want?
In the end, the answer was clearly custom fields, and none of the other options I listed. With the advanced custom fields plugin, it becomes a breeze to do what I wanted. You don't need the plugin, but it makes image management a whole lot easier, as it fully integrates the wordpress media library with the custom field (which you would have to do manually otherwise). With the plugin, custom fields meet all of my needs (responsiveness, portability, and ease of use for the technically challenged).

how to make non copyable html page like google book

I am just curious if I can be able to copy books from google or not.And I am also curious to know what to make such kind of material.
I suppose the best way is to convert the text pages to images. You'd still be able to capture the images, but they wouldn't be in text form anymore; to get them back in their original form, you'd have to OCR them, which is an arduous process.

How do I Make a Web Crawling Application User-Friendly

I'm creating a web crawling application that I want the 'average' user to be able to use. I'm concerned that a web crawling application is probably just too complex for most users though because users need to:
Understand URL structure (domain, path, etc...).
Understand crawling 'depth'.
Understand file extensions and be able to setup 'filters' to narrow their crawling to achieve better performance (or they'll be frustrated with the program).
Understand where URLs are found in pages (image srcs, links, plain text URLs, etc...).
What can I do to help users get quickly acquainted with my program? Or even better, what can I do so the program is intuitive enough that users just 'get it'? I know this seems pretty broad, but if you can confine your answers to web crawlers that should help. I've read up on general usability, ui design, etc... but I'm struggling with the domain I'm working in. Thanks.
Just because a web crawler is complex in implementation, doesn't mean it has to be complex to use. Only offer what is really necessary, use sensible defaults for the rest. That will get you 80% of use cases, and then rely on the other 20% being more willing for have a deeper understanding.
Why should they have to understand this? Depends on the expected usage, but I would of assumed most uses where crawling a full website, so only the domain is needed.
Gert G's suggestion of a slider with extending folder structure was a good one. This doesn't have to be dynamic with the site in question, just an illustration of what it means.
Forget exposing file extensions, instead offer common types of file with icons, possibly even grouping them (e.g. all common image types, jpg, png, gif, go into one 'images' type). Only give raw file extension settings under an advanced config section, those that need it will understand it.
I don't really see why they need to understand this? Surely that's a job for the crawler.
Some ideas:
Make an interactive user interface (e.g. a slider for depth, which shows a small picture of folders and subfolders opening as they move the slider)
Avoid clutter. Divide the settings into logical tabs.
Make video tutorials for the things you need to teach them.
Perhaps you could have a picture of "the web" showing two or three pages each from two or three websites. As the user selected where to find links (for example, images, plain text, links, etc), the parts of the page they selected would be highligted in the images.

Resources