About search engines: how do they take screenshots of web sites? - snapshot

This may be a dumb question, but I really have no idea and I'm utterly curious! So please bear with me.
What I know is search engines just read HTML and words in a site. They usually ignore CSS or part of it. They arguably cannot read images. Do they?
If they really cannot or ignore to read those, then my question is how do they make screenshot, which is a page that is presented just the way as CSS makes it, and has images.
If they do not read CSS, images, and they also do not like human being to open it in his or her screen. How do they make the screenshot?
Thanks!

Are you referring to Google's new screenshot feature, or their old cache feature? Your question is talking about screenshots and doesn't mention the cache at all, but your comments on your question seem to imply that you're referring to the cache, not the screenshots.
In the case of the screenshots:
You are correct in that search engines usually only read the HTML and text on a website, because that's all they need. But that doesn't mean they can't.
When they want to take a screenshot of a site, they'll just do exactly what a normal browser does when a user visits the site. Download the website, the CSS, the images, and everything else, and render it with the rendering engine of a web browser, such as WebKit.
In the case of the cache:
The search engine usually just stores the HTML without/before parsing it. It sends the saved HTML to your browser, and your browser pulls all the other stuff in the page (images, etc) from the original website. The search engine isn't reading anything, it's just saving the page verbatim (well, with minor changes, namely URL rewriting), and giving it to your browser.

There are apps that takes screenshot of pages as if displayed in a chosen browser.
Browershot is an example of online service that does it.
Here are some links and projects of webpage thumbnail generator:
Build your own website thumbnail generator with Django (Python)
Zubrag Website Thumb Generator (PHP)

Maybe I'm not understanding your question, but...
You seem to be using "read an image" to mean load the data from the image to the search engine. This the search engine does do (including CSS). When people say search engines ignore images they mean it doesn't see them as meaningful searchable data. In other words if I make an image that has the word "Hello" on it you and I "read" it in the sense that we see and understand that the image contains a word. A search engine typically will not attempt to do this, the search engine will however "read" the image into its storage if it wants to have the ability to present that to a user at a later time.

Search engine don't use the CSS and image content for indexing but they can store them on their servers to make a cached version of the site.
In the case of google I think they store only text files, so HTML, CSS, maybe javascript but no images.

Related

Microsoft Web Matrix

Pretty easy question I hope: does anyone know of a tool that will effectively scrape sites built with Microsoft Matrix? I could write the code in python, but it will take me way longer than I think I want to dedicate to the task, namely because of the really bad and ugly HTML generated by Matrix.
I have tried Web Harvey, Helium Scraper, and I tried the Web Scraper plugin for Chrome. WebHarvey choked on the HTML and couldn't load subsequent pages. Helium Scraper was able to move from one details page to another (the Next links were followed) but content from within the details pages was not lifted out. The Chrome plugin web scraper was not able to navigate links, with the popup window displaying an error page. My gut is telling me that this has to do with uniquely ASP.net things, but I could be wrong.
Any pointers or suggestions appreciated.
You know there are two completely different versions of Microsoft Web Matrix right? There's the one from 2003; i have no idea what its html looks like. There's the one from 2011 to current which uses razor cshtml source files to produce its html. In the 2011+ one, you write the html by hand; there's no drag and drop, and so it's unlikely you'll get consistent html from site to site.

Web Scraping an Image

I was thinking about the applications of web scraping (still quite new to it) and came up with a question. Can you get an image from a page if there are advertisements on the page (like can you avoid advertisements and only look for the correct image content on the page)? Also, if the image is also a link to another page, can you say go to the next page and get that image (and then go from there until you either reach a certain amount or get all of the images)? This would mean avoiding going to the advertisements pages.
Absolutely. If you use a tool like kimonolabs.com this can be relatively easy. You click the data that you want on the page, so instead of getting all images including advertisements, Kimono uses the CSS selectors of the data you clicked to know which data to scrape.
You can use Kimono to scrape data within links as well. It's actually a very common use. Here's a break-down of that strategy: https://help.kimonolabs.com/hc/en-us/articles/203438300-Source-URLs-to-crawl-from-another-kimono-API
This might be a helpful solution for you, especially if you're not a programmer because it doesn't require coding experience. It's a pretty powerful tool.
I think if you are ok with PHP programming then give a look into php simple html dome parser. I have used it a lot and scrapped number of websites.

Firefox plugin for a web developer that shows all resources (js, css, html) as a single unified file?

I'm developing with a really complex cms system, and sometimes I need to know if something was sent to my html rendering.
Since this is a huge cms system, I have at least 30 resources linked to a page (js, css), and going through each one, clicking and searching for a string is not the best way to do it.
I would like to have a plugin that gets all the resources from a page, merges them as text, so i can search only once. Is this possible? Does something like this exists?
(I know Firebug can inspect an element and such, but a search option for an specific scenario - like a type=submit somewhere in a css file - is faster and more useful).
The plugin you need is the Web Developers toolbar addon for Firefox.
You can search all JavaScript files in plain text by clicking Information -> View JavaScript
You can search all CSS files in plain text by clicking CSS -> View CSS
In firebug, when you inspect an element it shows all CSS rules applying to it and a link to the source file involved. I think it is away more powerful then what you want.

Web Page Rendering Capture

I start with describing the problem itself. Rather than a problem I'm looking for a better solution. I have a asp.net page which has a bunch of images and a link underneath it, Each image is infact the latest rendering of the link underneath it.
I scheduled a bat script which runs every hour to fetch the images through IECapt a web page rendering capture utility. One thing am annoyed about this utility is it takes a lot of time for the 20 images I have and for few because of the flash content it misses to take the actual screenshot of the website.
Now I like to know can this rendering be done by traditional programming am not interested in using any utilities. I'm interested in trying this. The solution need not be necessarily a C# based am ready to try in any other language. Because it gives me a chance to learn.
Thank you.
You should probably look at moz-headless-screenshot
You should be able to embed the functionality you need.
http://blog.mozilla.com/ted/2010/07/29/moz-headless-screenshot/
he also provided a sample embedding client application called moz-headless-screenshot.
This is a simple command line tool that takes a URL, image size, and output filename
and generates a PNG screenshot of the webpage.
You should look into browser shots:
http://browsershots.org/
They do what you want to do for lots of different browsers. It is even open source.
There's no simple-simple solution for what you're asking to do. This is because rendering HTML, CSS, and Flash is actually a very sophisticated process.
If you're up for quite a bit of coding, you can use the Gecko engine (which powers firefox) or another open-source web-browser core (ie Dillo) to render the page onto a custom canvas. Then save that canvas to a file. Unless you implement support for browser plug-ins, you won't get Flash this way, though. You could try using Gnash or its like. Good luck with that.
I don't know of an open-source project that already does this. It would be neat, though :-). If you write something, please push it to the world; it would be really cool to have a "get a screencap of this URL" tool.
One way is to use IRobotSoft web scraper. You can design a robot to go to the URL every hour, and capture the whole web page as an image via a function CapturePage(imagefile).
I am not sure if it will be better than IECapt though.
We have used ACA WebThumb ActiveX Control (http://www.acasystems.com/en/web-thumb-activex/) quite successfully to capture parts or whole of a web page in the web server and then to write them to a file, just passing in the url. It performs fast enough for our need.
I am not familiar with IECapt, but this might be something you might want to have a look at.

Best way to make a newsletter slideshow kiosk for the office?

So, I've been tasked with making a kiosk for the office for showing statistics about our SCRUM progress, build server status, rentability and so forth. It should ideally run a slideshow with bunch of different pages, some of them showing text, some showing graphs and so on.
What is the best approach for this? I first thought of powerpoint, but It should be able to take the images from a webserver so I can automate the graph generation procedure. I would also like to take text from an external source when showing "Who broke the build" or some page like that.
I have no doubt that ready-made systems exist, but I don't really know where to look for them.
Is this easy/hard in powerpoint? Or are there an ubiquous app that everybody but me knows about?
I would recommend creating it as a series of web-pages, which uses Javascript or the meta refresh tag to cycle though the different pages. Simply full-screen the browser on a spare machine, and connect it to a projector/monitor/big TV.
This has lots of benefits:
it's trivial to display images from an external server (an <img> tag)
it will cost nothing to setup (it can run on basically any functioning machine), and runs in a browser
it is quick to do (you do not have to worry about cross-browser compatibility, or different screen resolutions as you know the exact machine you are developing for
it's expandable - while what you describe is probably possible within Powerpoint, but if you do it as a web-page, you can use Javascript (or a JS framework like jQuery), and it's very easy to serve the pages via a web-server, then you can use any server-side scripting language.
Basically, you would have a series of files, say slide001.htm, slide002.htm and slide003.htm. Slide 1 would redirect to slide002 after 30 seconds, slide002 to slide 003, and slide003 would redirect to slide001..
The specific things you mention: graph generation and "Who broke the build" text:
Not sure which CI tool you use, but many of them generate graphs anyway, so that would be required is having one "slide" with something like <img src="http://hudson.abc/job/proj042/buildTimeGraph">
For the who-broke-the-build text, you would be easiest to run the slides as .php files served though a web-server, using XAMMP.
Then you would have a function that scrapes your CI server for whoever broke the last build, and in one of the slides, you would have <?PHP echo(who_broke_build()); ?>
(Obviously if you know some other language/system better, use that!)
The final benefit I can think of is that, if you serve the files through a web-server, you can allow people display it locally, say as their browsers home-page.
Thanks. I found jqS5 which did most of what you mentioned.
It requires 1 document where every h2 becomes a new slide.
I can then use the meta-refresh to reload to next page every 10 seconds. When I reach the end of the slides, I pull data from an aggregated RSS feed from all the different systems in order to pull information.
http://staticfree.info/projects/jqs5/

Resources