I was thinking about the applications of web scraping (still quite new to it) and came up with a question. Can you get an image from a page if there are advertisements on the page (like can you avoid advertisements and only look for the correct image content on the page)? Also, if the image is also a link to another page, can you say go to the next page and get that image (and then go from there until you either reach a certain amount or get all of the images)? This would mean avoiding going to the advertisements pages.
Absolutely. If you use a tool like kimonolabs.com this can be relatively easy. You click the data that you want on the page, so instead of getting all images including advertisements, Kimono uses the CSS selectors of the data you clicked to know which data to scrape.
You can use Kimono to scrape data within links as well. It's actually a very common use. Here's a break-down of that strategy: https://help.kimonolabs.com/hc/en-us/articles/203438300-Source-URLs-to-crawl-from-another-kimono-API
This might be a helpful solution for you, especially if you're not a programmer because it doesn't require coding experience. It's a pretty powerful tool.
I think if you are ok with PHP programming then give a look into php simple html dome parser. I have used it a lot and scrapped number of websites.
I'm working on sharepoint project, i have like 1000 image i want to upload, i need webpart or something to do swap images on click, is there any web part that do this?
what the best method to use on my situation.
Are you a SharePoint developer? If not, I'd strongly suggest not even trying to do this. Modifying SharePoint beyond out-of-the-box options requires some extensive asp.net and SharePoint-centric developer skills. Even then, it's not a joy to work with.
In the past, for modifying UI interactions, I found the saner approach is to manipulate the DOM post-render. Load up jQuery and then upon page render, do your thing.
I start with describing the problem itself. Rather than a problem I'm looking for a better solution. I have a asp.net page which has a bunch of images and a link underneath it, Each image is infact the latest rendering of the link underneath it.
I scheduled a bat script which runs every hour to fetch the images through IECapt a web page rendering capture utility. One thing am annoyed about this utility is it takes a lot of time for the 20 images I have and for few because of the flash content it misses to take the actual screenshot of the website.
Now I like to know can this rendering be done by traditional programming am not interested in using any utilities. I'm interested in trying this. The solution need not be necessarily a C# based am ready to try in any other language. Because it gives me a chance to learn.
Thank you.
You should probably look at moz-headless-screenshot
You should be able to embed the functionality you need.
http://blog.mozilla.com/ted/2010/07/29/moz-headless-screenshot/
he also provided a sample embedding client application called moz-headless-screenshot.
This is a simple command line tool that takes a URL, image size, and output filename
and generates a PNG screenshot of the webpage.
You should look into browser shots:
http://browsershots.org/
They do what you want to do for lots of different browsers. It is even open source.
There's no simple-simple solution for what you're asking to do. This is because rendering HTML, CSS, and Flash is actually a very sophisticated process.
If you're up for quite a bit of coding, you can use the Gecko engine (which powers firefox) or another open-source web-browser core (ie Dillo) to render the page onto a custom canvas. Then save that canvas to a file. Unless you implement support for browser plug-ins, you won't get Flash this way, though. You could try using Gnash or its like. Good luck with that.
I don't know of an open-source project that already does this. It would be neat, though :-). If you write something, please push it to the world; it would be really cool to have a "get a screencap of this URL" tool.
One way is to use IRobotSoft web scraper. You can design a robot to go to the URL every hour, and capture the whole web page as an image via a function CapturePage(imagefile).
I am not sure if it will be better than IECapt though.
We have used ACA WebThumb ActiveX Control (http://www.acasystems.com/en/web-thumb-activex/) quite successfully to capture parts or whole of a web page in the web server and then to write them to a file, just passing in the url. It performs fast enough for our need.
I am not familiar with IECapt, but this might be something you might want to have a look at.
This may be a dumb question, but I really have no idea and I'm utterly curious! So please bear with me.
What I know is search engines just read HTML and words in a site. They usually ignore CSS or part of it. They arguably cannot read images. Do they?
If they really cannot or ignore to read those, then my question is how do they make screenshot, which is a page that is presented just the way as CSS makes it, and has images.
If they do not read CSS, images, and they also do not like human being to open it in his or her screen. How do they make the screenshot?
Thanks!
Are you referring to Google's new screenshot feature, or their old cache feature? Your question is talking about screenshots and doesn't mention the cache at all, but your comments on your question seem to imply that you're referring to the cache, not the screenshots.
In the case of the screenshots:
You are correct in that search engines usually only read the HTML and text on a website, because that's all they need. But that doesn't mean they can't.
When they want to take a screenshot of a site, they'll just do exactly what a normal browser does when a user visits the site. Download the website, the CSS, the images, and everything else, and render it with the rendering engine of a web browser, such as WebKit.
In the case of the cache:
The search engine usually just stores the HTML without/before parsing it. It sends the saved HTML to your browser, and your browser pulls all the other stuff in the page (images, etc) from the original website. The search engine isn't reading anything, it's just saving the page verbatim (well, with minor changes, namely URL rewriting), and giving it to your browser.
There are apps that takes screenshot of pages as if displayed in a chosen browser.
Browershot is an example of online service that does it.
Here are some links and projects of webpage thumbnail generator:
Build your own website thumbnail generator with Django (Python)
Zubrag Website Thumb Generator (PHP)
Maybe I'm not understanding your question, but...
You seem to be using "read an image" to mean load the data from the image to the search engine. This the search engine does do (including CSS). When people say search engines ignore images they mean it doesn't see them as meaningful searchable data. In other words if I make an image that has the word "Hello" on it you and I "read" it in the sense that we see and understand that the image contains a word. A search engine typically will not attempt to do this, the search engine will however "read" the image into its storage if it wants to have the ability to present that to a user at a later time.
Search engine don't use the CSS and image content for indexing but they can store them on their servers to make a cached version of the site.
In the case of google I think they store only text files, so HTML, CSS, maybe javascript but no images.
What Firefox add-ons do you use that are useful for programmers?
I guess it's silly to mention Firebug -- doubt any of us could live without it. Other than that I use the following (only listing dev-related):
Console2: next-generation error console
DOM inspector: as the title might indicate, allows you to browse the DOM
Edit Cookies: change cookies on the fly
Execute JS: ad-hoc Javascript execution
IE Tab: render a page in IE
Inspect This: brings the selected object into the DOM inspector
JSView: display linked javascript and CSS
LORI (Life of Request Info): shows how long it takes to render a page
Measure IT: a popup ruler.
URL Params: shows GET and POST variables
Web Developer: a myriad of tools for the web developer
Here are mine (developer centric):
FireBug - a myriad of productivity enhancing tools, includes javascript debugger, DOM inspector, allows you to edit the CSS/HTML on the fly which is highly valuable for troubleshooing layout and display problems.
Web Developer - again another great developer productivity tool. I mostly use it for quickly validating pages, disabling javascript (yes I disable javascript sometimes, don't you?), viewing cookies, etc.
Tamper Data - lets you tamper with http headers, form values, cookies, etc. prior to posting back to a page, or getting a page. Incredibly valuable for poking and prodding your pages, and seeing how your web app responds when used with slightly malicious intent.
JavaScript Debugger - has a few more features than javascript debugger provided by firebug. Although I must admit, I sparingly use this one since firebug has largely won me over.
Live HTTP Headers - invaluable for troubleshooting, use it frequently. Lets you spy on all HTTP headers communicated back and forth between client and server. It has helped me track down nefarious problems, especially when debugging issues when deploying your web app between environments.
Header Spy - nice addon for the geeky types, shows you the web server and platform a web site runs on in the status bar.
MeasureIt - I don't use this all too frequently, but I've still found it valuable from time to time.
ColorZilla - again, not something I use all that frequently, but when I need it, I need it. Valuable when you want to know a color and you don't want to dig through a CSS file, or open up a graphics editing app to get a color embedded in some image.
Add N Edit Cookies - this has been a great debugging tool in web farms where the load balancer writes a cookie, and uses the cookie value to keep your session "sticky". It allowed me to switch at will between servers to track down problems on specific machine. Also a good tool if you want to try to mess with a site that uses cookies to track your login status/account, and you want to see how your code responds to malformed or hacked info.
Yellowpipe Lynx Viewer Tool - yeah I know what your thinking, lynx, who needs it, its so 1994. But if you are developing a site that needs to take web accessibility into account (meaning accessible to users with visual impairments who use screen readers), or if you need to get a sense of how a web spider/indexer "sees" your site, this tool is invaluable. Granted, you could always just go out and grab Lynx for yourselfhere's the windows xp port that I use.
I've got a handful of other addons that I've used from time to time that I'll just quickly mention: FireFTP (one I installed wasn't stable and I've not tried a newer release), Html Validator (also found this one unstable, least back when I installed like a year ago), IE Tab (I usually just have both IE and FireFox open concurrently, but that is just me, I know many others that find this addon useful).
I'd also recommend the Web Developer extension by Chris Pederick.
As far as web development, especially for javascript, I find Firebug to be invaluable. Web developer toolbar is also very useful.
The ones I have are...
Y-SLow
Live Headers
Firebug
Dom Inspector
One that wasn't mentioned yet is this HTML Validator extension that I found very useful.
#Flávio Amieiro
MeasureIt is an unnecessary extension to have if you install the Web Developer Toolbar. Web Developer Toolbar includes a ruler as one of its features. Under the "Miscellaneous" category for Web Developer click the option "Display Ruler" to use a ruler identical to the MeasureIt one.
That will allow you reduce the number of extensions needed by at least one.
Firefox addons:
FireBug:helps web developers and designers test and inspect front-end code. It provides us with many useful features such as a console panel for logging information, a DOM inspector, detailed information about page elements, and much, much more.
Web Developer-gives you the power disable CSS, edit CSS on the fly, measure certain areas of a page and much more.
ColorZilla
Just click on the icon, hover over the area you'd like to know the hex color for, and click.
Window Resizer
to make sure the layout is displayed properly in the standard resolutions of today.
Total Validator
validating websites much easier by checking HTML, links, CSS and doing a lot more.
Web Developer for web development. Scribefire if you're a blogger-progammer
For web developing I use the Web Developer Toolbar, CSS Viewer and MeasureIt.
But I'm really not one of those who has a thousand of extensions to do everything. I like to keep things simple.
EDIT: Thanks to Dan's answer I don't need MeasureIt anymore. Can't believe I've never seen that! I guess I'll just have to pay more atention to this WebDeveloper toolbar.
Adding to everyones lists, Tamper Data is quite useful, lets you intercept requests and change the data in them.
It can be used to bypass javascript validation and check whether the server side is doing its thing.
I use Web Developer, it's a real time saver.
+1 for LORI ("life-of-request-info"). It's a very convenient alternative for rough measurements of the load time of a particular web page -- the kind of thing that you might otherwise use an external stopwatch for.
New Tab Homepage. Combined with a "speed dial"-type homepage (a personal, fast-loading page of links that you use frequently), helps you get where you're going faster when you open a new browser tab.
LastTab. Changes the behavior of Ctrl+Tab to let you navigate back and forth between your most-recently-used tabs with repeated presses of Ctrl+Tab, the same way that Alt+Tab works in Windows. Also provides a nice view of all open tabs while Ctrl is still being held down for easy navigation. (The resultant behavior is very similar to the Ctrl+Tab behavior in recent releases of Visual Studio.)
FireFTP is good for grabbing/uploading any necessary files.
I find Hackbar to be quite useful. Very useful if you want to edit the querystring part of the url, to test for vulnerabilities, or just general other types of testing where you might end up with complicated query string values.
I was learning DOM inspector, but I've switched to Firebug.
Some of which has been missed above are here
Load Time Analyzer – View detailed graphs of the loading time of web pages in firefox. The graphs display events like page requests, image loading times etc.
Poster – A must have tool for web developers enabling them to interact with web services and other web resources.
Aardvark – A cool extension for web developers and designers, allows them to view CSS attributes, id, class by highlighting page element individually.
Fiddler is a really great debugging proxy. Think of it as a more powerful version of the "Net" panel in Firebug or the Live HTTP headers.
It used to be an IE-only extension, now it also has hooks into Firefox.
Groundspeed, is useful for testing server side code. It was created for input validation tests during pentest, but can be useful for any test that require manipulating input (similar to TamperData).
It lets you control the form elements in the page, you can change their type and other attributes (size, lenght, javascript event handlers, etc). So for example you can change a hidden field or a select to a textbox and then enter any value to test the server response and stuff like that.