I would like to save a web page programmatically.
I don't mean merely save the HTML. I would also like automatically to store all associated files (images, CSS files, maybe embedded SWF, etc), and hopefully rewrite the links for local browsing.
The intended usage is a personal bookmarks application, in which link content is cached in case the original copy is taken down.
Take a look at wget, specifically the -p flag
−p −−page−requisites
This option causes Wget to download all the files
that are necessary to properly display
a givenHTML page. Thisincludes such
things as inlined images, sounds, and
referenced stylesheets.
The following command:
wget -p http://<site>/1.html
Will download page.html and all files it requires.
On Windows: you can run IE as a com object and pull everything out.
On other thing, you can take the source of Mozilla.
In Java, Lobo.
Or commons-httpclient and write a lot of code.
You could try the MHTML format (which is what IE uses). http://en.wikipedia.org/wiki/MHTML
In other words, you'd be downloading each object (image, css, etc.) to your computer, and then "embedding" them, via Base64, into a single file.
Related
I would like to download all images in full quality from this blog: http://w899c8kcu.homepage.t-online.de/Blog.
I have access to server, but I can not find the directory where the images lie. When I use Firebug on the first picture, it shows me http://w899c8kcu.homepage.t-online.de/Blog;session=f0577255d9df9185d3abe04af0ce922d&focus=CMTOI_de_dtag_hosting_hpcreator_widget_PictureGallery_15716702&path=image.action&frame=CMTOI_de_dtag_hosting_hpcreator_widget_PictureGallery_15716702?id=34877331&width=1000&height=2000&crop=false.
How can I find the file paths like /dirname/image.jpg?
According to its HTML output the page obviously uses the CM4all content management system (CMS).
I don't know how precisely this CMS is working, though generally CMSs normally either save the files under cryptic names within a folder specified in the CMS's configuration or not in the file system at all but within a database.
Also, CMS may only save compressed or resized versions of the original files.
So, if you don't want to or are not able to dig into the server-side script code to find out if and where the images are saved, you should contact the company behind CM4all about this.
I need to download a pdf from a website which does not provide a link ending with (.pdf) using ruby. Manually, when i click on the link to download the pdf, it takes me to a new page and the dialog box to save/open the file appears after some time.
Please help me in downloading the file.
The link
You an do this
require 'open-uri'
File.open('my_file_name.pdf', "wb") do |file|
file.write open('http://someurl.com/2013-1-2/somefile/download').read
end
I have been doing this for my projects and it works.
If you just need a simple ruby script to do it, I'd just run wget. Like this exec 'wget "http://path.to.the.file/and/some/params"'
At that point though, you might as well run wget.
The other way, is to just run a get on the page that you know the pdf is at
source = Net::HTTP.get("http://the.website.com", "/and/some/params")
There are a number of other http clients that you could use, but as long as you make a get request to the endpoint that the pdf is at, it should give you the raw data. Then you can just rename the file, and you'll have the pdf
In your case, I ran the following commands to get the pdf
wget http://www.lawcommission.gov.np/en/documents/prevailing-laws/constitution/func-download/129/chk,d8c4644b0f086a04d8d363cb86fb1647/no_html,1/
mv index.html thefile.pdf
Then open the pdf. Note that these are linux commands. If you want to get the file with a ruby script, you could use something like what I previously mentioned.
Update:
There is an added complication that was not initially stated, which is that the url to the pdf changes every time there is an update to the pdf. In order to make this work, you probably want to do something involving web scraping. I suggest nokogiri. This way you can look at the page where the download is and then perform a get request on the desired URL. Furthermore, the server that hosts the pdf is misconfigured, and breaks chrome within a few seconds of opening the page.
How to solve this problem: I went to the site, and refreshed it. Then broke the connection to the server (press the X where there would otherwise be a refresh button). Then right click next to the download link, and select inspect element. Then browse the dom to find something that is definitively identifying (like an id). Thankfully, I found something <strong id="telecharger"> Download</strong>. This means that you can use something like page.css('strong#telecharger')[0].parent['href'] This should give you a URL. Then you can perform a get request as described above. I don't have time to make the script for you (too much work to do), but this should be enough to solve the problem.
When downloading a page using wget with the -p option (page requisites), which downloads all the files that are necessary to properly display a given html page, including such things as inlined images, sounds, and referenced stylesheets. It seems like an image that belongs to a different domain (e.g. www.google.com) will not be downloaded. Is there a way to cause it to be downloaded as well?
You could use -H option (go to foreign hosts when recursive).
I am currently writing an application working with specially prepared image data. Another tool prepares the images (basically PNGs with additional data stored in the meta-data section). Now my tool works with these files, but not with all PNGs, so "we" decided to use a different file extension. So far, so good.
Now, because I am a lazy sack I implemented some file type registration to allow double-clicking on the file and opening it in my application (no problem at all).
And here is my Question:
It would be cool if the windows explorer could still show me the thumbnail previews for my files. Since they basically are still PNG files, it should be possible without writing my own shell extension (at least I believe so).
I quickly tried to copy all registry keys and values from HKCR.png to HKCR.mInDat (my file name ext) and it worked. However, I would prefere knowning what I am doing ;-)
Which of the registry settings are responsible for the thumbnail preview control and which can I use to get the preview for my file types?
I tried to google it, but I failed, since it seems I am unable to come up with the right buzz-words to find the info I need. Please, help me.
Thank you!
Yours,
3of4
Simple:
[HKEY_CLASSES_ROOT\.apng]
#="apng"
"Content Type"="image/png"
"PerceivedType"="image"
[HKEY_CLASSES_ROOT\apng\shellex\{BB2E617C-0920-11d1-9A0B-00C04FC2D6C1}]
#="{3F30C968-480A-4C6C-862D-EFC0897BB84B}"
I'm trying to load an image from the Firefox cache as the title suggests. I'm running Ubuntu, so the location of my cache is /home/me/.mozilla/firefox/xxxxxx.default/Cache
However, in the Cache (and this is on Mac, too) the filenames are just ridiculous combinations of letters and numbers. Is there a way to pinpoint a certain file?
You should take a look at the source code of the CacheViewer Add-on.
Download the file instead of installing it (right click and save as) and then extract it (it's just a Zip file, even though it has a .xpi extension), then extract the cacheviewer.jar file inside the resulting chrome folder. Finally go into content and then cacheviewer to find the javascript and XUL files.
From my brief investigation, the useful routines are in the cacheviewer.js file, though if you were hoping there would be a simple javascript one liner for accessing cached items you're probably going to be disappointed. The XUL files (which are just XML) are helpful in working out which JS functions are called to perform particular tasks. I'm not too sure how all this maps into Greasemonkey, rather than the extension environment, but hopefully there's enough code to get you started.
Ummm, that really is an internal implementation detail. But I suggest looking at how about:cache?device=disk and about:cache-entry?client=HTTP&sb=1&key=https://stackoverflow.com/Content/img/wmd/blockquote.png are implemented.
Also, http://www.securityfocus.com/infocus/1832 gives details, too. Note that Firefox doesn't use a separate file for everything...
And of course, Firefox may change the format at any time.
Just give your img src= attribute the full URL. If the image happens to be cacheable (the server sends an appropriate Expires: or Cache-control: header, for example) and it's already in the cache, Firefox will not hit the network.
HTTP caching is supposed to be invisible. When you're generating content, you generally shouldn't worry about it.
You can point REDbot at a URL to see all sorts of delicious information about its cacheability.