wget fetch image from different domain - download

When downloading a page using wget with the -p option (page requisites), which downloads all the files that are necessary to properly display a given html page, including such things as inlined images, sounds, and referenced stylesheets. It seems like an image that belongs to a different domain (e.g. www.google.com) will not be downloaded. Is there a way to cause it to be downloaded as well?

You could use -H option (go to foreign hosts when recursive).

Related

Is there a fast way of gathering image files from external sources (Chrome Developer Tab)

Let's say I visit a website named abc.xyz.
When I get to the website, I see the website runs a javascript script to create an interactive book. Obviously, the book must have image files for each page.
Now let's say I go to the developer tools tab and go to the sources tab to find the images - sure enough, I find them. However, the images come from a folder and domain named xyz.abc that displays a 403 error when accessed.
Is there a faster way of gathering these image files than visiting the links for every single image and individually saving every single image (Bare in mind the images themselves are not restricted access)?
Real World Example:
Image showing files under the sources tab.
In the image above, you can see there are several image files located in a folder (hundreds, in fact). The domain and folder the images reside in display 403 errors when accessed, however the images themselves are not restricted. To download the images, you can individually get the link to each image and use "Save image as". However, this will be time-consuming for hundreds of images - is there a faster way to download all the images?
Edit: Furthermore, would there be a way to quickly order PDF images via a pre-existing page number on the PDF file.
To get images of a webpage , you can use python script to fetch all the image src which later u can perform any operations on it , such as copy it to your system or into your website .
I've used BeautifulSoup for webscraping
from bs4 import BeautifulSoup
import requests
page = requests.get("https://stackoverflow.com/questions/63939080/is-there-a-fast-way-of-gathering-image-files-from-external-sources-chrome-devel")
soup = BeautifulSoup(page.content, 'html.parser')
#print(soup.prettify())
for element in soup.find_all("img"):
try:
print(element['src'])
except Exception as e:
pass

How to get full image paths from web page using Firebug?

I would like to download all images in full quality from this blog: http://w899c8kcu.homepage.t-online.de/Blog.
I have access to server, but I can not find the directory where the images lie. When I use Firebug on the first picture, it shows me http://w899c8kcu.homepage.t-online.de/Blog;session=f0577255d9df9185d3abe04af0ce922d&focus=CMTOI_de_dtag_hosting_hpcreator_widget_PictureGallery_15716702&path=image.action&frame=CMTOI_de_dtag_hosting_hpcreator_widget_PictureGallery_15716702?id=34877331&width=1000&height=2000&crop=false.
How can I find the file paths like /dirname/image.jpg?
According to its HTML output the page obviously uses the CM4all content management system (CMS).
I don't know how precisely this CMS is working, though generally CMSs normally either save the files under cryptic names within a folder specified in the CMS's configuration or not in the file system at all but within a database.
Also, CMS may only save compressed or resized versions of the original files.
So, if you don't want to or are not able to dig into the server-side script code to find out if and where the images are saved, you should contact the company behind CM4all about this.

How to download pdf file in ruby without .pdf in the link

I need to download a pdf from a website which does not provide a link ending with (.pdf) using ruby. Manually, when i click on the link to download the pdf, it takes me to a new page and the dialog box to save/open the file appears after some time.
Please help me in downloading the file.
The link
You an do this
require 'open-uri'
File.open('my_file_name.pdf', "wb") do |file|
file.write open('http://someurl.com/2013-1-2/somefile/download').read
end
I have been doing this for my projects and it works.
If you just need a simple ruby script to do it, I'd just run wget. Like this exec 'wget "http://path.to.the.file/and/some/params"'
At that point though, you might as well run wget.
The other way, is to just run a get on the page that you know the pdf is at
source = Net::HTTP.get("http://the.website.com", "/and/some/params")
There are a number of other http clients that you could use, but as long as you make a get request to the endpoint that the pdf is at, it should give you the raw data. Then you can just rename the file, and you'll have the pdf
In your case, I ran the following commands to get the pdf
wget http://www.lawcommission.gov.np/en/documents/prevailing-laws/constitution/func-download/129/chk,d8c4644b0f086a04d8d363cb86fb1647/no_html,1/
mv index.html thefile.pdf
Then open the pdf. Note that these are linux commands. If you want to get the file with a ruby script, you could use something like what I previously mentioned.
Update:
There is an added complication that was not initially stated, which is that the url to the pdf changes every time there is an update to the pdf. In order to make this work, you probably want to do something involving web scraping. I suggest nokogiri. This way you can look at the page where the download is and then perform a get request on the desired URL. Furthermore, the server that hosts the pdf is misconfigured, and breaks chrome within a few seconds of opening the page.
How to solve this problem: I went to the site, and refreshed it. Then broke the connection to the server (press the X where there would otherwise be a refresh button). Then right click next to the download link, and select inspect element. Then browse the dom to find something that is definitively identifying (like an id). Thankfully, I found something <strong id="telecharger"> Download</strong>. This means that you can use something like page.css('strong#telecharger')[0].parent['href'] This should give you a URL. Then you can perform a get request as described above. I don't have time to make the script for you (too much work to do), but this should be enough to solve the problem.

Lynx - how to delay download process before dump website's content

I want to save whole content of this specific website using lynx
http://build.chromium.org/f/chromium/perf/dashboard/ui/changelog.html?url=%2Ftrunk%2Fsrc&range=41818%3A40345&mode=html
I used these commands
webpage="http://build.chromium.org/f/chromium/perf/dashboard/ui/changelog.html?url=%2Ftrunk%2Fsrc&range=41818%3A40345&mode=html"
lynx -crawl -dump $webpage > output
My output was only like this:
SVN path: ____________________ SVN revision range: ____________________
When it was expected to have all information about bugs and comments.
In the URL, it included "/trunk/src" and "41818:40345" values which should be put in to SVN path and SVN revision range and then submit it to get content but it didn't.
Question: Do you have any idea to "tell" lynx to wait a bit while the website is rendering its content until complete?
Thanks in advanced.
The problem here is that the webpage is being built by a javascript function. Such pages can be tricky to download with tools like lynx (or curl, which IMHO is better at the basic download problem). In order to download the contents you see on that page, you'd need to first load the javascript files needed by the page, and then execute the javascript "as though you were a browser". That javascript will proceed to request some data, which turns out to be XML, and then builds HTML from that data.
Note that the "website" doesn't render its data. Your browser renders the data. Or, to be more accurate, your browser is expected to render it but lynx won't because it doesn't do javascript.
So you have a couple of options. You could try to find a scriptable javascript-aware browser (iirc links does javascript, but I don't know offhand how to script it to do what you want.)
Or you can cheat. By using Chrom{e,ium}'s "developer" tools, you can see what URL is being requested by the javascript. It turns out, in this case, to be
http://build.chromium.org/cgi-bin/svn-log?url=http://src.chromium.org/svn//trunk/src&range=41818:40345
so you could get it with curl as follows
curl -G \
-d url=http://src.chromium.org/svn//trunk/src \
-d range=41818:40345 \
http://build.chromium.org/cgi-bin/svn-log \
> 41818-40345.xml
That XML data is in a pretty straightforward (i.e. apparently easy to reverse-engineer) format. And then you could use a simple scriptable xml tool like xmlstarlet (or any XSLT tool) to take the xml apart and reformat as you wish. With luck, you might even find some documentation (or a DTD) somewhere for the xml.
At least, that's how I would proceed.

How do I save a web page, programmatically?

I would like to save a web page programmatically.
I don't mean merely save the HTML. I would also like automatically to store all associated files (images, CSS files, maybe embedded SWF, etc), and hopefully rewrite the links for local browsing.
The intended usage is a personal bookmarks application, in which link content is cached in case the original copy is taken down.
Take a look at wget, specifically the -p flag
−p −−page−requisites
This option causes Wget to download all the files
that are necessary to properly display
a givenHTML page. Thisincludes such
things as inlined images, sounds, and
referenced stylesheets.
The following command:
wget -p http://<site>/1.html
Will download page.html and all files it requires.
On Windows: you can run IE as a com object and pull everything out.
On other thing, you can take the source of Mozilla.
In Java, Lobo.
Or commons-httpclient and write a lot of code.
You could try the MHTML format (which is what IE uses). http://en.wikipedia.org/wiki/MHTML
In other words, you'd be downloading each object (image, css, etc.) to your computer, and then "embedding" them, via Base64, into a single file.

Resources