I am wanting to find an automated way to download an entire website page (not the entire site just a single page) and all elements on the page, then sum the size of these files.
When I say files, I would like to know the total size of HTML, CSS, Images, local and remote JS files, and any CSS background images. Basically the entire page-weight for a given page.
I thought about using CURL but was not sure how to enable it to grab remote and local JS files as well as images referenced in the CSS files.
Try wget:
make it download all required files with -p or --page-requisites option
download scripts and images local to the site and not further than 2 hops away (this should get local images and code) with -l 2 for --level=2
and change the code files to link to your local files instead of their original path with -k for --convert-links:
wget -p -l 2 -k http://full_url/to/page.html
Related
a client have a download area where users can download or browse single files. Files are divided in folder, so there are documents, catalogues, newsletter and so on, and their extension can vary: they can be .pdf, .ai or simple .jpeg. He asked me if I can provide a link to download every item in a specific folder as a big, compressed file. Problem is, I'm on a Windows server, so I'm a bit clueless if there's a way. I can edit pe pages of this area, so I can include jquery and scripts with a little freedom. Any hint?
Windows archiver is TAR and you are needing to build a TARbALL (Historically all related files in one Tape ARchive)
I have a file server which is mapped as S:\ (it does not have TAR command, and Tar cannot use URL but can use device:)
For any folders contents (including sub folders) it is easy to remotely save all current files in a zip with a single command (for multiple root locations they need a loop or a list)
It will build the Tape Archive as a windows.zip using the -a (auto) switch but you need to consider the desired level of nesting by collect all contents at the desired root location.
TAR -a[other options] file.zip [folder / files]
Points to watch out for
ensure here is not an older archive
it will comment error/warnings like the two given during run, however, should complete without fail.
Once you have the zip file you can offer post as a web asset such as
<a href="\\server\folder\all.zip" download="all.zip">Get All<a>
for other notes see https://stackoverflow.com/a/68728992/10802527
I`ve tried this on a few forum threads already.
However I keep on getting the some failure as a result.
To replicate the problem :
Here is an url leading to a forum thread with 6 pages.
http://forex.kbpauk.ru/showflat.php/Cat/0/Number/107623/page/0/fpart/1/vc/1
What I typed into the console was :
wget "http://forex.kbpauk.ru/showflat.php/Cat/0/Number/107623/page/0/fpart/{1..6}/vc/1"
And here is what I got:
--2018-06-14 10:44:17-- http://forex.kbpauk.ru/showflat.php/Cat/0/Number/107623/page/0/fpart/%7B1..6%7D/vc/1
Resolving forex.kbpauk.ru (forex.kbpauk.ru)... 185.68.152.1
Connecting to forex.kbpauk.ru (forex.kbpauk.ru)|185.68.152.1|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: '1'
1 [ <=> ] 19.50K 58.7KB/s in 0.3s
2018-06-14 10:44:17 (58.7 KB/s) - '1' saved [19970]
The file was saved as simply "1" with no extension as it seems.
My expectations were that the file will be saved with an .html extension, because its a webpage.
Im trying to get WGET to work, but if its possible to do what I want with CURL than I would also accept that as an answer.
Well, there's a couple of issues with what you're trying to do.
The double quotes around your URL actually prevent Bash expansion, so you're not really downloading 6 files, but a single URL with "{1..6}" in it. You probably want to not have quotes around the URL to allow bash to expand it into 6 different parameters.
I notice that all of the pages are called "1", irrespective of their actual page numbers. This means the server is always serving a page with the same name, making it very hard for Wget or any other tool to actually make a copy of the webpage.
The real way to create a mirror of the forum would be to use this command line:
$ wget -m --no-parent -k --adjust-extension http://forex.kbpauk.ru/showflat.php/Cat/0/Number/107623/page/0/fpart/1
Let me explain what this command does:
-m --mirror activates the mirror mode (recursion)
--no-parent asks Wget to not go above the directory it starts from
-k --convert-links will edit the HTML pages you download so that the links in them will point to the other local pages you have also downloaded. This allows you to browse the forum pages locally without needing to be online
--adjust-extension This is the option you were originally looking for. It will cause Wget to save the file with a .html extension if it downloads a text/html file but the server did not provide an extension.
simply use the -O switch to specify the output filename, otherwise wget just defaults to something like in your case its 1
so if you wanted to call your file what-i-want-to-call-it.html then you would do
wget "http://forex.kbpauk.ru/showflat.php/Cat/0/Number/107623/page/0/fpart/{1..6}/vc/1" -o what-i-want-to-call-it.html
if you type into the console wget --help you will get a full list of all the options that wget provides
To verify it has worked type the following to output
cat what-i-want-to-call-it.html
Wget has the -H "span host" option
Span to any host—‘-H’
The ‘-H’ option turns on host spanning, thus allowing Wget's recursive run to visit any host referenced by a link. Unless sufficient recursion-limiting criteria are applied depth, these foreign hosts will typically link to yet more hosts, and so on until Wget ends up sucking up much more data than you have intended.
I want to do a recursive download (say, of level 3), and I want to get images, stylesheets, javascripts, etc. (that is, files necessary to display the page properly) even if they're outside my host. However, I don't want to follow a link to another HTML page (because then it can go to another HTML page, and so on, then the number can explode.)
Is it possible to do this somehow? It seems like the -H option controls spanning to other hosts for both the images/stylesheets/javascript case and the link case, and wget doesn't allow me to separate the two.
Downloading All Dependencies in a page
First step is downloading all the resources of a particular page. If you look in the man pages for wget you will find this:
...to download a single page and all its requisites (even if they exist on
separate websites), and make sure the lot displays properly locally, this author likes to use a few options in addition to -p:
wget -E -H -k -K -p http://<site>/<document>
Getting Multiple Pages
Unfortunately, that only works per-page. You can turn on recursion with -r, but then you run into the issue of following external sites and blowing up. If you know the full list of domains that could be used for resources, you can limit it to just those using -D, but that might be hard to do. I recommend using a combination of -np (no parent directories) and -l to limit the depth of the recursion. You might start getting other sites, but it will at least be limited. If you start having issues, you could use --exclude-domains to limit the known problem causers. In the end, I think this is best:
wget -E -H -k -K -p -np -l 1 http://<site>/level
Limiting the domains
To help figure out what domains need to be included/excluded you could use this answer to grep a page or two (you would want to grep the .orig file) and list the links within them. From there you might be able to build a decent list of domains that should be included and limit it using the -D argument. Or you might at least find some domains that you don't want included and limit them using --exclude-domains. Finally, you can use the -Q argument to limit the amount of data downloaded as a safeguard to prevent filling up your disk.
Descriptions of the Arguments
-E
If a file of type application/xhtml+xml or text/html is downloaded and the URL does not end with the regexp \.[Hh][Tt][Mm][Ll]?, this
option will cause the suffix .html to be appended to the local filename.
-H
Enable spanning across hosts when doing recursive retrieving.
-k
After the download is complete, convert the links in the document to make them suitable for local viewing. This affects not only the
visible hyperlinks, but any part of the document that links to external content, such as embedded images, links to style sheets,
hyperlinks to non-HTML content, etc.
-K
When converting a file, back up the original version with a .orig suffix.
-p
This option causes Wget to download all the files that are necessary to properly display a given HTML page. This includes such things as
inlined images, sounds, and referenced stylesheets.
-np
Do not ever ascend to the parent directory when retrieving recursively. This is a useful option, since it guarantees that only the files
below a certain hierarchy will be downloaded.
-l
Specify recursion maximum depth level depth.
-D
Set domains to be followed. domain-list is a comma-separated list of domains. Note that it does not turn on -H.
--exclude-domains
Specify the domains that are not to be followed.
-Q
Specify download quota for automatic retrievals. The value can be specified in bytes (default), kilobytes (with k suffix), or megabytes (with m suffix).
Just put wget -E -H -k -K -p -r http://<site>/ to download a complete site. Don't get nervous if while downloading you open some page and its resources are not available, because when wget finishes it all, it will convert them!
for downloading all "files necessary to display the page properly" you can use -p or --page-requisites, perhaps together with -Q or --quota
Try using the wget --accept-regex flag; the posix --regex-type is compiled into wget standard but you can compile in the perl regex engine pcre if you need something more elaborate:
E.g. The following will get all pngs on external sites one level deep and any other pages that have the word google in the url:
wget -r -H -k -l 1 --regex-type posix --accept-regex "(.*google.*|.*png)" "http://www.google.com"
It doesn't actually solve the problem of dipping down multiple levels on external sites, for that you would have to probably write your own spider. But using the --accept-regex you can probably get close to what you are looking for in most cases.
Within a single layer of a domain you can check all links internally, and on third party servers with the following command.
wget --spider -nd -e robots=off -Hprb --level=1 -o wget-log -nv http://localhost
The limitation here is that it only checks a single layer. This works well with a CMS where you can flatten the site with the GET variable rather than CMS generated URLs. Otherwise you can use your favorite server side script to loop this command through directories. For a full explanation of all of the options, check out this Github commit.
https://github.com/jonathan-smalls-cc/git-hooks/blob/LAMP/contrib/pre-commit/crawlDomain.sh
I am publishing content from a Drupal CMS to static HTML pages on another domain, hosted on a second server. Building the HTML files was simple (using PHP/MySQL to write the files).
I have a list of images referenced in my HTML, all of which exist below the /userfiles/ directory.
cat *.html | grep -oE [^\'\"]+userfiles[\/.*]*/[^\'\"] | sort | uniq
Which produces a list of files
http://my.server.com/userfiles/Another%20User1.jpg
http://my.server.com/userfiles/image/image%201.jpg
...
My next step is to copy these images across to the second server and translate the tags in the html files.
I understand that sed is probably the tool I would need. E.g.:
sed 's/[^"]\+userfiles[\/image]\?\/\([^"]\+\)/\/images\/\1/g'
Should change http://my.server.com/userfiles/Another%20User1.jpg to /images/Another%20User1.jpg, but I cannot work out exactly how I would use the script. I.e. can I use it to update the files in place or do I need to juggle temporary files, etc. Then how can I ensure that the files are moved to the correct location on the second server
It's possible to use sed to change the file in-place using the -i option.
For your use it's up to you if it's easier/better to create a new file with the changes from the old, then copy to the 2nd domain using scp (or something similar). Or it may be easier to copy the file first, then modify it once it's on the remote server (less management of new filenames this way).
How do I export and import images from and into a MediaWiki?
Terminal solutions
MediaWiki administrator, at server's terminal, can perform maintenance tasks using the Maintenance scripts framework. New Mediawiki versions run all standard scripts in the tasks described below, but old versions have some bugs or not have all moderns scripts: check the version number by grep wgVersion includes/DefaultSettings.php.
Note: all cited (below) scripts have also --help option, for instance php maintenance/importImages.php --help
Original image folder
Users upload files through the Special:Upload page; administrators can configure the allowed file types through an extension whitelist. Once uploaded, files are stored in a folder on the file system, and thumbnails in a dedicated thumb directory.
The Mediawiki's images folder can be zipped with zip -r ~/Mediafiles.zip images command, but this zip is not so good:
there are a lot of expurious files: "deleted files" and "old files" (not the current) with filenames as 20160627184943!MyFig.png, and thumbnails as MyFig.png/120px-MyFig.jpg.
for data-interchange or long-term preservation porpurses, it is invalid... The ugly images/?/??/* folder format is not suitable, as usual "all image files in only one folder".
Images export/import
For "Exporting and Importing" all current images in one folder at MediaWiki server's terminal, there are a step-by-step single procedure.
Step-1: generate the image dumps using dumpUploads (with --local or --shared options when preservation need), that creates a txt list of all image filenames in use.
mkdir /tmp/workingBackupMediaFiles
php maintenance/dumpUploads.php \
| sed 's~mwstore://local-backend/local-public~./images~' \
| xargs cp -t /tmp/workingBackupMediaFiles
zip -r ~/Mediafiles.zip /tmp/workingBackupMediaFiles
rm -r /tmp/workingBackupMediaFiles
The command results in a standard zip file of your image backup folder, Mediafiles.zip at yor user root directory (~/).
NOTE: if you are not worried about the ugly folder strutcture, a more direct way is
php maintenance/dumpUploads.php \
| sed 's~mwstore://local-backend/local-public~./images~' \
| zip ~/Mediafiles.zip -#
according Mediawiki version the --base=./ option will work fine and you can remove the sed command of the pipe.
Step-2: need a backup? installing a copy of the images? ... you need only Mediafiles.zip, and the Mediawiki installed, with no contents... If the Wiki have contents, check problems with filename conflicks (!). Another problem is configuration of file formats and permissions, that must be the same or broader in the new Wiki, see Manual:Configuring file uploads.
Step-3: restore the dumps (to the new Wiki), with the maintenance tools. Supposing that you used step-1 to export and preserve in a zip file,
unzip ~/Mediafiles.zip -d /tmp/workingBackupMediaFiles
php maintenance/importImages.php /tmp/workingBackupMediaFiles
rm -r /tmp/workingBackupMediaFiles
php maintenance/update.php
php maintenance/rebuildall.php
That is all. Check, navegating in your new Wiki's Special:NewFiles.
The full export or preservation
For exporting "ALL images and ALL articles" of your old MediaWiki, for full backup or content preservation. Add some procedures at each step:
Step-1: ... see above step-1... and, to generate the text-content dumps from the old Wiki
php maintenance/dumpBackup.php --full | gzip > ~/dumpContent.xml.gz
Note: instead of --full you can use the --current option.
Step-2: ... you need dumpContent.xml.zip and Mediafiles.zip... from the old Wiki. Suppose both zip files at your ~ folder.
Step-3: run in your new Wiki
unzip ~/Mediafiles.zip -d /tmp/workingBackupMediaFiles
gunzip -c ~/dumpContent.xml.gz
| php maintenance/importDump.php --no-updates \
--image-base-path=/tmp/workingBackupMediaFiles
rm -r /tmp/workingBackupMediaFiles
php maintenance/update.php
php maintenance/rebuildall.php
That is all. Check also Special:AllPages of the new Wiki.
There is no automatic way to export images like you export pages, you have to right click on them, and choose "save image". To get the history of the Image page, use the Special:Export page.
To import images use the Special:Upload page on your wiki. If you have lots of them, you can use the Import Images script. Note: you generally have to be in the sysop group to upload images.
- Export ALL:
You can get all pages and all images from a MediaWiki web using [API], even you are not the owner of the web (of course when the owner hasn't disable this function):
Step 1: Using API to get all pages title and all images url. You can write some code to do it automatically.
Step 2: Next you use [Special:Export] to export all pages with the titles you got, and use wget to get all images you had links (like this wget -i img-list.txt).
- Import ALL:
Step 1: Import pages using [Special:Import]
Step 2: Import images using [Manual:ImportImages.php].
There are a few mass upload tools available.
Commonist - www.djini.de/software/commonist/
Both run on the desktop and can be configured to upload to your local wiki (they are configured for Wikipedia and Wikimedia commons by default). If you are afraid to edit the content of a .jar file, I suggest you start with Commonplace.
Another useful extension exists for Mediawiki itself.
MultiUpload - http://www.mediawiki.org/wiki/Extension:MultiUpload
This extension allows you to drop images in a folder and load them all at once. It supports annotations for each file if necessary and cleans up the folder once it is done. On the downside, it requires opening a shared folder on the server side.
Commonplace - commons.wikimedia.org/wiki/Commons:Tools/Commonplace
used to be available, but it was deprecated as of Jan. 13, 2010.
Hope this helps a bit: http://www.mediawiki.org/wiki/Manual:ImportImages.php
As a committer of MediaWiki-Japi I'd like to point out:
For the usecase to push pages including images from one wiki to another MediaWiki-Japi now has a command line mode see
Issue 49 - Enable commandline interface with page transfer option
Otherwise you can use MediaWiki-Api with the language of your choice and use the functions as you find in PushPages.java
e.g.
download
upload