does recursive wget download visited URLs? - bash

I want to use wget to recursively download a complete webpage. If for example, pages on level 2 of depth contains links to pages from level 1 (that have been already downloaded), will wget download them again? If so, is there a way to prevent this from happening?
Will a manual wget-like script be more optimal than wget, or is it optimised to avoid downloading things over and again? (I'm especially worried about menu links that appear on all pages)
Thank you in advance

A single wget run should never try to download the same page twice. It wouldn't be very useful for mirroring if it did. :) It also has some other failsafes, like refusing to recurse to another domain or a higher directory.
If you want to be sure it's doing the right thing, I suggest just trying it out and watching what it does; you can always mash ^C.

Related

Is there a way to recover an entire website from the wayback machine?

My website files got corrupted and lost all the backup files somehow. Can any one please suggest the process to download entire site.
Its a simple html site. Once after downloading how can I host it ?
Please help
You can't use a regular crawler because the contents served have the original links, so you get out of the first page immediately when you're crawling it if you don't rewrite the links: in the browser they are rewritten with a client-side script to point back to the Wayback Machine.
If it's simple html, like you mentioned, and very small you might want to save the pages manually or even copy the contents by hand to a new website structure. If it's not small, then try the tools mentioned in the answers to this similar question in superuser: https://superuser.com/questions/828907/how-to-download-a-website-from-the-archive-org-wayback-machine
After downloading it you may have to check the structure of the files downloaded for links that may have been incorrectly rewritten or for missing files. The links that point to files that belong to the website should be local links and not external ones. Then you can host it again on a web hosting service of your preference.

Capybara + Downloading and using file

I'm using Capybara to navigate through a login on a website and then download a few files (I'm automating a frequent process that I have to do). There's a few things I tried that aren't working and I'm hoping someone might know a solution...
I have the two links I'm executing .click on, but while one file will start downloading (this is using the Chrome Selenium driver), capybara seems to stop functioning after that. Running .click on the other link doesn't do anything... I figured it's because it's not technically on the page anymore (since it followed a download link) but I tried revisiting the page to click the second link and that doesn't work either.
Assuming I can get that working, I'd really like to be able to download to my script location rather than my Downloads folder, but I've tried every profile configuration I've found online and nothing seems to change it.
Because of the first two issues, I decided to try wget... but I would need to continue from the session in capybara to authenticate. Is it possible to pull the session data (just cookies?) from capybara and insert it into a wget or curl command?
Thanks!
For #3 - accessing the cookies is driver dependent - in selenium it's
page.driver.browser.manage.all_cookies
or you can use the https://github.com/nruth/show_me_the_cookies gem which normalizes access across most of Capybaras drivers. With those cookies you can write them out to a file and then use the --load-cookies option of wget (--cookie option in curl)
For #1 you'd need to provide more info about any errors you get, what is current_url, what does "doesn't work" actually mean, etc

Joomla 2.5.16 take up to 2min to load

A relative asked me to fixed a Joomla website (v2.5.16) who has been hacked last year, probably due to lack of update (is up to date now), unfortunately I have no information about this. The issue is that the front end take 2~ min to load. The administration is loading normally so whatever the issue is, it depend of the front end. I already disabled all modules one by one and switched the template with another one to make sure that thebug is not in template or plugins folders, without success.
I must add that the problem is "probably" more recent than the hack, according to this person. So maybe there was a script somewhere reaching a random server which may not work anymore.
PS : the website is on a shared hosting. I have the FTP access but no ssh.
I know that I don't give any details which can lead to resolve this, but I need more a method to track what can go wrong and where than a solution.
Thanks in advance,
We have written a lengthy post explaining why a website might be slow: http://www.itoctopus.com/20-questions-you-should-be-asking-yourself-if-your-joomla-website-is-slow
From the looks of it, it might that the website is still hacked. Try overwriting the Joomla files with a fresh Joomla install and see if that addresses the problem.
Solving this issue will probably involve some or all of the following:
updating Joomla and all third party extensions to the latest versions
checking for and fixing malicious files using http://myjoomla.com or
https://sucuri.net or similar
analysing the performance of the website using http://gtmetrix.com
(it's free) or similar to pinpoint and fix what is taking the most time to
load
If the website has been hacked, you may need to reset passwords etc once the malicious files have been removed. See https://joomla.stackexchange.com/a/180/120 for more information about securing the website once it is fixed.

windows hard link - protect against writes

I have a bunch of files that I download at some point and then customize. I want to keep the originals, but also allow modifications, and I want to do this using hard links.
I figure I first download the batch of files into some sort of repository, then create hard links into my work location. I want to let the user delete his files (e.g. delete the hard link), which doesn't pose problems.
However I also want to let him write to them, in which case I want my original file to be left untouched in the repository, so I can revert later. How can I do this transparently, without actually locking the file and forcing him to delete it and recreate it?
Any ideas greatly appreciated, thanks.
Cosmin
In windows you have no such option as NTFS/FAT doesn't support snapshots. Hard links are just links anyway, both point to a single file and if link A is changed link B is changed also.
You can partially achive the same result with Windows File History however I don't know any way to set it up exaclty as you described.

Show which images are used and which are unused in a website directory

This is a little bit of a strange question.
I've been working on a website and in it's early stages of development it went through some drastic redesigns (several of them in fact) and now the directory is bloated with images and assets which were part of the old designs. Some of these assets were re-used and some were not. The server space of which I'm uploading the website is smaller than the website at the moment and I know once I clear out the old assets that it'll fit on the webspace.
I'm basically wanting some magical tool to filter out which images have been used and which have not - so ultimately I can remove the ones that have not been used.
I ask it in this forum because if there isn't a magical tool to do this (I sincerely hope there is), I'll need to write some sort of script (PHP perhaps?) to accomplish this.
I have never found one, and tend to take the approach of manually removing old images that I can easily tell are no longer needed. And accepting that I will not get them all.
The reverse approach to this is to remove all of the images, and see which ones are needed ( using firebug or suchlike to identify missing images on the pages ).
The problem with an automated tools is that images in css and code may not be picked up. If you set an image in code, from a range of parameters, how can any tool find that?
I hope someone else can come along and prove me wrong....

Resources