wget recursive/mirror option not following links - shell

I am trying to mirror a website at the moment. wget seems to do the job very well, however it's not working on some pages.
Looking at the manual, the command
wget -r https://www.gnu.org/
should download the GNU page. And it actually does that. However, if I use another page, for example the startpage of my personal website, this doesn't work anymore.
wget -r https://my-personal.website
The index.html is downloaded, but none of the CSS/JS not to mention the recursive download. All that is downloaded is the index.html.
I've tried setting the User-Agent using the -U option, but that didn't help either. Is there an option missing that is causing wget to stop after the index.html?
UPDATE: I've also tried the --mirror option, which is also not working and showing the same behavior.

Your website uses a relatively less-known form of robots control, through the <meta> tag in HTML. You can read more about it here. Wget will correctly adhere to the instructions in this robots directive. You can see this happening, if you look a little closely at the debug output of Wget when trying to recursively download the website:
no-follow in my-personal.website/index.html: 1
Now, unfortunately, that's not a very helpful message unless you're one of the developers and know the codebase. I will try and update the message to be something a little more clear in this case. Just the way we do when such things happen due to a robots.txt file.
Anyways, the fix is simple, disable robots parsing. While this is okay when accessing your own website, please be mindful about the web servers when doing this to others. The full command you need is:
$ wget -r -erobots=off https://my-personal.website
EDIT: As promised, added an improved message. See here. It now prints:
no-follow attribute found in my-personal.website/index.html. Will not follow any links on this page

Related

Downloading all hypelinked URLs from a Tumblr blog?

What's the best way to download all images / webms / mp4s from a Tumblr blog?
I'm looking to download all the posts / images / videos from some Tumblr blogs, and they hyperlink gfycat / webm versions in the body of the post, which Tumblripper / BulkImageDownloader / other Tumblr image downloaders don't catch. I think it's a problem with the fact they're hyperlinked in the body and not actually "on" Tumblr.
Anyone know of a good solution to download everything from a Tumblr blog? I've also tried wget and httrack but they don't seem to work.
I would prefer to use a program with a GUI to do what I need to do, as opposed to a command lined based program since I barely know how to work them. It took me too long to figure out wget and I don't have the time to learn another one to download Tumblr blogs.
I understand that you are averse to command line tools, however i would personnally use curl to write the page source to a file:
curl www.tumblr.com/something > outfile.html
Then you can parse the file in whatever language you are comfortable with.
This answer has some excellent suggestions on how to do that with grep:
https://unix.stackexchange.com/questions/181254/how-to-use-grep-and-cut-in-script-to-obtain-website-urls-from-an-html-file
such as this one:
$ curl -sL https://www.google.com | grep -Po '(?<=href=")[^"]*(?=")'
/search?
Which gives you:
https://www.google.co.in/imghp?hl=en&tab=wi
https://maps.google.co.in/maps?hl=en&tab=wl
https://play.google.com/?hl=en&tab=w8
https://www.youtube.com/?gl=IN&tab=w1
https://news.google.co.in/nwshp?hl=en&tab=wn
...

godoc without showing source code

I got several Go projects which is documented compatibly with godoc. We use godoc to share doc and code internally as a doc server without significant problem. However we need more control on opening code when we want to share doc with 3rd party. Is there a way to run godoc in a special mode that showing types and docs but never link to or showing source code?
I've tried
godoc -http=0.0.0.0:8090 -links=false -src=false
but not working, still can link to type definition code. Just wondering if missed sth. Go version, 1.3.
The src file only refers to command line mode, not to server mode, so it won't help you. The way I see it there are a few options:
Rewrite godoc for your needs and use your own fork.
Don't use the server mode, render the docs in command line mode and just create a server out of that.
Better yet (I'm not entirely sure 2 will work) - rewrite the templates a bit so the source code won't be linked. But you'll still need to make sure people who enter the path manually won't see the code so it will require fudging the source templates as well. or...
Maybe the simplest thing - run it behind nginx or a similar reverse proxy, and make sure the /src path in the server is closed to outside visitors, or password protected or whatever. That way your internal team can still use it.
Personally I'd go with 4, it's a couple minutes of work and will be the most robust and flexible solution.

Firefox Addon SDK1.17 Annotator tutorial: widget/button does not appear

I am trying to work through https://developer.mozilla.org/en-US/Add-ons/SDK/Tutorials/Annotator with jpm (https://developer.mozilla.org/en-US/Add-ons/SDK/Tools/jpm) rather than cfx, and running into difficulties:=> the button/widget that the addon adds does not appear in my browser. Not even in the Additional Tools and Features section if I go to Customize the browswer appearance.
This is the SDK v1.17, and Firefox v38.0.1 for Linux (openSuSE13.2).
I have created the structure and files with given names and contents, telling jpm to use main.js as the entry point, rather than index.js, in order to match the tutorial (which is cfx-based).
I am also passing jpm the -b PATH-TO-FIREFOX-BINARY flag, because it apparently doesn't follow the symlink at /usr/bin/firefox, but it sounds like that's a known issue.
I am also also passing jpm the -p MY-DEV-PROFILE flag because I found that with the introductory tutorial (https://developer.mozilla.org/en-US/Add-ons/SDK/Tutorials/Getting_Started_%28jpm%29) that was the only way I could get that button to show up.
But that doesn't help here, nor does leaving off that option.
The Addon Manager confirms that the extension is installed.
So I am open to suggestions. Obviously I am new to extension development, and pretty new to javascript in general.
I had also better ask while I am here: What I want to do is modify the behaviour of Firefox's Find (in page); can something like that be done with the SDK, or do I need to use the Overlay method?
Any other suggestions helpful for learning addon development would also be welcome (but should probably be done as comments, rather than Answers; let's save Answers for the original question about this tutorial button).
Thanks!
The widget api was removed in Firefox 38. For most cases you can replace widget with the button apis we introduced in Firefox 29, see this blog post for more information.
Ah, heheh, never mind.
It was just an impedence mismatch between the original cfx instructions and the jpm way of doing things.
While I had told jpm to use main.js instead of index.js, I had failed to tell it that main.js was in the "./lib/" directory instead of the root directory of the extension.
After changing the package.json to say
"main": "./lib/main.js"
it works - as far as that goes. But it turns out that the entire tutorial is no longer valid; see my (Edward's) comment on canuckistani's answer.
My subsidiary questions about whether the SDK will even do what I want (changing some Find behaviour) and any other advice/resources still stand, however.

Wkhtmltopdf version, first page and TOC

Some questions for this very nifty tool, unfortunately lacking many usage examples.
Manual speaks of a possible “Reduced Functionality” for wkhtmltopdf. I have version wkhtmltox-0.11.0_rc1-installer.exe, by running wkhtmltopdf --version what should I read to understand whether my version is the reduced one or not?
Currently I like wkhtmltopdf for webpages I want to read later and/or store. To mirror webpages I use httrack, then I generate the PDF with wkhtmltopdf *.html offline.pdf. How can I set/specify the first PDF page from the *.html list? Currently they seem to be converted in alphabetical order.
If I run wkhtmltopdf toc http://qt-project.org/doc/qt-4.8/qstring.html qstring.pdf I simply get a leading blank page, no TOC. What’s wrong?
Thanks for helping
EDIT:
#Nenotlep:
Your TOC trick works perfectly.
As for the first page, I don’t need an actual cover.
What I need is a way to download/convert a given page www.site.com/foo.html and all the linked pages (A.html, B.html ...) up to a certain depth level. Then I want a single PDF starting with foo.html and containing also the pages A.html, B.html ... (with relative links).
I don’t think there is an option to download and insert the linked pages in the final PDF (please, correct me if I am wrong). So I use httrack.com to download and wkhtmltopdf to convert. Given the alphabetical behaviour of wkhtmltopdf, the best now seems to rename the target page, downloaded with httrack, something like !foo.html.
Please, let me know of possible alternatives.
For part 3 of the question which is blank TOC, the latest stable version 0.12.5 also does not generate it. The pre-release version 0.12.6-dev has fixed this problem in Mac.
I think all available precompiled wkhtmltopdf's are compiled with the patched QT, they are not reduced. The reduced functionality means that it was compiled without a special patched version of QT. I use the windows version and it isn't reduced.
I think the cover command line argument would work for you. I can't test at the moment, but try a command like wkhtmltopdf cover derpy.html toc --xsl-style-sheet default.xsl rarity.html twilight.html spike.html equestriadaily.pdf
At least in Linux, I think the asterix *.html simply explodes into all the html files before the command is performed, so if you select one html file for the cover and then do *.html in the same folder you will get the file twice. Getting around this issue might need some command line sorcery or a batch file or some other trickery.
This is a bug in wkhtmltopdf. The workaround is to manually set a tocfile. You can get the default tocfile with wkhtmltopdf.exe --dump-default-toc-xsl. Then you can save the output as a file and use it like wkhtmltopdf.exe toc --xsl-style-sheet default.xsl www.stackoverflow.com so.pdf.

Saving PDF files with Chickenfoot

I'm writing a web-crawler using Chickenfoot and need to save PDF files. I can either click the link on the page or grab the PDF's URL and use
go("http://www.whatever.com/file.pdf")
and I get the firefox "Opening file.pdf" dialog box, but can't click the "OK" button to actually save the file.
I've tried using other means to download the files (wget, python's urllib2, twill), but the PDF files are gated so none of those will work.
Any help is appreciated.
This example of how to save a target in the Mozilla developer documents looks like it should do exactly what you want. I've tested a Chickenfoot example that is very similar that gets the temp environment variable, and that worked well for me in Chickenfoot.
https://developer.mozilla.org/en/XPCOM_Interface_Reference/nsIWebBrowserPersist#Example
You might have to play with the application associations in Tools, Options, Applications to make sure the action is set to Save File, but those settings might not apply to these functions.
End Answer, begin related grumblings...
I sure wish someone would fix the many bugs in Chickenfoot, and write a nice Cookbook programming guide. I've been using it for years, and there are still many basic things I've not been able to figure out how to do. I finally broke down and subscribed to the mailing list, as the archives have some decent script examples. It takes a lot of searching through the pdf references, blogs, etc. as the web API reference is very sparse.
I love how simple Chickenfoot can make automating some tasks, but it takes me days of searching javascript, DOM, and Firefox documents to find ways to do some of the things it can't, since I'm not really a web programmer. The goal of Chickenfoot seems to be that I shouldn't have to be, but unfortunately few are refining the proof of concept, as MIT has dropped the project.
I tried to do this several ways using only Chickenfoot commands and confirmed they don't work with the latest Firefox 3 and Chickenfoot 1.0.7.
I hope this helps! Good luck. Sorry I only ran across your question yesterday, but found it too interesting to leave alone.
You won't be able to click on Firefox dialogs for the sake of security.
The best way to download the content of a URL is to read then write the content of the URL.
// Chickenfoot 1.0.7 Javascript Code to download the content of a url.
include( "fileio.js" ); // enables the write function.
var url = "http://google.com",
saveFileTo = "c://chickenfoot-google.com";
write( saveFileTo, read( url ) );
You might find it helpful to use jquery with chickenfoot.
http://groups.csail.mit.edu/uid/chickenfoot/scripts/index.php?title=Using_jQuery,_jQuery_UI_and_similar_libraries
This has worked for me to save Excel files from NCES portal.
http://muaz-khan.blogspot.com/2012/10/save-files-on-disk-using-javascript-or.html
I was using Firefox 3.0 and the "old syntax" version of the code. I also stripped code intended for IE and "(window.URL || window.webkitURL).revokeObjectURL(save.href);" which generated an error.

Resources