Downloading all hypelinked URLs from a Tumblr blog? - download

What's the best way to download all images / webms / mp4s from a Tumblr blog?
I'm looking to download all the posts / images / videos from some Tumblr blogs, and they hyperlink gfycat / webm versions in the body of the post, which Tumblripper / BulkImageDownloader / other Tumblr image downloaders don't catch. I think it's a problem with the fact they're hyperlinked in the body and not actually "on" Tumblr.
Anyone know of a good solution to download everything from a Tumblr blog? I've also tried wget and httrack but they don't seem to work.
I would prefer to use a program with a GUI to do what I need to do, as opposed to a command lined based program since I barely know how to work them. It took me too long to figure out wget and I don't have the time to learn another one to download Tumblr blogs.

I understand that you are averse to command line tools, however i would personnally use curl to write the page source to a file:
curl www.tumblr.com/something > outfile.html
Then you can parse the file in whatever language you are comfortable with.
This answer has some excellent suggestions on how to do that with grep:
https://unix.stackexchange.com/questions/181254/how-to-use-grep-and-cut-in-script-to-obtain-website-urls-from-an-html-file
such as this one:
$ curl -sL https://www.google.com | grep -Po '(?<=href=")[^"]*(?=")'
/search?
Which gives you:
https://www.google.co.in/imghp?hl=en&tab=wi
https://maps.google.co.in/maps?hl=en&tab=wl
https://play.google.com/?hl=en&tab=w8
https://www.youtube.com/?gl=IN&tab=w1
https://news.google.co.in/nwshp?hl=en&tab=wn
...

Related

wget recursive/mirror option not following links

I am trying to mirror a website at the moment. wget seems to do the job very well, however it's not working on some pages.
Looking at the manual, the command
wget -r https://www.gnu.org/
should download the GNU page. And it actually does that. However, if I use another page, for example the startpage of my personal website, this doesn't work anymore.
wget -r https://my-personal.website
The index.html is downloaded, but none of the CSS/JS not to mention the recursive download. All that is downloaded is the index.html.
I've tried setting the User-Agent using the -U option, but that didn't help either. Is there an option missing that is causing wget to stop after the index.html?
UPDATE: I've also tried the --mirror option, which is also not working and showing the same behavior.
Your website uses a relatively less-known form of robots control, through the <meta> tag in HTML. You can read more about it here. Wget will correctly adhere to the instructions in this robots directive. You can see this happening, if you look a little closely at the debug output of Wget when trying to recursively download the website:
no-follow in my-personal.website/index.html: 1
Now, unfortunately, that's not a very helpful message unless you're one of the developers and know the codebase. I will try and update the message to be something a little more clear in this case. Just the way we do when such things happen due to a robots.txt file.
Anyways, the fix is simple, disable robots parsing. While this is okay when accessing your own website, please be mindful about the web servers when doing this to others. The full command you need is:
$ wget -r -erobots=off https://my-personal.website
EDIT: As promised, added an improved message. See here. It now prints:
no-follow attribute found in my-personal.website/index.html. Will not follow any links on this page

wkhtmltopdf with silverstripe 3.2

I'm playing around with pdf generation. After the silverstripe modules for dompdf and tcppdf which doesn't work like I want them to, I came across BetterBrief's module for wkhtmltopdf https://github.com/BetterBrief/silverstripe-pdf
It should be exactly what I need but I can't figure out why it's not creating pdfs. I installed it with composer following the module instructions, after that I installed the debian application and set up a demo template with just three words in it to test ist. But the pdf file can't be created.
The error I receive is the following and not very helpful to me http://www.sspaste.com/paste/show/5676bac4a4186
perhapse someone had the same problem or knows a solution for that.
the creation of a pdf from commandline works
wkhtmltopdf http://google.com google.pdf
Edit That's not a real solution for this problem, but an alternativ to create a pdf with SilverStripe and wkhtmltopdf. https://github.com/creativeSynergy/silverstripe-wkhtmltopdf
A quick google shows it has something to do with wkhtmltopdf needing X to work.
https://github.com/knplabs/snappy/issues/20

Wkhtmltopdf version, first page and TOC

Some questions for this very nifty tool, unfortunately lacking many usage examples.
Manual speaks of a possible “Reduced Functionality” for wkhtmltopdf. I have version wkhtmltox-0.11.0_rc1-installer.exe, by running wkhtmltopdf --version what should I read to understand whether my version is the reduced one or not?
Currently I like wkhtmltopdf for webpages I want to read later and/or store. To mirror webpages I use httrack, then I generate the PDF with wkhtmltopdf *.html offline.pdf. How can I set/specify the first PDF page from the *.html list? Currently they seem to be converted in alphabetical order.
If I run wkhtmltopdf toc http://qt-project.org/doc/qt-4.8/qstring.html qstring.pdf I simply get a leading blank page, no TOC. What’s wrong?
Thanks for helping
EDIT:
#Nenotlep:
Your TOC trick works perfectly.
As for the first page, I don’t need an actual cover.
What I need is a way to download/convert a given page www.site.com/foo.html and all the linked pages (A.html, B.html ...) up to a certain depth level. Then I want a single PDF starting with foo.html and containing also the pages A.html, B.html ... (with relative links).
I don’t think there is an option to download and insert the linked pages in the final PDF (please, correct me if I am wrong). So I use httrack.com to download and wkhtmltopdf to convert. Given the alphabetical behaviour of wkhtmltopdf, the best now seems to rename the target page, downloaded with httrack, something like !foo.html.
Please, let me know of possible alternatives.
For part 3 of the question which is blank TOC, the latest stable version 0.12.5 also does not generate it. The pre-release version 0.12.6-dev has fixed this problem in Mac.
I think all available precompiled wkhtmltopdf's are compiled with the patched QT, they are not reduced. The reduced functionality means that it was compiled without a special patched version of QT. I use the windows version and it isn't reduced.
I think the cover command line argument would work for you. I can't test at the moment, but try a command like wkhtmltopdf cover derpy.html toc --xsl-style-sheet default.xsl rarity.html twilight.html spike.html equestriadaily.pdf
At least in Linux, I think the asterix *.html simply explodes into all the html files before the command is performed, so if you select one html file for the cover and then do *.html in the same folder you will get the file twice. Getting around this issue might need some command line sorcery or a batch file or some other trickery.
This is a bug in wkhtmltopdf. The workaround is to manually set a tocfile. You can get the default tocfile with wkhtmltopdf.exe --dump-default-toc-xsl. Then you can save the output as a file and use it like wkhtmltopdf.exe toc --xsl-style-sheet default.xsl www.stackoverflow.com so.pdf.

Publishing toolchain with asciidoc / markdown input, html / pdf output

I saw this related question about publishing toolchain but I know many people did lot of work to produce publishing toolchains recently.
One great example I found is this project from akosma.
Avdi Grimm shared his work with org-mode in this project
I know there are (should be) many others.
What I'm looking for, is a publishing toolchain with
asciidoc / markdown / textile / org-mode or latex input. I don't want xml input
pdf AND html output, epub output is not a requirement for me.
What I can
author templates in latex / html / css / js. again, no xml.
read and write ruby and shell scripts
Take a look at asciidoc, this is what O'Reilly has started using and it is a refreshing break from DocBook. I use asciidoc, the tools and support leaves a little to be desired, but there are people working to create better alternatives (that don't involve Python and the existing Docbook pipeline).
Check out this: https://github.com/runemadsen/asciidoc
EDIT 1/6/13: You also really need to check out AsciiDoctor. Dan Allen from RedHat has been spending a lot of time on this particular package and Ryan Waldron. I expect great things from AsciiDoctor as it is starting to emerge as a foundation for a bunch of important AsciiDoc documentation efforts.

Saving PDF files with Chickenfoot

I'm writing a web-crawler using Chickenfoot and need to save PDF files. I can either click the link on the page or grab the PDF's URL and use
go("http://www.whatever.com/file.pdf")
and I get the firefox "Opening file.pdf" dialog box, but can't click the "OK" button to actually save the file.
I've tried using other means to download the files (wget, python's urllib2, twill), but the PDF files are gated so none of those will work.
Any help is appreciated.
This example of how to save a target in the Mozilla developer documents looks like it should do exactly what you want. I've tested a Chickenfoot example that is very similar that gets the temp environment variable, and that worked well for me in Chickenfoot.
https://developer.mozilla.org/en/XPCOM_Interface_Reference/nsIWebBrowserPersist#Example
You might have to play with the application associations in Tools, Options, Applications to make sure the action is set to Save File, but those settings might not apply to these functions.
End Answer, begin related grumblings...
I sure wish someone would fix the many bugs in Chickenfoot, and write a nice Cookbook programming guide. I've been using it for years, and there are still many basic things I've not been able to figure out how to do. I finally broke down and subscribed to the mailing list, as the archives have some decent script examples. It takes a lot of searching through the pdf references, blogs, etc. as the web API reference is very sparse.
I love how simple Chickenfoot can make automating some tasks, but it takes me days of searching javascript, DOM, and Firefox documents to find ways to do some of the things it can't, since I'm not really a web programmer. The goal of Chickenfoot seems to be that I shouldn't have to be, but unfortunately few are refining the proof of concept, as MIT has dropped the project.
I tried to do this several ways using only Chickenfoot commands and confirmed they don't work with the latest Firefox 3 and Chickenfoot 1.0.7.
I hope this helps! Good luck. Sorry I only ran across your question yesterday, but found it too interesting to leave alone.
You won't be able to click on Firefox dialogs for the sake of security.
The best way to download the content of a URL is to read then write the content of the URL.
// Chickenfoot 1.0.7 Javascript Code to download the content of a url.
include( "fileio.js" ); // enables the write function.
var url = "http://google.com",
saveFileTo = "c://chickenfoot-google.com";
write( saveFileTo, read( url ) );
You might find it helpful to use jquery with chickenfoot.
http://groups.csail.mit.edu/uid/chickenfoot/scripts/index.php?title=Using_jQuery,_jQuery_UI_and_similar_libraries
This has worked for me to save Excel files from NCES portal.
http://muaz-khan.blogspot.com/2012/10/save-files-on-disk-using-javascript-or.html
I was using Firefox 3.0 and the "old syntax" version of the code. I also stripped code intended for IE and "(window.URL || window.webkitURL).revokeObjectURL(save.href);" which generated an error.

Resources