Downloading HTML links from an online query search - download

I have posted this question at SE Bioinformatics. But I think this is is ultimately a programming problem.
I have just started using this website BlastKOALA KEGG for searching for (aminoacid) sequences similarities.
This is the website: https://www.kegg.jp/blastkoala/
When you get results, there are links for downloading. However these links will not download all detailed query search results, but just download the HTML screen already showing. To get these results I need to manually click on each query result on the page, which becomes impracticable with >500 entries. Thus I think need is a tool to download all linked contents from a webpage.
I have attempted 'wget' however with no success. Example code below:
wget -r -l3 -H -t1 -nd -N -np -erobots=off https://www.kegg.jp/kegg-bin/blastkoala_result?id=7345faab3969a3ef10a4a7542ba749386068138f&passwd=NKfm46&type=blastkoala
It says 'Requested Job Not Found', even if I change several parameters.
Please, did anyone every encounter such a problem? Thanks in advance.
A working link (I think it should remain active for a week.):
https://www.kegg.jp/kegg-bin/blastkoala_result?id=7345faab3969a3ef10a4a7542ba749386068138f&passwd=NKfm46&type=blastkoala

Related

Capybara + Downloading and using file

I'm using Capybara to navigate through a login on a website and then download a few files (I'm automating a frequent process that I have to do). There's a few things I tried that aren't working and I'm hoping someone might know a solution...
I have the two links I'm executing .click on, but while one file will start downloading (this is using the Chrome Selenium driver), capybara seems to stop functioning after that. Running .click on the other link doesn't do anything... I figured it's because it's not technically on the page anymore (since it followed a download link) but I tried revisiting the page to click the second link and that doesn't work either.
Assuming I can get that working, I'd really like to be able to download to my script location rather than my Downloads folder, but I've tried every profile configuration I've found online and nothing seems to change it.
Because of the first two issues, I decided to try wget... but I would need to continue from the session in capybara to authenticate. Is it possible to pull the session data (just cookies?) from capybara and insert it into a wget or curl command?
Thanks!
For #3 - accessing the cookies is driver dependent - in selenium it's
page.driver.browser.manage.all_cookies
or you can use the https://github.com/nruth/show_me_the_cookies gem which normalizes access across most of Capybaras drivers. With those cookies you can write them out to a file and then use the --load-cookies option of wget (--cookie option in curl)
For #1 you'd need to provide more info about any errors you get, what is current_url, what does "doesn't work" actually mean, etc

Using wkhtmltopdf from a web page

I am looking for a solution to producing a pdf document from a web page and have read the good reviews of wkhtmltopdf. I might be missing something but it appears that it is only run from the command line on the machine being used. I have a site that is predominantly php, for which there is a log in, and a number of session variables and queries from a database are used to produce the html output. What I am looking for is a means by which I can add a single button on the html page created to generate the pdf version.
Also, while I have installed wkhtmltopdf locally and tested it out from the command line I can't see how can you call on the function from where my site is hosted.
I have used HTML2PDf in the past but it struggled with tables and so thought I would give wkhtmltopdf a go but I am not sure it will meet my needs. Can it do what I am asking of it?

Automatically download Cacti Weathermap at regular intervals

I was looking for a way to automatically download the weathermap image from the Cacti Weathermap plugin at regular intervals. There does not seem to be an easy way to do this listed anywhere on the internet using only Windows, so I thought I'd a) ask here and b) post what I've managed so far.
P.S. I've posted where I got up to in one of the answers below.
You can easily right click - Save As while on the weathermap page. This produces a file called weathermap-cacti-plugin.png.
No such file is available from the webpage however. Right clicking - view URL gave me this:
http://<mydomain>/plugins/weathermap/weathermap-cacti-plugin.php?action=viewimage&id=df9c40dcab42d1fd6867&time=1448863933
I did a quick check in powershell to see if this was downloadable (it was):
$client = new-object System.Net.WebClient
$client.DownloadFile("http://<mydomain>/plugins/weathermap/weathermap-cacti-plugin.php?action=viewimage&id=df9c40dcab42d1fd6867&time=1448864049", "C:\data\test.png")
Following a hunch, I refreshed the page and copied a couple more URLs:
<mydomain>/plugins/weathermap/weathermap-cacti-plugin.php?action=viewimage&id=df9c40dcab42d1fd6867&time=1448863989
<mydomain>/plugins/weathermap/weathermap-cacti-plugin.php?action=viewimage&id=df9c40dcab42d1fd6867&time=1448864049
As I had suspected, the time= changed every time I refreshed the page.
Following another hunch, I checked the changing digits (1448863989 etc) on epochconverter.com and got a valid system time which matched my own.
I found a powershell command (Keith Hill's answer on this question) to get the current Unix time
[int64](([datetime]::UtcNow)-(get-date "1/1/1970")).TotalSeconds
and added this into the powershell download script
$client = new-object System.Net.WebClient
$time=[int64](([datetime]::UtcNow)-(get-date "1/1/1970")).TotalSeconds
$client.DownloadFile("http://<mydomain>/plugins/weathermap/weathermap-cacti-plugin.php?action=viewimage&id=df9c40dcab42d1fd6867&time=$time", "C:\data\test.png")
This seems to work - the file test.png modified timestamp was updated every time I ran the code, and it opened showing a valid picture of the weathermap.
All that's required now is to put this in a proper script and schedule it to run every X minutes and save to a folder. I am sure scheduling powershell scripts in Task Scheduler is covered elsewhere so I will not repeat it here.
If anyone knows an easier way of doing this, please let me know. Otherwise, vote up my answer - I have searched a lot and cannot find any other results on the net that let you do this using only Windows. The Cacti forums have a couple of solutions, but they require you to do stuff on the Linux server which is hard for a Linux noob like me.

Define download size for wget or curl

As part of a bash script I need to download a file with a known file size, but I'm having issues with the download itself. The file only gets partially downloaded every time. The server I'm downloading from doesn't seem particularly well set up - it doesn't report file size so wget (which I'm using currently) doesn't know how much data to expect. However, I know the exact size of the file, so theoretically I could tell wget what to expect. Does anyone know if there is a way to do this? I'm using wget at the moment but I can easily switch to curl if it will work better. I know how to adjust timeouts (which might help too), and retries, but I assume that for retries to work it needs to know the size of the file its downloading.
I have seen a couple of other questions indicating that it might be a cookie problem, but that's not it in my case. The actual size downloaded varies from <1Mb to 50Mb, so it looks more like some sort of lost connection.
Could you share the entire command to check what parameters are you using? however, it's a strange case.
You may use the -c parameter, restore the connection in the same point where it stopped after the retries.
Or you can try using --spider parameter. That checks if the file exists and get the info file in log.

does recursive wget download visited URLs?

I want to use wget to recursively download a complete webpage. If for example, pages on level 2 of depth contains links to pages from level 1 (that have been already downloaded), will wget download them again? If so, is there a way to prevent this from happening?
Will a manual wget-like script be more optimal than wget, or is it optimised to avoid downloading things over and again? (I'm especially worried about menu links that appear on all pages)
Thank you in advance
A single wget run should never try to download the same page twice. It wouldn't be very useful for mirroring if it did. :) It also has some other failsafes, like refusing to recurse to another domain or a higher directory.
If you want to be sure it's doing the right thing, I suggest just trying it out and watching what it does; you can always mash ^C.

Resources