Downloading a file using wget - download

I'm trying to download a program from a website using wget. However, whenever I try to download it I instead get a HTML file. I'm using the following syntax.
wget http://domain.com/downloads/name/
If you go the link directly the browser automatically tries to download the file. Why is that I'm getting the HTML file instead of the actual file I want?

I don't think wget can get the file for you like this, it will just get you the first page html, your page has some delaying code, and not a direct url to the file, so you will need a real browser to download the file

Related

When I download file from Box into Google Colab, HTML was is downloaded

I am trying to download a file (or several files) from Box into Google Colab using "wget". But, what is downloaded looks like a HTML page not the file itself.
I am using the command:
!wget https://AAA.box.com/s/mh7xq8lou9ukb5i7lssz0frou554dupb -O script.py
Is there a problem with the URL that I am using? I get the URL by opening the file in Box and click "Get shared link".
You are trying to download from sharing link which is a webpage not a direct download link. So It will download the webpage. As a simple trick you can click download in browser and cancel it. Then copy the URL from download and use it with wget.

Internal Links Not Working Convert .HTM to .pdf

I am trying to convert an .htm file from the SEC website to a .pdf and have the internal links work. I am successfully converting to .pdf using wkhtmltopdf, but all the internal links point me back to the first page.
wkhtmltopdf https://www.sec.gov/Archives/edgar/data/1594617/000119312514117433/d640354ds1a.htm test.pdf
It looks like there's an issue with wkhtmltopdf dealing with anchor tags that have no content. There's a PR that was opened in 2017 to resolve it, but it remains open.
As it turns out, your document does indeed have empty anchor tags, so that's probably the root cause:
<A NAME="toc640354_15"></A>
I would suggest using chrome to produce the pdf, with its --headless and --print-to-pdf flags. From your chrome installation directory, do:
chrome.exe --headless --disable-gpu --print-to-pdf="C:\path\to\file.pdf" https://www.sec.gov/Archives/edgar/data/1594617/000119312514117433/d640354ds1a.htm
Make sure you specify an absolute path to the output file or it doesn't seem to work, for whatever reason. The command will immediately return without any output or indication of success. Give it a few seconds to retrieve, render and write the file.
I tested with your document, and the links work perfectly.

Open URI downloading corrupt files

I am trying to download a .tar.gz file using Ruby. Upon download, the file is always corrupt in some way.
I am using this code to download the file:
require "open-uri"
File.open('img.tar.gz', 'wb') do |fo|
fo.write open('https://github.com/Arafatk/language-basics/blob/master/img.tar.gz').read
end
Is there a way to fix this?
Change the file mode in the open call:
open('https://github.com/Arafatk/language-basics/blob/master/img.tar.gz', "rb").read
It was opening the file in text mode, when you wanted binary mode.
You also needed to be using the proper URL to download a raw file from Github. In this case, the correct URL can be found by right-clicking the Raw link on the file's repo page (the original URL given), and that Raw URL is the one that contains the actual binary image that you're trying to download. Change the URL to this: https://github.com/Arafatk/language-basics/raw/master/img.tar.gz, and the change I suggested at the top of the answer works just fine.

How to download pdf file in ruby without .pdf in the link

I need to download a pdf from a website which does not provide a link ending with (.pdf) using ruby. Manually, when i click on the link to download the pdf, it takes me to a new page and the dialog box to save/open the file appears after some time.
Please help me in downloading the file.
The link
You an do this
require 'open-uri'
File.open('my_file_name.pdf', "wb") do |file|
file.write open('http://someurl.com/2013-1-2/somefile/download').read
end
I have been doing this for my projects and it works.
If you just need a simple ruby script to do it, I'd just run wget. Like this exec 'wget "http://path.to.the.file/and/some/params"'
At that point though, you might as well run wget.
The other way, is to just run a get on the page that you know the pdf is at
source = Net::HTTP.get("http://the.website.com", "/and/some/params")
There are a number of other http clients that you could use, but as long as you make a get request to the endpoint that the pdf is at, it should give you the raw data. Then you can just rename the file, and you'll have the pdf
In your case, I ran the following commands to get the pdf
wget http://www.lawcommission.gov.np/en/documents/prevailing-laws/constitution/func-download/129/chk,d8c4644b0f086a04d8d363cb86fb1647/no_html,1/
mv index.html thefile.pdf
Then open the pdf. Note that these are linux commands. If you want to get the file with a ruby script, you could use something like what I previously mentioned.
Update:
There is an added complication that was not initially stated, which is that the url to the pdf changes every time there is an update to the pdf. In order to make this work, you probably want to do something involving web scraping. I suggest nokogiri. This way you can look at the page where the download is and then perform a get request on the desired URL. Furthermore, the server that hosts the pdf is misconfigured, and breaks chrome within a few seconds of opening the page.
How to solve this problem: I went to the site, and refreshed it. Then broke the connection to the server (press the X where there would otherwise be a refresh button). Then right click next to the download link, and select inspect element. Then browse the dom to find something that is definitively identifying (like an id). Thankfully, I found something <strong id="telecharger"> Download</strong>. This means that you can use something like page.css('strong#telecharger')[0].parent['href'] This should give you a URL. Then you can perform a get request as described above. I don't have time to make the script for you (too much work to do), but this should be enough to solve the problem.

opens archive insted of downloading it

i have a simple question, but i couldn't find an easy solution for it, i have a rented ftp, i cant moderate it, and i have a website with links to this ftp, there few archive files that i want them to be downloaded rather than opened directly through browser, my link looks like this:
IMS 200 Client V1.29 (06.02.13)
i solved this problem by using php page that defines the file type, so that browser could understand that it is archive, and then download it, rather than try and open it directly through browser, is there any easier way to achieve it?!
thank u all for the help!
Hope that my case can help you some way.
I have a website that allows user to view news and download files. Oneday, I discover that if I show up the download link to a .rar file directly, eg. http://www.somenet.com/myfile.rar, then it is opened automatically in the browser instead of asking users if they want to save/open it. If I write some code to read and transfer the file to browser, eg. http://www.somenet.com/download?fileid=123, then it is asked to be saved/opened by browser.
After googling a while, I insert a piece of configuration into my Apache Tomcat web.xml (often at CATALINA_HOME/conf/web.xml) as follows:
<mime-mapping>
<extension>rar</extension>
<mime-type>application/x-rar-compressed</mime-type>
</mime-mapping>
then restart the Apapche Tomcat server to take effect.
And now i can click on the direft rar link to download the file.
I also have to restart the IE (FF takes effect right away).
Good luck!
If you can use HTML5 you can try to use:
Download this file
Extracted from: HTML5 link download

Resources