Open URI downloading corrupt files - ruby

I am trying to download a .tar.gz file using Ruby. Upon download, the file is always corrupt in some way.
I am using this code to download the file:
require "open-uri"
File.open('img.tar.gz', 'wb') do |fo|
fo.write open('https://github.com/Arafatk/language-basics/blob/master/img.tar.gz').read
end
Is there a way to fix this?

Change the file mode in the open call:
open('https://github.com/Arafatk/language-basics/blob/master/img.tar.gz', "rb").read
It was opening the file in text mode, when you wanted binary mode.
You also needed to be using the proper URL to download a raw file from Github. In this case, the correct URL can be found by right-clicking the Raw link on the file's repo page (the original URL given), and that Raw URL is the one that contains the actual binary image that you're trying to download. Change the URL to this: https://github.com/Arafatk/language-basics/raw/master/img.tar.gz, and the change I suggested at the top of the answer works just fine.

Related

Internal Links Not Working Convert .HTM to .pdf

I am trying to convert an .htm file from the SEC website to a .pdf and have the internal links work. I am successfully converting to .pdf using wkhtmltopdf, but all the internal links point me back to the first page.
wkhtmltopdf https://www.sec.gov/Archives/edgar/data/1594617/000119312514117433/d640354ds1a.htm test.pdf
It looks like there's an issue with wkhtmltopdf dealing with anchor tags that have no content. There's a PR that was opened in 2017 to resolve it, but it remains open.
As it turns out, your document does indeed have empty anchor tags, so that's probably the root cause:
<A NAME="toc640354_15"></A>
I would suggest using chrome to produce the pdf, with its --headless and --print-to-pdf flags. From your chrome installation directory, do:
chrome.exe --headless --disable-gpu --print-to-pdf="C:\path\to\file.pdf" https://www.sec.gov/Archives/edgar/data/1594617/000119312514117433/d640354ds1a.htm
Make sure you specify an absolute path to the output file or it doesn't seem to work, for whatever reason. The command will immediately return without any output or indication of success. Give it a few seconds to retrieve, render and write the file.
I tested with your document, and the links work perfectly.

Downloading a file using wget

I'm trying to download a program from a website using wget. However, whenever I try to download it I instead get a HTML file. I'm using the following syntax.
wget http://domain.com/downloads/name/
If you go the link directly the browser automatically tries to download the file. Why is that I'm getting the HTML file instead of the actual file I want?
I don't think wget can get the file for you like this, it will just get you the first page html, your page has some delaying code, and not a direct url to the file, so you will need a real browser to download the file

file stored on s3 not rendering in browser

I am copying an image that I extract from an .ipa file on s3. The file is getting move correctly but when every I try to view it in a browsers it appear to broken, in google chrome. If I download the file directly to my machine and open it appears perfectly fine. It also renders ok in Safari.
Dir.mktmpdir do |dir|
Zip::File.open(tmp_ipa) do |zip_file|
# Find Icon File
icon = zip_file.find do |entry|
entry.name.include? 'AppIcon76x76#2'
end
icon.extract(File.join(dir, 'AppIcon.png'))
s3_icon = bucket.objects[icon_dest]
s3_icon.write(Pathname.new(File.join(dir, 'AppIcon.png')))
icon_url = s3_icon.public_url.to_s
end
end
The most likely problem is that you didn't set the Content-Type to image/png when you uploaded your image to S3. Try this on the command line:
curl -i http://your-bucket.s3.amazonaws.com/path/to/AppIcon.png
What's the Content-Type header? If it isn't image/png, set that at the time you upload.
This is almost certainly because Apple uses a non-standard proprietary extension of the PNG file format for the PNG files in an iPhone APP (archived link), such as the .ipa you say it was extracted from.
The reason it works in Safari, is because Safari uses the OS's image decoding libraries, which do support this non-standard format.
There are some conversion scripts out there, that work with varying success.

How to download pdf file in ruby without .pdf in the link

I need to download a pdf from a website which does not provide a link ending with (.pdf) using ruby. Manually, when i click on the link to download the pdf, it takes me to a new page and the dialog box to save/open the file appears after some time.
Please help me in downloading the file.
The link
You an do this
require 'open-uri'
File.open('my_file_name.pdf', "wb") do |file|
file.write open('http://someurl.com/2013-1-2/somefile/download').read
end
I have been doing this for my projects and it works.
If you just need a simple ruby script to do it, I'd just run wget. Like this exec 'wget "http://path.to.the.file/and/some/params"'
At that point though, you might as well run wget.
The other way, is to just run a get on the page that you know the pdf is at
source = Net::HTTP.get("http://the.website.com", "/and/some/params")
There are a number of other http clients that you could use, but as long as you make a get request to the endpoint that the pdf is at, it should give you the raw data. Then you can just rename the file, and you'll have the pdf
In your case, I ran the following commands to get the pdf
wget http://www.lawcommission.gov.np/en/documents/prevailing-laws/constitution/func-download/129/chk,d8c4644b0f086a04d8d363cb86fb1647/no_html,1/
mv index.html thefile.pdf
Then open the pdf. Note that these are linux commands. If you want to get the file with a ruby script, you could use something like what I previously mentioned.
Update:
There is an added complication that was not initially stated, which is that the url to the pdf changes every time there is an update to the pdf. In order to make this work, you probably want to do something involving web scraping. I suggest nokogiri. This way you can look at the page where the download is and then perform a get request on the desired URL. Furthermore, the server that hosts the pdf is misconfigured, and breaks chrome within a few seconds of opening the page.
How to solve this problem: I went to the site, and refreshed it. Then broke the connection to the server (press the X where there would otherwise be a refresh button). Then right click next to the download link, and select inspect element. Then browse the dom to find something that is definitively identifying (like an id). Thankfully, I found something <strong id="telecharger"> Download</strong>. This means that you can use something like page.css('strong#telecharger')[0].parent['href'] This should give you a URL. Then you can perform a get request as described above. I don't have time to make the script for you (too much work to do), but this should be enough to solve the problem.

How to save base64 encoded file in Ruby?

I am trying to download image from internet using open-uri. Here is code:
require 'open-uri'
open('0RB2132__601_K3.jpg', 'wb') do |file|
file << open('http://luxonline.luxottica.com/luxpics/watermarkedextranet/med?style=0RB2132__601_K3').read
end
But it doesn't save image correctly. When I try to open it program reports:
Error interpreting JPEG image file (Improper call to JPEG library in state 200)
I opened original image on the internet in the Firefox and after examining it, found that it is base64 encoded image.
How to download this image from this address http://luxonline.luxottica.com/luxpics/watermarkedextranet/med?style=0RB2132__601_K3?
Using your script on OS X, it works as a charm. So your mistake is probably somewhere else

Resources