How to download a file in parts - download

I'm writing a program that downloads files anywhere up to a 1Gb in size. Right now I'm using the requests package to download files, and although it works (I think it times out sometimes) it is very slow. I've seen some examples multi-part download examples using urllib2 but I'm looking for a way to use urllib3 or requests, if that package has the ability.

How closely have you looked at requests' documentation?
In the Quickstart documentation the following is described
r = requests.get(url, stream=True)
r.raw.read(amount)
The better way, however, to do this is:
fd = open(filename, 'wb')
r = requests.get(url, stream=True)
for chunk in r.iter_content(amount):
fd.write(chunk)
fd.close()
(Assuming you are saving the downloaded content to a file.)

Related

How to download file using ProtoBuf

I'm trying to implement file download directly via Browser. Our company uses Protocol Buffer as data communication format. So how can I download the file once I open the web page?
I tried to user bytes and stream of Protocol Buffer. But the result is
{"result":{"data":"Cw4ODg4ODgsMCw4ODg4ODgsMTUwMCwwLDE1MDAsNDAwMDAsMTAwMDAsMzAwMDAKMDMvMTEvMjAxNSxVbmtub3duIEl0ZW0sUHJlIFJvbGwgVmlkZW8gKG1heCAwOjMwKSw2MDAwMCwzMTAwMCwyOTAwMCw1MDAwMCwyNDAwMCwyNjAwMCwyMC4wMCUsODQ0NCwwLDQwMDAsNDQ0NCw4OTAzODgsMCwwLDAsODg4ODg4LDAsODg4ODg4LDE1MDAsMCwxNTAwLDQwMDAwLDIxMDAwLDE5MDAwCg=="}}
Protobuf is good for structured communication but http provides the perfect protocol for downloading files. The right headers need to be set and the browser will download the file.
If you really have to use protobuf to transfer files, then you need to add some javascript that is parsing the protobuf first and then turns it into a file that can be downloaded. See How to create a dynamic file + link for download in Javascript? for reference.
So, send the message as bytes, add the javascript that parses the protobuf message to extract the bytes, and then create the file download like on the linked answer.

how to copy files from a web site to azure blob storage

I am trying to copy files from this site http://nemweb.com.au/Reports/Current/Daily_Reports/ to my azure blob storage account
my first option was to try Azure data factory, but it end up copying the html, which obviously not what I am looking for but rather the zip files inside
my question is ADF the right tool for that, or should I look at something else, any direction will be much appreciate it.
currently I am using Powerquery to read the data, and it is great, unfortunately, PowerBI service require a gateway to refresh, which is not very practical in my case, hence, I am looking for other option in Microsoft data stack
edit : I am going with the python route but happy to hear any alternative
I think I find the solution, Python, it has an excellent integration with azure blob, and the code to download the files is very easy, now I need to figure out which is the best service to run a python script on the cloud
import re
import urllib.request
from urllib.request import urlopen
url = "http://nemweb.com.au/Reports/Current/Daily_Reports/"
result = urlopen(url).read().decode('utf-8')
pattern = re.compile(r'[\w.]*.zip')
filelist = pattern.findall(result )
for x in filelist:
urllib.request.urlretrieve(url+x, x)

Receive file via websocket and save/write to local folder

Our application is entirely built on websockets. We don't do any HTTP request-reply. However, we are stuck with file download. If i receive file content via websockets can I wrote to local folder on user computer ?
If it makes a difference, we are only supporting Chrome so not issue if it doesn't work on other browsers.
Also, I know i can do this via HTTP. Trying to avoid it and stick to websockets since thats how the entire app is.
Thanks a lot in advance!
The solution depends on size of your file.
If size is less than about 50 MB, I would encode file's content to base64 string on the server and send this string to the client. Client should receive parts of the string, concat them to single result, and store. After receiving whole string, add link (tag <a>) to your page with attribute href set to "data:<data_type>;base64,<base64_encoded_file_content>". <data_type> is a mime type of your file, for example "text/html" or "image/png". Suggest file name by adding download attribute set to name of file (doesn't work for Chrome on OS X).
Unfortunately I have no solution for large files. Currently there is only FileEntry API that allows to write files with JS, but according to documentation it is supported only by Chrome v13+, learn more here https://developer.mozilla.org/en-US/docs/Web/API/FileEntry.

Python: requests module do not cache, why this error then?

I have a link to a raw txt file in github of the form https://raw.githubusercontent.com/XXX/YYY/master/txtfile where I want to periodically put a new version so a python script will know that it must update, the python script (py 3.5) uses an infinite while loop and the module requests:
while True:
try:
r = requests.get('https://raw.githubusercontent.com/XXX/YYY/master/txtfile', timeout=10)
required_version = r.text
except:
required_version = 0
log_in_txt_file(required_version)
sleep(10)
This script runs under Windows, however, I remark that despite the version is updated on the server, the log still show that the request is getting the previous version! If I try to get the version from a browser (Chrome) the same happens, but after some F5 the new version appears (in the browser and in the log), however, the script still log sometimes the old, sometimes the new version! I tried to make the URL variable with:
https://raw.githubusercontent.com/XXX/YYY/master/txtfile?_=time.time
But the problem remain, I'm using an Amazon workspace and I'm pretty sure it's a OS issue, my question, how to workaround this using python? Any idea?
This is not a client-side caching issue. In effect, Github servers are caching the version, serving you content until they are updated in time.
Github serves your data from a series of webservers, distributed geographically to ease loading times. These servers don't all update at the same time; until a change has propagated to all servers you'll see old and new content returned on that URL, depending on what machine served you the content for a specific request.
You can't really use GitHub to detect when a new version has been released, not reliably. Instead, generate a unique filename (generate a GUID perhaps) that at a future time will contain the new version information. Give that filename out with the current version, and try and poll that. Releasing a new version then consists of generating the filename for the version after, and putting the information to the current 'new version' URL. Each version links to the next file, and when it appears you only need to load it once.

Download and write .tar.gz files without corruption

How do you download files, specifically .zip and .tar.gz, with Ruby and write them to the disk?
—This question was originally specific to a bug in MacRuby, but the answers are relevant to the above general question.
Using MacRuby, I've found that the file appears to be the same as the reference (in
size), but the archives refuse to extract. What I'm attempting now is at: https://gist.github.com/arbales/8203385Thanks!
I've successfully downloaded and extracted GZip files with this code:
require 'open-uri'
require 'zlib'
open('tarball.tar', 'w') do |local_file|
open('http://github.com/jashkenas/coffee-script/tarball/master/tarball.tar.gz') do |remote_file|
local_file.write(Zlib::GzipReader.new(remote_file).read)
end
end
I'd recommend using open-uri in ruby's stdlib.
require 'open-uri'
open(out_file, 'w') do |out|
out.write(open(url).read)
end
http://ruby-doc.org/stdlib/libdoc/open-uri/rdoc/classes/OpenURI/OpenRead.html#M000832
Make sure you look at the :progress_proc option to open as it looks like you want a progress hook.
The last time I got currupted files with Ruby was when I forgot to call file.binmode right after File.open. Took me hours to find out what was wrong. Does it help with your issue?
When downloading a .tar.gz with open-uri via a simple open() call, I was also getting errors uncompressing the file on disk. I eventually noticed that the file size was much larger than expected.
Inspecting the file download.tar.gz on disk, what it actually contained was download.tar uncompressed; and that could be untarred. This seems to be due to an implicit Accept-encoding: gzip header on the open() call which makes sense for web content, but is not what I wanted when retrieving a gzipped tarball. I was able to work around it and defeat that behavior by sending a blank Accept-encoding header in the optional hash argument to the remote open():
open('/local/path/to/download.tar.gz', 'wb') do |file|
# Send a blank Accept-encoding header
file.write open('https://example.com/remote.tar.gz', {'Accept-encoding'=>''}).read
end

Resources