Environment:
Windows 10 x64
Ruby 2.1.0 32 bit
Chef 12.12.15
Azure Gem 0.7.9
Azure-Storage Gem 0.12.1.preview
I am trying to download a ~880MB blob from a container. When I do, it throws the following error after the Ruby process hits ~500MB in size:
C:/opscode/chefdk/embedded/lib/ruby/2.1.0/net/protocol.rb:102:in `read': failed to allocate memory (NoMemoryError)
I have tried this both inside and outside of Ruby, and with both the Azure gem and the Azure-Storage gem. The result is the same with all four combinations (Azure in Chef, Azure in Ruby, Azure-Storage in Chef, Azure-Storage in Ruby).
Most of the troubleshooting I have found for these kinds of problems suggests streaming or chunking the download, but there does not appear to be a corresponding method or get_blob option to do so.
Code:
require 'azure/storage'
# vars
account_name = "myacct"
container_name = "myfiles"
access_key = "mykey"
installs_dir = "myinstalls"
# directory for files
create_dir = 'c:/' + installs_dir
Dir.mkdir(create_dir) unless File.exists?(create_dir)
# create azure client
Azure::Storage.setup(:storage_account_name => account_name, :storage_access_key => access_key)
azBlobs = Azure::Storage::Blob::BlobService.new
# get list of blobs in container
dlBlobs = azBlobs.list_blobs(container_name)
# download each blob to directory
dlBlobs.each do |dlBlob|
puts "Downloading " + container_name + "/" + dlBlob.name
portalBlob, blobContent = azBlobs.get_blob(container_name, dlBlob.name)
File.open("c:/" + installs_dir + "/" + portalBlob.name, "wb") {|f|
f.write(blobContent)
}
end
I also tried using IO.binwrite() instead of File.open() and got the same result.
Suggestions?
As #coderanger said, your issue was caused by using get_blob to local data into memory at once. There are two ways for resolving it.
According to the offical REST reference here as below.
The maximum size for a block blob created via Put Blob is 256 MB for version 2016-05-31 and later, and 64 MB for older versions. If your blob is larger than 256 MB for version 2016-05-31 and later, or 64 MB for older versions, you must upload it as a set of blocks. For more information, see the Put Block and Put Block Listoperations. It's not necessary to also call Put Blob if you upload the blob as a set of blocks.
So for a blob which consist of block blobs, you can try to get the block blob list via list_blob_blocks to write these block blobs one by one to a local file.
To generate a blob url with SAS token via signed_uri like this test code, then to download the blob via streaming to write a local file.
The problem is that get_blob has to load the data into memory at once rather than streaming it to disk. In Chef we have the remote_file resource to help with this streaming download but you would need to get the plain URL for the blob rather than downloading it using their gem.
I was just looking into using the azure/storage/blob library for a dev-ops project I was working on and it seems to me that the implementation is quite basic and does not utilise the full underlying API available. For example uploads are slow when streamed from a file, because most likely it's not uploading chunks in parallel etc. I don't think this library is production ready and the exposed ruby API is lacking. It's open source, so if anybody has some time, they can help to contribute.
Related
We wanted to download files from remote-url into memory and then upload it to some public cloud. I am planning to use copy_stream lib in ruby. However I am not sure if it can be achieved by this, because I need to also maintain the memory and CPU stats in such a way that it will not hamper the performance.
Any suggestion or example how to achieve this via copy_stream lib in ruby or do we have any other lib to achieve this considering the performance.
https://ruby-doc.org/core-2.5.5/IO.html
You can setup src/dst to be simple IO abstractions that respond to read/write:
src = IO.popen(["ssh", srchost, "cat /path/to/source_file | gzip"], "r")
dst = IO.popen(["ssh", dsthost, "gunzip > /path/to/dest_file"], "w")
IO.copy_stream(src, dst)
src.close
dst.close
Set up src to be the downloadable file.
Set up dst to be the cloud resource, with write permission.
Make sure the two are compliant with sendfile().
Sendfile is a kernel based copy stream procedure. In terms of ram use and performance, there is nothing faster. You application will not be involved with the transfer.
For sendfile(), the output socket must have zero-copy support and the input file must have mmap() support. In general, this means you have already downloaded the file to a local file, you do not change the downloaded file during the copy, and you have an open socket to the output.
My goal is to download a large zip file (15 GB) and extract it to Google Cloud using Laravel Storage (https://laravel.com/docs/8.x/filesystem) and https://github.com/spatie/laravel-google-cloud-storage.
My "wish" is to sort of stream the file to Cloud Storage, so I do not need to store the file locally on my server (because it is running in multiple instances, and I want to have the disk size as small as possible).
Currently, there does not seem to be a way to do this without having to save the zip file on the server. Which is not ideal in my situation.
Another idea is to use a Google Cloud Function (eg with Python) to download, extract and store the file. However, it seems like Google Cloud Functions are limited to a max timeout of 9 mins (540 seconds). I don't think that will be enough time to download and extract 15GB...
Any ideas on how to approach this?
You should be able to use streams for uploading big files. Here’s the example code to achieve it:
$disk = Storage::disk('gcs');
$disk->put($destFile, fopen($sourceZipFile, 'r+'));
I am trying to read a .csv file from windows C drive to databricks. I tried the following code after going through some of the answers.
# remove the 'file' string and use 'r' or 'u' prefix to indicate raw/unicore string format
# Option 1
#PATH = r'C:\customers_marketing.csv' # raw string
# Option 2
PATH = u'C:\\customers_marketing.csv' # unicode string
customers_marketing = spark.read.csv(PATH, header="true", inferSchema="true")
However, I was not able to read it to databricks. I get the following error.
IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: C:%5Ccustomers_marketing.csv
Could anyone pls advise/suggest how can I read data from windows c drive to databricks.
Thanks in advance
It's not possible, because your file is on your local machine, and Databricks is in the cloud, without any knowledge about your machine.
You need to upload file onto DBFS, and then read from it. You can do it for example via UI - via DBFS file browser (docs) or via Upload Data UI (docs)
If the file is huge, then you need to use something like az-copy to upload file(s) to Azure Storage
So I am trying to upload a file with Celery that uses Redis on my Heroku website. I am trying to upload a .exe type file with the size of 20MB. Heroku is saying in they're hobby: dev section that the max memory that could be uploaded is 25MB. But I, who is trying to upload a file in Celery(turning it from bytes to base64, decoding it and sending it to the function) is getting kombu.exceptions.OperationalError: OOM command not allowed when used memory > 'maxmemory'. error. Keep in mind when I try to upload for e.g a 5MB file it works fine. But 20MB doesn't. I am using Python with the Flask framework
There are two ways to store files in DB (Redis is just an in-memory DB). You can either store a blob in the DB (for small files, say a few KBs), or you can store the file in memory and store a pointer to the file in DB.
So for your case, store the file on disk and place only the file pointer in the DB.
The catch here is that Heroku has a Ephemeral file system that gets erased every 24 hours, or whenever you deploy new version of the app.
So you'll have to do something like this:
Write a small function to store the file on the local disk (this is temporary storage) and return the path to the file
Add a task to Celery with the file path i.e. the parameter to the Celery task will be the "file-path" not a serialized blob of 20MB data.
The Celery worker process picks the task you just enqueued when it gets free and executes it.
If you need to access the file later, and since the local heroku disk only has temporary, you'll have to place the file in some permanent storage like AWS S3.
(The reason we go through all these hoops and not place the file directly in S3 is because access to local disk is fast while S3 disks might be in some other server farm at some other location and it takes time to save the file there. And your web process might appear slow/stuck if you try to write the file to S3 in your main process.)
I am using ruby Net-sftp gem,I need to download large number of small files before I download I need to make sure to get a list of files in the given directory.
In order to do that I am using sftp.dir.entries('folder path').size to get list of file count but doing this operation on more than 10,000 files taking too much time(even hours) is there a better way to do this ?
even I tried using ssh.exec!("ls -l") this is also slow.
I am tring to connect to windows box which is windows server 2008 R2
to download a series of files with validations, i would do something like the following:
Net::SFTP.start(ftp_host, user, :password => password) do |sftp|
sftp.dir.entries('/path/to/folder').each do |remote_file|
if passes_validation?(remote_file)
file_data = sftp.download!('/path/to/folder' + '/' + remote_file.name)
local_file = File.open('/path/to/local', 'wb')
local_file.print file_data
local_file.close
end
end
end
one thing to remember when using this approach is that there are differences in SFTP server protocols which affect how many attributes will be accessible for remote_file; you can check what protocol you are working with by calling sftp.protocol after opening the connection.
alternatively, if you wanted to try to pass validations as part of your query to the SFTP, you could try .glob("/path/to/folder", "*.ext") instead of .entries if your validation is based on the file extension, though i can't speak for how it will work speed-wise (documentation here). in theory, it could speed-up the query (less data to return), but as it involves more work up-front, i'm not certain it will help.
i'm running my script from a VirtualBox running Ubuntu 12 with 2 GB of RAM dedicated (host is Windows 7), and connecting to a server with Windows Server 2008 R2 SP1 installed, running SolarWind for the SFTP portion; Ruby 1.9.3p392, Net-SFTP 2.1.2 and Net-SSH 2.6.8. with those tech specs, i roughly average 78 files a minute (though that is without validations).