I am using ruby Net-sftp gem,I need to download large number of small files before I download I need to make sure to get a list of files in the given directory.
In order to do that I am using sftp.dir.entries('folder path').size to get list of file count but doing this operation on more than 10,000 files taking too much time(even hours) is there a better way to do this ?
even I tried using ssh.exec!("ls -l") this is also slow.
I am tring to connect to windows box which is windows server 2008 R2
to download a series of files with validations, i would do something like the following:
Net::SFTP.start(ftp_host, user, :password => password) do |sftp|
sftp.dir.entries('/path/to/folder').each do |remote_file|
if passes_validation?(remote_file)
file_data = sftp.download!('/path/to/folder' + '/' + remote_file.name)
local_file = File.open('/path/to/local', 'wb')
local_file.print file_data
local_file.close
end
end
end
one thing to remember when using this approach is that there are differences in SFTP server protocols which affect how many attributes will be accessible for remote_file; you can check what protocol you are working with by calling sftp.protocol after opening the connection.
alternatively, if you wanted to try to pass validations as part of your query to the SFTP, you could try .glob("/path/to/folder", "*.ext") instead of .entries if your validation is based on the file extension, though i can't speak for how it will work speed-wise (documentation here). in theory, it could speed-up the query (less data to return), but as it involves more work up-front, i'm not certain it will help.
i'm running my script from a VirtualBox running Ubuntu 12 with 2 GB of RAM dedicated (host is Windows 7), and connecting to a server with Windows Server 2008 R2 SP1 installed, running SolarWind for the SFTP portion; Ruby 1.9.3p392, Net-SFTP 2.1.2 and Net-SSH 2.6.8. with those tech specs, i roughly average 78 files a minute (though that is without validations).
Related
Environment:
Windows 10 x64
Ruby 2.1.0 32 bit
Chef 12.12.15
Azure Gem 0.7.9
Azure-Storage Gem 0.12.1.preview
I am trying to download a ~880MB blob from a container. When I do, it throws the following error after the Ruby process hits ~500MB in size:
C:/opscode/chefdk/embedded/lib/ruby/2.1.0/net/protocol.rb:102:in `read': failed to allocate memory (NoMemoryError)
I have tried this both inside and outside of Ruby, and with both the Azure gem and the Azure-Storage gem. The result is the same with all four combinations (Azure in Chef, Azure in Ruby, Azure-Storage in Chef, Azure-Storage in Ruby).
Most of the troubleshooting I have found for these kinds of problems suggests streaming or chunking the download, but there does not appear to be a corresponding method or get_blob option to do so.
Code:
require 'azure/storage'
# vars
account_name = "myacct"
container_name = "myfiles"
access_key = "mykey"
installs_dir = "myinstalls"
# directory for files
create_dir = 'c:/' + installs_dir
Dir.mkdir(create_dir) unless File.exists?(create_dir)
# create azure client
Azure::Storage.setup(:storage_account_name => account_name, :storage_access_key => access_key)
azBlobs = Azure::Storage::Blob::BlobService.new
# get list of blobs in container
dlBlobs = azBlobs.list_blobs(container_name)
# download each blob to directory
dlBlobs.each do |dlBlob|
puts "Downloading " + container_name + "/" + dlBlob.name
portalBlob, blobContent = azBlobs.get_blob(container_name, dlBlob.name)
File.open("c:/" + installs_dir + "/" + portalBlob.name, "wb") {|f|
f.write(blobContent)
}
end
I also tried using IO.binwrite() instead of File.open() and got the same result.
Suggestions?
As #coderanger said, your issue was caused by using get_blob to local data into memory at once. There are two ways for resolving it.
According to the offical REST reference here as below.
The maximum size for a block blob created via Put Blob is 256 MB for version 2016-05-31 and later, and 64 MB for older versions. If your blob is larger than 256 MB for version 2016-05-31 and later, or 64 MB for older versions, you must upload it as a set of blocks. For more information, see the Put Block and Put Block Listoperations. It's not necessary to also call Put Blob if you upload the blob as a set of blocks.
So for a blob which consist of block blobs, you can try to get the block blob list via list_blob_blocks to write these block blobs one by one to a local file.
To generate a blob url with SAS token via signed_uri like this test code, then to download the blob via streaming to write a local file.
The problem is that get_blob has to load the data into memory at once rather than streaming it to disk. In Chef we have the remote_file resource to help with this streaming download but you would need to get the plain URL for the blob rather than downloading it using their gem.
I was just looking into using the azure/storage/blob library for a dev-ops project I was working on and it seems to me that the implementation is quite basic and does not utilise the full underlying API available. For example uploads are slow when streamed from a file, because most likely it's not uploading chunks in parallel etc. I don't think this library is production ready and the exposed ruby API is lacking. It's open source, so if anybody has some time, they can help to contribute.
I'm trying to do this:
import groovy.sql.Sql
def sql = Sql.newInstance(
url:'jdbc:sqlserver://localhost\\myDB',
user:'server\user', //this I don't think I need because of SSPI
password:'password',
driver:'com.microsoft.sqlserver.jdbc.SQLServerDriver',
SSPI: 'true'
)
The problem I'm having is that this connection is just timing out. I can ping the machine. I can also connect to the database with Managment Studio logged into my SSPI user (or whatever you call it, I start the Management Studio with a different user)
So I've tried that with my SoapUI as well, started the program as a different user, but I still time out when I initiate the connection. So something is very wrong with my connection string and any help would be appreciated.
P.S. Yes, I don't know what's up with the \ backslashes after the URL to the server, I guess it indicates that it's at the root. If I don't use them I get a message that I'm on the incorrect version.
And then we found the answer..... First of all I had the wrong JDBC driver installed. You need to head over to microsoft to get the real deal:
https://www.microsoft.com/en-us/download/details.aspx?id=11774
Then you need to unpack this one, place the 4 or 4.1 version in your bin directory of SoapUI. (You are apparently supposed to use Lib/Ext, but that doesn't work for me)
Then, since we are trying to use SSPI or Windows Authentication, to connect to the SQL server, you need to place the sqljdbc_auth.dll from the driver/enu/auth folder. This is used in one of your path's or in SoapUI Lib folder. Remember to use the 32 bit dll for 32 bit SoapUI!!! I did not since my system is 64.....
After this, I used this string, but now you have the setup correct, so it should work fine as long as you remember to start SoapUI up using the correct windows user. (Shif-right click - start as different user - use the same user you have started the SQL server with)
Again, I wasn't completely aware of this from the start (yes, total newbie here) and it failed.
Finally, when you have done all this, this is the string that works - and probably a lot of derivatives since the failing part here were the driver and dll.
def sql =Sql.newInstance("jdbc:sqlserver://localhost;Database=myDB;integratedSecurity=true","com.microsoft.sqlserver.jdbc.SQLServerDriver")
After following this simple tutorial http://www.louisaslett.com/RStudio_AMI/ and video guide http://www.louisaslett.com/RStudio_AMI/video_guide.html I have setup an RStudio environment on EC2.
The only problem is, I can't upload large files (> 1GB).
I can upload small files just fine.
When I try to upload a file via RStudio, it gives me the following error:
Unexpected empty response from server
Does anyone know how I can upload these large files for use in RStudio? This is the whole reason I am using EC2 in the first place (to work with big data).
Ok so I had the same problem myself and it was incredibly frustrating, but eventually I realised what was going on here. The default home directory size for AWS is less than 8-10GB regardless of the size of your instance. As this as trying to upload to home then there was not enough room. An experienced linux user would not have fallen into this trap, but hopefully any other windows users new to this who come across this problem will see this. If you upload into a different drive on the instance then this can be solved. As the Louis Aslett Rstudio AMI is based in this 8-10GB space then you will have to set your working directory outside this, the home directory. Not intuitively apparent from Rstudio server interface. Whilst this is an advanced forum and this is a rookie error I am hoping no one deletes this question as I spent months on this and I think someone else will too. I hope this makes sense to you?
Don't you have shell access to your Amazon server? Don't rely on RStudio's upload (which may have a 2Gb limit, reasonably) and use proper unix dev tools:
rsync -avz myHugeFile.dat amazonusername#my.amazon.host.ip:
on your local PC command line (install cygwin or other unixy compatibility system) will transfer your huge file to your amazon server, and if interrupted will resume from that point, will compress the data for transfer too.
For a windows gui on something like this, WinSCP was what we used to do in the bad old days before Linux.
This could have something to do with your web server. Are you using nginx or apache as your web server. If so you can modify the upload feature in your nginx server. If you are running nginx on the front end of the web server I would recommend the following fix in your nginx.conf file.
http {
...
client_max_body_size 100M;
}
https://www.tecmint.com/limit-file-upload-size-in-nginx/
I had a similar problems with a 5GB file. What worked for me was to use SQLite to create a database with the csv file that I needed. Use SQLite code to bring create the database. Then I used a function in RStudio to communicate with the local database. In that way, I was able to bring in the csv file. I can track down the R code that I used if you like.
I'm trying to keep a file updated real time with the server. Its more like a real time syncing which has a very small delay. Is there any application that lets me do this? Or would you suggest me using a local host as a server?
I dont know how you are connected to your server - but i assume this will be something like SCP / SFTP / FTP and i dont know your OS. WinSCP will do excatly this what you need, you can set it to watch your Filesystem (to a specified folder) and it will update the server files as soon as your file on your drive changes.
It also supports command line features so that you can use it within your own applications.
I guess this is kind of a programming question, because I'm going to write a program if this doesn't exist.
So I found a very cheap web-host (I don't really care about the actual web hosting). They will give me a domain name and ftp server with a ton of storage space. Anyway, I want to backup a few hundred gigs of data (mostly family photos and scans of important documents). I also want to backup any future family photos / documents. I don't care if everything on my local NAS dies in a fire, I just want to have the photos and important documents backed up off-site.
So I want some program that lets me select folders locally and schedules them to be backed up to the ftp server. I'm a bit of a security nut, so i'd like the files to be encrypted locally before being transferred up onto the server.
I know I can do this with truecrypt volumes, but I don't want to transfer an entire encrypted volume blob up to the server ever time I change a file in it. So I could do multiple true crypt volumes but that will be a pain to manage
Also this must be mac/linux compatible although I'll primarily be on linux.
I basically need rsync + truecrypt + cron + sftp all rolled into a cryptographically secure program.
I've been searching for days with no luck. Any ideas?
mozyBackup does this - it doesn't use FTP, it has a custom uploader.
ps. Remember a typical home ADSL connection only does about 1Gb/day upstream
Linux option.
Out of the box option probably duplicity ( for example see http://www.howtoforge.com/creating-encrypted-ftp-backups-with-duplicity-and-ftplicity-on-debian-lenny )
Otherwise if these are basically rarely changed archive copies of files - I would roll my own gnupg (or dpad) individual file encryption, a file changed script, and ftp or rsync.