Why is downloading from Azure blobs taking so long?

Why is downloading from Azure blobs taking so long? - performance

In my Azure web role OnStart() I need to deploy a huge unmanaged program the role depends on. The program is previously compressed into a 400-megabytes .zip archive, splitted to files 20 megabytes each and uploaded to a blob storage container. That program doesn't change - once uploaded it can stay that way for ages.
My code does the following:
CloudBlobContainer container = ... ;
String localPath = ...;
using( FileStream writeStream = new FileStream(
localPath, FileMode.OpenOrCreate, FileAccess.Write ) )
{
for( int i = 0; i < blobNames.Size(); i++ ) {
String blobName = blobNames[i];
container.GetBlobReference( blobName ).DownloadToStream( writeStream );
}
writeStream.Close();
}
It just opens a file, then writes parts into it one by one. Works great, except it takes about 4 minutes when run from a single core (extra small) instance. Which means the average download speed about 1,7 megabytes per second.
This worries me - it seems too slow. Should it be so slow? What am I doing wrong? What could I do instead to solve my problem with deployment?

Adding to what Richard Astbury said: An Extra Small instance has a very small fraction of bandwidth that even a Small gives you. You'll see approx. 5Mbps on an Extra Small, and approx. 100Mbps on a Small (for Small through Extra Large, you'll get approx. 100Mbps per core).

The extra small instance has limited IO performance. Have you tried going for a medium sized instance for comparison?

In some ad-hoc testing I have done in the past I found that there is no discernable difference between downloading 1 large file and trying to download in parallel N smaller files. It turns out that the bandwidth on the NIC is usually the limiting factor no matter what and that a large file will just as easily saturate it as many smaller ones. The reverse is not true, btw. You do benefit by uploading in parallel as opposed to one at a time.
The reason I mention this is that it seems like you should be using 1 large zip file here and something like Bootstrapper. That would be 1 line of code for you to download, unzip, and possibly run. Even better, it won't do it more than once on reboot unless you force it to.
As others have already aptly mentioned, the NIC bandwidth on the XS instances is vastly smaller than even a S instance. You will see much faster downloads by bumping up the VM size slightly.

Related

What is good way of using multiprocessing for bifacial_radiance simulations?

For a university project I am using bifacial_radiance v0.4.0 to run simulations of approx. 270 000 rows of data in an EWP file.
I have set up a scene with some panels in a module following a tutorial on the bifacial_radiance GitHub page.
I am running the python script for this on a high power computer with 64 cores. Since python natively only uses 1 processor I want to use multiprocessing, which is currently working. However it does not seem very fast, even when starting 64 processes it uses roughly 10 % of the CPU's capacity (according to the task manager).
The script will first create the scene with panels.
Then it will look at a result file (where I store results as csv), and compare it to the contents of the radObj.metdata object. Both metdata and my result file use dates, so all dates which exist in the metdata file but not in the result file are stored in a queue object from the multiprocessing package. I also initialize a result queue.
I want to send a lot of the work to other processors.
To do this I have written two function:
A file writer function which every 10 seconds gets all items from the result queue and writes them to the result file. This function is running in a single multiprocessing.Process process like so:
fileWriteProcess = Process(target=fileWriter,args=(resultQueue,resultFileName)).start()
A ray trace function with a unique ID which does the following:
Get an index ìdx from the index queue (described above)
Use this index in radObj.gendaylit(idx)
Create the octfile. For this I have modified the name which the octfile is saved with to use a prefix which is the name of the process. This is to avoid all the processes using the same octfile on the SSD. octfile = radObj.makeOct(prefix=name)
Run an analysis analysis = bifacial_radiance.AnalysisObj(octfile,radObj.basename)
frontscan, backscan = analysis.moduleAnalysis(scene)
frontDict, backDict = analysis.analysis(octfile, radObj.basename, frontscan, backscan)
Read the desired results from resultDict and put them in the resultQueue as a single line of comma-separated values.
This all works. The processes are running after being created in a for loop.
This speeds up the whole simulation process quite a bit (10 days down to 1½ day), but as said earlier the CPU is running at around 10 % capacity and the GPU is running around 25 % capacity. The computer has 512 GB ram which is not an issue. The only communication with the processes is through the resultQueue and indexQueue, which should not bottleneck the program. I can see that it is not synchronizing as the results are written slightly unsorted while the input EPW file is sorted.
My question is if there is a better way to do this, which might make it run faster? I can see in the source code that a boolean "hpc" is used to initiate some of the classes, and a comment in the code mentions that it is for multiprocessing, but I can't find any information about it elsewhere.

Tensorflow dequeue is very slow on Cloud ML

I am trying to run a CNN on the cloud (Google Cloud ML) because my laptop does not have a GPU card.
So I uploaded my data on Google Cloud Storage. A .csv file with 1500 entries, like so:
| label | img_path |
| label_1| /img_1.jpg |
| label_2| /img_2.jpg |
and the corresponding 1500 jpgs.
My input_fn looks like so:
def input_fn(filename,
batch_size,
num_epochs=None,
skip_header_lines=1,
shuffle=False):
filename_queue = tf.train.string_input_producer(filename, num_epochs=num_epochs)
reader = tf.TextLineReader(skip_header_lines=skip_header_lines)
_, row = reader.read(filename_queue)
row = parse_csv(row)
pt = row.pop(-1)
pth = filename.rpartition('/')[0] + pt
img = tf.image.decode_jpeg(tf.read_file(tf.squeeze(pth)), 1)
img = tf.to_float(img) / 255.
img = tf.reshape(img, [IMG_SIZE, IMG_SIZE, 1])
row = tf.concat(row, 0)
if shuffle:
return tf.train.shuffle_batch(
[img, row],
batch_size,
capacity=2000,
min_after_dequeue=2 * batch_size + 1,
num_threads=multiprocessing.cpu_count(),
)
else:
return tf.train.batch([img, row],
batch_size,
allow_smaller_final_batch=True,
num_threads=multiprocessing.cpu_count())
Here is what the full graph looks like (very simple CNN indeed):
Running the training with a batch size of 200, then most of the compute time on my laptop (on my laptop, the data is stored locally) is spent on the gradients node which is what I would expect. The batch node has a compute time of ~12ms.
When I run it on the cloud (scale-tier is BASIC), the batch node takes more than 20s. And the bottleneck seems to be coming from the QueueDequeueUpToV2 subnode according to tensorboard:
Anyone has any clue why this happens? I am pretty sure I am getting something wrong here, so I'd be happy to learn.
Few remarks:
-Changing between batch/shuffle_batch with different min_after_dequeue does not affect.
-When using BASIC_GPU, the batch node is also on the CPU which is normal according to what I read and it takes roughly 13s.
-Adding a time.sleep after queues are started to ensure no starvation also has no effect.
-Compute time is indeed linear in batch_size, so with a batch_size of 50, the compute time would be 4 times smaller than with a batch_size of 200.
Thanks for reading and would be happy to give more details if anyone needs.
Best,
Al
Update:
-Cloud ML instance and Buckets were not in the same region, making them in the same region improved result 4x.
-Creating a .tfrecords file made the batching take 70ms which seems to be acceptable. I used this blog post as a starting point to learn about it, I recommend it.
I hope this will help others to create a fast data input pipeline!

Try converting your images to tfrecord format and read them directly from graph. The way you are doing it, there is no possibility of caching and if your images are small, you are not taking advantage of the high sustained reads from cloud storage. Saving all your jpg images into a tfrecord file or small number of files will help.
Also, make sure your bucket is a single region bucket in a region that had gpus and that you are submitting to cloudml in that region.

I've got the similar problem before. I solved it by changing tf.train.batch() to tf.train.batch_join(). In my experiment, with 64 batch size and 4 GPUs, it took 22 mins by using tf.train.batch() whilst it only took 2 mins by using tf.train.batch_join().
In Tensorflow doc:
If you need more parallelism or shuffling of examples between files, use multiple reader instances using the tf.train.shuffle_batch_join
https://www.tensorflow.org/programmers_guide/reading_data

UWP ARM System.IO Decompression performance is poor

I am noticing extremely long times to extract a zip file on ARM based devices. To extract a 20mb zip file takes over 60 seconds! I am seeing even 140 seconds on a 950XL which is supposed to be one of the more powerful ARM models.
This is the code I am using:
var startExtractTime = DateTime.Now;
ZipArchive za = new ZipArchive(archiveMemoryStream, ZipArchiveMode.Read);
za.ExtractToDirectory(path);
var stopExtractTime = DateTime.Now;
var durationInSeconds = stopExtractTime.Subtract(startExtractTime).TotalSeconds;
Is this the kind of performance I can expect from this method? Are there any other ways to get around this? I'd prefer to include a Zip file in my project instead of the HUGE directory structure that is inside this file but it I can't get good performance from ARM devices I may not have an option.

Zip decompression itself should not take that much time. However, if your archive contains a lot of small files this can be a bottleneck for flash drive/internal flash memory. Try to decompress single-file 20Mb archive to check whether it is CPU or file system issue.

Why does the execution time of my perl code vary so widely?

I found the following perl code executed in surprisingly varying speeds, sometimes fast, sometimes very slow. I have a few folders containing tens of thousands of files, which I need to run this code through. I am running this on cygwin with windows 7. Just wonder if someone could please help me to speed it up, or as least to figure out why the speed is varying. My CPU and memory should be plentiful in all these situations.
outer loop to iterate through a list of $dir's
opendir(DIR, $dir);
#all=readdir(DIR);
#files = (0..$#all);
$i=-1;
foreach $current (#all){
if (-f "$dir/$current") {
$files[++$i]=$current;
}
}
push #Allfiles,#files[0..$i];
closedir(DIR);

You're probably I/O bound, so changes to your code probably won't affect the total runtime - runtime will be affected by whether the directory entries are in cache or not.
But your code uses temporary arrays for no good reason, using too much RAM if the directories are very large. You could simplify it to:
opendir(DIR, $dir);
while (my file = readdir(DIR)) {
push #Allfiles, $file if (-f "$dir/$file");
}
closedir(DIR);
No temporary arrays.

If it is slow the first time you run, and fast after that, then the problem is that your system is caching the reads. The first time you run your code, data has to be read off your disk. After that, the data is still cached in RAM. If you wait long enough, the cache will flush and you will have to hit the disks again.
Or sometimes you may be running some other disk intensive task at the same time, but not at other times when you run your code.

Copying Files over an Intermittent Network Connection

I am looking for a robust way to copy files over a Windows network share that is tolerant of intermittent connectivity. The application is often used on wireless, mobile workstations in large hospitals, and I'm assuming connectivity can be lost either momentarily or for several minutes at a time. The files involved are typically about 200KB - 500KB in size. The application is written in VB6 (ugh), but we frequently end up using Windows DLL calls.
Thanks!

I've used Robocopy for this with excellent results. By default, it will retry every 30 seconds until the file gets across.

I'm unclear as to what your actual problem is, so I'll throw out a few thoughts.
Do you want restartable copies (with such small file sizes, that doesn't seem like it'd be that big of a deal)? If so, look at CopyFileEx with COPYFILERESTARTABLE
Do you want verifiable copies? Sounds like you already have that by verifying hashes.
Do you want better performance? It's going to be tough, as it sounds like you can't run anything on the server. Otherwise, TransmitFile may help.
Do you just want a fire and forget operation? I suppose shelling out to robocopy, or TeraCopy or something would work - but it seems a bit hacky to me.
Do you want to know when the network comes back? IsNetworkAlive has your answer.
Based on what I know so far, I think the following pseudo-code would be my approach:
sourceFile = Compress("*.*");
destFile = "X:\files.zip";
int copyFlags = COPYFILEFAILIFEXISTS | COPYFILERESTARTABLE;
while (CopyFileEx(sourceFile, destFile, null, null, false, copyFlags) == 0) {
do {
// optionally, increment a failed counter to break out at some point
Sleep(1000);
while (!IsNetworkAlive(NETWORKALIVELAN));
}
Compressing the files first saves you the tracking of which files you've successfully copied, and which you need to restart. It should also make the copy go faster (smaller total file size, and larger single file size), at the expense of some CPU power on both sides. A simple batch file can decompress it on the server side.

Try using BITS (Background Intelligent Transfer Service). It's the infrastructure that Windows Update uses, is accessible via the Win32 API, and is built specifically to address this.
It's usually used for application updates, but should work well in any file moving situation.
http://www.codeproject.com/KB/IP/bitsman.aspx

I agree with Robocopy as a solution...thats why the utility is called "Robust File Copy"
I've used Robocopy for this with excellent results. By default, it will retry every 30 seconds until the file gets across.
And by default, a million retries. That should be plenty for your intermittent connection.
It also does restartable transfers and you can even throttle transfers with a gap between packets assuing you don't want to use all the bandwidth as other programs are using the same connection (/IPG switch)?.

How about simply sending a hash after or before you send the file, and comparing that with the file you received? That should at least make sure you have a correct file.
If you want to go all out you could do the same process, but for small parts of the file. Then when you have all pieces, join them on the receiving end.

You could use Microsoft SyncToy (free).
http://www.microsoft.com/Downloads/details.aspx?familyid=C26EFA36-98E0-4EE9-A7C5-98D0592D8C52&displaylang=en

Hm, seems rsync does it, and does not need server/daemon/install I thought it does - just $ rsync src dst.

SMS if it's available works.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Why is downloading from Azure blobs taking so long? - performance

Adding to what Richard Astbury said: An Extra Small instance has a very small fraction of bandwidth that even a Small gives you. You'll see approx. 5Mbps on an Extra Small, and approx. 100Mbps on a Small (for Small through Extra Large, you'll get approx. 100Mbps per core).

The extra small instance has limited IO performance. Have you tried going for a medium sized instance for comparison?

Related

What is good way of using multiprocessing for bifacial_radiance simulations?

Tensorflow dequeue is very slow on Cloud ML

UWP ARM System.IO Decompression performance is poor

Why does the execution time of my perl code vary so widely?

Copying Files over an Intermittent Network Connection

Categories

Resources