Windows Azure virtual machine is slow to access network when scaling - windows

I'm running some startup scripts (cmd/bat) on my small azure VM which include a file-transfer operation from a mounted VHD, and normally it finishes in about 3 minutes (copying files and extracting ~500Mb zip file with command-line 7z).
When I scale out to ~150 instances, the same operation is very slow (up to 15 minutes in total, most of which is used by 7z). Also, the nodes which are the slowest to complete the bootup procedure are very hard to access at first using mstsc(animation is lagging and takes a lot of time to log in), but that might not be related.
What could be the problem?
We had the idea to examine the cache, but it would be nice to know of any other potential bottleneck which may be present in the following situation.
UPDATE:
I tried extracting on the D:\ drive instead of extracting it on the C:\ and while scaling to 200, the unzip takes about a minute! So it seems like the problem is that C:\ might be on the blob. But again, I have 3GB of data in 40 files, so 60MB/s per blob should be enough to handle it. Or - can it be the case that we have a cap for the all blobs?

The VM sizes each have their own bandwidth limitations.
| VM Size | Bandwidth |
| ------------- |:-------------:|
| Extra Small | 5 (Mbps) |
| Small | 100 (Mbps) |
| Medium | 200 (Mbps) |
| Large | 400 (Mbps) |
| Extra Large | 800 (Mbps) |
I suspect you always have one copy of your mounted VHD and have ~150 instances hitting it. Increasing the VM size of the VM hosting the VHD would be a good test but an expensive solution. Longer term put the files in blob storage. That means modifying your scripts to access RESTful endpoints.
It might be easiest to create 2-3 drives on 2-3 different VMs and write a script that ensures they have the same files. Your scripts could randomly hit one of the 2-3 mounted VHDs to spread out the load.
Here are the most recent limitations per VM size. Unfortunately this table doesn't include network bandwidth: http://msdn.microsoft.com/en-us/library/windowsazure/dn197896.aspx
-Rich
p.s. I got the bandwidths from a PowerPoint slide in the Microsoft provided Azure Training Kit dated January 2013.

one thing to consider is the per-storage-account scalability target of a storage account. With georeplication enabled, you have 10Gbps egress and 20K transactions/sec, which you could be bumping into. Figure with 150 instances, you could potentially be pulling 150 x 100Mbps, or 15Gbps as all of your instances are starting up.
Not sure about the "mounted VHD" part of your question. With Azure's drive-mounting, only one virtual machine instance can mount to a drive at any given time. For this type of file-copy operation, typically you'd grab a file directly from a storage blob, rather than a file stored in a vhd (which, in turn, is stored in a page blob).
EDIT: Just wanted to mention that an individual blob is limited to 60MB/sec (also mentioned in the blog post I referenced). This could also be related to your throttling.

Related

Very slow copy of MyISAM .MYD file

We noticed that a few of our MyISAM .MYD files (MySQL databasebase tables) copy extremely slow. Both the C: drive and the D: drive are SSDs; theoretical limit is 500MB / sec data rate. For timings, we turn off the MySQL service. Here are some sample timing for the file test.myd with 6GB:
NET STOP MYSQL56
Step1: COPY D:\MySQL_Data\test.myd C:\Temp --> 61MB / sec copy speed
Step2: COPY C:\Temp\test.myd D:\temp --> 463 MB / sec
Step3: COPY D:\Temp\test.myd c:\temp\test1.myd --> 92 MB / sec
Strange results; why would the speed in one direction be so different from the other direction?
Let's try this:
NET START MYSQL56
in MySQL: REPAIR TABLE test; (took about 6 minutes)
NET STOP MYSQL56
Step4: COPY D:\MySQL_Data\test.myd C:\Temp --> 463 MB / sec
Step5: COPY C:\Temp\test.myd D:\temp --> 463 MB / sec
Step6: COPY D:\Temp\test.myd c:\temp\test1.myd --> 451 MB / sec
Can anybody explain the difference in copy speed?
What might have caused the slow copy speed in the first place?
What would REPAIR make a difference, but OPTIMIZE which we tried
first, did not make a difference.
Would there be any kind of performance hit on the SQL level with the
initial version (ie before the REPAIR)? Sorry, I did not test this
out before running these tests.
REPAIR would scan through the table and fix issues that it finds. This means that the table is completely read.
OPTIMIZE copies the entire table over, then RENAMEs it back to the old name. The result is as if the entire table were read.
COPY reads one file and writes to the other file. If the target file does not exist, it must create it; this is a slow process on Windows.
When reading a file, the data is fetched from disk (SSD, in your case) and cached in RAM. A second reading will use the cached copy, thereby being faster.
This last bullet item may explain the discrepancies you found.
Another possibility is "wear leveling" and/or "erase-before-write" -- two properties of SSDs.
Wear leveling is when the SSD moves things around to avoid too much "wear". Note that a SSD block "wears out" after N writes to it. By moving blocks around, this physical deficiency is avoided. (It is a feature of Enterprise-grade SSDs, but may be missing on cheap drives.)
Before a write can occur on an SSD, the spot must first be "erased". This extra step is simply a physical requirement of how SSDs work. I doubt if it factors into your question, but it might.
I am removing [mysql] and [myisam] tags since the question really only applies to file COPY with Windows and SSD.

HP Fortify. Issues while handling very large fpr reports on fortify server

We have this huge source-code base. We scan it using HP SCA and create a fpr file ( size app 620 MB). Then we upload it to our fortify server using "fortifyclient" command.
After uploading, if i log into the fortify server and go into details of that project, i see that the artifact is in "processing" stage. It remains in processing stage even after few days. There is no way provided on the dashboard using which i can stop /kill/delete it.
Ques 1: Why is it taking so long to process ( We have 1 successfully processed fpr report that took 6 days ). What can we do to make it faster?
Ques 2: If i want to delete a artifact while it in in processing stage, how to do that?
Machine Info:
6 CPUs (Intel(R) Xeon(R) 3.07GHz )
RAM 36 gig
Thanks,
Addition:
We had 1 report that was successfully processed earlier in the month for the same codebase. FPR file for that was of also of similar size (610 MB ) . I can see the issue count for that report. Here it is:
EDIT:
Fortify Version: Fortify Static Code Analyzer 6.02.0014
HP Fortify Software Security Center Version 4.02.0014
Total issues: 157000
Total issues Audited: 0.0%
Critical issues: 4306
High: 151200
Low: 1640
medium: 100
That's a large FPR file, so it will need time to process. SSC is basically unzipping a huge ZIP file (that's what an FPR file is) and then transferring the data into the database. Here are a few things to check:
Check the amount of memory allotted for SSC. You may need to pass up to 16Gb of memory as the Xmx value to handlean FPR that size. Maybe more. The easiest way to tell would be to upload the FPR and then watch the java process that your app server uses. See how long it takes to reach the maximum amount of memory.
Make sure the database is configured for performance. Having the database on a separate server with the data files on another hard drive can significantly speed of processing.
As a last resort, you could also try making the FPR smaller. You can turn off the source rendering so that source code is not bundled with the FPR file. You can do this with this command:
sourceanalyzer -b mybuild -disable-source-bundling
-fvdl-no-snippets -scan -f mySourcelessResults.fpr
As far as deleting an in progress upload, I think you have to let it finish. With some tuning, you should be able to get the processing time down.

I/O performance of multiple JVM (Windows 7 affected, Linux works)

I have a program that creates a file of about 50MB size. During the process the program frequently rewrites sections of the file and forces the changes to disk (in the order of 100 times). It uses a FileChannel and direct ByteBuffers via fc.read(...), fc.write(...) and fc.force(...).
New text:
I have a better view on the problem now.
The problem appears to be that I use three different JVMs to modify a file (one creates it, two others (launched from the first) write to it). Every JVM closes the file properly before the next JVM is started.
The problem is that the cost of fc.write() to that file occasionally goes through the roof for the third JVM (in the order of 100 times the normal cost). That is, all write operations are equally slow, it is not just one that hang very long.
Interestingly, one way to help this is to insert delays (2 seconds) between the launching of JVMs. Without delay, writing is always slow, with delay, the writing is slow aboutr every second time or so.
I also found this Stackoverflow: How to unmap a file from memory mapped using FileChannel in java? which describes a problem for mapped files, which I'm not using.
What I suspect might be going on:
Java does not completely release the file handle when I call close(). When the next JVM is started, Java (or Windows) recognizes concurrent access to that file and installes some expensive concurrency handler for that file, which makes writing expensive.
Would that make sense?
The problem occurs on Windows 7 (Java 6 and 7, tested on two machines), but not under Linux (SuSE 11.3 64).
Old text:
The problem:
Starting the program from as a JUnit test harness from eclipse or from console works fine, it takes around 3 seconds.
Starting the program through an ant task (or through JUnit by kicking of a separate JVM using a ProcessBuilder) slows the program down to 70-80 seconds for the same task (factor 20-30).
Using -Xprof reveals that the usage of 'force0' and 'pwrite' goes through the roof from 34.1% (76+20 tics) to 97.3% (3587+2913+751 tics):
Fast run:
27.0% 0 + 76 sun.nio.ch.FileChannelImpl.force0
7.1% 0 + 20 sun.nio.ch.FileDispatcher.pwrite0
[..]
Slow run:
Interpreted + native Method
48.1% 0 + 3587 sun.nio.ch.FileDispatcher.pwrite0
39.1% 0 + 2913 sun.nio.ch.FileChannelImpl.force0
[..]
Stub + native Method
10.1% 0 + 751 sun.nio.ch.FileDispatcher.pwrite0
[..]
GC and compilation are negligible.
More facts:
No other methods show a significant change in the -Xprof output.
It's either fast or very slow, never something in-between.
Memory is not a problem, all test machines have at least 8GB, the process uses <200MB
rebooting the machine does not help
switching of virus-scanners and similar stuff has no affect
When the process is slow, there is virtually no CPU usage
It is never slow when running it from a normal JVM
It is pretty consistently slow when running it in a JVM that was started from the first JVM (via ProcessBuilder or as ant-task)
All JVMs are exactly the same. I output System.getProperty("java.home") and the JVM options via RuntimeMXBean RuntimemxBean = ManagementFactory.getRuntimeMXBean(); List arguments = RuntimemxBean.getInputArguments();
I tested it on two machines with Windows7 64bit, Java 7u2, Java 6u26 and JRockit, the hardware of the machines differs, though, but the results are very similar.
I tested it also from outside Eclipse (command-line ant) but no difference there.
The whole program is written by myself, all it does is reading and writing to/from this file, no other libraries are used, especially no native libraries. -
And some scary facts that I just refuse to believe to make any sense:
Removing all class files and rebuilding the project sometimes (rarely) helps. The program (nested version) runs fast one or two times before becoming extremely slow again.
Installing a new JVM always helps (every single time!) such that the (nested) program runs fast at least once! Installing a JDK counts as two because both the JDK-jre and the JRE-jre work fine at least once. Overinstalling a JVM does not help. Neither does rebooting. I haven't tried deleting/rebooting/reinstalling yet ...
These are the only two ways I ever managed to get fast program runtimes for the nested program.
Questions:
What may cause this performance drop for nested JVMs?
What exactly do these methods do (pwrite0/force0)? -
Are you using local disks for all testing (as opposed to any network share) ?
Can you setup Windows with a ram drive to store the data ? When a JVM terminates, by default its file handles will have been closed but what you might be seeing is the flushing of the data to the disk. When you overwrite lots of data the previous version of data is discarded and may not cause disk IO. The act of closing the file might make windows kernel implicitly flush data to disk. So using a ram drive would allow you to confirm that their since disk IO time is removed from your stats.
Find a tool for windows that allows you to force the kernel to flush all buffers to disk, use this in between JVM runs, see how long that takes at the time.
But I would guess you are hitten some iteraction with the demands of the process and the demands of the kernel in attempting to manage disk block buffer cache. In linux there is a tool like "/sbin/blockdev --flushbufs" that can do this.
FWIW
"pwrite" is a Linux/Unix API for allowing concurrent writing to a file descriptor (which would be the best kernel syscall API to use for the JVM, I think Win32 API already has provision for the same kinds of usage to share a file handle between threads in a process, but since Sun have Unix heritige things get named after the Unix way). Google "pwrite(2)" for more info on this API.
"force" I would guess that is a file system sync, meaning the process is requesting the kernel to flush unwritten data (that is currently in disk block buffer cache) into the file on the disk (such as would be needed before you turned your computer off). This action will happen automatically over time, but transactional systems require to know when the data previously written (with pwrite) has actually hit the physical disk and is stored. Because some other disk IO is dependant on knowing that, such as with transactional checkpointing.
One thing that could help is making sure you explicitly set the FileChannel to null. Then call System.runFinalization() and maybe System.gc() at the end of the program. You may need more than 1 call.
System.runFinalizersOnExit(true) may also help, but it's deprecated so you will have to deal with the compiler warnings.

Why is downloading from Azure blobs taking so long?

In my Azure web role OnStart() I need to deploy a huge unmanaged program the role depends on. The program is previously compressed into a 400-megabytes .zip archive, splitted to files 20 megabytes each and uploaded to a blob storage container. That program doesn't change - once uploaded it can stay that way for ages.
My code does the following:
CloudBlobContainer container = ... ;
String localPath = ...;
using( FileStream writeStream = new FileStream(
localPath, FileMode.OpenOrCreate, FileAccess.Write ) )
{
for( int i = 0; i < blobNames.Size(); i++ ) {
String blobName = blobNames[i];
container.GetBlobReference( blobName ).DownloadToStream( writeStream );
}
writeStream.Close();
}
It just opens a file, then writes parts into it one by one. Works great, except it takes about 4 minutes when run from a single core (extra small) instance. Which means the average download speed about 1,7 megabytes per second.
This worries me - it seems too slow. Should it be so slow? What am I doing wrong? What could I do instead to solve my problem with deployment?
Adding to what Richard Astbury said: An Extra Small instance has a very small fraction of bandwidth that even a Small gives you. You'll see approx. 5Mbps on an Extra Small, and approx. 100Mbps on a Small (for Small through Extra Large, you'll get approx. 100Mbps per core).
The extra small instance has limited IO performance. Have you tried going for a medium sized instance for comparison?
In some ad-hoc testing I have done in the past I found that there is no discernable difference between downloading 1 large file and trying to download in parallel N smaller files. It turns out that the bandwidth on the NIC is usually the limiting factor no matter what and that a large file will just as easily saturate it as many smaller ones. The reverse is not true, btw. You do benefit by uploading in parallel as opposed to one at a time.
The reason I mention this is that it seems like you should be using 1 large zip file here and something like Bootstrapper. That would be 1 line of code for you to download, unzip, and possibly run. Even better, it won't do it more than once on reboot unless you force it to.
As others have already aptly mentioned, the NIC bandwidth on the XS instances is vastly smaller than even a S instance. You will see much faster downloads by bumping up the VM size slightly.

Analyze Local Network Traffic, Update Quota with tshark and BASH [duplicate]

This question already has answers here:
How do I calculate network utilization for both transmit and receive
(2 answers)
Closed 2 years ago.
I have a slightly weird problem and I really hope someone can help with this:
I go to university and the wireless network here issues every login a certain quota/week (mine is 2GB). This means that every week, I am only allowed to access 2GB of the Internet - my uploads and downloads together must total at most 2GB (I am allowed access to a webpage that tells me my remaining quota). I'm usually allowed a few grace KB but let's not consider that for this problem.
My laptop runs Ubuntu and has the conky system monitor installed, which I've configured to display (among other things, ) my remaining wireless quota. Originally, I had conky hit the webpage and grep for my remaining quota. However, since my conky refreshes every 5 seconds and I'm on the wireless connection for upwards of 12 hours, the checking of the webpage itself kills my wireless quota.
To solve this problem, I figured I could do one of two things:
Hit the webpage much less frequently so that doing so doesn't kill my quota.
Monitor the wireless traffic at my wireless card and keep subtracting it from 2GB
(1) is what I've done so far: I setup a cron job to hit the webpage every minute and store the result in file on my local filesystem. Conky then reads this file - no need for it to hit the webpage; no loss of wireless quota thanks to conky.
This solution is a win by a factor of 12, which is still not enough. However, I'm a fan of realtime data and will not reduce the cron frequency further.
So, the only other solution that I have is (2). This is when I found out about wireshark and it's commandline version tshark. Now, here's what I think I should do:
daemonize tshark
set tshark to monitor the amount (in KB or B or MB - I can convert this later) of traffic flowing through my wireless card
keep appending this traffic information to file1
sum up the traffic information in the file1 and subtract it from 2GB. Store the result in file2
set conky to read file2 - that is my remaining quota
setup a cron job to delete/erase_the_contents_of file1 every Monday at 6.30AM (that's when the weekly quota resets)
At long last, my questions:
Do you see a better way to do this?
If not, how do I setup tshark to make it do what I want? What other scripts might I need?
If it helps, the website tells me my remaining quota is KB
I've already looked at the tshark man page, which unfortunately makes little sense to me, being the network-n00b that I am.
Thank you in advance.
Interesting question. I've no experience using tshark, so personally I would approach this using iptables.
Looking at:
[root#home ~]# iptables -nvxL | grep -E "Chain (INPUT|OUTPUT)"
Chain INPUT (policy ACCEPT 0 packets, 0 bytes)
Chain OUTPUT (policy ACCEPT 9763462 packets, 1610901292 bytes)
we see that iptables keeps a tally of the the bytes that passes through the each chain. So one can presumably go about monitoring your bandwidth usage by:
When your system starts up, retrieve your remaining quota from the web
Zero the byte tally in iptables (Use the -z option)
Every X seconds, get usage from iptables and deduct from quota
Here are some examples of using iptables for IP accounting.
Caveats
There are some drawbacks to this approach. First of all you need root access to run iptables, which means you need conky running as root, or run a cron daemon which writes the current values to a file which conky has access to.
Also, not all INPUT/OUTPUT packets may count towards your bandwidth allocation, e.g. intranet access, DNS, etc. One can filter out only relevant connections by matching them and placing them in a separate iptables chain (examples in the link given above). An easier approach (if the disparity is not too large) would be to occasionally grab your real time quota from the web, reset your values and start again.
It also gets a little tricky when you have existing iptables rules which are either complicated or uses custom chains. You'll then need some knowledge of iptables to retrieve the right values.

Resources