Windows Azure: Parallelization of the code - parallel-processing

I have some matrix multiplication operation. I want to parallelize the execution of those operations through multiple processors.. This can be done on high performance computing cluster using MPI (Message Passing Interface).
Like wise, can I do some parallelization in the cloud using multiple worker roles. Is there any means for doing that.

The June release of the Azure SDK and Tools, version 1.2, now supports .NET 4. You can now take advantage of the parallel extensions included with .NET 4. This includes Parallel.ForEach() and Parallel.For(), as examples.
The Task Parallel Library (TPL) will only help you on a single VM - it's not going to help divide your work across multiple VMs. So if you set up, say, a 2-, 4-, or 8-core VM, you should really see significant performance gains with parallel execution.
Now: if you wanted to divide work up across instances, you'll need to create a way of assigning work to each instance. For example: set up one worker role as a coordinator vm, and another worker role, with n instances, as the computation vm. Then, have the coordinator vm enumerate all instances of the computation vm and divide up work n ways. Send send 1/n work messages to each instance over WCF calls over an internal endpoint. Each vm instance processes each work message (potentially with the TPL as well) and stores its result in either blob or table storage, accessible to all instances.

In addition to message passing, Azure Queues are perfect for this situation as each worker role can read from the queue for work to be performed rather than dealing with iteration. This is a much less brittle approach as the number of workers may be changing dynamically as you scale.

Related

On memory intensive on-demand compute

We are considering using azure functions to run some compute on. However our computes could take up lots of memory, lets say more than 5GB.
As I understand there is no easy way to scale azure functions based on memory usage. Ie If you reach 15GB start a new instance (since you don't want it to run over the maximum memory of your instance)?
Is there a way around this limitation?
OR
Is there another technical alternative to azure functions that provide pay per use and allows rapid scaling on demand?
Without any further details about what you are actually trying to compute, it is very hard to give you any meaningful advice.
But there are a few things you could consider.
For example, if you are processing CSV files and you know you need 1GB of memory to process a file, then "spin up" a AWS Lambda per file. You could use serverless orchestration like AWS Step Function to coordinate your processing. This way you are effectively splitting up your compute intensive tasks.
Another option would be to automate "pay per use" yourself. You could use automation tools like Terraform to start a EC2 spot instance, run your computing task and after it finishes just shut down the EC2 instance. That is more or less pay as you go with a bit of operations overhead on your part.
There are also other services like AWS Fargate, which are marketed as "Serverless compute for containers", allowing you to run Docker containers in a pay per use manner. Fargate allows provisioning of up to 30GB of memory.
Another option would be to use services like ElastiCache or Memcached to "externalize" your memory.
Which of these options would be the best for you is hard to tell, because it depends on your constraints: do you need everything to be stored in the "same memory" or can it be split up etc, what are your data structures, can it be processed in chunks (to minimise memory usage), is latency important, how long does it take, how often do you need to run your tasks, etc...
Note: There are probably equivalent Azure services to the AWS services I talked about in this answer.

When would I want to create more than one verticles (assuming I am using non-blocking db clients in a stateless microservice)?

Assuming
I am building a stateless micro-service which exposes a few simple API endpoints (e.g., a read-through cache, save an entry to database),
I am using the non-blocking database clients e.g., mysql or redis and
I always want my microservices to speak to each other via HTTP (by placing EC2 instances behind a load balancer)
Questions
When will I want to use more than 1 standard verticles (i.e., write the whole microservice as a single verticle and deploy n instances of it (n = num of event-loop threads))?. Won't adding more verticles only add to serialization and context-switching costs?
Lets say I split the microservice into multiple standard verticles (for whatever reason). Wouldn't deploying n (n = num of event-loop threads) instances of each always give better performance than deploying a different ratio of instances. As each verticle is just a listener on an address and it will mean every event-loop thread can handle all kinds of messages and they are load balanced already.
When will I want to run my application in cluster mode? Based on docs, I get the feeling that cluster mode makes sense only when you have multiple verticles and that too when you have an actual use-case for clustering e.g., different EC2 instances handle requests for different users to help with data locality (say using ignite)
P.S., please help if even if you can answer one of the above questions.
I always want my microservices to speak to each other via HTTP (by placing EC2 instances behind a load balancer)
It doesn't make much sence to use Vertx if you already went for this overcomplicated approach.
Vertx is using Event Bus for in-cluster communication, eliminating the need for HTTP as well as LB in front.
Answers:
Why should it? If verticles are not talking to each other, where the serialization overhead should occur?
If your verticles are using non-blocking calls (and thus are multithreded), you won't see any difference between 1 or N instances on the same machine. Also if your verticle starts a (HTTP) server over a certain port, then the all instances will share that single server accross all thread (vertx is doing some magic reroutings here)
Cluster mode is the thing which I mentioned in the beginning. This is the proper way to distribute and scale you microservices.
A verticle is a way to structure your code. So, you'd want verticle of another type probably when your main verticle grows too big. How big? That depends on your preferences. I try to keep them rather small, about 200 LOC at the most, to do one thing.
Not necessarily. Different verticles can perform very different tasks, at different paces. Having N instances of all of them is not necessarily bad, but rather redundant.
Probably never. Clustered mode was a thing before microservices. Using it adds another level of complexity (cluster manager, for example Hazelcast), and it also means you cannot be as polyglot.

Azure web and worker role - 2x small instances or 1x medium?

Which is better in terms of performance, 2 medium role instances or 4 small role instances?
What are the pro's and cons of each configuration?
The only real way to know if you gain advantage of using larger instances is trying and measuring, not asking here. This article has a table that says that a medium instance has everything twice as large as a small one. However in real life your mileage may vary and how this affects your application only you can measure.
Smaller roles have one important advantage - if instances fail separately you get smaller performance degradation. Supposing you know about "guaranteed uptime" requirement of having at least two instances, you have to choose between two medium and four small instances. If one small instance fails you lose 1/4 of your performance, but if one medium instance fails you lose half of performance.
Instances will fail if for example you have an unhandled exception inside Run() of your role entry point descendant and sometimes something just goes wrong big time and your code can't handle this and it'd better just restart. Not that you should deliberately target for such failures but you should expect them and take measures to minimize impact to your application.
So the bottom line is - it's impossible to say which gets better performance, but uptime implications are just as important and they are clearly in favor of smaller instances.
Good points by #sharptooth. One more thing to consider: When scaling in, the fewest number of instances is one, not zero. So, say you have a worker role that does some nightly task for an hour, and it requires either 2 Medium or 4 Small instances to get the job done in that timeframe. When the work is done, you may want to save costs by scaling to one instance and let it run as one instance for 23 hours until the next nightly job. With a single Small instance, you'll burn 23 core-hours, and with a single Medium instance, you'll burn 46 core-hours. This thinking also applies to your Web role, but probably more-so since you will probably have minimum two instances to make sure you have uptime SLA (it may not be as important for you to have SLA on your worker if, say, your end user never interacts with it and it's just for utility purposes).
My general rule of thumb when sizing: Pick the smallest VM size that can properly do the work, and then scale out/in as needed. Your choice will primarily be driven by CPU, RAM, and network bandwidth needs (and don't forget you need network when moving data between Compute and Storage).
For a start, you won't get the guaranteed uptime of 99% unless you have at least 2 roles role instances, this allows one to die and be restarted while the other one takes the burden. Otherwise, it is a case of how much you want to pay and what specs you get on each. It has not caused me any hassle having more than one role role instance, Azure hides the difficult stuff.
One other point maybe worth a mention if you use four small roles you would be able to run two in one datacenter and two in another datacenter and use traffic manager to route people at least which is closer. This might give you some performance gains.
Two mediums will give you more options to store stuff in cache at compute level and thus more in cache rather than coming off SQL Azure it is going to be faster.
Ideally you have to follow #sharptooth and measure and test. This is all very subjective and I second David also you want to start as small as possible and scale outwards. We run this way, you really want to think about designing your app around a more sharding aspect as this fits azure model better than working in traditional sense of just getting a bigger box to run everything on, at some point you run out into limits thinking in the bigger box process, ie.Like SQL Azure Connection limits.
Using technologies like Jmeter is your friend here and should give you some tools to test your app.
http://jmeter.apache.org/

What is an appharbor worker...exactly?

The Appharbor pricing page defines a worker something you increase to "Improve the reliability and responsiveness of your website". But in trying to compare price with others such as aws, I am having a hard time defining what a worker is exactly.
Anyone have a better definition than "more is better"?
From this thread:
AppHarbor is a multitenant platform and we're running multiple
application on each application server. A worker is an actual worker
process that is limited in terms of the amount of resources it can
consume.
...
2 workers will always be on two different machines. We're probably
going to reuse machines when you scale to more than that and increase
process limits instead (this could yield better performance as you
need to populate fewer local cache etc.)

Best practices for deploying a high performance Berkeley DB system

I am looking to use Berkeley DB to create a simple key-value storage system. The keys will be SHA-1 hashes, so they are in 160-bit address space. I have a simple server working, that was easy enough thanks to the fairly well written documentation from Berkeley DB website. However, I have some questions about how best to set up such a system, to get good performance and flexibility. Hopefully, someone has had more experience with Berkeley DB and can help me.
The simplest setup is a single process, with a single thread, handling a single DB; inserts and gets are performed on this one DB, using transactions.
Alternative 1: single process, multiple threads, single DB; inserts and gets are performed on this DB, by all the threads in the process.
Does using multiple threads provide much performance improvements? There is one single DB, and therefore it's on one disk, and therefore I am guessing I won't get too much boost. But if Berkeley DB caches a lot of stuff in memory, then perhaps one thread will be able to run and answer from cache while another has blocked waiting for disk? I am using GNU Pth, user level cooperative threading. I am not familiar with the details of Pth, so I am also not sure if with Pth you can have a userlevel thread run while another userlevel thread has blocked.
Alternative 2: single process, one or multiple threads, multiple DBs where each DB covers a fraction of the 160-bit address space for keys.
I see a few advantages in having multiple DBs: we can put them on different disks, less contention, easier to move/partition DBs onto different physical hosts if we want to do that. Does anyone have experience with this setup and see significant benefits?
Alternative 3: multiple processes, each with one thread, each handles a DB that covers a fraction of the 160-bit address space for keys.
This has the advantages of using multiple DBs, but we are using multiple processes. Is this better than the second alternative? I suspect using processes rather than user-level threads to get parallelism will get you better SMP caching behaviors (less invalidates, etc), but will I get killed with all the process overheads and context switches?
I would love to hear if someone has tried the options, and have seen positive or negative results.
Thanks.
Alternative 2 gives you high scalability. You basically partition your database across
multiple servers. If you need a high performance distributed key/value database, I would
suggest looking at membase. I am doing that right now but we need to run on an appliance
and would like to limit dependencies (of membase).
You can use BerkeleyDB replication and have read only copies with servers to serve read/get
requests.

Resources