I've been hearing a bunch about Apache Thrift lately, though I know very little about it. I understand that it's a remote procedure call framework and abstracts calling functions across languages and on different machines. I've looked into MPI and found it's absurdly low-level. Would Thrift be a good higher level replacement to allow parallel computation to be performed on a networked group of machines?
The answer depends on your performance requirements. If you are looking for pure computational power using the networked group of machines then thrift is not quite ready.
Thrift has its own serialization to abstract away the type conversion between languages and versions of the API. This is great for enterprise/client server systems which can take the performance hit of doing these conversions for the benefits that come from allowing clients and servers in different languages. However for a high performance networked group of machines this may be useless as your nodes will probably use the same language.
Also the asynchronous I/O is fairly new and immature for most of the languages which means the use of blocking network I/O. This is probably not ideal for what you want to do.
I use thrift extensively and it solves a lot of problems and the community is fairly active. However its probably not the right tool for your problem.
Related
So as I have asked in a previous post, I want to be able to make programs or functions written in different languages to communicate between them.
I have come across zeromq recently and I'm trying to figure out whether or not this is something that could help me since it provides some sort of sockets. Can zeromq for example exchange data (or pass arguments) between a program written in python with a program or a function written in C++ or is its function for something completely different?
A: Oh Yes, exactly that is the power of ZeroMQ or nanomsg frameworks
Both of these are not sockets but rather BEHAVIOUR created within a context of a Zero-* -- a set of courageous maxims the Scaleable Formal Communication Pattern Framework was designed, developed and finetuned to meet.
That will enable you to assemble your own fast & smart messaging layer(s).
Q: What is the best next step?
In spite of your first impression, simly do forget anything you know about sockets and multithreaded synchronisation tricks.
Yes, rather forget and build your new understanding on "green field".
Take Pieter HINTJENS' book "Code Connected, Volume 1" (accessible in PDF ) and spend a few weeks on understanding both the motivation and the typical errors Pieter has hammered into this must-read bible of ZeroMQ.
Code-snippets are dangerous in case you did not get or completely missed the full-context of the bigger picture.
Believe me. I could not give you better advice. You may check my other posts on ZeroMQ & nanomsg, to see the difference.
You will definitely benefit from this book and ZeroMQ will give you many powers you would never ( and believe me never ) would be ready to program from scratch on your own. The power is so immense ( if well re-used ).
nota bene
For real-world inter-process communications, there is one minor issue to be aware of. Various ZeroMQ versions' inter-operability. Yes, the power of ZeroMQ is immense, nevertheless, it is necessary to keep the version control built in your messaging layer so as to solve situations, where some platforms do not have an update-path to "newer" releases available. Went into this issue with re-integration of a trading system with a component, where as old as zmq.__version__ == 2.1.11 was necessary, while recent are versions well above 14.x.y, so as to be assured to be 100% end-to-end backward-compatible.
Still, the overall potential is so immense, it makes sense to persevere and get the job done. G/L on that.
ZeroMQ is an abstraction of sockets. It is cross platform and have lots of language binding: I personally don't know any language that doesn't have ZeroMQ bindings.
So yes, you can use ZeroMQ to communicate between a program written in Python and program written in C++.
I recommend going through the zguide as it contains a lot of very useful information about ZeroMQ.
PyZMQ can be used as Python binding, and zmqpp for your C++ code. Note that for the C++ code you could also use cppzmq or the zmq C API directly. I would recommend using zmqpp as its higher level and (imho) easier to use.
Are there analogs of Intel Cluster OpenMP? This library simulates shared-memory machine (like SMP or NUMA) while running on distributed memory machine (like Ethernet-connected cluster of PC's).
This library allows to start openmp programs directly on cluster.
I search for
libraries, which allow multithreaded programms run on distributed cluster
or libraries (replacement of e.g. libgomp), which allow OpenMP programms run on distributed cluster
or compilers, capable of generating cluster code from openmp programms, besides Intel C++
The keyword you want to be searching for is "distributed shared memory"; there's a Wikipedia page on the subject. MOSIX, which became openMOSIX, which is now being developed as part of LinuxPMI, is the closest thing I'm aware of; but I don't have much experience with the current LinuxPMI project.
One thing you need to be aware of is that none of these systems work especially well, performance-wise. (Maybe a more optimistic way of saying it is that it's a tribute to the developers that these things work at all). You can't just abstract away the fact that accessing on-node memory is very very different from memory on some other node over a network. Even making local memory systems fast is difficult and requires a lot of hardware; you can't just hope that a little bit of software will hide the fact that you're now doing things over a network.
The performance ramifications are especially important when you consider that OpenMP programs you might want to run are almost always going to be written assuming that memory accesses are local and thus cheap, because, well, that's what OpenMP is for. False sharing is bad enough when you're talking about different sockets accessing a common cache line - page-based false sharing across a network is just disasterous.
Now, it could well be that you have a very simple program with very little actual shared state, and a distributed shared memory system wouldn't be so bad -- but in that case I've got to think you'd be better off in the long run just migrating the problem away from a shared-memory-based model like OpenMP towards something that'll work better in a cluster environment anyway.
We are looking into the solution that facilitate massively parallel data processing. Our processing graphs are often rather complex, so well developed operator framework like one Pervasive DataRush provides comes handy. Does anyone know any alternative solutions to one from Pervasive? DataRush is Java but I would like to consider all platforms and languages for which such solutions are available.
Not sure if you'd get usefulness out of something like hadoop/map-reduce. That's what they seem to compare themselves on their marketing collateral. Of course, they also claim that they're a better/different solution.
Pervasive DataRush exploits fine-grain parallelism on multicore servers and clusters.
With the release of V5.0 on 2 Feb 2011, Pervasive DataRush now supports all JVM languages, including Java, JRuby, Python and Scala. It also supports integration with Hadoop/MapReduce. It is complementary to Hadoop.
Look at https://github.com/rfqu/df4j - simple but powerful dataflow library. Has Actors and other dataflow constructs. It lacks persistence yet, though, but has interface with NIO2 asyncronous channels
My language of choice is Ruby, but I know because of twitter that Ruby can't handle a lot of requests. It is a good idea using it for socket development? or Should I use a functional language like erlang or haskell or scala like twitter developers did?
The company I work for uses Ruby for our web site. We have so far handled a little over 34,000,000,000 hits. We have no problem handling around 10,000,000 hits per day. Peak hits have exceeded 40,000,000 hits per day.
Scalability depends on a lot of factors. Our databases do a disproportionately high percentage of writes compared to reads, for example. While most websites do about 90% reads to 10% writes, we are closer to 50%-50%. My point is that scalability is affected by a lot of factors. If you are database-limited, as is often the case for web apps, it won't matter what language you use, you'll be waiting on your database.
There's a lot to think about if you are looking at handling large scales. Sharding databases, memcached, etc. etc. etc. etc. The language you use for your application is just one aspect, and often, though not always, a small aspect of scalability.
Ruby may be a good option for you, but there's a lot to like in other languages. Erlang tries hard to make it easier to recover from errors, for example.
I'm not sure that any "lessons" that the Twitter team has learned about Ruby (more specifically, Rails) and scaling would apply to your project. They're looking at WAY more traffic than most people can reasonably expect to see.
As far as sockets and Ruby go, check out I like Unicorn because it's Unix. It's quite an interesting read about doing sockets in Ruby.
I'd like to provide a bit of context first. I'm pretty active with the Scala community, and I would choose Scala over Ruby for any project.
So, having said that, keep with Ruby unless you actually hit barrier. If Ruby is your language of choice, it might just be that you'll never be happy with the choices you mention, particularly the statically typed ones.
It might be good to learn a new language, to have something to fall back on if you need an alternative. In your case, I'd recommend Clojure or Erlang. Scala is a good statically typed, OO language with functional programming perks. It might be easier to learn than the others, but people who really like dynamic typing don't convert to static typing easy.
As for Haskell, it's one of the most awesome languages out there (and much more well support and popular than the equally awesome alternatives), and can open your mind like nothing else. It's also tough to master.
If ruby is your favorite language, yes it is a good idea. It is always better to use what you know and what you like
Whereas you may get better performance from a functional language such as Erlang the suitability of Ruby will really depend on what you are trying to achieve. For example how many requests are you going to be handling is probably the first question, if the performance benefits of using Erlang don't make much difference use something you are comfortable with, why learn a new language if you don't have to?
You at least have the option of staying in your favorite high level language if you use a fast, concurrent language like Haskell, Erlang or Scala. With Ruby, performance bottlenecks will mean switching to compiled C (or Haskell, or ...) for speed anyway.
Ruby has the advantage of good frontend frameworks.
I have also used Ruby for many projects though I've recently moved to Scala and like it quite a bit. One thing that I've heard good things about (but never tried myself) for network stuff in Ruby is EventMachine. It uses the Reactor Pattern just like twisted and it seems quite solid.
The key is to have a low level library in C/C++ that does the socket multiplexing for you. Socket multiplexing is what makes a TCP server process truly multi-user. such libraries in C (which is what you want) could be libevent/libev... and in c++ boost::asio. Python has twisted that does it behind the scenes.
If you get such a library and use it in ruby you should be able to implement most socket programs fairly well. This is especially true on UNIX oses which favour multi-process to multi-threading.
Having recently written (actually still doing so now), a project using sockets with Ruby and Java I would say no. The ruby socket implementation is poorly documented unless you plan on writing a basic blocking chat server. I found writing in C or Java simpler, Ruby wraps up native sockets and your kinda left wondering how the hell to use it now. I have previously written plenty of socket code on windows, Linux and other platforms in C, with less stress.
My Ruby code now is very small and works well, getting to that point was a real pain.
How can i connect two or more machines to form a network grid and how can i distribute work load to the two machines?
What operating systems do i need to run on the machines, and what application should i use to manage the load balancing?
NB: I read somewhere that google uses cheap machines to perform this fete, how do they connect two network cards( 'Teaming' ) and distribute load across the machines?
Good practical examples would serve me good, with actual code samples.
Pointers to some good site i might read this stuff will be highly appreciated.
An excellent place to start is with the Beowulf project. Basically an opensource cluster built on the Linux OS.
There are several software solutions in this expanding market. The term "cloud computing" is certainly gaining traction to describe what you want to do. Are you wanting a service, or do you want to run it in house?
I'm most familiar with Appistry EAF - Runs on commodity based hardware. Its available as a free download. Runs on windows or linux.
Another is GoGrid - I believe this is only available as a service, but I'm not as familiar with it.
There are many different approaches to parallel processing, and many types of system architectures you could use.
For commodity systems, there are clusters and grids, or you can even form a single system image from several pieces of commodity hardware. There is of course also load balancing, high availability, failover, etc.
It's pretty much impossible to answer this question without more detail. What exactly do you want to with these systems? The answer is very highly dependent on the application.
You might want to have a look at some of the stuff to do with Folding At Home and the SETI project, and some of their participants blogs, here is a pretty amazing cluster that a guy built in an IKEA cabinet:
http://helmer.sfe.se/
Might give you some ideas.
The question is too abstract.
One of the (imaginable) ways is to use MPI - a framework for parallel programming, the Wikipedia page includes examples in C++, and there are bindings for other languages.