clojure: parallel processing using multiple computers - hadoop

i have 500 directories, and 1000 files (each about 3-4k lines) for each directory. i want to run the same clojure program (already written) on each of these files. i have 4 octa-core servers. what is a good way to distribute the processes across these cores? cascalog (hadoop + clojure)?
basically, the program reads a file, uses a 3rd party Java jar to do computations, and inserts the results into a DB
note that: 1. being able to use 3rd party libraries/jar is mandatory
2. there is no querying of any sorts

Because there is no "reduce" stage to your overall process as I understand it, it makes sense to put 125 of the directories on each server and then spend the rest you time trying to make this program process them faster. Up to the point where you saturate the DB of course.
Most of the "big-data" tools available (Hadoop, Storm) focus on processes that need both very powerful map and reduce operations, with perhaps multiple stages of each. Your case all you really need is a decent way to keep track of which jobs passed and which didn't. I'm as bad as anyone (and worse than many) at predicting development times, though in this case I'd say it would an even chance that rewriting your process on one of the map-reduce-esque tools will take longer than adding a monitoring process to keep track of which jobs finished and which failed so you can rerun the failed ones later (preferably automatically).

Onyx is a recent pure Clojure alternative to Hadoop/Storm. As long as you're familiar with Clojure, working with Onyx is pretty simple. You should give this data-driven approach a try:
https://github.com/MichaelDrogalis/onyx

Related

Make Spark run faster when run locally, for training purposes

My team builds and maintains an API built on top of PySpark SQL. It is meant for production use cases, and does a good job at scaling to large data on clusters. We can also run it locally, which is useful for development, testing, and training people via interactive exercise sessions using Jupyter notebooks.
However, running fairly simple computations on Spark takes a little while, frequently a few dozens seconds, even on a dataframe of about 50k rows. Our library is to do differential privacy, which involves some randomization. Thus, training use cases involve running the same analysis multiple times to get average utility metrics. This means that runtimes quickly reach a few minutes, which is annoyingly long when you're trying to run a 1-2h exercise session.
My question is: are there Spark configuration options I could tweak to lower this runtime for small-data, single-machine use cases, and make teaching a little smoother?

Parallelization with highly unreliable workers?

Suppose I've got a pool of workers (computers), around 1000 of them, but they're highly unreliable. I expect each to go down multiple times a day, sometimes for extended periods of time. Fyi these are volunteer computers running BOINC (not my botnet, I swear!)
Are there any tools that exist to facilitate using them to do parallel computations, (mostly trivially parallelizable)? I'm thinking something like an IPython parallel where maybe when a node dies the calculation is restarted elsewhere, and maybe where when a new node joins its brought up to speed to the current working environment.

What is the difference and how to choose between distributed queue and distributed computing platform?

there are many files need to process with two computers real-timely,I want to distribute them to the two computers and these tasks need to be completed as soon as possibile(means real-time processing),I am thinking about the below plan:
(1) distributed queue like Gearman
(2)distributed computing platform like hadoop/spark/storm/s4 and so on
I have two questions
(1)what is the advantage and disadvantage between (1) and (2)?
(2) How to choose in (2),hadoop?spark?storm?s4?or other?
thanks!
Maybe I have not described the question clearly. In most case,there are 1000-3000 files with the same format , these files are independent,you do not need to care their order,the size of one file maybe tens to hundreds of KB and in the future, the number of files and size of single file will rise. I have wrote a program , it can process the file and pick up the data and then store the data in mongodb. Now there are only two computers, I just want a solution that can process these files with the program quickly(as soon as possibile) and is easy to extend and maintain
distributed queue is easy to use in my case bur maybe hard to extend and maintain , hadoop/spark is to "big" in the two computers but easy to extend and maintain, which is better, i am confused.
It depends a lot on the nature of your "processing". Some dimensions that apply here are:
Are records independent from each other or you need some form of aggregation? i.e: do you need some pieces of data to go together? Say, all transactions from a single user account.
Is you processing CPU bound? Memory bound? FileSystem bound?
What will be persisted? How will you persist it?
Whenever you see new data, do you need to recompute any of the old?
Can you discard data?
Is the data somewhat ordered?
What is the expected load?
A good solution will depend on answers to these (and possibly others I'm forgetting). For instance:
If computation is simple but storage and retrieval is the main concern, you should maybe look into a distributed DB rather than either of your choices.
It could be that you are best served by just logging things into a distributed filesystem like HDFS and then run batch computations with Spark (should be generally better than plain hadoop).
Maybe not, and you can use Spark Streaming to process as you receive the data.
If order and consistency are important, you might be better served by a publish/subscribe architecture, especially if your load could be more than what your two servers can handle, but there are peak and slow hours where your workers can catch up.
etc. So the answer to "how you choose?" is "by carefully looking at the constraints of your particular problem, estimate the load demands to your system and picking the solution that better matches those". All of these solutions and frameworks dominate the others, that's why they are all alive and kicking. The choice is all in the tradeoffs you are willing/able to make.
Hope it helps.
First of all, dannyhow is right - this is not what real-time processing is about. There is a great book http://www.manning.com/marz/ which says a lot about lambda archtecture.
The two ways you mentioned serves completly different purposes and are connected to the definition of word "task". For example, Spark will take a whole job you got for him and divide it into "tasks", but the outcome of one task is useless for you, you still need to wait for whole job to finish. You can create small jobs working on the same dataset and use spark's caching to speed it up. But then you won't get much advantage from distribution (if they have to be run one after another).
Are the files big? Are there connected somehow to each other? If yes, I'd go with Spark. If no, distributed queue.

Best method of having a single process distributed across a cluster

I'm very new to cluster computing, and wanted to know more about the various software used for cluster computing, and which is best for particular tasks. In particular, the problem I am trying to solve involves a Manager/Workers type scenario, where a single Manager is responsible for the creation of 100s to 1000s of jobs. Each job, while relatively large, must execute on a small frame-by-frame basis. I.e. the Manager will tell each job, "advance one frame and report back to me". The execution of a single frame will be very small, so latency between the Manager and the worker machines must be very small, on the order of microseconds.
Thank you! Any information would be appreciated, even stuff that doesn't perfectly fit the scenario I described, just to give me a starting point. Some that I have researched so far are Hadoop, HTCondor, and Akka.
Since communication latency is important to you, you should probably consider using MPI. It's not too difficult to write simple Master/Worker programs using MPI, and it will probably give you the best performance, especially if your cluster has high performance networking, such as infiniband.
If, as it seems, you're using Java, you will have to do some research to determine a good Java/MPI package. You'll find some suggestions here: Java openmpi.

Hybrid : OpenMPI + OpenMP on a cluster

I solve numerically some Ordinary Differential Equations.
I have a very simple (conceptually), but very long computations. There is a very long array (~2M cells) and for each cell I need to perform numerical integration. This procedure should be repeated 1000 times. By using OpenMP parallelism and one 24-core machine, it takes around a week to do this (which is not acceptable).
I have a cluster of 20 such (24-core) machines and think about Hybrid implementation. I want to use MPI to pass over these 20 nodes and at each node use regular OpenMP parallelism.
Basically, I need to split my very long array to 20(nodes)X24(proccs) working units.
Are there any suggestion of better implementation or better ideas? I've read a lot on this subject and I've got impression, that sometimes such hybrid implementation does not necessarily bring a real speed up.
May be I should create a "pool of workers" and "feed" them with my array or something else.
Any suggestion and useful links are welcome!
If your computation is as embarrassingly parallel as you indicate you should expect good speedup by spreading the load across all 20 of your machines. By good I mean close to 20 and by close to 20 I mean any number which you actually get which leaves you thinking that the effort has been worthwhile.
Your proposed hybrid solution is certainly feasible and you should get good speedup if you implement it.
One alternative to a hybrid MPI+OpenMP program would be a job script (written in your favourite scripting language) which simply splits your large array into 20 pieces and starts 20 jobs, one on each machine running an instance of your program. When they've all finished have another script ready to recombine the results. This would avoid having to write any MPI code at all.
If your computer has an installation of Grid Engine you can probably write a job submission script to submit your work as an array job and let Grid Engine take care of parcelling the work out to the individual machines/tasks. I expect that other job management systems have similar facilities but I'm not familiar with them.
Another alternative would be an all-MPI code, that is drop the OpenMP altogether and modify your code to use whatever processors it finds available when you run it. Again, if your program requires little or no inter-process communication you should get good speedup.
Using MPI on a shared memory computer is sometimes a better (in performance terms) approach than OpenMP, sometimes worse. Trouble is, it's difficult to be certain about which approach is better for a particular program on a particular architecture with RAM and cache and interconnects and buses and all the other variables to consider.
One factor I've ignored, largely because you've provided no data to consider, is the load-balancing of your program. If you split your very large dataset into 20 equal-sized pieces do you end up with 20 equal-duration jobs ? If not, and if you have an idea how job time varies with inputs, you might do something more sophisticated in splitting the job up than simply chopping your into those 20 equal pieces. You might, for instance, chop it into 2000 equal pieces and serve them one at a time to the machinery for execution. In this case what you gain in load-balancing might be at risk of being lost to the time costs of job management. You pays yer money and you takes yer choice.
From your problem statement I wouldn't be making a decision about which solution to go for on the basis of expected performance, because I'd expect any of the approaches to get into the same ballpark performance-wise, but on the time to develop a working solution.

Resources