I have something in mind but I don't know the typical solution that could help me achieve that.
I need to have a distributed environment where not only memory is shared but processing is also shared, that means ALL Shared Processors work as one Big Processor Computing The code I wrote.
Could this be achieved knowing that I have limited knowledge in Data Grids and Hadoop?
Data Grid Platform (I knew that memory only is shared in that model) or Hadoop (where the code is shared among nodes but each node processes the code separately from other nodes but works on a subset of the data on HDFS).
But I need a solution that not only (shares memory or code as hadoop) but also the processing power of all the machines as one Single Big processor and one single Big Memory?
Do you expect that you just spawn the thread and it get executed somewhere and the middleware miraculously balances the load across nodes, moving threads from one node to another? I think you won't find this directly. The tagged frameworks don't have transparent shared memory either, for good reasons.
When using multiple nodes, you usually need them for processing power, and hiding everything and pretending you're on single machine will tend to unnecessary communication, slowing stuff down.
Instead, you can always design your app using the distribution API provided by those frameworks. For example in Infinispan, look for the Map-Reduce or Distributed Executors API.
I need to have a distributed environment where not only memory is shared but processing is also shared, that means ALL Shared Processors work as one Big Processor Computing The code I wrote.
You are not benefiting with processing on single machine. Application will scale if the processing is spread across multiple machines. If you want to see benefits of one Big Processor Computing, you can virtualize big physical machine into multiple virtual nodes (using technologies like VMWare).
But distributed processing across multiple VM nodes across multiple physical machines in a big cluster is best for distributed applications. Hadoop/Spark is best fit for these type of applications depending on batch processing (Hadoop) or real time processing needs (Spark).
Related
I am new to parallel computing and just starting to try out MPI and Hadoop+MapReduce on Amazon AWS. But I am confused about when to use one over the other.
For example, one common rule of thumb advice I see can be summarized as...
Big data, non-iterative, fault tolerant => MapReduce
Speed, small data, iterative, non-Mapper-Reducer type => MPI
But then, I also see implementation of MapReduce on MPI (MR-MPI) which does not provide fault tolerance but seems to be more efficient on some benchmarks than MapReduce on Hadoop, and seems to handle big data using out-of-core memory.
Conversely, there are also MPI implementations (MPICH2-YARN) on new generation Hadoop Yarn with its distributed file system (HDFS).
Besides, there seems to be provisions within MPI (Scatter-Gather, Checkpoint-Restart, ULFM and other fault tolerance) that mimic several features of MapReduce paradigm.
And how does Mahout, Mesos and Spark fit in all this?
What criteria can be used when deciding between (or a combo of) Hadoop MapReduce, MPI, Mesos, Spark and Mahout?
There might be good technical criteria for this decision but I haven't seen anything published on it. There seems to be a cultural divide where it's understood that MapReduce gets used for sifting through data in corporate environments while scientific workloads use MPI. That may be due to underlying sensitivity of those workloads to network performance. Here are a few thoughts about how to find out:
Many modern MPI implementations can run over multiple networks but are heavily optimized for Infiniband. The canonical use case for MapReduce seems to be in a cluster of "white box" commodity systems connected via ethernet. A quick search on "MapReduce Infiniband" leads to http://dl.acm.org/citation.cfm?id=2511027 which suggests that use of Infiniband in a MapReduce environment is a relatively new thing.
So why would you want to run on a system that's highly optimized for Infiniband? It's significantly more expensive than ethernet but has higher bandwidth, lower latency and scales better in cases of high network contention (ref: http://www.hpcadvisorycouncil.com/pdf/IB_and_10GigE_in_HPC.pdf).
If you have an application that would be sensitive to those effects of optimizations for Infiniband that are already baked into many MPI libraries, maybe that would be useful for you. If your app is relatively insensitive to network performance and spends more time on computations that don't require communication between processes, maybe MapReduce is a better choice.
If you have the opportunity to run benchmarks, you could do a projection on whichever system you have available to see how much improved network performance would help. Try throttling your network: downclock GigE to 100mbit or Infiniband QDR to DDR, for example, draw a line through the results and see if the purchase of a faster interconnect optimized by MPI would get you where you want to go.
The link you posted about FEM being done on MapReduce: Link
uses MPI. It states it right there in the abstract. They combined MPI's programming model (non-embarrassingly parallel) with HDFS to "stage" the data to exploit data locality.
Hadoop is purely for embarrassingly parallel computations. Anything that requires processes to organize themselves and exchange data in complex ways will get crap performance with Hadoop. This can be demonstrated both from an algorithmic complexity point of view, and also from a measurement point of view.
Is there anyone can explain the major differences between HDFS and Grid Computing ?
I think you have to replace HDFS with Hadoop in your question.
Hadoop is a framework that allows for distributed processing of large data sets across clusters of commodity computers using a simple programming model - Map Reduce framework based on YARN (Yet Another Resource Negotiator).
HDFS is a file system designed for storing very large files with streaming data access patterns, running clusters on commodity hardware.
Grid Computing approach is based on distributing the work across a cluster of machines, which access a shared file system, hosted by a storage area network (SAN). This works well for predominantly compute-intensive jobs, but it becomes a problem when nodes need to access larger data volumes.
HDFS is just a file system. Since you are comparing processing of data, you have to compare Grid Computing with Hadoop Map Reduce (YARN) instead of HDFS.
Hadoop tries to co-locate the data with the compute nodes, so data access is fast because it is local. This feature, known as data locality, is at the heart of data processing in Hadoop and is the reason for its good performance.
You can refer to Hadoop, The Definitive guide (4th edition) to understand the concepts better.
How Hadoop is different from others distributed system
scale out
proven technology
low cost
used by big Giants
The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data.
but....
Grid computing is the collection of computer resources from multiple locations to reach a common goal. The grid can be thought of as a distributed system with non-interactive workloads that involve a large number of files. Grid computing is distinguished from conventional high performance computing systems such as cluster computing in that grid computers have each node set to perform a different task/application. Grid computers also tend to be more heterogeneous and geographically dispersed (thus not physically coupled) than cluster computers. Although a single grid can be dedicated to a particular application, commonly a grid is used for a variety of purposes. Grids are often constructed with general-purpose grid middleware software libraries.
i think hdfs is not relevant to grid computing. or perhaps it is used in super virtual computers in a grid
Is Hadoop a proper solution for jobs that are CPU intensive and need to process a small file of around 500 MB? I have read that Hadoop is aimed to process the so called Big Data, and I wonder how it performs with a small amount of data (but a CPU intensive workload).
I would mainly like to know if a better approach for this scenario exists or instead I should stick to Hadoop.
Hadoop is a distributed computing framework proposing a MapReduce engine. If you can express your parallelizable cpu intensive application with this paradigm (or any other supported by Hadoop modules), you may take advantage of Hadoop.
A classical example of Hadoop computations is the calculation of Pi, which doesn't need any input data. As you'll see here, yahoo managed to determine the two quadrillonth digit of pi thanks to Hadoop.
However, Hadoop is indeed specialized for Big Data in the sense that it was developped for this aim. For instance, you dispose of a file system designed to contain huge files. These huge files are chunked into a lot of blocks accross a large number of nodes. In order to ensure your data integrity, each block has to be replicated to other nodes.
To conclude, I'd say that if you already dispose of an Hadoop cluster, you may want to take advantage of it.
If that's not the case, and while I can't recommand anything since I have no idea what exactly is your need, I think you can find more light weights frameworks than Hadoop.
Well a lot of companies are moving to Spark, and I personally believe it's the future of parallel processing.
It sounds like what you want to do is use many CPUs possibly on many nodes. For this you should use a Scalable Language especially designed for this problem - in other words Scala. Using Scala with Spark is much much easier and much much faster than hadoop.
If you don't have access to a cluster, it can be an idea to use Spark anyway so that you can use it in future more easily. Or just use .par in Scala and that will paralellalize your code and use all the CPUs on your local machine.
Finally Hadoop is indeed intended for Big Data, whereas Spark is really just a very general MPP framework.
You have exactly the type of computing issue that we do for Data Normalization. This is a need for parallel processing on cheap hardware and software with ease of use instead of going through all the special programming for traditional parallel processing. Hadoop was born of hugely distributed data replication with relatively simple computations. Indeed, the test application still being distributed, WordCount, is numbingly simplistic. This is because the genesis of Hadoop was do handle the tremendous amount of data and concurrent processing for search, with the "Big Data" analytics movement added on afterwards to try to find a more general purpose business use case. Thus, Hadoop as described in its common form is not targeted to the use case you and we have. But, Hadoop does offer the key capabilities of cheap, easy, fast parallel processing of "Small Data" with custom and complicated programming logic.
In fact, we have tuned Hadoop to do just this. We have a special built hardware environment, PSIKLOPS, that is powerful for small cluster (1-10) nodes with enough power at low cost for run 4-20 parallel jobs. We will be showcasing this in a series of web casts by Inside Analysis titled Tech Lab in conjunction with Cloudera for the first series, coming in early Aug 2014. We see this capability as being a key enabler for people like you. PSIKLOPS is not required to use Hadoop in the manner we will showcase, but it is being configured to maximize ease of use to launch multiple concurrent containers of custom Java.
At the company I work for, everyday we have to process a few thousands of files, which takes some hours. The operations are basically CPU intensive, like converting PDF to high resolution images and later creating many different sizes os such images.
Each one of those tasks takes a lot of CPU, and therefore we can't simply start many instances on the same machine because there won't be any processing power available for everything. Thus, it takes some hours to finish everything.
The most obvious thing to do, as I see it, is to partition the set of files and have them processed by more machines concurrently (5, 10, 15 machines, I don't know yet how many would be necessary).
I don't want to reinvent the wheel and create a manager for task (nor do I want the hassle), but I am not sure which tool should I use.
Although we don't have big data, I have looked at Hadoop for a start (we are running at Amazon), and its capabilities of handling the nodes seem interesting. However, I don't know if it makes sense to use it. I am looking at Hazelcast as well, but I have no experience at all with it or the concepts yet.
What would be a good approach for this task?
Hadoop is being used for a wide variety of data processing problems, some of them are related to image processing also. The problem mentioned in the OP can also be easily solved using Hadoop. Note that in some cases where the data to processed is small, then there is an overhead using Hadoop.
If you are new to Hadoop, would suggest a couple of things
Buy the Hadoop : The Definitive Guide book.
Go through the MapReduce resources.
Start going through the tutorials (1 and 2) and setup Hadoop on a single node and a cluster. There is no need for Amazon, if 1-2 machines can be spared for learning.
Run the sample programs and understand how they work.
Start migrating the problem area to Hadoop.
The advantage of Hadoop over other s/w is the ecosystem around Hadoop. As of now the ecosystem around Hadoop is huge and growing, I am not sure of Hazelcast.
You can use Hazelcast distributed queue.
First you can put your files (file references) as tasks to a distributed queue.
Then each node takes a task from the queue processes it and puts the result into another distributed queue/list or write it to DB/storage.
I have a system I wish to distribute where I have a number of very large non-splittable binary files I wish to process in a distributed fashion. These are of the order of a couple of hundreds of Gb. For a variety of fixed, implementation specific reasons, these files cannot be processed in parallel but have to be processed sequentially by the same process through to the end.
The application is developed in C++ so I would be considering Hadoop pipes to stream the data in and out. Each instance will need to process of the order of 100Gb to 200Gb sequentially of its own data (currently stored in one file), and the application is currently (probably) IO limited so it's important that each job is run entirely locally.
I'm very keen on HDFS for hosting this data - the ability to automatically maintain redundant copies and to rebalance as new nodes are added will be very useful. I'm also keen on map reduce for its simplicity of computation and its requirement to host the computation as close as possible to the data. However, I'm wondering how suitable Hadoop is for this particular application.
I'm aware that for representing my data it's possible to generate non-splittable files, or alternatively to generate huge sequence files (in my case, these would be of the order of 10Tb for a single file - should I pack all my data into one). And that it's therefore possible to process my data using Hadoop. However it seems like my model doesn't fit Hadoop that well: does the community agree? Or have suggestions for laying this data out optimally? Or even for other cluster computing systems that might fit the model better?
This question is perhaps a duplicate of existing questions on hadoop, but with the exception that my system requires an order of magnitude or two more data per individual file (previously I've seen the question asked about individual files of a few Gb in size). So forgive me if this has been answered before - even for this size of data.
Thanks,
Alex
It seems like you are working with relatively few numbers of large files. Since your files are huge and not splittable, Hadoop will have trouble scheduling and distributing jobs effectively across the cluster. I think the more files that you process in one batch (like hundreds), the more worth while it will be to use Hadoop.
Since you're only working with a few files, have you tried a simpler distribution mechanism, like launching processes on multiple machines using ssh, or GNU Parallel? I've had a lot of success using this approach for simple tasks. Using a NFS mounted drive on all your nodes can share limits the amount of copying you would have to do as well.
You can write a custom InputSplit for your file, but as bajafresh4life said it won't really be ideal because unless your HDFS chunk size is the same as your file size your files are going to be spread all around and there will be network overhead. Or if you do make your HDFS size match your file size then you're not getting the benefit of all your cluster's disks. Bottom line is that Hadoop may not be the best tool for you.