Assume that there is a globally shared data block that is being used by a set of machines in a cluster.
If it is a distributed memory model, say the whole structure is sent to every node in the cluster.
Each node in the cluster performs different operations to the parts of the shared data block in parallel.
My question is: can it be identified as a Multiple Instruction, Multiple Data (MIMD) model since the whole data block is shared by/transferred to every node in the cluster?
If you apply the same operation to different pieces of data then you have data-level parallelism, which according to Flynn's Taxonomy is equivalent to SIMD (Single Instruction, Multiple Data streams), typically applied in GPU processing.
Multiple Instruction, Multiple Data (MIMD) entail that different processors may be executing different instructions on different data.
So according to the definitions, the answer to your question is yes.
Related
I was exploring NiFi documentation. I must agree that it is one of the well documented open-source projects out there.
My understanding is that the processor runs on all nodes of the cluster.
However, I was wondering about how the content is distributed among cluster nodes when we use content pulling processors like FetchS3Object, FetchHDFS etc. In processor like FetchHDFS or FetchSFTP, will all nodes make connection to the source? Does it split the content and fetch from multiple nodes or One node fetched the content and load balance it in the downstream queues?
I think this document has an answer to your question:
https://community.hortonworks.com/articles/16120/how-do-i-distribute-data-across-a-nifi-cluster.html
For other file stores the idea is the same.
will all nodes make connection to the source?
Yes. If you did not limit your processor to work only on primary node - it runs on all nodes.
The answer by #dagget has traditionally been the approach to handle this situation, often referred to as the "list + fetch" pattern. List processor runs on Primary Node only, listings sent to RPG to re-distribute across the cluster, input port receives listings and connect to a fetch processor running on all nodes fetching in parallel.
In 1.8.0 there are now load balanced connections which remove the need for the RPG. You would still run the List processor on Primary Node only, but then connect it directly to the Fetch processors, and configure the queue in between to load balance.
Apache Ignite has two concepts, one of them is NearCache, and another one is the CacheMode enumaration.
What is the main difference between two concepts?
Near cache is the local hot cache that keeps often accessed data. It significantly speeds up data processing, saving time on network round-trips.
CacheMode defines how your data will be stored. It could be LOCAL for single node, which means data are not distributed in grid. Other two PARTITIONED and REPLICATED means respectively: cache data divided between nodes on some equal parts (called partitions) or each node keeps full data from that cache.
PARTITIONED allows you to keep in grid more data than available in separate machine, REPLICATED gives 100% data survivorability (if all nodes crashed except one - you will not loose your data).
More details you can find in documentation https://apacheignite.readme.io/docs/near-caches and https://apacheignite.readme.io/docs/cache-modes
I am running fairly large task on my 4 node cluster. I am reading around 4 GB of filtered data from a single table and running Naïve Baye’s training and prediction. I have HBase region server running on a single machine which is separate from the spark cluster running in fair scheduling mode, although HDFS is running on all machines.
While executing, I am experiencing strange task distribution in terms of the number of active tasks on the cluster. I observed that only one active task or at most two tasks are running on one/two machines at any point of time while the other are sitting idle. My expectation was that the data in the RDD will be divided and processed on all the nodes for operations like count and distinct etcetera. Why are all nodes not being used for large tasks of a single job? Does having HBase on a separate machine has anything to do with this?
Some things to check:
Presumably you are reading in your data using hadoopFile() or hadoopRDD(): consider setting the [optional] minPartitions parameter to make sure the number of partitions is equal to the number of nodes you want to use.
As you create other RDDs in your application, check the number of partitions of those RDDs and how evenly the data is distributed across them. (Sometimes an operation can create an RDD with the same number of partitions but can make the data within it badly unbalanced.) You can check this by calling the glom() method, printing the number of elements of the resulting RDD (the number of partitions) and then looping through it and printing the number of elements of each of the arrays. (This introduces communication so don't leave it in your production code.)
Many of the API calls on RDD have optional parameters for setting the number of partitions, and then there are calls like repartition() and coalesce() that can change the partitioning. Use them to fix problems you find using the above technique (but sometimes it will expose the need to rethink your algorithm.)
Check that you're actually using RDDs for all your large data, and haven't accidentally ended up with some big data structure on the master.
All of these assume that you have data skew problems rather than something more sinister. That's not guaranteed to be true, but you need to check your data skew situation before looking for something complicated. It's easy for data skew to creep in, especially given Spark's flexibility, and it can make a real mess.
I have something in mind but I don't know the typical solution that could help me achieve that.
I need to have a distributed environment where not only memory is shared but processing is also shared, that means ALL Shared Processors work as one Big Processor Computing The code I wrote.
Could this be achieved knowing that I have limited knowledge in Data Grids and Hadoop?
Data Grid Platform (I knew that memory only is shared in that model) or Hadoop (where the code is shared among nodes but each node processes the code separately from other nodes but works on a subset of the data on HDFS).
But I need a solution that not only (shares memory or code as hadoop) but also the processing power of all the machines as one Single Big processor and one single Big Memory?
Do you expect that you just spawn the thread and it get executed somewhere and the middleware miraculously balances the load across nodes, moving threads from one node to another? I think you won't find this directly. The tagged frameworks don't have transparent shared memory either, for good reasons.
When using multiple nodes, you usually need them for processing power, and hiding everything and pretending you're on single machine will tend to unnecessary communication, slowing stuff down.
Instead, you can always design your app using the distribution API provided by those frameworks. For example in Infinispan, look for the Map-Reduce or Distributed Executors API.
I need to have a distributed environment where not only memory is shared but processing is also shared, that means ALL Shared Processors work as one Big Processor Computing The code I wrote.
You are not benefiting with processing on single machine. Application will scale if the processing is spread across multiple machines. If you want to see benefits of one Big Processor Computing, you can virtualize big physical machine into multiple virtual nodes (using technologies like VMWare).
But distributed processing across multiple VM nodes across multiple physical machines in a big cluster is best for distributed applications. Hadoop/Spark is best fit for these type of applications depending on batch processing (Hadoop) or real time processing needs (Spark).
The intended use for Hadoop appears to be for when the input data is distributed (HDFS) and already stored local to the nodes at the time of the mapping process.
Suppose we have data which does not need to be stored; the data can be generated at runtime. For example, the input to the mapping process is to be every possible IP address. Is Hadoop capable of efficiently distributing the Mapper work across nodes? Would you need to explicitly define how to split the input data (i.e. the IP address space) to different nodes, or does Hadoop handle that automatically?
Let me first clarify a comment you made. Hadoop is designed to support potentially massively parallel computation across a potentially large number of nodes regardless of where the data comes from or goes. The Hadoop design favors scalability over performance when it has to. It is true that being clever about where the data starts out and how that data is distributed can make a significant difference in how well/quickly a hadoop job can run.
To your question and example, if you will generate the input data you have the choice of generating it before the first job runs or you can generate it within the first mapper. If you generate it within the mapper then you can figure out what node the mapper's running on and then generate just the data that would be reduced in that partition (Use a partitioner to direct data between mappers and reducers)
This is going to be a problem you'll have with any distributed platform. Storm, for example, lets you have some say in which bolt instance will will process each tuple. The terminology might be different, but you'll be implementing roughly the same shuffle algorithm in Storm as you would Hadoop.
You are probably trying to run a non-MapReduce task on a map reduce cluster then. (e.g. IP scanning?) There may be more appropriate tools for this, your know...
A thing few people do not realize is that MapReduce is about checkpointing. It was developed for huge clusters, where you can expect machines to fail during the computation. By having checkpointing and recovery built-in into the architecture, this reduces the consequences of failures and slow hosts.
And that is why everything goes from disk to disk in MapReduce. It's checkpointed before, and it's checkpointed after. And if it fails, only this part of the job is re-run.
You can easily outperform MapReduce by leaving away the checkpointing. If you have 10 nodes, you will win easily. If you have 100 nodes, you will usually win. If you have a major computation and 1000 nodes, chances are that one node fails and you wish you had been doing similar checkpointing...
Now your task doesn't sound like a MapReduce job, because the input data is virtual. It sounds much more as if you should be running some other distributed computing tool; and maybe just writing your initial result to HDFS for later processing via MapReduce.
But of course there are way to hack around this. For example, you could use /16 subnets as input. Each mapper reads a /16 subnet and does it's job on that. It's not that much fake input to generate if you realize that you don't need to generate all 2^32 IPs, unless you have that many nodes in your cluster...
Number of Mappers depends on the number of Splits generated by the implementation of the InputFormat.
There is NLineInputFormat, which you could configure to generate as many splits as there are lines in the input file. You could create a file where each line is an IP range. I have not used it personally and there are many reports that it does not work as expected.
If you really need it, you could create your own implementation of the InputFormat which generates the InputSplits for your virtual data and force as many mappers as you need.