Mapreduce performance speeed up check on simple mongodb installation with 2 secondary and a primary node - performance

I have simple mongodb installation with two secondary and one primary nodes. When i run a mapreduce query on a datasize of 5 gb it takes same time which it was taking on a standalone mongodb installation on one node. I am using command line. Do I have to use any specific command to exploit extra replica sets for mapreduce?
Thank you in advance.

You can speed up your job if you can use aggregation framework instead of mapreduce - aggregation framework is a lot faster.
You can't really scale your operations using replica sets, since replica sets are for high availability and failover (plus redundancy of data) not for scaling. You can run mapReduce or aggregation on a secondary, just connect to the secondary and specify rs.slaveOk() and then run mapReduce/aggregate - but you cannot not output results to a collection then, since you cannot write to a secondary, so it has to return results inline.
This will move the extra load from the primary, but it won't make it faster per se. If you want to utilize multiple servers, you need to shard your database - by distributing the data over multiple shards/hosts you will automatically cause your mapReduce and/or aggregation queries to run over multiple servers - even though a small penalty will exist for managing the results (they have to be merged still) the longest part of the job will likely more than offset the extra overhead.

Related

Is elasticsearch safe in single node in production environment of sensitive information?

Elastic on single node it can be faster than cluster, but what are the advantages and disadvantages of using it in a production environment with only one node.
I have a problem because my DBA wants to use elastic in single node with the justification of speed, but he is not taking into account availability, redundancy and failures against disaster the system is not slow, but all documentation I read about elastic says that in production environment it needs to run in cluster with his nodes/shard, We are an information bureau, we provide data for banks, credit analysis, and numerous large customer applications. Help me with arguments that prove that I am sure that the information we are dealing with needs high availability and redundancy. The index size is about 2.2TB I wanted to run on cluster as the information is very sensitive, but my DBA wants to run on single node, on production environment Help me give him an answer, if he's right or I'm wrong.
You should run a benchmark of the potential workloads if he needs proof.
It isn't really viable to run any real production load on a single system because the shards are what really create the speed of elasticsearch: Queries are run on many shards and partial results are returned to form the full result. If you use a single node to scan the entire dataset it will take a very long time for that single node to process everything.
I guess the issue really comes down to how much your data is getting updated and how many queries are running in parallel.
Without any network traffic it could seem faster on a small workload, but if your search cluster needs to do continuous indexing and run parallel queries it will just get stuck and stop returning results.
If a single-node cluster will be faster or slower than a multi-node cluster depends on the use case and many other factors, the argument that a single-node is faster is not valid without a comparative benchmark of the real use case.
For example, If your use case has more indexing than searchs, a single-node cluster can be faster, if your use case has more searchs than indexing, then a single-node cluster can be slower.
But even if it is faster in a specific use case, running a single-node cluster in production is not recommended and it is very risky.
The main issue is that a single-node has no resilience to failure, if your node is down, your entire cluster is down and until your node is back up running, the data in your elasticsearch cluster is unavailable.
Depending on how you will ingest the data, this can also lead to data loss.
If for some reason the data in your node gets corrupted or lost, you will need to restore it from a previous snapshot. On a multi-node cluster if a node is lost, you can spin-up another one and the cluster will take care of replicating the data, assuming you use replicas for your index.
There are also some limitations on how many shards you can have on a node and how many memory you should use for the java heap memory, the recomendations by elastic in those cases is to try to keep the number of shards per GB of heap memory below 20 and do not use more than 30 GB for the heap memory, so for a single-node this will give you a maximum of 600 shards.
If this is enough, depends on the use case and your indexing/shard strategy.
You should ask yourself if you can afford downtime and lose data, if the answer is no to both, then you should not use a single-node cluster.

How to Control Location of Parallel / GroupBy Stage in Google Cloud Dataflow

I'm building an ElasticSearch output for Google Cloud Dataflow that is scalable and does not put load on the ES cluster (batch data flow). My idea for this is to have the nodes of the Dataflow pipeline join the ES cluster and perform the indexing themselves without putting any additional load on the main ES cluster. Therefore I have a stage in the pipeline that on start of the bundle creates a ES node that joins the cluster and then indexes every item that is passed to it itself (via routing settings).
There are two questions that I have to make this work
How can I make sure that each bundle is started on a different node? I want in the end one bundle per node, otherwise I have too many shards for ES.
What is the best way to create a fixed number of bundles and split work between them? I'm currently doing a group by key based on a random number between 1 and n to create the groups.
Your needs seem closely related to BEAM-68, which is a feature specifically targeting the use case of writing to a service without overwhelming it. Today, using a GroupByKey like you describe is the usual way to manually shard your input data and limit the number of concurrent writes.
As to your more detailed questions: the particular bundles into which Dataflow splits your data, and distributing those to workers, is chosen by the service, tuned for performance. Even though you can't (and shouldn't try to) control bundling directly, the GroupByKey partitions your data into indivisible elements 1 through n, and the number of non-empty bundles cannot be greater than the number of elements.

Does HBase use the compute capacity of all nodes in the cluster for query execution?

We are having a setup of 1 master and 2 slave nodes. The data is setup in postgres and in hbase and its a similar dataset (same number of rows) - 65 million rows. Yet, we dont find a measurable increase in performance from HBase for the same query.
My first thought is - does HBase use the compute capacity of all nodes to fork the query out? Perhaps this is why the performance is not measurably better.
Any other reasons for why the performance between Postgres and HBase would be about the same? Any specific configuration items to look for?
EDIT : Something I found while researching this : http://www.flurry.com/2012/06/12/137492485#.VaQP_5QpBpg
This is kind of a yes and no answer. Depending on what you are doing for your 'query' and your region distribution, you may or may not be using all the nodes. For example, if you are running a scan across the table it will run against each region (assuming more then one) in sequence. However if you are using a multi-get for keys that are in different regions, this will run in parallel.
The real benefit is going to come as the number of regions increase and you start parallelizing requests (multiple clients). Regions will be distributed across region servers by the Master as regions are split.

Task scheduling with spark

I am running fairly large task on my 4 node cluster. I am reading around 4 GB of filtered data from a single table and running Naïve Baye’s training and prediction. I have HBase region server running on a single machine which is separate from the spark cluster running in fair scheduling mode, although HDFS is running on all machines.
While executing, I am experiencing strange task distribution in terms of the number of active tasks on the cluster. I observed that only one active task or at most two tasks are running on one/two machines at any point of time while the other are sitting idle. My expectation was that the data in the RDD will be divided and processed on all the nodes for operations like count and distinct etcetera. Why are all nodes not being used for large tasks of a single job? Does having HBase on a separate machine has anything to do with this?
Some things to check:
Presumably you are reading in your data using hadoopFile() or hadoopRDD(): consider setting the [optional] minPartitions parameter to make sure the number of partitions is equal to the number of nodes you want to use.
As you create other RDDs in your application, check the number of partitions of those RDDs and how evenly the data is distributed across them. (Sometimes an operation can create an RDD with the same number of partitions but can make the data within it badly unbalanced.) You can check this by calling the glom() method, printing the number of elements of the resulting RDD (the number of partitions) and then looping through it and printing the number of elements of each of the arrays. (This introduces communication so don't leave it in your production code.)
Many of the API calls on RDD have optional parameters for setting the number of partitions, and then there are calls like repartition() and coalesce() that can change the partitioning. Use them to fix problems you find using the above technique (but sometimes it will expose the need to rethink your algorithm.)
Check that you're actually using RDDs for all your large data, and haven't accidentally ended up with some big data structure on the master.
All of these assume that you have data skew problems rather than something more sinister. That's not guaranteed to be true, but you need to check your data skew situation before looking for something complicated. It's easy for data skew to creep in, especially given Spark's flexibility, and it can make a real mess.

MongoDB capacity planning

I have an Oracle Database with around 7 millions of records/day and I want to switch to MongoDB. (~300Gb)
To setup a POC, I'd like to know how many nodes I need? I think 2 replica of 3 node in 2 shard will be enough but I want to know your thinking about it :)
I'd like to have an HA setup :)
Thanks in advance!
For MongoDB to work efficiently, you need to know your working set size..You need to know how much data does 7 million records/day amounts to. This is active data that will need to stay in RAM for high performance.
Also, be very sure WHY you are migrating to Mongo. I'm guessing..in your case, it is scalability..
but know your data well before doing so.
For your POC, keeping two shards means roughly 150GB on each.. If you have that much disk available, no problem.
Give some consideration to your sharding keys, what fields does it make sense for you to shared your data set on? This will impact on the decision of how many shards to deploy, verses the capacity of each shard. You might go with relatively few shards maybe two or three big deep shards if your data can be easily segmented into half or thirds, or several more lighter thinner shards if you can shard on a more diverse key.
It is relatively straightforward to upgrade from a MongoDB replica set configuration to a sharded cluster (each shard is actually a replica set). Rather than predetermining that sharding is the right solution to start with, I would think about what your reasons for sharding are (eg. will your application requirements outgrow the resources of a single machine; how much of your data set will be active working set for queries, etc).
It would be worth starting with replica sets and benchmarking this as part of planning your architecture and POC.
Some notes to get you started:
MongoDB's journaling, which is enabled by default as of 1.9.2, provides crash recovery and durability in the storage engine.
Replica sets are the building block for high availability, automatic failover, and data redundancy. Each replica set needs a minimum of three nodes (for example, three data nodes or two data nodes and an arbiter) to enable failover to a new primary via an election.
Sharding is useful for horizontal scaling once your data or writes exceed the resources of a single server.
Other considerations include planning your documents based on your application usage .. for example, if your documents will be updated frequently and grow in size over time, you may want to consider manual padding to prevent excessive document moves.
If this is your first MongoDB project you should definitely read the FAQs on Replica Sets and Sharding with MongoDB, as well as for Application Developers.
Note that choosing a good shard key for your use case is an important consideration. A poor choice of shard key can lead to "hot spots" for data writes, or unbalanced shards if you plan to delete large amounts of data.

Resources