How to Control Location of Parallel / GroupBy Stage in Google Cloud Dataflow - elasticsearch

I'm building an ElasticSearch output for Google Cloud Dataflow that is scalable and does not put load on the ES cluster (batch data flow). My idea for this is to have the nodes of the Dataflow pipeline join the ES cluster and perform the indexing themselves without putting any additional load on the main ES cluster. Therefore I have a stage in the pipeline that on start of the bundle creates a ES node that joins the cluster and then indexes every item that is passed to it itself (via routing settings).
There are two questions that I have to make this work
How can I make sure that each bundle is started on a different node? I want in the end one bundle per node, otherwise I have too many shards for ES.
What is the best way to create a fixed number of bundles and split work between them? I'm currently doing a group by key based on a random number between 1 and n to create the groups.

Your needs seem closely related to BEAM-68, which is a feature specifically targeting the use case of writing to a service without overwhelming it. Today, using a GroupByKey like you describe is the usual way to manually shard your input data and limit the number of concurrent writes.
As to your more detailed questions: the particular bundles into which Dataflow splits your data, and distributing those to workers, is chosen by the service, tuned for performance. Even though you can't (and shouldn't try to) control bundling directly, the GroupByKey partitions your data into indivisible elements 1 through n, and the number of non-empty bundles cannot be greater than the number of elements.

Related

How can I shard request using Aeron Cluster

I'd like to understand the capabilities of Aeron Clusters with respect to sharing requests across different back-end cluster application instances. I am thinking of something similar to partitions in Kafka where distinct back-end consumer processes the workload in independent processes. There should be a partition key which defines how to find the partition, or it could be a consumer provided hash, etc.
I read this article but it was not much help https://aeroncookbook.com/aeron-cluster/on-sharding/
So far I have only been reading the documentation and the API documents.
I also read the aeoroncookbook site: https://aeroncookbook.com/aeron-cluster/on-sharding/
Could someone provide an example of this if it is possible? The cookbook does not really do much good here because it imposes a similar problem but with dependencies between the shards.
Aeron Cluster does not directly support sharding. Its primary goal is redundant copies of the same data across multiple nodes. Sharding would need to be something that layered on via your own application logic. An approach would be to run multiple clusters and utilize a key to partition data across the clusters, then within your client application run multiple cluster clients (one for each cluster) and select the approach client based on the data that you are interacting with.

Distribution of content among cluster nodes within edge NiFi processors

I was exploring NiFi documentation. I must agree that it is one of the well documented open-source projects out there.
My understanding is that the processor runs on all nodes of the cluster.
However, I was wondering about how the content is distributed among cluster nodes when we use content pulling processors like FetchS3Object, FetchHDFS etc. In processor like FetchHDFS or FetchSFTP, will all nodes make connection to the source? Does it split the content and fetch from multiple nodes or One node fetched the content and load balance it in the downstream queues?
I think this document has an answer to your question:
https://community.hortonworks.com/articles/16120/how-do-i-distribute-data-across-a-nifi-cluster.html
For other file stores the idea is the same.
will all nodes make connection to the source?
Yes. If you did not limit your processor to work only on primary node - it runs on all nodes.
The answer by #dagget has traditionally been the approach to handle this situation, often referred to as the "list + fetch" pattern. List processor runs on Primary Node only, listings sent to RPG to re-distribute across the cluster, input port receives listings and connect to a fetch processor running on all nodes fetching in parallel.
In 1.8.0 there are now load balanced connections which remove the need for the RPG. You would still run the List processor on Primary Node only, but then connect it directly to the Fetch processors, and configure the queue in between to load balance.

Task scheduling with spark

I am running fairly large task on my 4 node cluster. I am reading around 4 GB of filtered data from a single table and running Naïve Baye’s training and prediction. I have HBase region server running on a single machine which is separate from the spark cluster running in fair scheduling mode, although HDFS is running on all machines.
While executing, I am experiencing strange task distribution in terms of the number of active tasks on the cluster. I observed that only one active task or at most two tasks are running on one/two machines at any point of time while the other are sitting idle. My expectation was that the data in the RDD will be divided and processed on all the nodes for operations like count and distinct etcetera. Why are all nodes not being used for large tasks of a single job? Does having HBase on a separate machine has anything to do with this?
Some things to check:
Presumably you are reading in your data using hadoopFile() or hadoopRDD(): consider setting the [optional] minPartitions parameter to make sure the number of partitions is equal to the number of nodes you want to use.
As you create other RDDs in your application, check the number of partitions of those RDDs and how evenly the data is distributed across them. (Sometimes an operation can create an RDD with the same number of partitions but can make the data within it badly unbalanced.) You can check this by calling the glom() method, printing the number of elements of the resulting RDD (the number of partitions) and then looping through it and printing the number of elements of each of the arrays. (This introduces communication so don't leave it in your production code.)
Many of the API calls on RDD have optional parameters for setting the number of partitions, and then there are calls like repartition() and coalesce() that can change the partitioning. Use them to fix problems you find using the above technique (but sometimes it will expose the need to rethink your algorithm.)
Check that you're actually using RDDs for all your large data, and haven't accidentally ended up with some big data structure on the master.
All of these assume that you have data skew problems rather than something more sinister. That's not guaranteed to be true, but you need to check your data skew situation before looking for something complicated. It's easy for data skew to creep in, especially given Spark's flexibility, and it can make a real mess.

Mapreduce performance speeed up check on simple mongodb installation with 2 secondary and a primary node

I have simple mongodb installation with two secondary and one primary nodes. When i run a mapreduce query on a datasize of 5 gb it takes same time which it was taking on a standalone mongodb installation on one node. I am using command line. Do I have to use any specific command to exploit extra replica sets for mapreduce?
Thank you in advance.
You can speed up your job if you can use aggregation framework instead of mapreduce - aggregation framework is a lot faster.
You can't really scale your operations using replica sets, since replica sets are for high availability and failover (plus redundancy of data) not for scaling. You can run mapReduce or aggregation on a secondary, just connect to the secondary and specify rs.slaveOk() and then run mapReduce/aggregate - but you cannot not output results to a collection then, since you cannot write to a secondary, so it has to return results inline.
This will move the extra load from the primary, but it won't make it faster per se. If you want to utilize multiple servers, you need to shard your database - by distributing the data over multiple shards/hosts you will automatically cause your mapReduce and/or aggregation queries to run over multiple servers - even though a small penalty will exist for managing the results (they have to be merged still) the longest part of the job will likely more than offset the extra overhead.

How to setup ElasticSearch cluster with auto-scaling on Amazon EC2?

There is a great tutorial elasticsearch on ec2 about configuring ES on Amazon EC2. I studied it and applied all recommendations.
Now I have AMI and can run any number of nodes in the cluster from this AMI. Auto-discovery is configured and the nodes join the cluster as they really should.
The question is How to configure cluster in way that I can automatically launch/terminate nodes depending on cluster load?
For example I want to have only 1 node running when we don't have any load and 12 nodes running on peak load. But wait, if I terminate 11 nodes in cluster what would happen with shards and replicas? How to make sure I don't lose any data in cluster if I terminate 11 nodes out of 12 nodes?
I might want to configure S3 Gateway for this. But all the gateways except for local are deprecated.
There is an article in the manual about shards allocation. May be I'm missing something very basic but I should admit I failed to figure out if it is possible to configure one node to always hold all the shards copies. My goal is to make sure that if this would be the only node running in the cluster we still don't lose any data.
The only solution I can imagine now is to configure index to have 12 shards and 12 replicas. Then when up to 12 nodes are launched every node would have copy of every shard. But I don't like this solution cause I would have to reconfigure cluster if I might want to have more then 12 nodes on peak load.
Auto scaling doesn't make a lot of sense with ElasticSearch.
Shard moving and re-allocation is not a light process, especially if you have a lot of data. It stresses IO and network, and can degrade the performance of ElasticSearch badly. (If you want to limit the effect you should throttle cluster recovery using settings like cluster.routing.allocation.cluster_concurrent_rebalance, indices.recovery.concurrent_streams, indices.recovery.max_size_per_sec . This will limit the impact but will also slow the re-balancing and recovery).
Also, if you care about your data you don't want to have only 1 node ever. You need your data to be replicated, so you will need at least 2 nodes (or more if you feel safer with a higher replication level).
Another thing to remember is that while you can change the number of replicas, you can't change the number of shards. This is configured when you create your index and cannot be changed (if you want more shards you need to create another index and reindex all your data). So your number of shards should take into account the data size and the cluster size, considering the higher number of nodes you want but also your minimal setup (can fewer nodes hold all the shards and serve the estimated traffic?).
So theoretically, if you want to have 2 nodes at low time and 12 nodes on peak, you can set your index to have 6 shards with 1 replica. So on low times you have 2 nodes that hold 6 shards each, and on peak you have 12 nodes that hold 1 shard each.
But again, I strongly suggest rethinking this and testing the impact of shard moving on your cluster performance.
In cases where the elasticity of your application is driven by a variable query load you could setup ES nodes configured to not store any data (node.data = false, http.enabled = true) and then put them in for auto scaling. These nodes could offload all the HTTP and result conflation processing from your main data nodes (freeing them up for more indexing and searching).
Since these nodes wouldn't have shards allocated to them bringing them up and down dynamically shouldn't be a problem and the auto-discovery should allow them to join the cluster.
I think this is a concern in general when it comes to employing auto-scalable architecture to meet temporary demands, but data still needs to be saved. I think there is a solution that leverages EBS
map shards to specific EBS volumes. Lets say we need 15 shards. We will need 15 EBS Volumes
amazon allows you to mount multiple volumes, so when we start we can start with few instances that have multiple volumes attached to them
as load increase, we can spin up additional instance - upto 15.
The above solution is only advised if you know your max capacity requirements.
I can give you an alternative approach using aws elastic search service(it will cost little bit more than normal ec2 elasticsearch).Write a simple script which continuously monitor the load (through api/cli)on the service and if the load goes beyond the threshold, programatically increase the nodes of your aws elasticsearch-service cluster.Here the advantage is aws will take care of the scaling(As per the documentation they are taking a snaphost and launching a completely new cluster).This will work for scale down also.
Regarding Auto-scaling approach there is some challenges like shard movement has an impact on the existing cluster, also we need to more vigilant while scaling down.You can find a good article on scaling down here which I have tested.If you can do some kind of intelligent automation of the steps in the above link through some scripting(python, shell) or through automation tools like Ansible, then the scaling in/out is achievable.But again you need to start the scaling up well before the normal limits since the scale up activities can have an impact on existing cluster.
Question: is possible to configure one node to always hold all the shards copies?
Answer: Yes,its possible by explicit shard routing.More details here
I would be tempted to suggest solving this a different way in AWS. I dont know what ES data this is or how its updated etc... Making a lot of assumptions I would put the ES instance behind a ALB (app load balancer) I would have a scheduled process that creates updated AMI's regularly (if you do it often then it will be quick to do), then based on load of your single server I would trigger more instances to be created from the latest instance you have available. Add the new instances to the ALB to share some of the load. As this quiet down I would trigger the termination of the temp instances. If you go this route here are a couple more things to consider
Use spot instances since they are cheaper and if it fits your use case
The "T" instances dont fit well here since they need time to build up credits
Use lambdas for the task of turning things on and off, if you want to be fancy you can trigger it based on a webhook to the aws gateway
Making more assumptions about your use case, consider putting a Varnish server in front of your ES machine so that you can more cheaply provide scale based on a cache strategy (lots of assumptions here) based on the stress you can dial in the right TTL for cache eviction. Check out the soft-purge feature for our ES stuff we have gotten a lot of good value from this.
if you do any of what i suggest here make sure to make your spawned ES instances report any logs back to a central addressable place on the persistent ES machine so you don't lose logs when the machines die

Resources