I have setup the Elasticsearch Certified by Bitnami on GCP
Which I would link to put behind the HTTP(S) Load Balancing on GCP for auto scaling propose. What I have done is create snapshot and use it to create image for instance template. But the Instance group continuous return "instance in being verified" and "Recreated instance" for long time do I don't know where the problem is so I design to use the default instance template from GCP instead.
My question is, when the new node created of when the data in elasticsearch updated how can I sync data between node in the GCP load balancer? Think about when there is high traffic and load balancer created the new coming node, and when the query come in from load balance how the new node have the exactly same data with the existing node or when the new index come in, all the node get the new index.
PS I dont mind for the delay if it less than 5 mins it is acceptable.
Thanks in advance for helping out.
In GCP, if you want to sync your data between nodes, we recommend using a centralized location to store your data. You can use Cloud Storage, Cloud SQL, Cloud File System etc. You can check this link to find more options for the data storage. Then you can create an instance template that specifies that when any instance is created it will use the custom image and has access to that centralized database. This is a recommended workaround rather than replicate new instances with data. You can find this link for the similar kind of thread.
For your Elasticsearch setup, I'll recommend deploying an Elasticsearch Cluster that provides multiple VMs that you can customize the configuration. If deploying cluster, this other Stackoverflow post suggest that is not not necessary to use a load balancer as Elasticsearch handles the load between the nodes.
Related
I have an elastic beanstalk environment set with code pipeline to my repository which is a Laravel site.
When I push a new change to the master branch it gets changed in the ec2, AND DELETES ALL DATA IN STORAGE.
I can't find a similar question online, so any ideas on how can I fix this issue?
Elastic Beanstalk runs on instances that don't persist data between redeployments by default. You need to re-design app to use one of the available options if you really need stable local storage.
Take a look at Persistent storage section in the doc below.
https://docs.aws.amazon.com/elasticbeanstalk/latest/dg/concepts.concepts.design.html
I have a distributed system running on AWS EC2 instances. My cluster has around 2000 nodes. I want to introduce a stream processing model which can process metadata being periodically published by each node (cpu usage, memory usage, IO and etc..). My system only cares about the latest data. It is also OK with missing a couple of data points when the processing model is down. Thus, I picked hazelcast-jet which is an in-memory processing model with great performance. Here I have a couple of questions regarding the model:
What is the best way to deploy hazelcast-jet to multiple ec2 instances?
How to ingest data from thousands of sources? The sources push data instead of being pulled.
How to config client so that it knows where to submit the tasks?
It would be super useful if there is a comprehensive example where I can learn from.
What is the best way to deploy hazelcast-jet to multiple ec2 instances?
Download and unzip the Hazelcast Jet distribution on each machine:
$ wget https://download.hazelcast.com/jet/hazelcast-jet-3.1.zip
$ unzip hazelcast-jet-3.1.zip
$ cd hazelcast-jet-3.1
Go to the lib directory of the unzipped distribution and download the hazelcast-aws module:
$ cd lib
$ wget https://repo1.maven.org/maven2/com/hazelcast/hazelcast-aws/2.4/hazelcast-aws-2.4.jar
Edit bin/common.sh to add the module to the classpath. Towards the end of the file is a line
CLASSPATH="$JET_HOME/lib/hazelcast-jet-3.1.jar:$CLASSPATH"
You can duplicate this line and replace -jet-3.1 with -aws-2.4.
Edit config/hazelcast.xml to enable the AWS cluster discovery. The details are here. In this step you'll have to deal with IAM roles, EC2 security groups, regions, etc. There's also a best practices guide for AWS deployment.
Start the cluster with jet-start.sh.
How to config client so that it knows where to submit the tasks?
A straightforward approach is to specify the public IPs of the machines where Jet is running, for example:
ClientConfig clientConfig = new ClientConfig();
clientConfig.getGroupConfig().setName("jet");
clientConfig.addAddress("54.224.63.209", "34.239.139.244");
However, depending on your AWS setup, these may not be stable, so you can configure to discover them as well. This is explained here.
How to ingest data from thousands of sources? The sources push data instead of being pulled.
I think your best option for this is to put the data into a Hazelcast Map, and use a mapJournal source to get the update events from it.
I'm little confused here. Please help me out.
I have a spring-boot application which feeds the data into elasticsearch. This spring-boot runs on AWS instance. Right now, I do not have proper log aggregation and I want to use ELK stack for it.
Please help me out with these concerns...
Can I make a new log cluster on the same elasticsearch instance and feed the log data into it? Is it a good idea?
Should I use a different elasticsearch instance on the same machine with different port and direct all the log traffic to this instance?
Should I host my elasticsearch onto a new aws server and direct all the traffic? Will latency cause problems on later stages when the log data feed is huge?
The set of questions you've asked will have broad and varied answers depending on factors such as volume, velocity and capacity.
Think of an ES cluster as a database. If you have multiple log files/sources, you can insert them into different indexes on the same cluster.
If there are multiple read replicas, where load balancing related settings can be specified when using spring AWS libraries.
Read replicas have their own endpoint address similar to the original RDS instance. Your application will need to take care of using all the replicas and to switch between them. You'd need to introduce this algorithm into your application so it automatically detects which RDS instance it should connect to in turn. The following links can help:
http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Overview.Replication.html#Overview.ReadReplica
I have a few questions about BigCouch that i'm interesting getting answers before start using it.
Do I need to choose my shard key carefully or can just use an auto-generated GUID? I start with a single server with 1 replication, but I want to be ready when I need to add another shard
Any GUI for managing the cluster like CouchBase have, something similar to administer the DB
How can I backup the data when hosting BigCouch on EC2 (ie. snapshots)
Thanks
Since you have no started to use BigCouch yet and it looks like you need some features that are available out of the box in Couchbase (auto-sharding, administration console ...)
Why no going on Couchbase ?