Dataflow PubSub to Elasticsearch Template proxy

Dataflow PubSub to Elasticsearch Template proxy - elasticsearch

We need to create a Dataflow job that ingests from PubSub to Elasticsearch but the job can not make outbound internet connections to reach Elastic Cloud.
Is there a way to pass proxy parameters to the Dataflow vm on creation time?
Found this article but proxy parameters are part of a maven app, I'm not sure how to use it here.
https://leifengblog.net/blog/run-dataflow-jobs-in-a-shared-vpc-on-gcp/
Thanks

To reach an external endpoint you’ll need to configure internet access and firewall settings, depending on your use case, your VMs may also need access to other resources you can check in this document which method you’ll need to configure for Dataflow. Before selecting the method that you’ll choose please check the document how to specify a network or a subnetwork.
In GCP, in subnetwork, you can enable Google Private Access, and the VMs in that subnetwork will be able to reach all the GCP endpoints (Dataflow, BigQuery, etc), even if they have private IPs only. There is no need to set up a proxy. See this document.
For instance, for Java pipelines, I normally use private IPs only for the Dataflow workers, and they are able to reach Pubsub, BigQuery, Bigtable, etc.
For Python pipelines, if you have external dependencies, the workers will need to reach the PyPi, and for that, you need Internet connectivity. If you want to use private IPs in Python pipelines, you can ship those external dependencies in a custom container, so the workers don't need to download them.
You can use a maven file right after you write your pipeline, you must create and stage your template file(mvn) you can follow this example.

Related

Deploying jaeger on AWS ECS with Elasticsearch

How should I go about deploying Jaeger on AWS ECS with Elasticsearch as backend? Is it a good idea to use the Jaeger all in one image or should I use separate images?

While I didn’t find any official jaeger reference to this, I think the jaeger all in one image is not intended for use in production. It makes one container a single point of failure, making it better to use separate containers for each jaeger component(if one is down from some reason - others can continue to operate).
I have recently written a blog post about hosting jaeger on AWS with AWS Elasticsearch (OpenSearch) service. While it is done with all-in-one, it is still useful to get the general idea of how to go about this.
Just to generally outline the process (described in detail in the post):
Create AWS Elasticsearch cluster
Create an ECS Cluster (running on ec2)
Create an ECS Task Definition, configured with a jaeger all-in-one image with the elasticsearch url from the step 1
Create an ECS Service that runs the created task definition
Make sure security groups on your EC2 allow access to jaeger ports as described here
Send spans to your jaeger endpoint via OpenTelemetry SDK
View your spans via the hosted jaeger UI (your-ec2-url:16686)

The all in one is a useful tool in development to test your work locally.
For deployment it is very limiting. Ideally to handle a potentially large volume of traffic you will want to scale parts of your infrastructure.
I would recommend deploying multiple jaeger-collectors, configured to write to the ES cluster. Then you can configure jaeger-agents running as a sidecar to each app or service broadcasting telemetry info. These agents can be configured to forward to one of a list of collectors adding some extra resilience.

How does kubectl configure CRDs?

There're already some CRDs defined in my kubernetes cluster.
kubectl can create/update/delete the resources well.
When I tried to do those operations with program, the way I found by searching is to generate code with below tool:
https://github.com/kubernetes/code-generator
I'm wondering why kubectl can do it out-of-box without generating code for CRDs.
Is it necessary to generate code in order to add or delete a CRD resource?
Thanks!

First lets understand what CRD is.
The CustomResourceDefinition API resource allows you to define custom resources. Defining a CRD object creates a new custom resource with a name and schema that you specify. The Kubernetes API serves and handles the storage of your custom resource. The name of a CRD object must be a valid DNS subdomain name.
This frees you from writing your own API server to handle the custom resource, but the generic nature of the implementation means you have less flexibility than with API server aggregation.
Why would one create Custom Resources:
A resource is an endpoint in the Kubernetes API that stores a collection of API objects of a certain kind. For example, the built-in pods resource contains a collection of Pod objects.
A custom resource is an extension of the Kubernetes API that is not necessarily available in a default Kubernetes installation. It represents a customization of a particular Kubernetes installation. However, many core Kubernetes functions are now built using custom resources, making Kubernetes more modular.
Custom resources can appear and disappear in a running cluster through dynamic registration, and cluster admins can update custom resources independently of the cluster itself. Once a custom resource is installed, users can create and access its objects using kubectl, just as they do for built-in resources like Pods.
So to answer your question, if you need a functionality that is missing from Kubernetes you need to create it yourself using CRDs. Without it cluster won't know what you want and how to get it.
If you are looking for examples of usage of Kubernetes Client-go you can find them on the official GitHub Client-go/examples

hazelcast-jet deployment and data ingestion

I have a distributed system running on AWS EC2 instances. My cluster has around 2000 nodes. I want to introduce a stream processing model which can process metadata being periodically published by each node (cpu usage, memory usage, IO and etc..). My system only cares about the latest data. It is also OK with missing a couple of data points when the processing model is down. Thus, I picked hazelcast-jet which is an in-memory processing model with great performance. Here I have a couple of questions regarding the model:
What is the best way to deploy hazelcast-jet to multiple ec2 instances?
How to ingest data from thousands of sources? The sources push data instead of being pulled.
How to config client so that it knows where to submit the tasks?
It would be super useful if there is a comprehensive example where I can learn from.

What is the best way to deploy hazelcast-jet to multiple ec2 instances?
Download and unzip the Hazelcast Jet distribution on each machine:
$ wget https://download.hazelcast.com/jet/hazelcast-jet-3.1.zip
$ unzip hazelcast-jet-3.1.zip
$ cd hazelcast-jet-3.1
Go to the lib directory of the unzipped distribution and download the hazelcast-aws module:
$ cd lib
$ wget https://repo1.maven.org/maven2/com/hazelcast/hazelcast-aws/2.4/hazelcast-aws-2.4.jar
Edit bin/common.sh to add the module to the classpath. Towards the end of the file is a line
CLASSPATH="$JET_HOME/lib/hazelcast-jet-3.1.jar:$CLASSPATH"
You can duplicate this line and replace -jet-3.1 with -aws-2.4.
Edit config/hazelcast.xml to enable the AWS cluster discovery. The details are here. In this step you'll have to deal with IAM roles, EC2 security groups, regions, etc. There's also a best practices guide for AWS deployment.
Start the cluster with jet-start.sh.
How to config client so that it knows where to submit the tasks?
A straightforward approach is to specify the public IPs of the machines where Jet is running, for example:
ClientConfig clientConfig = new ClientConfig();
clientConfig.getGroupConfig().setName("jet");
clientConfig.addAddress("54.224.63.209", "34.239.139.244");
However, depending on your AWS setup, these may not be stable, so you can configure to discover them as well. This is explained here.
How to ingest data from thousands of sources? The sources push data instead of being pulled.
I think your best option for this is to put the data into a Hazelcast Map, and use a mapJournal source to get the update events from it.

Setup auto scaling elasticsearch behind GCP load balancer

I have setup the Elasticsearch Certified by Bitnami on GCP
Which I would link to put behind the HTTP(S) Load Balancing on GCP for auto scaling propose. What I have done is create snapshot and use it to create image for instance template. But the Instance group continuous return "instance in being verified" and "Recreated instance" for long time do I don't know where the problem is so I design to use the default instance template from GCP instead.
My question is, when the new node created of when the data in elasticsearch updated how can I sync data between node in the GCP load balancer? Think about when there is high traffic and load balancer created the new coming node, and when the query come in from load balance how the new node have the exactly same data with the existing node or when the new index come in, all the node get the new index.
PS I dont mind for the delay if it less than 5 mins it is acceptable.
Thanks in advance for helping out.

In GCP, if you want to sync your data between nodes, we recommend using a centralized location to store your data. You can use Cloud Storage, Cloud SQL, Cloud File System etc. You can check this link to find more options for the data storage. Then you can create an instance template that specifies that when any instance is created it will use the custom image and has access to that centralized database. This is a recommended workaround rather than replicate new instances with data. You can find this link for the similar kind of thread.
For your Elasticsearch setup, I'll recommend deploying an Elasticsearch Cluster that provides multiple VMs that you can customize the configuration. If deploying cluster, this other Stackoverflow post suggest that is not not necessary to use a load balancer as Elasticsearch handles the load between the nodes.

Loadbalancing settings via spring AWS libraries for multiple RDS Read Only Replicas

If there are multiple read replicas, where load balancing related settings can be specified when using spring AWS libraries.

Read replicas have their own endpoint address similar to the original RDS instance. Your application will need to take care of using all the replicas and to switch between them. You'd need to introduce this algorithm into your application so it automatically detects which RDS instance it should connect to in turn. The following links can help:
http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Overview.Replication.html#Overview.ReadReplica

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio