Assign and maintain sequential Worker-Number or NodeId in Kubernetes - spring-boot

When a Kubernetes Spring-Boot app is launched with 8 instances, the app running in each node needs to fetch sequence number of the pod/container. There should be no repeating numbers for the pods/containers running the same app. Assume that a pod runs a single container, and a container runs only one instance of the app.
There are a few unique identifiers the app can pull from Kubernetes API for each pod such as:
MAC address (networkInterface.getHardwareAddress())
Hostname
nodeName (aks-default-12345677-3
targetRef.name (my-sample-service-sandbox-54k47696e9-abcde)
targetRef.uid (aa7k6278-abcd-11ef-e531-kdk8jjkkllmm)
IP address (12.34.56.78)
But the app getting this information from the API cannot safely generate and assign a unique number to itself within the specified range of pods [0 - Max Node Count-1]. Any reducer step (bitwise &) running over these unique identifiers will eventually repeat the numbers. And communicating with the other pods is an anti-pattern although there are approaches which take a consensus/agreement patterns to accomplish this.
My Question is:
Is there a simple way for Kubernetes to assign a sequential number for each node/container/pod when it's created - possibly in an environment variable in the pod? The numbers can to begin with 0 or 1 and should reach uptown the max count of the number of pods.
Background info and some research:
Executing UUID.randomUUID().hashCode() & 7 eight times will get you repeats of numbers between 0 & 7. Ref article with this mistake in createNodeId().
Sample outputs on actual runs of reducer step above.
{0=2, 1=1, 2=0, 3=3, 4=0, 5=1, 6=1, 7=0}
{0=1, 1=0, 2=0, 3=1, 4=3, 5=0, 6=2, 7=1}
{0=1, 1=0, 2=2, 3=1, 4=1, 5=2, 6=0, 7=1}
I've went ahead and executed a 100 Million runs of the above code and found that only 0.24% of the cases has even distribution.
Uneven Reducers: 99760174 | Even Reducers: 239826

app is launched with 8 instances, the app running in each node needs to fetch sequence number of the pod
It sounds like you are requesting a stable Pod identity. If you deploy your Spring Boot app as a StatefulSet instead of as a Deployment, then this identity is a "provided feature" from Kubernetes.

Related

Custom gauge for prometheus Go SDK

I have a APP that monitors some external jobs (among other things). This does this monitoring every 5 Mins
I'm trying to create a prometheus gauge to get the count of currently running jobs.
Here is how I declared my gauge
JobStats= promauto.NewGaugeVec(
prometheus.GaugeOpts{
Namespace: "myapi",
Subsystem: "app",
Name: "job_count",
Help: "Current running jobs in the system",
ConstLabels: nil,
},
[]string{"l1", "l2", "l3"},
)
in the code that actually counts the jobs I do
metrics.JobStats.WithLabelValues(l1,l2,l3).add(float64(jobs_cnt))
when I query the /metrics endpoint I get the number
The thing is, this metrics only keeps increasing. If I restart the app this get resets to zer & again keeps increasing
I'm using grafana to graph this in a dashboard.
My question is
Get the graph to show the actual number of jobs (instead of ever increasing line)?
Should this be handled in code (like setting this to zero before every collection?) or in grafana?

How do you keep running count of aws spot fleet (request and maintain) request instances?

How can I get a running count of total number of spot fleet instances?
If the target capacity is 5 instances and sometime after if 2 of them somehow terminates and aws spot fleet automatically relaunch 2 instances to sustain the target capacity and I want to keep count of total instances launched as being 7 ( 5 target + 2 new instances because 2 of them terminated)
I can use this command to get the index value:
curl -s http://169.254.169.254/latest/meta-data/ami-launch-index
However, once the spot fleet is met and if one of the instances terminates for some reason, the spot fleet will relaunch another instance automatically to fulfill the TargetCapacity, but the launch index will revert to "0" vs incrementing by 1 from the very last number assigned.
thanks!

How to require one pod per minion/kublet when configuring a replication controller?

I have 4 nodes (kubelets) configured with a label role=nginx
master ~ # kubectl get node
NAME LABELS STATUS
10.1.141.34 kubernetes.io/hostname=10.1.141.34,role=nginx Ready
10.1.141.40 kubernetes.io/hostname=10.1.141.40,role=nginx Ready
10.1.141.42 kubernetes.io/hostname=10.1.141.42,role=nginx Ready
10.1.141.43 kubernetes.io/hostname=10.1.141.43,role=nginx Ready
I modified the replication controller and added these lines
spec:
replicas: 4
selector:
role: nginx
But when I fire it up I get 2 pods on one host. What I want is 1 pod on each host. What am I missing?
Prior to DaemonSet being available, you can also specify that you pod uses a host port and set the number of replicas in your replication controller to something greater than your number of nodes. The host port constraint will allow only one pod per host.
I was able to achieve this by modifying the labels as follows below
master ~ # kubectl get nodes -o wide
NAME LABELS STATUS
10.1.141.34 kubernetes.io/hostname=10.1.141.34,role=nginx1 Ready
10.1.141.40 kubernetes.io/hostname=10.1.141.40,role=nginx2 Ready
10.1.141.42 kubernetes.io/hostname=10.1.141.42,role=nginx3 Ready
10.1.141.43 kubernetes.io/hostname=10.1.141.43,role=nginx4 Ready
I then created 4 nginx replication controllers each referencing the nginx{1|2|3|4} roles and labels.
Replication controller doesn't guarantee one pod per node as the scheduler will find the best fit for each pod. I think what you want is the DaemonSet controller, which is still under development. Your workaround posted above would work too.

Can you determine the number of workers running in your application

In order to properly scale our sidekiq workers to the size of our database pool, we came up with a little formula in our configuration
sidekiq.rb
Sidekiq.configure_server do |config|
config.options[:concurrency] = ((ENV['DB_POOL'] || 5).to_i - 1) / workers
end
def workers
... the number of workers configured for our project ...
(ENV['HEROKU_WORKERS'] || 1).to_i
end
We're setting HEROKU_WORKERS by hand, but it would be sweet if there was a way to interrogate the Heroku API from within the application.
Modulo all the things that can happen (workers going up or down, changing the number of workers, etc.), this seems to get us out of the initial problem; where our workers would consume all of the database pool connections, and then start crashing on startup.
The heroku-api gem should provide you this.
https://github.com/heroku/heroku.rb
You should find your API key here: https://dashboard.heroku.com/account
require 'heroku-api'
heroku = Heroku::API.new(api_key: API_KEY)
Total number of current processes:
heroku.get_ps('heroku-app-name').body.count
(You should be able to parse this to get total number of workers... or a count of a specific kind of worker, if you have different kinds defined in your Procfile/Heroku app)

Hector is unable to read Cassandra data when nodes reboot or terminate

We are trying to run a cassandra cluster on AWS/EC2 within a standard VPC footprint (cassandra nodes on private subnets). Because this is AWS there is always a chance that an EC2 instance will terminate or reboot with no warning. I have been simulating this case on a test cluster and I am seeing things with the cluster that I thought a cluster was suppose to prevent. Specifically if a node reboots some data will go temporarily missing until the node completes its reboot. If a node terminates it appears that some data is lost forever.
For my test I just did a bunch of writes (using QUORUM consistency) to some keyspaces then interrogate the contents of those keyspaces as I bring down nodes (either through reboot or terminate). I'm just using cqlsh SELECT to do the keyspace/column family interrogation of the cluster using ONE consistency level.
Note, even though I am performing no writes to the cluster while I am doing the SELECTs rows temporarily disappear when rebooting and can permanently go missing during termination.
I thought Netflix Priam might be able to help, but sadly it doesn't work in a VPC the last time I checked.
Also, because we are using ephemeral storage instances there is no equivalent of 'shutdown' so I cannot run any scripts during reboot/terminate of an instance to perform a nodetool decommission or nodetool removenode before an instance goes away. Terminate is the equivalent of kicking the plug out of the wall.
Since I am using a replication factor of 3 and quorum/write that should mean that all data is written to at least 2 nodes. So, unless I am totally misunderstanding things (which is possible), losing one node should not mean that I lose any data for any period of time when I am using consistency level ONE for the read.
Questions
Why wouldn't a 6 node cluster with a replication factor of 3 work?
Do I need to run something like a 12 node cluster with a replication factor of 7? Don't bother telling me that will fix the problem, because it doesn't.
Do I need to use consistency level of ALL on the writes then use ONE or QUORUM on the reads?
Is there something not quite right with virtual nodes? unlikely
Are there nodetool commands besides removenode that I need to run when a node terminates to recover missing data? As mentioned earlier, when a reboot occurs, eventually the missing data reappears.
Is there some cassandra savant who can look at my cassandra.yaml file below and send me on the path to salvation?
More Info added 7/19
I don't think this is a QUORUM vs ONE vs ALL is the issue. The test I set up performs no writes to the keyspaces after the initial population of the column families. So the data has had plenty of time (hours) to make it to all the nodes as required by the replication factor. Plus the test dataset is REALLY small (2 column families with about 300-1000 values each). So in other words, the data is completely static.
The behavior I am seeing seems to be tied to the fact that the ec2 instance is no longer on the network. The reason I say this is because if I log on to a node and just do a cassandra stop I see no loss of data. But if I do the reboot or terminate I start getting the following in a stack trace.
CassandraHostRetryService - Downed Host Retry service started with queue size -1 and retry delay 10s
CassandraHostRetryService - Downed Host retry shutdown complete
CassandraHostRetryService - Downed Host retry shutdown hook called
Caused by: TimedOutException()
Caused by: TimedOutException()
So it seems to be more of a networking communication issue in that the cluster is expecting, for example 10.0.12.74, to be on the network after it has joined the cluster. If that ip is suddenly unreachable either due to reboot or termination the timeouts start happening.
When I do a nodetool status under all three scenarios (cassandra stop, reboot or terminate) the status of the node shows up as DN. Which is what you would expect. Eventually nodetool status will return to UN with cassandra start or reboot, but obviously termination always stays DN.
Details of my Configuration
Here are some details of my configuration (cassandra.yaml is at the bottom of this posting):
Nodes are running in private subnets of a VPC.
Cassandra 1.2.5 with num_tokens: 256 (virtual nodes). initial_token: (blank). I am really hoping this works because all of our nodes run in autoscaling groups so the thought that redistribution could be handle dynamically is appealing.
EC2 m1.large one seed and one non-seed node in each availability zone. (so 6 total nodes in the cluster).
Ephemeral storage, not EBS.
Ec2Snitch with NetworkTopologyStrategy and all keyspaces have replication factor of 3.
Non-seed nodes are auto_bootstraped, seed nodes are not.
sample cassandra.yaml file
cluster_name: 'TestCluster'
num_tokens: 256
initial_token:
hinted_handoff_enabled: true
max_hint_window_in_ms: 10800000
hinted_handoff_throttle_in_kb: 1024
max_hints_delivery_threads: 2
authenticator: org.apache.cassandra.auth.AllowAllAuthenticator
authorizer: org.apache.cassandra.auth.AllowAllAuthorizer
partitioner: org.apache.cassandra.dht.Murmur3Partitioner
disk_failure_policy: stop
key_cache_size_in_mb:
key_cache_save_period: 14400
row_cache_size_in_mb: 0
row_cache_save_period: 0
row_cache_provider: SerializingCacheProvider
saved_caches_directory: /opt/company/dbserver/caches
commitlog_sync: periodic
commitlog_sync_period_in_ms: 10000
commitlog_segment_size_in_mb: 32
seed_provider:
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
parameters:
- seeds: "SEED_IP_LIST"
flush_largest_memtables_at: 0.75
reduce_cache_sizes_at: 0.85
reduce_cache_capacity_to: 0.6
concurrent_reads: 32
concurrent_writes: 8
memtable_flush_queue_size: 4
trickle_fsync: false
trickle_fsync_interval_in_kb: 10240
storage_port: 7000
ssl_storage_port: 7001
listen_address: LISTEN_ADDRESS
start_native_transport: false
native_transport_port: 9042
start_rpc: true
rpc_address: 0.0.0.0
rpc_port: 9160
rpc_keepalive: true
rpc_server_type: sync
thrift_framed_transport_size_in_mb: 15
thrift_max_message_length_in_mb: 16
incremental_backups: true
snapshot_before_compaction: false
auto_bootstrap: AUTO_BOOTSTRAP
column_index_size_in_kb: 64
in_memory_compaction_limit_in_mb: 64
multithreaded_compaction: false
compaction_throughput_mb_per_sec: 16
compaction_preheat_key_cache: true
read_request_timeout_in_ms: 10000
range_request_timeout_in_ms: 10000
write_request_timeout_in_ms: 10000
truncate_request_timeout_in_ms: 60000
request_timeout_in_ms: 10000
cross_node_timeout: false
endpoint_snitch: Ec2Snitch
dynamic_snitch_update_interval_in_ms: 100
dynamic_snitch_reset_interval_in_ms: 600000
dynamic_snitch_badness_threshold: 0.1
request_scheduler: org.apache.cassandra.scheduler.NoScheduler
index_interval: 128
server_encryption_options:
internode_encryption: none
keystore: conf/.keystore
keystore_password: cassandra
truststore: conf/.truststore
truststore_password: cassandra
client_encryption_options:
enabled: false
keystore: conf/.keystore
keystore_password: cassandra
internode_compression: all
I think http://www.datastax.com/documentation/cassandra/1.2/cassandra/dml/dml_config_consistency_c.html will clear up a lot of this. In particular, QUORUM/ONE is not guaranteed to return the most recent data. QUORUM/QUORUM is. So is ALL/ONE, but that will be intolerant to failure on write.
Edit to go with the new information:
CassandraHostRetryService is part of Hector. I assumed you were testing with cqlsh like a sane person would. Lessons:
Use cqlsh for testing
Use the DataStax Java Driver for building your application, which is faster, easier to use, and has more insight into the cluster state than Hector thanks to the native protocol it's built on.

Resources