is ReadWriteOnce thread safe? - spring

I tried to search and look for resources regarding this, but i found that ReadWriteOnce mode allows a node to read and write once. A node consists of many pods and that's what i'm confused about because pods can read or write, is this process going to be scheduled ?.
What i'm really trying to achieve is thread safety!
since there's one master who can write to this shared volume and many nodes should read after.

ReadWriteOnce is an access mode and access modes dictate how many nodes can mount the PersistentVolume, and with which capabilities (read/write) - but not the amount of operations it can perform against that volume. Thus, your statement "i found that ReadWriteOnce mode allows a node to read and write once" is inaccurate.
From the docs:
ReadWriteOnce
the volume can be mounted as read-write by a single node. ReadWriteOnce access mode still can allow multiple pods to access the volume when the pods are running on the same node.
So no, this has nothing to do with synchronizing access to the underlying storage.

Related

Which algorithim does Kubernetes use to assign pods to nodes?

This is more of cost estimation question than how to use features like node affinity.
So basically there are m pods with some constraints like:
each pod of specific Deployments / StatefulSets should be on a different kubernetes node
pods of specific Deployments / StatefulSets should be balanced over 3 availability zones
Now, I want to find how many nodes (all same types) I will need to host given set of Deployments / StatefulSets.
I first thought this of more like an assignment problem to be solved using Hungarian Algorithim but this seems much more complex in term like multi dimensional constraints.
By my knowledge the algorithm used by default by the kube-scheduler is described on github here.
It explains how it works. It first filters nodes that do not meet the requirements of the pods, e.g. resource requests > available resources on nodes, affinity etc.
Then uses a ranking algorithm to determine the best fitting node. For deeper insights on the
Kubernetes assigns the pods based on the many constraints like
Resource requirement
Resource existence (Node capacity)
Node selectors if any or Affinity rules
Weight of Affinity rules
This is good article for same : https://kubernetes.io/docs/concepts/scheduling-eviction/kube-scheduler/
Also : https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/
i would suggest to read the : https://kubernetes.io/docs/concepts/scheduling-eviction/
In reference to the very good answer of user Harsh Manvar, I will add a few more information from myself. This topic was covered in the documentation, as described by my predecessor. Besides her, you can find very good materials here:
What Happens When You Create a Pod On a Kubernetes Cluster?
In a matter of seconds, the Pod is up and running on one of the cluster nodes. However, a lot has happened within those seconds. Let’s see:
While scanning the API server (which it is continuously doing), the Kubernetes Scheduler detects that there is a new Pod without a nodeName parameter. The nodeName is what shows which node should be owning this Pod.
The Scheduler selects a suitable node for this Pod and updates the Pod definition with the node name (though the nodeName parameter).
The kubelet on the chosen node is notified that there is a pod that is pending execution.
The kubelet executes the Pod, and the latter starts running on the node.
You can also find a tutorial about Scheduling Process and Scheduler Algorithms.

Oracle NoSQL Storage Nodes

What are the guidelines for creating Oracle NoSQL Database Storage Node (SN), can we create multiple storage nodes on the same machine? If so what are the trades off? I looked at the product documentation, but it's not clear
So digging deeper here's what was found :
It is recommended that Storage Nodes (SNs) be allocated one per node in the cluster for availability and performance reasons. If you believe that a given node has the I/O and CPU resources to host multiple Replication Nodes, the Storage Node's capacity parameter can be set to a value greater than one, and the system will know that multiple RNs may be hosted at this SN. This way, the system can:
ensure that each Replication Node in a shard is hosted on a different Storage Node, reducing a shard's vulnerability to failure dynamically divide up memory and other hardware resources among the Replication Nodes ensure that the master Replication Nodes, which are the ones which service write operations in a store, are distributed evenly among the Storage Nodes both at startup, and after any failovers. If more than one SN is hosted on the same node, multiple SNs are lost if that node fails, and data may become inaccessible.
You can set the capacity parameter for a Storage Node in several ways:
When using the makebootconfig command
List item with the change-policy command
List item with the plan change-params command.
Also, in very limited situations, such as for early prototyping and experimentation, it can be useful to create multiple SNs on the same node.
On a single machine, a Storage Node is uniquely identified by its root directory (KVROOT) plus a configuration file name, which defaults to "config.xml." This means you can create multiple SNs as by creating a unique KVROOT directory for each SN. Usually, these would be on different nodes, but it's also possible to have them on a single node.

Does Kubernetes consider the current memory usage when scheduling pods

The Kubernetes docs on https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/ state:
The scheduler ensures that, for each resource type, the sum of the resource requests of the scheduled Containers is less than the capacity of the node.
Does Kubernetes consider the current state of the node when calculating capacity? To highlight what I mean, here is a concrete example:
Assuming I have a node with 10Gi of RAM, running 10 Pods each with 500Mi of resource requests, and no limits. Let's say they are "bursting", and each Pod is actually using 1Gi of RAM. In this case, the node is fully utilized (10 x 1Gi = 10Gi), but the resources requests are only 10 x 500Mi = 5Gi. Would Kubernetes consider scheduling another pod on this node because only 50% of the memory capacity on the node has been requested, or would it use the fact that 100% of the memory is currently being utilized, and the node is at full capacity?
By default kubernetes will use cgroups to manage and monitor the "allocatable" memory on a node for pods. It is possible to configure kubelet to entirely rely on the static reservations and pod requests from your deployments though so the method depends on your cluster deployment.
In either case, a node itself will track "memory pressure", which monitors the existing overall memory usage of a node. If a node is under memory pressure then no new pods will be scheduled and existing pods will be evicted.
It's best to set sensible memory requests and limits for all workloads to help the scheduler as much as possible.
If a kubernetes deployment does not configure cgroup memory monitoring, setting requests is a requirement for all workloads.
If the deployment is using cgroup memory monitoring, at least setting requests give the scheduler extra detail as to whether the pods to be scheduled should fit on a node.
Capacity and Allocatable Resources
The Kubernetes Reserve Compute Resources docco has a good overview of how memory is viewed on a node.
Node Capacity
---------------------------
| kube-reserved |
|-------------------------|
| system-reserved |
|-------------------------|
| eviction-threshold |
|-------------------------|
| |
| allocatable |
| (available for pods) |
| |
| |
---------------------------
The default scheduler checks a node isn't under memory pressure, then looks at the allocatable memory available on a node and whether the new pods requests will fit in it.
The allocatable memory available is the total-available-memory - kube-reserved - system-reserved - eviction-threshold - scheduled-pods.
Scheduled Pods
The value for scheduled-pods can be calculated via a dynamic cgroup, or statically via the pods resource requests.
The kubelet --cgroups-per-qos option which defaults to true enables cgroup tracking of scheduled pods. The pods kubernetes runs will be in
If --cgroups-per-qos=false then the allocatable memory will only be reduced by the resource requests that scheduled on a node.
Eviction Threshold
The eviction-threshold is the level of free memory when Kubernetes starts evicting pods. This defaults to 100MB but can be set via the kubelet command line. This setting is teid to both the allocatable value for a node and also the memory pressure state of a node in the next section.
System Reserved
kubelets system-reserved value can be configured as a static value (--system-reserved=) or monitored dynamically via cgroup (--system-reserved-cgroup=).
This is for any system daemons running outside of kubernetes (sshd, systemd etc). If you configure a cgroup, the processes all need to be placed in that cgroup.
Kube Reserved
kubelets kube-reserved value can be configured as a static value (via --kube-reserved=) or monitored dynamically via cgroup (--kube-reserved-cgroup=).
This is for any kubernetes services running outside of kubernetes, usually kubelet and a container runtime.
Capacity and Availability on a Node
Capacity is stored in the Node object.
$ kubectl get node node01 -o json | jq '.status.capacity'
{
"cpu": "2",
"ephemeral-storage": "61252420Ki",
"hugepages-1Gi": "0",
"hugepages-2Mi": "0",
"memory": "4042284Ki",
"pods": "110"
}
The allocatable value can be found on the Node, you can note than existing usage doesn't change this value. Only schduleding pods with resource requests will take away from the allocatable value.
$ kubectl get node node01 -o json | jq '.status.allocatable'
{
"cpu": "2",
"ephemeral-storage": "56450230179",
"hugepages-1Gi": "0",
"hugepages-2Mi": "0",
"memory": "3939884Ki",
"pods": "110"
}
Memory Usage and Pressure
A kube node can also have a "memory pressure" event. This check is done outside of the allocatable resource checks above and is more a system level catch all. Memory pressure looks at the current root cgroup memory usage minus the inactive file cache/buffers, similar to the calculation free does to remove the file cache.
A node under memory pressure will not have pods scheduled, and will actively try and evict existing pods until the memory pressure state is resolved.
You can set the eviction threshold amount of memory kubelet will maintain available via the --eviction-hard=[memory.available<500Mi] flag. The memory requests and usage for pods can help informs the eviction process.
kubectl top node will give you the existing memory stats for each node (if you have a metrics service running).
$ kubectl top node
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
node01 141m 7% 865Mi 22%
If you were not using cgroups-per-qos and a number of pods without resource limits, or a number of system daemons then the cluster is likely to have some problems scheduling on a memory constrained system as allocatable will be high but the actual value might be really low.
Memory Pressure calculation
Kubernetes Out Of Resource Handling docco includes a script which emulates kubelets memory monitoring process:
# This script reproduces what the kubelet does
# to calculate memory.available relative to root cgroup.
# current memory usage
memory_capacity_in_kb=$(cat /proc/meminfo | grep MemTotal | awk '{print $2}')
memory_capacity_in_bytes=$((memory_capacity_in_kb * 1024))
memory_usage_in_bytes=$(cat /sys/fs/cgroup/memory/memory.usage_in_bytes)
memory_total_inactive_file=$(cat /sys/fs/cgroup/memory/memory.stat | grep total_inactive_file | awk '{print $2}')
memory_working_set=${memory_usage_in_bytes}
if [ "$memory_working_set" -lt "$memory_total_inactive_file" ];
then
memory_working_set=0
else
memory_working_set=$((memory_usage_in_bytes - memory_total_inactive_file))
fi
memory_available_in_bytes=$((memory_capacity_in_bytes - memory_working_set))
memory_available_in_kb=$((memory_available_in_bytes / 1024))
memory_available_in_mb=$((memory_available_in_kb / 1024))
echo "memory.capacity_in_bytes $memory_capacity_in_bytes"
echo "memory.usage_in_bytes $memory_usage_in_bytes"
echo "memory.total_inactive_file $memory_total_inactive_file"
echo "memory.working_set $memory_working_set"
echo "memory.available_in_bytes $memory_available_in_bytes"
echo "memory.available_in_kb $memory_available_in_kb"
echo "memory.available_in_mb $memory_available_in_mb"
Definitely YES, Kubernetes consider memory usage during pod scheduling process.
The scheduler ensures that, for each resource type, the sum of the resource requests of the scheduled containers is less than the capacity of the node. Note that although actual memory or CPU resource usage on nodes is very low, the scheduler still refuses to place a Pod on a node if the capacity check fails. This protects against a resource shortage on a node when resource usage later increases, for example, during a daily peak in request rate.
There are two key concepts in scheduling. First one, the scheduler attempts to filter the nodes that are capable of running a given pod based on resource requests and other scheduling requirements. Second, the scheduler weighs the eligible nodes based on absolute and relative resource utilization of the nodes and other factors. The highest weighted eligible node is selected for scheduling of the pod.
Good explanation of scheduling in Kuberneres you can find here: kubernetes-scheduling.
Simple example: your pod normally uses 100 Mi of ram but you run it with a 50 Mi request. If you have a node with 75 Mi free the scheduler may choose to run the pod there. When pod memory consumption later expands to 100 Mi this puts the node under pressure, at which point the kernel may choose to kill your process. So it is important to get both memory requests and memory limits right. About memory usage, requests and limits you can read more here: memory-resource.
A container can exceed its memory request if the node has memory available. But a container is not allowed to use more than its memory limit. If a container allocates more memory than its limit, the container becomes a candidate for termination. If the container continues to consume memory beyond its limit, the container is terminated. If a terminated container can be restarted the kubelet restarts it, as with any other type of runtime failure.
I hope its helps.
Yes, Kubernetes will consider current memory usage when scheduling Pods (not just requests), so your new Pod wouldn't get scheduled on the full node. Of course, there are also a number of other factors.
(FWIW, when it comes to resources, a request helps the scheduler by declaring a baseline value, and a limit kills the Pod when resources exceed that value, which helps with capacity planning/estimation.

What happens when an aws emr core node dies in a hadoop environment

I have an EMR cluster with 1 master and 2 core nodes. This automatically set the replication factor to 1. So what I read from the documentation this means that when a file is uploaded on a node then it's stored only on that node. In my case I have a spark application which was running pretty good until one of the core nodes died for some reason which I still investigate. When that node died my application died as well with the following error:
Diagnostics: Could not obtain block: BP-1346795555-172.31.18.53-1503395276403:blk_1073762933_22444 file=/user/hadoop/.sparkStaging/application_1503580640490_3075/__spark_libs__1454218958107463026.zip
So I interpret this in the following way: Spark libs were located on the node which died and as the replication factor is 1 they were not stored elsewhere. This lead to a file corruption. Am I correct in my reasoning or there's another explanation of what happened? And if I'm correct what is the best way to avoid that situation? The easiest to think is to have more core nodes and to increase the replication factor which will lead to having the data on more nodes.
Spark's staging directory is not located on a distributed file system and is not replicated at all. This is just a local storage for the machine. If executor is lost it is lost as well, but in a well designed application it should not result in a permanent data loss.
You are mostly right.
If a node died, then all you can do is add more than 3 nodes, so the replication factor will be 2 or manually set the dfs.replication to 2, so even if you have 2 nodes your data will be replicated 2 times.
Be also aware that such errors may occur due to running out of memory and that's why the blocks become unavailable. When Linux runs out of memory it runs something called the OOM Killer and that sometimes kills important processes. To prevent this from happening you can:
run on instances with more memory
adjust the cluster configuration to run fewer tasks per node
modify your application to use memory more efficiently and avoid shuffle

Distributed key-value storage for total data size of 80TB

TL;DR:
I'd like to have recommendations for a distributed key-value storage, for avg. entry size of up to 50KB, to be installed on a Linux environment (dedicated servers).
A file-system solution would do.
I found a few solutions: Ceph, Cassandra, Riak, and a few more.
Details
I'm looking for a storage solution for one of our components, it should be a key-value storage, flat namespace.
Scenario
The read/write patterns are very simple:
Once a key-value is written, there are a few reads within the next hours.
After that, nothing touches the given key-value. We'd like to keep the data for future purposes, "Storage mode".
Other usage aspects
OS: Linux
Python client/connector
Total size: up to 80TB (this value also represents future needs).
Avg Entry Size (for a single value in a k-v pair): 10 to 50 KB, uncompressed, mostly textual data
Compression: either built-in or external.
Encryption: not needed
Network bandwidth: 1Gb, single LAN
Servers: dedicated (not in the cloud)
Most important requirements
The "base" requirements are:
OS: Linux
Python client/connector OR RESTful API via HTTP
Can easily store up to 80TB (this value also represents future needs).
Max read latency: a few seconds for first reads, 30 seconds for "storage mode" (see above for explanation)
Built in replication (so that data is stored on more than a single node)
Nice to have
RESTful gateway
Background data backup to another store (for data recovery in case of a disaster).
Easy to configure
What I've found so far
Ceph
HDFS
HBase on top of HDFS
Lustre
GlusterFS
Mongo's GridFS - but can I trust Mongo's infrastructure?
Cassandra - not an option, since the merge process consumes double disk size
Riak - looks like it has the same issue as Cassandra, needs more research
Swift + OpenStack (actual storage can be on Amazon S3)
Voldemort
There are dozens of additional tools, but I won't write them here since some of them have proprietary license, and others seem to be immature.
I'd appreciate any recommendation on any of the tools I mentioned above (with total capacity of more than 50TB), or on a tool you think is sufficient.
Just use Ceph (I mean direct librados usage). Don't use GlusterFS -- it's hangy.

Resources