Azure Availability sets, Fault Domains, and Update Domains - windows

I'm Azure newbie and need some clarifications:
When adding machines to Availability set, in order to prevent VM from rebooting, what's best strategy for VM's, put them in:
-different update and fault domains
-same update domain
-same fault domain ?
My logic is that it's enough to put them in diffrent update AND fault domain
I used this as reference:https://blogs.msdn.microsoft.com/plankytronixx/2015/05/01/azure-exam-prep-fault-domains-and-update-domains/
Am i correct ?
These update/fault domains are confusing

My logic is that it's enough to put them in diffrent update AND fault
domain
You are right, we should put VMs in different update and fault domain.
We put them in different update domain, when Azure hosts need update, Microsoft engineer will update one update domain, when it completed, update another update domain. In this way, our VMs will not reboot in the same time.
we put them in different fault domain, when an Unexpected Downtime happened, VMs in that fault domain will reboot, other VMs will keep running, in this way, our application running on those VMs will keep health.
To shot, add VMs to an availability set with different update domain and fault domain, that will get a high SLA, but not means one VM will not reboot.
Hope that helps.

There are three scenarios that can lead to virtual machine in Azure being impacted: unplanned hardware maintenance, unexpected downtime, and planned maintenance.
Unplanned Hardware Maintenance
An Unexpected Downtime
Planned Maintenance events
Each virtual machine in your availability set is assigned an update domain and a fault domain by the underlying Azure platform. For a given availability set, five non-user-configurable update domains are assigned by default (Resource Manager deployments can then be increased to provide up to 20 update domains) to indicate groups of virtual machines and underlying physical hardware that can be rebooted at the same time. When more than five virtual machines are configured within a single availability set, the sixth virtual machine is placed into the same update domain as the first virtual machine, the seventh in the same update domain as the second virtual machine, and so on. The order of update domains being rebooted may not proceed sequentially during planned maintenance, but only one update domain is rebooted at a time. A rebooted update domain is given 30 minutes to recover before maintenance is initiated on a different update domain.
Fault domains define the group of virtual machines that share a common power source and network switch. By default, the virtual machines configured within your availability set are separated across up to three fault domains for Resource Manager deployments (two fault domains for Classic). While placing your virtual machines into an availability set does not protect your application from operating system or application-specific failures, it does limit the impact of potential physical hardware failures, network outages, or power interruptions.
For more details, refer this documentation.

Related

Ec2 1/2 checks passed every 2 or 3 days

I get every 2 or 3 days "Ec2 1/2 checks passed" alarm and my instance is not reachable. I can resolve it just by stopping and starting it again (not by rebooting), basically the problem is in "Instance Status Checks".
My instance type is "t2.micro" and used Amazon Linux AMI.
Any help?
From Status Checks for Your Instances - Amazon Elastic Compute Cloud:
System Status Checks
Monitor the AWS systems on which your instance runs. These checks detect underlying problems with your instance that require AWS involvement to repair.
The following are examples of problems that can cause system status checks to fail:
Loss of network connectivity
Loss of system power
Software issues on the physical host
Hardware issues on the physical host that impact network reachability
Instance Status Checks
Monitor the software and network configuration of your individual instance.The following are examples of problems that can cause instance status checks to fail:
Failed system status checks
Incorrect networking or startup configuration
Exhausted memory
Corrupted file system
Incompatible kernel
Basically, the System Status Check says that something is wrong with the hardware, and the Instance Status Check says that something is wrong with the virtual machine (eg kernel problems, networking).
When an instance is stopped and then started again, it is provisioned on different hardware. This will resolve hardware-related issue. Please note that rebooting will not re-provision the virtual machine — instead, the operating system simply restarts. That's why a stop/start is more effective than a reboot.
It is quite rare to have the instance status check fail after some period of time — it normally has problems at initial startup. This suggests that something is going wrong with your operating system, but that should be corrected by a reboot, which you say isn't sufficient. So, it's a bit of a mystery.
If possible, I would suggest launching a new instance and installing/configuring it as a replacement to the existing instance.

What is named.exe process and how to avoid consuming high CPU rates

I have a Windows Server 2008 with Plesk running two web sites.
Sometimes the server is going slow and there is a named.exe process making the CPU peak 100%.
It last a short period of time and after a while it comes again.
I would like to know what this process is for and how to configure it for not consuming this cpu and make my sites go slow.
This must be a DNS service, also known as Bind. High CPU usage may indicate one of the following:
DNS is re-reading its configuration. In this case high CPU usage shall be aligned with your activities in Plesk - i.e. adding and removing domains.
Someone (normally another DNS server) is pulling data from your DNS server. It is normal process. As you say it is for short period of time, it doesn't look like DNS DDoS
AFAIK there is no default way in Windows to restrict software from taking 100% CPU if no other apps require CPU at the moment.
See "DNS Treewalk Suite" system, off the process, and uses the antivirus.
Check the error "log" in the system.

How to debug TFS Lab Management Network Isolation?

How do I debug the TFS Lab Management Network Isolation process? Currently network isolation goes into a Configuring... state, and never actually completes. There are informational messages in the additional information, but other environments I have complete with these informational messages.
What is the steps the environment is going through when configuring Network Isolation and where can I look to find out why the network isolation stays in the configuring state.
The following was how my environment was structured:
Network Isolated, AD domain with multiple other servers including TMG.
First thing to ensure was that the network connections were being treated as a Public Location, and therefore the first fix was to ensure that they were treated as Private Networks.
1. Opened Secpol.msc on each of the VM's
2. Choose Network List Manager Policies
3. Change Unidentified Networks to Location Type Private
4. Change Identifying Networks to Location Type Private
Then when I created the environment, I neglected to remove the machines from the domain and add them to Workgroup before storing it in the library. Then whenever deploying the environment from the library, the environment exhibited the behaviour above, i.e. it never achieved network isolation.
To resolve, remove all servers from the domain, add them to Workgroup, store the environment into the library and then deploy a new environment from the library.
Wait for network isolation to be achieved, and then add each of the servers back into the domain one by one. The key here is patience and to wait for network isolation to be achieved, before moving onto the next server.
If network isolation is not achieved after 20 minutes, shut down the environment. Wait 10 minutes and then start it up again. By this time, it should resolve itself.

AppFabric Redundancy

We just tested an AppFabric cluster of 2 servers where we removed the "lead" server. The second server timeouts on any request to it with the error:
Microsoft.ApplicationServer.Caching.DataCacheException: ErrorCode<ERRCA0017>:SubStatus<ES0006>:
There is a temporary failure. Please retry later.
(One or more specified Cache servers are unavailable, which could be caused by busy network or servers. Ensure that security permission has been granted for this client account on the cluster and that the AppFabric Caching Service is allowed through the firewall on all cache hosts. Retry later.)
In practive this means that if one server in the cluster goes down then they all go down. (Note we are not using Windows cluster, only linking multiple AppFabric cache servers to each other.)
I need the cluster to continue operating even if a single server goes down. How do I do this?
(I realize this question is borderlining Serverfault, but imho developers should know this.)
You'll have to install the AppFabric cache on at least three lead servers for the cache to survive a single server crash. The docs state that the cluster will only go down if the "majority" of the lead servers go down, but in the fine print, they explain that 1 out of 2 constitutes a majority. I've verified that removing a server from a three lead-node cluster works as advertised.
Typical distributed systems concept. For a write or read quorum to occur in an ensemble you need to have 2f + 1 servers up where f is number of servers failing. I think appfabric or any CP (as in CAP theorem) consensus based systems need this to happen for working of the cluster.
--Sai
Thats actually a problem with the Appfabric architecture and it is rather confusing in terms of the "lead-host" concept. The idea is that the majority of lead hosts should be running so that the cluster remains up and running. So if you had three servers you'd have to have at least two lead hosts constantly communicating with each other and eating up server resources and if both go down then the whole cluster fails. The idea is to have a peer-to-peer architecture where all servers act as peers meaning that even if two servers go down the cluster remains functioning with no application downtimes. Try NCache:
http://www.alachisoft.com/ncache/

Common Issues in Developing Cluster Aware non-web-based Enterprise Applications

I've to move a Windows based multi-threaded application (which uses global variables as well as an RDBMS for storage) to an NLB (i.e., network load balancer) cluster. The common architectural issues that immediately come to mind are
Global variables (which are both read/ written) will have to be moved to a shared storage. What are the best practices here? Is there anything available in Windows Clustering API to manage such things?
My application uses sockets, and persistent connections is a norm in the field I work. I believe persistent connections cannot be load balanced. Again, what are the architectural recommendations in this regard?
I'll answer the persistent connection part of the question first since it's easier. All good network load-balancing solutions (including Microsoft's NLB service built into Windows Server, but also including load balancing devices like F5 BigIP) have the ability to "stick" individual connections from clients to particular cluster nodes for the duration of the connection. In Microsoft's NLB this is called "Single Affinity", while other load balancers call it "Sticky Sessions". Sometimes there are caveats (for example, Microsoft's NLB will break connections if a new member is added to the cluster, although a single connection is never moved from one host to another).
re: global variables, they are the bane of load-balanced systems. Most designers of load-balanced apps will do a lot of re-architecture to minimize dependence on shared state since it impedes the scalabilty and availability of a load-balanced application. Most of these approaches come down to a two-step strategy: first, move shared state to a highly-available location, and second, change the app to minimize the number of times that shared state must be accessed.
Most clustered apps I've seen will store shared state (even shared, volatile state like global variables) in an RDBMS. This is mostly out of convenience. You can also use an in-memory database for maximum performance. But the simplicity of using an RDBMS for all shared state (transient and durable), plus the use of existing database tools for high-availability, tends to work out for many services. Perf of an RDBMS is of course orders of magnitude slower than global variables in memory, but if shared state is small you'll be reading out of the RDBMS's cache anyways, and if you're making a network hop to read/write the data the difference is relatively less. You can also make a big difference by optimizing your database schema for fast reading/writing, for example by removing unneeded indexes and using NOLOCK for all read queries where exact, up-to-the-millisecond accuracy is not required.
I'm not saying an RDBMS will always be the best solution for shared state, only that improving shared-state access times are usually not the way that load-balanced apps get their performance-- instead, they get performance by removing the need to synchronously access (and, especially, write to) shared state on every request. That's the second thing I noted above: changing your app to reduce dependence on shared state.
For example, for simple "counters" and similar metrics, apps will often queue up their updates and have a single thread in charge of updating shared state asynchronously from the queue.
For more complex cases, apps may swtich from Pessimistic Concurrency (checking that a resource is available beforehand) to Optimistic Concurrency (assuming it's available, and then backing out the work later if you ended up, for example, selling the same item to two different clients!).
Net-net, in load-balanced situations, brute force solutions often don't work as well as thinking creatively about your dependency on shared state and coming up with inventive ways to prevent having to wait for synchronous reading or writing shared state on every request.
I would not bother with using MSCS (Microsoft Cluster Service) in your scenario. MSCS is a failover solution, meaning it's good at keeping a one-server app highly available even if one of the cluster nodes goes down, but you won't get the scalability and simplicity you'll get from a true load-balanced service. I suspect MSCS does have ways to share state (on a shared disk) but they require setting up an MSCS cluster which involves setting up failover, using a shared disk, and other complexity which isn't appropriate for most load-balanced apps. You're better off using a database or a specialized in-memory solution to store your shared state.
Regarding persistent connection look into the port rules, because port rules determine which tcpip port is handled and how.
MSDN:
When a port rule uses multiple-host
load balancing, one of three client
affinity modes is selected. When no
client affinity mode is selected,
Network Load Balancing load-balances
client traffic from one IP address and
different source ports on
multiple-cluster hosts. This maximizes
the granularity of load balancing and
minimizes response time to clients. To
assist in managing client sessions,
the default single-client affinity
mode load-balances all network traffic
from a given client's IP address on a
single-cluster host. The class C
affinity mode further constrains this
to load-balance all client traffic
from a single class C address space.
In an asp.net app what allows session state to be persistent is when the clients affinity parameter setting is enabled; the NLB directs all TCP connections from one client IP address to the same cluster host. This allows session state to be maintained in host memory;
The client affinity parameter makes sure that a connection would always route on the server it was landed initially; thereby maintaining the application state.
Therefore I believe, same would happen for your windows based multi threaded app, if you utilize the affinity parameter.
Network Load Balancing Best practices
Web Farming with the
Network Load Balancing Service
in Windows Server 2003 might help you give an insight
Concurrency (Check out Apache Cassandra, et al)
Speed of light issues (if going cross-country or international you'll want heavy use of transactions)
Backups and deduplication (Companies like FalconStor or EMC can help here in a distributed system. I wouldn't underestimate the need for consulting here)

Resources