Wednesday, September 29, 2010

HA Limitations

Limitations

HA in vSphere 4.1 has these limitations:
  • 320 virtual machines per host
  • 3,000 virtual machines per cluster
  • 32 host clusters

Managed services

Managed services is the practice of transferring day-to-day related management responsibility as a strategic method for improved effective & efficient operations. The person or organization who owns or has direct oversight of the organization or system being managed is referred to as the offer-er, client, or customer. The person or organization that accepts and provides the managed service is regarded as the service provider.

Thursday, September 23, 2010

VSphere Editions


VI3 End of Life Alert


VMotion HA & FT

VMotion is for a situation where everything works, and continues working. It allows to move a VM to another ESX, for load balancing or host evacuation. There's no dropped connections, and no reboot. However, as soon as an ESX isn't there anymore (HW crash, ...), forget about VMotion, it can't help you anymore.
HA is for a situation where an ESX goes down. The VMs that were running there go down with it, instantly. Other ESXes in the same cluster will react by restarting the crashed VMs. This involves a reboot of the OS, maybe a file system check, and the starting of the application. Obviously all connections that existed to the VM originally are dropped.
FT is for a situation where an ESX goes down. A protected VM runs on one ESX, but has an identical twin (secondary "shadow" VM) on another ESX. If the primary VM goes down because of an ESX crash (HW crash, ...), the secondary will instantly become primary and continue the workload. There's no reboot, no dropped connections. There are a lot of limitations to FT, including max 1 CPU in the VM, etc.
What is VMware Fault Tolerance?
VMware Fault Tolerance is a feature that allows a new level of guest redundancy, The feature is enabled on a per virtual machine basis .
What happens when I turn on Fault Tolerance?
In very general terms, a second virtual machine is created to work in tandem with the virtual machine you have enabled Fault Tolerance on. This virtual machine resides on a different host in the cluster, and runs in virtual lockstep with the primary virtual machine. When a failure is detected, the second virtual machine takes the place of the first one with the least possible interruption of service.
How do I tell if my environment is ready for Fault Tolerance?
The VMware SiteSurvey Tool is used to check your environment for compliance with VMware Fault Tolerance.
What happens during a failure?
When a host running the primary virtual machine fails, a transparent failover occurs to the corresponding secondary virtual machine. During this failover, there is no data loss or noticeable service interruption. In addition, VMware HA automatically restores redundancy by restarting a new secondary virtual machine on another host. Similarly, if the host running the secondary virtual machine fails, VMware HA starts a new secondary virtual machine on a different host. In either case there is no noticeable outage by an end user.
What is the logging time delay between the Primary and Secondary Fault Tolerance virtual machines?
The actual delay is based on the network latency between the Primary and Secondary. vLockstep executes the same instructions on the Primary and Secondary, but because this happens on different hosts, there could be a small latency, but no loss of state. This is typically less than 1 ms. Fault Tolerance includes synchronization to ensure that the Primary and Secondary are synchronized.
In a cluster with more than 3 hosts, can you tell Fault Tolerance where to put the Fault Tolerance virtual machine or does it chose on its own?
You can place the original (or Primary virtual machine). You have full control with DRS or VMotion to assign to it to any node. The placement of the Secondary, when created, is automatic based on the available hosts. But when the secondary is created and placed, you can VMotion it to the preferred host.
What happens if the host containing the primary virtual machine comes back online (after a node failure)?
This node is put back in the pool of available hosts. There is no attempt to start or migrate the primary to that host.
Is the failover from the primary virtual machine to the secondary virtual machine dynamic or does Fault Tolerance restart a virtual machine?
The failover from primary to secondary virtual machine is dynamic, with the secondary continuing execution from the exact point where the primary left off. It happens automatically with no data loss, no downtime, and little delay. Clients see no interruption. After the dynamic failover to the secondary virtual machine, it becomes the new primary virtual machine. A new secondary virtual machine is spawned automatically
Does Fault Tolerance support Intel Hyper-Threading Technology?

Yes, Fault Tolerance does support Intel Hyper-Threading Technology on systems that have it enabled. Enabling or disabling Hyper-Threading has no impact on Fault Tolerance.

http://kb.vmware.com/selfservice/search.do?cmd=displayKC&docType=kc&externalId=1013428

Additional
When VMware FT is enabled for a virtual machine ("the primary"), a second instance of the virtual machine (the "secondary") is created by live-migrating the memory contents of the primary using VMware® VMotion™. Once live, the secondary virtual machine runs in lockstep and effectively mirrors the guest instruction
execution of the primary.
If either the primary or secondary dies, a new secondary is spawned and is placed on the candidate host determined by HA. The candidate host determined by HA may not be an optimal placement for balancing, however one can manually VMotion either the primary or the secondary virtual machines to a
different host as needed.

A VMware HA Cluster consists of nodes, primary and secondary nodes. Primary nodes hold cluster settings and all "node states" which are synchronized between primaries. Node states hold for instance resource usage information. In case that vCenter is not available the primary nodes will have a rough estimate of the resource occupation and can take this into account when a fail-over needs to occur. Secondary nodes send their state info to the primary nodes.
Nodes send a heartbeat to each other, which is the mechanism to detect possible outages. Primary nodes send heartbeats to primary nodes and secondary nodes. Secondary nodes send their heartbeats to primary nodes only. Nodes send out these heartbeats every second by default. However this is a changeable value: das.failuredetectioninterval. (Advanced Settings on your HA-Cluster)
The first 5 hosts that join the VMware HA cluster are automatically selected as primary nodes. All the others are automatically selected as secondary nodes. When you do a reconfigure for HA the primary nodes and secondary nodes are selected again, this is at random. The vCenter client does not show which host is a primary and which is not. This however can be revealed from the Service Console:
cat /var/log/vmware/aam/aam_config_util_listnodes.logAnother method of showing the primary nodes is:
/opt/vmware/aam/bin/Cli (ftcli on earlier versions)
AAM> ln
The Limit of 5 is a soft limit, so you can manually add a 6th, but this is not supported.
To promote a node:
/opt/vmware/aam/bin/Cli (ftcli on earlier versions)
AAM> promotenode To demote a node:
/opt/vmware/aam/bin/Cli (ftcli on earlier versions)
AAM> demotenode
The promotion of a secondary host only occurs when a primary host is either put in "Maintenance Mode", disconnected from the cluster, removed from the cluster or when you do a reconfigure for HA.If all primary hosts fail simultaneously no HA initiated restart of the VMs will take place. HA needs at least one primary host to restart VMs. This is why you can only take four host failures in account when configuring HA.
You will need at least one primary because the "fail-over coordinator" role will be assigned to this primary, this role is also described as "active primary". The fail-over coordinator coordinates the restart of VMs on the remaining primary and secondary hosts. The coordinator takes restart priorities in account. Keep in mind, when two hosts fail at the same time it will handle the restart sequentially. In other words, restart the VMs of the first failed host (taking restart priorities in account) and then restart the VMs of the host that failed as second (again taking restart priorities in account). If the fail-over coordinator fails one of the other primaries will take over.


das.isolationaddress[x] – IP address the ESX hosts uses to check on isolation when no heartbeats are received, where [x] = 1‐10. VMware HA will use the default gateway as an isolation address and the provided value as an additional checkpoint. It is recommended to add an isolation address when a secondary service console is being used for redundancy purposes.


Power off – When a network isolation occurs all VMs are powered off. It is a hard stop.
Shut down – When a network isolation occurs all VMs running on that host are shut down via VMware Tools. If this is not successful within 5 minutes a "power off" will be executed.
Leave powered on – When a network isolation occurs on the host the state of the VMs remains unchanged.


http://www.yellow-bricks.com/vmware-high-availability-deepdiv/#HA-primariesandsecondaries

vSphere New Feature " FT "

HA, or High Availability is to ensure that if one of your hosts dies (poof, gone, power failure, hardware failure, network failure, etc) vCenter will detect the failure and then take the VM's that used to be running on the host that failed, and power up on another host in the cluser. What used to happen prior to HA was that if your host went down, those VM's were down until you repaired the host or manually registered them on another host and powered them up. HA powers them up automatically in the event of a host failure.
FT, or fault tolerance takes the concept of HA to a new level. Setting FT on a VM causes a standby VM to be setup. That VM is updated constantly so that in the case of a host failure, the standby VM immediately assumes processing. So, from a host OS perspective there is no power failure. The transition from primary FT VM to standby FT VM is nearly instantaneous. This ensures that there is no downtime for the FT VM.

Memory limits for VMware products


vSphere4 & 4.1


The Machine SID Duplication Myth

An article by Mark Russinovich specifying why New SID is irrelevent :

The reason that I began considering NewSID for retirement is that, although people generally reported success with it on Windows Vista, I hadn’t fully tested it myself and I got occasional reports that some Windows component would fail after NewSID was used. When I set out to look into the reports I took a step back to understand how duplicate SIDs could cause problems, a belief that I had taken on faith like everyone else. The more I thought about it, the more I became convinced that machine SID duplication – having multiple computers with the same machine SID – doesn’t pose any problem, security or otherwise. I took my conclusion to the Windows security and deployment teams and no one could come up with a scenario where two systems with the same machine SID, whether in a Workgroup or a Domain, would cause an issue. At that point the decision to retire NewSID became obvious.

http://blogs.technet.com/b/markrussinovich/archive/2009/11/03/3291024.aspx