VM
sprawl is defined as a waste of resources (compute : CPU cycles and RAM
consumption) as well as storage capacity due to a lack of oversight and
control over VM resource provisioning. Because of its uncontrolled
nature, VM sprawl has adverse effects on your environment’s performance
at best, and can lead to more serious complications (including downtime)
in constrained environments.
Here are some of the factors that cause VM sprawl:
In
certain environments where VMs and their allocated resources are
chargeable, you should contact your customers to let them know that a VM
needs to be resized or was already resized (based on your policies and
rules of engagement) to ensure they are not billed incorrectly. It is
worthwhile to formalize your procedures for how VM sprawl management
activities will be covered, and to agree with stakeholders on
pre-defined downtime windows that will allow you to seamlessly carry any
right-sizing activities.
In VMware
environments where storage is thin provisioned at the array level, and
where Storage DRS is enabled on datastore clusters it’s also important
to monitor the storage consumption at the array level. While storage
capacity will appear to be freed up at the datastore level after a VM is
moved around or deleted, it will not be released on the array and this
can lead to out-of-storage conditions. A manual triggering of the VAAI
Unmap primitive will be required, ideally outside of business hours, to
reclaim unallocated space. It’s thus important to have, as a part of
your operational procedures, a capacity reclamation process that is
triggered regularly.
The
usage of virtual infrastructure management tools with built-in resource
analysis & reclamation capabilities, such as Solarwinds
Virtualization Manager, is a must. By leveraging software capabilities,
these tedious analysis and reconciliation tasks are no longer required
and dashboards present IT teams with immediately actionable results.
VM Sprawl and its consequences
Lack of management and control over the environment will cause VMs to be created in an uncontrolled way. This means not only the total number of VMs in a given environment, but also how resources are allocated to these VMs. You could have a large environment with minimal sprawl, but a smaller environment with considerable sprawl.- Oversized VMs: VMs which were allocated more resources than they really need. Consequences:
- Waste of compute and/or storage resources
- Over-allocation of RAM will cause ballooning and swapping to disk if the environment falls under memory pressure, which will result in performance degradation
- Over-allocation of virtual CPU will cause high co-stops, which means that the more vCPUs a VM has, the more it needs to wait for CPU cycles to be available on all the physical cores at the same moment. The more vCPUs a VM has, the less likely it is that all the cores will be available at the same time
- The more RAM and vCPU a VM has, the higher is the RAM overhead required by the hypervisor.
- Idle VMs: VMs up and running, not necessarily oversized, but being unused and having no activity. Consequences:
- Waste of computer and/or storage resources + RAM overhead at the hypervisor level
- Resources wasted by Idle VMs may impact CPU scheduling and RAM allocation while the environment is under contention
- Powered Off VMs and orphaned VMDKs eat up space resources
How to Manage VM sprawl
Controlling and containing VM sprawl relies on process and operational aspects. The former covers how one prevents VM sprawl from happening, while the latter covers how to tackle sprawl that happens regardless of controls set up at the process level.Process
On the process side, IT should define standards and implement policies:- Role Based Access Control which defines roles & permissions on who can do what. This will greatly help reduce the creation of rogue VMs and snapshots.
- Define VM categories and acceptable maximums: while not all the VMs can fit in one box, standardizing on several VM categories (application, databases, etc.) will help filter out bizarre or oversized requests. Advanced companies with self-service portals may want to restrict/categorize what VMs can be created by which users or business units
- Challenge any oversized VM request and demand justification for potentially oversized VMs
- Allocate resources based on real utilization. You can propose a policy where a VM resources will be monitored during 90 days after which IT can adjust resource allocation if the VM is undersized or oversized.
- Implement policies on snapshots lifetime and track snapshot creation requests if possible