![](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj_g-z-QUdCnIkE7kkwbXZf0rxdTAuKjVRTUOZCAi4WR4BU6cFzFQaaIvkMWDzdQf-i0mHoYPhanYm-Hg1N9eGMDbpHqaay-JqUj44iYZuIOpnRHEsgJRKTqRY8762Y61G_a9PzIDdcoadR/s200/sprwal1.jpg)
VM
sprawl is defined as a waste of resources (compute : CPU cycles and RAM
consumption) as well as storage capacity due to a lack of oversight and
control over VM resource provisioning. Because of its uncontrolled
nature, VM sprawl has adverse effects on your environment’s performance
at best, and can lead to more serious complications (including downtime)
in constrained environments.
VM Sprawl and its consequences
Lack
of management and control over the environment will cause VMs to be
created in an uncontrolled way. This means not only the total number of
VMs in a given environment, but also how resources are allocated to
these VMs. You could have a large environment with minimal sprawl, but a
smaller environment with considerable sprawl.
Here are some of the factors that cause VM sprawl:
- Oversized VMs: VMs which were allocated more resources than they really need. Consequences:
- Waste of compute and/or storage resources
- Over-allocation
of RAM will cause ballooning and swapping to disk if the environment
falls under memory pressure, which will result in performance
degradation
- Over-allocation of virtual CPU will cause high
co-stops, which means that the more vCPUs a VM has, the more it needs to
wait for CPU cycles to be available on all the physical cores at the
same moment. The more vCPUs a VM has, the less likely it is that all the
cores will be available at the same time
- The more RAM and vCPU a VM has, the higher is the RAM overhead required by the hypervisor.
- Idle VMs: VMs up and running, not necessarily oversized, but being unused and having no activity. Consequences:
- Waste of computer and/or storage resources + RAM overhead at the hypervisor level
- Resources wasted by Idle VMs may impact CPU scheduling and RAM allocation while the environment is under contention
- Powered Off VMs and orphaned VMDKs eat up space resources
-
How to Manage VM sprawl
Controlling
and containing VM sprawl relies on process and operational aspects. The
former covers how one prevents VM sprawl from happening, while the
latter covers how to tackle sprawl that happens regardless of controls
set up at the process level.
Process
On the process side, IT should define standards and implement policies:
- Role
Based Access Control which defines roles & permissions on who can
do what. This will greatly help reduce the creation of rogue VMs and
snapshots.
- Define VM categories and acceptable maximums: while
not all the VMs can fit in one box, standardizing on several VM
categories (application, databases, etc.) will help filter out bizarre
or oversized requests. Advanced companies with self-service portals may
want to restrict/categorize what VMs can be created by which users or
business units
- Challenge any oversized VM request and demand justification for potentially oversized VMs
- Allocate
resources based on real utilization. You can propose a policy where a
VM resources will be monitored during 90 days after which IT can adjust
resource allocation if the VM is undersized or oversized.
- Implement policies on snapshots lifetime and track snapshot creation requests if possible
In
certain environments where VMs and their allocated resources are
chargeable, you should contact your customers to let them know that a VM
needs to be resized or was already resized (based on your policies and
rules of engagement) to ensure they are not billed incorrectly. It is
worthwhile to formalize your procedures for how VM sprawl management
activities will be covered, and to agree with stakeholders on
pre-defined downtime windows that will allow you to seamlessly carry any
right-sizing activities.
Operational
Even
with the controls above, sprawl can still happen. It can be caused by a
variety of factors. For example, you could have a batch of VMs
provisioned for one project, but while they passed through the process
controls, they can sit idle for months eating up resources because the
project could end up being delayed or cancelled and no one informed the
IT team.
In VMware
environments where storage is thin provisioned at the array level, and
where Storage DRS is enabled on datastore clusters it’s also important
to monitor the storage consumption at the array level. While storage
capacity will appear to be freed up at the datastore level after a VM is
moved around or deleted, it will not be released on the array and this
can lead to out-of-storage conditions. A manual triggering of the VAAI
Unmap primitive will be required, ideally outside of business hours, to
reclaim unallocated space. It’s thus important to have, as a part of
your operational procedures, a capacity reclamation process that is
triggered regularly.
The
usage of virtual infrastructure management tools with built-in resource
analysis & reclamation capabilities, such as Solarwinds
Virtualization Manager, is a must. By leveraging software capabilities,
these tedious analysis and reconciliation tasks are no longer required
and dashboards present IT teams with immediately actionable results.
Conclusion
Even
with all the good will in the world, VM sprawl will happen. Although
you may have the best policies in place, your environment is dynamic and
in the rush that IT Operations are, you just can’t have an eye on
everything. And this is coming from a guy whose team successfully
recovered 22 TB of space previously occupied by orphaned VMDKs earlier
this year.