The cost management lifecycle for an enterprise landscape closely follows the path from migration to operations. The diagram below highlights the various parts of the framework.
Figure 1: Cloud Cost Management Framework
Cost control should be considered across the application lifecycle — from the initial planning, to day-to-day operations, to periodic architecture optimization. Many enterprises do this through a well-defined underlying governance framework that is optimized using various automation techniques. In this blog, we provide our experience-based recommendations on how to execute cloud governance, initial planning and sizing, and operational visibility and forecasting for enterprise cloud infrastructures.
I. Cloud Governance
Cloud Resource Ownership
- Transform provisioning practices for the cloud through existing cloud management platforms or cloud-enabled CMDBs like ServiceNow.
- IT teams can provide the enterprise governance and management infrastructure and practices, while individual LOBs are responsible for managing their application infrastructure as per enterprise best practices.
- App teams should be responsible for the cost ownership of resources within projects (i.e., you provision it, you pay for it).
- App teams should be responsible for tagging/labelling all created resources.
- App teams should be responsible for the clean-up of unused resources.
Cloud Resource Provisioning
- Define clear access control policies (i.e., who can provision what resources).
- Build standard enterprise reference architectures and templates for provisioning resources (this should be the Enterprise Architecture’s responsibility).
- Automate the provisioning of reference architectures and templates.
- Use a cloud-based configuration management tool where appropriate (check if existing configuration management databases provide cloud support).
Tagging and Labeling
- Use tags for resource management and labels for resource identification, grouping, searches, and billing.
- Define a list of labels and tags to be applied.
GlobalLogic recommends the following tags (as reference):
- Identification/Classification Tags
- BU/Cost Center
- Owner-email – application owner/group
- Environment – Prod/Dev/Test/QA/Perf
- Environment-Name – Prod1a, Dev4 etc.
- Chargeback/Showback ID
- created-by – User who created resource.
- role = <db, appserver, proxy, etc.> – Classify by application role within a project
- Operations/Automation Tags
- schedule-* – Used to drive instance scheduling
- can-delete = <true/false>
- Can be added by app teams once resources are ready to be removed.
- Can also be added by automation scripts, after untagged resources have been reported and no action taken.
- Subsequently, a delete script will read this label and clean up this resource).
- image-type – App type for baseline images, e.g. Apache, Cassandra etc.
- image-version – Adds version ID of all the images of a certain app.
- Reservation-expiry – Used to alert and renew reservations
Other tags can be added as per the business need.
- Build or use a lightweight inventory management system to:
- Track current cloud sprawl
- Report data on current inventory, new resources, projected cost for new resources, etc.
- Find gaps between what was planned and what exists in the cloud
II. Initial Planning (Sizing and Provisioning)
TCO and Budgeting
- Use the max CPU/RAM for budgeting, but execute the initial sizing based on CPU utilization, etc. (especially for dev/test).
- For dev/test, be sure to consider the uptime hours (i.e., 9×5 as opposed to 24×7) for TCO calculations.
- Execute instance right-sizing based on performance characteristics.
- Use on-premise monitoring data to arrive at a more accurate initial cloud sizing.
- For new migrations, enforce budgets from Day 1.
Service Catalogs and Provisioning
- Create IAM policies so that teams only create services that are needed by the app in that project.
- Build IT-certified base images and templates for reference architectures.
- Publish and enable self-provisioning through tools like ServiceNow.
- Integrate with approval processes.
- Complement provisioning policies with proactive reporting and automated resource clean-up to build awareness and discipline while controlling costs.
III. Operational Visibility and Forecasting
Daily Reporting with cost, utilization, non-conformant resources:
- Automatically send daily reports directly to stakeholders with key data points.
- Obtain intelligence by analyzing individual resource level data points and environment-level correlations.
- Recommendations should be generated based on analytics; data points include:
- No or low CPU, memory or disk utilization, or during limited times (e.g., office hours for dev/test)
- No or low network traffic
- No login on VM
- VM uptime (but no activity)
- For cloud services, use cloud-provided metrics
Reporting and Automation Architecture
The following diagram describes the reporting and automation architecture for a cloud landscape:
Figure 2: Reporting and Automation Architecture
Data Points to Report
- Cost (filtered by app/environment)
- Daily, MTD, and projected monthly spend
- Budgeted vs actual, and overrun projection
- Alerts on any change in usage pattern and/or budget overruns
- Show unused resources + age + wasted cost:
- Unattached disks
- Orphaned snapshots
- Unallocated IPs
- Unused/unaccessed storage (recommend moving to archive: Glacier/ColdLine)
- Show underutilized resources
- Show individual instances
- Show environments that have predominantly no utilization (e.g., dev27 is not being used)
- Current inventory
- New resources created + corresponding cost
- New projected monthly spend based on new resources
- List of resources without tags and labels
- List of resources not confirming to naming conventions
- List of instances based on older versions of baseline images
- Rightsizing + corresponding cost savings
- Reservation planning/committed use recommendations + corresponding cost savings
- Results in up to 24%-57% potential savings
- Instance scheduling + corresponding cost savings
- Spot/pre-emptible instance recommendations + corresponding cost savings
- Results in up to 60-80% potential savings
- Instance/environment cleanup candidates (based on consistent low/no usage)
- Instance/environment cleanup candidates (based on non-conformance)
- Reserved/committed instance renewal alerts (for instances with approaching expiry dates)
Using the above best practices, enterprises can create an effective governance framework that proactively manages costs across the entire cloud infrastructure lifecycle. In the final installment of this blog series, we will provide recommendations for cost optimization and automation, including some popular tools currently in the market.