Virtualization Performance and Capacity Planning

Saturday, March 15, 2014

Application Awareness Part II Service Level Agreements

The end of the decade (wow where did all the time go since the big 2K scare?) has brought us to a world in the cloud. Some might say, this will never work, some others worry about security, availability and performance and yet the cost of maintaining the IT infrastructure and the drive to bring services and products to market faster is driving companies to look to the "cloud" for IT resources.

There are some acronyms coming out such as IaaS, PaaS and SaaS and standards groups like DMTF (Distributed Management Task Force (www.dmtf.org) all looking to help make the "Cloud" a platform/infrastructure for both internal and external customers.

All of this growth and opportunity (IMHO) can lead to not only confusion but concern. So how do we as IT professionals and as an industry make the shift to the cloud? One way is to ensure we understand the value of both utilizing cloud services and the move to the cloud, where it makes sense. This is a dollar discussion for the value (ROI) and the cost associated with the move. The value must out weigh the cost to have mass adoption.

The second and more direct path (near and dear to my heart) to acceptance of the cloud is availability and performance. As my other blogs have stated, the old world was/is used to putting an application onto a physical computer (one for one). All the resources of the physical computing platform (network cards, CPU, memory, and to some extent I/O and storage) are available for the individual application. Virtualization has enabled the move to utilizing shared resources (CPU, Memory, Storage (NAS) and Network). The next move is to be hosted in a truly virtual world. That being the application is now an appliance (the application can run anywhere on any platform) that has been detached from the underlying hardware and can be moved and run anywhere. I have VM's which I created on my laptop, moved to my desktop server and are now running on my hypervisor server on shared storage.

So what are the issues around availability and performance? The first is the location of the appliance, what hypervisor is it running on, where is it running, what priorities are set at that location for providing my appliance the resources I contracted for and many more questions. Once I consider moving my applications/appliances to the cloud, the lawers get involved and start writing service level agreements (SLA's.). Now I have been involved in SLA's and charge back (they do go hand in hand) for many years. Most fall somewhere between a hand shake and a legal document so loose that it is hard to quantify and adhere to. The outsourcing world has been addressing these issues for years. How much do we report to the customer and how much leeway do we allow to maintain the SLA?

So back to the cloud. With private and public clouds being created with large investments from companies like IBM, Amazon, VMware (build it and they will come) and many managed service providers and hosting companies, the need to assure application availability and performance is a must. We start with the data collection. How do we collect and correlate the information for the application when the application is running on outside your realm of control?

Sunday, March 6, 2011

Organizational Change

I have become astutely aware of the impact of change on an organization. Without mentioning customers, the abilities for some organizations to change (improve hopefully) the way they do business have amazed me.

As an example; in the mainframe business world, yes the mainframes still exist! The institutionalization of a workflow process can be one of the largest hurdles one can possibly face in a sales cycle to include implementation. What does this mean to current organizations and how can this apply to every day life?

I have a friend who has posted a similar topic on his blog: http://pragmaticarchitect.wordpress.com/2011/03/05/how-to-build-a-roadmap/, where he provides details to organizational change (gap analysis) for using software design and development principles to help organizations change. I see this being used across the enterprise to help organizations align themselves from mergers or for annual reviews. Now you maybe thinking I am saying - change for change sake. This could not be further from the truth. Just look at the amount of mergers and acquisitions going on today and the amount of consolidations over the last decade. These organizations have to some extent, combined so many different business units of the same type, but continue to run them as completely separate business units.

What does this have to do with organizational change? The companies merge to provide continued growth to the shareholders. The growth, if managed properly (In My Humble Opinion) can provide huge savings in the reduction of common jobs. Hold that thought that this is not a job reduction discussion; I will get back to that thought. What I am talking about here is the excessive amounts of money (software maintenance alone can be in the 100's of thousands of dollars); time and effort companies waste by not standardizing on a single and even unified business process.

Take for example the process of Workload automation. Working in the mainframe world these days and having that experience in my past (I am proud to say), shows the tremendous strengths of having a workload automation (formerly called schedulers) available to manage the hundreds of thousands of jobs a mainframe can and does run every day. Now, extend that functionality out to the distributed servers where a lot of data for one reason or another either starts or ends for any number of given business processes.

Now with these as examples, the ability of an organization to centralize all of the scheduling packages throughout an organization can be very difficult due to the organizations resistance to change. The issue is, can the organization change to provide value to the solution? There is the cost of the software, then the cost of the organizational change. Keep in mind; most mainframe people do not even know how many distributed computers connect to the mainframe other than those possibly at the application layer who wrote the program.

My point being here, that organizational change can be the largest part of a project and in my humble opinion, is why most organizations defer change for years. To get back to the "Job" discussion. What if we as a computer worker, were more acceptable to change. Not change for change sake, but change for improving organizational efficiencies. The jobs that shift from using in most cases a old technology that is inefficient in today's highly computerized world to a new centralized business process provide ongoing job security, show the ability to change and provide time for improving other business processes.

As noted with the release of the iPad II already, technology is constantly changing. Removing the barriers too many of us bring to work every day can help everyone of us bring more value to our companies, to the industry and to the bottom line of the companies we work for.

Monday, November 16, 2009

Application Awareness Part 1 CPU

In my previous post concerning Application Awareness, I talked about resource utilization and the need for a workload manager type management layer which provides the ability to manage applications from the resource perspective. I would like to break that down to the different resources (core four of CPU, Memory, I/O and Network) for discussion.

From the CPU perspective, the good news is the number of cores is increasing. We started out with a single core, then added hyper threading, to dual cores, four cores and now six cores per processor. This trend aids in the move to virualization and 64 bit OS/Applications as the 32 bit OS's cannot utilize more than 4 cores and most only two cores. What this means is for the virtual world, there are a number of cores available for an ever increasing number of virtual machines per host. That's the good news. The bad news is there are only so many cores to share.

A brief note on how I equate cores for business people. A core is a resource where only one instruction can occur at a time. So sorting a report or fetching a record from disk is an instruction.

There's more good news in that the core speed is increasing as well which means the applications can run faster but more importantly, the core can process the instruction and be free for another instruction/vm to use that resource. So, let's break this down a bit further in how application awareness comes into play here. As applications grow and scale, they are utilizing the increased size (speed) and number (cores) more and more. What this means to the application awareness discussion is the need to separate out which applications need more CPU (CPU intensive processing) and Cores (multi-threaded applications to process in parallel).

In the physical world, this is not a problem. An application is assigned to a physical node and the application gets all of the resources. In the virtual world, this is a four step process.

The creation of the VM requires the number of processors (cores) to be defined.
The VM needs to be placed on a host which has enough processors (cores) to start the VM.
The application needs to perform within limits for Service Level Agreements (SLA's) to be defined and meet.

While VMware's Distributed Resource Scheduler (DRS) provides the ability to set priorities (once the VM is running) at the VM layer to balance resources and ensure VM's have the resources they need, this is a manual process at best. In the cloud where the operations staff is far removed from the business/application owner, taking the step towards application awareness will become vital for the performance and availability of applications in the cloud.

The cloud providers will need to factor into their portals the creation, placement and performance of the VM's to maintain customer satisfaction. Moving towards Application Awareness is the first step.

Saturday, November 14, 2009

Application Awareness in the Cloud? Is it possible?

As we move closer to the cloud (private or public) the vendors are starting to circle the wagons around capacity planning, charge back and availability. In order to do this, the vendors (hardware and software) need to enable their components to be application aware. What exactly does this mean and what have we learned from our past?

A bit of history. Back in the main frame days, there was one computer being shared with batch jobs, online applications (think web today) and databases. There were more but lets start with this list. With only one computer to run all these applications on, there was a need to prioritize the different applications based on available resources and the priorities of the application (i.e. business unit). IBM and others created some thing called Workload Manager. The goal was to provide metrics at the resource (CPU, memory, I/O and network) and application (which application was using the resources and which one had priorities over other applications) level.

Moving back into today's world, in the physical (not virtual world), this is easy. Each application was installed onto one server. The application could use all the resources it wanted as it was the only application on that particular server. With the average CPU utilization rates of around 10%, this model is a huge waste of resources and power. Virtualization has in many ways, taken us back to the mainframe days (without the single computer).

So how do we enable the cloud to be application aware? A good start is to look to our past and what Workload manager did for us in the mainframe days. VMware has done a good job of prioritizing VM's with their resource management functions. This helps with ensuring resources are available on a given ESX host at start time and also for memory and CPU resources during run time. This assumes the VM is a single application, which in most cases it is. With the cloud, many applications from potentially many different customers will be running in the same environment. How do we prioritize these workloads based on service level agreements, price points (think phone companies charging more for day time calls, vice night time calls) and resource availability.

I propose the need for deeper awareness of the applications and resources by utilizing another piece of mainframe history. That being the concept of Systems Management Facility (SMF) and Resource Management Facility (RMF) type records. These records provide the ability to collect resource usage for performance and tuning, application usage for charge back and workload management for prioritization of applications across the cloud.

VMware has started the charge by opening up their API and some third party vendors have started bringing products to market for reporting and alerting on storage, CPU/Memory and network usage. Next steps are to tag the resources allowing application aware reporting and prioritization in the cloud.

Tuesday, March 31, 2009

Virtualization of Share Point

I read an interesting article this evening from the April edition of Windows IT Pro magazine. The article "The essential Guide to Deploying Moss" by Michael Noel outlined the architecture of Microsoft Office Share Point Server (MOSS). What I found interesting was the discussion around scalability options (horizontal or vertical) for the various components (Web, Query, Index, Database, and Application Roles) for both the physical and virtual options.

The article was based on HP hardware (and sponsored by HP)and took the reader through the configurations of a small farm supporting less than 400 users, all the way up to a "Highly Available Farm" supporting 1000+ users.

The article finishes up with an entire section on "Virtualization of MOSS" where the recommendations for virtualization start with the Web role and move onto the Query role. The issue is which roles may fit best in a virtualized environment to make the best use of the hardware and to provide scalability as the site grows. As with most three tiered applications, the queuing should begin at the outside of the application and gradually narrow down to the database layer. Having more web server roles queuing up request for the index or query roles makes sense. These roles require less resources (CPU, memory, I/O and network) than does the database servers on the back end.

My perspective on this article and the need for virtualization performance and capacity planning is; as your site grows, the ability to go back and fix architectural details like the clustering of your database and the number of servers and their utilization, becomes more and more difficult.

I found it interesting that the article stated "SQL Servers that are heavily utilized may not be the best candidates for virtualization, because their heavy I/O load can cause some contention and they may require a large amount of the resources from the host, which reduces the efficacy of the setup.

The contention of resources in a virtualized world should be the number one point of monitoring and testing for new applications. The IOPS, memory and CPU are the most contentious of the core four resources with networking being added to that list in the case of iSCSI and NAS. Proper segmentation of the network load can prevent contention of the network (VLAN's on separate physical NICS as an example or on HBA's). That leaves memory and CPU as points of possible contention. Understanding the OS requirements for Cores and addressable memory as well as the application load can help properly size these resources per virtual machine and per host. Monitoring the key applications (remember the 80/20 rule) provides the awareness of potential problems as the usage grows.

In the case of web applications, a close tie to the business side of the house is also important. If a new web page or process is placed on a web server or any number of increased usage processes happen without proper sizing, the whole site can come down.

Happy reading and remember to plan for and monitor your vital applications.

Source: http://www.hp.com/solutions/activeanswers/sharepoint and http://www.hp.com/go/sharepoint

Saturday, March 28, 2009

Cisco Announcement UCS - Chargeback

This past week (Mar 17), Cisco announced the release of their Unified Computing Service, which equates to their entry into the server market. These boxes from what I have read are some very beefy boxes including half or full width servers (dual or quad socket) and up to 384 Gb of memory per blade. Their networking components supplied with this product include iSCSI and FC networking to connect up to 320 servers.

I see this entry as a great solution for a hosting company that is focusing on offering virtual solutions where the density, networking and security (segmentation) provides the best possible cost points (Capex and Opex).

This brings me to point of this entry. What and how are customers dealing with chargeback for their virtual infrastructures? This is primarily for the ESX world today, however I have a little hunch that the HyperV world will be coming on strong with their latest release. In the grand scheme of charging customers for infrastructure (intra-enterprise, hosting model or the cloud), there are micro and macro ways of doing things. In the physical world, the customer is charged for the whole box, the setup and administrative time and for storage. They pay for the whole box and can use as much or as little of the computing resources as their application needs. In the virtual world, this changes dramatically because the VM's vary in size and more importantly vary in the amount of resources they use. One cannot and should not place all virtual servers of the same size on a given box or data center and assume all VM's of the same size will perform together nicely!

From the VMware perspective, their memory management algorithms allow for the paging of VM's depending on usage and can therefore use the resources defined or less depending on the needs of the application within the virtual machine itself.

So my question is; how important is chargeback becoming in the new virtualized world?

Let's flashback about 25 years to the main frame days, when the main frame was divided up into partitions (sound familiar :-)) and multiple applications (CICS, IMS, DB2, batch workloads, etc) were using the compute resources all at the same time. Each of these applications (i.e. business units) were measured to determine how much of the main frame resources were consumed by each application. This is where capacity planning, chargeback, and performance and tuning all come together to monitor, from the business application perspective, what resources they use and how many resources they need going forward to maintain SLA's and performance expectations.

Now, lets branch over to an analogy of the phone company billing systems. Remember when phone call rates were lower at night and on weekends? The phone companies did this to entice users (via costing models) to move resource requirements off prime time to other times. Now let's take another analogy from the main frame days with something called workload manager. This was a process of assigning a priority to a work load (batch jobs were less important than online transactions and financial reporting was more important than inventory at the gym), and controlling which workload received more resources from the available resources at any given time.

Now, let's bring it all back together. Measuring resource consumption provides business units and infrastructure managers the ability to know what resources are required, and how they are being used. Beginning with measurement, the collected dat allows companies to charge for the resource consumption that is actually being used vice a macro view of you are being charged a flat rate, if the customer is actually using the resources or not.

There are some products out there that are breaking into this market space, including VAlign, VKernel and others.

More on the importance of CMDB

I was reviewing some other blogs this morning and came upon the announcement from Microsoft that Win server 2003 SP0 is no longer supported. Now I may have heard a chuckle there for a minute from some of you as to your thoughts of "who in the world would still be running Win2003 SP0?". Well let me share with you a story of an assessment I worked on prior to a virtualization project. The goal of the assessment was to prepare the organization for virtualization. The first step is to determine how many physical servers they have, measure their performance and through the wonderful world of capacity planning, determine how many ESX hosts they would need (loaded to a pre-determined level).

Well the assessment went off without a hitch and the results were presented to the project manager. He went through the roof at the results! The problem was not the report, but what the report told him. You see, they had just finished up a rather nasty project of upgrading all their servers from Win2000 to Win2003. The project was wrapped up and they had reported the completion of the project to the CIO. The assessment report showed the existence of 12 more Win2000 servers that had not been converted.

The point being, without a complete inventory of computing resources (CPU, memory, Network and storage) both physical and virtual, the fires just keeping being lit like the trick birthday candles. Put one fire out and it simply starts back up. You quickly run out of breath or remove the candle right? To turn your organization around from a tactical mode to a strategic mode, consider completing a through inventory of your organization.

Now, let's move on to how to conduct the inventory. There are usually three different means of conducting an computing resource inventory.
1. Have your agents (Tivoli, HP Openview, BMC Patrol and many others on the market) tell you what you have. The problem is you most likely do not have agents on every machine in your environment (think test/dev/qa) so this can not possibly be complete.
2. Send out the email to all business managers asking them to tell you what they have. Um, think about that. What would you report and how would you collect the information?
3. Utilize an agent less tool that can scan LanMan directories, perform an IP scan of your sub nets AND interview business units for possible hidden assets behind firewalls or on stand alone networks (you know they exist out there).

The inventory should then be validated with procurement and any provisioning/software distribution processes to ensure a complete listing. Only then can an organization start down the strategic road of knowing what they have and where they are going.

Keep in mind, an inventory can and most likely should include software, security patches and services installed or not.