Site Reliability Engineering is Driving MSP Need for Centralized IT Monitoring

Automated response and single pane monitoring is helping Managed Service Providers deliver highly reliable IT operations for their customers. Consider these practices.

Site reliability engineering

Site reliability engineering (SRE) continues to grow in prominence as a desired methodology for providing a quick response to infrastructure issues that otherwise might lead to expensive downtime and lost productivity. For SRE to work successfully in the day-to-day world of an IT operation, however, requires automated centralized monitoring to identify and resolve events that can easily undermine the goals of SRE.

Managed service providers are integrating SRE methodology into their service portfolio to deliver the reliability customers expect. In doing so, they’re seeking monitoring solutions that are capable of supporting SRE standards.

Visibility in a Hybrid Environment

OST, a Michigan based MSP with offices and customers around the globe, is a great example of an MSP that has embraced the SRE methodology. IT operations monitoring is one of the extensive range of services OST provides and is crucial to maintaining reliability and therefore building trust and success with its customers.

Like many organizations these days, OST’s customers are functioning in a hybrid environment of legacy on-premise and cloud environments. A number of the customers are pursuing digital transformation and moving more applications to the cloud, as a result.  The customers, which are large organizations in sectors as diverse as healthcare and manufacturing, are running up to 10,000 hosts, and growing. OST needed the scalability to support this expanding customer base of hybrid environments, and find and fix problems before the customer even knows something is wrong.

Another challenge was the breadth of different client systems OST needed to support. These include AS400 mainframes; open vMA and other decades-old but mission-critical systems; a dozen different on-premises storage systems; and cloud native application infrastructure.

Layering over all of us was the need to enact SRE level, ultra-reliable monitoring for its customers, regardless of which client systems they were running or how large of a server environment they were operating.

Scaling Up to Meet Demand

Given the scope of client systems, OST was more than ready to embrace ‘ultra-scalability,’ another tenet of SRE. Scalability was an essential must-have element as OST sought an automated monitoring solution. As new customers come on board, or existing organizations scale up through acquisition or growth, OST had to be able to quickly manage this increased demand. Its monitoring capability had to include a full feature-set that could accommodate any additional needs a customer identified, all at a competitive cost.

The company chose Opsview as its monitoring solution. “We are a smaller organization and we provide services to organizations that are much larger than we are. So, we have to go to them and operate at their level,” says OST’s MSP manager, Matt Glenn. “Opsview represents to them an enterprise toolset with enterprise capabilities. It shows them that we’re enterprise-ready.”

Incorporating SRE

Google launched SRE some 16 years ago with the formation of a team to improve Google’s large-scale system, with the objective of a high-quality user experience for Google site visitors.  Ben Treynor, founder of Google‘s Site Reliability Team, is credited with saying SRE is “what happens when a software engineer is tasked with what used to be called operations.”

High reliability and scalability are the twin pillars of SRE.  The goal is to free IT from operations that can be automated so teams — notably DevOps — can focus on innovating new features, scaling the business and automating processes. In the larger view, it plays into the trend of IT teams linking themselves more closely to business value and competitive initiatives like digital transformation and finding the bandwidth to work on these value-added tasks. Automation is key to this evolution.

Benefitting by Automation

MSPs thinking about providing advanced monitoring to better serve their customers, and embracing SRE in the process, need to put automation at the top of their criteria.

Customers will react positively to IT improvements that provide reliability and scalability and enables them to deploy staff to more competitive tasks.  In adding automated IT monitoring to their portfolio, MSPs can consider these additional benefits:

  • Accelerated resolution of incidents promotes customer confidence and productivity.  For example, OST was able to reduce by 40% the time taken to find and resolve incidents.
  • Single pane of glass visibility ensures all applications are performing well.  It provides a central source for collecting metrics from a wide range of legacy and digital systems – including hybrid and cloud environments.  Rather than having to navigate through siloed monitoring systems, a centralized monitoring solution provides efficiency for MSPs serving already complex hybrid environments.
  • Reporting capability enables MSPs to add further value by providing quarterly reviews of their customers’ IT infrastructure use and capacity, for example. It can also help with identifying assets that are being under-utilized.
  • Accountability can be improved by direct integration of a monitoring solution like Opsview into the ticketing system, thus enhancing transparency with customers, and highlighting the value of speedy incident response.

Advanced Monitoring Key to SRE Success

MSPs like OST are proving that automated centralized monitoring is the key to delivering the reliability standards their customers expect.  To be able to remediate issues even before the customer notices a potential disruption is a powerful means of building customer confidence and trust.  As customers grow their business and move more applications to the cloud, a monitoring solution that can add organization to this complexity – via a single pane of glass — gives MSPs an efficient means to show even more value.

“In driving down costs, we can have one person supporting a much larger number of systems than we could before,” says Glenn. “This means that we can hire more knowledgeable and skilled people than before to drive customer stratification, enabling a great balance and growth.”  Site reliability engineering