Proactive Infrastructure Analysis - Part 1: Monitoring

Many teams have a complex architecture that makes it difficult to proactively monitor, capture and resolve issues quickly or even anticipate issues before they happen. I’ve overseen the architecture for several teams in IBM, each with their own unique complexities using on-prem or cloud technologies and this issue has been a common thread across all of them.

I typically see some rudimentary setup of server probes or manually checking server resource allocations coupled with email alerts when an issue is being experienced. With on-prem technologies, this usually involves automated probes that send an email alert, investigating and then engaging the proper support team to look at and resolve the issue. This scenario typically can take an hour in the best case to resolve an outage or more likely 2-4 hours in total. These outages can seem like an eternity to customers for a web application that they expect to maintain 99% or greater availability.

This is where modern technologies and some ingenuity can come into the picture and help reduce overall downtime and especially the time to resolve an issue. As I established the plans for moving from on-prem to cloud, I have worked with each team to define 2 key tools and one responsibility change that have drastically application up-time.

Automated Monitoring

When establishing the cloud instance of our applications I had the team setup New Relic, which provides a whole variety of monitoring utilities. New Relic helps application teams exploit the latest technology trends to confidently move faster—which means reducing costly downtime, improving engineer productivity, and enabling high-performing applications that deliver differentiated experiences for customers. The tool suite offers benefits like the following:

Easy-to-set-up real-time instrumentation and analytics
Flexible instrumentation and dashboarding
Guides appropriate engineer responses
Correlates application performance to end-user experience
Connects application and infrastructure performance
Rich, detailed transaction data that can be mined to understand application behavior and performance
Real-time error analysis with on-demand diagnostic tools
Integration with DevOps tooling to connect with common development team applications
Cloud-service instrumentation
Built to scale with your cloud applications

New Relic Insights Dashboard

Conclusion

New Relic has become our team’s go-to tool to analyze trends, global performance, identify a potential problem before it happens or flag an outage in realtime. Management also appreciates the rich statistics and the ability to solidly prove application availability or when a problem occurred.

We have used this tool to help us stay ahead of potential issues and react quickly in the case of an outage to resolve the problem sometimes in as few as 10 minutes vs the previous 2-4 hours. The statistical information can easily be mined and help provide real-time insight for our applications, helping us to optimize and reduce costly downtime, thereby increasing our team’s overall business value.

In Part 2, we will look at how to connect the application monitoring to the appropriate teams and utilize modern tools to make sure the right team members know about an issue in real time and are engaged to resolve it immediately.

Proactive Infrastructure Analysis – Part 1: Monitoring

Automated Monitoring

Conclusion

About The Author

Kerry Landis

Leave a reply Cancel reply

Recent Posts

Follow Me

Proactive Infrastructure Analysis – Part 1: Monitoring

Automated Monitoring

Conclusion

About The Author

Kerry Landis

Related Posts

Proactive Infrastructure Analysis – Part 2: Responsiveness

Why Transformation is Necessary, Especially Using Cloud Technologies

Critical Thinking: Your Secret Weapon to Solving Problems

Achieving Excellence with Your Team Through the Use of a Systematic Evaluation Process Model

Leave a reply Cancel reply

Recent Posts

Follow Me