Many teams have a complex architecture that makes it difficult to proactively monitor, capture and resolve issues quickly or even anticipate issues before they happen.  I’ve overseen the architecture for several teams in IBM, each with their own unique complexities using on-prem or cloud technologies and this issue has been a common thread across all of them.

I typically see some rudimentary setup of server probes or manually checking server resource allocations coupled with email alerts when an issue is being experienced.  With on-prem technologies, this usually involves automated probes that send an email alert, investigating and then engaging the proper support team to look at and resolve the issue.  This scenario typically can take an hour in the best case to resolve an outage or more likely 2-4 hours in total.  These outages can seem like an eternity to customers for a web application that they expect to maintain 99% or greater availability.

This is where modern technologies and some ingenuity can come into the picture and help reduce overall downtime and especially the time to resolve an issue.  As I established the plans for moving from on-prem to cloud, I have worked with each team to define 2 key tools and one responsibility change that have drastically application up-time.

Automated Monitoring

When establishing the cloud instance of our applications I had the team setup New Relic, which provides a whole variety of monitoring utilities.  New Relic helps application teams exploit the latest technology trends to confidently move faster—which means reducing costly downtime, improving engineer productivity, and enabling high-performing applications that deliver differentiated experiences for customers. The tool suite offers benefits like the following:
  • Easy-to-set-up real-time instrumentation and analytics
  • Flexible instrumentation and dashboarding
  • Guides appropriate engineer responses
  • Correlates application performance to end-user experience
  • Connects application and infrastructure performance
  • Rich, detailed transaction data that can be mined to understand application behavior and performance
  • Real-time error analysis with on-demand diagnostic tools
  • Integration with DevOps tooling to connect with common development team applications
  • Cloud-service instrumentation
  • Built to scale with your cloud applications
New Relic Insights Dashboard

Conclusion

New Relic has become our team’s go-to tool to analyze trends, global performance, identify a potential problem before it happens or flag an outage in realtime.  Management also appreciates the rich statistics and the ability to solidly prove application availability or when a problem occurred.

We have used this tool to help us stay ahead of potential issues and react quickly in the case of an outage to resolve the problem sometimes in as few as 10 minutes vs the previous 2-4 hours. The statistical information can easily be mined and help provide real-time insight for our applications, helping us to optimize and reduce costly downtime, thereby increasing our team’s overall business value.

In Part 2, we will look at how to connect the application monitoring to the appropriate teams and utilize modern tools to make sure the right team members know about an issue in real time and are engaged to resolve it immediately.