The Customer
Canada Post is the primary postal operator in Canada, delivering mail and parcels nationwide while supporting e-commerce and logistics operations across the country.
The Project
Production Support for Next Generation Track and Trace System (NGTT)
Overview
Improving, working closely with the Canada Post, planned and built a Next Generation Track and Trace system from the ground up, based on industry-leading expertise in Reactive Micro-Service Systems Architecture. The Track and Trace system is a complex ecosystem that supports the “Parcels” line of business’s ability to track the full life cycle of a parcel through the network that enables Canadians and people abroad to send and receive packages through the primary postal operator in Canada.
The Managed services team supported the software reliability engineering demands of this large scalable solution which is built on the latest and most up to date cloud infrastructure leveraging Microsoft Azure, Kubernetes clusters, Confluent Kafka, and Akka by Lightbend.
The Challenge
Canada Post, the primary postal operator in Canada, faced significant challenges in maintaining the performance and availability of its critical services due to the increase of parcel volumes, as well as spikes due to seasonal and unexpected demand. The internal IT team was overwhelmed with constant reactive troubleshooting, alerting, and managing a large, disparate set of services, resulting in downtime and slow response, especially during peak seasons.
Services Provided
Managed services
24/7 production support
Incident Management
Pro-active Monitoring & Reporting
Systems Optimization
The Solution
A managed services approach was implemented by Improving Ottawa to provide a comprehensive, 24/7 support framework, leveraging industry best practices and automation.
24/7 Monitoring and Incident Management: Implementing continuous monitoring for all critical systems, databases, and applications, with real-time alerts for any anomalies or performance degradation.
Pre-emptive system health checks were scheduled to identify potential issues before they affected production workloads.
Incident Response: Developed an efficient incident management workflow, ensuring that issues were identified and resolved within pre-defined SLAs.
Automated solutions: Sourced, procured, and Implemented the Splunk on Call tool to accelerate incident response and reduce manual ticket creation efforts.
Root Cause Analysis and Postmortems: Identified patterns in recurring incidents and worked closely with development teams to implement long-term fixes rather than temporary workarounds. Developed incident process for all high impact incidents to mitigate re-occurrence.
Reporting and Analytics: Implemented real time dashboards to provide visibility into system health and performance metrics.
Status reports were shared regularly with the client to understand the trends and identify areas for future improvement.
Key Success Metrics
Supporting over 1 billion Kafka messages processed per week and 24.6 GB of data per day processed through the Next Generation Track and Trace core systems.
Integrating Splunk on Call minimized escalations to Stakeholders as well as drastically improved Mean Time to Acknowledgement (10.3 minutes in 2023 to 5.11 minutes in 2024)
Managed services missed 0 incident acknowledgements.
Paged out incidents 24/7/365:
2022: 294
2023: 336
2024: 274
Handled over 3300 incidents combined through 2022-2024 with 0 escalations to client stakeholders for a missed incident.
Autonomously created 2558 service desk and Jira tickets via Splunk on-call integration saving over 5000 hours of manual effort.
Transitioned incident management to client preferred offshore team and onboarded client’s existing IT support team by providing them with extensive knowledge training sessions and supporting documentation ensuring a successful future support effort.
The services provided by Improving were independently reviewed by Deloitte to evaluate the maturity and received a rating of 4.9/5.
The Business Benefits
Increased System Availability: The managed services approach led to a 95% reduction in unscheduled downtime, significantly improving customer experience. Proactive incident management and automated processes led to faster recovery times from disruptions.
Pro-active Issue Resolution: With 24/7 monitoring and regular system checks, critical outages were prevented and resolved before escalations, reducing emergency incidents by 40%.
Cost Efficiency: Client saved on operational costs by reducing the need for additional internal resources, allowing them to focus on innovation and delivery.
Enhanced IT Team Productivity: Managed Services acted as an extension of the internal team, providing expertise when needed and relieving staff from repeated issues.
Why Choose Improving
This case study exemplifies Improving's ability in building a managed services production support model, enabling clients to successfully address its operational challenges, reduce costs, and improve system reliability. This partnership enabled enhanced system availability, especially during peak periods. The proactive support provided by the managed services team allowed their internal IT resources to focus more on strategic tasks and product development for future growth.