What CIOs can learn from the massive Optus outage

Planning for inevitable outages

Even if they’re not overseeing vast networks like Optus’, IT leaders and their executive counterparts must plan for outages, their own or those of their service providers, as even small or localized outages can still disrupt the business and its customers.

“It’s important to review your business continuity plans and ensure you’ve got some kind of backup, where possible, to continue with [business as usual],” says Tett.

This business continuity plan might include processes for reverting to paper-based systems, shifting to cellular coverage instead of internet, ensuring executives and key staff have dual SIM phones to switch networks to ensure continuity of communications, or whatever is relevant to the organization.

“It’s like having a flight manual so that if you lose a significant part of the technology you can try and ensure there are some offline ways to continue functioning,” he says.

Spark the disaster recovery conversation

CIOs can use these headline-making incidents to spur conversations with their infrastructure leaders to review their disaster recovery plan. “Don’t wait for something to happen. It should be an ongoing, systematic approach to look at where vulnerabilities lie,” says Fredkin, who cites Netflix’s Chaos Monkey, which creates random outages in its production environment, as a key component of the streaming media giant’s strategy for improving the resiliency of its complex systems.

“Causing chaos in their system allows them to expose weak points, see how things might pan out, and plan and run drills of what could happen,” he says.

Conversations around disaster recovery need to involve the CFO and CEO to map the risks of being offline and of losing customer trust, as well as the costs to mitigate those risks. “How one company is impacted can differ substantially to the way another company’s impacted, so you’ve got to take that into account to,” Fredkin says.

Understand third-party risks

According to Uptime, managed digital infrastructure services, including cloud, colocation, telecom, and hosting companies, account for a growing proportion of outages today. As such IT leaders must be aware of — and know how to manage — third-party vendor risks, says Budde, “particularly in a technological landscape where cost-saving measures and outsourcing have become common.”

For software or hardware updates, it’s vital to have a list of critical vendors along with the timing and nature of updates. CIOs need to look at whether it’s feasible to roll out updates to some customers and not others or to parts of your infrastructure and not others, Fredkin says. They also need to find “a way you can do some testing so it doesn’t impact the entire by production environment,” he adds.

“Having good relationships with the people who provide the hardware and the software is crucial. Knowing when something, like an update, is coming ahead of time, and having some sort of control over when that update is pushed through to your organization can be very beneficial,” he says.

Make the case for IT modernization

As unfortunate as they are, headline-grabbing outages often offer the opportunity for IT leaders to make their own case for IT modernization, Fredkin advises. Although not expressly the case with Optus, when systems go offline, it is often related to a legacy technology issue, and these incidents can help motivate buy-in at the leadership and board level to update systems to ensure they’re secure and resilient at speed and at scale, he says.

“When CIOs are making a modernization use case, they need to have the stakeholder buy-in for the business to come along the journey,” he says.

Complex, mission-critical functions can take two to three years to complete, so there needs to be a way of ordering and prioritizing efforts as well. “Think of it like a traffic-light system,” Fredkin says, looking at what is crucial and critical, and what is urgent. “What are the biggest gaps in the system? And in terms of the longer-term refresh, that’s a different prioritization, because some things need to be done in a specific order,” he says.

“It’s that classic waterfall mentality, which still has a very big place when it comes to redesigning critical infrastructure,” he adds.

Consider the larger picture

Whether they originate with your systems or are the result of connected networks, outages can impact a wide range of businesses at once. As such, IT leaders might want to consider thinking beyond their organization’s four walls, Budde says.

“A tailored disaster and resilience plan needs to include compliance with industry standards and regular review of IT systems and protocols to ensure robustness, particularly in response to potential network stress and security threats,” he says, adding that such efforts might need to go further than just your organization, depending on your industry.

“We may need some out-of-the-box thinking and start looking at nationwide solutions and industry-wide solutions in how organizations can assist each other in these situations,” he says.

Overlook communications to your peril

Last, but by no means least, organizations need a comprehensive communications playbook for when outages or disruptions occur, regardless of whether those outages originate with them.

“It’s vital to have clear, concise communication about any outages or issues,” says Enex Test Labs’ Tett. This communication should be up the chain to the CEO as well as outward to customers and the media to provide as much clarity as possible about the situation.

“The first thing organizations need to think of is how to clearly communicate with their customers, even if it’s not them that’s causing a disruption. And the second is, if they can’t communicate with their customers because of network outages, have a strategy in place to be able to communicate via the media,” he says.

It should also include some kind of time frame to help manage expectations around downtime and restoration of business as usual. “Whether it’s a few hours or 48 hours, be open and transparent,” says Tett.

About Author

AndyC

Andy Curtis is an award-winning security consultant, researcher and public speaker. He has been working in the computer security industry since the early 1990s, having been employed by state and federal government, leading healthcare and banking providers across three continents. He has given talks about computer security for some of the world’s largest companies, worked with law enforcement agencies on investigations into hacking groups, and is a regular voice on TV and radio explaining IT security threats.

See author's posts