正しいSREのやり方

Google Explains Why Others Are Doing SRE Wrong

Google诠释其它企业在实施SRE中的错误

SLOに対する理解が重要ですね。
要は、未然防止のためにやる指標で、障害対応のため(SLA)ではない。
ユーザが気づく前に、まずSREチームが問題を検知。

Typical SLOs at Google include:

uptime of 99.9% a month (i.e. 43 minutes of downtime a month)
99.99% of HTTP requests in a month succeed with a 200 OK
50% of HTTP requests returned in under 300ms

For Google, a typical error budget policy is to disallow launching new features once an application has exhausted its error budget (for example, already over the 43 minutes downtime budget for this month), or dedicating a sprint to corrective actions stemming from previous post-mortem analysis.

私が知ってる範囲で、ほとんどの組織は下記のアンチパターンに当てはまる。

Thorne further referenced some clear anti-patterns to implementing SRE such as simply rebranding the ops team to SRE team or hiring for SRE engineers without first putting in place the SRE principles and mechanisms (SLOs, error budget policies and balancing workload) for success.