
System Reliability Engineer
- Australia
- Permanent
- Full-time
- Enjoy ownership and responsibility, with a bias towards identifying problems and proposing and implementing solutions.
- Strong experience with Ruby on Rails, especially in production SaaS systems.
- Deep knowledge of background job processing (Sidekiq or similar), caching, and distributed systems.
- Proven experience improving CI/CD pipelines, we currently use CircleCI but don't discard a migration.
- Comfortable designing and improving observability stacks (New Relic, Datadog, Honeycomb, etc.).
- Experience building resilient systems — retries, back-offs, queueing, circuit breakers, graceful degradation, kill switches, isolation of workloads, etc.
- Strong focus on developer ergonomics and reliability culture.
- Bias toward action and delivering tools that improve system behaviour and developer happiness.
- Own and improve our CI/CD pipelines (CircleCI), reducing deploy times and failure rates.
- Build reliable retry/back-off mechanisms for critical user workflows.
- Design and implement observability tooling, including synthetic checks, smoke tests, etc.
- Help architect and implement failover and fallback mechanisms for critical vendors and workflows.
- Work with Support to build debug tooling and dashboards that empower non-engineers.
- Collaborate with engineering to define and template runbooks, kill switches, and disaster mitigation patterns.
- Champion performance tuning.
- Must have worked in a company known for its world-class engineering standards and global scale.
- Must have a great track record working as a member of an engaged team, beyond the theory that someone would learn in a leadership book and have had some hiccups along the way and learned from them.
- Must help bring the team up to their expected standards: Organise learning sessions and find development opportunities for everyone in their team.
- Must have an appetite for a startup environment, enjoy making decisions and drive change. Have a good framework to pick a good balance between doing things perfectly and doing something subpar: What’s the minimum work we can do to solve a problem/feature properly and move on to the next problem?
- Deep experience with technical problem-solving and code reviews within a Ruby environment
- Must have the experience and will to drive the successful completion of projects within established timelines and quality standards and motivate and guide team members to do the same
- Experience building resilient systems — retries, back-offs, queueing, circuit breakers, graceful degradation, kill switches, isolation of workloads