
Site Reliability Engineer (SRE/ DevOps) - Engineering Productivity - Sydney
- Sydney, NSW
- Permanent
- Full-time
- Build, deploy safely and incrementally and operate critical production systems with focus on scalability, reliability, observability, performance and security.
- Monitor, support and enhance developer experience across services.
- Build automation to remove toil and efficiently operate production systems.
- Proactively monitor, respond to, and enhance alerts and set up automated alert handling
- Create and maintain the incident response runbooks.
- Triage platform/infrastructural issues and help Arista software engineers in their triages. Engage with 3rd party vendor support.
- Write postmortem documents and build solutions to avoid incidents from repeating.
- Plan and communicate maintenance windows on production systems.
- Work with Arista’s product development teams to identify infrastructural issues that are causing bottlenecks and limitations in their workflows. Design and implement solutions to resolve them.
- Survey and adopt best practices around infrastructure/platform to maintain secure, scalable and fault-tolerant systems.
- Study the design and sufficient implementation details of OSS systems for better triage and fix resolution.
- At least BSc Computer Science or Engineering + 3 years’ experience, MS Computer Science or Engineering + 3 years’ experience, or equivalent work experience.
- Knowledge of one or more of Go, Python, shell scripting to be able to implement medium complexity automation workflows.
- Knowledge of Linux (or UNIX) from administration and debugging perspective
- Hands-on experience in operating software systems (infrastructure, complex applications etc) at scale
- Experience in server provisioning (esp from storage and networking perspective).
- Strong problem solving and software troubleshooting skills
- Experience with infrastructure-as-code
- Experience managing databases - mariadb, postgres, mongodb etc
- Experience with docker and virtualization technologies - kvm, qemu, kata-containers etc
- Experience managing monitoring stack - Prometheus, Loki, Tempo, InfluxDB, Grafana, Thanos etc
- Experience managing ElasticSearch clusters
- Experience managing Artifactory, docker registry etc
- Experience managing CI/CD systems like ArgoCD, Spinnaker etc
- Experience managing version control systems like Perforce, Gerrit etc
- Experience with infrastructure-as-code frameworks like Ansible
- Experience managing large Java applications
- Experience in storage infrastructure management eg: NAS, SAN, Ceph etc