About the Site Reliability Engineering (Consumer) team:
You’ll be joining Grofers as one of the founding members of the Site Reliability Team, a team that is part of the DevOps Tribe. The Site Reliability Team is responsible for ensuring that all of the resiliency measures that have been implemented work as expected, discovering gaps and working with the teams on fixing them, and developing processes, tools, automation, and libraries that help ensure the reliability of our products. This team will be primarily aligned with the Consumer Tribe which builds and maintains the Grofers Apps & Website that Grofers customers interact with.
Here is a quick peek into some of our work that we have been doing in the DevOps tribe:
- https://lambda.grofers.com/managing-key-values-in-consul-using-consulkv-crd-7e6874c80278
- https://lambda.grofers.com/reducing-data-transfer-costs-with-a-docker-registry-based-cache-8f93d7e561f3
- https://lambda.grofers.com/learnings-from-two-years-of-kubernetes-in-production-b0ec21aa2814
- https://lambda.grofers.com/reducing-aws-data-transfer-cost-kubernetes-from-multi-az-to-single-az-341d890553b6
And a little about what the Consumer Tribe has been upto:
- https://lambda.grofers.com/the-final-call-the-good-and-the-bad-of-react-native-eaea62395319
- https://lambda.grofers.com/transform-your-automation-suite-into-a-testing-product-part-1-15557f02a612
- https://lambda.grofers.com/react-native-the-sinner-and-the-saint-ef8bab16ba85
What you will do:
- Design and implement processes, tools, automation, and libraries that engineering teams (pods) can use to improve the reliability of the services they own. For example, adding a feature in our circuit breaker library or adding a feature to collect additional context in our internal logging library.
- Drive a culture that puts reliability first and establish processes, policies and tools that drive reliability within product engineering teams. This includes things like SLOs, error budgets, on-call response, incident management, observability best practices, creating tools and automation that empowers developers to think reliability first.
- Work with product engineering teams to ensure that reliability tools and best practices are adopted in every service. Just creating new tools and guidelines is not enough. We want to make sure that services use these tools and guidelines.
- Drive a culture of incident postmortems. Deeply investigate production incidents with product engineering teams. Apply learnings from postmortems to fixing the gaps in our code, architecture, processes and learning.
- Participating in design meetings, interviews, code reviews and other organization activities that help us become an elite engineering team.
EXPERTISE AND QUALIFICATIONS
What you need:
- 3+ years of experience working with developing complex, distributed web applications.
- Hands-on experience of operating your service in production, resolving production issues across the stack.
- Experience working with programming languages such as Python, Java, Node.js, Golang, etc. We are polyglot and use almost all of those but we are big users of Python. It is important for us that you have worked as a developer before.
- Disciplined coding practices, experience with code reviews and pull requests and a creative and conceptual problem-solving approach.
- Experience with database reliability. Must have dealt with common database related issues. We primarily use Postgres at Grofers. Postgres experience is not a must but experience with some RDBMS (like MySQL or MariaDB) and at least one common NoSQL datastore (like Redis, Mongo, RabbitMQ) is critical.
- Experience working with microservice architectures in large distributed cloud environments. We are hosted on AWS.
- Strong communication and team collaboration skills, both written and verbal. As a site reliability engineer, you will need to collaborate with multiple product engineering teams to get reliability related changes done.
Good to have:
- Experience with setting up and operating Kubernetes cluster.
- Experience with GraphQL, RPC Frameworks (such as gRPC or Thrift). Understanding how services communicate with each other is crucial to find out where a failure can occur.
- Knowledge of networking protocols such as TCP, HTTP/2, WebSockets, DNS, etc.
- Experience with back-end technologies such as Django, Flask, Rails, etc.
- Understanding of compliance frameworks (such as ISO27001, PCI, SOX, etc.) and cloud-native security.