As a Staff Site Reliability Engineer at Prodigy Education, I am responsible for not only the day to day monitoring and supporting of Prodigy's architecture on AWS, but also for providing guidance to leadership and engineering for the direction we should take with the future of our infrastructure.
This means that I am often investigating and testing new tools and services to implement within our infrastructure, and writing documentation and assisting with the implementation of such tools.
Prodigy regularly handles up to 60,000 simultaneous students, with peaks exceeding 150,000 during high-traffic periods. On a typical day, our backend processes around 2 million requests per minute. To support this scale, our Kubernetes architecture dynamically scales to hundreds of nodes and thousands of pods across multiple clusters.
I also participate in on-call rotations, handling and managing incidents as they arise. This includes triaging issues in real time, coordinating with engineering teams, and bringing in key stakeholders to help resolve incidents quickly and minimize impact to our users.
Some examples of initiatives I have led or contributed to are:
- Improving Istio service mesh configuration through configuration scoping, and planning a migration to Gateway API
- Rolling out enhanced AWS and Kubernetes networking to enable higher scale
- Leading initiatives to move off legacy tools like Jenkins and JFrog Artifactory to more scalable modern alternatives
- Replacing Kubernetes External Secrets with External Secrets Operator