Site Reliability Engineer - CTJ - Top Secret

Microsoft
United States, Washington, Redmond
Oct 31, 2025
OverviewAre you interested in working on cutting-edge cloud security products? Would you like to be part of one of the world's most advanced cyber-security solutions and protect millions of computers from thousands of active attack attempts, every month? Look no further than the Microsoft Defender engineering team. You will be building and delivering cloud solutions to meet the scale that few companies in the industry are required to support. Leveraging state-of-the-art technologies, you will be instrumental in delivering holistic protection within government environments. The Microsoft Defender team is responsible for delivering a constantly evolving set of services and solutions to meet the challenging landscape of our ever-evolving attackers. This is a team which provides on-call operational support and improvements to the operational posture of the Microsoft Defender products within US Government clouds. You will operate our production services, and work closely with other engineering teams to ensure services and systems are highly stable, meet performance SLAs, and meet the expectations of internal and external customers and users. TheMicrosoft Defender team is responsible for delivering a constantly evolving set of services and solutions to meet the challenging landscape of our ever-evolving attackers. Learn more here! Microsoft's mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond. Responsibilities24x7 On-Call Rotation: Participate in a regular on-call schedule to monitor service health, respond to incidents, and escalate complex issues as needed. Support Development & Design: Make basic code changes to improve reliability, security, and observability, and engage in design/code reviews with guidance from senior engineers. Incident Response & Postmortems: Troubleshoot issues during on-call, mitigate impact, and contribute to postmortem documentation and review meetings. Operational Improvements: Use existing tools to identify and suggest fixes for recurring issues affecting performance, reliability, or efficiency. Safe Deployment & Configuration: Apply safe deployment practices and automation to manage configuration and data changes across product components. Performance Analysis: Assist in evaluating service performance and identifying areas for deeper analysis under the guidance of experienced engineers. Technical Growth: Build foundational knowledge in distributed systems, cloud infrastructure, and product operations to contribute to incremental improvements.