The CXone Expert product is a multi-tenant SaaS platform, designed to handle millions of requests with high performance and reliability. Each Expert site can easily host a complex hierarchy of tens of thousands of pages (articles), with layers of fine-grained permissioning, server- and client-side customizations and branding, and other complex business logic. Our enterprise customers have a global presence, and delivering their content with low latency across the globe with near-zero downtime is what they expect.
CXone Expert is an agile engineering organization, and QA is fully automated. We release new versions of our platform every week through our CI/CD pipeline. Our application infrastructure runs on AWS and is almost entirely containerized and orchestrated by Kubernetes.
We are looking for a Principal DevOps Engineer to round out our Site Reliability / DevOps team. This person will be the go-to person for research and development of architectural changes from the infrastructure up. Another important part of this role is helping other engineers on the team design and implement software that scales well and is highly reliable. You will get your hands dirty and refactor existing system / application code yourself (this is a hands-on role).
Responsibilities
Analyze system reliability and performance to address and prevent issues.
Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning and launch reviews.
Maintain services once they are live by measuring and monitoring availability, latency and overall system health.
Scale systems sustainably through mechanisms like automation and evolve systems by making code and configuration changes that improve reliability and velocity.
Participate in on-call rotation for service disruptions
Identify and diagnose infrastructure issues in a live production environment
Engage in and improve the whole lifecycle of services-from inception and design, through deployment, operation and refinement.
Practice sustainable incident response, blameless postmortems, and root cause analysis.
Defining and developing continuous integration and deployment pipelines
Building Infrastructure as Code
Coordinating build and release activities with other stakeholders
Training and mentoring other DevOps engineers
Working with teams to develop code quality metrics and meters
Identifying, researching, and prototyping new technologies to improve DevOps processes
Troubleshooting & responding to downtime, performance degradation and outside attacks
Prepare documentation and diagrams for informational and compliance purposes
Requirements
BS degree in Computer Science or related technical field involving coding (e.g., physics or mathematics), or equivalent practical experience.
8+ years experience designing, analyzing and troubleshooting large-scale distributed systems
Sustained track record of creating major improvements in large business-critical systems around stability, security, performance, and scalability.
Experience in one or more of the following: Java, Python, C#, or JavaScript.
Excellent communication, analytical, and troubleshooting skills
Ability to work independently, as well as part of a team, on multiple competing projects
Ability to debug, profile, and optimize code and automate routine tasks.
Can effectively facilitate cross-team work and are influential far beyond his or her individual group.
Strong sense of ownership.
Life-long learner able to quickly grow new frameworks, architectures, and languages
Desired Skills
Experience running production systems on AWS
A deep understanding of REST and network programming
Experience scaling high-traffic SaaS applications
Deep knowledge of Kubernetes
Experience with Application Monitoring Metrics (AWS X-Ray, Cloudwatch, Datadog, etc)