North America Distributed Operations: Site Reliability Engineer
Position Summary:
The Site Reliability Engineer will help transform operations from a traditional team that is predominantly reactive, to a predominantly proactive team that applies and maintains best practices, utilize standard tools, and automates; with the objective of monitoring the services, reacting to problems, and proactively addressing issues before they affect performance or availability. In addition, this position will drive system / business service stability, availability, and resilience by working closely with counterparts in Infrastructure, Application Support, and Enterprise Engineering, and engage as needed to repair any service-impacting issues.
Primary Job Responsibilities:
Analyze and resolve complex issues related to infrastructure engineering and operations: system software, storage management, backup methodology, virtualization, monitoring and management tools, business continuity and high availability solutions.
Work closely with network, security, development, application and support teams in the implementation of infrastructure components that support emerging technologies and applications.
Automate operational, monitoring, and integrity verification processes (e.g., runbooks) for hardware, server, and system resources and processes.
Proactively ensure the highest levels of systems and infrastructure availability. Perform daily system monitoring, verifying the integrity and availability of systems and key processes, reviewing system and application logs.
Create and maintain system documentation for infrastructure (systems, storage, virtualization) technologies, including installation, configuration, and appropriate troubleshooting steps.
Collaborating with other technology leads and support teams to ensure integrated end-to-end availability, reliability, and performance.
Improve existing processes through automation solutions to recurring problems and enhancements to existing solutions or documentation.
Provides on-call and after-hours support to address incidents, maintain infrastructure and support operational efforts.
Provide training and mentorship to junior team members. Train team members in best practices and act as subject matter expert and escalation contact for infrastructure related issues.
Provide call leadership to mitigate critical incidents.
Identify and drive resolution on monitoring and alerting gaps.
Ability to work across multiple projects and provide best practice advice and contribute to technical tasks.
Solve problems relating to mission-critical services and build automation to prevent problem recurrence.
Provides guidance and engineering solutions to fulfill business requirements using sound and proven industry best practices in accordance with architectural standards and engineering methodology.
Evangelize and influence resiliency, stability, and scalability through best practices, elimination of bottlenecks, and process improvement.
Knowledge, Skills and Competencies:
10+ years senior level support and engineering experience in hardware, operating systems, storage, and virtualization technologies in a global multi-data center enterprise organization.
Advanced knowledge and experience with multiple server operating systems (Windows 2012, 2016, 2019, Red Hat Enterprise Linux 7.x).
Advanced knowledge and experience with virtualization hypervisors (VMware vSphere 6.x, 7.x, Azure).
Working knowledge and experience in various storage systems, related technologies, and protocols (Dell EMC PowerMax, Unity, IBM XIV, Dell EMC Isilon NAS).
Skilled with server hardware architecture, configuration, as well as troubleshooting Dell EMC PowerEdge and Cisco UCS platforms (VxBlock converged infrastructure a plus).
Advanced knowledge of PowerShell scripting and other languages commonly used (Bash, Python, JavaScript, etc.)
Excellent problem solving and analytical thinking skills; ability to influence change and drive results.
Work within defined change control processes and procedures.
Ability to manage multiple projects in a dynamic development environment; demonstrated project delivery required.
Strong ability to identify, understand and communicate business needs and application architectures for technical projects.
Excellent communication and collaboration skills; ability to effectively communicate across all levels is required.
Working knowledge of networking principles including routing, switching, firewalls, load balancing and VLANs.
Working knowledge of common information security concepts and practices.
Working knowledge of containerization (VMware Tanzu, Docker, Kubernetes) is preferred.
Working knowledge and experience with configuration management tools (Ansible, etc.).
Experience with infrastructure as code (vRealize Automation, Terraform, git, etc.).
Experience with centralized logging solutions (Splunk, Elk, etc.).
Bachelor’s Degree in Computer Science, Information Technology, or another related discipline.