Senior AI Infrastructure Services Software Engineer at Nvidia Corporation

This job listing has expired and the position may no longer be open for hire.

Posted in General Business 30+ days ago.

Type: Full-Time
Location: Santa Clara, California

Job Description:

We are seeking a highly skilled Senior Infrastructure System Software Engineer with Kubernetes-based infrastructure experience to join our Omniverse Infrastructure team! The ideal candidate will have a solid understanding of system software design principles and experience in deploying, managing, optimizing, and scaling NVIDIA Omniverse™ Cloud, a platform-as-a-service (PaaS) that provides developers and enterprises a full-stack cloud environment to design, develop, deploy, and manage industrial Omniverse applications and workflows.

The candidate will work closely with cross-functional teams to design and develop common system software blocks within Kubernetes clusters (e.g., Custom Resource Definitions, Operators and system plug-ins) to meet the highly challenging and multi-faceted requirements of the NVIDIA Omniverse™ Cloud. They include but are not limited to elasticity, multitenancy, high availability, fault tolerance, debuggability, operational efficiency, and sustainability of the cluster-level services as needed to onboard and optimize omniverse applications and workflows at large scale. A key feature of the workflows to compose one or more high-performance simulation/AI tasks, streaming Kit-based applications of various types, and elastic microservices via the use of Cloud APIs.

What you will be doing:

Design and develop low-level system software solutions within Kubernetes to manage and schedule OVX cluster resources in order to power NVIDIA Omniverse™ Cloud (OVC).

Design and develop cluster-level system software solutions to map a wide range of Omniverse workloads to the high-performance interactive tasks (Kit-based applications), elastic microservices and simulation/AI tasks.

Collaborate with multiple Omniverse product teams to understand customer storage, compute requirements, and build supporting infrastructure.

Work across organizational boundaries with diverse hardware and software engineers.

Proactively identify and address system software challenges in compute, networking, and storage resource utilization that affect OVC's availability, multi-tenancy, fault tolerance, debuggability, operational efficiency, and sustainability.

What we need to see:

6+ years of hands-on system software engineering experience to extend the cluster-level services for large-scale Kubernetes

4+ years of experience building large-scale distributed, fault-tolerant distributed services

Experience with cloud infrastructure platforms like AWS, Azure, and Google Cloud

Strong systems programming skills, including optimizations using multi-threading, asynchronous programming, concurrency and parallelism, caching, and batching

Proficiency in Python, C/C++ and Golang

Working knowledge of elasticity techniques within Kubernetes

Deep understanding of cloud technologies, distributed compute systems, and distributed systems and microservices architecture

Masters or PhD in Computer Science or a related field (or equivalent experience)

Excellent interpersonal skills and ability to work successfully with multi-functional teams, principles, and architects across organizational boundaries and geographies

Ways to stand out from the crowd:

Expert knowledge of virtualization and containerization technologies like Docker, VMware, KVM, etc

Strong knowledge of elasticity techniques within Kubernetes

Experience of co-designing high-performance application workflows with the underlying cluster-level software such as Slurm and/or Kubernetes

The base salary range is 180,000 USD - 339,250 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

More jobs in Santa Clara, California

Other about 17 hours ago Senior Software Engineer Chegg, Inc. Santa Clara, California
Other about 17 hours ago Visual Designer Intern Chegg, Inc. Santa Clara, California
Other about 18 hours ago Compliance Manager Cepheid Santa Clara, California

More jobs in General Business

General Business 4 minutes ago Co-Op - Engineering - Summer 2024 Schaeffler Group USA Inc. Fort Mill, South Carolina
General Business 4 minutes ago Senior Manager, Forensic Investigations and Intelligence Kroll, LLC New York, New York
General Business 5 minutes ago Associate Manager, Forensic Investigations and Intelligence Kroll, LLC New York, New York