About the position
CoreWeave is seeking a highly skilled and motivated Infrastructure/Hardware Engineer, focusing on GPU and PCIe troubleshooting, to join our Hardware Engineering team, reporting to the Hardware Engineering Manager. In this role, you will play a crucial part in the design, development, troubleshooting, and optimization of our server hardware infrastructure. You will collaborate closely with cross-functional teams, external vendors, and stakeholders to ensure the successful delivery of highly performant and reliable hardware solutions.
Responsibilities
• Troubleshoot complex GPU and PCIe related failures
• Partner with external vendors on failure analysis
• Track component RMAs
• Develop and maintain hardware/firmware management services.
• Automate all aspects of the server hardware lifecycle.
• Serve as the senior point of contact for hardware escalation and troubleshooting.
• Collaborate with cross-functional teams to define hardware requirements, specifications, system architecture and issue identification and resolution playbooks.
• Create and maintain accurate documentation of hardware designs, specifications, test procedures, and results.
• Analyze and optimize the performance of hardware systems, identify bottlenecks, and propose improvements for enhanced efficiency.
• Establish processes for internal hardware testing, deployment, performance optimization and troubleshooting.
Requirements
• 5+ years of prior experience supporting and troubleshooting data center class GPUs ( H100 or newer, including Infiniband and NVLink).
• Proficiency in ansible/python and experience with programmatically interacting with server BMCs, using IPMI or Redfish (preferably Redfish).
• Experience using, integrating and automating data center class GPU diagnostics and troubleshooting tools, including observability platforms like prometheus and grafana.
• In-depth knowledge of server hardware, components, and management technologies, particularly GPUs and PCIe devices.
• Proven ability to stay updated with the latest industry technologies and trends.
• Previous experience collaborating with hardware vendors to identify novel issues, generate operational playbooks, create alerts and drive issue resolution to completion
• Strong passion for automation, with a commitment to automating processes comprehensively.
• Excellent documentation skills and attention to detail.
• Strong analytical and problem-solving abilities.
Benefits
• Medical, dental, and vision insurance - 100% paid for by CoreWeave
• Company-paid Life Insurance
• Voluntary supplemental life insurance
• Short and long-term disability insurance
• Flexible Spending Account
• Health Savings Account
• Tuition Reimbursement
• Ability to Participate in Employee Stock Purchase Program (ESPP)
• Mental Wellness Benefits through Spring Health
• Family-Forming support provided by Carrot
• Paid Parental Leave
• Flexible, full-service childcare support with Kinside
• 401(k) with a generous employer match
• Flexible PTO
• Catered lunch each day in our office and data center locations
• A casual work environment
• A work culture focused on innovative disruption
Apply Now
Apply Now