Role: AWS Cloud Ops SME
Location: Rockville MD (Remote)
Duration: Fulltime FTE
Need 8-10+ Years of experience.
Required Technical Skills:
• AWS, Terraform, IAC, Python
• AWS Cloud Infra Management
• Control Tower, Organization policies and management
• Multi-Account deployment and management
• AWS Backups and SSM Patching process - in detail.
• AMI deployments & pushing config to multiple accounts
• AWS EC2, ECS, EKS, RDS, S3, Sage Maker, CloudFront, Lambda etc...
• AWS S3, SFTP and Site externalization methods.
• IaC - Terraform, Cloud Formation templates and Python.
• IAM polices and access management and restrictions.
• AWS Networking - VPC, ALB, NLB, Transit gateways, WAF
• Azure AD SSO and App Proxy.
• CI/CD and basic Dev Ops
• Linux OS troubleshooting, Bash & Ansible.
• Any Windows AD skills would be an added advantage.
Responsibilities
• Oversee the management and maintenance of cloud infrastructure, ensuring high availability and reliability. Act as the primary point of contact for all Cloud infrastructure related issues and escalations.
• Ensure cloud resources are optimally configured and managed to meet performance and cost objectives.
• Implement and maintain monitoring solutions to track the health and performance of cloud infrastructure.
• Drive the major incidents and potential incidents end to end with periodic updates to client stake holders for approvals/recommendations.
• Ensure due diligence and impact analysis for all the changes that get implemented in the cloud platforms.
• Lead and mentor a team of cloud engineers and administrators, fostering a collaborative and high-performing work environment.
• Provide guidance and support to team members, facilitating their professional development and growth.
• Coordinate and manage the team's daily activities, ensuring alignment with organizational goals and priorities.
• Lead the response to cloud-related incidents, ensuring timely resolution and minimal impact on business operations.
• Develop and implement incident management processes and procedures.
• Perform root cause analysis and implement preventive measures to avoid recurrence of issues.
• Identify opportunities to automate repetitive tasks and processes to improve efficiency and reduce operational overhead.
• Develop and implement automation scripts and tools, leveraging Infrastructure as Code (IaC) practices.
• Continuously evaluate and improve cloud operations processes and procedures.
• Ensure cloud infrastructure adheres to security policies, standards, and best practices.
• Implement and maintain security controls to protect cloud resources and data.
• Ensure compliance with regulatory requirements and industry standards (e.g., GDPR, HIPAA).
• Monitor and analyze cloud resource usage, ensuring efficient utilization and avoiding over-provisioning.
• Conduct capacity planning to support future growth and demand.
• Implement cost management strategies to optimize cloud spending.
• Develop and implement disaster recovery and business continuity plans for cloud infrastructure.
• Ensure regular testing and validation of disaster recovery procedures.
• Ensure cloud infrastructure is resilient and can recover quickly from failures or disruptions.
• Work closely with other IT teams, business units, and stakeholders to understand requirements and deliver cloud solutions that meet their needs.
• Collaborate with vendors and service providers to evaluate and integrate new cloud technologies and services.
• Communicate effectively with stakeholders, providing regular updates on cloud operations and performance.
• Maintain comprehensive documentation of cloud infrastructure, configurations, processes, and procedures.
• Generate regular reports on cloud performance, incidents, and operational metrics.
• Ensure documentation is up-to-date and accessible to relevant stakeholders.
Apply Now
Apply Now