Site Reliability Engineer (DevOps)

Salary Negotiable

Experience 5-7 Years

Job Type Full time

Vacancies 1

Job Context:

Riseup Labs looking for an experienced Site Reliability Engineer. The primary responsibility of the Site Reliability Engineer (SRE) is to support applications, systems, operations, administration, configuration, troubleshooting and automation of cloud hosting, monitoring and improving application performance and enhancing all service line objectives. In this role, the SRE will be responsible for overall performance of our cloud applications and AWS cloud infrastructure. The SRE will be working with the product engineering teams, cloud architecture and engineering team, DevOps and DevSecOps teams.

Job Responsibilities:

Monitor, and enforce service-level agreements (SLAs) and service-level indicators (SLIs).
Handle and respond to service outages and interruptions. This includes troubleshooting, root cause analysis, and post-mortem reviews to prevent future incidents.
Monitor the infrastructure and application's performance to predict future system demands. This includes provisioning additional resources or optimizing the existing setup to handle the load.
Completes complex development, design, implementation, architecture design specification, and maintenance activities as needed.
Automate manual operations work, including the deployment of code and configuration changes.
Set up and maintain monitoring, logging, and alerting systems.
Monitor and analyze infrastructure costs to suggest ways to optimize and reduce unnecessary expenses.
Identify and remove bottlenecks in the system to improve performance. This might involve code optimizations, database tuning, or optimizing server configurations.
Build software and systems to manage platform infrastructure and applications
Provide operational support and engineering for multiple large distributed software applications
Fixing escalated issues from development team
Documenting technical systems
Document best practices, runbooks, and procedures for troubleshooting common issues.
Improve reliability, quality, and time-to-market of our suite of software solutions
Measure and optimize system performance, pushing our capabilities forward, getting ahead of product team needs, and innovating to continually improve
Work collaboratively with product & software engineering professionals to define infrastructure and deployment requirements.
Provision, configure and maintain cloud infrastructure defined as code.
Ensure that the infrastructure and applications meet security standards and comply with relevant regulations. This might involve regular security audits, patching, and vulnerability assessments.
Troubleshoot problems across a wide array of services and functional areas.
Gather and analyze metrics from both operating systems and applications to assist in performance tuning and fault finding
Partner with development teams to improve services through rigorous testing and release procedures
Create sustainable systems and services through automation and uplifts
Strong organizational skills, customer service focus, attention to detail, and process orientation
Ability to distill and present information to senior leaders
Flexibly adapt to a changing environment
Consistent and regular attendance including on-call availability on a rotational basis is an essential function of this job
Performs other related duties as assigned

Educational Requirements:

B.Sc in Computer Science and Engineering from any reputed public or private university.

Additional Requirements:

21 years of age.
Bachelor’s degree or equivalent in relevant discipline.
5 years of experience building and maintaining AWS infrastructure (VPC, EC2, Security Groups, IAM, ECS, Code Deploy, CloudFront, S3)
Strong understanding of how to secure AWS environments and meet compliance requirements
Hands-on experience deploying and managing infrastructure with Terraform
Experience with Kubernetes, GitHub, Jenkins, ELK and deploying applications on AWS
Ability to learn/use a wide variety of open-source technologies and tools
Ability to program (structured and OO) with one or more high level languages, with a strong preference for GoLang
Experience with distributed storage technologies like NFS, HDFS, Ceph, S3 as well as dynamic resource management frameworks (Mesos, Kubernetes, Yarn)
A proactive approach to spotting problems, areas for improvement, and performance bottlenecks
Strong bias for action and ownership
Strong interpersonal skills with the ability to communicate effectively with guests and other Team Members of different backgrounds and levels of experience.
Must be able to work varied shifts, including nights, weekends and holidays.
Previous startup experience would be a huge plus

Workplace: