Job Context:
Riseup Labs looking for an experienced Site Reliability Engineer. The primary responsibility of the Site Reliability Engineer (SRE) is to support applications, systems, operations, administration, configuration, troubleshooting and automation of cloud hosting, monitoring and improving application performance and enhancing all service line objectives. In this role, the SRE will be responsible for overall performance of our cloud applications and AWS cloud infrastructure. The SRE will be working with the product engineering teams, cloud architecture and engineering team, DevOps and DevSecOps teams.
Job Responsibilities:
- Monitor, and enforce service-level agreements (SLAs) and service-level indicators (SLIs).
- Handle and respond to service outages and interruptions. This includes troubleshooting, root cause analysis, and post-mortem reviews to prevent future incidents.
- Monitor the infrastructure and application's performance to predict future system demands. This includes provisioning additional resources or optimizing the existing setup to handle the load.
- Completes complex development, design, implementation, architecture design specification, and maintenance activities as needed.
- Automate manual operations work, including the deployment of code and configuration changes.
- Set up and maintain monitoring, logging, and alerting systems.
- Monitor and analyze infrastructure costs to suggest ways to optimize and reduce unnecessary expenses.
- Identify and remove bottlenecks in the system to improve performance. This might involve code optimizations, database tuning, or optimizing server configurations.
- Build software and systems to manage platform infrastructure and applications
- Provide operational support and engineering for multiple large distributed software applications
- Fixing escalated issues from development team
- Documenting technical systems
- Document best practices, runbooks, and procedures for troubleshooting common issues.
- Improve reliability, quality, and time-to-market of our suite of software solutions
- Measure and optimize system performance, pushing our capabilities forward, getting ahead of product team needs, and innovating to continually improve
- Work collaboratively with product & software engineering professionals to define infrastructure and deployment requirements.
- Provision, configure and maintain cloud infrastructure defined as code.
- Ensure that the infrastructure and applications meet security standards and comply with relevant regulations. This might involve regular security audits, patching, and vulnerability assessments.
- Troubleshoot problems across a wide array of services and functional areas.
- Gather and analyze metrics from both operating systems and applications to assist in performance tuning and fault finding
- Partner with development teams to improve services through rigorous testing and release procedures
- Create sustainable systems and services through automation and uplifts
- Strong organizational skills, customer service focus, attention to detail, and process orientation
- Ability to distill and present information to senior leaders
- Flexibly adapt to a changing environment
- Consistent and regular attendance including on-call availability on a rotational basis is an essential function of this job
- Performs other related duties as assigned
Educational Requirements:
- B.Sc in Computer Science and Engineering from any reputed public or private university.
Additional Requirements:
- 21 years of age.
- Bachelor’s degree or equivalent in relevant discipline.
- 5 years of experience building and maintaining AWS infrastructure (VPC, EC2, Security Groups, IAM, ECS, Code Deploy, CloudFront, S3)
- Strong understanding of how to secure AWS environments and meet compliance requirements
- Hands-on experience deploying and managing infrastructure with Terraform
- Experience with Kubernetes, GitHub, Jenkins, ELK and deploying applications on AWS
- Ability to learn/use a wide variety of open-source technologies and tools
- Ability to program (structured and OO) with one or more high level languages, with a strong preference for GoLang
- Experience with distributed storage technologies like NFS, HDFS, Ceph, S3 as well as dynamic resource management frameworks (Mesos, Kubernetes, Yarn)
- A proactive approach to spotting problems, areas for improvement, and performance bottlenecks
- Strong bias for action and ownership
- Strong interpersonal skills with the ability to communicate effectively with guests and other Team Members of different backgrounds and levels of experience.
- Must be able to work varied shifts, including nights, weekends and holidays.
- Previous startup experience would be a huge plus
Workplace:
Working Hours:
Salary:
- Negotiable (Based on experience and skills)
Compensation & Other Benefits:
- Annual Performance Evaluation and Increment
- Festival Bonus (2)
- Group Life and Health Insurance
- Full Subsidize Lunch
- Annual Retreats
- Wedding Bonus (As per company’s policy)
- Celebration of Events & Occasions
- Team Outing
- Training & Development by Organization Assigned Consultants
- Weekly 2 holidays (Saturday, Sunday)
- Paid Time Off 24 days (CL & SL)
- Maternity Leave with benefit (As per company's policy)
- Paternity Leave
- Public holidays as per Riseup Labs calendar
The Application Process:
- Telephone Round.
- Interview with the Technical Lead & Talent Acquisition Team.
- Job Offer.