Sword Services Greece S.A. is seeking to recruit a high-caliber Site Reliability engineer. The successful candidate will responsible for ensuring the reliability, performance, and availability of our critical platforms: Kong (API Management), Solace (Messaging), Mulesoft (iPaaS), and Informatica ETL). This role requires a deep understanding of distributed systems, cloud technologies, and a passion for building resilient and scalable platforms.
This role requires a deep understanding of distributed systems, cloud technologies, and a passion for building resilient and scalable platforms.
Responsibilities
- Ensure the reliability and availability of the Kong, Solace, Mulesoft, and Informatica platforms, applying SRE principles of automation, monitoring, and continuous improvement.
- Proactively identify and resolve potential issues before they impact production environments, using data-driven insights and predictive analysis.
- Develop and implement comprehensive monitoring and alerting systems to ensure platform health and performance.
- Collaborate with the Support team and conduct thorough post-incident reviews with the goal of continuous improvement of the reliability of the platform.
- Conduct root cause analysis (RCA) for incidents and implement preventative measures, with a focus on automation and systemic solutions.
- Collaborate with development, operations, and security teams to ensure smooth platform operations, promoting a culture of shared responsibility for reliability.
- Take ownership of platform SLAs and SLOs, ensuring they are met or exceeded, and proactively identifying opportunities for improvement.
- Evaluate and implement new tools and technologies to improve platform reliability and efficiency, staying up-to-date with the latest SRE trends and technologies.
Chaos Engineering & Resilience
- Design, implement, and execute chaos engineering experiments to proactively identify weaknesses and vulnerabilities in the integration platforms.
- Develop and maintain a chaos engineering framework to systematically test the resilience of the platforms under various failure scenarios.
- Analyze the results of chaos experiments and collaborate with engineering teams to implement improvements to enhance platform resilience.
- Participate in the design and implementation of fault-tolerant and self-healing systems.
Disaster Recovery & Business Continuity
- Collaborate with DevOps engineers to develop, maintain, and test disaster recovery plans for the integration platforms.
- Participate in disaster recovery exercises to validate the effectiveness of the plans and identify areas for improvement.
- Ensure that disaster recovery plans are aligned with business continuity requirements.
- Implement and maintain backup and recovery procedures for critical platform components.
Upstream/Downstream Dependency Management
- Analyze the dependencies of the integration platforms on other systems (e.g., API Gateway, backend services) and assess the impact of their reliability on the overall service.
- Implement monitoring and alerting to detect issues in upstream and downstream systems that could affect the integration platforms.
- Collaborate with other teams to improve the reliability and performance of dependent systems.
- Design and implement strategies for handling failures in dependent systems, such as circuit breakers, retries, and fallbacks.
Qualifications
- Bachelor's degree in Computer Science, Engineering, or a related field.
- 5+ years of experience in a similar role, with a focus on platform reliability and operations, preferably with experience in a Site Reliability Engineering (SRE) environment.
- Strong understanding of Kong API Gateway, Solace PubSub+, Mulesoft Anypoint Platform, and Informatica PowerCenter.
- Experience with cloud platforms such as AWS, Azure, or GCP.
- Proficiency in scripting languages such as Python, Bash, or Go.
- Experience with infrastructure-as-code (IaC) tools such as Terraform or Ansible.
- Experience with monitoring and alerting tools such as Datadog.
- Strong understanding of networking concepts and protocols.
- Excellent problem-solving and troubleshooting skills.
- Excellent communication and collaboration skills, with the ability to effectively communicate technical information to both technical and non-technical audiences.
- Strong understanding of Site Reliability Engineering (SRE) principles and practices.
- Experience with containerization technologies such as Docker and Kubernetes.
- Experience with CI/CD pipelines and automation tools.
- Relevant certifications (e.g., AWS Certified DevOps Engineer, Azure DevOps Engineer Expert, Google Cloud Professional Cloud Architect).
- Experience with Agile development methodologies.
Applications must be in English.