Site Reliability Engineer, Sword Services Greece S.A., GR

Job Description

Sword Services Greece S.A. is seeking to recruit a high-caliber Site Reliability engineer. The successful candidate will responsible for ensuring the reliability, performance, and availability of our critical platforms: Kong (API Management), Solace (Messaging), Mulesoft (iPaaS), and Informatica ETL). This role requires a deep understanding of distributed systems, cloud technologies, and a passion for building resilient and scalable platforms.


This role requires a deep understanding of distributed systems, cloud technologies, and a passion for building resilient and scalable platforms.


Responsibilities


  • Ensure the reliability and availability of the Kong, Solace, Mulesoft, and Informatica platforms, applying SRE principles of automation, monitoring, and continuous improvement.
  • Proactively identify and resolve potential issues before they impact production environments, using data-driven insights and predictive analysis.
  • Develop and implement comprehensive monitoring and alerting systems to ensure platform health and performance.
  • Collaborate with the Support team and conduct thorough post-incident reviews with the goal of continuous improvement of the reliability of the platform.
  • Conduct root cause analysis (RCA) for incidents and implement preventative measures, with a focus on automation and systemic solutions.
  • Collaborate with development, operations, and security teams to ensure smooth platform operations, promoting a culture of shared responsibility for reliability.
  • Take ownership of platform SLAs and SLOs, ensuring they are met or exceeded, and proactively identifying opportunities for improvement.
  • Evaluate and implement new tools and technologies to improve platform reliability and efficiency, staying up-to-date with the latest SRE trends and technologies.

Chaos Engineering & Resilience


  • Design, implement, and execute chaos engineering experiments to proactively identify weaknesses and vulnerabilities in the integration platforms.
  • Develop and maintain a chaos engineering framework to systematically test the resilience of the platforms under various failure scenarios.
  • Analyze the results of chaos experiments and collaborate with engineering teams to implement improvements to enhance platform resilience.
  • Participate in the design and implementation of fault-tolerant and self-healing systems.

Disaster Recovery & Business Continuity


  • Collaborate with DevOps engineers to develop, maintain, and test disaster recovery plans for the integration platforms.
  • Participate in disaster recovery exercises to validate the effectiveness of the plans and identify areas for improvement.
  • Ensure that disaster recovery plans are aligned with business continuity requirements.
  • Implement and maintain backup and recovery procedures for critical platform components.

Upstream/Downstream Dependency Management


  • Analyze the dependencies of the integration platforms on other systems (e.g., API Gateway, backend services) and assess the impact of their reliability on the overall service.
  • Implement monitoring and alerting to detect issues in upstream and downstream systems that could affect the integration platforms.
  • Collaborate with other teams to improve the reliability and performance of dependent systems.
  • Design and implement strategies for handling failures in dependent systems, such as circuit breakers, retries, and fallbacks.

Qualifications


  • Bachelor's degree in Computer Science, Engineering, or a related field.
  • 5+ years of experience in a similar role, with a focus on platform reliability and operations, preferably with experience in a Site Reliability Engineering (SRE) environment.
  • Strong understanding of Kong API Gateway, Solace PubSub+, Mulesoft Anypoint Platform, and Informatica PowerCenter.
  • Experience with cloud platforms such as AWS, Azure, or GCP.
  • Proficiency in scripting languages such as Python, Bash, or Go.
  • Experience with infrastructure-as-code (IaC) tools such as Terraform or Ansible.
  • Experience with monitoring and alerting tools such as Datadog.
  • Strong understanding of networking concepts and protocols.
  • Excellent problem-solving and troubleshooting skills.
  • Excellent communication and collaboration skills, with the ability to effectively communicate technical information to both technical and non-technical audiences.
  • Strong understanding of Site Reliability Engineering (SRE) principles and practices.
  • Experience with containerization technologies such as Docker and Kubernetes.
  • Experience with CI/CD pipelines and automation tools.
  • Relevant certifications (e.g., AWS Certified DevOps Engineer, Azure DevOps Engineer Expert, Google Cloud Professional Cloud Architect).
  • Experience with Agile development methodologies.

Applications must be in English.

AI-Powered Job Matching

Get personalized insights and tailored applications with our AI tools:

AI Match Scoring

Get your exact compatibility score for each job based on your CV and experience

CV Tailoring

Automatically optimize your CV for each specific job application

Gap Analysis

Identify missing skills and get actionable improvement recommendations

Start Free Today

No credit card required • 100% free to start

Get Your Personal Job Feed

Join thousands of professionals getting AI-powered job recommendations tailored to their skills.

Daily job alerts matching your profile
AI match scores for every job
One-click CV tailoring
Application tracking
Get Started Free

Frequently Asked Questions about Site Reliability Engineer Jobs in GR