This job offer is not available in your country.

Site Reliability Engineer

OKXHong Kong

9 days ago

Job description

The Service Reliability Engineering team envisions ensuring service stability as one of thepany's corepetitive advantages. By building end-to-end, chain-level risk management capabilities, we aim to achieve sustainable, automated identification and analysis of stability risks, transitioning from "reactiveernance" to "proactiveernance". This approach allows us to preemptively address more stability issues, improving user experience.

What You’ll Be Doing :

Ensure stability and optimize big data platforms (Alibaba Cloud DataWorks, AWS EMR, AWS DataBricks, Spark, Flink) and data warehouses (Mapute, Hologres, Hive, Clickhouse, StarRocks, etc.).
Deeply understand the architecture and principles of middleware (Kafka, Spring Cloud, Nacos, Apollo, Kong Gateway, etc.), ensuring high performance and availability.
Effectively optimize existing runtime environments (KVM, Docker, K8S, JVM, etc.) to ensure efficient resource utilization and stable service operation.
Lead chaos engineering exercises, coordinating with business units to validate system robustness and recovery capabilities through simulated failure scenarios.
Participate in rapid response and troubleshooting of system failures, continuously optimize monitoring strategies to reduce system downtime and ensure service continuity and stability.
Drive infrastructure automation and intelligence to improve SRE work efficiency and quality.
Collaborate closely with development teams, providing technical support and advice on infrastructure to jointly promote continuous product improvement and innovation.

What We Look For In You :

Bachelor's degree or above inputer Science or related field, with 8+ years of experience in large-scale internet or cloudputing platform development / SRE / operations.

In-depth understanding of big data platforms, data warehouses, middleware, runtime environments, and network technology principles and architectures, with rich practical experience and troubleshooting skills.

Proficient in Linux system management and optimization, familiar with scripting languages such as Shell / Python, able to write automation tools and scripts.

Familiar with container and cloud-native technologies like KVM, Docker, and K8S, including their architectures and principles, with extensive experience in handlingmon issues and failures.

Familiar with network protocols such as TCP / UDP / QUIC, proficient in using networkmands like TcpDump, TraceRoute, Netstat, and tools like Wireshark, with rich practical experience in troubleshootingmon network issues.

Rich experience with Alibaba Cloud and AWS cloud products, from architecture to usage, with extensive practice in dealing withmon issues and failures.

Practitioners with experience in serviceernance system construction, architecture optimization, stability assurance construction, capacity management, activity support, and chaos engineering are preferred.

Strong sense of responsibility and team spirit, with excellent problem-solving and analytical skills.

Must have Chinesemunication skills; proficiency in both Chinese and Englishmunication is preferred.

Perks & Benefits

L&D programs and Education subsidy for employees' growth and development

Various team building programs andpany events

Wellness and meal allowances

More that we love to tell you along the process!

Create a job alert for this search

Engineer • Hong Kong