Revolutionizing Data Integration: ZhongAn Insurance’s Journey with Apache SeaTunnel

Apache SeaTunnel
12 min readMar 4, 2024

Written by | Li Zeng, Senior Expert in Big Data Development at ZhongAn Insurance

Edited by | Hui Zeng

Foreword

ZhongAn Insurance began the pre-research work on data integration services in April 2023, aiming to solve two major pain points in the current data synchronization scenario: weak service capabilities and the lack of distributed synchronization capabilities. After researching and performance testing various open-source data synchronization middleware, we finally chose Apache SeaTunnel and its new Zeta engine for service packaging.

In October 2023, we started secondary development based on version 2.3.3, mainly improving the service interface and adapting connector characteristics. Around New Year’s Day 2024, we completed the development of the data integration service and began to replace the existing DataX jobs with batch synchronization from MaxCompute to ClickHouse. Hundreds of jobs have been switched so far, running smoothly and achieving the expected performance improvement.

We will continue to collect feedback, optimize and improve the service in practical applications, and submit iteration and optimization suggestions to the community.

The Pain Points of Data Integration

ZhongAn Insurance started using DataX as a data integration synchronization tool around 2015, from the internal version 2.0 of Taobao to the later community version 3.0, its stability and efficiency have been verified.

However, as time goes on, our daily data synchronization job volume has increased from a few thousand to 34,000. Facing the daily data warehousing volume of 20TB and data outflow volume of 15TB, as well as the incremental synchronization scenario of up to 4 billion records in a day under streaming media interaction, DataX has shown its limitations.

DataX, as a classic single-machine multi-threaded data integration tool, its job configuration, multi-concurrency, plug-and-play plugins, and in-memory data transfer design philosophy are excellent, paving the way for many subsequent integration middleware designs. However, its lack of service and distributed processing capabilities limits its application in large-scale data synchronization scenarios.

Reduce Coupling: In internal scenarios, the limitations of DataX serviceability lead to its severe coupling with internal development and scheduling platforms. This causes the resource consumption (CPU) of DataX jobs to seriously affect service performance.

Capability Expansion: Facing future trends such as computing-storage separation and cloud-native, we realize the need for a tool that can provide service capabilities, support different integration middleware, and adapt to rapid configuration replacement.

Resource Isolation and Elastic Expansion: We expect data synchronization resources to be more elastically controlled and managed, especially facing our 34,000 DataX tasks, which are deployed on six ECSs configured with 16 cores and 64GB of memory, achieving departmental and subsidiary isolation through logically three clusters. However, the uneven use of resources, especially the possibility of extremely high resource loads during the nighttime peak period, has intensified the need for elastic and controllable use of resources.

Facing future trends such as computing-storage separation and cloud-native, we realize the need for a tool that can provide service capabilities, support different execution middleware, and adapt to rapidly developing requirements.

Apache SeaTunnel was chosen against this backdrop, as it not only helps us address the current data integration challenges but also provides us with a smooth migration path to ensure the efficient and stable execution of data synchronization tasks. Moreover, Apache SeaTunnel’s capabilities in CDC real-time synchronization and reducing data synchronization backflow time are also important considerations for our choice.

Why Choose Zeta

Easy to Use

  • Multiple Deployment Modes: Supports single process/cluster modes, containerized Kubernetes/Docker deployment;
  • Rich Connectors: The community has provided dozens of types of connectors, with relatively complete functionality. After several versions of iterations, the community has been able to cover the main functions of DataX;
  • Transformers: Provides DAG-level transformers, a significant improvement over DataX’s row-level transformers;
  • Service Capabilities: Provides system RestApi, client proxy, and other modes of service access;
  • Support Scenarios: Offline/real-time synchronization, entire library synchronization, etc.;
  • Fewer Dependencies: Zeta standalone mode can achieve distributed data synchronization without relying on third-party components;

Scalability

  • Connectors: Plug-and-play design, easily supporting more data sources and expanding modes as needed;
  • Multiple Engines: Supports Zeta, Flink, and Spark engines simultaneously, providing a unified translation layer for interfacing and expansion; ZhongAn’s current infrastructure is mainly based on MaxCompute, and we do not have a big data cluster like Hadoop, so Zeta’s distributed capability can well solve this problem. Also, if there is a future big data platform migration (to another cloud EMR or a self-built cluster), it can achieve seamless job transition.
  • Zeta Multiple Resource Managers: Currently only supports Standalone, the community will support yarn/k8s mode in the future;

Efficient and Stable

  • Faster: In the same resource configuration, compared to DataX, it can provide a 15%~30% performance improvement;
  • Stable: Provides back pressure, failover, exactly-once semantics, etc.;
  • Optimization: Zeta’s optimization for specific scenarios, such as the optimization of source and target connectors for MaxCompute and ClickHouse, respectively, has significantly improved efficiency and stability.

Community activity

The community of Apache SeaTunnel is very active. As a community dominated by domestic developers, we have very smooth communication and cooperation with other members of the community, including Teacher Gao and Teacher Hailin, and they provide timely guidance and questions. Analytics helps us tremendously. The community also holds regular weekly meetings, providing a platform for everyone to discuss design patterns and share solutions to problems.

Unified Data Integration Service

We have created a unified data service platform that simplifies the configuration process of data source management and data integration, and supports the entire data development process from development to testing to release. We further improved the automation and efficiency of data processing by managing data sources and integration configurations in the IDE, and then allocating jobs to execution nodes nightly through the scheduling system.

Although this approach is effective, we realize that there is still room for improvement in terms of servitization, especially considering the high consumption of CPU resources and the need for monitoring and job management under high load conditions.

In summary, Apache SeaTunnel and Zeta offer a promising solution to the challenges we face in data integration and synchronization. Their capabilities not only meet our current needs but also provide a pathway for future development and scalability. As we continue to deploy and optimize these tools, we look forward to further enhancing our data processing capabilities and efficiency.

DataX Job Migration

We have also focused on migrating from DataX to SeaTunnel.

Plugin Compatibility

This includes comparing connectors provided by the community with the plugins we use internally to ensure their compatibility. Special attention was given to the most common data backflow scenarios, namely data backflow tasks from MC to ClickHouse (CK). We have about 34,000 tasks, of which approximately 14,000 are dedicated to pushing the underlying metadata of self-service analytical reports to CK on a daily basis. For these scenarios, we have developed specific compatibility.

Job Switching Interface

To support smooth migration and development of jobs, we implemented a job development switching interface. This allows us to flexibly migrate jobs based on the job number and the adaptation of connectors. After migration, new tasks are registered in the integration service and saved in a common configuration format, making it convenient to operate through script mode or guided configuration on the management service side.

Configuration Abstraction

We have established a set of internal common configuration standards aimed at compatibility between Apache SeaTunnel and DataX job configurations. This approach not only simplifies the process of replacing data sources in multiple environments but also enhances security by avoiding the exposure of sensitive information such as usernames and passwords in the configuration.

We perform job configuration translation before job execution. This design is inspired by Seatunnel’s translation layer design, including the replacement of local variables and data source parameters, as well as configuration translation for different engines. This dual-layer translation mechanism — one layer responsible for converting specific middleware plugin configurations to common configurations (Pre-transform), and the other converting common configurations to specified engine configurations (normal transform) — greatly enhances the flexibility and compatibility of job configurations. The presence of a common layer is necessary as it allows for flexible translation and configuration conversion between different data integration tools, thereby achieving data service execution downgrade across multiple engines.

Zeta Cluster Resource Management

Problem: Zeta resource management Slots are currently only logically isolated. If dynamic slot mode is used, a large number of threads will be created to compete for resources, which may slow down the overall speed of concurrent jobs or lead to cluster OOM. This mode is more suitable for CDC real-time synchronization scenarios with multiple batches and small data volume shards.

Solutions

  • Use Static Slot Mode

For offline batch processing tasks, this mode is more appropriate as it can control resource consumption to some extent, preventing memory overflow (OM) issues due to large data caching. Assess based on the cluster’s CPU/memory size, appropriately oversell CPU, and configure the proper number of resource slots to ensure the efficiency of data processing jobs and effective use of cluster resources.

  • New Cluster Slot Service RestApi

By extending SlotService and ResourceManager in Hazelcast, extend storage of the cluster’s total slots and allocated slots, and improve the handling of slot resources during cluster startup, node online/offline, job submission, and job release, and provide RestApi for querying.

  • Job Slot Calculation

Initially, we tried to evaluate the concurrency of jobs based on the physical execution plan, but later versions required us to calculate slot resources based on job configurations. Under consistent concurrency, the formula for job resource occupation is as follows:

This method can be applied to most end-to-end data synchronization scenarios, but in more complex job configurations, this method may not be flexible enough. We also look forward to an internal implementation of an API similar to SQL explain for resource calculation by the community.

  • Job Control

Calculate the consumed slot resources based on the configuration before job submission; Before job submission, check whether the total cluster slot resources and available resources can meet the job resource consumption. If so, submit through RestApi.

Zeta RestAPI Integration Issues

Problem

After the cluster HTTP service address was mounted on Alibaba Cloud SLB, a large number of connections were found to be remotely closed. Reason: After SLB enabled health checks, it would send SYN packets, the backend would respond with SYN+ACK, and then reset the connection. Solution: After trying hazelcast networking mode and SLB configuration without success, we performed a random routing treatment on the server side through cluster configuration information before the HTTP request.

Problem

Non-Master nodes cannot handle job submissions, terminations, cluster slot acquisitions, etc. Reason: In version 2.3.3, HazelcastInstance on non-master nodes cannot obtain instances of Master services.

There are not multiple Hazelcast.getAllHazelcastInstances(). Do you need additional code to modify it? Jobs cannot be submitted across nodes.

Solution: A general idea is to simulate SlotService, bringing statistics to the Master, using hazelcast’s Operation mechanism, referencing the HeartbeadHealthOperation mechanism, and using the existing GetMetricsOperation to go to the Master node for retrieval.

Later, we provided this approach to the community, and community colleagues also improved the modification of interfaces for job submissions, terminations, etc.

Connector Support for Pre/Post SQL

In the practice of Apache SeaTunnel, especially when dealing with ClickHouse (CK) report data, the connector’s Pre and Post SQL functionalities have shown high adaptability to complex data processing scenarios. These features allow the execution of specific SQL statements before and after data synchronization tasks, providing greater flexibility and precise control over data handling.

Use Cases

The main application scenarios include preparatory work before data synchronization and cleanup or reorganization work after synchronization. For example, before pushing data to a CK report, instead of directly overwriting or deleting the current table, data might first be written into a temporary table. After the data writing is completed, Post SQL statements can be executed to rename the local table and mount it to the partition table, effectively avoiding data loss or inconsistency issues during the data synchronization process.

PreSql Practice

Problem: Early versions did not support it, and it could only be implemented through the prepare method in XxxSink, but this interface would later be canceled;

Solution: Apache SeaTunnel community version 2.3.4 introduced a combination of schema save mode and data save mode as a solution, supporting the execution of SQL statements (Pre SQL) before data synchronization. The introduction of this method greatly enhanced the flexibility and usability of Apache SeaTunnel in data synchronization scenarios. We implemented preSql execution in the CUSTOM_PROCESSING mode of data save mode and extended it to support the execution of multiple SQL statements;

PostSql Practice

Problem: Implementing in the close method of XxxSink or XxxSinkWriter can lead to concurrency conflict issues;

Solution: Supporting Post SQL, especially ensuring data integrity and consistency in a multi-threaded environment, poses a more complex challenge. Executing Post SQL statements in the close method of two-phase commit provides a feasible solution. This method initially realized the capability to perform necessary post-processing operations after completing data synchronization tasks.

One challenge we encountered was handling the failure of Post SQL execution. This issue was discovered in the release test on January 4th, where the test team carefully examined the system’s behavior when Post SQL execution failed.

It was found that after execution failed, the retry mechanism of Subplan (reApache SeaTunnelore processing) led to issues in job status management, preventing jobs from terminating normally. As a temporary solution, the pipeline’s maximum retry number (Max reApache SeaTunnelore number) in Subplan was set to 0 (default value is 3), meaning that in offline batch processing scenarios, the system will report an error and terminate the job immediately upon encountering an error.

Although this measure temporarily solves the problem, further collaboration with the community is needed to explore more fundamental solutions.

We also look forward to the community developing better practices for implementing PostSql, as executing SQL in the two-phase commit close method means that the job checkpoint has already been refreshed, which might affect the existing mechanism if an exception occurs.

Connector Implicit Column Conversion

Problem

In the process of data synchronization and integration, matching and converting data types between data sources and target storage is a common issue. Connectors and framework levels in Apache SeaTunnel might not have adequately handled implicit column conversion, making it ineffective to write data into corresponding fields of the target data source. When adapting connector features for DataX, we found that neither connectors nor the framework level had implemented implicit column conversion.

For example, if the first column corresponding to SeatunnelRowType is a String type with data “2023–12–01 11:12:13”, it cannot be written into a Maxcompute field of Datetime type.

Solution

At the connector level, a simple RowConverter was implemented to map and convert based on the field types in SeatunnelRowType and the corresponding Maxcompute field types. We plan to integrate community standard type conversion features in the future.

pull request address: https://github.com/apache/seatunnel/pull/5872

Partial Column Synchronization for Connectors

Problem

During the adaptation of connector features for DataX, we found that DataX supports partial column backflow and partial column writing; currently, some connectors on the source end of Seatunnel implement this, but the sink end is mostly full-field writing;

Solution

Source end: We can set custom columns (instead of full table columns) in CatalogTable, similarly to how DataX handles partition columns and constant columns backflow, which can be passed to the execution plan for the Sink end to fetch; JDBC connectors can choose appropriate columns through query SQL;

Sink end: Currently, alignment can be made based on the index position of SeaTunnelRow and the index in custom columns to achieve partial writing; JDBC connectors can handle this by specifying columns in insert.

With the successful implementation of Apache SeaTunnel, ZhongAn Insurance has taken solid steps in the field of data integration. We look forward to continuing to optimize our data processes in the ever-changing technology environment to support rapid business development and innovation needs.

This case study from ZhongAn Insurance demonstrates the potential and value of open-source technology in enterprise applications, highlighting the importance of open cooperation in driving industry development, and hopefully provides some inspiration to others!

--

--

Apache SeaTunnel

The next-generation high-performance, distributed, massive data integration tool.