Introduction - a data fabric approach to multicloud data integration

Modern data integration solutions that are part of a data fabric allow you to create flexible, reusable, augmented data pipelines

Modern enterprises are struggling with increasingly complex data real estates as data becomes more diverse, distributed, and dynamic across hybrid cloud environments. New data sources, applications, and requirements are multiplying. According to an IDC report, as architectures get more complicated, it’s harder to maintain them and provide access to the data needed, leading to 60–73% of all data being unused in the average enterprise.

Multicloud data integration addresses this complexity and sprawl across on-premises, multicloud and hybrid cloud environments. When used as part of a data fabric, it can turn a siloed data architecture into one where the right data is delivered to the right place at the right time. A data fabric is a technology architecture approach that enables organizations to unlock insight from their data wherever it resides while keeping true to secure, governed, and performant principles.

Data fabric strengthens compliance with automated data governance and privacy controls while maintaining regulatory compliance no matter where data resides. It also builds the basis for 360-degree customer intelligence and trustworthy AI

Modern data integration solutions that are part of a data fabric allow you to create flexible, reusable, augmented data pipelines to create and deliver data products across diverse domains and lines of businesses. Forward thinking organizations like Wichita State University  and Highmark Health  have already experienced the benefits of such an approach.

The building blocks of multicloud data integration

leveraging a flexible architecture that is purpose-built for cloud native workloads, and infusing ML-powered capabilities

The goal of multicloud data integration is to democratize data access across the enterprise, enable continuous availability for mission-critical data, and empower new data consumers, while keeping true to secure, governed, and performant principles

Modern data integration solutions require a data delivery platform that supports different integration and delivery styles while using knowledge graphs, active metadata, and policy enforcements to unify and govern data across a distributed landscape. By leveraging a flexible architecture that is purpose-built for cloud native workloads, and infusing ML-powered capabilities that allow consumers to unlock insight from their data wherever it resides, modern data integration solutions can provide faster access to high-quality data for downstream analytics, business intelligence (BI), AI, and data intensive application consumers.

Data integration  helps combine structured and unstructured data from disparate sources into meaningful and valuable data sets. Integration styles such as bulk/batch integration (ETL or ELT), data replication and streaming integration, change data capture, and data virtualization enable a variety of integration use cases.


  • ETL and ELT:  are both data integration processes that move raw data from a source system to a target data repository, such as a data lake or data warehouse. Advanced transformation of the data can either be applied in-flight (ETL), or post-load (ELT). Data sources can be in multiple, different repositories, including legacy systems or systems of record, that are then transferred using bulk or batch integration to a target data location.
  • Entity resolution: The process that resolves entities and detects relationships. The pipelines perform entity resolution as they process incoming identity records in three phases: recognize, resolve, and relate.
  • Data replication:  provides the flexibility to replicate data between a variety of heterogeneous sources and targets for dynamic delivery, making it ideal for multisite workload distribution and continuous availability—whether across data centers, from on premises or to the cloud.
  • Change data capture:  captures database changes as they happen, in real-time and delivers them to target databases, message queues, or as part of an ETL solution.
  • Data virtualization:  connects data across data sources and makes it accessible through a single access point—regardless of data location, size, type or format. Data engineers can quickly and easily fulfill ad hoc data integration requests to validate hypotheses or what-if scenarios.

Data cataloging  can help users easily find and use the right data with a rich and metadata-driven index of cataloged assets. Data catalogs with built-in quality analysis can make inferences and identify anomalies. Some catalogs have built-in refineries that help with data discovery, cleansing and transforming the data with data shaping operations.

Data governance  makes the right data easier to find for those who should have access to it, while allowing sensitive data to remain hidden or masked. A governance framework designed to automate the enforcement of data protection policies and backed by automated metadata tagging is ideal for the establishment and enforcement of data governance policies and rules.

Advanced data engineering  automates data access and sharing, accelerating data delivery with active metadata. Orchestration can also be used to make DataOps pipelines simpler to build, and run, improving the integration and communication of data flows between data producers and data consumers. Automatic workload balancing and elastic scaling  make jobs ready for any environment and any data volume.

Finally, you’ll want an approach that includes solutions to provide self-service access, the ability to query all data, and continuous data availability.

Modern data integration solutions that are part of a data fabric allow you to create flexible, reusable, augmented data pipelines to create and deliver data products across diverse domains and lines of businesses.

Multicloud data integration success stories

Consider these components

IBM Cloud Pak for Data

IBM Cloud Pak® for Data brings together a comprehensive data integration solution that addresses all the needs of a modern data fabric with built-in data virtualization. It creates an end-to-end user experience rooted in metadata and active policy management. With Cloud Pak for Data users can view, access, manipulate and analyze data without the need to understand its physical format or location, and without having to move or copy it.

IBM Cloud Pak for Data allows companies to automatically apply industry-specific regulatory policies and rules to their data assets, securing it across the enterprise.

IBM DataStage

IBM® DataStage® is an industry-leading data integration tool that helps you design, develop and run jobs that move and transform data. At its core, the DataStage tool supports extract, transform and load (ETL) and extract, load and transform (ELT) patterns. Get flexible and near real-time data integration, both on-prem and on the cloud, with a scalable ETL platform and workload balancing that executes up to 30% faster.5

IBM Watson Query

IBM Watson Query is a universal query engine that executes distributed and virtualized queries across databases, data warehouses, and data lakes. Providing data virtualization capabilities, Watson Query is the tool of choice for quick, easy integrations to data sources. The amount of effort required for small and large ETL jobs is often the same; data virtualization helps improve efficiency when smaller, ad hoc requests occur, reducing ETL jobs by 25-65%.5 Through Watson Query’s ability to enforce Watson Knowledge Catalog’s policies, governance and data protection is applied.

IBM Watson Knowledge Catalog

IBM Watson® Knowledge Catalog is a data catalog tool that powers intelligent, self-service discovery of data, models and more. The cloud-based enterprise metadata repository activates information for AI, machine learning (ML) and deep learning and can cut time to automate data discovery, quality and governance by up to 90%.6 Users can access, curate, categorize and share data, knowledge assets and their relationships, wherever they reside.

IBM data replication

The IBM data replication portfolio of products supports high volumes of data with very low latency, making the solution ideal for multisite workload distribution and continuous availability—whether across the data center, from on premises or to the cloud. This robust support for sources, targets and platforms ensures the right data is available in data lakes, data warehouses, data marts and point-of-impact solutions, while enabling optimal resource utilization and rapid ROI.