You are here

Data Virtualization Overview

What Is Data Virtualization?

Data virtualization is synonymous with information agility - it delivers a simplified, unified, and integrated view of trusted business data in real time or near real time as needed by the consuming applications, processes, analytics, or business users. Data virtualization integrates data from disparate sources, locations and formats, without replicating the data,  to create a single "virtual" data layer that delivers unified data services to support multiple applications and users. The result is faster access to all data, less replication and cost, more agility to change.

Do you want more information about data virtualization? Visit this page to see how it works in three simple steps.

Data virtualization is modern data integration. It performs many of the same transformation and quality functions as traditional data integration (Extract-Transform-Load (ETL), data replication, data federation, Enterprise Service Bus (ESB), etc.) but leveraging modern technology to deliver real-time data integration at lower cost, with more speed and agility. It can replace traditional data integration and reduce the need for replicated data marts and data warehouses in many cases, but not entirely.

Data virtualization is also an abstraction layer and a data services layer. In this sense it is highly complementary to use between original and derived data sources, ETL, ESB and other middleware, applications, and devices, whether on-premise or cloud-based, to provide flexibility between layers of information and business technology.

 

5 Key Capabilities Data Virtualization Delivers:

  1. Logical abstraction and decoupling - Disparate data sources, middleware, and consuming applications that use or expect specific platforms and interfaces, formats, schema, security protocols, query paradigms and other idiosyncrasies can now interact easily through data virtualization.
  2. Data federation on steroids - Data federation is a subset of data virtualization, but now enhanced with more intelligent real-time query optimization, caching, in-memory and hybrid strategies that are automatically (or manually) chosen based on source constraints, application need, network awareness.
  3. Semantic integration of structured & unstructured - Data virtualization is one of the few technologies that bridge the semantic understanding of unstructured and web data with the schema-based understanding of structured data to enable integration and data quality improvements.
  4. Agile data services provisioning - Data virtualization promotes the API economy. Any primary, derived, integrated or virtual data source can be made accessible in a different format or protocol than the original, with controlled access in a matter of minutes.
  5. Unified data governance & security -  All data is made discoverable and integratable easily through a single virtual layer which expose redundancy and quality issues faster. While they are addressed, data virtualization imposes data model governance and security from source to output data services, and consistency in integration and data quality rules.

These capabilities are not to be found together in any other integration middleware. While they can be pieced together or custom coded, that destroys any agility or speed advantage you seek.

5 Flavors of Data Virtualization - from "Feature" to "Enterprise Platform"

As data virtualization gains in popularity, some of its features are being included in other products or as an add-on module or feature. This can be a good thing, particularly if it is included in the cost of the other product. 
However being able to tell the difference between an add-on or built-in data virtualization product and an enterprise data virtualization platform is important for several reasons:

  • Breadth of capabilities may be very limited. particularly sources, logical modeling,  performance, security and governance.
  • Optimized to play an adjunct function to the main product of the vendor - such as prototyping for an ETL / data warehousing or Master Data Management (MDM) project or tool vendor; or provide a semantic layer for a BI tool. Thus the product is defocused from being a true, high-performance enterprise data virtualization layer supporting widely heterogeneous sources, consumers, and solution patterns.
  • Vendor lock-in requiring pre-requisite products or add-ons from the same vendor to get the most value out of the data virtualization product.

The following list helps understand Data Virtualization in many forms:

  1. Data blending - This is often included as part of a business intelligence (BI) tool semantic universe layer or is a new module offered by a predominantly BI vendor. Data blending is able to combine multiple sources (limited list of structured or big data) to feed the BI tool, but the output is only available for this tool and cannot be accessed from any other external application for consumption.
  2. Data services module - Typically these are offered for additional cost by Data Integration Suite (ETL / MDM / Data Quality) or Data Warehouse vendors. The suite is usually very strong in other areas. When it comes to data virtualization, some features shared with the suite such as modeling, transformation, quality functions are very robust, but the data virtualization engine, query optimization, caching, virtual security layers, flexibility of data model for unstructured sources, and overall performance is weak. This is so because the product is designed to prototype ETL or MDM and not to compete with it in production use.
  3. SQLification Products - This is an emerging offering particularly among Big Data and Hadoop vendors. These products "virtualize" the underlying big data technologies and allow them to be combined with relational data sources and flat files and queried using standard SQL. This can be good for projects focused on that particular big data stack, but not beyond.
  4. Cloud data services. These products are often deployed in the cloud and have pre-packaged integrations to SaaS and cloud applications, cloud databases and few desktop and on-premise tools like Excel. Rather than a true data virtualization product with tiered -views and delegatable query execution, these products expose normalized APIs across cloud sources for easy data exchange in projects of medium volume. Projects involving big data analytics, major enterprise systems, mainframes, large databases, flat files and unstructured data are out of scope.
  5. Data virtualization platform. Built from the ground-up to provide data virtualization capabilities for the enterprise in a many-to-many fashion through a unified "virtual" data layer. Designed for agility and speed in a wide range of use cases, agnostic to sources and consumers, and competes and collaborates with other less efficient middleware. Click here to learn more about the Denodo Platform.


6 Things Data Virtualization Is NOT:

The description of data virtualization above is consistent with definitions by leading industry analysts. However some vendors use similar buzzwords for marketing other products to capitalize on the popularity of data virtualization. This list helps dispel confusion.

Data virtualization ...

  1. is not data visualization. It sounds similar, but visualization refers to display of data to end users graphically as charts, graphs, maps, reports, etc.  Data virtualization is middleware that provides data services to other data visualization tools and applications. While it has some data visualization for users and developers, that is not the main use.
  2. is not a replicated data store. Data virtualization does not normally persist or replicate data from source systems to itself. It only stores metadata for the virtual views and integration logic. If caching is enabled, it stores some data temporarily in a cache or in-memory database. Virtual data can be persisted if desired by simply invoking them as a source using ETL. Thus data virtualization is powerful, yet very light-weight and agile solution.
  3. is not a Logical Data Warehouse. Logical DWH is an architectural concept and not a platform. Data Virtualization is an essential technology used in creating a logical DWH by combining multiple data sources, data warehouses and big data stores like Hadoop.
  4. is not data federation. TDWI teaches a course on data virtualization which says this: "While all data federation is data virtualization, not all data virtualization is data federation". Thus data virtualization is a superset of capabilities that includes advanced data federation.
  5. is not virtualized data storage. Some companies and products use the exact same term "data virtualization" to describe virtualized database software or storage hardware virtualization solutions. They do not provide real-time data integration and data services across disparate structured and unstructured data sources.
  6. is not virtualization. When the term "virtualization" is used alone it typically refers to hardware virtualization -- servers, storage disks, networks, etc.