National Institutes of Health
Two of the National Institutes of Health (NIH), the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), collaborated to execute a project known as The Cancer Genome Atlas (TCGA). The TCGA mission is to catalog the genetic mutations responsible for cancer using genome sequencing and bioinformatics. The genome data under TCGA had to be converted to a different format (ICGC) to make this information available to researchers and scientists around the world. The challenges faced in this process were
- Not Scalable – High data volumes (100s of millions of rows across 25 variations) made the conversion difficult to accomplish using custom scripts
- High Costs - Modifying and maintaining the scripts when source and target formats changed proved to be both expensive and time consuming
- Inaccurate - Limited connectivity to data sources led to redundant copies of data, slower processes and greater chance of errors.
- Denodo Data Virtualization deployed to connect to the different sources of the genome data, apply transformations, and produce the final data sets
- Data is extracted from XML files, Oracle DB and MySQL DB
- Scheduler within Denodo performs an FTP per quarter to move files to target ICGC servers
- Increased scalability - Include larger genome data sets due to the creation of replicable generic workflows and the platform’s advanced performance capabilities
- Increased efficiency - Faster development and modification of TCGA – ICGC transformation processes because of the platform’s diverse connectivity and publishing capabilities
- Increased accuracy - Minimized replication and manual intervention led to the most current versions of data and processes being used to create the output files, leading to greater accuracy in the final data
Curing advanced data ailments using data virtualization to aid worldwide war on cancer.