Model Characterization and Data Standardization
Clinical and Molecular Data Types Collected
HCMI models and associated tumor and normal tissues are characterized using standard operating procedures for clinical and molecular data.
For clinical data, cancer type-specific clinical data working groups generate case report forms (CRFs). The working groups are composed of disease experts and pathologists who contribute their specialized knowledge to the clinical data collection process for HCMI. These CRFs function to standardize the clinical data that is collected from the many HCMI sites.
For molecular data, the models and their associated normal and tumor tissue are characterized at various Genome Characterization Centers. Characterization data includes:
- 150x whole exome (WXS)
- 15x whole genome (WGS)
- 120 million read RNA-sequencing
- Infinium MethylationEPIC DNA Array
To help standardize molecular characterization, the data are aligned and processed by NCI's Genomic Data Commons (GDC). Somatic variants, copy number variants, and structural rearrangement data files are also generated. All clinical and molecular data are made available through the GDC.
Variant Calling and Filtering
Somatic variants for each case are generated by the GDC. As part of the data harmonization process at the GDC, potential germline mutations are filtered from the variants identified from DNA sequencing of normal tissues, maintaining only somatic variants. These carefully filtered lists of somatic mutations, also known as “masked somatic mutations,” are available to users as open access “masked somatic MAF” files for each model. Once cases are released at the GDC, the available masked somatic MAF data are shown as “Research Somatic Variants” on the HCMI Searchable Catalog. Users may search the catalog for models with gene mutations of interest.
Because open-access masked somatic MAF data are heavily-filtered to remove potential germline variants, there are instances where files do not contain low-frequency somatic variants or do not contain variants (e.g., ensemble calls for case IDs HCM-BROD-0036-41 and HCM-CSHL-0093-C25). Tumor mutational frequency is affected by a number of factors including cancer type, tissue status, and patient age. More information about the GDC’s process for generating and filtering MAF files is available at the GDC.
If omission of true-positive somatic mutations is a concern, users may access controlled-access MAFs through the GDC, which requires user certification through dbGaP. Information about alternations may also be gleaned from copy number and structural variant data.
Both clinical and molecular characterization data may be accessed regardless of whether a user has obtained an HCMI model. Available open- and controlled-access model-associated data can be found at the GDC.