Data Types Collected by TCGA
The Cancer Genome Atlas (TCGA) collected many types of data for each of over 20,000 tumor and normal samples. Each step in the Genome Characterization Pipeline generated numerous data points, such as:
- clinical information (e.g., smoking status)
- molecular analyte metadata (e.g., sample portion weight)
- molecular characterization data (e.g., gene expression values)
Below is supporting information and documentation for the different steps of molecular characterization.
Case Enrollment Documentation
Documents on case enrollment, followup, and other forms related to the intake of samples and clinical data are available from the Biospecimen Core Resource.
Sample Processing Documentation
TCGA used a compendium of standard operating procedures for processing tissues and other biological samples into molecular analytes for molecular characterization. These protocols are available from NCI's Biospecimen Research Database.
Summary of Data Types Collected
The data collected for a specific case in TCGA may have differed according to sample quality and quantity, cancer type, or technology available at the time of analysis. Below is a general summary of the types of clinical, molecular characterization, and other types of data that may have been generated for the different cancer types studied.
Raw data (e.g. BAMs), germline and non-validated mutations, and genotypes are under controlled access (indicated in red). Derived data is available open access (exceptions are noted in table below).
All data collected and processed by the program is available at the Genomic Data Commons (GDC), including TCGA publication supplemental and associated data files. Questions about locating or accessing data should be directed to the GDC support team. Resources for TCGA users and TCGA FAQs are available.
Experimental protocols for each platform can be found in individual publications.
Type | Subtype | Cancer Types Applicable | Description | Format | Notes |
---|---|---|---|---|---|
Clinical | Clinical data | All | Available clinical information (may include demographic information, treatment information, survival data, etc) | XML (per patient), tab-delimited TXT (grouped "biotab" per cancer type) | Clinical data forms used by the TSS
Additional information in the Clinical Data Elements (CDE) Browser |
Biospecimen data | All | Information on how samples were processed by the Biospecimen Core Resource Center | XML (per patient), tab-delimited TXT (grouped "biotab" per cancer type) | Protocols used by the BCR for processing of samples | |
Pathology Reports | All | Pathology reports (for select cases) | |||
Copy Number | SNP microarray | All | CEL, IDAT, tab-delimited TXT (raw values per SNP, copy number, and loss of heterozygosity), tab-delimited TXT ( normalized values and purity/ploidy) | Probe information contained in array design files for each platform | |
Copy number microarray | GBM, OV, LUSC | tab-delimited TXT (raw signals per probe), tab-delimited TSV (normalized values per aggregated region), MAT, | Probe information contained in array design files for each platform | ||
Low-Pass DNA Sequencing | Some tumor types | Low pass, whole genome sequencing of tumor and normal matched samples and analysis of differences in read counts between tumor and normal | BAM, VCF, tab-delimited TSV (normal v tumor calls) | ||
DNA | Whole exome | All | Whole exome sequencing of tumor and normal matched samples | BAM, VCF, MAF (mutation calls) | Germline mutation calls and unvalidated non-coding somatic variants are controlled-access |
Whole genome | All | Whole genome sequencing for tumor and normal matched samples (for select cases) | BAM, VCF, MAF (mutation calls) | Germline mutation calls and unvalidated non-coding somatic variants are controlled-access | |
SNP microarray | All | CEL, IDAT, tab-delimited TXT (raw values per SNP), tab-delimited TXT (genotypes per SNP) | Germline mutation calls and unvalidated non-coding somatic variants are controlled-access | ||
Sequence traces | GBM, OV | Raw output from capillary sequencing technology | SCF, TR | May be available at NCBI Trace Archive | |
Imaging | Diagnostic image | All | Whole slide images of tissue used to diagnose participant | SVS | Available at the GDC, open access |
Tissue image | All | Whole slide images of tissue samples from each participant that were used for TCGA analyses | SVS | Available at the GDC, open access | |
Radiological image | Some | Pre-surgical radiological imaging (e.g. MRI, CT, PET, etc) (for select cases) | DCM | Available at The Cancer Imaging Archive, open access | |
Methylation | Bisulfite sequencing | Some tumor types | Whole genome sequencing performed after bisulfite treatment of tumor samples | BAM, VCF (methylation and mutation calls), BED (methylation calls per CpG site) | |
Bead array | All | tab-delimited TXT (raw signal values, beta values, beta values mapped to genome), IDAT | Probe information contained in array design files for each platform | ||
Microsatellite Instability | COAD, READ, UCEC | Markers indicating presence or absence of a MSI shift, allele homozygosity/heterozygosity, and loss of heterozygosity observed in tumor samples | FSA, TXT (summary of trace file) | MSI classifications within clinical biotab files | |
miRNA | miRNA Sequencing | All except GBM | miRNA sequencing of tumor samples | BAM, tab-delimited TXT (normalized expression values per miRNA or isoform) | |
Array-based | GBM, OV | TXT (raw signals per probe, normalized expression values per probe, gene, or exons) | Probe information contained in array design files for each platform | ||
mRNA Expression | mRNA sequencing | All | mRNA sequencing of tumor sampls using a poly(A) enrichment RNA preparation | BAM, TXT (normalized expression values per gene, isoform, exon, or splice junction) | May be labeled as RNASeqV1 and RNASeqv2 |
Total RNA Sequencing | Some tumor types | mRNA sequencing of tumor samples using ribosomal depletion RNA preparation | BAM, TXT (normalized expression values per gene, isoform, exon, or splice junction) | May be labeled as TotalRNASeqV2 | |
Microarray | BRCA, COAD, GBM, KIRC, KIRP, LAML, LGG, LUAD, LUSC, OV, READ, UCEC | CEL (raw signals per probe), TXT (raw signals per probe, normalized expression values per probe, gene, or exons) | Probe information contained in array design files for each platform | ||
Protein Expression | Reverse-Phase Protein Array | All | High resolution images of protein array slides (up to 1000 participant tumor samples per slide) and raw signals per slide | TIFF, tab-delimited TXT (signal values, dilution curves, normalized expression values) |