Loading of VCF files
This page describes the process that an operator follows to load their VCF files into an OpenCGA Variant Store.
Introduction
This page describes the typical process that an operator will follow to load their VCF files into an OpenCGA using the OpenCGA command line tools. Loading of VCF files can be done either before (normally) or after the loading of sample and clinical metadata.
It typically takes a few minutes to load a VCF from a single exome but it can take several days (or even weeks) to load many thousands of whole genomes. For more information on data load times see [Data Load Benchmarks].
The process is divided into 5 steps:
"Register" the source VCF files with OpenCGA; this creates basic File, Sample and Individual entries in Catalog.
"Index" each VCF file; this loads data into the Variant Store
"Annotate" all newly-created variants against the CellBase knowledge base.
"Summarise"; re-calculate all variant statistics.
"Secondary Index" to include annotations and summaries.
[TODO: update the figure below to follow the process described above]
Prerequisites
This document assumes that:
The source VCF files are accessible (e.g. via shared filesystem) on the target OpenCGA server.
The operator has access to a workstation with network access to the web services on the OpenCGA server.
Compatible OpenCGA client software is installed on the workstation. Find here the instructions on how to install the client software.
The destination Study has been created on the OpenCGA server. Find here instructions for creating Projects and Studies.
The operator has login credentials on the OpenCGA server with appropriate permissions; i.e. write access to the destination Study.
Catalog file register
This step presents the data to OpenCGA and registers the new files into the system. Samples will be created automatically after linking the file, by reading the VCF Header. This step can be further extended with extra annotations, defining individuals, creating cohorts or even families.
It is important to note that this step is a synchronous operation that does not upload the genomic data (e.g:VCFs) into OpenCGA, instead, the files will only be “linked” (registered) with OpenCGA. Therefore, the files to link must be in a location that is accessible by the OpenCGA server (REST servers and the Master service).
Catalog Path Structure
Internally, the Catalog metadata holds a logical tree view of the linked files that can easily be explored or listed. Try using:
New folders can be created with this command:
Being <catalog-logical-path>
the directory that you’d like to create within catalog.
Linking files synchronously vs. asynchronously
Note that for VCF files with more than 5000 samples linking should be launched as an asynchronous job
There are two different commands depending on the type of VCF that needs to be loaded. Aggregated VCF files with many samples need to be linked by launching an asynchronous job.
- Linking files synchronously (~less than 5000 samples)
Files are registered into OpenCGA Catalog using this command line:
Multiple files can be linked using the same command typing multiple input files separated by space or comma.
- Linking files asynchronously (more than 5000 samples)
For VCFs containing more than 5000 samples, the linking steps needs to be performed as an asynchronous job. In this case, a different command needs to be run:
Full example: This example includes creating a directory and link of VCF file in the new path.
Variant storage index
This operation will read the content of the file, run some simple validations to detect possible errors or data corruptions, and ingest the variants into the Hadoop system, building some high performance indexes.
Each file index operation will be run by an asynchronous job, to be executed by the OpenCGA Master service.
Contrary to the Catalog File Register step, only one file should be provided as input in the Variant storage index command line. This will create separate asynchronous indexing jobs for each one of the files. This is important in order to avoid failure of the jobs.
Use this command to launch a variant index job:
All the jobs along with their current status can be either inspected from IVA, or running this command line:
Special scenarios
Note: Be aware that the misuse of the parameters described below may lead to data corruption. Please, ask for support at support@zettagenomics.com or create a ticket in the Zetta Service Desk if you are not sure about what option adjust to your dataset.
Samples data split by chromosome or region
By default, OpenCGA doesn’t allow you to index a VCF file if any of its samples is already indexed as part of another VCF file. This restriction is to avoid accidental data duplications. In case of having one dataset split by chromosome or region, this restriction can be bypassed by adding the param --load-split-data <chromosome|region>
to the variant index command line.
Multiple files for the same samples
Similarly to the previous scenario, a dataset may contain multiple files from the same set of samples that may want to be indexed together, for example, when using multiple VCF callers for the same sample. In this case, you can bypass the restriction by adding the param --load-multi-file-data.
Family or Somatic callers
When using special callers it is important to specify it in the command line with either--family / --somatic.
Variant Annotation
Once all the data is loaded, we need to run the Variant Annotation. This is a key enrichment operation that will attach CellBase Variant Annotations with the loaded data, allowing filtering by a large number of fields.
Find more information at: http://docs.opencb.org/display/cellbase/Variant+Annotation
The Variant Storage Engine will run the annotation just for the new variants, being able to reuse the existing annotations to save time and disk usage. This operation is executed at the project level, so shared variants between studies won’t need to be annotated twice.
Similar to the variant-index process, this command line will queue an asynchronous job to be executed by the OpenCGA Master service.
Variant Statistics calculation
The second enrichment operation is the Variant Statistics Calculation. After defining a cohort, you might decide to compute the Variant Stats for that cohort. These statistics include the most typical values like allele and genotype frequencies, MAF, QUAL average, FILTER count...
For updating the stats of all the cohorts, or when there are no cohorts in the study apart from the default ALL cohort
:
Aggregated VCFs
This section is under current development.
In case of having computed stats codified in the INFO column of a VCF using standard or non-standard keys, these values can be converted into VariantStats
models, and be used for filtering.
To extract the statistics, you need to create a mapping file between the INFO keys containing the information, and it’s meaning. Each line will have this format: <COHORT>.<KNOWN_KEY>=<INFO_KEY>
Then, this file needs to be linked in catalog, and referred when computing the stats.
OpenCGA supports 3 different “ways” of codifying the information, known as “aggregation method”. Some of these are named after public studies that started using them. Each one defines a set of known keys that will be used to parse the statistics.
BASIC : Using standard vcf-spec INFO keys.
AN : Total number of alleles in called genotypes
AC : Total number of alternate alleles in called genotypes
AF : Allele Frequency, for each ALT allele, in the same order as listed
EXAC
HET: Count of HET genotypes. For multi allelic variants, the genotype order is 0/1, 0/2, 0/3, 0/4... 1/2, 1/3, 1/4... 2/3, 2/4... 3/4...
HOM : Count of HOM genotypes. For multi allelic variants, the genotype order is 1/1, 2/2, …
EVS
GTS: List of Genotypes
GTC: Genotypes count, ordered according to “GTS”
GROUPS_ORDER: Order of cohorts for key “MAF”
MAF: Minor allele frequency value for each cohort, ordered according to “GROUPS_ORDER”
e.g. Single cohort variant stats custom_mapping.properties
ALL.AC =AC ALL.AN =AN ALL.AF =AF ALL.HET=AC_Het ALL.HOM=AC_Hom/2 #Key “HEMI” is not supported #ALL.HEMI=AC_Hemi |
Variant Secondary Index Build
Secondary indexes are built using the search engine Apache Solr for improving the performance of some queries and aggregations, allowing full text search and faceted queries to the Variant database.
This secondary index will include the Variant Annotation and all computed Variant Stats. Therefore, this step needs to be executed only once all annotations and statistics are finished.
Enrichment Operations
This steps are optional operations, that can be indexed to enrich the data displayed at the IVA web application:
Sample Variant Stats
Sample Variant Stats will contain a set of aggregated statistics values for each sample.
These aggregated values can be computed across all variants from each sample, or using a subset of variants using a variant filter query. e.g:
By default, this analysis will produce a file, and optionally, the result can be indexed in the catalog metadata store, given an ID.
The ID ALL can only be used if without any variant query filter.
Cohort Variant Stats
This section is under current development.
Family Index
This section is under current development.
Last updated