ClinVar — a technical view
ClinVar is one of the largest public repositories of genomic variant interpretation data. Although the Clinvar data is also available for download, additional processing steps are necessary to make the data useful for the interpretation of variants in other software tools.
Medical device manufacturers like Limbus must be extra cautious since they take responsibility that the data import into their platform does not alter the data from ClinVar. In our previous post we provided a short overview of ClinVar. It is now time to start a more technical discussion about what kind and what quality of information ClinVar provides.
First of all, these are the ClinVar variant annotations that we provide in varvis™:
ClinVSig
: the clinical significance as determined from all related ClinVar submissions in the current ClinVar release.ClinVarStatus
: the ClinVar review status of a variant. This is shown as a star rating in the Variation Report.ClinVAcc
: link with ID to access the ClinVar Variation Report on the ClinVar website, represented by thersID
from dbSNP or the internal ClinVarvariation ID
, if thersID
is not available.
Executive summary
varvis generally provides the following standardized values of clinical significance from ClinVar data:
Benign
Benign/Likely_benign
Likely_benign
Conflicting_interpretations_of_pathogenicity
Uncertain_significance
Likely_pathogenic
Pathogenic/Likely_pathogenic
Pathogenic
Note that Benign/Likely_benign is a separate valid category provided by Variation Reports, but a conflict in RCVs. Same applies to Pathogenic/Likely_pathogenic.
If a specific variant has been reported as part of a compound variant, we indicate this with the prefix Compound:
and display the classification of the compound variant. We thereby ensure that a variant that may be pathogenic in combination with another variant is not missed.
We follow the ClinVar recommendations and also display non-standardized terms such as risk_factor
, association
, drug_response
, protective
, other
, not_provided
, Affects
.
Note that the review status of a Variation Report may not reflect the most recent state of clinical evidence. See this example which shows a variant reviewed as
benign
by an expert panel where new evidence was submitted by GeneReviews aspathogenic
. It appears that the status is not re-set automatically, if new clinical evidence emerges, and may therefore be outdated.
For those of you who would like to dive a little deeper into the issue here is a description of the most prevailing technical pitfalls that we ran into:
Introduction: RCVs, SCVs, variation IDs
There is a lot of documentation on ClinVar, but this can also be a little overwhelming.
ClinVar has accessions for submissions (e.g., SCV000012345
) and variant/phenotype relationships (e. g., RCV000012345
). Additionally, ClinVar aggregates all data for a variant (or a set of variants, e.g. for compound-heterozyguous variants) and associates this record with a so-called variation ID
.
Since we want to annotate individual variants with the data, these IDs are what we are most interested in: IDs that can be used on the ClinVar website to access all information on the given variant.
For example, RCV000012345
belongs to the variation ID 11588
, which is based on one submission, SCV000032579
. This submission does not have a separate page, but you can see lists of submissions on the bottom of the RCV- and variation report pages.
Issues with ClinVar data
In the following, we highlight a few examples of issues that we encountered when writing software to regularly import ClinVar data.
eXpensive Markup Language
ClinVar comes as a huge XML file, e. g. /ClinVarFullRelease_2017-08.xml.gz
: this ought to be "[t]he most detailed comprehensive file [...]" (from the documentation), so we went with this 5GB XML file and the 40-odd pages of a 'data dictionary' documenting its structure.
Documentation
The data dictionary for the XML structure seems to be outdated (although the release date suggests otherwise). For example, the XML element MeasureTrait
is mentioned quite a bit in the documentation (over 30 times) but is nowhere to be found.
Variant representation
ClinVar right-shifts variants, and not only in their HGVS cDNA and protein change notation, but also with their genomic coordinates. Per HGVS recommendation, cDNA and protein changes should always be right-shifted, so this is also done in varvis. However, we keep the genomic coordinates left-shifted so that we can still annotate each variant as good as possible (via its minimal representation). Otherwise, those variants could not be found in other data sources like gnomAD, dbSNP etc. and would thus not be annotated correctly. Moreover, Yen et al. note that ClinVar also does not right-shift properly in all cases, as it often seems to ignore intron/exon boundaries when doing so.
Variant uniqueness
After left-shifting and normalization, ClinVar’s variation IDs
are not necessarily unique anymore. For example chr1@76226858:G>GCTAGAATGAGTTA
(hg19, 1-based coordinates) refers to both c.1012_1013insTAGAATGAGTTAC
and c.999_1011dupTAGAATGAGTTAC
. Each variation ID can in turn have multiple RCV records (and each, in turn, multiple SCV records) associated. To resolve this issue, we sort variations by pathogenicity and review status, and then choose the first.
Variation 404
We also found that not all variation IDs have valid ‘variation report’ pages on the ClinVar website (although their RCV records have, and also link to them). For example, RCV000022411
links to the page of the corresponding variation ID 1388
, which does not exist.
Determination of clinical significance
Another issue of confusing documentation, but quite critical for variant annotation and thus deserving a special mention, is the determination of clinical significance.
It is well-documented by the ClinVar team that the clinical significance displayed for RCVs differs from the variation reports, mostly by using a more lenient definition of what is conflicting data for the latter (e. g. the supported ACMG-recommended significances Benign
and Likely benign
would not be regarded as a conflict in a variation report, but they would be marked as a conflict in an RCV record).
This means that it is not enough to consider all RCV records per variation ID; all SCV records and their clinical significance need to be considered instead. The relevant data cannot be retrieved from the clinical significances associated with the RCVs.
About varvis and allexes
We address the issues of public variation data repositories with our data network allexes™ which is integrated into our varvis™ platform. In allexes, we collect complete variant and phenotype data in a standardized, de-identified and automated process across many organizations. allexes is scalable and provides transparency about data quality.
About varvis: the clinical decision support system varvis, designed by Limbus Medical Technologies GmbH, Rostock, Germany, is a cloud-based software system to filter and evaluate genetic sequencing data. The varvis platform significantly increases the efficiency and diagnostic yield in clinical genetics. varvis is a Class I medical device compliant with EU regulations. For more information about varvis, visit http://www.varvis.com.