Loading variants into database

When calling make initialize a series of steps are performed to initialize the MongoDB database by loading variants information. This process is independent from the data generation step, but those information are used to convert genotype files to the same format and produce the final genotype dataset. Information on variants need to be loaded for both SMARTER species (goat and sheep) in order to do this data conversion. Some accessory information are also required, for example rs_is, since the same SNP can be called with different names.

The variant collections

Mainly the variations are modelled around Illumina variants described in their chips, since the majority of genotype files are produced using this technology. However, those data need to support also the Affymetrix manufacturer and even data produced from Whole Genome Sequencing (WGS). To accomplish this, variants have additional fields as described by VariantSpecies class, in order to support different sources of information. Variants derived from Affymetrix technology or by WGS are retrieved and replaced with the proper Illumina variants when possible, in order to make possible the comparison between samples derived from different technologies. When genotypes are processed during the data import process, variants are retrieved relying their names or attributes like genomic positions or rs_id, then genotypes are checked against database and converted with Illumina TOP coding convention.

About supported assemblies

Received genotype data could come from different chips which relies on different assemblies. Data generated long time ago could refer a very old or deprecated genome assembly, and even data generated with the same chip could have different positions since the probe mapping is a process continuously under revision. Considering this, we can’t trust genomic positions when processing genotype files, the only thing stable in genotype files is the SNP name: this is the reason why we chose to use SNP names to update genomic positions. We upload evidences for different assemblies and chips in order to represent every SNP processed in genotype file, but we convert genomic positions and genotypes relying only on one evidence: this could introduce some errors maybe non present in latest assemblies, however this genome assembly is consistent between genotype files, and this let to compare genotypes across different dataset produced by different platforms.

When we first upload a SNP, we assign an initial Location object with a version and imported_from attributes in which track the genome assembly version and the source of information. This let us to further update the same location, if the assembly and the source is the same (for example, with a more recent manifest file) or store another Location object to manage a new genomic position from a different evidence. This let us also to switch from one assembly to another one, since all the available genomic locations are stored within the SNP itself.

At this time, the genome assemblies we support are OAR3 for sheep and ARS1 for goat genome: they are not the latest assembly versions, however they are supported by genome browser like Ensembl or UCSC. We plan to support more recent assemblies to facilitate the data sharing in the future.

Upload the supported chips

First step in database initialization is loading the supported chip into SupportedChip documents. You need to prepare a JSON file in which at least the chip name and the species is specified: This chip name will be assigned to VariantSpecies defined within this chips and also to SampleSpecies and Dataset. Here is an example of such JSON file:

[
    {
        "name": "IlluminaOvineSNP50",
        "species": "Sheep",
        "manifacturer": "illumina",
        "n_of_snps": 0
    },
    {
        "name": "WholeGenomeSequencing",
        "species": "Sheep"
    }
]

Next, you can upload the chip name using import_snpchips.py:

python src/data/import_snpchips.py --chip_file data/raw/chip_names.json

For more information, see import_snpchips.py manual page.

Import SNPs from manifest files

In order to define a VariantSpecies object, you need to load such SNP from a manifest file and specify the source of such location. After a SNP object is created, you can add additional location evidences, or update the same genomic location using a more recent manifest file. Since this database is modelled starting from Illumina chips, its better to define all the Illumina SNPs before: after that, if an Affymetrix chip has a correspondence with a SNP already present, the new location source can be integrated with the Illumina genotype. To upload SNP from an illumina manifest file, simply type:

python src/data/import_manifest.py --species_class sheep \
    --manifest data/external/SHE/ILLUMINA/ovinesnp50-genome-assembly-oar-v3-1.csv.gz \
    --chip_name IlluminaOvineSNP50 --version Oar_v3.1 --sender AGR_BS

where the --species_class must be one of sheep or goat and --manifest, --chip_name and --version need to specify the manifest file location, a SupportedChip.name already loaded into database and the assembly version. To upload data from an Affymetrix manifest file, there’s another script:

python src/data/import_affymetrix.py --species_class sheep \
    --manifest data/external/SHE/AFFYMETRIX/Axiom_BGovis2_Annotation.r1.csv.gz \
    --chip_name AffymetrixAxiomBGovis2 --version Oar_v3.1

where the parameters required are similar to the Illumina import process. For more information see import_manifest.py and import_affymetrix.py manual pages.

Import locations from SNPchiMp

Another useful source of information come from the SNPchiMp database, which was a project in which SNPs belonging to Affymetrix or Illumina manufacturers where loaded with their genome alignment from dbSNP database: This lets to convert coordinates and genotypes between different genomic assemblies. Unfortunately, after dbSNP release 151 SNPs from animals like sheep and goat are not more managed by NCBI but were transferred to EBI EVA. This implies update importing script and update database like SNPchiMp. At the moment SNPchiMp data are the main data used from assemblies OAR3, OAR4 and CHI1, while ARS1 assembly is currently managed from manifest file (which is more recent than SNPchiMp). We plan to re-map the probes and to integrate data with EVA, in order to solve genomic locations for all the SNPs and having the latest evidences and cross-reference id like rs_id. To upload data from SNPchiMp, simply download the entire datafile for a certain assembly and chip. Then call the following program:

python src/data/import_snpchimp.py --species_class sheep \
    --snpchimp data/external/SHE/SNPCHIMP/SNPchimp_SHE_SNP50v1_oar3.1.csv.gz \
    --version Oar_v3.1

see import_snpchimp.py manual page for additional information.

Import locations from genome projects

The last source of evidence that is modelled by SMARTER-database comes from sheep and goat genome initiatives like Sheep HapMap or VarGoats, which can re-map chips on latest genome assemblies. However, this mapping process can have some issues (see here, here and here for example) so this source of evidence need to be revised with sheep and goat genomic projects. To upload this type of information in database, you can do as following:

python src/data/import_isgc.py \
    --datafile data/external/SHE/CONSORTIUM/OvineSNP50_B.csv_v3.1_pos_20190513.csv.gz \
    --version Oar_v3.1
python src/data/import_iggc.py \
    --datafile data/external/GOA/CONSORTIUM/capri4dbsnp-base-CHI-ARS-OAR-UMD.csv.gz \
            --version ARS1 --date "06 Mar 2018" --chrom_column ars1_chr --pos_column ars1_pos \
    --strand_column ars1_strand

please, refer to import_isgc.py and import_iggc.py manual pages for additional information.