The Data Import Process 

Table of Contents

The Data Import Process

When calling the make data step of SMARTER-database data generation, a series of steps are performed in order to process raw data and to generate the final dataset. This document tries to describe how the data import process works and how to add new data to the SMARTER-database.

To add a new dataset into SMARTER-database, you need to call the following scripts with specific option in data section of the Makefile file. The order in which those import scripts are called matters, since importing a sample into SMARTER-database means generating a unique smarter_id, which need to be stable, in order to track the same object when updating the database or in different releases. Scripts are written in order to be idempotent: calling the same script twice with the same parameters will produce the same final result.

Defining a new dataset 

The data import process start by defining a dataset as a .zip archive, which could contain genotype or phenotype information (or other metadata). Dataset can also be classified as foreground or background respectively if they are generated in the context of the SMARTER project or before it. Accordingly to data source type and provenience, you have to define a record in the proper .csv file in data/raw folder, like the following:

#;File;Uploader;Size;Partner;Country;Species;Breed;N of Individuals;Gene Array;Chip Name
3;ADAPTmap_genotypeTOP_20161201.zip;smarterdatabase-admin;43.68MB;AUTH;36 Countries;Goat;144 breeds;4653;Genotyping data in plink binary format;IlluminaGoatSNP50

Next, dataset need to be imported by calling src/data/import_datasets.py with the proper dataset type and input file, like the following example:

python src/data/import_datasets.py \
    --types genotypes background \
    data/raw/genotypes-bg.csv

This command will add this dataset as a new Dataset object into the SMARTER-database and will unpack its content in a folder with the MongoDB ObjectID inside the data/interim folder. This let you to analyze and process the dataset content using the SMARTER-database src code. For more information, see the import_datasets.py help.

Exploring data with Jupyter Lab 

Before importing genotypes and samples into the SMARTER-database, there is an additional data exploration step using Jupyter Lab: this step requires manual intervention to understand if data could be imported as it is or if some fixing steps are required. Common issues in datasets could be having different breeds with the same code, or using different codes to specify the same breeds. There could be the case where sample names within the genotype file are different from the ones used in metadata: in all those cases, you have to define a new metadata file where there will be a correspondence between the used value and the value to be inserted into the SMARTER-database.

In this data exploration step, you could check also the coding format of genotypes, by calling the proper SmarterMixin derived class. Metadata can also be integrated with external data sources, which can be used to fix some stuff related to metadata. Start Jupyter Lab (in an activated conda environment) with:

jupyter lab

then create a new notebook according your needs. Those notebook can also be used after the data ingestion to produce reports about the SMARTER-database status. Please, see the notebook section in the Cookiecutter Data Science project for more information.

Adding breeds to the database 

Before upload samples into SMARTER-database, you have to register a Breed first: If the dataset have one or few breeds, you could define a new breed object by calling add_breed.py like this:

python src/data/add_breed.py --species_class sheep \
    --name Texel --code TEX --alias TEXEL_UY \
    --dataset TEXEL_INIA_UY.zip

where the --species_class parameter specifies the source species (goat or sheep), --name and --code specify the breed name and code used in the SMARTER-database respectively, the --alias specifies the FID (the code) used in the genotype file and the --dataset parameter specifies the dataset sources of the sample we want to add. If you have to manage many different breeds in the same dataset, it’s better to create breeds from a metadata file. In such case, you can create your new breeds with a different script:

python src/data/import_breeds.py --species_class Sheep \
    --src_dataset=ovine_SNP50HapMap_data.zip \
    --datafile ovine_SNP50HapMap_data/kijas2012_dataset_fix.xlsx \
    --code_column code --breed_column Breed \
    --fid_column Breed --country_column country

in such case, we will have a --src_dataset and --dst_dataset which let to specify the dataset where the metadata information are retrieved (using the --datafile option) and the dataset where these information will be applied: parameters like these can be provided to other import scripts which rely on a metadata file and one or two distinct datasets. The other parameters let to specify which columns of the metadata file will be used when defining a new breed. See import_breeds.py documentation for more information.

Note

Breed name and code are unique in the same species (enforced by MongoDB): if you have the same breed in two different dataset, you need to call those command twice: first time you will create a new Breed object with the alias used in the first dataset. Every other call on the same breed, will update the same object to support also the new alias in the other dataset.

Adding samples to the database 

Samples can be added in two ways: the first is when converting data from genotype files, the second is by processing metadata information. The first approach should be used when you have a single breed in the whole genotype file, and the breed code in the genotype file have already a Breed instance in the SMARTER-database: this is the simplest data file, when data belongs to the same country and breed. With this situation, you could create samples while processing the genotype file simply by adding the --create-samples flag to the appropriate importing script (for more information, see Processing PLINK-like files, Processing Illumina report files and Processing Affymetrix files sections)

The second approach need to be used when you have different breeds in you genotype file, or when there are additional information that can’t be derived from the genotype file, like the country of origin, the sample name or the breed codes which could have different values respect to the values stored in the genotype file. Other scenarios could be Illumina report or Affymetrix report files which don’t track the FID or other types of information outside sample names and genotypes. Another case is when your genotype files contains more samples than metadata file, for example, when you want to track in SMARTER-database only a few samples: in all these cases, samples need to be created before processing genotypes, using the import_samples.py script:

python src/data/import_samples.py --src_dataset Affymetrix_data_Plate_652_660.zip \
    --datafile Affymetrix_data_Plate_652_660/Uruguay_Corriedale_ID_GenotypedAnimals_fix.xlsx \
    --code_all CRR --id_column "Sample Name" \
    --chip_name AffymetrixAxiomOviCan --country_all Uruguay \
    --alias_column "Sample Filename"

like import_breeds.py, we have --src_dataset and --datafile to indicate where our metadata file is located; if our genotype file is located in the same dataset of metadata, we can omit the --dst_dataset parameter. Breed codes and country can be set to the same values with the --code_all or --country_all parameters, or can be read from metadata file like the following example:

python src/data/import_samples.py --src_dataset greece_foreground_sheep.zip \
    --dst_dataset AUTH_OVN50KV2_CHIOS_FRIZARTA.zip \
    --datafile greece_foreground_sheep/AUTH_OVN50KV2_CHIOS_FRIZARTA.xlsx \
    --code_column breed_code --id_column sample_name \
    --chip_name IlluminaOvineSNP50 --country_column Country

Please, look at import_samples.py help page to have more info about the sample creation process.

Note

Samples are always related to their source dataset, so you could have more samples with the same original_id in SMARTER-database. However, samples need to be unique in the same dataset, otherwise the genotype conversion step will not work. If your dataset contains two or more samples with the same original_id, you could specify an additional column (like the alias) to identify your samples within genotype files

Processing genotype files 

Genotype data is not added into the SMARTER-database, however this data is validated with SMARTER-database, which track information on SNPs: in fact, genotype data could be produced long time ago and with different technologies, so assemblies don’t match and genotype calls need to be standardized in order to be compared. This is particularly true when genotypes are referred according genomic sequence: since the chip probes could be aligned to the forward/reverse strands, the same SNPs could have different genotypes in different assembly versions. In such way, variants need to be converted in order to compare datasets produced in different times with different approaches. To accomplish this, variants need to be loaded into database from manifest, and supplementary information need to be added into the smarter database: all those steps are managed through Makefile by calling:

make initialize

before importing datasets into the SMARTER-database. For more information, see the Loading variants into database section of this documentation.

Converting genotypes to Illumina TOP 

All the received genotypes are converted in illumina TOP format: this coding convention was introduced by Illumina and its main features is that SNP orientation is determined from the sequence around the SNP itself. This seems complex but has the advantage that the SNPs remains the same even if the SNP database or the genome assembly changes. In detail, illumina defines as unambiguous a SNP with only one of A or T calls: SNPs like A/G or A/C will be TOP snps; SNP with T/C and T/G are BOTTOM SNPs. All the other ambiguous cases are determined using the sequence walking method: starting from the SNPs itself, take a letter after and before and check if the resulting pair is ambiguous or not. If the pair is unambiguous, you can classify in TOP/BOTTOM. If the pair is ambiguous take the second letter after and before the SNP and check the resulting pair. This will be done until we can assign a TOP/BOTTOM coding to the SNP.

Credits: Illumina technical notes

In this example A/T is ambiguous even if is composed by A and T. The first couple taken (T/C) is unambiguous so we can say that this example SNP is in BOTTOM orientation. SMARTER genotypes are converted into Illumina TOP: this means that if a SNP is already in TOP coding will be used as it is, but all the other cases need to be converted into illumina TOP. The following is an example of coding conversion for DU186191_327.1 SNP:

DU186191_327.1 A/G (unambiguous SNP)
smarter_id	source version	source coding	source genotype	top genotype
UYOA-CRR-000003890	OAR4	forward	T C	A G
UYOA-CRL-000000382	OAR3	A/B	A B	A G
NAOA-ADP-000001020	OAR3	top	G A	G A
GROA-CHI-000004137	OAR3	forward	T T	A A

In the previous example is easy to convert a SNP into illumina TOP: assumed that the TOP genotype is A/G, every time a letter T or C is found it need to be reversed. But how the snp OAR1_103790218.1 can be converted?

OAR1_103790218.1 C/G (ambiguous SNP)
smarter_id	source version	source coding	source genotype	top genotype
UYOA-CRR-000003890	OAR4	forward	G G	G G
UYOA-CRL-000000382	OAR3	A/B	B B	G G
NAOA-ADP-000001020	OAR3	top	G G	G G
GROA-CHI-000004137	OAR3	forward	C C	G G

This case is more complex since the C call is the complement of G, so you can’t determine the coding of this genotype. The only way to determine the genotype coding of this SNP is to check the coding of the other SNPs in the same dataset. The other source of information required is the orientation of the probe to the reference genome. Consider samples UYOA-CRR-000003890 and GROA-CHI-000004137: they have the same TOP genotype since the probe is aligned to different strands in OAR3 and OAR4 assemblies, so only one genotype need to be reversed to get a TOP genotype. All the information about SNP position and strand orientation are stored in Variants and Locations objects, and can be accessed using the proper methods. The genotype conversion is managed by the proper SmarterMixin derived class method, called by the proper importing script.

So why convert genotypes into illumina TOP? Because illumina TOP SNPs are identical in different genome assemblies, and this means that if you have a new genome version you don’t need to convert the genotype, you will need only to update the genomic positions of the SNPs. For such reason, each genotype importing script has a --coding option with let you to specify the genotype coding of the source file. Source coding will be checked against SMARTER-database variant information in order to be converted in Illumina TOP coding.

To read more about illumina TOP/BOTTOM coding convention, please see illumina technical notes documentation and also Simple guidelines for identifying top/bottom (TOP/BOT) strand and A/B allele and How to interpret DNA strand and allele information for Infinium genotyping array data.

Processing PLINK-like files 

Genotypes provided as PLINK files (both text or binary) can be imported using the import_from_plink.py script, like in the following example:

python src/data/import_from_plink.py --bfile AUTH_OVN50KV2_CHIOS_FRIZARTA/AUTH_OVN50KV2_CHI_FRI \
    --dataset AUTH_OVN50KV2_CHIOS_FRIZARTA.zip --coding forward \
    --chip_name IlluminaOvineSNP50 --assembly OAR3

The --bfile/--file options (mutually exclusive) let you to specify a file prefix (like PLINK does) for a binary/text file respectively. The --dataset option lets to specify which dataset contains the genotype file; --coding option lets to specify the source coding (if the provided coding does not match with database data, the import process will fail). The --assembly parameter will be the destination assembly version of the converted genotypes. There are also other parameter, for example when you have source genotypes with rs_id or when the source assembly is different from the destination assembly. For a full list os such options, take a look to import_from_plink.py help page.

Processing Illumina report files 

Genotypes provided as Illumina reports need to be processed using another script:

python src/data/import_from_illumina.py --report JCM2357_UGY_FinalReport1.txt \
    --snpfile OvineHDSNPList.txt --dataset CREOLE_INIA_UY.zip --breed_code CRL \
    --chip_name IlluminaOvineHDSNP --assembly OAR3 --create_samples

In this case the Illumina report file need to be specified with the --report option, while the SNPs information file need to be specified with the --snpfile option. This command, like import_from_plink.py and import_from_affymetrix.py let to create samples while reading from genotypes using the --create_samples flag. Since illumina report files doesn’t track information about FID, breed codes need to be specified using --breed_code parameter only for one breed samples file: files with multiple breeds can’t be imported like this, samples need to be created before with import_samples.py in order to retrieve the correct information from SMARTER-database. Please see import_from_illumina.py manual pages to get other information regarding this program.

Processing Affymetrix files 

Affymetrix genotypes can be provided using reports format or PLINK like format (which lacks of some columns unlike standard PLINK files). Even in this case, there will be a proper script to call and custom parameters to specify:

python src/data/import_from_affymetrix.py \
    --prefix Affymetrix_data_Plate_652_660/Affymetrix_data_Plate_652/Affymetrix_data_Plate_652 \
    --dataset Affymetrix_data_Plate_652_660.zip --breed_code CRR --chip_name AffymetrixAxiomOviCan \
    --assembly OAR3 --sample_field alias --src_version Oar_v4.0 --src_imported_from affymetrix

In this example, the --prefix parameter means load data from a PLINK-like file. The other input source type could be specified with the --report option. Other parameters are already been described with other import script, with the exception of --sample_field, which let to search samples using a different attribute, and the source of the assembly (both --src_version and --src_imported_from) which is required to convert genotypes into Illumina TOP. For other information, please see the import_from_affymetrix.py help page.

Adding metadata information 

Next step in the data import pipeline is importing metadata into SMARTER-database: those data can’t be provided in the final genotype file, and so will be made available through the SMARTER-backend with the help of the r-smarter-api R package and SMARTER-frontend. There are two main scripts to import metadata: import_metadata.py and import_phenotypes.py. import_metadata.py should be used to import GPS coordinates and other generic metadata fields, while import_phenotypes.py should be used to import phenotypes. Both two scripts can be used to apply information to all the samples belonging to the same breed or to each sample belonging to the same datasets, relying on metadata defined for each breed group or each distinct sample. For example, to load data with GPS coordinates and additional columns you can call import_metadata.py like this:

python src/data/import_metadata.py --src_dataset "High density genotypes of French Sheep populations.zip" \
    --datafile Populations_infos_fix.xlsx --breed_column "Population Name" \
    --latitude_column Latitude --longitude_column Longitude --metadata_column Link \
    --metadata_column POP_GROUP_CODE --metadata_column POP_GROUP_NAME

In this example, metadata are applied by breed using the --breed_column. Parameters like --src_dataset/--dst_dataset and --dataset have the same behavior described in import_samples.py. All the additional metadata column can be loaded by calling multiple times the --metadata_column parameter by providing the desired column in metadata file. Similarly, this applies also for import_phenotypes.py as described in the following example:

python src/data/import_phenotypes.py --src_dataset ADAPTmap_phenotype_20161201.zip \
--dst_dataset ADAPTmap_genotypeTOP_20161201.zip \
--datafile ADAPTmap_phenotype_20161201/ADAPTmap_InfoSample_20161201_fix.xlsx --id_column ADAPTmap_code \
--chest_girth_column ChestGirth --height_column Height --length_column Length \
--additional_column FAMACHA --additional_column WidthOfPinBones

This time, phenotype metadata are loaded for each sample, as described by the --id_column parameter. Then there are parameters which describe a single phenotype trait, like --height_column or --length_column, while additional phenotype traits not described by the Phenotype class, can be loaded with the --additional_column parameter, which can be specified multiple times.

Merging datasets together 

Last step of data import is merging all the processed genotype files into one dataset for species/assemblies. You can do it by calling merge_datasets.py like this:

python src/data/merge_datasets.py --species_class sheep --assembly OAR3

This script will search all processed genotype files for the same species/assembly and will merge all the genotypes in one file. The final genotype will be placed in a new directory with the same name of the desired assembly under the data/processed directory.