src.features.smarterdb
Created on Tue Feb 23 16:21:35 2021
@author: Paolo Cozzi <paolo.cozzi@ibba.cnr.it>
Classes:
|
|
|
Required to describe the breed and code used in a certain dataset in order to resolve the final breed to be used in SMARTER-database |
|
A class to manage SNP consequences. |
|
A class to deal with counter collection (created when initializing smarter database) and used to define SMARTER IDs |
|
A helper class to deal with countries object. |
|
Describe a dataset instace with fields owned by data types |
|
A class to deal with a SNP location (ie position in an assembly for a certain chip or data source) |
|
A class to deal with phenotypes. |
|
A class to deal with different affymetrix probesets |
|
A simple Enum object to define sample type ( |
|
An enum object to manage Sample sex in the same way as plink does |
|
A class specific for Goat samples |
|
A class specific for Sheep samples |
|
A generic class used to manage Goat or Sheep samples |
|
A class to track database status informations |
|
A class to deal with SMARTER-database managed chips |
|
A class to deal with Goat variations (SNP) |
|
A class to deal with Sheep variations (SNP) |
|
Generic class to deal with Variant (SNP) objects |
Exceptions:
Functions:
|
Return reverse complement for a base call |
|
Read from |
|
Generate a new SMARTER ID object using the internal counter collections |
|
Get a Breed instance or create a new one (or update a breed adding a new |
|
Get or create a sample providing attributes (search for original_id in provided dataset |
|
test if foreground or background dataset |
|
Establish a connection to the SMARTER database. |
- class src.features.smarterdb.Breed(*args, **values)[source]
Bases:
Document
Miscellaneous:
Attributes:
A list of
BreedAlias
objects.The breed code
A field wrapper around MongoDB's ObjectIds.
How many samples are the same breed
The breed name
objects
([q_obj])The breed species.
- exception DoesNotExist
Bases:
DoesNotExist
- exception MultipleObjectsReturned
Bases:
MultipleObjectsReturned
- aliases
A list of
BreedAlias
objects. Required to determine the SMARTER-database breed from the genotype file (which can use a different breed name or code)
- code
The breed code
- id
A field wrapper around MongoDB’s ObjectIds.
- n_individuals
How many samples are the same breed
- name
The breed name
- objects(q_obj=None, **query) = []
- species
The breed species. Should be one of
Goat
orSheep
- class src.features.smarterdb.BreedAlias(*args, **kwargs)[source]
Bases:
EmbeddedDocument
Required to describe the breed and code used in a certain dataset in order to resolve the final breed to be used in SMARTER-database
Attributes:
The country of the breed in the dataset.
The dataset
ObjectID
in which this BreedAlias is usedThe breed Family ID used in genotype file
- country
The country of the breed in the dataset. Used in multi country datasets
- dataset
The dataset
ObjectID
in which this BreedAlias is used
- fid
The breed Family ID used in genotype file
- class src.features.smarterdb.Consequence(*args, **kwargs)[source]
Bases:
EmbeddedDocument
A class to manage SNP consequences. Not yet implemented
- class src.features.smarterdb.Counter(*args, **values)[source]
Bases:
Document
A class to deal with counter collection (created when initializing smarter database) and used to define SMARTER IDs
Miscellaneous:
Attributes:
A unicode string field.
objects
([q_obj])32-bit integer field.
- exception DoesNotExist
Bases:
DoesNotExist
- exception MultipleObjectsReturned
Bases:
MultipleObjectsReturned
- id
A unicode string field.
- objects(q_obj=None, **query) = []
- sequence_value
32-bit integer field.
- class src.features.smarterdb.Country(name: Optional[str] = None, *args, **kwargs)[source]
Bases:
Document
A helper class to deal with countries object. Each record is created after data import, when database status is updated
Miscellaneous:
Methods:
__init__
([name])Initialise a document or an embedded document.
Attributes:
Country 2 letter code (used to derive SMARTER IDs)
Country 3 letter code
A field wrapper around MongoDB's ObjectIds.
The Country name
The country numeric code
objects
([q_obj])Country ufficial name
The sample species find within this country
- exception DoesNotExist
Bases:
DoesNotExist
- exception MultipleObjectsReturned
Bases:
MultipleObjectsReturned
- __init__(name: Optional[str] = None, *args, **kwargs)[source]
Initialise a document or an embedded document.
- Parameters
values – A dictionary of keys and values for the document. It may contain additional reserved keywords, e.g. “__auto_convert”.
__auto_convert – If True, supplied values will be converted to Python-type values via each field’s to_python method.
_created – Indicates whether this is a brand new document or whether it’s already been persisted before. Defaults to true.
- alpha_2
Country 2 letter code (used to derive SMARTER IDs)
- alpha_3
Country 3 letter code
- id
A field wrapper around MongoDB’s ObjectIds.
- name
The Country name
- numeric
The country numeric code
- objects(q_obj=None, **query) = []
- official_name
Country ufficial name
- species
The sample species find within this country
- class src.features.smarterdb.Dataset(*args, **values)[source]
Bases:
Document
Describe a dataset instace with fields owned by data types
Miscellaneous:
Attributes:
The breed of the dataset.
The
SupportedChip.name
attribute of the technology usedDataset contents as a list
The country where the data come from.
The publication DOI of this dataset
The source dataset file
The technology used to generate data specified by the partner
A field wrapper around MongoDB's ObjectIds.
Number of individual in the dataset
Number of the record in the phenotype file
objects
([q_obj])The partner which owns the dataset
returns the locations of dataset processed directory.
The file size
The species of the data.
Trait described in phenotype file
Dataset type.
The partner which upload this dataset
returns the locations of dataset working directory.
- exception DoesNotExist
Bases:
DoesNotExist
- exception MultipleObjectsReturned
Bases:
MultipleObjectsReturned
- breed
The breed of the dataset. Could have many values
- chip_name
The
SupportedChip.name
attribute of the technology used
- contents
Dataset contents as a list
- country
The country where the data come from. Could have many values
- doi
The publication DOI of this dataset
- file
The source dataset file
- gene_array
The technology used to generate data specified by the partner
- id
A field wrapper around MongoDB’s ObjectIds.
- n_of_individuals
Number of individual in the dataset
- n_of_records
Number of the record in the phenotype file
- objects(q_obj=None, **query) = []
- partner
The partner which owns the dataset
- property result_dir: PosixPath
returns the locations of dataset processed directory. Could exists or not
- Returns
a subdirectory in /data/processed/
- Return type
- size_
The file size
- species
The species of the data. Could be ‘Sheep’ or ‘Goat’
- trait
Trait described in phenotype file
- type_
Dataset type. Need to be one from
['genotypes', 'phenotypes]
and one from['background', 'foreground']
- uploader
The partner which upload this dataset
- class src.features.smarterdb.Location(*args, **kwargs)[source]
Bases:
EmbeddedDocument
A class to deal with a SNP location (ie position in an assembly for a certain chip or data source)
Methods:
__init__
(*args, **kwargs)Initialise a document or an embedded document.
ab2top
(genotype[, missing])Convert an illumina ab SNP in a illumina top snp
affy2top
(genotype[, missing])Convert an affymetrix SNP in a illumina top snp
forward2top
(genotype[, missing])Convert an illumina forward SNP in a illumina top snp
illumina2top
(genotype[, missing])Convert an illumina SNP in a illumina top snp
is_ab
(genotype[, missing])Return True if genotype is compatible with illumina AB coding
is_affymetrix
(genotype[, missing])Return True if genotype is compatible with affymetrix coding
is_forward
(genotype[, missing])Return True if genotype is compatible with illumina FORWARD coding
is_illumina
(genotype[, missing])Return True if genotype is compatible with illumina coding (as it's recorded in manifest)
is_top
(genotype[, missing])Return True if genotype is compatible with illumina TOP coding
Attributes:
The SNP code read as it is from affymetrix data
The dbSNP alleles of such SNP
The chromosome where this SNP is located
A list of SNP consequences (not yet implemented)
Track manifactured date or when this data was last updated
The SNP code read as it is from illumina data
The SNP code in illumina forward coding
The probe orientation in alignment
Return genotype in illumina top format
The source of the SNP data
The SNP position
The SNP subission ID
The strand orientation in aligment
The assembly version where this SNP is placed
- __init__(*args, **kwargs)[source]
Initialise a document or an embedded document.
- Parameters
values – A dictionary of keys and values for the document. It may contain additional reserved keywords, e.g. “__auto_convert”.
__auto_convert – If True, supplied values will be converted to Python-type values via each field’s to_python method.
_created – Indicates whether this is a brand new document or whether it’s already been persisted before. Defaults to true.
- ab2top(genotype: list, missing: list = ['0', '-']) list [source]
Convert an illumina ab SNP in a illumina top snp
- affy2top(genotype: list, missing: list = ['0', '-']) list [source]
Convert an affymetrix SNP in a illumina top snp
- affymetrix_ab
The SNP code read as it is from affymetrix data
- alleles
The dbSNP alleles of such SNP
- chrom
The chromosome where this SNP is located
- consequences
A list of SNP consequences (not yet implemented)
- date
Track manifactured date or when this data was last updated
- forward2top(genotype: list, missing: list = ['0', '-']) list [source]
Convert an illumina forward SNP in a illumina top snp
- illumina
The SNP code read as it is from illumina data
- illumina2top(genotype: list, missing: list = ['0', '-']) list [source]
Convert an illumina SNP in a illumina top snp
- illumina_forward
The SNP code in illumina forward coding
- illumina_strand
The probe orientation in alignment
- property illumina_top
Return genotype in illumina top format
- imported_from
The source of the SNP data
- is_ab(genotype: list, missing: list = ['0', '-']) bool [source]
Return True if genotype is compatible with illumina AB coding
- is_affymetrix(genotype: list, missing: list = ['0', '-']) bool [source]
Return True if genotype is compatible with affymetrix coding
- is_forward(genotype: list, missing: list = ['0', '-']) bool [source]
Return True if genotype is compatible with illumina FORWARD coding
- is_illumina(genotype: list, missing: list = ['0', '-']) bool [source]
Return True if genotype is compatible with illumina coding (as it’s recorded in manifest)
- is_top(genotype: list, missing: list = ['0', '-']) bool [source]
Return True if genotype is compatible with illumina TOP coding
- position
The SNP position
- ss_id
The SNP subission ID
- strand
The strand orientation in aligment
- version
The assembly version where this SNP is placed
- class src.features.smarterdb.Phenotype(*args, **kwargs)[source]
Bases:
DynamicEmbeddedDocument
A class to deal with phenotypes. This is a dynamic document and not a generic DictField since there can be attributes which could be enforced to have certain values. All other attributes could be set without any assumptions
Attributes:
Floating point number field.
Floating point number field.
Floating point number field.
A unicode string field.
- chest_girth
Floating point number field.
- height
Floating point number field.
- length
Floating point number field.
- purpose
A unicode string field.
- class src.features.smarterdb.Probeset(*args, **kwargs)[source]
Bases:
EmbeddedDocument
A class to deal with different affymetrix probesets
Attributes:
the chip name where this affymetrix probeset comes from
A list probeset assigned to the same SNP
- chip_name
the chip name where this affymetrix probeset comes from
- probeset_id
A list probeset assigned to the same SNP
- class src.features.smarterdb.SAMPLETYPE(value)[source]
Bases:
Enum
A simple Enum object to define sample type (
background
orforeground
)Attributes:
- BACKGROUND = 'background'
- FOREGROUND = 'foreground'
- class src.features.smarterdb.SEX(value)[source]
-
An enum object to manage Sample sex in the same way as plink does
Attributes:
Methods:
from_string
(value)Get proper type relying on input string
- FEMALE = 2
- MALE = 1
- UNKNOWN = 0
- class src.features.smarterdb.SampleGoat(*args, **values)[source]
Bases:
SampleSpecies
A class specific for Goat samples
Miscellaneous:
Attributes:
The father (SIRE) of this animal.
A field wrapper around MongoDB's ObjectIds.
The mother (DAM) of this animal.
objects
([q_obj])The species name.
The generic specie class
- exception DoesNotExist
Bases:
DoesNotExist
- exception MultipleObjectsReturned
Bases:
MultipleObjectsReturned
- father_id
The father (SIRE) of this animal. Is a reference to another SampleGoat instance
- id
A field wrapper around MongoDB’s ObjectIds.
- mother_id
The mother (DAM) of this animal. Is a reference to another SampleGoat instance
- objects(q_obj=None, **query) = []
- species
The species name. Could be something different from
Capra hircus
- species_class = 'Goat'
The generic specie class
- class src.features.smarterdb.SampleSheep(*args, **values)[source]
Bases:
SampleSpecies
A class specific for Sheep samples
Miscellaneous:
Attributes:
The father (SIRE) of this animal.
A field wrapper around MongoDB's ObjectIds.
The mother (DAM) of this animal.
objects
([q_obj])The species name.
The generic specie class
- exception DoesNotExist
Bases:
DoesNotExist
- exception MultipleObjectsReturned
Bases:
MultipleObjectsReturned
- father_id
The father (SIRE) of this animal. Is a reference to another SampleSheep instance
- id
A field wrapper around MongoDB’s ObjectIds.
- mother_id
The mother (DAM) of this animal. Is a reference to another SampleSheep instance
- objects(q_obj=None, **query) = []
- species
The species name. Could be something different from
Ovis aries
- species_class = 'Sheep'
The generic specie class
- class src.features.smarterdb.SampleSpecies(*args, **values)[source]
Bases:
Document
A generic class used to manage Goat or Sheep samples
Attributes:
This is a sample alias, mainly the name used in the genotype file, which can be different from the name specified in the metadata file
The breed full name
The breed code
The chip name used to define this sample
Where this samples comes from
The dataset where this sample come from
The sample GPS location as a Point (X, Y -> longitude, latitude).
Additional metadata (not managed via ORM)
The sample original ID in the source dataset
A
Phenotype
instanceA
SEX
instance.A SMARTER unique and stable identifier
A generic species (Sheep or Goat).
A
SAMPLETYPE
instance (ie,background
orforeground
Methods:
save
(*args, **kwargs)Custom save method.
- alias
This is a sample alias, mainly the name used in the genotype file, which can be different from the name specified in the metadata file
- breed
The breed full name
- breed_code
The breed code
- chip_name
The chip name used to define this sample
- country
Where this samples comes from
- dataset
The dataset where this sample come from
- locations
The sample GPS location as a Point (X, Y -> longitude, latitude). Mind that a location is specified in latitude and longitude coordinates. Specifying coordinates header in general is useful to avoid errors
- metadata
Additional metadata (not managed via ORM)
- original_id
The sample original ID in the source dataset
- smarter_id
A SMARTER unique and stable identifier
- species_class = None
A generic species (Sheep or Goat). Used to determine specific methods and to identify the proper data from the database
- type_
A
SAMPLETYPE
instance (ie,background
orforeground
- class src.features.smarterdb.SmarterInfo(*args, **values)[source]
Bases:
Document
A class to track database status informations
Miscellaneous:
Attributes:
A unicode string field.
When the SMARTER-database was updated for the last time
objects
([q_obj])The plink parameters used to generate the final genotype dataset
The SMARTER-database version
A dictionary in which managed assemblies are tracked
- exception DoesNotExist
Bases:
DoesNotExist
- exception MultipleObjectsReturned
Bases:
MultipleObjectsReturned
- id
A unicode string field.
- last_updated
When the SMARTER-database was updated for the last time
- objects(q_obj=None, **query) = []
- plink_specie_opt
The plink parameters used to generate the final genotype dataset
- version
The SMARTER-database version
- working_assemblies
A dictionary in which managed assemblies are tracked
- class src.features.smarterdb.SupportedChip(*args, **values)[source]
Bases:
Document
A class to deal with SMARTER-database managed chips
Miscellaneous:
Attributes:
A field wrapper around MongoDB's ObjectIds.
Who created the chip
How many SNPs are described within this chip
The chip identifier
objects
([q_obj])The species for which a chip is defined
- exception DoesNotExist
Bases:
DoesNotExist
- exception MultipleObjectsReturned
Bases:
MultipleObjectsReturned
- id
A field wrapper around MongoDB’s ObjectIds.
- manifacturer
Who created the chip
- n_of_snps
How many SNPs are described within this chip
- name
The chip identifier
- objects(q_obj=None, **query) = []
- species
The species for which a chip is defined
- class src.features.smarterdb.VariantGoat(*args, **values)[source]
Bases:
VariantSpecies
A class to deal with Goat variations (SNP)
Miscellaneous:
Attributes:
A field wrapper around MongoDB's ObjectIds.
objects
([q_obj])- exception DoesNotExist
Bases:
DoesNotExist
- exception MultipleObjectsReturned
Bases:
MultipleObjectsReturned
- id
A field wrapper around MongoDB’s ObjectIds.
- objects(q_obj=None, **query) = []
- class src.features.smarterdb.VariantSheep(*args, **values)[source]
Bases:
VariantSpecies
A class to deal with Sheep variations (SNP)
Miscellaneous:
Attributes:
A field wrapper around MongoDB's ObjectIds.
objects
([q_obj])- exception DoesNotExist
Bases:
DoesNotExist
- exception MultipleObjectsReturned
Bases:
MultipleObjectsReturned
- id
A field wrapper around MongoDB’s ObjectIds.
- objects(q_obj=None, **query) = []
- class src.features.smarterdb.VariantSpecies(*args, **values)[source]
Bases:
Document
Generic class to deal with Variant (SNP) objects
Attributes:
The affymetrix SNP id
The chip names where this SNP could be found
The affymetrix customer id (which is the illumina name)
Illumina TOP variant (which is the same indipendently by locations)
A list of
Location
objectsThe name of the SNPs.
A list of
Probeset
objectsThe SNP rsID
Who provide this SNP probe
A dictionary where keys are chip_name, and values are their probe sequences
Methods:
get_location
(version[, imported_from])Returns location for assembly version and imported source
get_location_index
(version[, imported_from])Returns location index for assembly version and imported source
save
(*args, **kwargs)Custom save method.
- affy_snp_id
The affymetrix SNP id
- chip_name
The chip names where this SNP could be found
- cust_id
The affymetrix customer id (which is the illumina name)
- get_location(version: str, imported_from='SNPchiMp v.3')[source]
Returns location for assembly version and imported source
- get_location_index(version: str, imported_from='SNPchiMp v.3')[source]
Returns location index for assembly version and imported source
- illumina_top
Illumina TOP variant (which is the same indipendently by locations)
- name
The name of the SNPs. Could be illumina name or affyemtrix name
- rs_id
The SNP rsID
- sender
Who provide this SNP probe
- sequence
A dictionary where keys are chip_name, and values are their probe sequences
- src.features.smarterdb.complement(genotype: str) str [source]
Return reverse complement for a base call
- src.features.smarterdb.getNextSequenceValue(sequence_name: str, mongodb: Database)[source]
Read from
Counter
collection and determine the next sequence number to be used for the SMARTER ID
- src.features.smarterdb.getSmarterId(species_class: str, country: str, breed: str) str [source]
Generate a new SMARTER ID object using the internal counter collections
- Parameters
- Raises
SmarterDBException – Raised when passing a wrong species or no one.
- Returns
A new smarter_id.
- Return type
- src.features.smarterdb.get_or_create_breed(species_class: str, name: str, code: str, aliases: list = []) [<class 'src.features.smarterdb.Breed'>, <class 'bool'>] [source]
Get a Breed instance or create a new one (or update a breed adding a new
BreedAlias
)- Parameters
species_class (str) – The class of the species (should be ‘Goat’ or ‘Sheep’)
name (str) – The breed full name.
code (str) – The breed code (unique in Sheep and Goats collections).
aliases (list, optional) – A list of
BreedAlias
objects. The default is [].
- Raises
SmarterDBException – Raised if the breed is not Unique.
- Returns
breed (Breed) – A
Breed
instance.modified (bool) – True is breed is created (or alias updated).
- src.features.smarterdb.get_or_create_sample(SampleSpecies: Union[SampleGoat, SampleSheep], original_id: str, dataset: Dataset, type_: str, breed: Breed, country: str, species: Optional[str] = None, chip_name: Optional[str] = None, sex: Optional[SEX] = None, alias: Optional[str] = None) list[Union[src.features.smarterdb.SampleGoat, src.features.smarterdb.SampleSheep], bool] [source]
Get or create a sample providing attributes (search for original_id in provided dataset
- Parameters
SampleSpecies (Union[SampleGoat, SampleSheep]) – the class required for insert/update.
original_id (str) – the original_id in the dataset.
dataset (Dataset) – the dataset instance used to register sample.
type (str) – sample type. “background” or “foreground” are the only values accepted
country (str) – the country where the sample comes from.
species (str, optional) – The sample species. If None, the default species_class attribute will be used
chip_name (str, optional) – the chip name. The default is None.
alias (str, optional) – an original_id alias. Could be the name used in the genotype file, which could be different from the original_id. The default is None.
- Raises
SmarterDBException – Raised multiple samples are returned (should never happen).
- Returns
Union[SampleGoat, SampleSheep] – a SampleSpecies instance.
created (bool) – True is sample is created.
- src.features.smarterdb.get_sample_type(dataset: Dataset)[source]
test if foreground or background dataset
- src.features.smarterdb.global_connection(database_name: str = 'smarter') MongoClient [source]
Establish a connection to the SMARTER database. Reads environment parameters using
load_dotenv()
, returns a MongoClient object.- Parameters
database_name (str, optional) – The smarter database. The default is ‘smarter’.
- Returns
CLIENT – a mongoclient instance.
- Return type
MongoClient