Getting started
Table of Contents
The SMARTER-database project
The SMARTER-database projects is a repository where partners of Work Package 4 (WP4) of the SMARTER project can share their genotype and phenotypes data. The main objective of this WP is to quantify the genetic diversity in hardy and underutilized breeds and identify signatures of selection related to specific breed adaptation to geo-climatic environments. New and available data on R&E phenotypic and genotypic information on different breeds from partners, from previous projects and from other WPs will be used to develop strategies to combine such heterogeneous data. To accomplish this task, data need to be standardized, merged and then referred to their metadata.
The SMARTER-database project is a collection of scripts and code to standardize and integrate information in an unique place available to WP4 partners and later to the all community. Processed genotype data will be available through FTP, while metadata will be available through the SMARTER-backend with the help of the r-smarter-api R package and SMARTER-frontend.
This project is structured as described by Cookiecutter Data Science documentation: the key idea is to structure a data science project in a standardized way. Every folder within the project has a precise scope which is described in both Cookiecutter Data Science documentation and in README.md. All data produced within this project is reproducible and the structure imposed by this project let people to understand where to find data or code of interest in order to get information on a certain element without having a full understanding of every script/module/data file inside this project.
There are two major distinct areas regarding the SMARTER data: The first is the database
related folder, which keep information regarding the SMARTER MongoDB instance and is
managed using docker-compose. This database need to be up and running in order to
work properly with SMARTER data. Moreover, database need to be populated with data
like SNP coordinates which comes from SNPchimp, Ensembl or EVA.
There’s also the need to upload data coming from custom chips, in order to have
a more precise picture of all the variants. Data need also to be integrated with
additional information like breeds and their codes. All those steps are managed
by python scripts which are stored in the second area: those scripts let to interact
with data relying on the MongoDB instance and transform the content in the data/raw
folder into data/processed
output, which is the final output generated by
the SMARTER-database project
SMARTER-database requirements
SMARTER-database is managed through a conda
environment, in which python executable
and other non-python dependencies are specified. Moreover, python dependencies are
managed through a requirements.txt
file. Dependencies and environment set up
is managed through the GNU linux make
command. The MongoDB instance of this project is managed
by docker and docker-compose,
however you can configure an environment variable to set up a connection with
an external MongoDB instance. See Configure environment variables for more
information.
Installation and configuration
Clone this project with GIT
In order to install SMARTER-database project, you need to clone it from GitHub using git:
git clone https://github.com/cnr-ibba/SMARTER-database.git
Now enter into the smarter cloned directory: from now
and in the rest of this documentation this SMARTER-database
directory will be
referred as the project home directory:
cd SMARTER-database
export PROJECT_DIR=$PWD
Note
If you plan to install this project in a shared folder, take a look before at Shared folders and permissions and in particular at the Setting permissions section in the BIOINFO Guidelines documentation
Tip
In order to better share this project with other users on the same machine, its better to clone this project inside a directory with the SGID special permission (see Using SGID for more information)
Warning
Every file you create in a SGID directory will have the correct permissions
and ownership, however if you copy a file through scp
, rsync
or you
move a file from a non SGID directory, the permission will be the standard
ones defined for your user. You should check that permissions are correct after
moving or copying files, in particular for data
directory. To add the
SGID permission on the current directory and subfolder, you could do like
this:
find . -user $USER -type d -exec chmod g+s {} \;
This command should be called inside a interactive bash login session, since bash will ignore commands which try to set the SGID permission.
Configure environment variables
In order to work properly SMARTER-database needs some environment variables defined in two environment files. Those files must not be tracked with GIT for security reasons, and should be defined before start working with this project.
The first .env
file is located inside the database
folder and is required
in order to start the MongoDB
and mongoexpress images
and to set up the required collections and validation constraints.
So edit the $PROJECT_DIR/database/.env
file by setting these two variables:
MONGODB_ROOT_USER=<smarter root database username>
MONGODB_ROOT_PASS=<smarter root database password>
MONGOEXPRESS_USER=<smarter mongoexpress username>
MONGOEXPRESS_PASS=<smarter mongoexpress password>
The second .env
file need to be located in the project HOME directory and
need to define the credentials required to access the MongoDB instance using a
new smarter user (a user granted to fill up the database and to retrieve information
to process the genotype files). Start from this template and set your credentials
properly in $PROJECT_DIR/.env
file:
# Environment variables go here, can be read by `python-dotenv` package:
#
# `src/script.py`
# ----------------------------------------------------------------
# import dotenv
#
# project_dir = os.path.join(os.path.dirname(__file__), os.pardir)
# dotenv_path = os.path.join(project_dir, '.env')
# dotenv.load_dotenv(dotenv_path)
# ----------------------------------------------------------------
#
# DO NOT ADD THIS FILE TO VERSION CONTROL!
MONGODB_SMARTER_USER=<smarter username>
MONGODB_SMARTER_PASS=<smarter password>
MONGODB_SMARTER_HOST=localhost
MONGODB_SMARTER_PORT=27017
Hint
You can configure the MongoDB instance on a different host, or call the import
process from another location by setting the proper MONGODB_SMARTER_HOST
and MONGODB_SMARTER_PORT
values in the environment file.
Start the MongoDB instance
The MongoDB instance is managed using docker-compose
: database will
be created and configured when you start the docker container for the first time.
Local files are written in the $PROJECT_DIR/database/mongodb-data
that will
persist even when turning down and destroying docker containers . First check
that the $PROJECT_DIR/database/.env
file is configured correctly as described by the section
before. Next, in order to avoid annoying
messages when saving your mongo-client history, set mongodb-home
sticky dir
permission:
cd $PROJECT_DIR/database
chmod o+wt mongodb-home/
This let you to save and see mongodb history using a different user than the user used inside the MongoDB docker container. Moreover, this folder can be used to import/export a SMARTER-database dump. Next download, build and initialize the SMARTER-database containers with:
docker-compose pull
docker-compose build
docker-compose up -d
Now is time to define create the smarter user with the same credentials used in
your $PROJECT_DIR/.env
environment file. You could do this using docker-compose
commands:
docker-compose run --rm --user mongodb mongo sh -c 'mongo --host mongo \
--username="${MONGO_INITDB_ROOT_USERNAME}" \
--password="${MONGO_INITDB_ROOT_PASSWORD}"'
Then from the mongodb terminal create the smarter user using the values
of $MONGODB_SMARTER_USER
and $MONGODB_SMARTER_PASS
variables.
You require both the read/write privileges to update and retrieve smarter data:
use admin
db.createUser({
user: "<user>",
pwd: "<password>",
roles: [{
role: "readWrite",
db: "smarter"
}]
})
For more information on the smarter MongoDB database usage, please refer to the
README.md
documentation in the $PROJECT_DIR/database
folder.
Setting up python environment
In order to install all the conda requirements and libraries, move into the $PROJECT_DIR
(which is the SMARTER-database folder cloned using git) and then install dependencies
using make:
cd $PROJECT_DIR
make create_environment
This will create a SMARTER-database
conda environment and will install all the
required softwares (like plink,
vcftools,
tabix, …).
Then you need to manually activate the SMARTER-database
before installing all
the required python dependencies:
conda activate SMARTER-database
make requirements
Note
All project dependencies will be installed in the SMARTER-database
conda
environment. You will need to activate this environment every time you need
to use a SMARTER-database script or dependency.
Initialize and populate SMARTER database
In order to populate the SMARTER-database with data, you need to collect data
provided by the partners from the SMARTER repository.
Moreover you have to retrieve and collect information from databases like
SNPchiMp, Ensembl or EVA. You will need also information from
Illumina or Affymetrix Manifest files in order to deal with different types
of genotype files. Raw unprocessed files and external sources files need to be placed
in their proper folder: all data received by the SMARTER partners need to be placed
in the data/raw
folder in the SMARTER $PROJECT_DIR
directory, in a foreground
or background
folder accordingly if data is produced in the context of SMARTER project
or is available outside this project. External source files, like manifests, database
dumps and other support files need to be placed in the data/external
directory.
Within this project external support files are organized by species (GOA
and SHE
for goat and sheep respectively) and by data source (ie, SNPCHIMP
, ILLUMINA
AFFYMETRIX
, etc.). Those data files are not shipped with this github project,
you need to ask to developer and to SMARTER WP4 coordinators to have access to this data.
Process raw data and create the final dataset
In order to process raw data, insert data into SMARTER database, generate the SMARTER ids
an create the final genotype dataset files there are manly two steps that are
managed using make
command. In the first step, you
will upload all the external information into the database: simply type (inside
the SMARTER-database
conda environment):
make initialize
to upload all the external information on variants in the database. This step is described in detail in the Loading variants into database section.
In the next step, you will process each sample by generating a SMARTER ID, and you will insert phenotypes and other sample related metadata into the SMARTER database. The final output of this step will be the generation of the final genotype files. Like before, simply type:
make data
Output data will be placed in a folders relying on the assembly version used,
with all the genotypes in the same format and using the same reference system.
Those folders will be placed in the data/processed
folder. For more detailed information
about all the process called within this step, please see
The Data Import Process documentation.
Last step in data generation is made available with:
make publish
which will pack your genotype files in order to be shared with other partners using the SMARTER FTP repository.
Database management through docker-compose
The SMARTER MongoDB docker-composed image in database
folder does a mount
bind of the database/mongodb-home/
folder in which you can put files that could be
inserted / retrieved from database. This means that you can place here a file
to be imported into database or you can export a collection outside SMARTER-database.
Here are described how to dump and restore a full SMARTER-database instance:
Restore SMARTER database from a mongodump file
In order to restore the SMARTER database from a dump file:
docker-compose run --rm --user mongodb mongo sh -c 'mongorestore --host mongo \
--username="${MONGO_INITDB_ROOT_USERNAME}" \
--password="${MONGO_INITDB_ROOT_PASSWORD}" --authenticationDatabase admin \
--db=smarter --drop --preserveUUID --gzip \
--archive=/home/mongodb/smarter.archive.gz'
After that, you can login through the smarter database by calling the mongodb client like this:
docker-compose run --rm --user mongodb mongo sh -c 'mongo --host mongo \
--username="${MONGO_INITDB_ROOT_USERNAME}" --password="${MONGO_INITDB_ROOT_PASSWORD}" \
--authenticationDatabase=admin smarter'
Dump SMARTER-database
In order to dump SMARTER database in a file:
docker-compose run --rm --user mongodb mongo sh -c 'mongodump --host mongo \
--username="${MONGO_INITDB_ROOT_USERNAME}" \
--password="${MONGO_INITDB_ROOT_PASSWORD}" --authenticationDatabase admin \
--db=smarter --gzip --archive=/home/mongodb/smarter.archive.gz'