Getting started

The SMARTER-database project

The SMARTER-database projects is a repository where partners of Work Package 4 (WP4) of the SMARTER project can share their genotype and phenotypes data. The main objective of this WP is to quantify the genetic diversity in hardy and underutilized breeds and identify signatures of selection related to specific breed adaptation to geo-climatic environments. New and available data on R&E phenotypic and genotypic information on different breeds from partners, from previous projects and from other WPs will be used to develop strategies to combine such heterogeneous data. To accomplish this task, data need to be standardized, merged and then referred to their metadata.

The SMARTER-database project is a collection of scripts and code to standardize and integrate information in an unique place available to WP4 partners and later to the all community. Processed genotype data will be available through FTP, while metadata will be available through the SMARTER-backend with the help of the r-smarter-api R package and SMARTER-frontend.

This project is structured as described by Cookiecutter Data Science documentation: the key idea is to structure a data science project in a standardized way. Every folder within the project has a precise scope which is described in both Cookiecutter Data Science documentation and in README.md. All data produced within this project is reproducible and the structure imposed by this project let people to understand where to find data or code of interest in order to get information on a certain element without having a full understanding of every script/module/data file inside this project.

There are two major distinct areas regarding the SMARTER data: The first is the database related folder, which keep information regarding the SMARTER MongoDB instance and is managed using docker-compose. This database need to be up and running in order to work properly with SMARTER data. Moreover, database need to be populated with data like SNP coordinates which comes from SNPchimp, Ensembl or EVA. There’s also the need to upload data coming from custom chips, in order to have a more precise picture of all the variants. Data need also to be integrated with additional information like breeds and their codes. All those steps are managed by python scripts which are stored in the second area: those scripts let to interact with data relying on the MongoDB instance and transform the content in the data/raw folder into data/processed output, which is the final output generated by the SMARTER-database project

SMARTER-database requirements

SMARTER-database is managed through a conda environment, in which python executable and other non-python dependencies are specified. Moreover, python dependencies are managed through a requirements.txt file. Dependencies and environment set up is managed through the GNU linux make command. The MongoDB instance of this project is managed by docker and docker-compose, however you can configure an environment variable to set up a connection with an external MongoDB instance. See Configure environment variables for more information.

Installation and configuration

Clone this project with GIT

In order to install SMARTER-database project, you need to clone it from GitHub using git:

git clone https://github.com/cnr-ibba/SMARTER-database.git

Now enter into the smarter cloned directory: from now and in the rest of this documentation this SMARTER-database directory will be referred as the project home directory:

cd SMARTER-database
export PROJECT_DIR=$PWD

Note

If you plan to install this project in a shared folder, take a look before at Shared folders and permissions and in particular at the Setting permissions section in the BIOINFO Guidelines documentation

Tip

In order to better share this project with other users on the same machine, its better to clone this project inside a directory with the SGID special permission (see Using SGID for more information)

Warning

Every file you create in a SGID directory will have the correct permissions and ownership, however if you copy a file through scp, rsync or you move a file from a non SGID directory, the permission will be the standard ones defined for your user. You should check that permissions are correct after moving or copying files, in particular for data directory. To add the SGID permission on the current directory and subfolder, you could do like this:

find . -user $USER -type d -exec chmod g+s {} \;

This command should be called inside a interactive bash login session, since bash will ignore commands which try to set the SGID permission.

Configure environment variables

In order to work properly SMARTER-database needs some environment variables defined in two environment files. Those files must not be tracked with GIT for security reasons, and should be defined before start working with this project.

The first .env file is located inside the database folder and is required in order to start the MongoDB and mongoexpress images and to set up the required collections and validation constraints. So edit the $PROJECT_DIR/database/.env file by setting these two variables:

MONGODB_ROOT_USER=<smarter root database username>
MONGODB_ROOT_PASS=<smarter root database password>
MONGOEXPRESS_USER=<smarter mongoexpress username>
MONGOEXPRESS_PASS=<smarter mongoexpress password>

The second .env file need to be located in the project HOME directory and need to define the credentials required to access the MongoDB instance using a new smarter user (a user granted to fill up the database and to retrieve information to process the genotype files). Start from this template and set your credentials properly in $PROJECT_DIR/.env file:

# Environment variables go here, can be read by `python-dotenv` package:
#
#   `src/script.py`
#   ----------------------------------------------------------------
#    import dotenv
#
#    project_dir = os.path.join(os.path.dirname(__file__), os.pardir)
#    dotenv_path = os.path.join(project_dir, '.env')
#    dotenv.load_dotenv(dotenv_path)
#   ----------------------------------------------------------------
#
# DO NOT ADD THIS FILE TO VERSION CONTROL!
MONGODB_SMARTER_USER=<smarter username>
MONGODB_SMARTER_PASS=<smarter password>
MONGODB_SMARTER_HOST=localhost
MONGODB_SMARTER_PORT=27017

Hint

You can configure the MongoDB instance on a different host, or call the import process from another location by setting the proper MONGODB_SMARTER_HOST and MONGODB_SMARTER_PORT values in the environment file.

Start the MongoDB instance

The MongoDB instance is managed using docker-compose: database will be created and configured when you start the docker container for the first time. Local files are written in the $PROJECT_DIR/database/mongodb-data that will persist even when turning down and destroying docker containers . First check that the $PROJECT_DIR/database/.env file is configured correctly as described by the section before. Next, in order to avoid annoying messages when saving your mongo-client history, set mongodb-home sticky dir permission:

cd $PROJECT_DIR/database
chmod o+wt mongodb-home/

This let you to save and see mongodb history using a different user than the user used inside the MongoDB docker container. Moreover, this folder can be used to import/export a SMARTER-database dump. Next download, build and initialize the SMARTER-database containers with:

docker-compose pull
docker-compose build
docker-compose up -d

Now is time to define create the smarter user with the same credentials used in your $PROJECT_DIR/.env environment file. You could do this using docker-compose commands:

docker-compose run --rm --user mongodb mongo sh -c 'mongo --host mongo \
  --username="${MONGO_INITDB_ROOT_USERNAME}" \
  --password="${MONGO_INITDB_ROOT_PASSWORD}"'

Then from the mongodb terminal create the smarter user using the values of $MONGODB_SMARTER_USER and $MONGODB_SMARTER_PASS variables. You require both the read/write privileges to update and retrieve smarter data:

use admin
db.createUser({
  user: "<user>",
  pwd: "<password>",
  roles: [{
    role: "readWrite",
    db: "smarter"
  }]
})

For more information on the smarter MongoDB database usage, please refer to the README.md documentation in the $PROJECT_DIR/database folder.

Setting up python environment

In order to install all the conda requirements and libraries, move into the $PROJECT_DIR (which is the SMARTER-database folder cloned using git) and then install dependencies using make:

cd $PROJECT_DIR
make create_environment

This will create a SMARTER-database conda environment and will install all the required softwares (like plink, vcftools, tabix, …). Then you need to manually activate the SMARTER-database before installing all the required python dependencies:

conda activate SMARTER-database
make requirements

Note

All project dependencies will be installed in the SMARTER-database conda environment. You will need to activate this environment every time you need to use a SMARTER-database script or dependency.

Initialize and populate SMARTER database

In order to populate the SMARTER-database with data, you need to collect data provided by the partners from the SMARTER repository. Moreover you have to retrieve and collect information from databases like SNPchiMp, Ensembl or EVA. You will need also information from Illumina or Affymetrix Manifest files in order to deal with different types of genotype files. Raw unprocessed files and external sources files need to be placed in their proper folder: all data received by the SMARTER partners need to be placed in the data/raw folder in the SMARTER $PROJECT_DIR directory, in a foreground or background folder accordingly if data is produced in the context of SMARTER project or is available outside this project. External source files, like manifests, database dumps and other support files need to be placed in the data/external directory. Within this project external support files are organized by species (GOA and SHE for goat and sheep respectively) and by data source (ie, SNPCHIMP, ILLUMINA AFFYMETRIX, etc.). Those data files are not shipped with this github project, you need to ask to developer and to SMARTER WP4 coordinators to have access to this data.

Process raw data and create the final dataset

In order to process raw data, insert data into SMARTER database, generate the SMARTER ids an create the final genotype dataset files there are manly two steps that are managed using make command. In the first step, you will upload all the external information into the database: simply type (inside the SMARTER-database conda environment):

make initialize

to upload all the external information on variants in the database. This step is described in detail in the Loading variants into database section.

In the next step, you will process each sample by generating a SMARTER ID, and you will insert phenotypes and other sample related metadata into the SMARTER database. The final output of this step will be the generation of the final genotype files. Like before, simply type:

make data

Output data will be placed in a folders relying on the assembly version used, with all the genotypes in the same format and using the same reference system. Those folders will be placed in the data/processed folder. For more detailed information about all the process called within this step, please see The Data Import Process documentation. Last step in data generation is made available with:

make publish

which will pack your genotype files in order to be shared with other partners using the SMARTER FTP repository.

Database management through docker-compose

The SMARTER MongoDB docker-composed image in database folder does a mount bind of the database/mongodb-home/ folder in which you can put files that could be inserted / retrieved from database. This means that you can place here a file to be imported into database or you can export a collection outside SMARTER-database. Here are described how to dump and restore a full SMARTER-database instance:

Restore SMARTER database from a mongodump file

In order to restore the SMARTER database from a dump file:

docker-compose run --rm --user mongodb mongo sh -c 'mongorestore --host mongo \
  --username="${MONGO_INITDB_ROOT_USERNAME}" \
  --password="${MONGO_INITDB_ROOT_PASSWORD}" --authenticationDatabase admin \
  --db=smarter --drop --preserveUUID --gzip \
  --archive=/home/mongodb/smarter.archive.gz'

After that, you can login through the smarter database by calling the mongodb client like this:

docker-compose run --rm --user mongodb mongo sh -c 'mongo --host mongo \
  --username="${MONGO_INITDB_ROOT_USERNAME}" --password="${MONGO_INITDB_ROOT_PASSWORD}" \
  --authenticationDatabase=admin smarter'

Dump SMARTER-database

In order to dump SMARTER database in a file:

docker-compose run --rm --user mongodb mongo sh -c 'mongodump --host mongo \
  --username="${MONGO_INITDB_ROOT_USERNAME}" \
  --password="${MONGO_INITDB_ROOT_PASSWORD}" --authenticationDatabase admin \
  --db=smarter --gzip --archive=/home/mongodb/smarter.archive.gz'