Getting started
===============
..
This is where you describe how to get set up on a clean install, including the
commands necessary to get the raw data (using the `sync_data_from_s3` command,
for example), and then how to make the cleaned, final data sets.
.. contents:: Table of Contents
The SMARTER-database project
----------------------------
The SMARTER-database projects is a repository where partners of Work Package 4 (WP4)
of the `SMARTER project `__ can share their genotype
and phenotypes data. The main objective of this WP is to quantify the genetic diversity
in hardy and underutilized breeds and identify signatures of selection
related to specific breed adaptation to geo-climatic environments.
New and available data on R&E phenotypic and genotypic information on different
breeds from partners, from previous projects and from other WPs will be
used to develop strategies to combine such heterogeneous data. To accomplish this
task, data need to be standardized, merged and then referred to their metadata.
The `SMARTER-database `__ project
is a collection of scripts and code to standardize and integrate information
in an unique place available to WP4 partners and later to the all
community. Processed genotype data will be available through FTP, while
metadata will be available through the
`SMARTER-backend `__
with the help of the `r-smarter-api `__
R package and `SMARTER-frontend `__.
This project is structured as described by `Cookiecutter Data Science`_
documentation: the key idea is to structure a data science project in a standardized
way. Every folder within the project has a precise scope which is described in both `Cookiecutter Data Science`_
documentation and in `README.md `__.
All data produced within this project is reproducible and the
structure imposed by this project let people to understand where to find
data or code of interest in order to get information on a certain element without
having a full understanding of every script/module/data file inside this project.
There are two major distinct areas regarding the SMARTER data: The first is the database
related folder, which keep information regarding the SMARTER `MongoDB`_ instance and is
managed using `docker-compose`_. This database need to be up and running in order to
work properly with SMARTER data. Moreover, database need to be populated with data
like SNP coordinates which comes from `SNPchimp`_, `Ensembl`_ or `EVA`_.
There's also the need to upload data coming from custom chips, in order to have
a more precise picture of all the variants. Data need also to be integrated with
additional information like breeds and their codes. All those steps are managed
by python scripts which are stored in the second area: those scripts let to interact
with data relying on the `MongoDB`_ instance and transform the content in the ``data/raw``
folder into ``data/processed`` output, which is the final output generated by
the *SMARTER-database* project
SMARTER-database requirements
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
*SMARTER-database* is managed through a `conda `__
environment, in which python executable
and other non-python dependencies are specified. Moreover, python dependencies are
managed using `poetry `__. Dependencies and environment set up
is managed through the GNU linux ``make`` command. The MongoDB instance of this project is managed
by `docker `__ and `docker-compose `__,
however you can configure an environment variable to set up a connection with
an external `MongoDB`_ instance. See :ref:`Configure environment variables` for more
information.
Installation and configuration
------------------------------
Clone this project with GIT
^^^^^^^^^^^^^^^^^^^^^^^^^^^
In order to install *SMARTER-database* project, you need to clone it
`from GitHub `__ using git:
.. code-block:: bash
git clone https://github.com/cnr-ibba/SMARTER-database.git
Now enter into the smarter cloned directory: from now
and in the rest of this documentation this ``SMARTER-database`` directory will be
referred as **the project home directory**:
.. code-block:: bash
cd SMARTER-database
export PROJECT_DIR=$PWD
.. note::
If you plan to install this project in a shared folder, take a look before at
`Shared folders and permissions `__
and in particular at the `Setting permissions `__
section in the `BIOINFO Guidelines `__
documentation
.. tip::
In order to better share this project with other users on the same machine, its
better to clone this project inside a directory with the **SGID** special permission
(see `Using SGID `__
for more information)
.. warning::
Every file you create in a **SGID** directory will have the correct permissions
and ownership, however if you **copy** a file through ``scp``, ``rsync`` or you
move a file from a non **SGID** directory, the permission will be the standard
ones defined for your user. You should check that permissions are correct after
*moving* or *copying* files, in particular for ``data`` directory. To add the
**SGID** permission on the current directory and subfolder, you could do like
this:
.. code-block:: bash
find . -user $USER -type d -exec chmod g+s {} \;
This command should be called inside a *interactive bash login session*, since
bash will ignore commands which try to set the **SGID** permission.
Configure environment variables
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
In order to work properly *SMARTER-database* needs some environment variables defined
in two environment files. Those files **must not be tracked with GIT** for security
reasons, and should be defined **before** start working with this project.
The first ``.env`` file is located inside the ``database`` folder and is required
in order to start the `MongoDB `__
and `mongoexpress `__ images
and to set up the required collections and validation constraints.
So edit the ``$PROJECT_DIR/database/.env`` file by setting these two variables::
MONGODB_ROOT_USER=
MONGODB_ROOT_PASS=
MONGOEXPRESS_USER=
MONGOEXPRESS_PASS=
The second ``.env`` file need to be located in the **project HOME directory** and
need to define the credentials required to access the MongoDB instance using a
new *smarter* user (a user granted to fill up the database and to retrieve information
to process the genotype files). Start from this template and set your credentials
properly in ``$PROJECT_DIR/.env`` file::
# Environment variables go here, can be read by `python-dotenv` package:
#
# `src/script.py`
# ----------------------------------------------------------------
# import dotenv
#
# project_dir = os.path.join(os.path.dirname(__file__), os.pardir)
# dotenv_path = os.path.join(project_dir, '.env')
# dotenv.load_dotenv(dotenv_path)
# ----------------------------------------------------------------
#
# DO NOT ADD THIS FILE TO VERSION CONTROL!
MONGODB_SMARTER_USER=
MONGODB_SMARTER_PASS=
MONGODB_SMARTER_HOST=localhost
MONGODB_SMARTER_PORT=27017
.. hint::
You can configure the MongoDB instance on a different host, or call the import
process from another location by setting the proper ``MONGODB_SMARTER_HOST``
and ``MONGODB_SMARTER_PORT`` values in the environment file.
Start the MongoDB instance
^^^^^^^^^^^^^^^^^^^^^^^^^^
The *MongoDB* instance is managed using ``docker-compose``: database will
be created and configured when you start the docker container for the first time.
Local files are written in the ``$PROJECT_DIR/database/mongodb-data`` that will
persist even when turning down and destroying docker containers . First check
that the ``$PROJECT_DIR/database/.env`` file is configured correctly as described by the section
:ref:`before `. Next, in order to avoid annoying
messages when saving your mongo-client history, set ``mongodb-home`` *sticky dir*
permission:
.. code-block:: bash
cd $PROJECT_DIR/database
chmod o+wt mongodb-home/
This let you to save and see mongodb history using a different user than the
user used inside the MongoDB docker container. Moreover, this folder can be used
to import/export a *SMARTER-database* dump.
Next download, build and initialize the *SMARTER-database* containers with:
.. code-block:: bash
docker-compose pull
docker-compose build
docker-compose up -d
Now is time to define create the *smarter* user with the same credentials used in
your ``$PROJECT_DIR/.env`` environment file. You could do this using *docker-compose*
commands:
.. code-block:: bash
docker-compose run --rm --user mongodb mongo sh -c 'mongo --host mongo \
--username="${MONGO_INITDB_ROOT_USERNAME}" \
--password="${MONGO_INITDB_ROOT_PASSWORD}"'
Then from the mongodb terminal create the *smarter* user using the values
of ``$MONGODB_SMARTER_USER`` and ``$MONGODB_SMARTER_PASS`` variables.
You require both the *read/write* privileges to update and retrieve smarter data:
.. code-block:: javascript
use admin
db.createUser({
user: "",
pwd: "",
roles: [{
role: "readWrite",
db: "smarter"
}]
})
For more information on the smarter *MongoDB* database usage, please refer to the
`README.md `__
documentation in the ``$PROJECT_DIR/database`` folder.
Setting up python environment
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
In order to install all the conda requirements and libraries, move into the ``$PROJECT_DIR``
(which is the *SMARTER-database* folder cloned using git) and then install dependencies
using make:
.. code-block:: bash
cd $PROJECT_DIR
make create_environment
This will create a ``SMARTER-database`` conda environment and will install all the
required softwares (like `plink `__,
`vcftools `__,
`tabix `__, ...).
Then you need to manually activate the ``SMARTER-database`` before installing all
the required python dependencies:
.. code-block:: bash
conda activate SMARTER-database
make requirements
.. note::
All project dependencies will be installed in the ``SMARTER-database`` conda
environment. You will need to activate this environment every time you need
to use a *SMARTER-database* script or dependency.
Initialize and populate SMARTER database
----------------------------------------
In order to populate the *SMARTER-database* with data, you need to collect data
provided by the partners from the `SMARTER repository `__.
Moreover you have to retrieve and collect information from databases like
`SNPchiMp`_, `Ensembl`_ or `EVA`_. You will need also information from
*Illumina* or *Affymetrix* Manifest files in order to deal with different types
of genotype files. *Raw unprocessed files* and external *sources files* need to be placed
in their proper folder: all data received by the SMARTER partners need to be placed
in the ``data/raw`` folder in the SMARTER ``$PROJECT_DIR`` directory, in a ``foreground``
or ``background`` folder accordingly if data is produced in the context of SMARTER project
or is available outside this project. External source files, like manifests, database
dumps and other support files need to be placed in the ``data/external`` directory.
Within this project external support files are organized by species (``GOA`` and ``SHE``
for *goat* and *sheep* respectively) and by data source (ie, ``SNPCHIMP``, ``ILLUMINA``
``AFFYMETRIX``, etc.). Those data files are not shipped with this github project,
you need to ask to developer and to SMARTER WP4 coordinators to have access to this data.
Process raw data and create the final dataset
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
In order to process raw data, insert data into SMARTER database, generate the SMARTER ids
an create the final genotype dataset files there are manly two steps that are
managed using ``make`` command. In the first step, you
will upload all the external information into the database: simply type (inside
the ``SMARTER-database`` conda environment):
.. code-block:: bash
make initialize
to upload all the external information on *variants* in the database. This step
is described in detail in the :ref:`Loading variants into database` section.
In the next step, you will process each sample by generating a *SMARTER ID*,
and you will insert phenotypes and other sample related metadata into the SMARTER
database. The final output of this step will be the generation of the final genotype
files. Like before, simply type:
.. code-block:: bash
make data
Output data will be placed in a folders relying on the assembly version used,
with all the genotypes in the same format and using the same reference system.
Those folders will be placed in the ``data/processed`` folder. For more detailed information
about all the process called within this step, please see
:ref:`The Data Import Process` documentation.
Last step in data generation is made available with:
.. code-block:: bash
make publish
which will pack your genotype files in order to be shared with other partners using
the SMARTER FTP repository.
Database management through docker-compose
------------------------------------------
The SMARTER MongoDB docker-composed image in ``database`` folder does a *mount
bind* of the ``database/mongodb-home/`` folder in which you can put files that could be
inserted / retrieved from database. This means that you can place here a file
to be imported into database or you can export a collection outside *SMARTER-database*.
Here are described how to dump and restore a full *SMARTER-database* instance:
Restore SMARTER database from a *mongodump* file
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
In order to restore the SMARTER database from a dump file:
.. code-block:: bash
docker-compose run --rm --user mongodb mongo sh -c 'mongorestore --host mongo \
--username="${MONGO_INITDB_ROOT_USERNAME}" \
--password="${MONGO_INITDB_ROOT_PASSWORD}" --authenticationDatabase admin \
--db=smarter --drop --preserveUUID --gzip \
--archive=/home/mongodb/smarter.archive.gz'
After that, you can login through the *smarter* database by calling the mongodb
client like this:
.. code-block:: bash
docker-compose run --rm --user mongodb mongo sh -c 'mongo --host mongo \
--username="${MONGO_INITDB_ROOT_USERNAME}" --password="${MONGO_INITDB_ROOT_PASSWORD}" \
--authenticationDatabase=admin smarter'
Dump SMARTER-database
^^^^^^^^^^^^^^^^^^^^^
In order to dump SMARTER database in a file:
.. code-block:: bash
docker-compose run --rm --user mongodb mongo sh -c 'mongodump --host mongo \
--username="${MONGO_INITDB_ROOT_USERNAME}" \
--password="${MONGO_INITDB_ROOT_PASSWORD}" --authenticationDatabase admin \
--db=smarter --gzip --archive=/home/mongodb/smarter.archive.gz'
.. _`Cookiecutter Data Science`: https://drivendata.github.io/cookiecutter-data-science/
.. _`MongoDB`: https://www.mongodb.com/
.. _`docker-compose`: https://docs.docker.com/compose/
.. _`SNPchiMp`: http://webserver.ibba.cnr.it/SNPchimp/
.. _`Ensembl`: https://www.ensembl.org/index.html
.. _`EVA`: https://www.ebi.ac.uk/eva/