Getting started =============== .. This is where you describe how to get set up on a clean install, including the commands necessary to get the raw data (using the `sync_data_from_s3` command, for example), and then how to make the cleaned, final data sets. .. contents:: Table of Contents The SMARTER-database project ---------------------------- The SMARTER-database projects is a repository where partners of Work Package 4 (WP4) of the `SMARTER project `__ can share their genotype and phenotypes data. The main objective of this WP is to quantify the genetic diversity in hardy and underutilized breeds and identify signatures of selection related to specific breed adaptation to geo-climatic environments. New and available data on R&E phenotypic and genotypic information on different breeds from partners, from previous projects and from other WPs will be used to develop strategies to combine such heterogeneous data. To accomplish this task, data need to be standardized, merged and then referred to their metadata. The `SMARTER-database `__ project is a collection of scripts and code to standardize and integrate information in an unique place available to WP4 partners and later to the all community. Processed genotype data will be available through FTP, while metadata will be available through the `SMARTER-backend `__ with the help of the `r-smarter-api `__ R package and `SMARTER-frontend `__. This project is structured as described by `Cookiecutter Data Science`_ documentation: the key idea is to structure a data science project in a standardized way. Every folder within the project has a precise scope which is described in both `Cookiecutter Data Science`_ documentation and in `README.md `__. All data produced within this project is reproducible and the structure imposed by this project let people to understand where to find data or code of interest in order to get information on a certain element without having a full understanding of every script/module/data file inside this project. There are two major distinct areas regarding the SMARTER data: The first is the database related folder, which keep information regarding the SMARTER `MongoDB`_ instance and is managed using `docker-compose`_. This database need to be up and running in order to work properly with SMARTER data. Moreover, database need to be populated with data like SNP coordinates which comes from `SNPchimp`_, `Ensembl`_ or `EVA`_. There's also the need to upload data coming from custom chips, in order to have a more precise picture of all the variants. Data need also to be integrated with additional information like breeds and their codes. All those steps are managed by python scripts which are stored in the second area: those scripts let to interact with data relying on the `MongoDB`_ instance and transform the content in the ``data/raw`` folder into ``data/processed`` output, which is the final output generated by the *SMARTER-database* project SMARTER-database requirements ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ *SMARTER-database* is managed through a `conda `__ environment, in which python executable and other non-python dependencies are specified. Moreover, python dependencies are managed using `poetry `__. Dependencies and environment set up is managed through the GNU linux ``make`` command. The MongoDB instance of this project is managed by `docker `__ and `docker-compose `__, however you can configure an environment variable to set up a connection with an external `MongoDB`_ instance. See :ref:`Configure environment variables` for more information. Installation and configuration ------------------------------ Clone this project with GIT ^^^^^^^^^^^^^^^^^^^^^^^^^^^ In order to install *SMARTER-database* project, you need to clone it `from GitHub `__ using git: .. code-block:: bash git clone https://github.com/cnr-ibba/SMARTER-database.git Now enter into the smarter cloned directory: from now and in the rest of this documentation this ``SMARTER-database`` directory will be referred as **the project home directory**: .. code-block:: bash cd SMARTER-database export PROJECT_DIR=$PWD .. note:: If you plan to install this project in a shared folder, take a look before at `Shared folders and permissions `__ and in particular at the `Setting permissions `__ section in the `BIOINFO Guidelines `__ documentation .. tip:: In order to better share this project with other users on the same machine, its better to clone this project inside a directory with the **SGID** special permission (see `Using SGID `__ for more information) .. warning:: Every file you create in a **SGID** directory will have the correct permissions and ownership, however if you **copy** a file through ``scp``, ``rsync`` or you move a file from a non **SGID** directory, the permission will be the standard ones defined for your user. You should check that permissions are correct after *moving* or *copying* files, in particular for ``data`` directory. To add the **SGID** permission on the current directory and subfolder, you could do like this: .. code-block:: bash find . -user $USER -type d -exec chmod g+s {} \; This command should be called inside a *interactive bash login session*, since bash will ignore commands which try to set the **SGID** permission. Configure environment variables ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ In order to work properly *SMARTER-database* needs some environment variables defined in two environment files. Those files **must not be tracked with GIT** for security reasons, and should be defined **before** start working with this project. The first ``.env`` file is located inside the ``database`` folder and is required in order to start the `MongoDB `__ and `mongoexpress `__ images and to set up the required collections and validation constraints. So edit the ``$PROJECT_DIR/database/.env`` file by setting these two variables:: MONGODB_ROOT_USER= MONGODB_ROOT_PASS= MONGOEXPRESS_USER= MONGOEXPRESS_PASS= The second ``.env`` file need to be located in the **project HOME directory** and need to define the credentials required to access the MongoDB instance using a new *smarter* user (a user granted to fill up the database and to retrieve information to process the genotype files). Start from this template and set your credentials properly in ``$PROJECT_DIR/.env`` file:: # Environment variables go here, can be read by `python-dotenv` package: # # `src/script.py` # ---------------------------------------------------------------- # import dotenv # # project_dir = os.path.join(os.path.dirname(__file__), os.pardir) # dotenv_path = os.path.join(project_dir, '.env') # dotenv.load_dotenv(dotenv_path) # ---------------------------------------------------------------- # # DO NOT ADD THIS FILE TO VERSION CONTROL! MONGODB_SMARTER_USER= MONGODB_SMARTER_PASS= MONGODB_SMARTER_HOST=localhost MONGODB_SMARTER_PORT=27017 .. hint:: You can configure the MongoDB instance on a different host, or call the import process from another location by setting the proper ``MONGODB_SMARTER_HOST`` and ``MONGODB_SMARTER_PORT`` values in the environment file. Start the MongoDB instance ^^^^^^^^^^^^^^^^^^^^^^^^^^ The *MongoDB* instance is managed using ``docker-compose``: database will be created and configured when you start the docker container for the first time. Local files are written in the ``$PROJECT_DIR/database/mongodb-data`` that will persist even when turning down and destroying docker containers . First check that the ``$PROJECT_DIR/database/.env`` file is configured correctly as described by the section :ref:`before `. Next, in order to avoid annoying messages when saving your mongo-client history, set ``mongodb-home`` *sticky dir* permission: .. code-block:: bash cd $PROJECT_DIR/database chmod o+wt mongodb-home/ This let you to save and see mongodb history using a different user than the user used inside the MongoDB docker container. Moreover, this folder can be used to import/export a *SMARTER-database* dump. Next download, build and initialize the *SMARTER-database* containers with: .. code-block:: bash docker-compose pull docker-compose build docker-compose up -d Now is time to define create the *smarter* user with the same credentials used in your ``$PROJECT_DIR/.env`` environment file. You could do this using *docker-compose* commands: .. code-block:: bash docker-compose run --rm --user mongodb mongo sh -c 'mongo --host mongo \ --username="${MONGO_INITDB_ROOT_USERNAME}" \ --password="${MONGO_INITDB_ROOT_PASSWORD}"' Then from the mongodb terminal create the *smarter* user using the values of ``$MONGODB_SMARTER_USER`` and ``$MONGODB_SMARTER_PASS`` variables. You require both the *read/write* privileges to update and retrieve smarter data: .. code-block:: javascript use admin db.createUser({ user: "", pwd: "", roles: [{ role: "readWrite", db: "smarter" }] }) For more information on the smarter *MongoDB* database usage, please refer to the `README.md `__ documentation in the ``$PROJECT_DIR/database`` folder. Setting up python environment ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ In order to install all the conda requirements and libraries, move into the ``$PROJECT_DIR`` (which is the *SMARTER-database* folder cloned using git) and then install dependencies using make: .. code-block:: bash cd $PROJECT_DIR make create_environment This will create a ``SMARTER-database`` conda environment and will install all the required softwares (like `plink `__, `vcftools `__, `tabix `__, ...). Then you need to manually activate the ``SMARTER-database`` before installing all the required python dependencies: .. code-block:: bash conda activate SMARTER-database make requirements .. note:: All project dependencies will be installed in the ``SMARTER-database`` conda environment. You will need to activate this environment every time you need to use a *SMARTER-database* script or dependency. Initialize and populate SMARTER database ---------------------------------------- In order to populate the *SMARTER-database* with data, you need to collect data provided by the partners from the `SMARTER repository `__. Moreover you have to retrieve and collect information from databases like `SNPchiMp`_, `Ensembl`_ or `EVA`_. You will need also information from *Illumina* or *Affymetrix* Manifest files in order to deal with different types of genotype files. *Raw unprocessed files* and external *sources files* need to be placed in their proper folder: all data received by the SMARTER partners need to be placed in the ``data/raw`` folder in the SMARTER ``$PROJECT_DIR`` directory, in a ``foreground`` or ``background`` folder accordingly if data is produced in the context of SMARTER project or is available outside this project. External source files, like manifests, database dumps and other support files need to be placed in the ``data/external`` directory. Within this project external support files are organized by species (``GOA`` and ``SHE`` for *goat* and *sheep* respectively) and by data source (ie, ``SNPCHIMP``, ``ILLUMINA`` ``AFFYMETRIX``, etc.). Those data files are not shipped with this github project, you need to ask to developer and to SMARTER WP4 coordinators to have access to this data. Process raw data and create the final dataset ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ In order to process raw data, insert data into SMARTER database, generate the SMARTER ids an create the final genotype dataset files there are manly two steps that are managed using ``make`` command. In the first step, you will upload all the external information into the database: simply type (inside the ``SMARTER-database`` conda environment): .. code-block:: bash make initialize to upload all the external information on *variants* in the database. This step is described in detail in the :ref:`Loading variants into database` section. In the next step, you will process each sample by generating a *SMARTER ID*, and you will insert phenotypes and other sample related metadata into the SMARTER database. The final output of this step will be the generation of the final genotype files. Like before, simply type: .. code-block:: bash make data Output data will be placed in a folders relying on the assembly version used, with all the genotypes in the same format and using the same reference system. Those folders will be placed in the ``data/processed`` folder. For more detailed information about all the process called within this step, please see :ref:`The Data Import Process` documentation. Last step in data generation is made available with: .. code-block:: bash make publish which will pack your genotype files in order to be shared with other partners using the SMARTER FTP repository. Database management through docker-compose ------------------------------------------ The SMARTER MongoDB docker-composed image in ``database`` folder does a *mount bind* of the ``database/mongodb-home/`` folder in which you can put files that could be inserted / retrieved from database. This means that you can place here a file to be imported into database or you can export a collection outside *SMARTER-database*. Here are described how to dump and restore a full *SMARTER-database* instance: Restore SMARTER database from a *mongodump* file ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ In order to restore the SMARTER database from a dump file: .. code-block:: bash docker-compose run --rm --user mongodb mongo sh -c 'mongorestore --host mongo \ --username="${MONGO_INITDB_ROOT_USERNAME}" \ --password="${MONGO_INITDB_ROOT_PASSWORD}" --authenticationDatabase admin \ --db=smarter --drop --preserveUUID --gzip \ --archive=/home/mongodb/smarter.archive.gz' After that, you can login through the *smarter* database by calling the mongodb client like this: .. code-block:: bash docker-compose run --rm --user mongodb mongo sh -c 'mongo --host mongo \ --username="${MONGO_INITDB_ROOT_USERNAME}" --password="${MONGO_INITDB_ROOT_PASSWORD}" \ --authenticationDatabase=admin smarter' Dump SMARTER-database ^^^^^^^^^^^^^^^^^^^^^ In order to dump SMARTER database in a file: .. code-block:: bash docker-compose run --rm --user mongodb mongo sh -c 'mongodump --host mongo \ --username="${MONGO_INITDB_ROOT_USERNAME}" \ --password="${MONGO_INITDB_ROOT_PASSWORD}" --authenticationDatabase admin \ --db=smarter --gzip --archive=/home/mongodb/smarter.archive.gz' .. _`Cookiecutter Data Science`: https://drivendata.github.io/cookiecutter-data-science/ .. _`MongoDB`: https://www.mongodb.com/ .. _`docker-compose`: https://docs.docker.com/compose/ .. _`SNPchiMp`: http://webserver.ibba.cnr.it/SNPchimp/ .. _`Ensembl`: https://www.ensembl.org/index.html .. _`EVA`: https://www.ebi.ac.uk/eva/