Testing DRIVE 
====================
Once DRIVE is installed, the user can use the provided test data to better understand how to run the program and to see examples of the inputs. This test data can be found in the tests folder on GitHub `(test data location) <https://github.com/belowlab/drive/tree/main/tests>`_. At the moment, only integration tests have been implemented for DRIVE. This section of the documentation will only provide information relevant to the integration tests. 

The following two sections describe how we simulated the IBD data to use for the integration tests and then how the test folder is structured. If you only wish to know how to run the test data then you can skip to the "Commands to test successful installation" section.

Simulating IBD Data:
--------------------
In order to provide test data, we needed to simulate pairwise IBD segments as inputs. To accomplish this we followed a similar procedure as described in the paper, `Open-source benchmarking of IBD segment detection methods for biobank-scale cohorts <https://doi.org/10.1093/gigascience/giac111>`_, by Tang et al. In this paper, the authors wished to compare several IBD detection programs so they developed a python pipeline using `msprime <https://tskit.dev/msprime/docs/stable/intro.html>`_, which can simulate pairwise IBD segments. We used their script `msprime_simulation.py <https://github.com/ZhiGroup/IBD_benchmark/blob/main/Simulation/msprime_simulation.py>`_ which generates a vcf file mimicking sparse array data. For simplicity, we used the same genetic map file and simulated data for the same chromosome (chromosome 20). The following command was used to run that script:

.. code::

    python3 ./msprime_simulation.py 20 ./genetic_map_GRCh37_chr20.txt 1.38e-8 1234 1000 17197 5000 1.0 ../msprime_output/

We specified the following parameters for the analysis:

* a recombination rate of 1.38e-8
* a random seed of 1234
* 1000 samples
* 17197 markers to be in the output data
* a distance of 5,000 base between each spot that was sampled in the trees generated by msprime

After we generated the vcf file, we indexed it using tabix and then ran `SHAPEIT4 <http://odelaneau.github.io/shapeit4/>`_. We first formatted the genetic map file used for msprime to have the plink format required by SHAPEIT4. Once the inputs were made we ran the following command for SHAPEIT4:

.. code::

    shapeit4 -I ooa1000.chr20.vcf.gz -M shapeit4_recombination_map.txt  -R 20 --thread 8  -O output_filename > logfile.log

After we phased the data, we converted it back to a vcf file and then used hap-IBD to detect IBD segments within the data. We had to once again reformat the recombination map file so that it had an additional id column. After that file was create, we used the following command to run hap-IBD and generate our simulated data:

.. code::

    java -jar ~/bin/hap-ibd.jar gt=./phased_chr20_simulated_v2.vcf.gz  map=hapibd_map.txt out=simulated_ibd_test_data_v2_chr20 min-markers=75

DRIVE uses the \*.ibd.gz file generated from hap-IBD, so this file was placed in the subdirectory called "test_inputs".

Files/Directories found within the tests directory:
---------------------------------------------------
The test data directory has a subdirectory called "test_inputs" and a Python script called "test_integration.py" that are essential to running the integration tests. They are described in further detail below.

**test_inputs directory**:
Within this directory there are two files of interest:

* simulated_ibd_test_data_v2_chr20.ibd.gz - This file contains the pairwise shared IBD segments generated from hap-IBD
* test_phenotype_file_withNAs.txt - This file has simulated case/control data for 3 phenotypes indicated by 1/0. The first column is "GRID" and then the following columns are "phenoA", "phenoB", "phenoC".

**test_integration.py**:
pytest uses the test_integration.py script to run the integration test. This script contains 5 integration tests which are described below.

* *test_drive_full_run_no_phenotypes*: test the behavior when only the network identification algorithm of DRIVE is used 
* *test_drive_full_run_with_phenotypes*: test the behavior of the enrichment plugin by also providing phenotype information.
* *test_drive_dendrogram_single_network*: test to make sure a dendrogram is formed for a network. This test uses the inputs from the 1st test.
* *test_pull_samples_success*: test to make sure that DRIVE correctly forms an output file when trying to pull specific samples.
* *test_for_correct_samples*: test to make sure that DRIVE pulled the correct samples from the right network.

Commands to run the test data:
------------------------------
Depending how DRIVE was installed, the commands to run the test data can be different.

.. tab-set:: 
   :sync-group: run-test

   .. tab-item:: PIP 
      :sync: key1

      As of v3.0.2, the Python testing framework "Pytest" has been bundled with DRIVE now so that users can run the test data directly from the DRIVE CLI. If DRIVE was installed directly from PyPI into either a virtualenv or a conda environment, then you can run the test data with the following command:

      .. code:: bash

        drive utilities test

   .. tab-item:: PDM 
      :sync: key2

      PDM has the ability to run commands within the created virtual environment if the prefix "pdm run" is used. The following command will run the test data using pdm, ensuring that the correct dependencies are being used.
      
      .. code::

         pdm run pytest -v ./tests/test_integration.py

   .. tab-item:: GitHub 
      :sync: key3

      If the user cloned the GitHub repository and then installed all the dependencies using the conda .yaml file or the pyproject.toml file then the following command can be used to run the integration test

      .. code::

         pytest -v ./tests/test_integration.py

         or 

         python -m drive.drive utilities test  

      .. note::

         If you receive an error saying "pytest not found", but DRIVE seems to be installed correctly then pytest dependency may not have install correctly. You can remedy this by running the following command:

         .. code::

            pip install pytest

   .. tab-item:: Docker
      :sync: key4

      Once the user has pulled the docker image from DockerHub, they can use the following command to run the test data. This command also works for other container software such as Podman, only replace the phrase "docker" with "podman".
      
      .. code::
        
        docker run -it --rm drive-image-tag drive utilities test
        

      Singularity is a read-only file system for security. Due to this, the commands to run the test data are different. Users can't run the built-in testing framework because it will not have permissions to write to the filesystem (filesystem meaning the directory in the singularity image). Instead users can run the following commands to run the test data. First a writable "sandbox" has to be created. Users can replace the phrase "singularity-sandbox" with a name of their choosing. After that step, users can execute the sandbox. All other commands can use the normal singularity image (not the sandbox).

      .. code::

        # Using singularity to make a sandbox
        singularity build --sandbox singularity-sandbox singularity-image-path.sif
        # Now you can run the unit test using the sandbox image
        singularity exec -w --no-home singularity-sandbox drive utilities test
        
   .. tab-item:: Manual
      :sync: key5
 
      Users can also manually run the test data. This option is usually only useful if the other testing options are not working or if you wish to compare new versions of DRIVE to older versions of DRIVE. 

      DRIVE underwent major changes between v1 and v3. There are additional command flags that were added that are not present. For this reason, the below section describes how to run the test data for DRIVE v3 and then how to run the legacy commands for DRIVE v1 and DRIVE v2. These commands assume that you have either cloned the github repository or you have downloaded the "tests" directory from Github.

      **Command to run DRIVE (v3.0.0+) on simulated Data:**

      *Running the clustering subcommand*:

      The test data in the tests/test_inputs folder illustrates how the inputs for the phenotype file and the segment data should be formatted for DRIVE. The following command will run those files and show what the DRIVE output file should look like:

      .. code::

         drive cluster -i tests/test_inputs/simulated_ibd_test_data_v2_chr20.ibd.gz  -f hapibd -t 20:4666882-4682236 -o test --recluster --min-cm 3 --log-to-console

      
      For this example, we used the gene *PRNP* on chromosome 20. We used gnomAD v2.2.1 to get the position of this gene because the simulated data is in build GRCh37. Variants within this gene have been implicated for Fatal Familial Insomnia, Gerstmann-Straussler Disease, and Huntington Disease. 
      
      *Running the dendrogram subcommand*:
      
      It is expected that the user will first run the above cluster command and has generated an output file called test.drive_networks.txt in their current directory. This file will be used as input in the dendrogram subcommand. The following command will generate a dendrogram for network 0.

      .. code::

         drive dendrogram -i test.drive_networks.txt --ibd tests/test_inputs/simulated_ibd_test_data_v2_chr20.ibd.gz -f hapibd -t 20:4666882-4682236 --min-cm 3 -n 0 --log-to-console


      **Command to test legacy versions of DRIVE (before v3.0.0):**

      Prior to DRIVE v3, the tool went through 2 stages involving significant changes to the CLI structure and functionality (v1 & v2). Both versions can still use the test data to run the clustering algorithm. The commands to test each version can be found below.

      *Running example data for DRIVE v1:*

      DRIVE v1 was an initial implementation of the DRIVE tooling that only performed clustering and had more limited runtime options but it can still run the test data using the following commands:

      .. code::

         drive -i tests/test_inputs/simulated_ibd_test_data_v2_chr20.ibd.gz -f hapIBD -t 7:117287120-117715971 -o ./test -m 3 

      If successful, this command will produce an output file called test.drive_networks.txt.


      **Running example data for DRIVE v2**

      Although DRIVE v2 was only a development version and was never truly released for external use, it is still available on PYPI (Although there is no guarantee that there may not be bugs that were worked out in later development). There were no subcommands in the CLI so only the clustering and phenome-wide enrichment functionality is available. You can still run the test data using the following commands:

      .. code::

         drive -i tests/test_inputs/simulated_ibd_test_data_v2_chr20.ibd.gz  -f hapibd -t 20:4666882-4682236 -o test --recluster --min-cm 3 --log-to-console

      If successful, this command will generate an output file called test.drive_networks.txt.

.. important::

   The commands for running the test using "PDM" and "GitHub" assume that you are running them from the drive repository parent directory. 
 
ry.