Testing DRIVE 
====================
Once DRIVE is installed, the user can use the provided test data to better understand how to run the program and to see examples of the inputs. This test data can be found in the tests folder on github `(test data location) <https://github.com/belowlab/drive/tree/main/tests>`_. At the moment, only integration test have been implemented for DRIVE. This documentation will only files pertaining to the integration tests. 

Simulating IBD Data:
--------------------
In order to provide test data, we needed to simulate pairwise IBD segments as inputs. To accomplish this we followed a similar procedure as described in the paper, `Open-source benchmarking of IBD segment detection methods for biobank-scale cohorts <https://doi.org/10.1093/gigascience/giac111>`_, by Tang et al. In this paper, the authors wished to compare several IBD detection programs so they developed a python pipeline using `msprime <https://tskit.dev/msprime/docs/stable/intro.html>`_, which can simulate pairwise IBD segments. We used their script `msprime_simulation.py <https://github.com/ZhiGroup/IBD_benchmark/blob/main/Simulation/msprime_simulation.py>`_ which generates a vcf file mimicking spares array data. For simplicity we used the same genetic map file and simulated data for the same chromosome (chromosome 20). The following command was used to run that script:

.. code::

    python3 ./msprime_simulation.py 20 ./genetic_map_GRCh37_chr20.txt 1.38e-8 1234 1000 17197 5000 1.0 ../msprime_output/

We specified the following parameters for the analysis:

* a recombination rate of 1.38e-8
* a random seed of 1234
* 1000 samples
* 17197 markers to be in the output data
* a distance of 5,000 base between each spot that was sampled in the trees generated by msprime

After we generated the vcf file, we indexed it using tabix and then ran `SHAPEIT4 <http://odelaneau.github.io/shapeit4/>`_. We first formatted the genetic map file used for msprime to have the plink format required by SHAPEIT4. Once the inputs were made we ran the following command for SHAPEIT4:

.. code::

    shapeit4 -I ooa1000.chr20.vcf.gz -M shapeit4_recombination_map.txt  -R 20 --thread 8  -O output_filename > logfile.log

After we phased the data, we converted it back to a vcf file and then used hap-IBD to detect IBD segments within the data. We had to once again reformat the recombination map file so that it had an additional id column. After that file was create, we used the following command to run hap-IBD and generate our simulated data:

.. code::

    java -jar ~/bin/hap-ibd.jar gt=./phased_chr20_simulated_v2.vcf.gz  map=hapibd_map.txt out=simulated_ibd_test_data_v2_chr20 min-markers=75

DRIVE uses the \*.ibd.gz file generated from hap-IBD, so this file was placed in the subdirectory called "test_inputs".

Files/Directories:
------------------
The test data directory has a subdirectory called "test_inputs" and a python script called "test_integration.py" that are essential to running the integration test. They are discribed in further detail below.

**test_inputs directory**:
Within this directory there are two files of interest:

* simulated_ibd_test_data_v2_chr20.ibd.gz - This file contains the pairwise shared IBD segments generated from hap-IBD
* test_phenotype_file_withNAs.txt - This file has simulated case/control data for 3 phenotypes indicated by 1/0. The first column is "GRID" and then the following columns are "phenoA", "phenoB", "phenoC".

**test_integration.py**:
pytest uses the test_integration.py script to run the integration test. This script contains the integration tests "test_drive_full_run_no_phenotypes" and "test_drive_full_run_with_phenotypes". One of these functions test the behavior when only the network identification algorithm of DRIVE is used and the other test the behavior when also the enrichment plugin is used due to phenotype information being provided.

Commands to test successful installation:
-----------------------------------------
If DRIVE was installed using PDM to install the project you can run pytest using the following command:

.. code::

    pdm run pytest -v ./tests/test_integration.py

If DRIVE was installed from pip or if it was installed from github, then it is assumed that it was installed into either a virtualenv, venv, or conda environment. The following command can then be used to run the integration test.

.. code::

    pytest -v ./tests/test_integration.py

.. hint::

    All of these commands make the assumption that you are running them from the drive parent directory.

Command to run DRIVE on simulated Data:
---------------------------------------
The test data in the tests/test_inputs folder illustrates how the inputs for the phenotype file and the segment data should be formatted for DRIVE. The following command will run those files and show what the DRIVE output file should look like:

.. code::

    drive -i tests/test_inputs/simulated_ibd_test_data_v2_chr20.ibd.gz  -f hapibd -t 20:4666882-4682236 --recluster --min-cm 3 --log-to-console

For this example, we used the gene *PRNP* on chromosome 20. We used gnomAD v2.2.1 to get the position of this gene because the simulated data is in build GRCh37. Variants within this gene has been implicated for Fatal Familial Insomnia, Gerstmann-Straussler Disease, and Huntington Disease. 

.. hint::

    This example code assumes that you cloned the tests subdirectory from github or that you created a similar directory structure.

More information about the output from DRIVE can found in the outputs section