--- title: "A2BCovid: example analysis" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{A2BCovid: example analysis} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, message = FALSE, fig.width = 8, out.extra='keepaspectratio') ``` ## a2bcovid: introduction `a2bcovid` is a tool to estimate the likelihood that an infection was transmitted between particular individuals, and then to identify clusters of infections. It uses data on either genome sequences or the locations of infected individuals, and uses a statistical and evolutionary model. It is currently designed to apply to COVID-19 infection dynamics on hospital wards. It is available as both an R package and a web app. An `a2bcovid` analysis using the R package is shown here for a simple example. The data files for this example are provided with the installed package, and can be read into R as follows. For the CSV files, the first couple of rows are shown here, to illustrate the format required. Full details of the format are specified in the R help page `help(a2bcovid)`. The file names, e.g. `pat_file`, below, should be specified as full path names. If the file is in your current working directory (for example `myfile.csv`) you can construct this path with code such as `file.path(getwd(),"myfile.csv")`. ### Times of symptom onset ```{r} pat_file <- system.file("extdata", "Example_genetic_temporal_data.csv", package="a2bcovid") head(read.csv(pat_file),2) ``` ### Genome sequence alignment data The example sequences file has been constructed so that all of the variants in the sequence appear in the first 10 positions in the genome. Looking at it with an alignment viewer will give a simple idea of how the sequences relate to one another. ```{r} ali_file <- system.file("extdata", "Example_sequences.fa", package="a2bcovid") ``` ### Patient location data Two alternative formats are accepted for the patient location data file, as illustrated here. The format can be automatically detected: if a variable called "start_date" is supplied then long format is assumed, or if there is a variable called "StartDate_0", then wide format is assumed. **Wide format: one row per patient** ```{r} pat_loc_file <- system.file("extdata", "Example_pat_loc_file.csv", package="a2bcovid") head(read.csv(pat_loc_file),2) ``` **Long format: one row per ward stay** ```{r} pat_loc_file_long <- system.file("extdata", "Example_pat_loc_file_long.csv", package="a2bcovid") head(read.csv(pat_loc_file_long),2) ``` ### Staff location data ```{r} hcw_loc_file <- system.file("extdata", "Example_hcw_loc_file.csv", package="a2bcovid") head(read.csv(hcw_loc_file),2) ``` # Example analysis ### Using just the symptom onset times ```{r} library(a2bcovid) a <- a2bcovid(pat_file = pat_file) plot_a2bcovid(a) ``` Individual 1008 does not seem to infect anyone else, but otherwise most of the cases appear to be connected to each other. Sequences with lower numbers are generally more likely to infect individuals with higher numbers. ### Using symptom onset times and sequence data ```{r} a <- a2bcovid(pat_file = pat_file, ali_file = ali_file) plot_a2bcovid(a) ``` With the addition of sequence information, individuals 1009 and 1010 appear more separate. They may have been infected by 1001 or 1002 but most of the links to them at type 0 have disappeared. There seems to be a cluster of individuals 1001 to 1008. ### Using symptom onset times, sequence data and patient location data Suppose that individuals 1004 to 1006 are healthcare workers, and the remaining individuals are patients in the ward. Suppose we have location data for all patients, but not the healthcare workers. ```{r} a <- a2bcovid(pat_file = pat_file, ali_file = ali_file, pat_loc_file = pat_loc_file) plot_a2bcovid(a) ``` The links from 1001 and 1002 to 1009 and 1010 are now gone, with these last two individuals being seen as clearly separate from the remaining cases. The plot suggests that 1007 infected 1008 in the absence of other known cases. ### Using symptom onset times, sequence data, patient location and staff location data Finally, we add in location data for the healthcare workers, individuals 1004 to 1006. ```{r} a <- a2bcovid(pat_file = pat_file, ali_file = ali_file, pat_loc_file = pat_loc_file, hcw_loc_file = hcw_loc_file) plot_a2bcovid(a) ``` More resolution is now seen around these individuals, with for example 1006 not having infected anyone else, but possibly having been infected by 1002 or 1003. We note that generally the question of who infected who is not resolved for the cluster of cases at the top right of the plot, but a cluster linking the individuals 1001 to 1008, and a second linking 1009 and 1010, could be identified from this plot for further investigation. Note that in the current version of `a2bcovid`, two different files and formats are used for location of patients and location of healthcare workers. However this is not necessary for the calculation. This is just a legacy of the original setting where the package was used. In a future version, the data format might be standardised. ### Additional plotting options By default, the individuals in the plot are sorted in a way that highlights potential clusters of infections. To sort them in the order that they were provided in the original data, specify `cluster=FALSE`. ```{r} plot_a2bcovid(a, cluster=FALSE) ``` By default the colours in the plot indicate _ranges_ of significance levels for a test of the hypothesis that transmission occurred between a pair of individuals. Three ranges are shown, corresponding to different ranges of the test p-value: * (p > 0.05) Data "Consistent" with transmission * (0.01 < p < 0.05) Data of "Borderline" significance * (p < 0.01) Data "Unlikely" to have arisen from a transmission event A smoother plot can be obtained by specifying `continuous=TRUE`. Here the colours vary smoothly with the p-value. ```{r} plot_a2bcovid(a, continuous=TRUE) ```