a2bcovid
is a tool to estimate the likelihood that an
infection was transmitted between particular individuals, and then to
identify clusters of infections.
It uses data on either genome sequences or the locations of infected individuals, and uses a statistical and evolutionary model.
It is currently designed to apply to COVID-19 infection dynamics on hospital wards.
It is available as both an R package and a web app.
An a2bcovid
analysis using the R package is shown here
for a simple example.
The data files for this example are provided with the installed
package, and can be read into R as follows. For the CSV files, the first
couple of rows are shown here, to illustrate the format required. Full
details of the format are specified in the R help page
help(a2bcovid)
.
The file names, e.g. pat_file
, below, should be
specified as full path names. If the file is in your current working
directory (for example myfile.csv
) you can construct this
path with code such as file.path(getwd(),"myfile.csv")
.
pat_file <- system.file("extdata", "Example_genetic_temporal_data.csv", package="a2bcovid")
head(read.csv(pat_file),2)
## patient_study_id onset_date onset_date_source infection_type sequence_id
## 1 CAMP001001 01/05/2020 1 2 CAM00001
## 2 CAMP001002 05/05/2020 1 2 CAM00002
## sample_collection_date sample_received_date
## 1 04/05/2020 04/05/2020
## 2 08/05/2020 08/05/2020
The example sequences file has been constructed so that all of the variants in the sequence appear in the first 10 positions in the genome. Looking at it with an alignment viewer will give a simple idea of how the sequences relate to one another.
Two alternative formats are accepted for the patient location data file, as illustrated here. The format can be automatically detected: if a variable called “start_date” is supplied then long format is assumed, or if there is a variable called “StartDate_0”, then wide format is assumed.
Wide format: one row per patient
pat_loc_file <- system.file("extdata", "Example_pat_loc_file.csv", package="a2bcovid")
head(read.csv(pat_loc_file),2)
## patient_study_id ward_cluster_network hcw_status
## 1 CAMP001001 A patient
## 2 CAMP001002 A patient
## patient_movement_data_available LocationName_0 StartDate_0 EndDate_0
## 1 patient_moves_available WARD_01 01/05/2020 07/05/2020
## 2 patient_moves_available WARD_01 04/05/2020 09/05/2020
## LocationName_1 StartDate_1 EndDate_1
## 1 WARD_04 07/05/2020 09/05/2020
## 2 WARD_05 09/05/2020 12/05/2020
Long format: one row per ward stay
pat_loc_file_long <- system.file("extdata", "Example_pat_loc_file_long.csv", package="a2bcovid")
head(read.csv(pat_loc_file_long),2)
## patient_study_id from_ward start_date to_ward end_date
## 1 CAMP001001 WARD_01 01/05/2020 Discharge 07/05/2020
## 2 CAMP001001 WARD_04 07/05/2020 Discharge 09/05/2020
hcw_loc_file <- system.file("extdata", "Example_hcw_loc_file.csv", package="a2bcovid")
head(read.csv(hcw_loc_file),2)
## patient_study_id ward_cluster X05.05.2020 X06.05.2020 X07.05.2020 X08.05.2020
## 1 CAMP001004 WARD_01 Y Y N N
## 2 CAMP001005 WARD_01 N N Y N
## X09.05.2020 X10.05.2020 X11.05.2020 X12.05.2020 X13.05.2020 X14.05.2020
## 1 N N N Y N N
## 2 N N N N N Y
## X15.05.2020 X16.05.2020 X17.05.2020 X18.05.2020 X19.05.2020 X20.05.2020
## 1 N N N N N N
## 2 N Y N N N N
## X21.05.2020
## 1 N
## 2 N
Individual 1008 does not seem to infect anyone else, but otherwise most of the cases appear to be connected to each other. Sequences with lower numbers are generally more likely to infect individuals with higher numbers.
## Now here
## Check case of sequence data
## CheckBaseCase complete
## Now here
## Go to Incorporate
With the addition of sequence information, individuals 1009 and 1010 appear more separate. They may have been infected by 1001 or 1002 but most of the links to them at type 0 have disappeared. There seems to be a cluster of individuals 1001 to 1008.
Suppose that individuals 1004 to 1006 are healthcare workers, and the remaining individuals are patients in the ward. Suppose we have location data for all patients, but not the healthcare workers.
## Now here
## Check case of sequence data
## CheckBaseCase complete
## Now here
## Go to Incorporate
## Read ward file
The links from 1001 and 1002 to 1009 and 1010 are now gone, with these last two individuals being seen as clearly separate from the remaining cases. The plot suggests that 1007 infected 1008 in the absence of other known cases.
Finally, we add in location data for the healthcare workers, individuals 1004 to 1006.
a <- a2bcovid(pat_file = pat_file, ali_file = ali_file, pat_loc_file = pat_loc_file, hcw_loc_file = hcw_loc_file)
## Now here
## Check case of sequence data
## CheckBaseCase complete
## Now here
## Go to Incorporate
## Read ward file
More resolution is now seen around these individuals, with for example 1006 not having infected anyone else, but possibly having been infected by 1002 or 1003. We note that generally the question of who infected who is not resolved for the cluster of cases at the top right of the plot, but a cluster linking the individuals 1001 to 1008, and a second linking 1009 and 1010, could be identified from this plot for further investigation.
Note that in the current version of a2bcovid
, two
different files and formats are used for location of patients and
location of healthcare workers. However this is not necessary for the
calculation. This is just a legacy of the original setting where the
package was used. In a future version, the data format might be
standardised.
By default, the individuals in the plot are sorted in a way that
highlights potential clusters of infections. To sort them in the order
that they were provided in the original data, specify
cluster=FALSE
.
By default the colours in the plot indicate ranges of significance levels for a test of the hypothesis that transmission occurred between a pair of individuals. Three ranges are shown, corresponding to different ranges of the test p-value:
A smoother plot can be obtained by specifying
continuous=TRUE
. Here the colours vary smoothly with the
p-value.