Title: | Inferring COVID-19 Transmission Events from Sequence and Location Data |
---|---|
Description: | A tool which combines genome sequence and the locations of infected individuals, using a statistical and evolutionary model, to estimate the likelihood that transmission occurred between particular individuals, and then to identify clusters of infections. It is currently designed to apply to COVID-19 infection dynamics on hospital wards. |
Authors: | Chris Illingworth [aut, cre], Chris Jackson [aut] |
Maintainer: | Chris Illingworth <[email protected]> |
License: | GPL-3 |
Version: | 0.1.0 |
Built: | 2024-11-26 04:23:08 UTC |
Source: | https://github.com/chjackson/a2bcovid |
A tool which combines genome sequence and the locations of infected individuals, using a statistical and evolutionary model, to estimate the likelihood that transmission occurred between particular individuals, and then to identify clusters of infections. It is currently designed to apply to COVID-19 infection dynamics on hospital wards.
a2bcovid( pat_file, hcw_loc_file = "", ali_file = "", pat_loc_file = "", strain = "default", ucta = 2.59321520957074, uctb = 3.77600606639754, ucto = 3.11208004146092, uct_mean = 6.67992, evo_rate = 8e-04, seq_noise = 0.41369, chat = 0.5, max_n = 10, min_qual = 0.8, diagnostic = FALSE, hcw_default = 0.5714286, pat_default = 1, use_all_seqs = 0, symptom_uncertainty_calc = 0 )
a2bcovid( pat_file, hcw_loc_file = "", ali_file = "", pat_loc_file = "", strain = "default", ucta = 2.59321520957074, uctb = 3.77600606639754, ucto = 3.11208004146092, uct_mean = 6.67992, evo_rate = 8e-04, seq_noise = 0.41369, chat = 0.5, max_n = 10, min_qual = 0.8, diagnostic = FALSE, hcw_default = 0.5714286, pat_default = 1, use_all_seqs = 0, symptom_uncertainty_calc = 0 )
pat_file |
(Required) A character string with the path to a file containing the basic data for each individual. This should be a comma separated (.csv) file with data in columns: 1. Individual ID (A code or identifier corresponding to the individual) 2. Onset date. The date at which the individual first experienced symptoms. Date format should be dd/mm/yyyy. 3. Onset date source : Equal to 1 if the date of onset is known. Equal
to 2 if the infection was asymptomatic. In this case the onset date is
the date on which the first positive swab was collected. Equal to 3 if
data is missing or unknown. In this case the onset date is the date on
which the first positive swab was collected. If the onset date is
anything other than 1 the true onset date is estimated by the code using
data collected from Cambridge University hospitals (argument
4. Infection type : Equal to 1 if the individual is a patient and a community case (i.e. who could not have been infected by others in the dataset but who could potentially transmit the virus to others). This was defined as being positive for the virus 48 hours before admission to hospital with no healthcare contact in the previous 14 days prior to admission). Equal to 2 if the individual is a patient and not a community case (i.e. who could potentially transmit and receive infection). Equal to 3 if the individual is a healthcare worker. Whether or not an individual is a healthcare worker is set by this parameter. 5. Sequence ID : A code used to link the individual to genome sequence
information. This should match the header of the sequence corresponding
to the individual in the accompanying .fasta file (arcument
6. Date of sample collection : Used in evolutionary calculations. Date format should be dd/mm/yyyy. 7. Sample received date : Currently not used in the calculation but necessary. If this seventh column is missing A2B-Covid has been reported to crash R completely. An example is given with the installed package. The path to the example
file can be shown by the R command |
hcw_loc_file |
A character string with the path to a file of data describing when specific health care workers were on the ward in question. If this argument is omitted or set to an empty string, then this kind of data is not used in the calculation. The first line is a header line with column names. The first two of these are labels, while those from the third column onwards describe dates, specified in dd.mm.yyyy format. After the first line, the data is specified in columns as follows: 1. Individual ID (same as for 2. Cluster ID e.g. the name of the ward in question. 3 onwards: Presence/absence data. A 'Y' indicates that the health care worker was on the ward on the date specified for that column in the first row. An 'N' indicates that the health care worker was not present on the ward on that date. Either 'Y' or 'N' should be specified for each date. An example is given with the installed package. The path to the example
file can be shown by the R command |
ali_file |
A character string with the path to a file in FASTA format
containing genome sequence alignments. This file must contain all required
sequences, specified by the sequence ID in the data of An example is given with the installed package. The path to the example
file can be shown by the R command |
pat_loc_file |
A character string with the path to a file containing the location of patients over time. If this argument is omitted or set to an empty string, then this kind of data is not used in the calculation. This should be a comma separated (.csv) file. Two alternative formats are accepted, "wide" and "long" formats. These are based on the formats in use in the hospital setting where the package was developed. The first line of the file should be a header with variable names. If
there is a variable called The names don't matter, but the columns should appear in the specified order. In wide format, each row represents a different patient. The file should have the following columns: 1. Individual ID (same as for 2. Cluster ID e.g. the name of the ward being studied. 3. Infection type e.g. 'patient' or 'HCW' for health care worker. 4. Availability of data e.g. 'patient_moves_available'. 5 onwards. Data of the location of a patient, in sets of three columns.
These specify in turn: i) The name of the location of the individual
e.g. WARD_01. ii) The start date of the individual being in that
location. iii) The end date of the individual being in that location. In
practice only the first column, and columns from 5 onwards are used.
An example is given with the installed package. The path to the example
file can be shown by the R command In long format, each row represents a single stay on a specific ward for a specific patient. The file should have the following columns: 1. Individual ID 2. Cluster ID, typically the name of the ward 3. Start date/time for the ward stay, in d/m/Y format (optionally in d/m/Y H:M format, but the time is currently ignored) 4. Name of the ward the patient went to next (or "Discharge") if they were discharged 5. End date/time for the ward stay, in d/m/Y format (optionally in d/m/Y H:M format, but the time is currently ignored). An example is given with the installed package. The path to the example
file can be shown by the R command |
strain |
Specification of parameters describing transmission dynamics. |
ucta |
Alpha parameter for a gamma distribution of the times bewterrn becomining symptomatic and testing positive. Currently not used. |
uctb |
Beta parameter for a gamma distribution of the times bewterrn becomining symptomatic and testing positive. Currently not used. |
ucto |
Offset parameter for a gamma distribution of the times bewterrn becomining symptomatic and testing positive. Currently not used. |
uct_mean |
Mean time between an individual becoming symptomatic for coronavirus infection and testing positive. This value is used to estimate times of individuals becoming symptomatic in the case that no symptom dates are available |
evo_rate |
Rate of evolution of the virus, specified in nucleotide substitutions per locus per year. |
seq_noise |
An estimate of the number of mutations separating two genome sequences that arises from sequencing noise. The default parameter was estimated from data collected by Cambridge University Hospitals within single hosts, using the criteria that at least 90% of the reported nucleotides were unambiguous. |
chat |
Prior estimate of the probability of any two individuals being in contact on any given day, conditional on transmission between the two individuals having taken place. |
max_n |
Maximum number of ambiguous nucleotides tolerated in a sequence counted at positions in the sequence data for which there is a polymorphism. This parameter deals with a case of a sequence of generally high quality in which the missing coverage of the genome is all at critical sites |
min_qual |
Minimum sequence quality for a sequence to be included, measured as a fraction of genome coverage (e.g. 0.8 would indicate that at least 80% of the genome must have been specified by a sequence |
diagnostic |
Binary flag to enable extensive diagnostic output from the function. |
hcw_default |
Default probability of a health care worker being present on the ward on a given day if no location information is specified for that individual. Default is 4/7. |
pat_default |
Default probability of a patient being present on the ward on a given day if no location information is specified for that individual. Default 1. |
use_all_seqs |
Binary flag to use multiple sequences from an individual, rather than simply the first collected. Reports the maximum likelihood calculated across all sequences from an individual. |
symptom_uncertainty_calc |
Binary flag to use a complete offset gamma distribution, specified by the parameters ucta, uctb, and ucto, to model the uncertainty in the date of onset of symptom. |
A data frame with the following columns
from
to
hcw_from
hcw_to
ordered_i
ordered_j
likelihood
consistency
under_threshold
Chris Illingworth [email protected], Chris Jackson [email protected].
"A2B-Covid: A method for evaluating potential Covid-19 transmission events". Illingworth C., Hamilton W., Jackson C. et al. Under preparation.
## Example data supplied with the package pat_file <- system.file("extdata", "Example_genetic_temporal_data.csv", package="a2bcovid") hcw_loc_file <- system.file("extdata", "Example_movement_file.csv", package="a2bcovid") ali_file <- system.file("extdata", "Example_sequences.fa", package="a2bcovid") pat_loc_file <- system.file("extdata", "Example_pat_loc_file.csv", package="a2bcovid") res <- a2bcovid(pat_file = pat_file, hcw_loc_file = hcw_loc_file, ali_file = ali_file, pat_loc_file = pat_loc_file) plot_a2bcovid(res, hi_from="from_hcw", hi_to="to_hcw")
## Example data supplied with the package pat_file <- system.file("extdata", "Example_genetic_temporal_data.csv", package="a2bcovid") hcw_loc_file <- system.file("extdata", "Example_movement_file.csv", package="a2bcovid") ali_file <- system.file("extdata", "Example_sequences.fa", package="a2bcovid") pat_loc_file <- system.file("extdata", "Example_pat_loc_file.csv", package="a2bcovid") res <- a2bcovid(pat_file = pat_file, hcw_loc_file = hcw_loc_file, ali_file = ali_file, pat_loc_file = pat_loc_file) plot_a2bcovid(res, hi_from="from_hcw", hi_to="to_hcw")
Web app interface to a2bcovid
a2bcovid_app(rstudio = FALSE)
a2bcovid_app(rstudio = FALSE)
rstudio |
Set to |
Convert patient location data for an a2bcovid analysis from long to wide format
long_to_wide(long_file)
long_to_wide(long_file)
long_file |
A path name to a CSV file in long format. |
A path name to a temporary file containing the equivalent data in
wide format. This can be read with read.csv
.
The names of the columns in the wide and long formats are both documented
in the a2bcovid
help page, argument pat_loc_file
.
Plots a grid of colours indicating likelihood of transmission paths between each pair of individuals.
plot_a2bcovid( x, cluster = TRUE, hi_from = "from_hcw", hi_to = "to_hcw", hi_col = "red", hi_lab = NULL, palette = NULL, continuous = FALSE, direction = 1 )
plot_a2bcovid( x, cluster = TRUE, hi_from = "from_hcw", hi_to = "to_hcw", hi_col = "red", hi_lab = NULL, palette = NULL, continuous = FALSE, direction = 1 )
x |
Data frame returned by |
cluster |
If |
hi_from |
Character string, naming a variable in the dataframe indicating "from" individual IDs to be highlighted in the plot. If not supplied, then no IDs will be highlighted. |
hi_to |
Character string indicating "to" individual IDs to be highlighted, similarly. |
hi_col |
Colour to use to highlight individual IDs. |
hi_lab |
Legend to describe which individuals are highlighted. By default this is "Healthcare workers". |
palette |
Colour palette, passed to
|
continuous |
If |
direction |
Direction of colours in the brewer palettes. Defaults to 1. Change to -1 to reverse the order of colours. |
A ggplot2 plot object.
Convert patient location data for an a2bcovid analysis from wide to long format
wide_to_long(wide_file)
wide_to_long(wide_file)
wide_file |
A path name to a CSV file in wide format. |
A path name to a temporary file containing the equivalent data in
long format. This can be read with read.csv
.
The names of the columns in the wide and long formats are both documented
in the a2bcovid
help page, argument pat_loc_file
.