What is the benefit of using GPAS?
GPAS provides a rapid and standardized, easy-to-use method of analyzing genomic sequences. Results can be downloaded or viewed in dashboards for understanding individual sequence data and trends. It can be part of an early warning system for pandemic response in a sovereign nation, and globally for those who wish to share their data. By standardising the genome assembly process, GPAS enables data to be compared directly between different laboratories and across borders.
What are the main technology components of GPAS?
- user interface (client) for data upload (available as command-line tool and graphical user interface)
- technology for assembling the genetic reads and analysing the individual sequence (the Viridian assembly pipeline, plus NextClade, PangoLEARN/Scorpio, aln2type and other tools, orchestrated by the Scalable Pathogen Pipeline Platform (SP3))
- technology for rapidly assessing sequence relatedness (FindNeighbour4 (FN4))
all of which are hosted on the Oracle Cloud Infrastructure.
What is the charge associated with the GPAS service?
GPAS is offered free of charge for analysis of SARS-CoV-2 sequences for academic and research labs in low-and-middle income countries for the next 10 years (until April 2031).
What are the hidden costs of GPAS for academic and research labs in low-and-middle income countries?
GPAS is a free tool/service for academic and research labs in low-and-middle income countries, run by a not-for-profit entity. The computing power and data storage that drive the service are provided for free under the 10-year donation from Oracle.
There are costs associated with development and maintenance of the tool/service, some of which have been covered by philanthropic donations from global institutions, including the Ellison Institute for Transformative Medicine and Oracle. Additional operational costs associated with running the service will be paid for through cost recovery charges/fees/contributions from government and private sector contracts and users in high- and middle-income countries.
We are actively seeking grants from major health organisations so that we do not have to charge governments and organisations in low- and middle-income countries.
Is GPAS a sequence database or repository? How does this differ from repositories like GISAID or EBI?
GPAS is not a sequence database or a repository of data. It is a service that processes and analyses sequences in a standardised manner, which can then be stored in any repository that the user chooses.
Data that is processed by GPAS can be stored in repositories like GISAID, EBI or ENA, or shared to Pathogenwatch. We are developing tools to enable automated upload to repositories from GPAS (for users that wish to do this).
How long does it take to process a genomic sequence on GPAS?
Processing a single sequence on GPAS takes 15-20 minutes (multiple samples are processed in parallel). Uploading, processing and downloading a large batch of sequences will take 2-4 hours with moderate internet network connectivity.
What data is being uploaded to the cloud? Is any personally identifiable information (PII) being uploaded?
GPAS does not collect clinical PII. PII about the sample (including the local laboratory reference number, and any genetic material from the patient from whom the sample was collected) stays on the local computer of the GPAS user. GPAS does not have access to any PII held on the local computer.
The upload client creates a lookup table, stored on the user’s computer, to enable users to link anonymised GPAS results back to their local records.
The GPAS upload client removes human genetic reads, then sends the genomic sequence to the cloud, along with relevant metadata (to capture provenance of the data: date the sample was collected, method of sample collection, fuzzy location the sample was collected, who uploaded it to GPAS, etc.).
What sequencing technology do I need to use GPAS? What other infrastructure?
GPAS V1.0 works with data collected from the Oxford Nanopore and Illumina sequencing platforms. We expect to support other sequencing platforms in future releases.
To send data to GPAS, users download and install the upload client, and use the client to upload the data to the GPAS cloud. Users can choose a command-line interface, or a user-friendly graphical interface. The graphical interface is designed to be intuitive and not require specialist training, and will work with common operating systems (Linux Ubuntu, Windows, iOS).
Uploading data to the cloud requires an active internet connection. Once data has been uploaded and processed, users may view their results in a web browser using our online dashboard, or they can download their results for local analysis.
What technical support is available for using GPAS?
As part of this free service, technical user support will be available to all labs. There is a dedicated helpline that will operate the same way a helpline for a commercially available software service would operate.
How does GPAS create consensus genomes?
GPAS uses the Viridian workflow. This is an amplicon-aware assembly, with careful attention to primers. It is a reference-mapped “genome polisher” approach, correcting the assembly for consistent deviations. Each read is identified as a specific amplicon, and the primers are removed before the remaining sequence is mapped to the reference genome.
Viridian workflow applies quality control metrics at the amplicon and base level, as follows:
- Reads are aligned to the reference (default MN908947.3). This removes reads that do not sufficiently match this sequence.
- Individual reads must be sufficiently long. For Illumina reads, illumina_read_lengths must be, by default, at least 50.
- The start and end positions of reads (read-pairs) are used to infer which amplicon they belong to (e.g. using nCoV-artic-v3.bed). “off-target” reads (pairs) are discarded.
- Enough reads (pairs) must cover an amplicon or else all calls inside the region covered solely by that amplicon (i.e. not overlapping adjacent amplicons) are voided with Ns: min_depth, default 50.
- Enough reads (pairs) must span at least 75% of the amplicon they belong to, excluding internal indels, where template length is the (alignment_end – alignment_start) / amplicon_length. This threshold is controlled by min_template_coverage_75, default 80%. This filters out short fragments caused by PCR artefacts. If this threshold fails the entire amplicon is discarded.
- After the assembly is constructed the original reads are mapped to it. Individual bases are voided with N if there is less than freq_threshold agreement. Default is 80%.
- After assembly, Viridian does mild filtering: 70% of reads at a position need to support the consensus (FRS>0.7), and depth should be at least 10.
Which primer sets does GPAS support?
GPAS currently supports the Artic3 primer set, and SISPA (Sequence-Independent, Single-Primer Amplification).
We are working to support other primer sets (starting with Artic4 and Midnight). Because the Viridian assembly process is able to specify primers in a modular way, in future we expect it will be possible to accommodate custom primers based on a BED file — but this option is not currently available.
How does GPAS handle human read removal?
The GPAS upload client removes human genetic reads from users’ samples, prior to uploading them to GPAS, to ensure patient Personally-Identifiable Information (PII) is not sent to the GPAS cloud.
This is done using read-it-and-keep, which positively selects SARS-CoV-2 reads and discards all non-SARS-CoV-2 reads. The process takes only about one minute per sample, requires a relatively small amount of RAM (10-286 MB) and could run on any multi-core laptop.
ReadItAndKeep is published as a preprint (https://www.biorxiv.org/content/10.1101/2022.01.21.477194v1) and the code is freely available under an MIT license.
The preprint describes how read-it-and-keep has compared to Dehumanizer on 246 Illumina and 189 Oxford Nanopore samples deposited in the ENA by COG-UK. These samples were chosen from a much larger dataset to maximise the amount of genetic diversity they contain. Each contained 61 and 52 pangolin lineages, including Alpha and Delta. As described in Table 1 of the preprint, read-it-and-keep removed 0.106% and 0.008% of reads from the Illumina and Nanopore datasets, respectively. Applying read-it-and-keep to synthetic Wuhan and Omicron ARTICv4 samples shows that it removes up to 0.007% of reads when there is a 4% error rate in the genetic reads.
What happens if my file upload process is interrupted?
Currently, any interrupted or failed uploads will resume from the point after the last fully uploaded sample. For example, uploading a batch of 10 samples: if samples 1-9 were complete, and sample 10 was interrupted, the upload will resume from the beginning of sample 10. We are working to improve this for future releases.
How does GPAS handle re-classification of lineages over time?
SARS-CoV-2 samples can be named in different ways. https://covariants.org/ is a helpful resource to understand the relationships between nomenclatures.
GPAS currently provides automatic lineage calls for all samples, following:
- PANGOLin (e.g. B.1.1.7)
- WHO Variants of Concern list (e.g. alpha, beta, delta)
- UKHSA Variants of Concern list (e.g. VOC-20DEC-01)
GPAS uses NextClade to perform viral genome alignment, mutation calling and quality checks, but does not retain NextClade lineage calls (e.g. 20I/501Y.V1).
Lineage is a complex topic and these lineage definitions are frequently updated by their respective organisations. GPAS updates all of these components to the latest version once a week, to apply the latest definitions.
Currently, samples are assigned lineage names at the point of initial processing. GPAS plans to periodically update all samples to the latest lineage definitions: the process will be developed in consultation with the GPAS User Reference Group, to ensure that local lab workflows are supported.
Data Sharing and Ownership
Who can access data uploaded to GPAS?
Each organisation has full control over what data is uploaded to GPAS, and whether that data is shared with the community of “sharing” organisations.
When you join GPAS, your organisation will choose whether to share data or not.
By granting full control over data sharing to each lab, GPAS respects the autonomy of each lab, and the rights of the sovereign nation where the sequence originated, to decide what to do with its data. At the same time, the system incentivises data sharing by allowing users to see shared data only if they share their own data.
Data will only be shared with other labs and relevant local authorities; it will not be shared with or sold to private sector companies. As we develop future versions of GPAS with input from users, we will create additional sharing functionality to facilitate research partnerships and public health response to outbreaks.
Oxford and Oracle do not have priority access to any data on the GPAS platform: we have access to shared data on the same terms as all other users. Data which is not shared is not visible to Oracle or Oxford.
We hope to facilitate relationships between users, for those who wish to collaborate on analysis or publications, but neither Oxford nor Oracle expects to be included in any such activities by default.
Who has ownership of the data processed by GPAS?
GPAS users will retain full ownership of all sample data (including metadata) they upload to the platform, and of the results generated from that data. GPAS processes data according to users’ instructions.
Users may choose to share their data with other “sharing” users of the platform, or instruct GPAS to submit sequences to repositories such as ENA, EBI or GISAID, but this will only be done with explicit permission from the data owner.
Institutional partnerships and legal arrangements
Are there opportunities for collaboration, academic partnerships, etc.?
Yes. These are being developed on a case-by-case basis, given the wide range of interests from different labs, public health agencies, and governments around the world.
There are opportunities for collaboration in areas such as academic research/publications, capacity-building for genomic sequencing labs, and providing input to GPAS for development of subsequent versions. We plan to build a robust community of practice to facilitate research collaborations, data sharing as appropriate, and networks that can facilitate rapid scientific research and public health response.
Oxford University/Oracle/GPAS will not have preferential access to data, and GPAS is not intended as a vehicle to drive University of Oxford research.
Is GPAS a legal entity?
GPAS is currently an initiative between researchers at the University of Oxford and Oracle. In a future stage, GPAS will be its own legal entity, run on a not-for-profit basis.
What legal arrangements are required to use GPAS?