Virus Expression Database

Methods

Virus Nomenclature

Viruses are specified at the species level, as defined in the ICTV Master Species List, version 2018b.v2. Human-infecting viruses were identified based on annotations in ViralZone, UniProt, the Virus-Host DB, Taylor et al, 2001, and Woolhouse et al, 2012. Virus aliases and synonyms were derived from the same sources.

GEO Study Identification

GEO study descriptions were downloaded in SOFT format. Only samples that were annotated as human, and with a platform of Affymetrix microarray or Illumina RNA-seq were considered. An in-house tool was used to search these records for instances of any virus name or alias. Candidate samples were manually curated.

Microarray Processing

Microarrays were processed using the Affymetrix Power Tools (v1.15.1) to perform Robust Multi-array Average (RMA) normalization with the BrainArray Custom CDF (v25, Ensembl genes). All samples using the same CDF (as specified in the CEL file header) were processed together, using the command line apt-probeset-summarize -d <CDF file> --chip-type <BrainArray name> --chip-type <Affymetrix name> -a rma-sketch <CEL files>. Following QC (see below), the passing samples were rerun with the same command line to yield the final values. Thus, the final microarray values are RMA-normalized, in log2 space, and use Ensembl genes as reference rather than the original Affymetrix probesets.

RNA-Seq Processing

FASTQ files for all samples were downloaded from SRA using fasterq-dump from version 2.10.8 of SRA Tools. FASTQs were then quantified using kallisto v0.46.0 and the Ensembl 102 cDNA definitions. This version of Ensembl was used to match the version used for the BrainArray microarray annotations. The precise command line differed based on the Library Layout as annotated in SRA:

For Single-End libraries:

kallisto quant -i <kallisto index> --single --bias -l 300 -s 30 -o <output directory> <FASTQ files>

For Paired-End libraries:

kallisto quant -i <kallisto index> --bias -o <output directory> <FASTQ files>

Transcripts per Million (TPM) quantities were used for differential expression. Note that these units are linear, unlike the log2 values used for microarrays.

Quality Control

Microarrays were removed if they had either mad_residual_mean > 0.80 OR pm_mean < 65. The first metric quantifies the error in the RMA model applied to the probeset, while the second is an indicator of the overall "brightness" of the array. These metrics removed ~3% of eligible samples.

RNA-seq samples were removed if they had either a low absolute number of usable reads (pseudo-aligned reads < 1.5 million) or a low fraction of reads that could be pseudo-aligned (< 18% of total reads).

If 40% or more of the samples from a study were unavailable - either because they failed QC or the raw data was not available - the study was removed.

Differential Analyses

All differential analyses used a two-sided Welch's T-test (unequal variance) with Benjamini-Hochberg correction. Comparisons were performed within a single study, cell type, and platform, and were only performed when at least two samples were present for both the infected and uninfected condition.