MetaAtlas - Documentation

An initial list of all Drosophila melanogaster transcriptomic datasets was constructed by searching NCBI’s Sequence Run Archive (SRA) for organism=”Drosophila melanogaster” and source=”RNA”. This gave a set of 93,372 individual samples at the time of writing. These samples were narrowed down to only those constituting Bulk RNAseq experiments, providing a full set of experiments included in our database.

Of applicable experiments, we only seek to analyse those which are appropriately annotated with metadata, so as to prevent poorly defined experiments from altering gene expression profiles. As such, experiments within our database are either defined as “Active”, denoting cases which have been fully annotated and analysed; and “Inactive” for cases which lack full annotation and are not included in our dataset.

Experiments can be searched using our “Browse Datasets” facility, available here.

Each individual sample on SRA is associated with a larger project, grouping related samples. As such, each sample within our database is associated with an overarching project from NCBI BioSamples. Where the information was available, these projects were also associated with authors – either from NCBI itself or populated from linked publications.

Projects and authors can also be searched using “Browse Datasets” here.

Metadata for each sample was initially populated directly from NCBI using the Entrez E-Utilities (please see here).

With this basic metadata in place, extensive manual curation was carried out to standardise and control the vocabulary used across all distinguishing metrics between samples. This was not a one-size-fits-all process – in some cases, this was as straightforward as formatting the information already available, while in many others this required deep mining of the publication associated with the sample.

To explore our controlled vocabulary and how it applies to samples within our database, please use the “Explore Controlled Vocabulary” facility provided here.

MetaAtlas in many places takes the information available on SRA at face value. In cases where you feel information is missing or incorrect, please get in touch with our admin by email at andrew.gillen@glasgow.ac.uk

Our basic analytical pipeline was based on MassiveQC, as described in this paper, available publicly at https://github.com/shimw6828/MassiveQC . That said, our strategy made two major changes to this pipeline:

-A ribosomal RNA removal step was added prior to sequence alignment. rRNA quantities can strongly vary between samples and bias gene expression estimates (see here) , particularly in cases such as this where differing library preparation methods are to be compared. As such, we employed BBDuk, part of the BBMap pipeline to remove reads which aligned to rRNA sequences.

-To lower computational costs associated with data reprocessing, we employed kallisto pseudoalignment instead of HISAT2 alignment. kallisto performs exceptionally well in a majority of cases, and runs markedly faster than HISAT2 > FeatureCounts (see here)

To ensure results were comparable between datasets, we have been very careful to sensibly standardise our data.

EdgeR was used to generate TMM-normalised library sizes across all samples, then these sizes used to calculate Transcript Per Million (TPM) for each sample.

Log-transformed TPMs were used as input for ComBat batch correction to try and minimize the impact of inter-dataset technical differences. The more recent ComBat-seq could not be used, as it does not (yet!) allow for batch sizes of one, which is the case for some of our datasets.

Resulting batch-corrected TPMs were returned to a non-log scale, allowing for straightforward biological interpretation. This is what can be found on our gene summary pages as normalised TPM – nTPM.

We can take absolutely no credit for this – graphs are made using the fantastic facilities provided by plotly

Our current funding for MetaAtlas has ended, limiting the further analysis we can perform. If you have an exciting new idea for how we can use our data, please don’t hesitate to get in touch by email at andrew.gillen@glasgow.ac.uk

That said, we do have plans for smaller scale improvements to our system, which will be rolled out as and when development is complete. As a ROUGH outline, please expect:

-Mobile functionality: April 2025

-Expanded analytical functionality: August 2025

The data underlying MetaAtlas is, by definition, publicly available. As such, for any single experiment, raw data can be downloaded from SRA.

If you’d like the full MySQL database underpinning MetaAtlas, this can be arranged – but be aware it’s very large! (~30 Gb). If you’d still like it, please get in touch by email at andrew.gillen@glasgow.ac.uk – we cannot statically host this on the website due to data storage/transfer limits.

Please let us know what else you need to know – odds are other people are also wondering! Just get in touch by email at andrew.gillen@glasgow.ac.uk