Yuk Yee Leung

Software

SparkINFERNO: a scalable high-throughput pipeline for inferring molecular mechanisms of non-coding genetic variants

Spark-based INFERence of the molecular mechanisms of NOn-coding genetic variants (SparkINFERNO), is a scalable bioinformatics pipeline characterizing non-coding GWAS association findings. SparkINFERNO prioritizes causal variants underlying GWAS association signals and reports relevant regulatory elements, tissue contexts and plausible target genes they affect. To achieve this, the SparkINFERNO algorithm integrates GWAS summary statistics with large-scale collection of functional genomics datasets spanning enhancer activity, transcription factor binding, expression quantitative trait loci and other functional datasets across more than 400 tissues and cell types. Scalability is achieved by an underlying API implemented using Apache Spark and Giggle-based genomic indexing. We evaluated SparkINFERNO on large GWASs and show that SparkINFERNO is more than 60 times efficient and scales with data size and amount of computational resources. This work is described in Kuksa et al. (Bioinformatics, 2020). SparkINFERNO runs on clusters or a single server with Apache Spark environment, and is available at here or here.

HiPR: High-throughput probabilistic RNA structure inference

HiPR is a novel method for RNA structure prediction at single-nucleotide resolution that combines high-throughput structure probing data (DMS-seq, DMS-MaPseq) with a novel probabilistic folding algorithm. On validation data spanning a variety of RNA classes, HiPR often increases accuracy for predicting RNA structures, giving researchers new tools to study RNA structure. This work is described in Kuksa et al. (Comput Struct Biotechnol J., 2020). The webserver and complete instructions can be found here or here.

VCPA: genomic variant calling pipeline and data management tool for Alzheimer’s Disease Sequencing Project

VCPA is the official SNP/Indel Variant Calling Pipeline and data management tool used for the analysis of whole genome and exome sequencing (WGS/WES) for the Alzheimer’s Disease Sequencing Project. VCPA consists of two independent but linkable components: pipeline and tracking database. The pipeline, implemented using the Workflow Description Language and fully optimized for the Amazon elastic compute cloud environment, includes steps from aligning raw sequence reads to variant calling using GATK. The tracking database allows users to view job running status in real time and visualize >100 quality metrics per genome. VCPA is functionally equivalent to the CCDG/TOPMed pipeline. Users can use the pipeline and the dockerized database to process large WGS/WES datasets on Amazon cloud with minimal configuration. This work is described in Leung et al. (Bioinformatics, 2019). VCPA is freely available at here.

SPAR: small RNA-seq portal for analysis of sequencing experiments

Small RNA-seq Portal for Analysis of sequencing expeRiments (SPAR), is a user-friendly web server for interactive processing, analysis, annotation and visualization of small RNA sequencing data. SPAR supports sequencing data generated from various experimental protocols, including smRNA-seq, short total RNA sequencing, microRNA-seq, and single-cell small RNA-seq. Additionally, SPAR includes publicly available reference sncRNA datasets from our DASHR database and from ENCODE across 185 human tissues and cell types to produce highly informative small RNA annotations across all major small RNA types and other features such as co-localization with various genomic features, precursor transcript cleavage patterns, and conservation. SPAR allows the user to compare the input experiment against reference ENCODE/DASHR datasets. SPAR currently supports analyses of human (hg19, hg38) and mouse (mm10) sequencing data. This work is described in Kuksa et al. (Nucleic Acids Research Web Server Issue, 2018). SPAR is freely available at here. If you prefer to run SPAR at your own site, please download stand-alone, offline version here.

DASHR2 - DASHR 2.0: integrated database of human small non-coding RNA genes and mature products

DASHR v2.0 database is the first that integrates human sncRNA gene and mature products profiles obtained from multiple RNA-seq protocols. Altogether, 185 tissues/cell types and sncRNA annotations and >800 curated experiments from ENCODE and GEO/SRA across multiple RNA-seq protocols for both GRCh38/hg38 and GRCh37/hg19 assemblies are integrated in DASHR. Moreover, DASHR is the first to contain both known and novel, previously un-annotated sncRNA loci identified by unsupervised segmentation (13 times more loci with 1 678 800 total). Additionally, DASHR v2.0 adds >3 200 000 annotations for non-small RNA genes and other genomic features (long-noncoding RNAs, mRNAs, promoters, repeats). This work is described in Kuksa et al. (Bioinformatics, 2018). The DASHR database and complete instructions can be found here.

DASHR - Database of small human noncoding RNA

The DASHR database provides information about small non-coding RNA (sncRNA) and their expression in different human tissues and cell types. The content of this database derives from curation, annotation, and computational analysis of small RNA sequencing data sets from multiple sources. Currently the database contains information about more than 46,000 sncRNAs in 42 normal human tissues and cell types from over 30 independent studies. This work is described in Leung et al. (Nucleic Acids Research Database Issue, 2016). The DASHR database and complete instructions can be found here.[Source code: DASHR]

CoRAL - Classification of RNAs by Analysis of Length

CoRAL is a machine learning tool / package that can predict the precursor class of small noncoding RNAs present in a high-throughput RNA-sequencing dataset. In addition to classification, it also produces information about the features that are the most important for discriminating different populations of small non-coding RNAs. This work is described in Leung et al. (Nucleic Acids Research, 2013). Complete instructions and documentation can be found here.[Source code: CoRAL]

HAMR - High throughput Annotation of Modified Ribonucleotides

HAMR (High-throughput Annotation of Modified Ribonucleotides) is a web application that allows you to detect and classify modified nucleotides in RNA-seq data. HAMR scans RNA-sequencing data for sites showing potential signatures of nucleotide modification. Users can input particular genomic regions of interest (BED file format) and HAMR will output a table containing the list of sites with nucleotide patterns that deviate from expectation at a statistically significant rate. This work is described in Ryvkin et al. (RNA, 2013). The webserver and complete instructions can be found here.[Source code: HAMR]

Contact Information

3700 Hamilton Walk
D101 Richards Medical Research Laboratories
Perelman School of Medicine
University of Pennsylvania
Philadelphia PA 19104