Dr. Culhane is a research scientist in Biostatistics and Computational Biology at Dana Farber Cancer Institute and at Harvard TH Chan School of Public Health where she develops and applies multivariate statistics and machine learning to cancer genomics and genetics data. She has a PhD from the University of Manchester, UK and has published over 50 peer reviewed research articles. She is an R/Bioconductor developer, has written Bioconductor packages for biclustering and exploratory data analysis of big data in genomics and isa member of the technical advisory board to Bioconductor. Bioconductor is an a open-source, open development software project, primarily in R with over 1,200 packages, for the analysis of genomics data. She is a founding member of the Boston R/Bioconductor for Genomics Meetup. She taught Bio503, Introduction to programming and statistical modeling in R at the Harvard TH Chan School of Public Health (2008-2016).
Supervised learning is among the most powerful tools in data science but it requires a training dataset in which one knows the classes of the input features apriori. For example, a classification algorithm learns the identify of animals through training on a dataset of images that are labeled with the species of each animal. Unsupervised learning is applied when data is without labels, the classes are unknown or one seeks to discover new groups or features that best characterize the data. I will provide an overview of unsupervised learning algorithms, including dimension reduction and matrix factorization approaches that learn low-dimensional mathematical representations from high-dimensional data. There are numerous computational techniques within the class of matrix factorization, each of which provides a unique interpretation of the processes in high-dimensional data. I will describe and do my best to demystify matrix factorization approaches, including principal component analysis, correspondence analysis and non-negative matrix factorization, in addition to newer approaches including t-SNE and autoencoders. Extensions to these approaches can be applied to simultaneously learn the structure and features in multiple data sets. Methods such as canonical correlations analysis, multiple factor analysis extract the linear relationships that best explain the correlated structure across datasets. I will describe how we apply these approaches to tens of thousands of tumors to advance precision medicine in oncology. My talk will describe our recent review articles; Meng & Zeleznik et al., 2016 (Briefings in Bioinformatics, 17:628, https://doi.org/10.1093/bib/bbv108) and Stein-O'Brien et al., 2017 (bioRxiv 196915; https://doi.org/10.1101/196915).