Title: Identifying molecular subtypes of breast cancer using single cell RNA-seq data integration and random forest classification

Authors: Peter Jerome Ishmael V. Paulino; Muhammad Sufyan

Addresses: Department of Biology, Vanguard University, Costa Mesa, CA, USA ' Department of Bioinformatics and Biotechnology, Government College University, Faisalabad, Pakistan

Abstract: Single-cell RNA sequencing (scRNA-seq) has been invaluable in advancing our understanding of various cancers, including breast cancer. The extensive analysis of scRNA-seq data from multiple independent breast cancer studies helped build an integrated single-cell gene expression atlas encompassing over 60,000 cells. Unsupervised clustering and classification algorithms including t-SNE, UMAP, and random forest were applied to identify molecular subtypes and classify new tumour samples. Integrated analysis identified six major breast cancer subtypes consistent with known luminal, HER2-enriched, and basal-like classifications. Random forest classification using a panel of discriminative genes achieved over 90% accuracy in classifying held-out tumour samples into known subtypes. Further substructure within subtypes revealed novel candidate cell states. The study also demonstrated the feasibility and advantages of integrating multiple scRNA-seq datasets to generate a comprehensive breast cancer atlas. The results of this study provide insights into breast cancer biology with potential applications in precision oncology.

Keywords: breast cancer; single-cell RNA-seq; scRNA-seq; data integration; molecular subtypes; random forest; tumour heterogeneity.

DOI: 10.1504/IJBRA.2024.141751

International Journal of Bioinformatics Research and Applications, 2024 Vol.20 No.5, pp.468 - 494

Received: 10 Jan 2024
Accepted: 01 Mar 2024

Published online: 01 Oct 2024 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article