Decoding the grammar of DNA using Natural Language Processing
The Twelfth International Conference on Knowledge Capture workshop
Description:
This workshop is jointly hosted by Sonika Tyagi Lab, and Australian BioCommons.
Abstract: DNA is the blueprint defining all living organisms. Therefore, understanding the nature and function of DNA is at the core of all biological studies. Rapid advances in DNA sequencing and computing technologies over the past few decades resulted in large quantities of DNA generated for diverse experiments, exceeding the growth of all major social media platforms and astronomy data combined [1]. However, biological data is both complex and high-dimensional, and is difficult to analyse with conventional methods.
Machine learning is naturally well suited to problems with a large volume of data and complexity [2]. In particular, applying Natural Language Processing to the genome is intuitive, since DNA is a natural language. Unique challenges exist in Genome-NLP over natural languages, including the difficulty of word segmentation or corpus comparison.
To tackle these challenges, we developed the first automated and open-source genomeNLP workflow that enables efficient and accurate knowledge extraction on biological data [1], automating and abstracting preprocessing steps unique to biology. This lowers the barrier to perform knowledge extraction by both machine learning practitioners and computational biologists. In this tutorial, we will demonstrate how our workflow can be used to address the above challenges, with implications in fields such as personalised medicine [3-4].
[1] [preprint] Chen, T., Tyagi, N., Chauhan, S., Peleg, A.Y. and Tyagi, S., 2023. genomicBERT and data-free deep-learning model evaluation. bioRxiv, pp.2023-05. https://doi.org/10.1101/2023.05.31.542682 (This article is a preprint and has not been certified by peer review)
[2] Chen, T., Tyagi, S. Integrative computational epigenomics to build data-driven gene regulation hypotheses, GigaScience, Volume 9, Issue 6, June 2020, giaa064, https://doi.org/10.1093/gigascience/giaa064
[3] Chen, T., Philip, M., Lê Cao, K-A., Tyagi, S. A multi-modal data harmonisation approach for discovery of COVID-19 drug targets, Briefings in Bioinformatics, Volume 22, Issue 6, November 2021, bbab185, https://doi.org/10.1093/bib/bbab185
[4] Mu, A., Klare, W.P., Baines, S.L. et al. Integrative omics identifies conserved and pathogen-specific responses of sepsis-causing bacteria. Nat Commun 14, 1530 (2023). https://doi.org/10.1038/s41467-023-37200-w
Start: Tuesday, 05 December 2023 @ 09:00
End: Thursday, 07 December 2023 @ 17:00
Duration: 4 hours
Timezone: Eastern Time (US & Canada)
Venue: Pensacola
City: Pensacola Country: United States of America
Prerequisites:- [required] CLI (e.g. bash shell) usage
- [optional] Connecting and working on a remote server (e.g. ssh)
- [optional] Basic knowledge of machine learning
- [optional] Machine learning dashboards (e.g. tensorboard, wandb)
- [optional] Package/environment managers (e.g. conda, mamba)
- Describe the unique challenges in biological NLP compared to conventional NLP
- Understand common representations of biological data
- Understand common biological data preprocessing steps
- Investigate biological sequence data for use in machine learning
- Perform a hyperparameter sweep, training and cross-validation
- Identify what the model is focusing on
- Compare trained model performances to each other
- By invitation
Organiser: The Twelfth International Conference on Knowledge Capture
Contact: tyrone.chen@monash.edu, sonika.tyagi@rmit.edu.au, christina.hall@biocommons.org.au, melissa.burke@biocommons.org.au
Host institution: ACM Special Interest Group on Artificial Intelligence, Florida Institute for Human & Machine Cognition (IHMC)
Keywords: Deep learning, Machine learning, Natural language processing, Bioinformatics, Computational Biology
Fields: Bioinformatics, Natural Language Processing, Bioinformatics Software
Target audience:- bioinformaticians
- Life scientists
- Data scientists
Capacity: 40
Event type:- Workshop
- Conference
Laptop with internet connection
Cost Basis: Cost incurred by all