|国家预印本平台
首页|omprehensive?evaluation?of?gene?sequence?encoding methods?in?deep?learning

omprehensive?evaluation?of?gene?sequence?encoding methods?in?deep?learning

omprehensive evaluation of gene sequence encoding methods in deep learning

中文摘要英文摘要

p><strong><strong>Background: </strong></strong>The prediction of genomic structure has become a hot spot in genome&nbsp;research. At present, the prediction method based on deep learning is more effective and accurate than other machine learning algorithms. Since gene sequence data cannot directly enter the deep learning model, the original data need to be encoded and converted into numerical features before model prediction. As a result, different encoding methods may affect final accuracy.<strong><strong>Methods: </strong></strong>In order to explore the performance of different encoding methods, we compared ten strategies in six deep learning models. We also compared the performance of all methods on independent datasets and models from our laboratory. For all models, we used their original parameters.<strong><strong>Results: </strong></strong>Dummy encoding, hash encoding, and one-hot encoding perform best in various models. In addition, dummy encoding and one-hot encoding are the best for processing RNA data, while hash encoding is superior to other methods for processing promoter data. Also, when processing part- or full-sequence data, the performance of dummy encoding, hash encoding, and one-hot encoding is similar. Besides that, in sisRNA datasets and prediction models of <em><em>Arabidopsis</em></em>&nbsp;and rice, dummy encoding and one-hot encoding achieve higher prediction accuracy.<strong><strong>Conclusions:</strong></strong>&nbsp;We conclude that the best encoding method varies when the data set changes. One-hot encoding, dummy encoding, and hash encoding are the three best methods for six models. This study fills the gap on sequence encoding methods in deep learning and can provide a valuable reference for the community.</p

p><strong><strong>Background: </strong></strong>The prediction of genomic structure has become a hot spot in genome&nbsp;research. At present, the prediction method based on deep learning is more effective and accurate than other machine learning algorithms. Since gene sequence data cannot directly enter the deep learning model, the original data need to be encoded and converted into numerical features before model prediction. As a result, different encoding methods may affect final accuracy.<strong><strong>Methods: </strong></strong>In order to explore the performance of different encoding methods, we compared ten strategies in six deep learning models. We also compared the performance of all methods on independent datasets and models from our laboratory. For all models, we used their original parameters.<strong><strong>Results: </strong></strong>Dummy encoding, hash encoding, and one-hot encoding perform best in various models. In addition, dummy encoding and one-hot encoding are the best for processing RNA data, while hash encoding is superior to other methods for processing promoter data. Also, when processing part- or full-sequence data, the performance of dummy encoding, hash encoding, and one-hot encoding is similar. Besides that, in sisRNA datasets and prediction models of <em><em>Arabidopsis</em></em>&nbsp;and rice, dummy encoding and one-hot encoding achieve higher prediction accuracy.<strong><strong>Conclusions:</strong></strong>&nbsp;We conclude that the best encoding method varies when the data set changes. One-hot encoding, dummy encoding, and hash encoding are the three best methods for six models. This study fills the gap on sequence encoding methods in deep learning and can provide a valuable reference for the community.</p

10.12074/202302.00053V1

生物科学研究方法、生物科学研究技术计算技术、计算机技术分子生物学

deep learningRNApromoterencoding methods

.omprehensive?evaluation?of?gene?sequence?encoding methods?in?deep?learning[EB/OL].(2023-02-09)[2025-08-02].https://chinaxiv.org/abs/202302.00053.点此复制

评论