Identification of nucleosome positioning using support vector machine method based on comprehensive DNA sequence feature
-
摘要: 本文基于 Z 曲线(z-curve)理论和位置权重矩阵(PWM)提出一种构建核小体 DNA 序列的模型。该模型将核小体 DNA 序列集转换成三维空间坐标,通过计算该序列集的位置权重矩阵获得相似性权重得分,将两者整合得到综合序列特征模型(CSeqFM),并分别计算候选核小体序列和连接序列到模型 CSeqFM 的欧氏距离作为特征集,投入到支持向量机(SVM)中训练和检验,通过十折交叉验证进行性能评估。结果显示,酵母核小体定位的敏感性、特异性、准确率和 Matthews 相关系数(MCC)分别为 97.1%、96.9%、94.2% 和 0.89,受试者操作特征(receiver operating characteristic,ROC)曲线下面积(area under curve,AUC)达到 0.980 1。与其他相关 Z 曲线方法比较,CSeqFM 方法在各项评估指标中均表现出优势,具有更好的识别效果。同时,将 CSeqFM 方法推广到线虫、人类和果蝇的核小体定位识别中,AUC 均高于 0.90,与 iNuc-STNC 和 iNuc-PseKNC 方法比较,CSeqFM 方法也表现出较好的稳定性和有效性,进一步表明该方法具有较好的可靠性和识别效能。Abstract: In this article, based on z-curve theory and position weight matrix (PWM), a model for nucleosome sequences was constructed. Nucleosome sequence dataset was transformed into three-dimensional coordinates, PWM of the nucleosome sequences was calculated and the similarity score was obtained. After integrating them, a nucleosome feature model based on the comprehensive DNA sequences was obtained and named CSeqFM. We calculated the Euclidean distance between nucleosome sequence candidates or linker sequences and CSeqFM model as the feature dataset, and put the feature datasets into the support vector machine (SVM) for training and testing by ten-fold cross-validation. The results showed that the sensitivity, specificity, accuracy and Matthews correlation coefficient (MCC) of identifying nucleosome positioning for S. cerevisiae were 97.1%, 96.9%, 94.2% and 0.89, respectively, and the area under the receiver operating characteristic curve (AUC) was 0.980 1. Compared with another z-curve method, it was found that our method had better identifying effect and each evaluation performance showed better superiority. CSeqFM method was applied to identify nucleosome positioning for other three species, including C. elegans, H. sapiens and D. melanogaster. The results showed that AUCs of the three species were all higher than 0.90, and CSeqFM method also showed better stability and effectiveness compared with iNuc-STNC and iNuc-PseKNC methods, which is further demonstrated that CSeqFM method has strong reliability and good identification performance.
-
Key words:
- sequence feature /
- support vector machine /
- nucleosome /
- z-curve /
- position weight matrix /
- euclidean distance
-
表 1 两套酵母数据集的核小体定位识别结果
Table 1. Results of identifying nucleosome by two datasets for S. cerevisiae
数据集 模型 Sn Sp Acc MCC S1 CSeqFM 97.1% 96.9% 94.2% 0.89 Wu’s 模型 88.2% 88.2% 88.3% 0.77 S2 CSeqFM 92.4% 93.9% 93.1% 0.86 Wu’s 模型 88.7% 89.1% 88.9% 0.77 表 2 CSeFM 方法与其他方法的实验结果比较
Table 2. Comparison of experimental results between CSeFM and other methods
物种 方法 Sn Sp Acc MCC AUC C. elegans iNuc-STNC 91.6% 86.7% 88.6% 0.77 − iNuc-PseKNC 90.3% 83.6% 86.9% 0.74 0.935 0 CSeqFM 81.4% 86.8% 83.9% 0.68 0.905 2 H. sapiens iNuc-STNC 89.3% 85.9% 87.6% 0.75 − iNuc-PseKNC 87.9% 84.7% 86.3% 0.73 0.925 0 CSeqFM 90.1% 80.5% 84.6% 0.70 0.908 7 D. melanogaster iNuc-STNC 79.8% 83.6% 81.7% 0.63 − iNuc-PseKNC 78.3% 81.7% 80.0% 0.60 0.874 0 CSeqFM 79.9% 92.3% 84.8% 0.71 0.901 9 -
[1] Maskell D P, Renault L, Serrao E, et al. Structural basis for retroviral integration into nucleosomes. Nature, 2015, 523(7560): 366-369. doi: 10.1038/nature14495 [2] Taberlay P C, Statham A L, Kelly T K, et al. Reconfiguration of nucleosome-depleted regions at distal regulatory elements accompanies DNA methylation of enhancers and insulators in cancer. Genome Res, 2014, 24(9): 1421. doi: 10.1101/gr.163485.113 [3] Cole H A, Cui F, Ocampo J, et al. Novel nucleosomal particles containing core histones and linker DNA but no histone H1. Nucleic Acids Res, 2016, 44(2): 573-581. doi: 10.1093/nar/gkv943 [4] Buckwalter J M, Norouzi D, Harutyunyan A, et al. Regulation of chromatin folding by conformational variations of nucleosome linker DNA. Nucleic Acids Res, 2017, 45(16): 9372. doi: 10.1093/nar/gkx562 [5] Murugan R. Theory of site-specific DNA-protein interactions in the presence of nucleosome roadblocks. Biophys J, 2018, 114(11): 2516. doi: 10.1016/j.bpj.2018.04.039 [6] Nocetti N, Whitehouse I, et al. Nucleosome repositioning underlies dynamic gene expression. Genes Dev, 2016, 30(6): 660. doi: 10.1101/gad.274910.115 [7] Bai L, Morozov A V. Gene regulation by nucleosome positioning. Trends in Genetics, 2010, 26(11): 476-483. doi: 10.1016/j.tig.2010.08.003 [8] Eaton M L, Kyriaki G, Sukhyun K, et al. Conserved nucleosome positioning defines replication origins. Genes Dev, 2010, 24(8): 748-753. doi: 10.1101/gad.1913210 [9] Hua Y, Epps J, Williams R, et al. Evidence that localized variation in primate sequence divergence arises from an influence of nucleosome placement on DNA repair. Mol Biol Evol, 2010, 27(3): 637-649. doi: 10.1093/molbev/msp253 [10] Bevington S, Boyes J. Transcription-coupled eviction of histones H2A/H2B governs V(D)J recombination. EMBO J, 2013, 32(10): 1381-1392. doi: 10.1038/emboj.2013.42 [11] Xing Y Q, Liu G Q, Zhao X J, et al. An analysis and prediction of nucleosome positioning based on information content. Chromosome Res, 2013, 21(1): 63-74. doi: 10.1007/s10577-013-9338-z [12] Lieleg C, Krietenstein N, Walker M, et al. Nucleosome positioning in yeasts: methods, maps, and mechanisms. Chromosoma, 2015, 124(2): 131-151. doi: 10.1007/s00412-014-0501-x [13] Zhang J, Peng W, Wang L, et al. LeNup: Learning nucleosome positioning from DNA sequences with improved convolutional neural networks. Bioinformatics, 2018, 34(10): 1705-1712. doi: 10.1093/bioinformatics/bty003 [14] Huang Xiaolin, Mehrkanoon S, Suykens J A K. Support vector machines with piecewise linear feature mapping. Neurocomputing, 2013, 117: 118-127. doi: 10.1016/j.neucom.2013.01.023 [15] Lee W, Tillo D, Bray N, et al. A high-resolution atlas of nucleosome occupancy in yeast. Nat Genet, 2007, 9(10): 1235-1244. [16] Tahir M, Hayat M. iNuc-STNC: a sequence-based predictor for identification of nucleosome positioning in genomes by extending the concept of SAAC and Chou’s PseAAC. Mol Biosyst, 2016, 12(8): 2587-2593. doi: 10.1039/C6MB00221H [17] Chen W, Feng P, Ding H, et al. Using deformation energy to analyze nucleosome positioning in genomes. Genomics, 2016, 107: 69-75. doi: 10.1016/j.ygeno.2015.12.005 [18] Fu Limin, Niu Beifang, Zhu Zhengwei, et al. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics, 2012, 28(23): 3150-3152. doi: 10.1093/bioinformatics/bts565 [19] Guo Shouhui, Deng Enze, Xu Liqin, et al. iNuc-PseKNC: a sequence-based predictor for predicting nucleosome positioning in genomes with pseudo k-tuple nucleotide composition. Bioinformatics, 2014, 30(11): 1522-1529. doi: 10.1093/bioinformatics/btu083 [20] Zhang R, Zhang C T. A brief review: The Z-curve theory and its application in genome analysis. Curr Genomics, 2014, 15(2): 78-94. doi: 10.2174/1389202915999140328162433 [21] 崔颖. 基于 Z 曲线理论的转录因子结合位点的识别研究. 长春: 东北师范大学, 2008. [22] 岁品品, 邢旭东, 王宏, 等. 基于位置权重矩阵的核小体识别及功能分析. 生物信息学, 2016, 14(1): 1-6. doi: 10.3969/j.issn.1672-5565.2016.01.01 [23] Alencar J, Bonates T, Lavor C, et al. An algorithm for realizing Euclidean distance matrices. Electronic Notes in Discrete Mathematics, 2015, 50: 397-402. doi: 10.1016/j.endm.2015.07.066 [24] Wu X, Liu H, Liu H, et al. Z curve theory-based analysis of the dynamic nature of nucleosome positioning in Saccharomyces cerevisiae. Gene, 2013, 530(1): 8-18. doi: 10.1016/j.gene.2013.08.018