基于综合 DNA 序列特征的支持向量机方法识别核小体定位

崔颖; 徐泽龙; 李建中

doi:10.7507/1001-5515.201911064

基于综合 DNA 序列特征的支持向量机方法识别核小体定位

doi: 10.7507/1001-5515.201911064

崔颖^{1, 2},
徐泽龙²,
李建中^{1, 3, ,}

详细信息

通讯作者:
李建中，Email：lijzh@hit.edu.cn

计量
- 文章访问数: 570
- HTML全文浏览量: 193
- PDF下载量: 0
- 被引次数: 0
出版历程
- 收稿日期: 2019-11-23
- 修回日期: 2020-02-22
- 发布日期: 2020-03-17

Identification of nucleosome positioning using support vector machine method based on comprehensive DNA sequence feature

Ying CUI^{1, 2},
Zelong XU²,
Jianzhong LI^{1, 3
, ,}

More Information

Corresponding author: LI Jianzhong, Email: lijzh@hit.edu.cn

摘要

摘要: 本文基于 Z 曲线（z-curve）理论和位置权重矩阵（PWM）提出一种构建核小体 DNA 序列的模型。该模型将核小体 DNA 序列集转换成三维空间坐标，通过计算该序列集的位置权重矩阵获得相似性权重得分，将两者整合得到综合序列特征模型（CSeqFM），并分别计算候选核小体序列和连接序列到模型 CSeqFM 的欧氏距离作为特征集，投入到支持向量机（SVM）中训练和检验，通过十折交叉验证进行性能评估。结果显示，酵母核小体定位的敏感性、特异性、准确率和 Matthews 相关系数（MCC）分别为 97.1%、96.9%、94.2% 和 0.89，受试者操作特征（receiver operating characteristic，ROC）曲线下面积（area under curve，AUC）达到 0.980 1。与其他相关 Z 曲线方法比较，CSeqFM 方法在各项评估指标中均表现出优势，具有更好的识别效果。同时，将 CSeqFM 方法推广到线虫、人类和果蝇的核小体定位识别中，AUC 均高于 0.90，与 iNuc-STNC 和 iNuc-PseKNC 方法比较，CSeqFM 方法也表现出较好的稳定性和有效性，进一步表明该方法具有较好的可靠性和识别效能。
- 序列特征 /
- 支持向量机 /
- 核小体 /
- Z 曲线 /
- 位置权重矩阵 /
- 欧氏距离
Abstract: In this article, based on z-curve theory and position weight matrix (PWM), a model for nucleosome sequences was constructed. Nucleosome sequence dataset was transformed into three-dimensional coordinates, PWM of the nucleosome sequences was calculated and the similarity score was obtained. After integrating them, a nucleosome feature model based on the comprehensive DNA sequences was obtained and named CSeqFM. We calculated the Euclidean distance between nucleosome sequence candidates or linker sequences and CSeqFM model as the feature dataset, and put the feature datasets into the support vector machine (SVM) for training and testing by ten-fold cross-validation. The results showed that the sensitivity, specificity, accuracy and Matthews correlation coefficient (MCC) of identifying nucleosome positioning for S. cerevisiae were 97.1%, 96.9%, 94.2% and 0.89, respectively, and the area under the receiver operating characteristic curve (AUC) was 0.980 1. Compared with another z-curve method, it was found that our method had better identifying effect and each evaluation performance showed better superiority. CSeqFM method was applied to identify nucleosome positioning for other three species, including C. elegans, H. sapiens and D. melanogaster. The results showed that AUCs of the three species were all higher than 0.90, and CSeqFM method also showed better stability and effectiveness compared with iNuc-STNC and iNuc-PseKNC methods, which is further demonstrated that CSeqFM method has strong reliability and good identification performance.
- sequence feature /
- support vector machine /
- nucleosome /
- z-curve /
- position weight matrix /
- euclidean distance

HTML全文

图 1 酵母数据集 S1 结果的四项性能指标、AUC 值分布及 ROC 曲线

Figure 1. Four performances, AUC distribution and ROC curves of dataset S1 for S. cerevisiae

下载: 全尺寸图片幻灯片

图 2 C. elegans、H. sapiens 和 D. melanogaster 的实验结果

Figure 2. Experimental results of C. elegans, H. sapiens and D. melanogaster species

下载: 全尺寸图片幻灯片

表 1 两套酵母数据集的核小体定位识别结果

Table 1. Results of identifying nucleosome by two datasets for S. cerevisiae

数据集	模型	Sn	Sp	Acc	MCC
S1	CSeqFM	97.1%	96.9%	94.2%	0.89
S1	Wu’s 模型	88.2%	88.2%	88.3%	0.77
S2	CSeqFM	92.4%	93.9%	93.1%	0.86
S2	Wu’s 模型	88.7%	89.1%	88.9%	0.77

下载: 导出CSV

表 2 CSeFM 方法与其他方法的实验结果比较

Table 2. Comparison of experimental results between CSeFM and other methods

物种	方法	Sn	Sp	Acc	MCC	AUC
C. elegans	iNuc-STNC	91.6%	86.7%	88.6%	0.77	−
	iNuc-PseKNC	90.3%	83.6%	86.9%	0.74	0.935 0
	CSeqFM	81.4%	86.8%	83.9%	0.68	0.905 2
H. sapiens	iNuc-STNC	89.3%	85.9%	87.6%	0.75	−
	iNuc-PseKNC	87.9%	84.7%	86.3%	0.73	0.925 0
	CSeqFM	90.1%	80.5%	84.6%	0.70	0.908 7
D. melanogaster	iNuc-STNC	79.8%	83.6%	81.7%	0.63	−
	iNuc-PseKNC	78.3%	81.7%	80.0%	0.60	0.874 0
	CSeqFM	79.9%	92.3%	84.8%	0.71	0.901 9

下载: 导出CSV

参考文献(24)

[1]	Maskell D P, Renault L, Serrao E, et al. Structural basis for retroviral integration into nucleosomes. Nature, 2015, 523(7560): 366-369. doi: 10.1038/nature14495
[2]	Taberlay P C, Statham A L, Kelly T K, et al. Reconfiguration of nucleosome-depleted regions at distal regulatory elements accompanies DNA methylation of enhancers and insulators in cancer. Genome Res, 2014, 24(9): 1421. doi: 10.1101/gr.163485.113
[3]	Cole H A, Cui F, Ocampo J, et al. Novel nucleosomal particles containing core histones and linker DNA but no histone H1. Nucleic Acids Res, 2016, 44(2): 573-581. doi: 10.1093/nar/gkv943
[4]	Buckwalter J M, Norouzi D, Harutyunyan A, et al. Regulation of chromatin folding by conformational variations of nucleosome linker DNA. Nucleic Acids Res, 2017, 45(16): 9372. doi: 10.1093/nar/gkx562
[5]	Murugan R. Theory of site-specific DNA-protein interactions in the presence of nucleosome roadblocks. Biophys J, 2018, 114(11): 2516. doi: 10.1016/j.bpj.2018.04.039
[6]	Nocetti N, Whitehouse I, et al. Nucleosome repositioning underlies dynamic gene expression. Genes Dev, 2016, 30(6): 660. doi: 10.1101/gad.274910.115
[7]	Bai L, Morozov A V. Gene regulation by nucleosome positioning. Trends in Genetics, 2010, 26(11): 476-483. doi: 10.1016/j.tig.2010.08.003
[8]	Eaton M L, Kyriaki G, Sukhyun K, et al. Conserved nucleosome positioning defines replication origins. Genes Dev, 2010, 24(8): 748-753. doi: 10.1101/gad.1913210
[9]	Hua Y, Epps J, Williams R, et al. Evidence that localized variation in primate sequence divergence arises from an influence of nucleosome placement on DNA repair. Mol Biol Evol, 2010, 27(3): 637-649. doi: 10.1093/molbev/msp253
[10]	Bevington S, Boyes J. Transcription-coupled eviction of histones H2A/H2B governs V(D)J recombination. EMBO J, 2013, 32(10): 1381-1392. doi: 10.1038/emboj.2013.42
[11]	Xing Y Q, Liu G Q, Zhao X J, et al. An analysis and prediction of nucleosome positioning based on information content. Chromosome Res, 2013, 21(1): 63-74. doi: 10.1007/s10577-013-9338-z
[12]	Lieleg C, Krietenstein N, Walker M, et al. Nucleosome positioning in yeasts: methods, maps, and mechanisms. Chromosoma, 2015, 124(2): 131-151. doi: 10.1007/s00412-014-0501-x
[13]	Zhang J, Peng W, Wang L, et al. LeNup: Learning nucleosome positioning from DNA sequences with improved convolutional neural networks. Bioinformatics, 2018, 34(10): 1705-1712. doi: 10.1093/bioinformatics/bty003
[14]	Huang Xiaolin, Mehrkanoon S, Suykens J A K. Support vector machines with piecewise linear feature mapping. Neurocomputing, 2013, 117: 118-127. doi: 10.1016/j.neucom.2013.01.023
[15]	Lee W, Tillo D, Bray N, et al. A high-resolution atlas of nucleosome occupancy in yeast. Nat Genet, 2007, 9(10): 1235-1244.
[16]	Tahir M, Hayat M. iNuc-STNC: a sequence-based predictor for identification of nucleosome positioning in genomes by extending the concept of SAAC and Chou’s PseAAC. Mol Biosyst, 2016, 12(8): 2587-2593. doi: 10.1039/C6MB00221H
[17]	Chen W, Feng P, Ding H, et al. Using deformation energy to analyze nucleosome positioning in genomes. Genomics, 2016, 107: 69-75. doi: 10.1016/j.ygeno.2015.12.005
[18]	Fu Limin, Niu Beifang, Zhu Zhengwei, et al. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics, 2012, 28(23): 3150-3152. doi: 10.1093/bioinformatics/bts565
[19]	Guo Shouhui, Deng Enze, Xu Liqin, et al. iNuc-PseKNC: a sequence-based predictor for predicting nucleosome positioning in genomes with pseudo k-tuple nucleotide composition. Bioinformatics, 2014, 30(11): 1522-1529. doi: 10.1093/bioinformatics/btu083
[20]	Zhang R, Zhang C T. A brief review: The Z-curve theory and its application in genome analysis. Curr Genomics, 2014, 15(2): 78-94. doi: 10.2174/1389202915999140328162433
[21]	崔颖. 基于 Z 曲线理论的转录因子结合位点的识别研究. 长春: 东北师范大学, 2008.
[22]	岁品品, 邢旭东, 王宏, 等. 基于位置权重矩阵的核小体识别及功能分析. 生物信息学, 2016, 14(1): 1-6. doi: 10.3969/j.issn.1672-5565.2016.01.01
[23]	Alencar J, Bonates T, Lavor C, et al. An algorithm for realizing Euclidean distance matrices. Electronic Notes in Discrete Mathematics, 2015, 50: 397-402. doi: 10.1016/j.endm.2015.07.066
[24]	Wu X, Liu H, Liu H, et al. Z curve theory-based analysis of the dynamic nature of nucleosome positioning in Saccharomyces cerevisiae. Gene, 2013, 530(1): 8-18. doi: 10.1016/j.gene.2013.08.018