|国家预印本平台
首页|基于分位函数的直方图符号数据非负主成分分析法

基于分位函数的直方图符号数据非负主成分分析法

中文摘要英文摘要

针对已有的符号数据主成分分析法大都采用部分代表性信息来代替符号数据的缺点,提出一种直方图符号数据的主成分分析法。直方图数据以概率分布的形式表示符号数据,更全面准确。根据直方图数据特点将其用分位函数表示,引入充分考虑直方图数据概率分布的Wasserstein距离,计算直方图变量协方差矩阵,从而进行主成分分析。但该方法求得的前若干个最大特征所对应的特征向量不一定为非负的,这样在用分位函数表示主成分时不能保证它也是分位函数。为此,又结合Dias[1]等人的DSD(distribution and symmetric distribution)回归模型,对每个直方图变量定义相应的对称分布变量,根据Wasserstein距离下的广义协方差矩阵得到具有非负系数的所有主成分。通过实验说明了该算法的有效性。该方法同时克服了文献[2]中直方图PCA系数可能为负的缺点,更多地保留了原始数据的信息。

Since the existing principal component analysis(PCA) of symbolic data mostly use some representative information instead of symbolic data, a histogram principal component analysis is proposed. Represent a histogram data by a quantile function with its characteristic, and introduce the Wasserstein distance which fully takes into account the probability distribution of the histogram data. It is easy to obtain the covariance matrix to perform the principal component analysis using this distance. However, the eigenvectors corresponding to the first m largest eigenvalues obtained by this method is not necessarily negative, so it cannot guarantee that the principal components are also quantile functions when they are represented by the quantile functions. For this point, combining the idea of DSD (distribution and symmetric distribution) regression model studied by Dias [1]et al, defining the corresponding symmetric distribution variables for each histogram variable, then obtain the non-negative principal component coefficients with the generalized covariance matrix. The experiments show the effectiveness of the algorithm. Besides, this method overcomes the disadvantage that the PCA coefficient of the histogram in [2] may be negative and retains more information of the original data.

孙慧强、陈秀宏、李竹婷

10.12074/201805.00473V1

数学计算技术、计算机技术

主成分分析直方图数据分位函数Wasserstein距离协方差矩阵

孙慧强,陈秀宏,李竹婷.基于分位函数的直方图符号数据非负主成分分析法[EB/OL].(2018-05-24)[2025-08-24].https://chinaxiv.org/abs/201805.00473.点此复制

评论