Danyang Huang, Yilin Luo, Yingqiu Zhu. Clustering Algorithm Based on Integrating Distribution Features for Unstructured Big Data of Third-Party Payment PlatformsJ. Quarterly Journal of Economics and Management, 2023, 2(3): 179-208.
Citation: Danyang Huang, Yilin Luo, Yingqiu Zhu. Clustering Algorithm Based on Integrating Distribution Features for Unstructured Big Data of Third-Party Payment PlatformsJ. Quarterly Journal of Economics and Management, 2023, 2(3): 179-208.

Clustering Algorithm Based on Integrating Distribution Features for Unstructured Big Data of Third-Party Payment Platforms

  • With the rapid development of big data technology as well as the popularization of third-party payment services,more and more transactions are conducted digitally and recorded in databases.The massive transaction data,which contain merchants' behavior logs,can serve as a valuable resource for mining behavioral information merchants.Segmenting merchants into different groups according to their behavior patterns contributes to precise and personalized decision-supports for marketing,risk control,and many other management issues related to merchants.This is of great importance to the management and revenue of third-party payment platforms,as well as promoting the sustainable development of the real economy.With respect to the segmentation of individuals using transaction data,previous methods are typically based on feature engineering.In this way,a large amount of transaction records are compressed into low-dimensional dense feature vectors.Then,with merchants presented by feature vectors,clustering methods are implemented on those vectors to output partition for all merchants.However,feature-based methods have limited efficiency in utilizing the data and inevitably lead to information loss,which may greatly restrict the effectiveness of the segmentation results.

    To make full use of transaction data,in this paper,we investigate the empirical distributions of transactions for a better understanding of merchants' behaviors.Compared with low-dimensional feature vectors,the empirical distributions of transactions are much more informative.Nevertheless,how to conduct clustering analysis based on empirical distributions is a challenging task.Traditional clustering algorithms,which are typically applicable for structured data,can hardly be directly used for empirical distributions.To fix this problem,this paper proposes a novel clustering algorithm for merchant segmentation based on multivariate distribution functions among transactions.Firstly,the Gaussian Mixture Model (GMM) is adopted to fit the distribution among the whole dataset.The combinations of Gaussian components within GMM are utilized to describe meaningful patterns of transaction behaviors.With all transactions modelled by GMM,the relations between a transaction and the Gaussian components are simultaneously estimated.As a result,the relations between a merchant and the Gaussian components can be thus inferred via aggregating results of corresponding transactions.Secondly,based on the estimation of GMM,the Wasserstein distance is exploited to measure the dissimilarities among merchants' distributions.Specifically,we apply sliced Wasserstein distance for the purpose of the computational efficiency.Finally,we develop an iterative algorithm,which is called K-means Clustering algorithm based on GMM and Wasserstein Distance (GWKC),to cluster all merchants according to the dissimilarities among their distributions.With the empirical distributions among transactions fully taken into consideration,our method provides a reasonable solution for the segmentation of merchants.In regard to the hyperparameters of our method,we also provide information criterion as reference for real applications.

    The GWKC algorithm mentioned above utilizes the differences in transaction distribution among merchants for clustering.To further improve the clustering performance,this paper considers integrating more transaction-related covariate features to boost the GWKC algorithm.These covariate features,e.g.,average transaction amount,average number of transactions,and suspected cash-out transactions,serve as supplementary information to assist and adjust the results of GWKC.The improved clustering algorithm is called GWKC With Weighted Covariates (GWKC+WCov) in this paper.This version covers information on both feature vectors and empirical distributions.It allows the integration of distribution-based clustering methods with feature engineering,incorporating highly personalized and complex features that involve expert experience and business knowledge into the clustering process.It is noteworthy that when integrating distribution and covariates to measure the differences among merchants,it is necessary to determine the weights of different parts,e.g.the measurement based on distributions and those based on feature vectors.To obtain appropriate weights,this paper proposes an adaptive approach to iteratively search for weights that optimize the clustering performance.Thus,GWKC+WCov is able to integrate multiple structural features for comprehensive clustering.

    Both simulation and real data analysis show that the proposed algorithm significantly outperforms previous methods.With structural information of distributions involved,GWKC performs much better than those based on feature vectors.Moreover,the visualization of the results of GWKC intuitively illustrates  the behavior patterns of cash-out merchants,thus providing decision-making supports for risk detection and differentiated management of payment platforms.Among various methods,GWKC+WCov achieves the best performance.Since it adaptively integrates multiple structural information within transactions,it is supposed to be a promising solution for real applications of merchant segmentation.

    Possible directions for future works are also discussed in this paper.Firstly,the proposed methods can be further extended through integrating more unstructured data,e.g.,network data or text data.Thus,the clustering results may be more informative with more available data sources incorporated.Secondly,in regard to the modelling of empirical distributions,it is possible to apply different finite mixture models.For datasets with arbitrary distributions,non-Gaussian components,e.g.,Gamma components,or nonparametric estimations may also be useful.Based on the proposed methods,we can derive more flexible versions to further optimize the clustering performance.
  • loading

Catalog

    Turn off MathJax
    Article Contents

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return