统计学科大规模多层学术网络数据集——网络构建、描述分析与实际应用

高天辰; 张妍; 潘蕊

统计学科大规模多层学术网络数据集——网络构建、描述分析与实际应用

Statistical Large-Scale Multi-Layer Academic Network Dataset—Network Construction, Descriptive Analysis, and Applications

摘要

摘要: 多层学术网络描述了学术实体之间的多样化关系，对学术发展的研究以及未来方向的预测产生了促进作用。然而，目前的数据集通常只包含单一网络，缺少能够同时考虑多层学术网络的大规模数据集。本文提供了一个高质量的大规模多层学术网络数据集（LMANStat数据集），其中包括合著网络、机构合作网络、引文网络、期刊引用网络、作者引用网络、共引用网络、作者-论文网络以及关键词共现网络8个学术网络。此外，所构建的多层学术网络的每一层都是动态变化的。最后，本文提供了节点的属性，如作者的研究兴趣、产出力、地区和机构。该数据集当前是统计学领域规模较大、较为全面的学术网络数据集之一，包括多个学术网络类型，具有广泛的时间跨度和时变性，以及丰富的节点属性信息，填补了现有数据集的空白。借助该数据集，可以从多个角度研究统计学科的发展和演化，为研究具有多层结构的复杂系统提供了丰富的研究基础。

Abstract: Relational structures consisting of different types of interactions among several groups of entities are very common nowadays.As a useful tool for analyzing this type of data,multi-layer networks have gained increasing attention in recent years due to their ability to capture the complexity of real-world systems.Multi-layer academic networks are a specific type of multi-layer network that consists of multiple layers of relationships among academic entities,such as researchers,institutions,papers,or journals.Typical examples of multi-layer academic networks include the collaboration network that represents co-authorship relationships among researchers,the citation network that represents citation relationships among papers,and the journal citation networks that represent citation relationships among journals.They have been used for various purposes,such as identifying research areas,evaluating research impact,predicting scientific trends,studying the diffusion of scientific knowledge,and supporting science policy and decision-making.Overall,multi-layer academic networks provide a powerful tool for understanding and analyzing the complex relationships that underlie academic communities and their impact on scientific knowledge production and dissemination. In this work,we collect data from 42 statistical journals published between 1981 and 2021 from the Web of Science(www.webofscience.com).Our LMANStat dataset includes basic information on 97,436 papers,including their title,abstract,keywords,publisher,published date,volume and pages,document type,citation counts,author information (name,ORCID,address,region,and institution),as well as their reference lists.Based on this information,we construct multi-layer academic networks,including collaboration network,co-institution network,citation network,co-citation network,journal citation network,author citation network,author-paper network,and keyword co-occurrence network.These networks change dynamically over time,providing a dynamic analytical perspective during analysis.Moreover,we also include rich nodal attributes of authors,such as the authors’ research interests,to enhance the usefulness of our dataset.The LMANStat dataset is publicly available on GitHub,and can be accessed directly at https://github.com/Gaotianchen97/LMANStat. We present a comprehensive overview of our methodology,which covers the complete workflow from data collection to data cleaning,as well as the construction of multi-layer academic networks.Subsequently,we provide detailed explanations regarding author and paper identification,the extraction of author attributes,and the construction of multi-layer academic networks.Next,we validate the dataset through various potential scenarios for exploring and analyzing our multi-layer academic networks.To emphasize the usability of our dataset,key insights into the characteristics of the data are also provided,aligning with historical research findings and the consensus among statisticians.More importantly,the LMANStat dataset is extensively utilized by our research team to validate its usability.In our multi-layer academic networks,the collaboration network and citation network are the most commonly used networks.Therefore,we utilize them for verification.Additionally,we also consider the journal citation network with journals as nodes and the keyword co-occurrence network with keywords as nodes to validate the LMANStat dataset. For the collaboration network,the scale-free phenomenon can be detected through a log-log degree distribution plot,which is also referenced in research on collaborative networks.The average number of authors per paper shows an increasing trend by year,indicating collaboration trends in statistical research.By visualizing a sub-network of the collaboration network,we identified the top 4 authors with the highest degrees,whose innovative methods and insights have had a profound impact on the field of statistics.As for the citation network,the in-degree of a paper is crucial as it represents the number of times the paper has been cited within the network.A higher in-degree implies a greater number of citations for a paper.Within our dataset,the average in-degree per paper is 5.31,which correlates closely with the Impact Factor (IF) of the selected journals. In addition,journal citation networks are often employed for ranking journals,which is considered an important indicator for evaluating the quality and impact of publications in specific research fields.Therefore,we validate the accessibility of the journal citation network through journal ranking.By calculating the PageRank centrality of each node (journal),we can effectively rank journals based on their importance.Interestingly,we observe the phenomenon that the ranking of journals based on PageRank centrality closely aligns with the expectations and intuitions of statisticians.This suggests that the PageRank-based approach provides a ranking that resonates well with the perceptions of experts in the field. It is important to note that the multi-layer academic networks presented in this paper are all dynamic in nature.Taking the citation network as an illustration,we showcase the dynamic nature of the network.The visualization includes snapshots of the network from different time periods:1980—2006,1980—2010,and 1980—2020.It is evident from the visualization that the citation network exhibits a community structure that undergoes constant changes over time.This community has shown continuous growth over the years,as indicated by the increasing number of papers associated with variable selection.In conclusion,it is worth emphasizing that the networks within this dataset are all dynamic,thereby enabling the exploration of dynamic nature. In conclusion,the paper utilizes statistical publication data collected from the Web of Science to provide a large-scale,high-quality multi-layer academic network dataset (LMANStat dataset).The study further validates the quality and usability of these constructed multi-layer academic networks from multiple perspectives.It discusses feasible research directions and application scenarios,including but not limited to exploring community structures within academic networks,tracing the development and evolution of research topics,investigating mechanisms behind citation counts of papers,discussing the impact of international and inter-institutional collaborations,exploring career planning and development of researchers,and establishing more diversified journal ranking systems.

HTML全文

参考文献(0)

施引文献

资源附件(1)