preview

A Citation Count Prediction Model For Stem Publishing Domains

Better Essays

A Citation Count Prediction Model for STEM Publishing Domains

Goals

I attempt to tackle the task of citation count prediction using existing and new features. Looking at multiple domains, I identify differences both in the ability to predict citation counts as well as the nature of features that contribute to the prediction. For instance, the phenomenon of famous authors attracting more citations is more apparent in Biology and Medicine compared with other domains. Additionally, while the popularity of a paper’s references is predictive of the paper’s success in most domains, this is clearly not the case in Engineering and Physics. The following is a model that can be used to predict citations 5 years in the future (using data from 2005 …show more content…

Table 1. Domain-specific Statistics
Domain Affiliations Papers – 2005 Papers – 2015 Authors per paper – 2005 Authors per paper - 2015
CS 4,851 59,116 110,506 2.43 2.75
Biology 2,082 59,395 93,792 3.58 4.04
Chemistry 811 26,496 50,381 3.56 3.99
Medicine 5,524 125,113 214,854 3.52 3.67
Engineering 2,589 43,440 77,664 3.20 3.53
Mathematics 581 11,057 17,317 1.75 1.90
Physics 688 25,393 42,955 4.41 5.05

Methods & Techniques

Feature Engineering - I consider four groups of features: Authors, Institutions, Affiliations, References Network. The first three (group 1)—Authors, Institutions and Affiliations—describe the reputation of the paper’s venue, of its authors and of its author’s institutions. I start by calculating the following features for each venue, author and institution in the dataset: the sum of citation counts of papers published by the entity, mean citations over papers published by the entity, and max citations, e.g. the citation count of the most cited work by the entity. I also calculate the h-index and g-index of these entities. The h-index is defined as the largest h such that at least h papers by the entity received at least h citations. The g-index is defined as the largest g such that the top g papers by the entity received together at least g2 citations. Both h-index and g-index numbers are easily calculable using the capabilities in the Scopus database. For each paper I aggregate the features of the entities (authors, institutions and

Get Access