The CiteSeerX 4217 dataset

The CiteSeerX 4217 dataset

The dataset was compiled in order to facilitate the evaluation of metadata extraction from scholarly articles.

The dataset was compiled between September 5 and 7, 2016 using the OAI2 protocol of CiteSeerX to retrieve the metadata. The methodology of collecting the data was the following:

The articles were downloaded and filtered by using PDFMiner to check if text can be extracted.

Getting additional information about an article can be done by taking the doi number from the key and setting the identifier field of the below query to it:

http://citeseerx.ist.psu.edu/oai2?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:CiteSeerX.psu:10.1.1.10.8012

Dataset fields (the key is the CiteSeerX identifier of the article):

Example:

{"http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.10.5389": {
    "source": "http://www.csie.ntu.edu.tw/~cjlin/papers/svmprob/svmprob.pdf", 
    "author": [
        "Ting-fan Wu", 
        " Chih-Jen Lin", 
        " Ruby C. Weng"
    ], 
    "title": "Probability Estimates for Multi-class Classification by Pairwise Coupling"
}}