A Hybrid Approach for Scholarly Information Extraction
Abstract
Metadata extraction from documents forms an essential part of web or desktop search systems.
Similarly, digital libraries that index scholarly literature require to find and extract the title, the list of authors and other publication-related information from an article.
We present a hybrid approach for metadata extraction, combining classification and clustering to extract the desired information without the need of a conventional labeled dataset for training.
An important asset of the proposed method is that the resulting clustering parameters can be used in other problems, e.g. document layout analysis.
References
[2] L. Breiman. Random forests. Machine Learning, 45(1):5, 2001.
[3] T. M. Breuel. High performance document layout analysis. In Proceedings of the Symposium on Document Image Understanding Technology, pages 209–218, 2003.
[4] B. H. Butt, M. Rafi, A. Jamal, R. S. U. Rehman, S. M. Z. Alam, and M. B. Alam. Classification of research citations (CRC). In CLBib@ISSI, volume 1384 of CEUR Workshop Proceedings, pages 18–27. CEUR-WS.org, 2015.
[5] C. Caragea, J. Wu, A. M. Ciobanu, K. Williams, J. P. F. Ram´ırez, H.-H. Chen, Z. Wu, and C. L. Giles. CiteseerX: A scholarly big dataset. In ECIR, volume 8416 of Lecture Notes in Computer Science, pages 311–322. Springer, 2014.
[6] M. Granitzer, M. Hristakeva, K. Jack, and R. Knight. A comparison of metadata extraction techniques for crowdsourced bibliographic metadata management. In SAC, pages 962–964. ACM, 2012.
[7] H. Han, C. L. Giles, E. Manavoglu, H. Zha, Z. Zhang, and E. A. Fox. Automatic document metadata extraction using support vector machines. In JCDL, pages 37–48. IEEE Computer Society, 2003.
[8] A. Ivanyukovich and M. Marchese. Unsupervised metadata extraction in scientific digital libraries using a-priori domain-specific knowledge. In SWAP, volume 201 of CEUR Workshop Proceedings. CEUR-WS.org, 2006.
[9] R. Kern, K. Jack, M. Hristakeva, and M. Granitzer. Teambeam - meta-data extraction from scientific literature. D-Lib Magazine, 18(7/8), 2012.
[10] M. Lipinski, K. Yao, C. Breitinger, J. Beel, and B. Gipp. Evaluation of header metadata extraction approaches and tools for scientific PDF documents. In JCDL, pages 385–386. ACM, 2013.
[11] P. Lopez. Grobid: Combining automatic bibliographic data recognition and term extraction for scholarship publications. In International Conference on Theory and Practice of Digital Libraries, pages 473–474. Springer, 2009.
[12] P. Lopez and L. Romary. Humb: Automatic key term extraction from scientific articles in grobid. In Proceedings of the 5th international workshop on semantic evaluation, pages 248–251. Association for Computational Linguistics, 2010.
[13] C. D. Manning, H. Schütze, and P. Raghavan. Introduction to information retrieval. Cambridge University Press, 2008.
[14] F. Peng and A. McCallum. Accurate information extraction from research papers using conditional random fields. In HLT-NAACL, pages 329–336, 2004.
[15] G. Salton and C. Buckley. Term weighting approaches in automatic text retrieval. Information Processing and Management, 24(5):513–523, 1988.
[16] J. Wu, J. Killian, H. Yang, K. Williams, S. R. Choudhury, S. Tuarob, C. Caragea, and C. L. Giles. PDFMEF: A multi-entity knowledge extraction framework for scholarly documents and semantic search. In K-CAP, pages 13:1–13:8. ACM, 2015.
[17] J. Wu, K. M. Williams, H.-H. Chen, M. Khabsa, C. Caragea, S. Tuarob, A. Ororbia, D. Jordan, P. Mitra, and C. L. Giles. CiteseerX: AI in a digital library search engine. AI Magazine, 36(3):35–48, 2015.
[18] X. Zhu. Semi-supervised learning literature survey. Technical Report TR 1530, University of Wisconsin, 2005.

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Transfer of copyright agreement: When the article is accepted for publication, I, as the author and the representative of the coauthors, hereby agree to transfer to Studia Universitatis Babes-Bolyai, Series Informatica, all rights, including those pertaining to electronic forms and transmissions, under existing copyright laws, except for the following, which the authors specifically retain: the authors can use the material however they want as long as it fits the NC ND terms of the license. The authors have all rights for reuse according to the below license.