A Hybrid Approach for Scholarly Information Extraction

  • Zalan Bodo Department of Computer Science, Faculty of Mathematics and Computer Science, Babeș-Bolyai University, Cluj-Napoca, Romania
  • Lehel Csato Department of Computer Science, Faculty of Mathematics and Computer Science, Babeș-Bolyai University, Cluj-Napoca, Romania


Metadata extraction from documents forms an essential part of web or desktop search systems.
Similarly, digital libraries that index scholarly literature require to find and extract the title,  the list of authors and other publication-related information from an article.
We present a hybrid approach for metadata extraction, combining classification and clustering to extract the desired information without the need of a conventional labeled dataset for training.
An important asset of the proposed method is that the resulting clustering parameters can be used in other problems, e.g. document layout analysis.


[1] J. Beel, S. Langer, M. Genzmehr, and C. Müller. Docear’s PDF inspector: title extraction from PDF files. In JCDL, pages 443–444. ACM, 2013.
[2] L. Breiman. Random forests. Machine Learning, 45(1):5, 2001.
[3] T. M. Breuel. High performance document layout analysis. In Proceedings of the Symposium on Document Image Understanding Technology, pages 209–218, 2003.
[4] B. H. Butt, M. Rafi, A. Jamal, R. S. U. Rehman, S. M. Z. Alam, and M. B. Alam. Classification of research citations (CRC). In CLBib@ISSI, volume 1384 of CEUR Workshop Proceedings, pages 18–27. CEUR-WS.org, 2015.
[5] C. Caragea, J. Wu, A. M. Ciobanu, K. Williams, J. P. F. Ram´ırez, H.-H. Chen, Z. Wu, and C. L. Giles. CiteseerX: A scholarly big dataset. In ECIR, volume 8416 of Lecture Notes in Computer Science, pages 311–322. Springer, 2014.
[6] M. Granitzer, M. Hristakeva, K. Jack, and R. Knight. A comparison of metadata extraction techniques for crowdsourced bibliographic metadata management. In SAC, pages 962–964. ACM, 2012.
[7] H. Han, C. L. Giles, E. Manavoglu, H. Zha, Z. Zhang, and E. A. Fox. Automatic document metadata extraction using support vector machines. In JCDL, pages 37–48. IEEE Computer Society, 2003.
[8] A. Ivanyukovich and M. Marchese. Unsupervised metadata extraction in scientific digital libraries using a-priori domain-specific knowledge. In SWAP, volume 201 of CEUR Workshop Proceedings. CEUR-WS.org, 2006.
[9] R. Kern, K. Jack, M. Hristakeva, and M. Granitzer. Teambeam - meta-data extraction from scientific literature. D-Lib Magazine, 18(7/8), 2012.
[10] M. Lipinski, K. Yao, C. Breitinger, J. Beel, and B. Gipp. Evaluation of header metadata extraction approaches and tools for scientific PDF documents. In JCDL, pages 385–386. ACM, 2013.
[11] P. Lopez. Grobid: Combining automatic bibliographic data recognition and term extraction for scholarship publications. In International Conference on Theory and Practice of Digital Libraries, pages 473–474. Springer, 2009.
[12] P. Lopez and L. Romary. Humb: Automatic key term extraction from scientific articles in grobid. In Proceedings of the 5th international workshop on semantic evaluation, pages 248–251. Association for Computational Linguistics, 2010.
[13] C. D. Manning, H. Schütze, and P. Raghavan. Introduction to information retrieval. Cambridge University Press, 2008.
[14] F. Peng and A. McCallum. Accurate information extraction from research papers using conditional random fields. In HLT-NAACL, pages 329–336, 2004.
[15] G. Salton and C. Buckley. Term weighting approaches in automatic text retrieval. Information Processing and Management, 24(5):513–523, 1988.
[16] J. Wu, J. Killian, H. Yang, K. Williams, S. R. Choudhury, S. Tuarob, C. Caragea, and C. L. Giles. PDFMEF: A multi-entity knowledge extraction framework for scholarly documents and semantic search. In K-CAP, pages 13:1–13:8. ACM, 2015.
[17] J. Wu, K. M. Williams, H.-H. Chen, M. Khabsa, C. Caragea, S. Tuarob, A. Ororbia, D. Jordan, P. Mitra, and C. L. Giles. CiteseerX: AI in a digital library search engine. AI Magazine, 36(3):35–48, 2015.
[18] X. Zhu. Semi-supervised learning literature survey. Technical Report TR 1530, University of Wisconsin, 2005.
How to Cite
BODO, Zalan; CSATO, Lehel. A Hybrid Approach for Scholarly Information Extraction. Studia Universitatis Babeș-Bolyai Informatica, [S.l.], v. 62, n. 2, p. 5-16, dec. 2017. ISSN 2065-9601. Available at: <https://www.cs.ubbcluj.ro/~studia-i/journal/journal/article/view/10>. Date accessed: 12 june 2024. doi: https://doi.org/10.24193/subbi.2017.2.01.