A Hybrid Approach for Scholarly Information Extraction
Abstract
Metadata extraction from documents forms an essential part of web or desktop search systems.
Similarly, digital libraries that index scholarly literature require to find and extract the title, the list of authors and other publication-related information from an article.
We present a hybrid approach for metadata extraction, combining classification and clustering to extract the desired information without the need of a conventional labeled dataset for training.
An important asset of the proposed method is that the resulting clustering parameters can be used in other problems, e.g. document layout analysis.
References
[2] L. Breiman. Random forests. Machine Learning, 45(1):5, 2001.
[3] T. M. Breuel. High performance document layout analysis. In Proceedings of the Symposium on Document Image Understanding Technology, pages 209–218, 2003.
[4] B. H. Butt, M. Rafi, A. Jamal, R. S. U. Rehman, S. M. Z. Alam, and M. B. Alam. Classification of research citations (CRC). In CLBib@ISSI, volume 1384 of CEUR Workshop Proceedings, pages 18–27. CEUR-WS.org, 2015.
[5] C. Caragea, J. Wu, A. M. Ciobanu, K. Williams, J. P. F. Ram´ırez, H.-H. Chen, Z. Wu, and C. L. Giles. CiteseerX: A scholarly big dataset. In ECIR, volume 8416 of Lecture Notes in Computer Science, pages 311–322. Springer, 2014.
[6] M. Granitzer, M. Hristakeva, K. Jack, and R. Knight. A comparison of metadata extraction techniques for crowdsourced bibliographic metadata management. In SAC, pages 962–964. ACM, 2012.
[7] H. Han, C. L. Giles, E. Manavoglu, H. Zha, Z. Zhang, and E. A. Fox. Automatic document metadata extraction using support vector machines. In JCDL, pages 37–48. IEEE Computer Society, 2003.
[8] A. Ivanyukovich and M. Marchese. Unsupervised metadata extraction in scientific digital libraries using a-priori domain-specific knowledge. In SWAP, volume 201 of CEUR Workshop Proceedings. CEUR-WS.org, 2006.
[9] R. Kern, K. Jack, M. Hristakeva, and M. Granitzer. Teambeam - meta-data extraction from scientific literature. D-Lib Magazine, 18(7/8), 2012.
[10] M. Lipinski, K. Yao, C. Breitinger, J. Beel, and B. Gipp. Evaluation of header metadata extraction approaches and tools for scientific PDF documents. In JCDL, pages 385–386. ACM, 2013.
[11] P. Lopez. Grobid: Combining automatic bibliographic data recognition and term extraction for scholarship publications. In International Conference on Theory and Practice of Digital Libraries, pages 473–474. Springer, 2009.
[12] P. Lopez and L. Romary. Humb: Automatic key term extraction from scientific articles in grobid. In Proceedings of the 5th international workshop on semantic evaluation, pages 248–251. Association for Computational Linguistics, 2010.
[13] C. D. Manning, H. Schütze, and P. Raghavan. Introduction to information retrieval. Cambridge University Press, 2008.
[14] F. Peng and A. McCallum. Accurate information extraction from research papers using conditional random fields. In HLT-NAACL, pages 329–336, 2004.
[15] G. Salton and C. Buckley. Term weighting approaches in automatic text retrieval. Information Processing and Management, 24(5):513–523, 1988.
[16] J. Wu, J. Killian, H. Yang, K. Williams, S. R. Choudhury, S. Tuarob, C. Caragea, and C. L. Giles. PDFMEF: A multi-entity knowledge extraction framework for scholarly documents and semantic search. In K-CAP, pages 13:1–13:8. ACM, 2015.
[17] J. Wu, K. M. Williams, H.-H. Chen, M. Khabsa, C. Caragea, S. Tuarob, A. Ororbia, D. Jordan, P. Mitra, and C. L. Giles. CiteseerX: AI in a digital library search engine. AI Magazine, 36(3):35–48, 2015.
[18] X. Zhu. Semi-supervised learning literature survey. Technical Report TR 1530, University of Wisconsin, 2005.
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
When the article is accepted for publication, I, as the author and representative of the coauthors, hereby agree to transfer to Studia Universitatis Babes-Bolyai, Series Informatica, all rights, including those pertaining to electronic forms and transmissions, under existing copyright laws, except for the following, which the author specifically retain: the right to make further copies of all or part of the published article for my use in classroom teaching; the right to reuse all or part of this material in a review or in a textbook of which I am the author; the right to make copies of the published work for internal distribution within the institution that employs me.