PyDsBuilder - A Dataset Builder Written in Python Django
Abstract
Data mining and the analysis of open-source projects have become crucial in recent research, driven by the vast availability of data across multiple programming domains. This paper focuses on two main objectives: first, to present an experience report for designing a software quality data mining tool, and secondly, to provide an open-source solution, PyDs, that facilitates the creation of datasets specifically aimed at analyzing software quality attributes. PyDs, leveraging Python and the Django Framework, provides a comprehensive solution for researchers, encompassing data extraction from repositories, the application of software analysis tools, and the consolidation of results into a coherent format conducive to in-depth experimentation and analysis. This tool addresses the pressing need for effective data mining capabilities in evaluating software quality, allowing the research community to harness the full potential of the vast resources offered by open-source software projects.
References
[2] Ansible, I., et al. Ansible: Radically simple IT automation. https://github.com/ansible/ansible, 2023.
[3] Atwi, H., Lin, B., Tsantalis, N., Kashiwa, Y., Kamei, Y., Ubayashi, N., Bavota, G., and Lanza, M. Pyref: Refactoring detection in python projects. In 2021 IEEE 21st International Working Conference on Source Code Analysis and Manipulation (SCAM) (2021), pp. 136–141.
[4] Chaturvedi, K., Sing, V., and Singh, P. Tools in mining software repositories. In 2013 13th International Conference on Computational Science and Its Applications (2013), pp. 89–98.
[5] Django Software Foundation. Django.
[6] Docker, Inc. Docker: Empowering app development for developers, 2023. Accessed: 2024-02-17.
[7] Duenas, S., Cosentino, V., Robles, G., and Gonzalez-Barahona, J. M. Perceval: software project data at your will. In Proceedings of the 40th International Conference on Software Engineering: Companion Proceeedings (New York, NY, USA, 2018), ICSE ’18, Association for Computing Machinery, p. 1–4.
[8] Elmishali, A., Stern, R., and Kalech, M. An artificial intelligence paradigm for troubleshooting software bugs. Engineering Applications of Artificial Intelligence 69 (2018), 147–156.
[9] Fiechter, A., Minelli, R., Nagy, C., and Lanza, M. Visualizing github issues. In 2021 Working Conference on Software Visualization (VISSOFT) (2021), pp. 155–159.
[10] Gousios, G., Vasilescu, B., Serebrenik, A., and Zaidman, A. Lean ghtorrent: Github data on demand. pp. 384–387.
[11] Jr., J. M., Santana, R., and Machado, I. Grumpy: an automated approach to simplify issue data analysis for newcomers. In Proceedings of the XXXV Brazilian Symposium on Software Engineering (New York, NY, USA, 2021), SBES ’21, Association for Computing Machinery, p. 33–38.
[12] Kourtzanidis, S., Chatzigeorgiou, A., and Ampatzoglou, A. Reposkillminer: identifying software expertise from github repositories using natural language processing. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering (New York, NY, USA, 2021), ASE ’20, Association for Computing Machinery, p. 1353–1357.
[13] Krogh, G. v., and Spaeth, S. The open source software phenomenon: Characteristics that promote research. The Journal of Strategic Information Systems 16, 3 (2007), 236–253.
[14] Lenarduzzi, V., Lomio, F., Taibi, D., and Huttunen, H. On the fault proneness of sonarqube technical debt violations: A comparison of eight machine learning techniques. CoRR abs/1907.00376 (2019).
[15] Lenarduzzi, V., Saarim¨aki, N., and Taibi, D. The technical debt dataset. In Proceedings of the Fifteenth International Conference on Predictive Models and Data Analytics in Software Engineering (Sept. 2019), PROMISE’19, ACM.
[16] McKinney, W., et al. pandas: a powerful Python data analysis toolkit. https://github.com/pandas-dev/pandas, 2023.
[17] Midha, V., and Palvia, P. Factors affecting the success of open source software. Journal of Systems and Software 85, 4 (2012), 895–905.
[18] Moldovan, V.-A., Berciu, L.-M., and Patcas, R.-D. The python software quality dataset. In 50th Euromicro Conference Series on Software Engineering and Advanced Applications (2024).
[19] Molnar, A.-J., and Motogna, S. Long-term evaluation of technical debt in open-source software. In Proceedings of the 14th ACM / IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM) (New York, NY, USA, 2020), ESEM ’20, Association for Computing Machinery.
[20] Molnar, A.-J., and Motogna, S. A study of maintainability in evolving open-source software. In Evaluation of Novel Approaches to Software Engineering (Cham, 2021), R. Ali, H. Kaindl, and L. A. Maciaszek, Eds., Springer International Publishing, p. 261–282.
[21] RabbitMQ Team. Rabbitmq: Open source message broker. https://www.rabbitmq.com/, 2023. [Online; accessed 10-February-2024].
[22] Rosa, G., Pascarella, L., Scalabrino, S., Tufano, R., Bavota, G., Lanza, M., and Oliveto, R. A comprehensive evaluation of szz variants through a developer-informed oracle. Journal of Systems and Software 202 (2023), 111729.
[23] SonarSource. Sonarqube: Continuous code quality inspection tool, 2023. [Online; accessed 10-February-2024].
[24] Spadini, D., Aniche, M., and Bacchelli, A. Pydriller: Python framework for mining software repositories. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (New York, NY, USA, 2018), ESEC/FSE 2018, Association for Computing Machinery, p. 908–911.
[25] Spinellis, D., Gousios, G., Karakoidas, V., Louridas, P., Adams, P. J., Samoladas, I., and Stamelos, I. Evaluating the quality of open source software. Electronic Notes in Theoretical Computer Science 233 (2009), 5–28.
[26] Wangoo, D. P. Artificial intelligence techniques in software engineering for automated software reuse and design. In 2018 4th International Conference on Computing Communication and Automation (ICCCA) (2018), pp. 1–4.

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Transfer of copyright agreement: When the article is accepted for publication, I, as the author and the representative of the coauthors, hereby agree to transfer to Studia Universitatis Babes-Bolyai, Series Informatica, all rights, including those pertaining to electronic forms and transmissions, under existing copyright laws, except for the following, which the authors specifically retain: the authors can use the material however they want as long as it fits the NC ND terms of the license. The authors have all rights for reuse according to the below license.