About a year ago, generating code from a Large Language model (LLM) was like an unachievable task. With the advancement in Artificial Intelligence, LLMs are now successfully being used to generate software codes. The automatic generation of code has streamlined a lot of real-world programming tasks. However, along with the ample usage of code LLMs by the techies, there has been a buzz about the source code that is used as the training data for developing the model. The model learns from the training examples, which might include open-source codes constrained by restrictive licenses. This has cast doubts and raised questions among developers who would not have intended to have their codes used in training the language models.
The BigCode project, an association of ServiceNow and Hugging Face, has released The Stack, incorporating a 3.1 TB dataset of permissively licensed source code in 30 programming languages. Considering the current scenario in which using open-source repositories is debatable, BigCode has released the code to promote transparency around the pre-training data.
The main idea is to let people choose if they want their code to be contributed to evaluating Machine Learning models. The hugging face website – ‘https://huggingface.co/spaces/bigcode/in-the-stack’ allows people to conveniently opt-out from having their repository included in The Stack for training the LLMs. People can confirm so by entering their respective GitHub usernames on the website, and if the repository is in the Stack, they can discard the data from any future variation.
The ServiceNow and Hugging Face team, in their recently published paper The Stack: 3 TB of Permissively Licensed Source Code have mentioned some of their contributions which are as follows –
To obtain the license details of 137.36M Github repositories constituting the huge dataset, the team used GHArchive and the go-license-detector. The most commonly used licenses were MIT and Apache 2.0. The group laid a complete comparison between the size of The Stack and one of the most popular datasets, CodeParrot. Compared with CodeParrot, The Stack is relatively more than three times the size. Apart from that, The Stack is compared with other code datasets such as AlphaCode, CodeGen, and PolyCoder.
The absence of transparency in training data has always been a crucial obstacle to the development of a model. The Service Now Research and Hugging Face have definitely promoted clarity in code LLMs by releasing the enormous dataset and sharing the entire process of curating the data.
Check out the Paper. All Credit For This Research Goes To Researchers on This Project. Also, don’t forget to join our Reddit page and discord channel, where we share the latest AI research news, cool AI projects, and more.
Tanya Malhotra is a final year undergrad from the University of Petroleum & Energy Studies, Dehradun, pursuing BTech in Computer Science Engineering with a specialization in Artificial Intelligence and Machine Learning.
She is a Data Science enthusiast with good analytical and critical thinking, along with an ardent interest in acquiring new skills, leading groups, and managing work in an organized manner.
Marktechpost is a California based AI News Platform providing easy-to-consume, byte size updates in machine learning, deep learning, and data science research
© 2021 Marktechpost LLC. All Rights Reserved. Made with ❤️ in California
Join Our ML Reddit Community
source
—
Note that any programming tips and code writing requires some knowledge of computer programming. Please, be careful if you do not know what you are doing…