Synced
AI Technology & Industry Review
Synced
56 Temperance St, #700
Toronto, ON M5H 3V5
In the new paper The Stack: 3 TB of Permissively Licensed Source Code, a team from ServiceNow Research and Hugging Face advances open and responsible research on code LLMs by releasing The Stack, a 3.1 TB dataset of permissively licensed source code in 30 programming languages.
With ChatGPT racking up more than a million users in less than a week, large language models (LLMs) have captured the public imagination much like image generation models did last year. Countless social media posts have showcased the OpenAI model’s conversational abilities, with tech-oriented users focusing more on its amazing results in code generation. While the development of effective Code LLMs promises to significantly simplify programming tasks, progress in this area has been hindered by a lack of transparency with regard to the license terms on their training datasets.
In the new paper The Stack: 3 TB of Permissively Licensed Source Code, a team from ServiceNow Research and Hugging Face advances open and responsible research on code LLMs by releasing The Stack, a 3.1 TB dataset of permissively licensed source code in 30 programming languages. The researchers train 350M decoder-only transformers on various Python subsets to demonstrate the effectiveness and robustness of the Stack on text2code benchmarks.
The team summarizes their main contributions as follows:
The team built their dataset from 137.36M GitHub repositories in the GH Archive, extracting available license information for each repository and running the go-license-detector to locate such information if it was lacking. MIT and Apache 2.0 were the most frequently detected licenses (9.6% and 2.7%, respectively, of the total repositories). They then applied near-deduplication techniques to remove files judged near-duplicates of other files and produce the final Stack dataset of source code with permissive licenses (defined as “minimal restrictions on how the software can be copied, modified, and redistributed”).
The team compared The Stack with popular code datasets CodeParrot10, AlphaCode, CodeGen, and PolyCoder, noting that while The Stack and CodeParrot both provide source code for 30 programming languages, the others cover 12 at most. The Stack dataset is also larger than CodeParrot in each of the 30 programming languages and 3x larger in total.
To evaluate the Stack’s quality, the team trained a 350M parameter decoder-only transformer on Python subsets. The results show that near-deduplicating the data significantly boosts performance and that it is possible to reproduce Codex and CodeGen text2code performance using only permissively licensed data. The Stack also showed promising results on the HumanEval benchmark.
The team plans to further improve the Stack dataset in the future and hopes it will become a helpful resource for open and responsible research on Code LLMs.
The Stack dataset is available on the HuggingFace website. The paper The Stack: 3 TB of Permissively Licensed Source Code is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.
Machine Intelligence | Technology & Industry | Information & Analysis
Your email address will not be published.
Synced
56 Temperance St, #700
Toronto, ON M5H 3V5
One Broadway, 14th Floor, Cambridge, MA 02142
75 E Santa Clara St, 6th Floor, San Jose, CA 95113
Contact Us @ global.general@jiqizhixin.com
Visit Us @ Synced China
Contribute to Synced Review
source
—
Note that any programming tips and code writing requires some knowledge of computer programming. Please, be careful if you do not know what you are doing…