Building StarCoder, an Open Source LLM Alternative
A challenge with proprietary large language models, particularly for regulated industries, is that they lack transparency in how they are developed.
This is not an insignificant issue. For instance, in all the hullabaloo around AI assistants, it’s easy to forget that OpenAI, Microsoft and GitHub still face a lawsuit over the coding assistant, Copilot. Indeed, last month, a judge agreed to allow the lawsuit to move forward, despite an attempt to have it dismissed, which, to be fair, is a standard move in lawsuits. It’s also worth noting that concerns about personal information use also led Italy to temporarily ban ChatGPT and then launch on ongoing investigation into OpenAI’s compliance with the European Union’s General Data Protection Regulation (GDPR).
Why Create an Open Source Model
StarCoder: May the Source Be With You, a Cornell-published paper about the project, explained why creating the open source model was necessary. It noted that while OpenAI and other AI startups have made their LLMs available for use to the general public through a paid API, they have not shared all the details regarding the development process.
“While API access allows researchers to experiment with these models, it limits their ability to research LLM safety and alignment and inspect the models’ inner workings,” the paper noted. “Additionally, the high development costs make it nearly impossible for academic institutions to develop these models from scratch, which has created anxiety among academic researchers about whether they can meaningfully contribute to new AI breakthroughs.”
Other drawbacks with proprietary systems is the inability to adapt them to your own domain or codebase, the StarCoder team noted in a recent blog post about how developers can create their own coding assistant with the LLM.
The model isn’t just for code completion, either, said Leandro von Werra, a machine learning engineer at Hugging Face and co-lead on the project. The model isn’t just trained on raw code but also on GitHub commits and issues, which taught it a lot about chat.
“The model can also respond, for example, to GitHub issues,” he said. “One thing that was quite interesting that we found is if we just showed the model a lot of examples of conversations about coding problems, like a conversation between a human and a hypothetical assistant, the mobile would also be able to answer questions. So we were able to use it as a tech assistant, where you can say, ‘I have this error in Python. What should I do?’ It would try to help you, which was a little bit surprising because it was primarily trained on code, not to chat.”
Training it a bit more explicitly yields better results, he said, adding that the Big Code team have created an alpha version of a chat, called StarChat.
The Challenge in Creating Open Source LLMs
Big Code recently released its LLM, StarCoderBase, which was trained on 1 trillion tokens (“words”) in 80 languages from the dataset The Stack, a collection of source code in over 300 languages. The team then further trained StarCoderBase for 34 billion tokens on the Python subset of the dataset to create a second LLM called StarCoder.
Big Code is not the only open source LLM available, but it is the most recent and most performant one, von Werra claimed. There’s also SalesForce’s CodeGen Mono 16B for Python and Replit’s 3B parameter model trained on 20 programming languages.
One of the barriers to creating open source LLMs is that training on the data sets requires a lot of compute power. That’s not something most open source projects can afford. In September 2022, Hugging Face and ServiceNow Research launched Big Code, an open science collaboration. Hugging Face is a large open source community that builds tools for machine learning models based on open source code and technologies. ServiceNow Research is an enterprise AI company. Both companies made their compute cluster available for the large-scale training for Big Code’s StarCoder and StarCoderBase. Since its launch, 600 more members from academic institutes and industry labs have joined the Big Code effort.
StarCoder is trained using only “permissively licensed code on GitHub,” explained von Werra. The 15.5B parameter model is trained on one trillion tokens sourced from 80+ programming languages, GitHub issues, Git commits, and Jupyter notebooks.
The models can copy verbatim from the pertaining data and even if it’s permissive data, it will still require attribution, von Werra added. In the VSCode extension, there is a quick test to see if the code generated by the model was in the retraining data and a full-text search to find where exactly the code came from and how it is licensed, he explained.
“If you have a 15 billion parameter model, you have 15 billion things that you can adjust and optimize during training,” von Werra said. “You need a lot of GPUs and a lot of data. That’s the main thing. Training StarCoder required roughly 500 GPUs for almost a month, 24 days of training. That’s quite expensive.”
By comparison, GPT is rumored to have a trillion parameters — but size is not always a sign of better, Ori Goshen, AI21 Labs co-founder and co-CEO, told The New Stack’s Senior Editor Richard MacManus in March.
LLM size “plays a factor, but it’s not the only factor,” said Goshen. “So we’ve stopped referring to the size because it can be misleading about the actual performance of the model.”
Ethically Sourced Training Data
Beyond using only GitHub material that was permissively licensed, Big Code took other steps to ensure it’s “ethically sourced.” First, it stripped out personal identifiable information (PII), such as names, email, addresses, and passwords that might be in the code.
“One thing that you can quite easily do with these language models is you can prompt them to generate PII if it was trained on such information,” von Werra said. “You could, for example, input to the model ‘password equals’ and then the model would generate a password that it has seen during pre-training. We created a dataset, an annotated data set where we know if there was PII and we trained a model to detect and then we applied that to the whole data set to remove this information such that you can’t easily abuse the model to create a big data set of personal information.”
Second, Big Code added an opt-out process. Developers can look up whether their code was used to train the model and then, by completing a form, opt out of being used for future model training.
StarCoder Compared to Copilot
How does it compare to a Copilot? One of the first Open AI models presumed to power Copilot was called Cushman, von Werra said. StarCoder either performed on par or outperformed Cushman on the HumanEval benchmark for performance, he said.
“We found that on this HumanEval benchmark, they’re either the same performance or better depending on the language — we train on many languages and we evaluate many languages —but on general, we match the performance of the first iteration of Copilot,” von Werra said. “We also outperform it on some other benchmarks that are more related on data science coding tasks; there’s a DS 1000 benchmark are we pretty good at with StarCoder.”