The Dark History Behind Github Co-pilot's Success – Analytics India Magazine

Over the past few years, we have seen considerable advancements in large language models (LLMs), with the number of parameters and features increasing exponentially. However, simply increasing the size of large language models does not make them viable for adoption in the real world.
One of the key verticals where LLMs have been deployed is AI-powered auto coders. These algorithms can take natural language prompts and automatically write a code snippet that aligns with the syntax of a given language.
The adoption of an autocoder in the real world depends on a variety of factors, and one needn’t look further than the current best LLM for coding on the market – GitHub Copilot. This so-called AI pair programmer can suggest code snippets and entire functions to a programmer while they edit, and has found widespread adoption and success in the developer community.
Mike Krieger, the co-founder of Instagram, had this to say about Github Copilot – “This is the single most mind-blowing application of machine learning I’ve ever seen.”
However, other companies who looked to enter this vertical over the past few years have not found similar success; some have even failed. Copilot’s success has been dotted with controversies that have stained the reputation of an otherwise spotless tool.
We can identify a variety of reasons why Copilot succeeded while others failed. While it gets its programming chops from OpenAI’s Codex LLM, a deeper look into Copilot’s runaway success shows that it was the right product released at the right time.
Derived from GPT-3, Codex is a specialized version of the general LLM focused on translating natural language to code. Even before Codex was released to the public, OpenAI collaborated with Microsoft to create Copilot.
The model not only contains the parameters that GPT-3 was trained on, but also has billions of lines of source code from public GitHub repositories. This allowed it to learn code syntax and the contextual information for problem solving tasks. Moreover, fine-tuning the algorithm for coding specific tasks made it fast and light on resources while providing high degrees of accuracy.
Kite was one of the companies that failed, as it was unable to create a model good enough to complete code at par with Copilot. Apart from the tech not being good enough at the time, Kite did not have the resources required to create a state-of-the-art model like Codex. It estimated that it would cost around $100 million to build a model like Codex due to the computing resources required for training and inference.
Microsoft has not only acquired an exclusive license for GPT-3, it has also worked closely with OpenAI to create Codex. Moreover, it has the nigh-infinite scalability of Microsoft Azure to deploy and train these algorithms, affording them a sizable advantage over their competitors.
Microsoft’s goals for the developer market go far beyond Copilot, which just represents one piece of the puzzle. Along with Azure, Visual Studio, VS Code, and Github, Microsoft is one of the most prominent companies in the development space. Copilot adds to their already powerful portfolio for developers and builds on it.
To begin with, Microsoft’s acquisition of Github solidified its position as a leader in programming. For the tech stack, it partnered with OpenAI to license GPT-3. Microsoft then developed Codex along with the OpenAI team, and trained it on various open-source repositories available on the platform, giving it one of the best datasets to train on.
Even though there were so many reasons for Copilot to be a good product, the infrastructure behind is equally important. Microsoft Azure is not only scalable, but it also has cloud services optimized for training and deploying machine learning algorithms. This is the brains of Copilot, a globally available and scalable hardware pipeline that can be accessed on demand.
It is simply not viable for companies to have access to the dataset that Microsoft had to train Codex, as seen with TabNine. Even though it is a close competitor to Copilot, many still prefer Microsoft’s product. Due to the smaller dataset and less accurate model (GPT-2), TabNine does not perform as well as Copilot, creating messy code with a higher tendency to make mistakes and cause errors.
Even though Copilot seems to be the end-all solution to all coding problems, it is not without its own host of issues. The origins of the product show a more dangerous side of the auto-coding market.
Large language models are not an easy technology to access and deploy. Even if there are many companies with competing models, the companies with the most financial grunt and highest number of cloud computing resources will win out.
Copilot has succeeded not because it is a good product, but because of Microsoft’s backing. From Azure, to OpenAI, to the huge cost required to train and run the algorithm for millions of developers, Microsoft has footed the bill for Copilot in the hopes of it becoming a money-making product sometime in the future.
In addition to the idea of LLMs going against open access to all, Github Copilot has its own share of blots. A class-action lawsuit has been filed against the company on the grounds that Microsoft has violated the rights of the vast number of creators whose code was used to train the algorithm. This dataset, which is one of the main reasons for Github Copilot’s accuracy, is scraped off the hard work of thousands of developers. Replit’s Ghostwriter, which is competing in the same field with responsibly sourced datasets, is struggling to capture market share.
Considering the factors, it is likely that other companies will also jump on the auto-coding bandwagon as an application of LLMs. As bigger players enter the field, Copilot’s unregulated usage of open-source code and cloud computing grunt will become the norm, increasing the barrier for entry for companies who want to do things the right way. While competing against the never-ending coffers of tech giants, smaller companies simply cannot create a competing product with comparable latency, cost, and usability.
We are already seeing this pattern, with Amazon Web Services releasing a competing product called CodeWhisperer. However, it still misses out on Copilot’s silver bullet for datasets: code from Github repositories. This is an advantage that no other company apart from Microsoft will ever have, and sets a dangerous precedent for the future of auto-coding platforms.
While the future of LLMs for computer generated code looks like it will be consolidated, smaller companies doing things the right way might come out on top after all.
Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023
Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023
Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023
Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023
Conference, in-person (Bangalore)
Cypher 2023
20-22nd Sep, 2023
Discover special offers, top stories, upcoming events, and more.
Stay Connected with a larger ecosystem of data science and ML Professionals
Stay up to date with our latest news, receive exclusive deals, and more.
© Analytics India Magazine Pvt Ltd 2022
Terms of use
Privacy Policy
Copyright

source
—
Note that any programming tips and code writing requires some knowledge of computer programming. Please, be careful if you do not know what you are doing…

Leave a ReplyCancel reply