This Week in Numbers: Uniqueness Is Rare on GitHub

OK, I admit it. I rely on Adrian Colyer to read dense computer science articles that are loaded with math beyond my comprehension. On his blog, Colyer promises to review “an interesting/influential/important paper from the world of computer science” each weekday (Whew! He must have a long commute to his day job, as a Venture Partner at London’s Accel).
A recent edition caught the interest of many people — a paper asserting that most files on GitHub are not original. At the heart of many developers’ open source world, GitHub enables collaboration within a version control system. It turns out that most collaboration is building on top of the work of others. According to the authors of the paper Colyer studied, DéjàVu: A Map of Code Duplicates on GitHub, eighty-two percent of files in non-forked projects written in Java, C++, Python or JavaScript are found in another project’s code base.
Java has the fewest duplicated files, but even here about half of the other files can be considered similar. These were likely cloned from another repository and have only been slightly modified, like by adding comments, moving code around or adding a few extra lines. JavaScript’s tendency to use many smaller files means skews the numbers somewhat. More significantly, many projects include libraries available through npm. This is a problem because if library components are committed as application code, then it decreases the likelihood that upstream changes in frameworks and libraries will be implemented.
By its very nature, open source proves that imitation is a form of flattery but has this gone too far? Of course not. Long live copycats. Yet, the prevalence of dependencies creates unique challenges for security and software quality. There are ways to address these issues. GitHub has created tools to identify dependencies. Along with many security companies, Libraries.io has created tools to check your repositories’ components versus their original source in the software supply chain. From a metrics perspective, we continue to gain consensus on just how to track these types of ecosystem dependencies. Stay tuned.