Microsoft accidentally exposes 38TB of internal data via GitHub repository

Microsoft Corp. has accidentally made 38 terabytes of internal data, including passwords, publicly accessible through a GitHub repository.

The data leak was detailed today by researchers from venture-backed cloud security startup Wiz Inc. The company originally discovered the issue on June 22 and reported it to Microsoft shortly thereafter. The software giant fixed the issue on June 24.

According to Wiz, the data leak affected a GitHub repository that Microsoft’s artificial intelligence research group uses to host open-source projects. The repository contains image recognition models and training datasets that can be used to build new neural networks. The information leak was caused by one of the training data files.

The file in question was hosted in an Azure Storage account, Wiz’s researchers determined. Azure Storage is a product portfolio that includes Azure’s object, file and block storage services along with a number of related offerings. Microsoft had meant to share publicly only an AI training dataset, but accidentally opened access to the entire Azure Storage account that contained the dataset.

“The Microsoft developers used an Azure mechanism called ‘SAS tokens,’ which allows you to create a shareable link granting access to an Azure Storage account’s data — while upon inspection, the storage account would still seem completely private,” Wiz researchers wrote in a blog post today. 

The misconfigured account exposed 38 terabytes’ worth of internal Microsoft files. Among those files were backups of two employee workstations. Wiz says that the backups contained over 30,000 internal Microsoft Teams messages from 359 staffers along with passwords, encryption keys and other sensitive files.

According to Wiz, the AI training dataset at the heart of the incident also created other cybersecurity problems. It could have potentially enabled hackers to not only steal internal Microsoft files but also launch cyberattacks against users of the GitHub repository through which the dataset was made accessible.

The latter issue was caused by two separate security weaknesses, Wiz detailed.

The first weak point was in the Azure Storage account that hosted the AI training dataset. Hackers could only download the 38 terabytes of data in the account, Wiz detailed today, but also change or delete existing files.

The second issue that could have made cyberattacks possible has to do with the AI training dataset itself. Microsoft packaged the dataset into a file format called ckpt using an open-source tool known as pickle. This tool is susceptible to arbitrary code execution, meaning hackers can upload malicious code.

Had the Azure Storage account not allowed users to change files it contained, the arbitrary code execution vulnerability would have been impossible to exploit. But because file changes weren’t blocked, it was theoretically possible for hackers to launch cyberattacks before the issue was fixed. 

“AI unlocks huge potential for tech companies,” said Wiz co-founder and Chief Technology Officer Ami Luttwak. “However, as data scientists and engineers race to bring new AI solutions to production, the massive amounts of data they handle require additional security checks and safeguards.” 

Photo: Unsplash

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One-click below supports our mission to provide free, deep and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy