NASA and IBM to Speed AI Creation with New Foundation Models
NASA and IBM are working together to create foundation models based on NASA’s data sets — including geospatial data — with the goal of accelerating the creation of AI models.
Foundation models are trained on large, broad data sets, then used to train other AI models by using targeted and smaller datasets. Foundation models can be used for different tasks and can apply information about one situation to another. One real-world example of a foundation model at work is ChatGPT3, which was built with the foundation model, GPT3.
Priya Nagpurkar, who oversees IBM’s hybrid cloud platform and developer productivity research strategy, said this approach accelerates the creation of AI models.
“We are excited about this being a key proof point and really being the first time we at IBM are applying foundation model technology towards sciences and to this scale of data in particular,” Nagpurkar said. “Foundation models are part of a big push at IBM Research and what excites us about foundation models is it’s this emerging AI technology, which can ingest large amounts of unlabeled data and transfer, learn in one area and apply it to others [which] simplifies significantly downstream tasks and AI applications, and also removes the need for large amounts of labor data.”
Two AI Model Goals for NASA and IBM
Foundation models can be powerful: It originally took IBM seven years to train Watson in 12 languages. By using a foundation model, IBM accelerated Watson’s language abilities to 25 languages in approximately one year.
“The key thing here is it will augment and accelerate the scientific process in terms of building and solving specific science problems,” said Rahul Ramachandran, senior research scientist at NASA’s Marshall Space Flight Center in Huntsville, Alabama. “Instead of people having to build their own individual machine learning pipelines starting from collecting the large volumes of training data, you can start with the foundation models, and with a few limited or well-curated training samples, you should be able to build your applications that would meet your scientific or application needs.”
The goal of this joint work is to advance scientific understanding, as well as the response to Earth and climate-related issues such as natural disasters and warming temperatures, the joint press release stated. The collaboration will apply foundation models in two areas:
- The first model will be trained on over 300,000 earth science publications to thematically organize the literature and make it easier to search and discover new knowledge.
- The second model will be trained on USGS and NASA’s popular Harmonized Landset-Sentinel2 (HLS2) satellite dataset with uses ranging from detecting natural hazards to tracking changes in vegetation and wildlife habitats. For example, it could be used to estimate tornado damage by detecting damaged roofs, determining the boundaries of floods, helping with grassland management and estimating biomasses.
Technical Challenges with GeoSpatial Data
In 2020, NASA held a workshop about incorporating AI and machine learning, where NASA identified two challenges: First, the lack of training data sets required to train deep learning models — which Ramachandran called “a major scientific bottleneck.” Second, the existing AI models do not generalize across space and time.
While there is already a proof-of-concept for the first language model project — which IBM and NASA speculated could be ready by mid-year — the second goal faces technical challenges, IBM and NASA officials acknowledged during a press conference Tuesday.
“We’re looking at new, innovative solutions that can address these problems […] I think that there is a potential for the foundation models to address these challenges,” Ramachandran said.
Raghu Ganti, principal researcher at IBM, further explained that the remote sensors of the HLS2 data set created unique challenges because it’s recording geophysics space data, which includes time and space information.
“The kind of transformer technology on which foundation models are built will have to change in order to train a model on top of such data,” he said. “Those are the questions that we are exploring.”
Transformer technology is a deep learning model used primarily in the fields of natural language processing and computer vision. Transformers are designed to process sequential input data, which includes natural language, to translate and summarize. Unlike older approaches, like recurrent neural networks, transformers process the entire input all at once — so rather than digesting one word at a time, transformer technology can process a sentence. This approach reduces training time.
There’s also the fact that NASA’s archive data is currently at 70 petabytes and is projected to grow to 250 petabytes within a few years with the launch of high data rate missions, such as SWOT, launched in December, and NISAR, planned for 2024.
“All our data is openly available, we support 7 billion users worldwide who access our data for research and applications,” Ramachandran said. “Our goal is to make our data, the NASA data — which is really valuable to the scientific community — discoverable, accessible and usable for broad scientific use and applications worldwide.”
One goal of the projects will be to lower the barriers of entry for end users to put that data to work, he said.
NASA and IBM to Use PyTorch
The models will be open and available to the public, and will leverage PyTorch, Ganti said.
“Our training platform leverages PyTorch, we solely rely upon PyTorch for training all our foundation models, and we have partnered with PyTorch as well, to drive the training of all these models,” Ganti said. “PyTorch is the go-to deep learning framework for all the developers in open source and we just want to make sure all our foundation models, the technology for training, the models that we train on are all in PyTorch and contributed back to the community.”
The foundation model platform is built on Red Hat OpenShift, which supports running on any hyper scaler in a public or private cloud. Red Hat OpenShift will “let you train these models with recipes out of the box much faster,” Ganti said.
For example, with a model already built with NASA data, they leveraged roughly a billion tokens. Tokenization is splitting the input data into a sequence of meaningful parts, according to ML engineer and AI blogger Vaclav Kosar.
“To give you a comparison, an open source model, which is a very high-quality model, has 50 billion tokens,” Ganti said. “This particular one takes…maybe it’s around six hours on 32 GPUs — that’s pretty much what it is doing right now. So you see significant speed improvement because of all the streamlining of the training approach that we are taking.”
That’s a big improvement to developer productivity, he noted, adding that putting that into the hands of developers is something “we are strategically interested in.”