Data

Mnemonic: Memory Management for Big Data

8 Jan 2018 3:00am, by

Mnemonic, one of the latest projects to achieve top-level status with the Apache Software Foundation, is designed to address the garbage collection problems that Big Data applications encounter.

It provides a generic in-memory persistence object model to address performance problems including long pauses or heavy resource usage for interactive analytics.

Java Virtual Machines (JVMs) don’t have application awareness — the user knows where the program can stop a little to clean up garbage, and when a critical session can’t, explained Yanping “Lynn” Wang, co-creator of the project, which originated at Intel.

“We wanted to give users more flexibility and a better model than JVM garbage collection,” she said. “If you have some critical data you do not want to be bothered by JVM garbage collection, let us handle it.”

With the Mnemonic model, all they need to do is declare their objects non-volatile.

The Java-based project includes a unified framework, a durable object model and computing model, an extensible focal point for optimization, and integration with Big Data projects like Apache Hadoop and Apache Spark.

Hardware + Software Improvements

She and Gang “Gary” Wang worked together on how to improve memory management for Big Data, ultimately writing 6,000 lines of code.

The software provides Java programmers a way to define objects and structures to be managed by Mnemonic. In testing their work on Spark, they were able to cut pause time in half and achieve 2x to 3x gains in speed, she said. Companies such as Cisco and Cloudera have joined the project at ASF, she said, while Intel’s artificial intelligence and client computing groups are furthering the work.

Yanping Wang, meanwhile, has left Intel to focus the work on financial trading use cases, where no pauses are acceptable.

Storage continues to evolve, with NAND and 3D XPoint filling the speed and capacity gap between DRAM and hard disk drives, she said.

“We have a lot of non-volatile memory these days. In the old days, there was a memory disk and cache, which left speed gaps. Then software developed a lot of algorithms for caching — if they didn’t do the caching, some of the data they wanted to use was swapped back to disk. Then it would take an extremely long time to access. Now hardware developers are starting to put fast memory to fill in the gaps between cache and disk, so a large amount of memory is available. That chunk of memory is not being used very effectively so far on Spark, Hadoop and others on the software side,” she said.

In-Place Computing

Mnemonic is a project to extend in-memory computing to in-place computing using next-generation NVM storage media.

For Big Data computation, it keeps massive data object and its schema on storage, with no need to serialize/deserialize the objects.

“In-place computing means doing calculations where the data is,” she explained. “You do not need to move the data. We work with CPU — it could be GPU, on a card — memory can have a processing unit pretty much anywhere. Take the data, it can do sorting, do partition, when some critical work is being done, processing is done in the database itself, very close to the processing unit. In the old days, the computer could not do that because it was too expensive and the memory was so far away from the cache. This is based on hardware innovation as well.”

Actually, Mnemonic is a project somewhere between software and hardware, she said.

“The optimization is to ensure the best usage of hardware while the software has flexibility and a fast usage model. The software can run seamlessly, but underneath … we make sure hardware works the best, but also makes software happy,” she said.

Mnemonic provides a mechanism to communicate with native code directly through in-place object data update to avoid complex object data-type conversion and stack marshaling.

Using Apache Mnemonic, objects also can be directly accessed by other computing languages such as C/C++. The durable object model and durable computing model might also lead to new cache-less and SerDe-less (serializer and deserializer-less) architecture for high-performance applications and frameworks.

The project is continuing to develop the library, she said, and is working on pure Java memory service, durable object vectorization and durable query service features.

Apache Mnemonic provides a unified interface for memory management,” said Yanhui Zhao, Apache Mnemonic committer. “It is playing a significant role in reshaping the memory management in current computer architecture along with the developments of large capacity NVMs, making a smooth transition from present mechanical-based storage to flash-based storage with the minimum cost.”

Feature Image: “The Flame of Memory” by Maxim Peremojnii, licensed under CC BY-SA 2.0.

A newsletter digest of the week’s most important stories & analyses.