Alibaba Cloud has developed a new cluster management system called Eigen+ that achieved a 36% improvement in memory allocation efficiency while eliminating Out of Memory (OOM) errors in production database environments, according to research presented at the recent SIGMOD conference.
The system addresses a fundamental challenge facing cloud providers: how to maximize memory utilization to reduce costs while avoiding catastrophic OOM errors that can crash critical applications and violate Service Level Objectives (SLOs).
The development, detailed in a research paper titled “Eigen+: Memory Over-Subscription for Alibaba Cloud Databases,” represents a significant departure from traditional memory over-subscription approaches used by major cloud providers, including AWS, Microsoft Azure, and Google Cloud Platform.
The system has been deployed in Alibaba Cloud’s production environment. The research paper claimed that in online MySQL clusters, Eigen+ “improves the memory allocation ratio of an online MySQL cluster by 36.21% (from 75.67% to 111.88%) on average, while maintaining SLO compliance with no OOM occurrences.”
For enterprise IT leaders, these numbers can translate into significant cost savings and improved reliability. The 36% improvement in memory allocation means organizations can run more database instances on the same hardware while actually reducing the risk of outages.
The technology is currently deployed across thousands of database instances in Alibaba Cloud’s production environment, supporting both online transaction processing (OLTP) workloads using MySQL and online analytical processing (OLAP) workloads using AnalyticDB for PostgreSQL, according to Alibaba researchers.
The memory over-subscription risk
Memory over-subscription — allocating more memory to virtual machines than physically exists — has become standard practice among cloud providers because VMs rarely use their full allocated memory simultaneously. However, this practice creates a dangerous balancing act for enterprises running mission-critical databases.
“Memory over-subscription enhances resource utilization by allowing more instances per machine, it increases the risk of Out of Memory (OOM) errors, potentially compromising service availability and violating Service Level Objectives (SLOs),” the researchers noted in their paper.
The stakes are particularly high for enterprise databases. “The figure clearly demonstrates that service availability declines significantly, often falling below the SLO threshold as the number of OOM events increases.”
Traditional approaches attempt to predict future memory usage based on historical data, then use complex algorithms to pack database instances onto servers. But these prediction-based methods often fail catastrophically when workloads spike unexpectedly.
The Pareto Principle solution
Rather than trying to predict the unpredictable, Alibaba Cloud’s research team discovered that database OOM errors follow the Pareto Principle—also known as the 80/20 rule. “Database instances with memory utilization changes exceeding 5% within a week constitute no more than 5% of all instances, yet these instances lead to more than 90% of OOM errors,” the team said in the paper.
Instead of trying to forecast memory usage patterns, Eigen+ simply identifies which database instances are “transient” (prone to unpredictable memory spikes) and excludes them from over-subscription policies.
“By identifying transient instances, we can convert the complex problem of prediction into a more straightforward binary classification task,” the researchers said in the paper.
Eigen+ employs machine learning classifiers trained on both runtime metrics (memory utilization, queries per second, CPU usage) and operational metadata (instance specifications, customer tier, application types) to identify potentially problematic database instances.
The system uses a sophisticated approach that includes Markov chain state transition models to account for temporal dependencies in database behavior. “This allows it to achieve high accuracy in identifying transient instances that could cause OOM errors,” the paper added.
For steady instances deemed safe for over-subscription, the system employs multiple estimation methods, including percentile analysis, stochastic bin packing, and time series forecasting, depending on each instance’s specific usage patterns.
Quantitative SLO modeling
Perhaps most importantly for enterprise environments, Eigen+ includes a quantitative model for understanding how memory over-subscription affects service availability. Using quadratic logistic regression, the system can determine precise memory utilization thresholds that maintain target SLO compliance levels.
“Using the quadratic logistic regression model, we solve for the machine-level memory utilization (𝑋) corresponding to the desired 𝑃target,” the paper said.
This gives enterprise administrators concrete guidance on safe over-subscription levels rather than relying on guesswork or overly conservative estimates.
Recognizing that no classification system is perfect, Eigen+ includes reactive live migration capabilities as a fallback mechanism. When memory utilization approaches dangerous levels, the system automatically migrates database instances to less loaded servers.
During production testing, “Over the final two days, only five live migrations were initiated, including mirror databases. These tasks, which minimally impact operational systems, underscore the efficacy of Eigen+ in maintaining performance stability without diminishing user experience.”
Industry implications
The research suggests that cloud providers have been approaching memory over-subscription with unnecessarily complex prediction models when simpler classification approaches may be more effective. The paper stated that approaches used by Google Autopilot, AWS Aurora, and Microsoft Azure all rely on prediction-based methods that can fail under high utilization scenarios.
For enterprise IT teams evaluating cloud database services, Eigen+ represents a potential competitive advantage for Alibaba Cloud in markets where database reliability and efficient resource utilization are critical factors.