Artificial Intelligence (AI) is one of the most upcoming technologies and today the most talked about as well. Be it healthcare and drug discovery, consumer market, stock price forecasting, satellite-based navigation or autonomous vehicles, one of the most crucial success factors is the infrastructure operations supporting these critical businesses. We are now far beyond discussing these in the realm of data/system availability. Today this is about efficiency of operations, recoverability from faults/failures, forecasting changes, efficient resource/workload planning and provisioning, and above all, how the scale of operations is itself going to change the entire workflow.
Prior to AI showed its worth, the activities were almost entirely driven by humans and hence infrastructure management had to almost always catch up with the business operations. Today, AI powered analytics help drive us toward autonomous and real time operations. The traditional data centers are slowly expanding their footprints into edge devices/sensors and of course the cloud with IoT and more such smart technologies.
Here are the 5 essential constituents of autonomous data centers of today:
- Forecasting resource demands
Compute, storage and networking are critical when we talk about continuity of business operations. Forecasting resource utilization and planning for their upgrade in advance helps in consistent performance. Using Machine Learning techniques, and more specifically- feature engineering methods of AI, growth in demand can be predicted better. Feature engineering helps in reducing the parameter clutter and helps attribute increase in load to demand for one or more specific resources in the data centers. Which enables planning ahead of time.
- Optimum workload mix for operational efficiency: This consists of two aspects. First one deals with using AI to identify workloads and an optimal placement of these workloads to meet the application service level objectives (SLOs). Looking at past application deployments and the corresponding performance profiles one can recommend co-location of one or more applications on the same storage cluster. The second aspect, is more inclined towards reduction in operational costs. By observing the performance profiles of applications based on their signatures for a consistent period of time, the infrastructure management software can recommend downgrade or upgrade to a different SLO level. This assures guaranteed performance with optimal operational costs.
- Configuration management: The term “community wisdom” is not quite new. For instance, large storage enterprises which are moving towards autonomous management for their customers also contain information on what configurations/deployment models are popular amongst their customer base. They also know the configuration of the infrastructure – be it storage, network or compute, works well for customers running application of a particular type. Again, ML helps in gathering these tidbits of wisdom which can be acclaimed as the AI models defining the popular configurations.
- Load balancing: Despite full proof provisioning, as the demands fluctuate, bottlenecks do develop. Demands can themselves be forecasted to re-align the resource allocation. With this, load balancing by software configuration changes is almost inevitable. Dynamic load balancing is served by ML algorithms in collaboration with optimization strategies which evaluate various re-allocation options in the real time. The mitigation options might involve data tiering, resource sharing, data pre-fetching, or at times selective demand throttling while guaranteeing SLO adherence.
- Proactive failure predictions, troubleshooting and automated fixes: Many AI based deep learning (DL) applications are in action to predict a failure ahead of time. This is based on the sequence of observed events, and performance problems using infrastructure telemetry data. In an autonomous management system, corrective action enforcement is also expected. Using ML based recommendation systems, one can also enforce the most appropriate fix in the system.
There are many more components of an autonomous management pipeline, from power consumption efficiency to system downtime prediction and automated system upgrades using patches to provide a configuration fix. In the next article we shall focus on how the same objectives can be met in the face of new challenges like data lakes, multi gigabyte data ingests and heterogeneous data formats with infrastructure footprints extending to ruggedized edges and agile clouds. Here the time to react is even more crucial and every inaccurate prediction can cost the business. Stay tuned!
( The author is is a data scientist/ researcher at the Advanced Technology Group (ATG) of NetApp in Bangalore)