What is a data lake?
A data lake is a repository where data is ingested in its original form without alteration. It is most useful when it is part of a greater data management platform and integrates well with existing data and tools for more powerful analytics. The goal is to uncover insights and trends while being secure, scalable, and flexible.
Data Lakes Explained
- All types of qualitive data, including unstructured (often called big data) and semi-structured data can be stored — which is critical for today’s machine learning and advanced analytics use cases.
- In the networking space, think of infrastructure and endpoint telemetry being used as descriptors or classifiers that feed AI/ML models and algorithms to identify baselines and anomalies.
- As a customer, your infrastructure and endpoint clients feed the data lake, and your networking vendor maintains it to deliver AI-based tools that help IT operate your network more efficiently.
What is stored in a data lake?
A data lake in the networking space is made up of network telemetry (infrastructure and endpoints) from each customer that is using a vendor’s cloud management solution. The vendor is responsible for managing and securing the data lake, and for creating customer facing tools. Customers and IT do not have to perform special tasks related to the data lake. Cloud-managed networking infrastructure is designed to forward management related data to the cloud, so it was a simple progression to extract telemetry to baseline a network’s performance and deviations.
Data lake requirements include:
- Lots of data – In fact, for machine learning, variety is key. You don’t need a data lake for a single data set.
- Machine learning framework – This includes libraries, data science, and other tools used by networking vendors to perform various types of analysis ranging from variance to causal analysis and the prediction of outcomes.
Benefits of a data lake
Data lake customer benefits include:
- Dynamic baselines for their site’s network performance without manually setting SLEs.
- Comparisons that highlight where similar sites are seeing issues based on their own data.
- Optimization tips based on the performance data of a similar customer site’s behavior.
- A constant retraining of AI/ML as new technology, infrastructure, and endpoints emerge.
How do cloud vs. on-prem data lakes differ?
Data lake attribute | Cloud | On-premises |
---|---|---|
Data security | Cloud provider expertise / best practices | Air gapping and manual configuration |
Memory | On-demand | Requires more appliances |
CPU | On-demand | Requires more appliances |
Storage | On-demand | Requires more appliances |
Configuration recommendations | Allows for insights across multiple tenant sites | Limited to one customer’s data / configuration |
Baseline peer comparisons | Available for each customer site and similar “peer” sites | Limited to one customer’s data / sites |
Retraining and use of AIOps models | Automatic and instantly usable from cloud-managed GUI | Requires manual software upgrades to management GUI |