What is a data lake?

A data lake is a repository where data is ingested in its original form without alteration. It is most useful when it is part of a greater data management platform and integrates well with existing data and tools for more powerful analytics. The goal is to uncover insights and trends while being secure, scalable, and flexible.

Data Lakes Explained

A data lake is used to hold a large amount of data in its native, raw format in a central location — typically the cloud. By leveraging inexpensive object storage, open formats, and cloud scalability, a variety of applications can take advantage of the wealth of data contained in a data lake.
  • All types of qualitive data, including unstructured (often called big data) and semi-structured data can be stored — which is critical for today’s machine learning and advanced analytics use cases.
  • In the networking space, think of infrastructure and endpoint telemetry being used as descriptors or classifiers that feed AI/ML models and algorithms to identify baselines and anomalies.
  • As a customer, your infrastructure and endpoint clients feed the data lake, and your networking vendor maintains it to deliver AI-based tools that help IT operate your network more efficiently.
Data Lakes Explained

What is stored in a data lake?

A data lake in the networking space is made up of network telemetry (infrastructure and endpoints) from each customer that is using a vendor’s cloud management solution. The vendor is responsible for managing and securing the data lake, and for creating customer facing tools. Customers and IT do not have to perform special tasks related to the data lake. Cloud-managed networking infrastructure is designed to forward management related data to the cloud, so it was a simple progression to extract telemetry to baseline a network’s performance and deviations.

Data lake requirements include:

  • Lots of data – In fact, for machine learning, variety is key. You don’t need a data lake for a single data set.
  • Machine learning framework – This includes libraries, data science, and other tools used by networking vendors to perform various types of analysis ranging from variance to causal analysis and the prediction of outcomes.

Benefits of a data lake

Data lake customer benefits include:

  • Dynamic baselines for their site’s network performance without manually setting SLEs.
  • Comparisons that highlight where similar sites are seeing issues based on their own data.
  • Optimization tips based on the performance data of a similar customer site’s behavior.
  • A constant retraining of AI/ML as new technology, infrastructure, and endpoints emerge.

How do cloud vs. on-prem data lakes differ?

Data lake attributeCloudOn-premises
Data securityCloud provider expertise / best practicesAir gapping and manual configuration
MemoryOn-demandRequires more appliances
CPUOn-demandRequires more appliances
StorageOn-demandRequires more appliances
Configuration recommendationsAllows for insights across multiple tenant sitesLimited to one customer’s data / configuration
Baseline peer comparisonsAvailable for each customer site and similar “peer” sitesLimited to one customer’s data / sites
Retraining and use of AIOps modelsAutomatic and instantly usable from cloud-managed GUIRequires manual software upgrades to management GUI

Ready to get started?