Telco Cloud Platform RAN Reference Architecture Guide 1.0 ...

Telco Cloud Platform RAN Reference Architecture Guide 1.0

VMware Telco Cloud Platform RAN 1.0

You can find the most up-to-date technical documentation on the VMware website at:

https://docs.vmware.com/

VMware, Inc.3401 Hillview Ave.Palo Alto, CA 94304www.vmware.com

Copyright ©

2021 VMware, Inc. All rights reserved. Copyright and trademark information.


VMware, Inc. 2

https://docs.vmware.com/

https://docs.vmware.com/copyright-trademark.html

Contents

1 About the Telco Cloud Platform RAN Reference Architecture Guide 4

2 Overview of the Telco Cloud Platform RAN Reference Architecture 7Telco Cloud Platform Overview 7

Telco Cloud Platform RAN Overview 7

End-to-End Architecture with Telco Cloud Platform 5G Core and RAN 8

Telco Cloud Platform 5G Core and RAN Comparison 9

Connectivity Requirements for 5G Core and RAN 11

vRAN Architecture Overview 11

Physical Infrastructure 14

Virtual Platform Infrastructure 15

Telco Cloud Automation Overview 16

Tanzu Basic for RAN Overview 18

Operational Management Overview 19

3 Telco Cloud Platform RAN Solution Design 21Deployment Architecture of Telco Cloud Platform RAN 21

Services Design 23

Physical Design 26

Physical ESXi Host Design 26

Physical Network Design 28

Physical Storage Design 32

RAN Virtualization Design 33

vCenter Server Design 34

Workload Domains and vSphere Cluster Design 37

Network Virtualization Design 38

Telco Cloud Automation Design 42

Tanzu Kubernetes Cluster Design 47

Tanzu Basic for RAN Deployment Model 53

CNF Design 54

Operations Management Design 57

VMware, Inc. 3

About the Telco Cloud Platform RAN Reference Architecture Guide

1This reference architecture guide provides guidance for designing and deploying a RAN solution based on VMware Telco Cloud Platform™ — RAN.

Intended Audience

This guide is intended for telecommunications and solution architects, sales engineers, field consultants, advanced services specialists, and customers who are responsible for designing the Virtualized Network Functions (VNFs), Cloud Native Network Functions (CNFs), and the RAN environment in which they run.

Acronyms and Definitions

The following table lists the Telco Cloud Platform acronyms used frequently in this guide.

Telco Cloud Platform Acronyms Definitions

3GPP 3rd Generation Partnership Project

AMF Access and Mobility Management Function

AUSF Authentication Server Function

BC Boundary Clock

CBRS Citizens Broadband Radio Service

CNTT Common NFVI Telco Task Force

CSP Communications Service Provider

CU Centralized Unit

DHCP Dynamic Host Configuration Protocol

DPDK Data Plane Development Kit, an Intel-led packet processing acceleration technology

DU Distributed Unit

ETSI European Telecommunications Standards Institute

GM Grandmaster

gNB Next Generation NodeB

GNSS Global Navigation Satellite System

VMware, Inc. 4

Telco Cloud Platform Acronyms Definitions

LCM Life Cycle Management

NFV Network Functions Virtualization

NRF NF Repository Function

PCIe Peripheral Component Interconnect Express

PCF Policy Control Function

PCRF Policy and Charging Rule Function

PRTC Primary Reference Time Clock

PTP Precision Time Protocol

QCI Quality of Service Class Identifier

RAN Radio Access Network

RRU Remote Radio Unit

SMF Session Management Function

SBA Service-Based Architecture

SBI Service-Based Interface

SR-IOV Single Root Input/Output Virtualization

STP Spanning Tree Protocol

SVI Switched Virtual Interface

ToR Switch Top-of-Rack Switch

UDM Unified Data Management

UDR Unified Data Repository

VDS vSphere Distributed Switch

VNF Virtual Network Function

The following table lists the Cloud Native acronyms used frequently in this guide:

Cloud Native Acronyms Definition

CAPV Cluster API Provider vSphere

CMK CPU Manager for Kubernetes

CNI Container Network Interface

CNCF Cloud Native Computing Foundation, a Linux Foundation project designed to help advance the container technology

CNF Cloud Native Network Function executing within a Kubernetes environment

CSAR Cloud Service Archive


VMware, Inc. 5

Cloud Native Acronyms Definition

CSI Container Storage Interface. VMware vSphere® CSI exposes vSphere storage to containerized workloads on container orchestrators, such as Kubernetes. It enables vSAN and other types of vSphere storage.

CNS Cloud Native Storage, a storage solution that provides comprehensive data management for Kubernetes stateful applications.

K8s Kubernetes

PSP Pod Security Policy

TCA-CP Telco Cloud Automation-Control Plane


VMware, Inc. 6

Overview of the Telco Cloud Platform RAN Reference Architecture

2This section provides the architecture overview including the high-level physical and virtual infrastructure, networking, and storage elements in the Telco Cloud Platform RAN solution.

This chapter includes the following topics:

n Telco Cloud Platform Overview

n Telco Cloud Platform RAN Overview

n Physical Infrastructure

n Virtual Platform Infrastructure

n Telco Cloud Automation Overview

n Tanzu Basic for RAN Overview

n Operational Management Overview

Telco Cloud Platform Overview

VMware Telco Cloud Platform™ is a modernization solution that deploys cloud-native and virtual network function consistently, at web-scale speed, and without disruption.

VMware Telco Cloud Platform is a cloud-native platform that empowers CSPs to manage VNFs and CNFs across the core, far edge (RAN), enterprise edge, and cloud with efficiency, scalability, and agility.

Telco Cloud Platform provides the framework to deploy and manage VNFs and CNFs quickly and efficiently across distributed 5G networks. You can run VNFs and CNFs from dozens of vendors, on any cloud, with holistic visibility, orchestration, and operational consistency.

For more information, see the VMware Telco Cloud Platform 5G Edition documentation.

Telco Cloud Platform RAN Overview

VMware Telco Cloud Platform RAN is a cloud-native RAN solution that is designed specifically for running RAN functions. It provides the RAN modernization path, evolving from legacy Radio Access Network (RAN) to virtualized RAN (vRAN) to OpenRAN. It transforms the RAN into a 5G

VMware, Inc. 7

https://docs.vmware.com/en/VMware-Telco-Cloud-Platform/index.html

multi-services hub “mini cloud”, enabling Communication Services Providers (CSPs) to monetize their RAN investments.

Telco Cloud Platform RAN is designed to meet the performance and latency requirements inherent to RAN workloads:

n Enables CSPs to run virtualized Baseband functions that include virtualized Distributed Units (vDUs) and virtualized Central Units (vCUs).

n Simplifies CSPs’ operations consistently across distributed vRAN sites with centralized cloud-first automation, while reducing the Operating Expense (OpEx).

n Provides operational consistency by removing business uncertainties and reduces ballooning costs associated with 5G deployment.

n Enables CSPs to accelerate innovation speed, deploy 5G services fast, and scale the services as customers’ demands increase.

Telco Cloud Platform RAN is powered by field-proven virtualization, carrier-grade Container-as-a-Service (CaaS), and multi-layer automation that are consistent with its 5G core and edge offerings. This end-to-end consistency achieved by the coherent underpinning platform across 5G networks enables CSPs to provision 5G services customized for different enterprise and consumer markets while providing unparalleled operational efficiency.

End-to-End Architecture with Telco Cloud Platform 5G Core and RAN

The 5G network must be dynamic and programmable to meet the defined business objectives.

To handle the massive amount of data traffic, 5G is designed to separate the user plane from the control plane and to distribute the user plane as close to the device as possible. As the user traffic increases, an operator can add more user plane services without changing the control plane capacity. This distributed architecture can be realized by constructing the data center and network infrastructure based on hierarchical layers.

A hierarchical model allows Telcos to scale their 5G deployment based on application requirements and user load. Modern Telco architecture consists of four levels of hierarchy. 5G Subscriber databases, data repositories, resource orchestration, and service assurance are typically hosted in the national data centers. The national data centers also serve as peering points for lawful intercept points. For added redundancy, a pair of national data centers must be deployed in geographically diverse sites.


VMware, Inc. 8

Figure 2-1. End-to-End Architecture with 5G Core and RAN

NFVO

Management/ Automation

Cloud Sites

Telco Adaptation

VMware MaaS/VMware Cloud

OSS/SA Non-Sub DB

5G Core CP

Central (National)

Telco Cloud Core

Sub-DB CALEA/ LI

5G Core UP/M

Central (RDC)

Telco Cloud Core

IPAM DIA IMS

Near Edge

Far Edge

Telco Cloud Platform RAN

DU


DU+CU

Telco Cloud Core

IOT25- 100

Gbps

10-25 Gbps

10-25 Gbps

10-25 Gbps

MEC vRAN


NETWORK FUNCTION FACTORY CATALOG

Telco Cloud AutomationCaaS

Optimization

CNF CNF CNFxFNM

Aggregation Data Centers:

Subtending from national data centers are regional data centers. Regional data centers host the 5G core user plane function, voice services functions, and non-call processing infrastructure such as IPAM, DNS, and NTP server. In-bound and out-bound roaming traffic can also be routed from the regional data center.

To support new applications and devices that require ultra-low latency and high throughput networks, CSPs have an opportunity to push 5G user-plane closer to the application edge. At the same time, RAN disaggregation enables efficient hardware utilization, pooling gain, and increases deployment flexibility while reducing the Capital Expenditure (CAPEX) / Operational Expenditure (OPEX) of Radio Access.

Caution Aggregation data centers such as national, regional, and near edge data centers can be architected by following Telco Cloud Platform 5G Core recommendations, while the cell site architecture and implementation align with VMware Telco Cloud Platform RAN.

Telco Cloud Platform 5G Core and RAN Comparison

This section describes the relationship and assumptions of Telco Cloud Platform RAN in comparison to Telco Cloud Platform 5G Edition.


VMware, Inc. 9

VMware Telco Cloud Platform is a common platform from Core to RAN. This common platform self-tunes automatically depending on the workload deployed through VMware Telco Cloud Automation™. To deploy all VNF/CNFs from 5G Core to RAN, the same automation platform, operational tools, and CaaS layer based on VMware Tanzu® Basic for RAN are used.

Figure 2-2. Relationship between Telco Cloud Platform 5G and Telco Cloud Platform RAN

VMware Telco Cloud Automation

VMware vCenter Server (VC)

VMware ESXi

VMware ESXi

VMware ESXi

COTS Server

COTS Server

VMware RT ESXi

Single Node

VMware RT ESXi

Single Node

COTS Server

COTS Server

Regional Data CenterCell Sites

vSphere Cluster

COTS Server

VC

VC

VC

VC

VC

VC

VC

VC

Telco Cloud Platform RAN is a new compute workload domain that spans from Central or Regional Data Center to Cell sites. The management components of this compute workload domain reside in the management cluster inside the RDC management domain. Within this compute workload domain, VMware ESXi™ hosts are managed as single node hosts and distributed across thousands of cell sites. This distributed architecture also applies to Kubernetes. Kubernetes cluster management is centralized, and workload VMs are distributed to respective cell sites.

Category Telco Cloud Platform 5G Edition Telco Cloud Platform RAN

Kubernetes Cluster Single sited Stretched

Management Domain Same Same

Workload Domain composition vSphere clusters ESXi Hosts

Storage vSAN/Shared storage Local storage

Networking Overlay/VLAN VLAN


VMware, Inc. 10

Connectivity Requirements for 5G Core and RAN

This section describes core and edge connectivity requirements to support different deployment models of 5G RAN.

n Core and Edge connectivity: Core and Edge connectivity can have a significant impact on the 5G core deployment and it provides application-specific SLA. The type of radio spectrum, connectivity, and the available bandwidth can have a great influence on the placement of CNFs.

n WAN connectivity: In the centralized deployment model, the WAN connectivity must be reliable between the sites. Any unexpected WAN outage prevents 5G user sessions from being established as all 5G control travels from the edge to core.

n Components deployment in Cell Site: Due to the physical constraints of remote Cell Site locations, place only the required function at the Cell Site and deploy the remaining components centrally. For example, the platform monitoring and logging are often deployed centrally to provide universal visibility and control without replicating the core data center at the remote edge Cell Site locations. Non- latency-sensitive user metrics are often forwarded centrally for processing.

n Available WAN bandwidth: The available WAN bandwidth between Cell Site and Central Core sites must be sized to meet the worst-case bandwidth demand. Also, when multiple classes of an application share a WAN, proper network QoS is critical.

n Fully distributed 5G core stack: A fully distributed 5G core stack is ideal for private 5G use cases, where the edge data center must be self-contained. It survives extended outages that impact connectivity to the core data center. The Enterprise edge can be the aggregation point for 5G Core control plane, UPF, distributed radio sites, and selective mobile edge applications. A fully distributed 5GC reduces the dependency on WAN, but it increases the compute and storage requirements.

n Network Routing in Cell Site: Each Cell Site can locally route the user plane traffic and all the Internet traffic through the local Internet gateways, while the management and non-real time sensitive applications leverage the core for device communication.

vRAN Architecture Overview

In RAN virtualization, the relocation of baseband radio functions from custom-built nodes to vendor-agnostic Commercial Off-the-Shelf (COTS) hardware.

In 3GPP R15, the division of the upper and lower sections of the RAN was standardized. The higher-layer split is specified with a well-defined interface (F1) between the Centralized Unit (gNB-CU) and the Distributed Unit (gNB-DU). The CU and its functions that are similar to the radio have less stringent processing specifications and are more virtualization-friendly than the DU and its functions. The enhanced Common Public Radio Interface (eCPRI) links the DU to the radio.


VMware, Inc. 11

The benefits of a fully virtualized RAN (vRAN) are as follows:

n A single uniform hardware platform is used across the core network, RAN, and edge. This simplifies network management while lowering operational and maintenance costs.

n The network functions and computing hardware are isolated in a completely virtualized RAN. The network functions of the RAN can be performed on the same hardware, giving the service provider more versatility. The functionality and capacity of a vRAN can be easily implemented where and when it is required, giving it more flexibility.

The following figure shows vRAN (also called Centralized RAN) and the terminologies that are used to define various legs of the transport network:

Figure 2-3. vRAN Transport Network Terminology

gNBCU

gNB-DU gNB-RRU

MidhaulConnectivity betweenthe CU and DU

FronthaulConnectivity betweenthe DU and RRU

5G Core Network

BackhaulConnectivity betweenthe 5GC and CU

F1 eCPRI

n Centralized Unit (CU) provides non-real time processing and access control. It manages higher layer protocols including Radio Resource Control (RRC) from the Control Plane, and Service Data Adaptation Protocol (SDAP) and Packed Data Convergence Protocol (PDCP) from the User Plane. The CU is connected between the 5G core network and the DUs. One CU can be connected to multiple DUs.

n Distributed Unit (DU) provides real-time processing and coordinates lower layer protocols including Physical Layer, Radio Link Control (RLC ), and Media Access Control (MAC).

n Remote Radio Unit (RRU) does the physical layer transmission and reception, supporting technologies such as Multiple Input Multiple Output (MIMO).

vRAN Design Approaches

The different design approaches of vRAN are as follows:

Co-located CU and DU:

A non-centralized approach utilizes the CU and DU functions co-located, with RRU physically separated.

Note Only fronthaul and backhaul connectivity is required in this design.


VMware, Inc. 12

Figure 2-4. Co-Located CU and DU

Co-located gNB-CUand gNB-DU

gNB-RRU

5G Core Network

Centralized Processing:

In the centralized approach, all functional elements of the gNB are physically separated. A single CU is responsible for several DUs. The design requirements of the Next-Generation RAN (NG-RAN) require specific transport network specifications to meet the required distances. This design model requires fronthaul, midhaul, and backhaul connectivity.

Figure 2-5. vRAN Design - Centralized Processing

gNBCU

gNB-RRU

gNB-RRUgNB-RRU

MidhaulTens of kilometers

Fronthaul Distance0-20 kilometers

5G Core Network

Backhaul If the CU is virtualized,it can be part of the 5GC NFV deployment. Hence, backhaul is potentially very short.

gNB-DU

gNB-DU gNB-DU

DU and RRU Co-Located:

In this vRAN design approach, the DU and RRU are co-located such that they are directly connected without a fronthaul transport network. This connection is fiber-based and may span hundreds of meters, supporting scenarios where the DU and RRU are within the same building. This design approach requires midhaul and backhaul connectivity as shown in the following figure.


VMware, Inc. 13

Figure 2-6. vRAN Design- DU and RRU Co-located

gNBCU

5G Core Network

gNB-DU-gNB-RRU SeparationUp to hundreds of meters,via fiber (no fronthaul network)

gNB-DUgNB-RRU

gNB-DUgNB-RRU

gNB-DUgNB-RRU

gNB-DUgNB-RRU

The centralized CU vRAN design introduces several advantages:

n Cost Reduction: Centralized processing capability reduces the cost of the DU function.

n Energy Efficiency, Power and Cost Reduction: Reducing the hardware in the cell site, reduces the power consumption and air conditioning of that site. The cost saving can be significant when you deploy tens or hundreds of cell sites.

n Flexibility: Flexible hardware deployment leads to a highly scalable and cost-effective RAN solution. Also, the functional split of the protocol stack has an effect on the transport network.

n Higher Performance: Better performance is achieved as a result of improved load management, cell coordination, and the future deployment of Radio interference mitigation algorithms.

n Improved offload and content delivery: Aggregation of processing at the CU provides an optimal place in the network for data offload and MEC application delivery.

Physical Infrastructure

The Physical Tier represents compute hardware, storage, and physical networking resources for the Telco Cloud Platform RAN solution.

Physical ESXi Host

The configuration and assembly process of each ESXi host in Telco Cloud Platform RAN must be standardized, with all components installed in the same manner on each ESXi host. The standardization of the physical configuration of the ESXi hosts at Cell Sites removes variability, so you can operate an easily manageable and supportable infrastructure. Deploy ESXi hosts with identical configuration across all Cell Site locations, including storage and networking configurations as per VMware supported platform.

For the list of supported platforms, see the VMware Compatibility Guide


VMware, Inc. 14

http://www.vmware.com/resources/compatibility

Physical Storage

VMware vSAN is used in the compute workload domains in Regional Data Center to provide highly available shared storage to the clusters. However, the host used for Cell Sites must have a local disk to provide the storage service to the RAN applications.

The ESXi host deployed at the Cell Site uses the local disk storage and all the disks recommended in the VMware Compatibility Guide.

The following table lists the sites and storage required for sites:

Site Storage Types

Regional Data Center (RDC) vSAN

Cell Site Local Storage

Physical Network Interfaces

Consider the following for the Telco Cloud Platform RAN deployment with Regional Data Center and Cell Site combination:

n For Regional Data Center (RDC), the ESXi hosts must contain four or more physical NICs of the same speed. Use two physical NICs for ESXi host traffic types and two physical NICs for user workloads.

n For Cell Site, the ESXi hosts must contain a minimum of three physical NICs of the same speed. Use two physical NICs for ESXi management and user workloads and dedicate one for PTP time synchronization.

n For each ESXi host in combination with a pair of ToR switches, use a minimum of two 10 GbE connections or 25 GbE or larger connections. 802.1Q network trunks support the required VLANs.

Note In case of a dual-socket host with NUMA placement, use six physical NICs. Each socket can be mapped with two physical NICs for user workload and one physical NIC for PTP time synchronization.

Virtual Platform Infrastructure

The virtual infrastructure is the foundation of the platform. It contains the software-defined infrastructure, software-defined networking, and software-defined storage.


VMware, Inc. 15

Compute Workload Domain

With Telco Cloud Platform RAN architecture, you can deploy and extend RAN applications into Regional Data Center and Cell Sites. The compute workload domain that you deploy on Regional Data Center (RDC) through Telco Cloud Automation can contain multiple Cell Site vSphere hosts, each running the user workloads to enable the RAN solution. These Cell Site hosts can include as small as one ESXi host and scale depending on the resource and availability requirements of the solution being deployed.

Each Cell Site compute workload domain can support up to 128 Cell Sites locations due to the 128 VDS limit per vCenter. Telco Cloud Automation dedicates a VDS for each Cell Site Host.

Also, when designing the Cell Sites for the Telco Cloud Platform RAN solution, review the vCenter maximum supported configurations for your scaling requirement.

Some of the supported configurations of vCenter maximums:

Type Maximum

Hosts per vCenter 2500

VDS per vCenter 128

Hosts per VDS 2000

Telco Cloud Automation Overview

VMware Telco Cloud Automation (TCA) is a unified orchestrator. It onboards and orchestrates workloads seamlessly from VM and container-based infrastructures. It distributes workloads from the core to the edge and from private to public clouds for unified orchestration.

Communications service providers (CSPs) are transitioning from physical to cloud networks to gain operational agility, network resiliency, and low operating costs. This shift marks a radical departure from the traditional single‑purpose hardware appliance model, especially as CSPs must now design and operate services across a web of data centers—bridging physical and virtual ecosystems—while enabling interoperability across competing vendors.

Due to the complexity of coordinating network functions and managing multiple services, CSPs require an automated approach that removes complexity and error‑prone manual processes. To address these challenges and improve operational efficiency, CSPs are turning to VMware Telco Cloud Automation.


VMware, Inc. 16

Figure 2-7. Telco Cloud Automation Overview

xNF Management

Unified Network Function

Management

Domain Orchestration

VMware Telco Cloud Automation

Multi-Cloud Management

VMware Cloud

Public Clouds Private Clouds Core | Edge (VM and Container-Based)

TELCO

TRANSLATE

CLOUD

Partner

Engineering, Management

and Orchestration

Infrastructure Admin and

Orchestration VMware Cloud Foundation/ vSphere VMware Cloud Director

VMware Integrated OpenStack

Tanzu | Kubernetes

CaaS Automation

Design | Onboard | Orchestrate | Automate

Complete ETSI MANO-SOL Interperability

Service/Slice Orchestration | S-VNFM | VNF | EMS | Network Service | Assurance | OSS/BSS | SDN-DC

Network Service Management

Intelligence

Policy and Placement

Telco Cloud Automation accelerates time-to-market for network functions and services while igniting operational agility through unified automation—across any network and any cloud. It applies an automated, cloud‑first approach that streamlines the CSP’s orchestration journey with native integration to VMware Telco Cloud Infrastructure.

Telco Cloud Automation enables multi‑cloud placement, easing workload instantiation and mobility from the network core to edge and from private to public clouds. It also offers standards‑driven modular components to integrate any multi‑vendor MANO architecture. VMware further enhances interoperability by expanding partner network function certification using the VMware Ready for Telco Cloud program. With simplified and certified interoperability, CSPs can now leverage best‑of‑breed solutions and reduce their risks.

Key Benefits of Telco Cloud Automation

n Accelerates time-to-market of network functions and services.

n Integrates 5G network capabilities alongside existing architecture.

n Gains operational efficiencies and avoids error‑prone manual tasks.

n Enhances the service experience through workload mobility, dynamic scalability, closed‑loop healing, and improved resilience.

n Optimizes cloud resource utilization through VMware xNF Manager (G-xNFM), NFVO, and VIM/ CaaS/NFVI integrations.


VMware, Inc. 17

n Evolves to cloud native with Kubernetes upstream compliance, cloud-native patterns, and CaaS automation.

n Avoids costly integration fees, maximizes current VMware investments, innovates faster, reduces project complexity, and enables faster deployment with pre‑built VMware integrations.

n Reduces the time to provision new network sites or to expand capacity into existing ones. Leverages best‑of‑breed network functions and benefits from a healthy and thriving multi‑vendor ecosystem.

n Minimizes version validation efforts.

n Improves service quality of AI-driven workflows that are integrated with VMware Telco Cloud Operations.

Tanzu Basic for RAN Overview

VMware Telco Cloud Platform RAN is based on VMware Tanzu Basic for RAN. Tanzu Basic for RAN is responsible for the life cycle management of Kubernetes clusters on top of the Telco Cloud Platform RAN architecture. A Tanzu Kubernetes cluster is an opinionated installation of Kubernetes open-source software that is built and supported by VMware.

The following diagram shows different hosts and components of the Tanzu Basic for RAN architecture:

Figure 2-8. Tanzu Basic for RAN Architecture

Tanzu Kubernetes Management Cluster

Tanzu Kubernetes Workload Cluster



Desired StateConfiguration


VMware, Inc. 18

Tanzu Basic for RAN Control Plane

The Kubernetes control plane runs as pods on the Kubernetes Control Plane node. The Kubernetes Control Plane node comprises the following components:

n Etcd: Etcd is a simple, distributed key-value store that stores the Kubernetes cluster configuration, data, API objects, and service discovery details. For security reasons, etcd must be accessible only from the Kubernetes API server.

n Kube-APIServer: The Kubernetes API server is the central management entity that receives all API requests for managing Kubernetes objects and resources. The API server serves as the frontend to the cluster and is the only cluster component that communicates with the etcd key-value store.

For added redundancy and availability, place a load balancer for the control plane nodes. The load balancer performs health checks of the API server to ensure that the external clients such as kubectl connect to a healthy API server even during the cluster degradation.

n Kube-Controller-Manager: The Kubernetes controller manager is a daemon that embeds the core control loops shipped with Kubernetes. A control loop is a non-terminating loop that regulates the state of the system. In Kubernetes, a controller is a control loop that watches the shared state of the cluster through the API server and moves the current state to the desired state.

n Kube-Scheduler: Kubernetes schedulers know the total resources available in a Kubernetes cluster and the workload allocated on each worker node in the cluster. The API server invokes the scheduler every time there is a need to modify a Kubernetes pod. Based on the operational service requirements, the scheduler assigns the workload on a node that best fits the resource requirements.

Tanzu Basic for RAN Data Plane

5G RAN workloads such as CNFs run on worker nodes. Worker nodes run as VMs. A worker node requires the container runtime, kube-proxy, and kubelet daemon to function as a member of the Kubernetes cluster. Depending on the type of Telco workloads, worker nodes may require advanced network features such as multiple network interfaces, Container Networking Interface (CNI), SR-IOV, exclusive CPU core assignment, and NUMA pinning.

Operational Management Overview

The Operations management architecture includes VMware vRealize® Log Insight™ and VMware vRealize® Operations Manager™ to provide real-time monitoring and logging for the infrastructure and compute workloads in the Telco Cloud Platform RAN solution.


VMware, Inc. 19

vRealize Log Insight

vRealize Log Insight collects data from ESXi hosts using the syslog protocol. vRealize Log Insight has the following capabilities:

n Connects to other VMware products such as VMware vCenter Server® to collect events, tasks, and alarm data.

n Integrates with vRealize Operations Manager to send notification events and enable launch in context.

n Functions as a collection and analysis point for any system that sends syslog data.

To collect additional logs, you can install an ingestion agent on Linux or Windows servers or use the preinstalled agent on specific VMware products. Preinstalled agents are useful for custom application logs and operating systems such as Windows that do not natively support the syslog protocol.

As the Kubernetes and Container adoptions are increasing in Telco 5G, vRealize Log Insight can also be the centralized log management platform for Tanzu Kubernetes clusters. Cloud Administrators can easily configure container logs to forward to vRLI using industry-standard Open Source log agents such as FluentD and Fluentbit. Any logs written to standard output (stdout) by the container pod are sent to vRLI by the log agent, with no changes to the CNF itself.

vRealize Operations Manager

vRealize Operations Manager tracks and analyzes the operation of multiple data sources by using specialized analytic algorithms. These algorithms help vRealize Operations Manager learn and predict the behavior of every object it monitors. Users access this information by using views, reports, and dashboards.

Note For VMware Telco Cloud Platform RAN, vRealize Operations and vRealize Log Insight are optional components.


VMware, Inc. 20

Telco Cloud Platform RAN Solution Design 3This section describes the design and deployment of the Telco Cloud Platform RAN solution.

This design leverages Telco Cloud Platform 5G. It highlights various integration points between the Telco Cloud Platform 5G Core and RAN design and the dependency services with the virtual infrastructure to enable a scalable and fault-tolerant Telco Cloud Platform RAN solution design.

This chapter includes the following topics:

n Deployment Architecture of Telco Cloud Platform RAN

n Services Design

n Physical Design

n RAN Virtualization Design

n Telco Cloud Automation Design

n Tanzu Kubernetes Cluster Design

n Operations Management Design

Deployment Architecture of Telco Cloud Platform RAN

This section describes the deployment architecture of the Telco Cloud Platform 5G and RAN and the potential benefits of virtualization.

The architecture of the Telco Cloud Platform RAN solution shows the management and compute workload placement between the Regional Data Center site and the Cell Site. This guide focuses on the Telco Cloud Platform RAN solution. However, you must understand how a Cell Site host and application work in conjunction with a Regional Data Center and what are the dependencies to onboard a Cell Site host. The following figure shows an end-to-end deployment model of Telco Cloud Platform 5G core and RAN to understand how they work together.

In the Regional Data Center (RDC) site, vSphere administrators deploy and configure one management workload domain and one or more Cell Site compute workload domain. The Cell Site Compute workload domain dedicates a vCenter Server to manage both Regional Data Center (RDC) and Cell Site vSphere host and workloads.

VMware, Inc. 21

Figure 3-1. Telco Cloud Platform RAN Deployment Model

Control Plane Node

Control Plane Node

Control Plane Node

Control Plane Node

NSX-M: WLD1

vCenter: WLD1

TCA-CP2

vRNI

vROPS vRLI

vRO

TCA TCA-CP1

NSX-M vCenter Cluster-1


NSX Edge

Management Domain Workload Domain WLD1

Kubernetes Control Plane

Management Cluster

vSAN

Regional Data Center

vSAN


Single Host

Cell Site

Worker

Kubernetes Cluster

vCU vDU

Worker

Note NSX, vRNI, vROPS, and vRLI are optional components in the Telco Cloud Platform RAN design, and they are shown at RDC.

This deployment model includes one Regional Data Center (RDC) and one Cell Site as part of the VMware Telco Cloud Platform RAN solution architecture.

n Regional Data Center (RDC) consists of one Management Workload Domain and one Cell Site Compute Workload Domain.

n Management workload domain hosts a dedicated vCenter and manages all the SDDC management and operational management components such as vCenter Server, NSX, vRealize Operations, vRealize Log Insight, and so on.


VMware, Inc. 22

n VMware NSX is deployed as part of 5G Core components. It is used only for workload that is specific to telco management at RDC. NSX is not used for Cell Site Compute Workload Domain.

n Compute Workload Domain hosts a dedicated Compute vCenter Server with one vSphere cluster such as WLD1-Cluster1 and multiple Cell Site hosts.

n In the Compute vCenter Server, WLD1-Cluster1 is hosted at Regional Data Center and standalone hosts are deployed at Cell Site locations.

n A dedicated vSphere Distributed Switch (VDS) is used for both RDC vSphere cluster (WLD1- Cluster1) and the Cell Site host.

n Kubernetes cluster components such as control plane nodes are deployed at vSphere cluster such as WLD1-Cluster1 in RDC.

n Kubernetes Worker nodes are deployed at both Regional Data Center and Cell Site locations to support the telco CNF workloads such as vCU and vDU in a geographically distributed manner.

Services Design

This section describes common external services such DNS, DHCP, NTP, and PTP required for the Telco Cloud Platform RAN solution deployment.

Various external services are required for the deployment of the Telco Cloud Platform RAN components and Tanzu Kubernetes clusters. If you deploy the Telco Cloud Platform RAN solution in a greenfield environment, you must first deploy your Regional Data Center site and then onboard the Cell Sites to the Telco Cloud Platform RAN solution.

The following table lists the required external services and dependencies for Regional Data Center site and Cell Site locations:

Service Purpose

Domain Name Services (DNS) Provides name resolution for various components of the Telco Cloud Platform RAN solution.

Dynamic Host Configuration Protocol (DHCP) Provides automated IP address allocation for Tanzu Kubernetes clusters at Regional Data Center and Cell Site locations.

Note: Ensure that the DHCP service is available local to each site.

Network Time Protocol (NTP) Performs time synchronization between various Telco Core management components at Central Data Center or Regional Data Center.

Precision Time Protocol (PTP) Distributes accurate time and frequency over telecom mobile networks and ESXi host at Cell Site locations.


VMware, Inc. 23

DNS

When you deploy the Telco Cloud Platform RAN solution, provide the DNS domain information for configuring various components of the solution. DNS resolution must be available for all the components in the solution, including servers, Virtual Machines (VMs), and virtual IPs. Before you deploy the Telco Cloud Platform RAN management components or create any workload domains, ensure that both forward and reverse DNS resolutions are functional for each component.

DHCP

Telco Cloud Platform RAN uses Dynamic Host Configuration Protocol (DHCP) to automatically configure Tanzu Kubernetes Cluster with an IPv4 address at Regional Data Center and the Cell Site location. Each RDC and Cell Site must have a dedicated DHCP service locally, and the DHCP scope must be defined and made available for this purpose. The defined scope must be able to accommodate all the initial and future Kubernetes workloads used in the Telco Cloud Platform RAN solution.

The following figure shows the deployment architecture of the DHCP service for a Regional Data Center and Cell Site location:

Figure 3-2. DHCP Design for Regional Data Center and Cell Site Locations

DHCP

DHCP Network

Tanzu Kubernetes Cluster Tanzu Kubernetes Cluster

Regional Data Center Cell Site

vDU

Kubernetes Worker

DHCP

DHCP Network

Midhaul

Workload Domain Mgmt. Network

vDU

Kubernetes Worker

TCA-CP2 vCenter WLD-1

Kubernetes Control Plane Node

Note While deploying the Tanzu Kubernetes Control Plane, dedicate a static IP for the Kubernetes API endpoint.


VMware, Inc. 24

NTP

All the management components of Telco Cloud Platform RAN must be synchronized against a common time by using the Network Time Protocol (NTP). The Telco Cloud Platform RAN components such as vCenter Single Sign-On (SSO) are sensitive to a time drift between distributed components. The synchronized time between various components also assists troubleshooting efforts.

The following guidelines apply for the NTP sources:

n The IP addresses of NTP sources can be provided during the initial deployment of Telco Cloud platform RAN.

n The NTP sources must be reachable by all the components in the Telco Cloud Platform RAN solution.

n Time skew between NTP sources must be less than 5 minutes.

PTP

Precision Time Protocol (PTP) delivers time synchronization in various Telco applications and environments. It is defined in the IEEE 1588-2008 standard. PTP helps issuing accurate time and frequency over telecommunication mobile networks. Precise timekeeping is a key attribute for telco applications. It allows these applications to accurately construct the precise sequence of events that occurred or occur in real time. So, each ESXi node in the Telco Cloud Platform RAN solution must be time-synchronized.

Note The precision of a clock describes how consistent its time and frequency are relative to a reference time source, when measured repeatedly. The distinction between precision and accuracy is subtle but important.

PTP profiles: PTP allows various profiles to be defined to amend PTP for use in different scenarios. A profile is a set of specific PTP configuration options that are selected to meet the requirements of telco RAN applications.

PTP traffic: If the network carrying PTP comprises a non-PTP-aware switch in the pathway between the Grandmaster and Follower clocks, the switch handles PTP as any other data traffic, affecting the PTP accuracy. In this case, use a proper Quality-of-Service (QoS) configuration for network delivery to prioritize PTP traffic over all other traffic.

PTP grandmaster clocks: When networks are distributed geographically across different locations with Central Data Center, Regional Data Center, and Cell Site and they are connected over Wide Area Networks (WAN), varying latency across WAN link can compromise PTP accuracy. In that case, use different PTP Grandmaster clocks in each site and do not extend PTP across these sites.


VMware, Inc. 25

The following guidelines apply for the PTP sources: ESXi host must have a third physical NIC dedicated for PTP synchronization.

n The PTP sources such as Telco Grandmaster Clock (T-GM) must be reachable by all the components in the Telco Cloud Platform RAN solution.

n Use G.8275.1 PTP profile for accurate time synchronization for RAN applications. ITU–T G.8275.1 defines PTP profile for network scenarios with the full-timing support, which means all the intermediate switches support Boundary Clock functionality (BC).

n For the Cell Site locations, each host must have a minimum of three dedicated physical NICs: one NIC for PTP time synchronization and two NICs for workloads.

The following figure shows the physical NIC placement for PTP on an ESXi host:

Figure 3-3. PTP Design for ESX Host at Cell Site

ESXi Host

VMXNET3 VDS

pNIC1

pNIC2

pNIC3 PTP

PTP sync PTPGrandMaster

PTP aware Switch

Note In case of a dual-socket host with NUMA placement, use six physical NICs. Each socket must be mapped to two physical NICs for user workload and one physical NIC for PTP time synchronization.

Physical Design

The physical design includes the physical ESXi hosts, storage, and network design for Telco Cloud Platform RAN

Physical ESXi Host Design

Ensure that the physical specifications of the ESXi hosts allow for successful deployment and operation of the physical ESXi host design for RAN.

Physical Design Specification Fundamentals

The physical design specifications of the ESXi host determine the characteristics of the ESXi hosts that you use to deploy the Telco Cloud Platform RAN sol. For example, consistent PCI card slot placement, especially for network controllers, is essential for accurate alignment of physical to virtual I/O resources. By using identical configurations, you can balance the VM storage components across storage and compute resources.


VMware, Inc. 26

ESXi Host Memory

The amount of memory required for vSphere compute host for RAN varies according to the workloads running in the host. Ensure that at least 8% of the resources are available for ESXi host operations.

ESXi Boot Device

The following considerations apply when you select a boot device type and size for local storage. Select all ESXi host hardware, including boot devices by referring to the VMware Compatibility Guide.

The device types that are supported as ESXi boot devices are as follows:

n USB or SD embedded devices. The USB or SD flash drive must be at least 8 GB.

n SATADOM devices. The size of the boot device per host must be at least 16 GB.

Recommended Physical ESXi Host Design

Design Decision Design Justification Design Implication

Use hardware based on VMware Compatibility Guide.

n Ensures full compatibility with vSphere.

n Allows flexibility and ease of management of both RDC and Cell Site hosts.

Hardware choices might be limited.

Ensure that all ESXi hosts have a uniform configuration across the Cell Site vSphere hosts.

Ease of management and maintenance across the Cell Sites.

None

Onboard a Cell Site with a minimum of one ESXi host

Cell Sites are limited in space, and only a few telco workloads can be run.

Additional ESXi host resources might be required for redundancy and maintenance.

Set up each ESXi host with a minimum of three physical NICs.

n Ensures full redundancy for the required 2 Physical NICs for workloads.

n One physical NIC is dedicated for PTP time synchronization.

As more critical workloads are added, more physical NICs must be added to the host to ensure PTP redundancy.

Set up each ESXi host in the Cell Site with two minimum disks.

n ESXi boot drive

n Local Storage for workloads.

Local storage is the primary storage solution for Cell Site.

Note: The disk size must be considered based on telco workloads.

The local disk must be sized appropriately.

Note: Local storage does not support sharing across multiple hosts.

Set up each ESXi host in the Cell Site location with a minimum of 192 GB RAM.

n A good starting point for most workloads.

n Allows for ESXi and other management overhead.

Additional memory might be required based on vendor workload and sizing requirements.


VMware, Inc. 27

Physical Network Design

The physical network design for RAN includes defining the network topology for connecting physical switches and the ESXi hosts, determining switch port settings for VLANs, and designing routing.

Top-of-Rack Physical Switches

When configuring Top-of-Rack (ToR) switches, consider the following best practices:

n Configure redundant physical switches to enhance availability.

n Configure switch ports that connect to ESXi hosts manually as trunk ports. Virtual switches are passive devices and do not support trunking protocols, such as Dynamic Trunking Protocol (DTP).

n Modify the Spanning Tree Protocol (STP) on any port that is connected to an ESXi NIC to reduce the time it takes to transition ports over to the forwarding state, for example, using the Trunk PortFast feature on a Cisco physical switch.

n Configure jumbo frames on all switch ports, Inter-Switch Link (ISL), and Switched Virtual Interfaces (SVIs).

n Configure PTP time synchronization on supported ToR switches.

Top-of-Rack Connectivity and Network Settings

Each ESXi host is connected redundantly to the network fabric ToR switches by using a minimum of two 10 GbE ports (25 GbE or faster ports are recommended). Configure the ToR switches to provide all necessary VLANs through an 802.1Q trunk. These redundant connections use the features of vSphere Distributed Switch to guarantee that the physical interface is not overrun and redundant paths are used if they are available.

n Spanning Tree Protocol (STP): Although this design does not use the STP, switches usually include STP configured by default. Designate the ports connected to ESXi hosts as trunk PortFast.

n Trunking: Configure the switch ports as members of an 802.1Q trunk.

n MTU: Set MTU for all switch ports, VLANs, and SVIs to jumbo frames for consistency.

Jumbo Frames

IP storage throughput can benefit from the configuration of jumbo frames. Increasing the per-frame payload from 1500 bytes to the jumbo frame setting improves the efficiency of data transfer. Jumbo frames must be configured end-to-end. When you enable jumbo frames on an ESXi host, select an MTU size that matches the MTU size of the physical switch ports.

The workload determines whether to configure jumbo frames on a VM. Configure jumbo frames, if necessary, if the workload regularly transfers large volumes of network data. Also, ensure that both the VM operating system and the VM NICs support jumbo frames. Jumbo frames also improve the performance of vSphere vMotion.


VMware, Inc. 28

Recommended Physical Network Design


Use a layer 3 transport n You can select layer 3 switches from different vendors for the physical switching fabric.

n You can mix switches from different vendors because of the general interoperability between the implementation of routing protocols.

n This approach is cost-effective because it uses only the basic functionality of the physical switches.

None

Implement the following physical network architecture:

n A minimum of two 10 GbE port (two 25 GbE port recommended) on each ToR switch for ESXi host uplinks.

n No EtherChannel (LAG/vPC) configuration for ESXi host uplinks.

n Guarantees availability during a switch failure.

n Provides compatibility with vSphere host profiles because they do not store link-aggregation settings.

Hardware choices might be limited.

Use two ToR switches for each Cell Site location for network high availability.

n This design uses a minimum of two 10 GbE links or two 25 GbE links recommended, to each ESXi host.

n Provides redundancy and reduces the overall design complexity.

Two ToR switches per Cell Tower can increase costs.

Use VLANs to segment physical network functions.

n Supports physical network connectivity without requiring many NICs.

n

Requires uniform configuration and presentation on all the switch ports made available to the ESXi hosts.

Assign static IP addresses to all management components.

Ensures that interfaces such as management and storage always have the same IP address. This way, you provide support for continuous management of ESXi hosts using vCenter Server and for provisioning IP storage by storage administrators

Requires precise IP address management.


VMware, Inc. 29


Create DNS records for all ESXi hosts and management VMs to enable forward, reverse, short, and FQDN resolution.

Ensures consistent resolution of management components using both IP address (reverse lookup) and name resolution.

Adds administrative overhead.

Configure the MTU size to 9000 bytes (jumbo frames) on the physical switch ports, VLANs, SVIs, vSphere Distributed Switches, and VMkernel ports.

Improves traffic throughput. When you adjust the MTU size, you must also configure the entire network path (VMkernel port, distributed switch, physical switches, and routers) to support the same MTU size.

PTP Time Synchronization

RAN maintains network timing distribution as the preferred method for PTP time synchronization. For the RAN to operate effectively, the RU, DU, and CU must be time and phase synchronized. The delayed synchronization can have a negative impact on network applications. For example, low throughput, poor attachment success rate, and poor delivery success rate.

The accuracy of time synchronization is mostly dependent on the implementation of network connectivity and PTP protocol distribution. For example, the timestamp near interfaces and the number of hops. The O- RAN.WG4 Fronthaul networks define the following synchronization topologies for telco deployment:

n LLS-C1 Configuration




Note Consider PTP time synchronization based on these designs. However, Telco Cloud Platform RAN 1.0 supports LLS-C3 configuration only.

LLS-C1 Configuration

This configuration is done based on the point-to-point connection between DU and RU by using the network timing option. LLS-C1 is simple to configure. In this configuration, DU operates as PTP Boundary Clock (BC). The DU derives the time signal from Grandmaster and communicates directly with RU to synchronize it.

Figure 3-4. LLS-C1

CU

RDC

Grandmaster

DU

LLS-C1

RUBC

PTP

Cell Site

PTP


VMware, Inc. 30


In this configuration, DU acts as PTP BC to allocate network timing towards the RU. One or more PTP-supported switches can be installed between the DU and RU.

Figure 3-5. LLS-C2

CU

RDC

Grandmaster

DU

LLS-C2

RUBC

PTP

Cell Site

PTP

BC


In this configuration, the PTP Grandmaster performs network time-sharing between DU and RU at Cell Sites. One or more PTP switches are allowed in the Fronthaul network to support network time-sharing. This architecture is widely adopted by introducing the PTP Grandmaster and PTP Switch, which provides the ideal solution for network time-sharing.

Figure 3-6. LLS-C3

CU

RDC

Grandmaster

DU

LLS-C3

RUPTP

Cell Site

PTP


In this configuration, PRTC (usually the GNSS receiver) is used locally to provide timing for RU. PRTC does not depend on the Fronthaul transport network for timing and synchronization.

Figure 3-7. LLS-C4

PTP unawareSwitch

DU

LLS-C4

RU

Cell Site

PRTCSource

CU

RDC

Grandmaster

PTP


VMware, Inc. 31

RAN Split and Fronthaul Network

3GPP defined eight functional split options for Fronthaul networks. Option 2 and 7.x are the most commonly adopted Radio Splits.

n Option 2: A high-level CU and DU split. With the Option 2 split, the CU handles Service Data Adaptation Protocol (SDAP) or Packet Data Convergence Protocol (PDCP) with Radio Resource Control (RRC) while L2/L1 Ethernet functions reside in the DU. Before the data is sent across the Medium haul network, aggregation and statistical multiplexing of the data are done in the DU. So, the amount of data transmitted across the interface for each radio antenna appliance is reduced. PTP time synchronization is not mandatory for Option-2 split.

n Option 7.x: A low-level DU and RU split. With Option 7 Split, the DU handles the RRC/ PDCP/ Radio Link Control (RLC)/MAC and higher Physical (PHY) functions. The RU handles the lower PHY and RF functions. Mostly, a single DU is co-located with multiple RUs, offloading resource-intensive processing from multiple RUs. CU can be centrally located across the WAN, aggregating multiple DUs. Option 7.x lets operators simplify the deployment of the DU and RU, leading to a cost-effective solution and an ideal option for a distributed RAN deployment. Use LLC-C3 for PTP synchronization between the RU and DU.

Mobile operators require the flexibility to choose different splits based on the server hardware and fronthaul availability. Higher-layer functional splits are required for dense urban areas and scenarios, while a low fronthaul bit rate is required for a fronthaul interface. With Option 7.x, more functions are shifted to DUs, enabling more virtualization gains. Hence, Option 7.x split is more cost-effective than Option-2 DU split.

Physical Storage Design

The physical storage design for RAN uses ESXi local disk to implement primary storage type at Cell Site.

Cell Site design for local storage can be internal hard disks located inside your ESXi host. Local storage does not support sharing across multiple hosts. Only one host has access to a datastore on a local storage device. As a result, although you can use local storage to create VMs, you cannot use VMware features that require shared storage, such as HA and vMotion.

ESXi supports various local storage devices, including SCSI, IDE, SATA, USB, SAS, flash, and NVMe devices.

Another option is Network Attached Storage that stores VM files on remote file servers accessed over a standard TCP/IP network. The NFS client built into ESXi uses Network File System (NFS) protocol version 3 and 4.1 to communicate with the NAS/NFS servers. For network connectivity, the host requires a standard network adapter.

You can mount an NFS volume directly on the ESXi host. You can use the NFS datastore to store and manage VMs in the same way that you use the VMFS datastores.


VMware, Inc. 32

NFS Storage depicts a VM using the NFS datastore to store its files. In this configuration, the host connects to the NAS server, which stores the virtual disk files, through a regular network adapter.

Important Although Local storage configuration is possible, it is not recommended. Using a single connection between storage devices and host, creates a Single Point of Failure (SPOF) that can cause interruptions when a connection is unreliable or fails. However, because most of the local storage devices do not support multiple connections, you cannot use multiple paths to access the local storage.

RAN Virtualization Design

The Telco Cloud Platform RAN solution architecture consists of a centralized management domain along with compute workload domains to support the required RAN workloads in a Regional Data Center and a Cell Site.

Before you deploy a Cell Site host, deploy a management domain and a compute workload domain at your Regional Data Center (RDC) using Telco Cloud Automation. The following sections describe the VMware SDDC design within a construct of Management workload domain and Compute workload domain, and how they are related in this RAN design.

Management Domain

The management domain contains a single vSphere cluster called the management cluster. The management cluster hosts the VMs that manage the solution. This cluster is crucial for the management and monitoring of the solution. Its high availability deployment ensures that the management and monitoring services are always available centrally at Regional Data Center.

Management Domain Components Description

Management vCenter Server Manages the Management Domain.

Compute vCenter Server Manages the Compute Workload Domain.

NSX Manager Three instances of NSX Manager are used in a cluster.

Note: NSX is not used at RAN site such as Cell Site location.

vRealize Suite Standard Includes vRealize Log Insight and vRealize Operations Manager.

vRealize Network Insight Communicates with the vCenter Server and NSX Manager instances to collect metrics that are presented through various dashboards and views.

vRealize Orchestrator Workflow engine, fully integrated with Telco Cloud Automation


VMware, Inc. 33

Management Domain Components Description

Telco Cloud Automation Includes TCA Manager and TCA-CPs.

VMware Tanzu Basic for RAN Creates Workload clusters in the compute Workload domain.

Note In the Telco Cloud Platform RAN solution, the Management Domain is hosted at Regional Data Center (RDC).

Compute Workload Domain

The compute workload domain can contain multiple vSphere clusters at RDC. These clusters can contain a minimum of three ESXi hosts and a maximum of 96 or 64 hosts (when using vSAN), depending on the resource and availability requirements of the solution being deployed at RDC.

The Cell Site host, which is designed to onboard CNFs on a single node ESXi host, is part of the compute workload domain. This host provides Kubernetes workload cluster where CNFs such as vCU and vDU workloads are placed.

Each compute workload domain can support a maximum of 2500 ESXi hosts and 45,000 VMs in combination with RDC cluster and Cell Site hosts. If you use other management and monitoring tools, the vCenter maximums do not apply and the actual number of ESXi hosts and VMs per workload domain might be less.

vCenter Server Design

The vCenter Server design for RAN includes the design for all the vCenter Server instances. For this design, determine the number of instances, their sizes, networking configuration, vSphere cluster layout, redundancy, and security configuration.

vCenter Server is deployed at Regional Data Center and it manages all the Cell Site hosts. So, it is critical to design vCenter appropriately before onboarding the Cell Site hosts and RAN applications.

A vCenter Server deployment can consist of two or more vCenter Server instances according to the scale, number of VMs, and continuity requirements for your environment.

You must protect the vCenter Server system as it is the central point of management and monitoring. You can protect vCenter Server according to the maximum downtime tolerated. Use the following methods to protect the vCenter Server instances:

n Automated protection using vSphere HA

n Automated protection using vCenter Server HA

vCenter Server Sizing

You can size the resources and storage for the Management vCenter Server Appliance and the Compute vCenter Server Appliance according to the expected number of Hosts and VMs in the environment.


VMware, Inc. 34

Table 3-1. Recommended Sizing for the Management vCenter Server

Attribute Specification

Appliance Size Small (up to 100 hosts or 1000 VMs)

Number of vCPUs 4

Memory 19 GB

Disk Space 528 GB

The following table lists different deployment sizes for Compute vCenter Server. Choose the appropriate size based on your scaling requirements such as the number of Cell Site hosts or workloads.

Table 3-2. Deployment Sizes for Compute vCenter Servers

Deployment Size Limitations

Tiny Deploys an appliance with 2 vCPUs and 12 GB of memory.

Suitable for environments with up to 10 hosts or 100 VMs

Small Deploys an appliance with 4 CPUs and 19 GB of memory.

Suitable for environments with up to 100 hosts or 1,000 VMs

Medium Deploys an appliance with 8 CPUs and 24 GB of memory.

Suitable for environments with up to 400 hosts or 4,000 VMs

Large Deploys an appliance with 16 CPUs and 37 GB of memory.

Suitable for environments with up to 1,000 hosts or 10,000 VMs

X-Large Deploys an appliance with 24 CPUs and 56 GB of memory.

Suitable for environments with up to 2,500 hosts or 45,000 VMs

For more information, see the VMware vSphere documentation.

Important Ensure that the Compute vCenter Server is dedicated to your Cell Site hosts and RAN applications.

TLS Certificates in vCenter Server

By default, vSphere uses TLS/SSL certificates that are signed by VMware Certificate Authority (VMCA). These certificates are not trusted by end-user devices or browsers.

As a security best practice, replace at least all user-facing certificates with certificates that are signed by a third-party or enterprise Certificate Authority (CA).


VMware, Inc. 35

https://docs.vmware.com/en/VMware-vSphere/7.0/com.vmware.vcenter.install.doc/GUID-39B42E70-9241-45AA-A0FB-2F369418501A.html

Recommended vCenter Server Design


Deploy two vCenter Server systems

n One vCenter Server supports the management workloads.

n Another vCenter Server supports the compute workloads.

n Isolates vCenter Server failures to management or compute workloads.

n Isolates vCenter Server operations between management and compute workloads.

n Supports a scalable vSphere cluster design where you might reuse the management components as more compute workload domains are added.

n Simplifies capacity planning for compute workloads because you do not consider management workloads for the Compute vCenter Server.

n Improves the ability to upgrade the vSphere environment and related components by the separation of maintenance windows.

n Supports separation of roles and responsibilities to ensure that only administrators with proper authorization can attend to the management workloads.

n Facilitates quicker troubleshooting and problem resolution.

Requires licenses for each vCenter Server instance.

Protect all vCenter Servers by using vSphere HA.

Supports the availability objectives for vCenter Server without the required manual intervention during a failure event.

vCenter Server becomes unavailable during the vSphere HA failover.

Replace the vCenter Server machine certificate with a certificate signed by a third-party Public Key Infrastructure.

n Infrastructure administrators connect to the vCenter Server instances using a Web browser to perform configuration, management, and troubleshooting.

n The default certificate results in certificate warning messages.

Replacing and managing certificates is an operational overhead.

Use an SHA-2 or higher algorithm when signing certificates.

The SHA-1 algorithm is considered less secure and is deprecated.

Not all certificate authorities support SHA-2.

Important In the Telco Cloud Platform RAN solution design, both the management vCenter and Compute vCenter are deployed at Regional Data Center. Compute vCenter manages all the Cell Site hosts.


VMware, Inc. 36

Workload Domains and vSphere Cluster Design

The vCenter Server functionality is distributed across a minimum of two workload domains and two vSphere clusters. This solution uses two vCenter Server instances: one for the management domain and another for the first compute workload domain. The compute workload domain can contain multiple Cell Site vSphere Hosts.

The cluster design at RDC must consider the workloads that the cluster handles. Different cluster types in this design have different characteristics. When you design the cluster layout in vSphere, consider the following guidelines:

n Use a few large-sized ESXi hosts or more small-sized ESXi hosts for Regional Data Center (RDC)

n A scale-up cluster has few large-sized ESXi hosts.

n A scale-out cluster has more small-sized ESXi hosts.

n Use the ESXi hosts that are sized appropriately for your Cell Site locations.

n Consider the total number of ESXi hosts and cluster limits as per vCenter Maximums.

vSphere High Availability

VMware vSphere High Availability (vSphere HA) protects your VMs in case of an ESXi host failure by restarting VMs on other hosts in the cluster. During the cluster configuration, the ESXi hosts elect a primary ESXi host. The primary ESXi host communicates with the vCenter Server system and monitors the VMs and secondary ESXi hosts in the cluster.

The primary ESXi host detects different types of failure:

n ESXi host failure, for example, an unexpected power failure.

n ESXi host network isolation or connectivity failure.

n Loss of storage connectivity.

n Problems with the virtual machine OS availability.

The vSphere HA Admission Control Policy allows an administrator to configure how the cluster determines available resources. In a small vSphere HA cluster, a large proportion of the cluster resources is reserved to accommodate ESXi host failures, based on the selected policy.

However, with a Regional Data Center and a Cell Site construct in the Telco Cloud Platform RAN deployment, you need to only enable vSphere High Availability on your workload cluster at Regional Data Center. vSphere HA is not required on your cell site host as it is managed as a standalone host.

Recommended vSphere HA design for Cell Site Host:


VMware, Inc. 37

Cluster Operation Locations Action Justification

vSphere HA Cell Site No HA Hosts are deployed as standalone at Cell Site without a vSphere Cluster.

vSphere Distributed Resource Scheduler

The distribution and usage of CPU and memory resources for all hosts and VMs in the cluster are continuously monitored. The vSphere Distributed Resource Scheduler (DRS) compares these metrics to an ideal resource usage given the attributes of the cluster’s resource pools and VMs, the current demand, and the imbalance target. DRS then provides recommendations or performs VM migrations accordingly.

Recommended vSphere DRS design for Cell Site Host:

Operation Location Action Justification

DRS Cell Site No DRS Hosts are deployed as standalone at Cell Site without a Cluster.

Network Virtualization Design

The network virtualization design for RAN uses the vSphere Distributed Switch (VDS) along with VLAN requirements.

Network Segments and VLANs

Separate the different types of traffic for access security and to reduce the contention and latency.

According to the application or service, high latency on specific VM networks can also negatively affect performance. Determine which workloads and networks are sensitive to high latency by using the information gathered from the current state analysis and by interviewing key stakeholders and SMEs.

The following table lists the network segments and VLANs for a Cell Site host configuration.

Table 3-3. Cell Site VLAN

VLAN Purpose

ESXi Management vSphere management network

Workload RAN workload network


VMware, Inc. 38

vSphere Distributed Switch (VDS) Design for RAN

VMware vSphere Distributed Switch (VDS) provides a centralized interface from which you can configure, monitor, and administer VM access switching for the entire Cell Site locations. The VDS extends the features and capabilities of virtual networks while simplifying provisioning and the ongoing configuration, monitoring, and management processes.

In the case of Cell Site ESXi hosts, create a single virtual switch per Cell Site group. The virtual switch can manage each type of network traffic, configure a port group to simplify the configuration and monitoring. Cell Site ESXi hosts are added to the data center object of vCenter server.

The VDS eases this management burden by treating the network as an aggregated resource. Individual host-level virtual switches are abstracted into one large VDS spanning multiple hosts. In this design, the data plane remains local to each VDS but the management plane is centralized.

The following figure shows a dedicated VDS at Regional Data Center which is managing Kubernetes Cluster and Worker nodes along with vCU. Another VDS is configured to manage all Cell Site group hosts. Both VDS switches are managed by a Compute vCenter Server which is hosted at Regional Data Center.

Important Each vCenter Server instance can support up to 128 vSphere Distributed Switches. Each VDS can manage up to 2000 hosts. So, you must consider your Cell Site scaling appropriately.

Figure 3-8. VDS Design for Cell Site Groups

Managed by RDC Compute vCenter

DHCP

vDU

K8s-W

DHCP

WLD-1 Cluster-1 Host #1 Host #2 Host #3

vDU

K8s-W

Regional Data Center Cell Site #1

Cell Site Group

Cell Site #2 Cell Site #3

K8s-M

DHCP

vDU

K8s-W

VDS VDS

DHCP

vDU

K8s-W

Cell Site VDS

Use a vSphere Distributed Switch (VDS) for Cell Site hosts, based on the number of Cell Site Hosts, scaling options, and ease of network management in each cell site group. The network traffic between vCenter Server and an ESXi host should be 150ms or less.

VDS Limitation

Shared VDS across Cell Sites in a cell site group. 128 vSphere Distributed Switches supported per vCenter Server.


VMware, Inc. 39

Figure 3-9. Dedicated VDS for each Cell Site Group

Managed by RDC Compute vCenter

VDS VDS VDS

Host #1 Host #2 Host #3 Host #4 Host #5 Host #6 Host #7 Host #8

Cell Site #1 Cell Site #2

Cell Site Group SFO Cell Site Group Palo Alto Cell Site Group San Jose

Cell Site #3 Cell Site #1 Cell Site #2 Cell Site #1 Cell Site #2 Cell Site #3

SR-IOV

SR-IOV is a specification that allows a single Peripheral Component Interconnect Express (PCIe) physical device under a single root port to appear as multiple separate physical devices to the hypervisor or the guest operating system.

SR-IOV uses Physical Functions (PFs) and Virtual Functions (VFs) to manage global functions for the SR-IOV devices. PFs are full PCIe functions that can configure and manage the SR-IOV functionality. VFs are lightweight PCIe functions that support data flow but have a restricted set of configuration resources. The number of VFs provided to the hypervisor or the guest operating system depends on the device. SR-IOV enabled PCIe devices require appropriate BIOS, hardware, and SR-IOV support in the guest operating system driver or hypervisor instance.

In vSphere, a VM can use an SR-IOV virtual function for networking. The VM and the physical adapter exchange data directly without using the VMkernel stack as an intermediary. Bypassing the VMkernel for networking reduces the latency and improves the CPU efficiency for high data transfer performance.


VMware, Inc. 40

Figure 3-10. SR-IOV Logical View

ESXi Host

vSphere Distributed Switch

Port GroupAssociation

ToR

PF VF VF

Port 1 Port 2

pNIC pNIC

DataPlane

DataPlane

VNFC 1

VF Driver

VNFC 2

VF Driver

Recommended Network Virtualization Design

Design Recommendation Design Justification Design Implication

Use two physical NICs in Cell Site ESXi host for workloads.

Provides redundancy to all portgroups.

None

Use a minimum of one physical NIC (two recommended) in Cell Site ESXi host for PTP time synchronization.

Provides time synchronization service None

Use vSphere Distributed Switches. Simplifies the management of the virtual network.

Migration from a standard switch to a distributed switch requires a minimum of two physical NICs to maintain redundancy.

Use a single vSphere Distributed Switch per Cell Site Group

Reduces the complexity of the network design.

Provides more scalable architecture for Cell Site locations.

Increases the number of vSphere Distributed Switches that must be managed.

Use ephemeral port binding for the management port group.

Provides the recovery option for the vCenter Server instance that manages the distributed switch.

Port-level permissions and controls are lost across power cycles, and no historical context is saved.

Use static port binding for all non-management port groups.

Ensures that a VM connects to the same port on the vSphere Distributed Switch. This allows for historical data and port-level monitoring.

None


VMware, Inc. 41


Enable health check on all vSphere distributed switches.

Verifies that all VLANs are trunked to all ESXi hosts attached to the vSphere Distributed Switch and the MTU sizes match the physical network.

You must have a minimum of two physical uplinks to use this feature.

Use the Route based on the physical NIC load teaming algorithm for all port groups.

n Reduces the complexity of the network design.

n Increases resiliency and performance.

None

Enable Network I/O Control on all distributed switches.

Increases the resiliency and performance of the network.

If configured incorrectly, Network I/O Control might impact the network performance for critical traffic types.

Telco Cloud Automation Design

This section outlines the design best practices of the Telco Cloud Automation (TCA) components including TCA Manager, TCA-Control Plane, NodeConfig Operator, Container registry, and CNF designer.

VMware Telco Cloud Automation with infrastructure automation provides a universal 5G Core and RAN deployment experience to service providers. Infrastructure automation for 5G core and RAN allows telco providers and telco administrators to provide a virtually zero IT touch and virtually zero infrastructure onboarding experience. The Telco Cloud Automation appliance automates the bring-up of the entire software-defined stack for the 5G Core and RAN site, and automates its configuration and provisioning.

Network Operation Administrator can provision new telco cloud resources, monitor changes to the RDC and cell sites, and manage other operational activities. VMware Telco Cloud Automation enables consistent, secure infrastructure and operations across Central Data Center, RDC, and Cell Sites with increased enterprise agility and flexibility.

Telco Cloud Automation Components

Telco Cloud Automation is a domain orchestrator that provides life cycle management of VNF, CNFs, and infrastructure on which they run. Telco Cloud Automation consists of two major components:

n Telco Cloud Automation Manager (TCA Manager) provides orchestration and management services for Telco clouds.

n Telco Cloud Automation Control Plane (TCA-CP) is responsible for multi‑VIM/CaaS registration, synchronizes multi‑cloud inventories, and collects faults and performance logs from infrastructure to network functions.


VMware, Inc. 42

TCA-CP and TCA Manager components work together to provide Telco Cloud Automation services. TCA Manager connects with TCA-CP nodes through site pairing. TCA manager relies on the inventory information captured from TCA-CP to deploy and scale Tanzu Kubernetes clusters. TCA manager does not communicate with the VIM directly. Workflows are always posted by the TCA manager to the VIM through TCA-CP.

The Kubernetes cluster bootstrapping environment is completely abstracted into TCA-CP. The binaries and cluster plans required to bootstrap the Kubernetes clusters are pre-bundled into the TCA-CP appliance. After the base OS image templates are imported into the respective vCenter Servers, Kubernetes admins can log into the TCA manager and start deploying Kubernetes clusters directly from the TCA manager console.


Integrate Management vCenter SSO with LDAP / AD for TCA user onboarding.

n TCA-CP SSO integrates with vCenter SSO.

n LDAP enables centralized and consistent user management.

Requires additional components to manage in the Management cluster.

Deploy a single instance of the TCA manager to manage all TCA-CP endpoints.

n Single point of entry into CaaS.

n Simplifies inventory control, user onboarding, and CNF onboarding.

None

Register TCA manager with the management vCenter Server.

Management vCenter Server is used for TCA user onboarding.

None

Deploy a dedicated TCA-CP node to control the Tanzu Kubernetes management cluster.

Required for the deployment of the Tanzu Kubernetes management cluster.

TCA-CP requires additional CPU and memory in the management cluster.

Each TCA-CP node controls a single vCenter Server. Multiple vCenter Servers in one location require multiple TCA-CP nodes.

Cannot distribute TCA-CP to vCenter mapping

n Each time a new vCenter Server is deployed, a new TCA-CP node is required.

n To minimize recovery time in case of TCA-CP failure, each TCA-CP node must be backed up independently, along with the TCA manager.

Deploy TCA manager and TCA-CP on a shared LAN segment used by VIM for management communication.

n Simplifies connectivity between the Telco Cloud Platform management components.

n TCA manager, TCA-CP, and VIM share the same level of the security trust domain.

n Single NIC design simplifies host routing setup across the Telco Cloud Platform management components.

None

VMware vRealize® Orchestrator™

deployments are shared across all TCA-CP and vCenter pairing.

Consolidated vRO deployment reduces the number of VRO nodes to deploy and manage.

Requires vRO to be highly available, if multiple TCA-CP endpoints are dependent on a shared deployment


VMware, Inc. 43


vRO cluster must be deployed using three nodes.

A highly available cluster ensures that vRO is highly available for all TCA-CP endpoints

vRO redundancy requires an external Load Balancer.

Schedule TCA manager and TCA-CP backups at around the same time as SDDC infrastructure components to minimize database synchronization issues upon restore.

Note: Your backup frequency and schedule might vary based on your business needs and operational procedure.

n Proper backup of all TCA and SDDC components is crucial to restore the system to its working state in the event of a failure.

n Time consistent backups taken across all components require less time and effort upon restore.

Backups are scheduled manually. TCA admin must log into each component and configure a backup schedule and frequency.

CaaS Infrastructure

The Kubernetes Cluster automation in Telco Cloud Automation starts with Kubernetes templates that capture deployment configurations for a Kubernetes cluster. The cluster templates are a blueprint for Kubernetes cluster deployments and intended to minimize repetitive tasks, enforce best practices, and define guard rails for infrastructure management.

A policy engine is used to honor SLA required for each template profile by mapping the Telco Cloud Infrastructure resources to the Cluster templates. Policies can be defined based on the tags assigned to the underlying VIM or based on the role and role permission binding. Hence, the appropriate VIM resources are exposed to a set of users, thereby automating the SDDC to the K8s Cluster creation process.

The CaaS Infrastructure automation in Telco Cloud Automation consists of the following components:

n TCA Kubernetes Cluster Template Designer: TCA admin uses the TCA Kubernetes Cluster designer to create Kubernetes Cluster templates to help deploy the Kubernetes cluster. Kubernetes Cluster template defines the composition of the Kubernetes cluster. Attributes such as the number and size of Control and worker nodes, Kubernetes CNI, Kubernetes


VMware, Inc. 44

storage interface, and Helm version makes up a typical Kubernetes cluster template. The TCA Kubernetes Cluster template designer does not capture CNF-specific Kubernetes attributes but instead leverages the VMware NodeConfig operator through late binding. For late binding details, see TCA VM and Node Config Automation Design.

n SDDC Profile and Inventory Discovery: The Inventory management component of Telco Cloud Automation can discover the underlying infrastructure for each VIM associated with a TCA-CP appliance. Hardware characteristics of the vSphere node and vSphere cluster are discovered using the TCA inventory service. The platform inventory data is made available by the discovery service to the Cluster Automation Policy engine to assist the Kubernetes cluster placement. TCA admin can add tags to the infrastructure inventory to provide additional business logic on top of the discovered data.

n Cluster Automation Policy: The Cluster Automation policy defines the mapping of the TCA Kubernetes Cluster template to infrastructure. VMware Telco Cloud Platform allows TCA admins to map the resources using a Cluster Automation Policy to identify and group the infrastructure to assist users in deploying higher-level components on them. The Cluster Automation Policy indicates the intended usage of the infrastructure. During the cluster creation time, TCA validates whether the Kubernetes template requirements are met by the underlying infrastructure resources.

n K8s Bootstrapper: When the deployment requirements are met, TCA generates a deployment specification. The K8s Bootstrapper uses the Kubernetes cluster APIs to create the Cluster based on the deployment specification. Bootstrapper is a component of the TCA-CP.


Create unique Kubernetes Cluster templates for each Telco Cloud Platform RAN system profile. For more information, see the Workload Profile and Cluster Sizing section.

Cluster templates serve as a blueprint for Kubernetes cluster deployments and intended to minimize repetitive tasks, enforce best practices, and define guard rails for infrastructure management.

K8s templates must be maintained to align with the latest CNF requirements.

When creating the Tanzu Kubernetes management cluster template, define a single network label for all nodes across the cluster.

Tanzu Kubernetes management cluster nodes require only a single NIC per node.

None

When creating workload Cluster templates, define only network labels required for Tanzu Kubernetes management and CNF OAM using network labels.

n Network labels are used to create vNICs on each node.

n Data plane vNICs that require SR-IOV are added as part of the node customization during CNF deployment.

n Late binding of vNIC saves resource consumption on the SDDC infrastructure. Resources are allocated only during CNF instantiation.

None


VMware, Inc. 45

https://docs.vmware.com/en/VMware-Telco-Cloud-Platform---5G-Edition/2.0/telco-cloud-platform-5G-edition-reference-architecture-guide-20/GUID-8572C7D2-11D7-4800-A744-784B82A7CA48.html

https://docs.vmware.com/en/VMware-Telco-Cloud-Platform---5G-Edition/2.0/telco-cloud-platform-5G-edition-reference-architecture-guide-20/GUID-903C54BD-F4CF-41D3-ACC0-ABCBBA16E9C9.html

https://docs.vmware.com/en/VMware-Telco-Cloud-Platform---5G-Edition/2.0/telco-cloud-platform-5G-edition-reference-architecture-guide-20/GUID-903C54BD-F4CF-41D3-ACC0-ABCBBA16E9C9.html


When creating workload Cluster templates, enable Multus CNI for clusters that host Pods requiring multiple NICs.

n Multus CNI enables the attachment of multiple network interfaces to a Pod.

n Multus acts as a "meta-plugin", a CNI plugin that can call multiple other CNI plugins.

Multus is an upstream plugin and follows the community support model.

When creating workload Cluster templates, enable whereabouts if cluster-wide IPAM is required for secondary Pod NICs.

n Simplifies IP address assignment for secondary Pod NICS.

n Whereabouts is cluster wide compared to the default IPAM that comes with most CNIs such as macvlan.

Whereabouts is an upstream plugin and follows the community support model.

When defining workload cluster templates, enable nfs_client CSI for multiaccess read and write support.

Some CNF vendors require the support to read/write many persistent volumes. NFS provider supports Kubernetes RWX persistent volume types.

NFS backend must be onboarded separately, outside of Telco Cloud Automation.

When defining workload and management Kubernetes templates, enable Taint on all Control Plane nodes.

Improved security, stability, and management of the control plane.

None

When defining workload cluster and Management template, do not enable multiple node pools for the Kubernetes Control node.

Telco Cloud Platform supports only a single Control node group per cluster.

None

When defining a workload cluster template, if a cluster is designed to host CNFs with different performance profiles, create a separate node pool for each profile. Define unique node labels to distinguish node members between other node pools.

n Node Labels can be used with Kubernetes scheduler for CNF placement logic.

n Node pool simplifies the CNF placement logic when a cluster is shared between CNFs with different placement logics.

Too many node pools might lead to resource underutilization.

Pre-define a set of infrastructure tags and apply the tags to SDDC infrastructure resources based on the CNF and Kubernetes resource requirements.

Tags simplify the grouping of infrastructure components. Tags can be based on hardware attributes or business logic.

Infrastructure tag mapping requires Administrative level visibility into the Infrastructure composition.

Pre-define a set of CaaS tags and apply the tags to each Kubernetes cluster template defined by the TCA admin.

Tags simplify the grouping of Kubernetes templates. Tags can be based on hardware attributes or business logic.

K8s template tag mapping requires advanced knowledge of CNF requirements.

K8s template mapping can be performed by the TCA admin with assistance from Kubernetes admins.

Pre-define a set of CNF tags and apply the tags to each CSAR file uploaded to the CNF catalog.

Tags simplify the searching of CaaS resources.

None


VMware, Inc. 46

Important After deploying the resources with Telco Cloud Automation, you cannot rename infrastructure objects such as Datastores or Resource Pools.

Core Capabilities of Telco Cloud Automation

n xNF management (G‑xNFM) unifies and standardizes the network function management across VM‑ and container‑based infrastructures.

n Domain Orchestration (NFVO) simplifies the design and management of centralized or distributed multi‑vendor network services.

n Multi‑Cloud Infrastructure and CaaS Automation eases multi-cloud registration (VIM/Kubernetes), enables centralized CaaS management, synchronizes multi‑cloud inventories/resources, and collects faults and performance logs from infrastructure to network functions.

n Policy and Placement Engine enables intent‑based and multi‑cloud workload/ policy placements from the network core to edge, and from private to public clouds.

Tanzu Kubernetes Cluster Design

The Tanzu Kubernetes clusters are deployed in the compute workload domains.

Telco Cloud Platform RAN consumes resources from the compute workload domain. Resource pools provide guaranteed resource availability to workloads. Resource pools are elastic; more resources can be added as its capacity grows. Each Kubernetes cluster can be mapped to a resource pool. A resource pool can be dedicated to a Kubernetes cluster or shared across multiple clusters.

In a RAN deployment design with a Regional Data Center and Cell Sites, Kubernetes control plane node can be placed on a vSphere cluster at Regional Data Center and Worker nodes can be placed on a vSphere host at Cell Site to support CNF workloads. Both vSphere cluster and vSphere host can be managed by a vCenter in compute workload domain.


VMware, Inc. 47

Recommended Resource Workload Domain Design


Map the Tanzu Kubernetes clusters to the vSphere Resource Pool in the compute workload domain.

Enables Resource Guarantee and Resource Isolation.

During resource contention, workloads can be starved for resources and can experience performance degradation.

Note: You must proactively perform monitoring and capacity management, and add the capacity before the contention occurs.

n Create dedicated DHCP IP subnet pools for the Tanzu Kubernetes cluster management network.

n Dedicate a Static IP for Kubernetes Endpoint API.

n Simplifies the IP address assignment to Kubernetes clusters.

n Use static reservations to reserve IP addresses in the DHCP pool for Kube-Vip address.

n DHCP servers must be monitored for availability.

n Address scopes are not overlapping IP addresses that are being used.

Place the Kubernetes cluster management network on a virtual network, which is routable to the management network for vSphere, Harbor, and repository mirror.

n Provides connectivity to the vSphere infrastructure.

n Simplifies the network design and reduces the network complexity.

n Increases the network address management overhead.

n Increased security configuration to allow traffic between the resource and management domains.

When you allocate resource pools to Kubernetes clusters, consider the following guidelines:

n Enable 1:1 Kubernetes Cluster to Resource Pool mapping for data plane intensive workloads.

n Reduced resource contention can lead to better performance.

n Better resource isolation and resource guarantees and reporting.

n Enable N:1 Kubernetes Cluster to Resource Pool mapping for control plane workloads where resources are shared.

n Efficient use of the server resources

n High workload density

n Use vRealize Operation Manager to provide recommendations on the required resource by analyzing performance statistics.

n Consider the total number of ESXi hosts and Kubernetes cluster limits.


VMware, Inc. 48

Management and Workload Kubernetes Clusters

A Kubernetes cluster in Telco Cloud Platform RAN consists of etcd and the Kubernetes control and data planes.

n Etcd: Etcd must run in the cluster mode with an odd number of cluster members to establish a quorum. A 3-node cluster tolerates the loss of a single member, while a 5-node cluster tolerates the loss of two members. In a stacked mode deployment, etcd availability determines the number of Kubernetes Control nodes.

n Control Plane node: The Kubernetes control plane must run in redundant mode to avoid a single point of failure. To improve API availability, Kube-Vip is placed in front of the Control Plane nodes.

Component Availability

API Server Active/Active

Kube-controller-manager Active/Passive

Kube-scheduler Active/Passive

Important Do not place CNF workloads on the control plane nodes.

Worker Node

5G RAN workloads are classified based on their performances. Generic workloads such as web services, lightweight databases, monitoring dashboards, and so on, are supported adequately using standard configurations on Kubernetes nodes. In addition to the recommendations outlined in the Tuning vCloud NFV for Data Plane Intensive Workloads white paper, the data plane workload performance can benefit from further tuning in the following areas:

n NUMA Topology

n CPU Core Affinity

n Huge Pages

NUMA Topology: When deploying Kubernetes worker nodes that host high data bandwidth applications, ensure that the processor, memory, and vNIC are vertically aligned and remain within a single NUMA boundary.


VMware, Inc. 49

https://docs.vmware.com/en/VMware-vCloud-NFV-OpenStack-Edition/3.3/vmwa-vcloud-nfv-performance-tuning-guide.pdf

Figure 3-11. NUMA and CPU Affinity

1 2

CPU Cores

CNF1 CNF2

CNF1 Pages CNF2 Pages

MemoryCPU1

3 4 1 2

CPU Cores

Interconnect

sched.mem.lpage.enable1GPage=TRUE sched.cpu.latencySensitivity=HIGH

MemoryCPU2

3 4

The topology manager is a new component in the Kubelet and provides NUMA awareness to Kubernetes at the pod admission time. The topology manager figures out the best locality of resources by pulling topology hints from the Device Manager and the CPU manager. Pods are then placed based on the topology information to ensure optimal performance.

Note Topology Manager is optional, if the NUMA placement best practices are followed during the Kubernetes cluster creation.

CPU Core Affinity: CPU pinning can be achieved in different ways. Kubernetes built-in CPU manager is the most common. The CPU manager implementation is based on cpuset. When a VM host initializes, host CPU resources are assigned to a shared CPU pool. All non-exclusive CPU containers run on the CPUs in the shared pool. When the Kubelet creates a container requesting a guaranteed CPU, CPUs for that container are removed from the shared pool and assigned exclusively for the life cycle of the container. When a container with exclusive CPUs is terminated, its CPUs are added back to the shared CPU pool.

The CPU manager includes the following two policies:

n None: Default policy. The kubelet uses the CFS quota to enforce pod CPU limits. The workload can move between different CPU cores depending on the load on the Pod and the available capacity on the worker node.

n Static: With the static policy enabled, the CPU request results in the container getting allocated the whole CPU and no other container can schedule on that CPU.

Note For data plane intensive workloads, the CPU manager policy must be set to static to guarantee an exclusive CPU core on the worker node.

CPU Manager for Kubernetes (CMK) is another tool used by selective CNF vendors to assign the core and NUMA affinity for data plane workloads. Unlike the built-in CPU manager, CMK is not bundled with Kubernetes binaries and it requires separate download and installation. CMK must be used over the built-in CPU manager if required by the CNF vendor.


VMware, Inc. 50

http://man7.org/linux/man-pages/man7/cpuset.7.html

Huge Pages: For Telco workloads, the default huge page size can be 2 MB or 1 GB. To report its huge page capacity, the worker node determines the supported huge page sizes by parsing the /sys/kernel/mm/hugepages/hugepages-{size}kB directory on the host. Huge pages

must be set to pre-allocated for maximum performance. Pre-allocated huge pages reduce

the amount of available memory on a worker node. A node can only pre-allocate huge pages for the default size. The Transport Huge Pages must be disabled.

Container workloads requiring huge pages use hugepages-<hugepagesize> in the Pod

specification. As of Kubernetes 1.18, multiple huge page sizes are supported per Pod. Huge Pages allocation occurs at the pod level.

Recommended tuning details:


Three Control nodes per Kubernetes cluster to ensure full redundancy

3-node cluster tolerates the loss of a single member

n Each Control node requires CPU and memory resources.

n CPU/Memory overhead is high for small Kubernetes cluster sizes.

Install and activate the PTP clock synchronization service

Kubernetes and its components rely on the system clock to track events, logs, state, and so on.

None

Disable Swap on all Kubernetes Cluster Nodes.

Swap causes a decrease in the overall performance of the cluster.

None

Vertically align Processor, memory, and vNIC and keep them within a single NUMA boundary for data plane intensive workloads.

n High packet throughput can be maintained for data transfer across vNICs within the same NUMA zone than in different NUMA zones.

n Latency_sensitivity must be

enabled for best effort NUMA placement.

Requires an extra configuration step on the vCenter to ensure NUMA alignment.

Note: This is not required for generic workloads such as web services, lightweight databases, monitoring dashboards, and so on.

The CPU manager policy set to static for data plane intensive

workloads.

When the CPU manager is used for CPU affinity, the static mode is required to guarantee exclusive CPU cores on the worker node for data-intensive workloads.

Requires an extra configuration step for CPU Manager through NodeConfig Operator.


When enabling static CPU manager policy, set aside sufficient CPU resources for the kubelet operation.

n The kubelet requires a CPU reservation to ensure that the shared CPU pool is not exhausted under load.

n The amount of CPU to be reserved depends on the pod density per node.

n Requires an extra configuration step for CPU Manager.

n Less CPU reservation can impact the Kubernetes cluster stability.



VMware, Inc. 51


Enable huge page allocation at boot time.

n Huge pages reduce the TLB miss.

n Huge page allocation at boot time prevents memory from becoming unavailable later due to fragmentation.

n Update VM setting for Worker Nodes 1G Hugepage.

n Enable IOMMUs to protect system memory between I/O devices.

n Pre-allocated huge pages reduce the amount of available memory on a worker node.

n Requires an extra configuration step in the worker node VM GRUB configuration.

n Enabling huge pages requires a VM reboot.


Set the default huge page size to 1 GB.

Set the overcommit size to 0.

n For 64-bit applications, use 1 GB huge pages if the platform supports them.

n Overcommit size defaults to 0, no actions required.

For 1 GB pages, the huge page memory cannot be reserved after the system boot.


Mount the file system type hugetlbfs on the root file system.

n The file system of type hugetlbfs is required by the mmap system

call.

n Create an entry in fstab so the mount point persists after a reboot.

Perform an extra configuration step in the worker node VM configuration.


Storage Considerations for RAN

In Kubernetes, a Volume is a directory on a disk that is accessible to the containers inside a pod. Kubernetes supports many types of volumes. However, for the Telco Cloud Platform RAN deployment with a Regional Data Center and Cell Site location, follow these recommendations when configuring your storage.

Location Storage Type Justification

Regional Data Center vSAN The vSphere cluster at Regional Data Center can have three or more ESXi hosts to support vSAN storage.

Cell Site Local Storage The Cell Site vSphere Cluster may have a single host configuration, so the local disk is the primary choice. You can also use any NFS storage available locally.

vSAN Storage Policies

If vSAN storage is used in Regional Data Center (RDC), you must follow the vSAN storage policies.


VMware, Inc. 52

https://doc.dpdk.org/guides/linux_gsg/sys_reqs.html

https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt

vSAN storage policies define storage requirements for your StorageClass. Cloud Native Persistent Storage or Volume (PV) inherits performance and availability characteristics made available by the vSAN storage policy. These policies determine how the storage objects are provisioned and allocated within the datastore to guarantee the required level of service. Kubernetes StorageClass is a way for Kubernetes admins to describe the “classes” of storage available for a Tanzu Kubernetes cluster by the Cloud Admin. Different StorageClasses map to different vSAN storage policies.

For more information about Cloud Native Storage Design for 5G Core, see the Telco Cloud Platform - 5G Edition Reference Architecture Guide 2.0.

Tanzu Basic for RAN Deployment Model

This section describes the Tanzu Basic for RAN deployment architecture and placement of its components in the Telco Cloud Platform RAN design.

Tanzu Kubernetes Management Cluster is a Kubernetes cluster that functions as the primary management and operational center for the Tanzu Basic for RAN instance. In this management cluster, the Cluster API runs to create Tanzu Kubernetes clusters and you configure the shared and in-cluster services that the clusters use.

Tanzu Kubernetes Workload Cluster is a Kubernetes cluster that is deployed from the Tanzu Kubernetes management cluster. Tanzu Kubernetes clusters can run different versions of Kubernetes, depending on the CNF workload requirements. Tanzu Kubernetes clusters support multiple types of CNIs for Pod-to-Pod networking, with Antrea as the default CNI and the vSphere CSI provider for storage by default. When deployed through Telco Cloud Automation, VMware NodeConfig Operator is bundled into every workload cluster to handle the node Operating System (OS) configuration, performance tuning, and OS upgrades required for various types of Telco CNF workloads for RAN.


VMware, Inc. 53

https://docs.vmware.com/en/VMware-Telco-Cloud-Platform---5G-Edition/2.0/telco-cloud-platform-5G-edition-reference-architecture-guide-20/GUID-32238A60-9636-4537-975B-B394AFC85482.html


Figure 3-12. Tanzu Basic for RAN Deployment Model

Telco Cloud

AutomationDNS NTP

TCA CP2

DHCP

DHCP DHCP

NSX vCenter WLD-1

K8s Control

Plane Node

K8s Control

Plane Node

K8s Control

Plane Node

K8s Control

Plane NodeK8s- Worker K8s- Worker

vCU vDU

VDS

vCenter

TCA CP1

NSX

VRO

VROPS

VRLI

Compute Wokload Domain WLD-1

Management Domain

Management- WLD

Cluster-1



QoS

Cell Site

VDS

Compute WLD-1 Cluster-1

Grand Master Clock PTP

tkg_dhcp_network

sddc_mgmt_network

VDS

Host

n In this design, Kubernetes control plane nodes are deployed at Regional Data Center (RDC) and Kubernetes worker node is deployed at Cell Site host.

n Telco Cloud Automation onboards the Cell Site host and orchestrates the deployment of Tanzu Kubernetes clusters.

n A dedicated DHCP server is available locally at RDC and Cell Site to support the DHCP service offering for Kubernetes clusters.

n Kubernetes Worker nodes are deployed at Regional Data Center and are extended to Cell Site locations to support the telco CNF workloads such as vCU and vDU in a geographically distributed manner.

CNF Design

This section outlines the CNF requirements and how CNF can be onboarded and instantiated in Telco Cloud Platform RAN.


VMware, Inc. 54

HELM Charts

Helm is the default package manager in Kubernetes, and it is widely leveraged by CNF vendors to simplify container packaging. With Helm charts, dependencies between CNFs are handled in the formats agreed upon by the upstream community. This allows Telco operators to consume CNF packages in a declarative and easy to operate manner. With proper version management, Helm charts also simplify workload updates and inventory control.

Helm repository is a required component in the Telco Cloud Platform RAN. Production CNF Helm charts must be stored centrally and accessible by the Tanzu Kubernetes clusters. To reduce the number of management endpoints, the Helm repository must work seamlessly with container images. A container registry must be capable of supporting both container image and Helm charts.

CSAR Design

Network Function (NF) Helm charts are uploaded as a catalog offering wrapped around the ETSI-compliant TOSCA YAML (CSAR) descriptor file. The descriptor file includes the structure and composition of the NF and supporting artifacts such as Helm charts version, provider, and set of pre-instantiation jobs. RAN Network Functions have sets of prerequisite configurations on the underlying Kubernetes cluster. Those requirements are also defined in the Network Function CSAR. The summary of features supported by the CSAR extension is:

n SR-IOV Interface configuration and addition, along with DPDK binding

n NUMA Alignment of vCPUs and Virtual Functions

n Latency Sensitivity

n Custom Operating system package installations

n Full GRUB configuration

The following table outlines those CSAR extensions:


VMware, Inc. 55

Component Type Description

node_components Kernel_type Type of Linux Kernel and version.

Note: Based on the Kernel version and type, Telco Cloud Automation downloads and installs from VMware Photon Linux repo during Kubernetes node customization.

kernel_args Kernel boot parameters.

Required for CPU isolation and so on. Parameters are free-form text strings. The syntaxes are as follows:

n Key: the name of the parameter

n Value: the value corresponding to the key

Note: The Value field is optional for those Kernel parameters that do not require a value.

kernel_modules Kernel Modules are specific to DPDK. When the DPDK host binding is required, the name of the DPDK module and the relevant version are required.

custom_packages Custom packages include lxcfs, tuned, and pci-utils.

Note: Telco Cloud Automation downloads and installs from VMware Photon Linux repo during node customization.

network deviceType Types of network device. Example: SR-IOV.

resourceName The label in the Network Attachment Definition (NAD).

dpdkBinding The PCI driver this device must use. Specify "igb_uio" or "vfio" for DPDK or any equivalent driver depending on the vendors.

count Number of adapters required

caas_components CaaS components define the CaaS CNI, CSI, and HELM components for the Kubernetes cluster.

CSAR files can be updated to reflect changes in CNF requirements or deployment model. CNF developers can update the CSAR package directly within the TCA designer or leverage an external CICD process to maintain and build newer versions of the CSAR package.


VMware, Inc. 56


Deploy all containers using the TCA Manager interface.

Direct access to the Kubernetes cluster outside of k8 cluster admin is not supported.

Some containers may not be available with a CSAR package to be deployed through Telco Cloud Automation.

Define all CNF infrastructure requirements using the TOSCA CSAR extension.

When infrastructure requirements are bundled into the CSAR package, Telco Cloud Automation provides placement assistance to locate a Kubernetes cluster that meets CNF requirements.

If Telco Cloud Automation cannot place the CNF workload due to the lack of resources, it leverages the node Operator to update an existing cluster with the required hardware and software based on the CSAR file definition.

CNF developers must work closely with CNF vendors to ensure that the infrastructure requirements are captured correctly in the CSAR file.

Store the TOSCA-compliant CSAR files in a GIT repository.

n GIT repository is ideal for centralized change control.

n CSAR package versioning, traceability, and peer review are built into GIT when a proper git-flow is implemented and followed.

Git and Git-flow are outside the scope of this reference architecture guide.

For more information about configuring a CSAR package, see the Telco Cloud Automation User Guide.

Note VMware offers a certification program for CNF and VNF vendors to create certified and validated CSARs packages.

Operations Management Design

The operations management design includes components such as vRealize Operations Manager and vRealize Log Insight that form the operations management layer in the Telco Cloud Platform RAN solution. This section provides guidance on the main design elements such as sizing, networking, and diagnostics.

Note vRealize Operations Manager and vRealize Log Insight are optional components for Telco Cloud Platform RAN.

vRealize Log Insight Design

The vRealize Log Insight cluster consists of one primary node and two worker nodes behind a load balancer. The vRealize Log Insight Cluster is located in the management cluster at Regional Data Center, and all logging traffic is sent across the site-to-site link.


VMware, Inc. 57

https://docs.vmware.com/en/VMware-Telco-Cloud-Automation/1.9/com.vmware.tca.userguide/GUID-84616F64-62E4-45A3-A75E-B73716BDA72A.html

https://docs.vmware.com/en/VMware-Telco-Cloud-Automation/1.9/com.vmware.tca.userguide/GUID-84616F64-62E4-45A3-A75E-B73716BDA72A.html

Enable the Integrated Load Balancer (ILB) on the three-node cluster so that all log sources can address the cluster by its ILB. By using the ILB, you do not need to reconfigure log sources with a new destination address in case of a scale-out or node failure. The ILB also guarantees that vRealize Log Insight accepts all incoming ingestion traffic.

The ILB address is required for users to connect to vRealize Log Insight using either the Web UI or API. It is also required for clients to ingest logs using syslog or the Ingestion API. A vRealize Log Insight cluster can scale out to 12 nodes: 1 primary and 11 worker nodes.

Table 3-4. Recommended vRealize Log Insight Design


Configure the enterprise edge sites to forward syslog data to the Centralized vRealize Log Insight Cluster.

Provides a central logging infrastructure for all Cell Sites.

All logging traffic is sent over the site-to-site link.

vRealize Operations Manager Design

The vRealize Operations Manager deployment is a single instance of a 3-node analytics cluster that is deployed in the management cluster along with a two-node remote collector group.

The remote collectors collect data from the compute vCenter Servers in the management cluster. Deploying remote collectors into enterprise edge sites would generate unnecessary traffic and occupy compute resources.

Table 3-5. Recommended vRealize Operations Manager Design


Configure vRealize Operations Manager to collect metrics from the compute vCenter Server

Provides operations management infrastructure for all Cell Sites.

As the Cell Sites are added, more data nodes and remote collectors need to be added.

For more information about the vRealize Operations Manager and vRealize Log Insight design at Regional Data Center, see the Telco Cloud Platform 2.0 Reference Architecture guide.


VMware, Inc. 58


Telco Cloud Platform RAN Reference Architecture Guide 1.0 ...

Documents