ccamp                                                        Y. Tan, Ed.
Internet-Draft                                                  Y. Zheng
Intended status: Informational                              China Unicom
Expires: 24 April 2026                                        Q. Hu, Ed.
                                                                 W. Wang
                                                                J. Zhang
                                                                 Y. Zhao
                      Beijing University of Posts and Telecommunications
                                                         21 October 2025


Unified Optical Networks and AI Computing Orchestration (UONACO) Problem
                 Statement, Use Cases and Requirements
              draft-tan-ccamp-uonaco-problem-statement-00

Abstract

   Distributed artificial intelligence (AI) computing is increasingly
   deployed across geographically dispersed AI data centers (AIDCs) to
   meet the scale and performance demands of modern AI workloads.  In
   such environments, the efficiency of distributed training, inference,
   and remote service access depends critically on tight coordination
   between optical transport networks and compute orchestration systems.
   However, today's infrastructure operates with isolated control
   planes: optical networks lack awareness of dynamic compute
   requirements, while compute schedulers have no visibility into real-
   time network conditions such as latency, bandwidth, or congestion.
   This decoupling leads to suboptimal resource utilization, degraded
   job performance, and inefficient scaling.

   This document presents the problem statement, outlines three
   representative use cases—distributed AI training, distributed AI
   inference, and remote AI service access—and specifies the
   requirements for Unified Optical Networks and AI Computing
   Orchestration (UONACO).  The goal is to enable bidirectional
   awareness, joint resource abstraction, and synchronized control
   across the compute-optical boundary, thereby supporting intent-
   driven, end-to-end provisioning of AI services over wide-area optical
   infrastructures.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.


Tan, et al.               Expires 24 April 2026                 [Page 1]

Internet-Draft  UONACO: Problem, Use Cases, Requirements    October 2025


   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at https://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on 24 April 2026.

Copyright Notice

   Copyright (c) 2025 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents (https://trustee.ietf.org/
   license-info) in effect on the date of publication of this document.
   Please review these documents carefully, as they describe your rights
   and restrictions with respect to this document.  Code Components
   extracted from this document must include Revised BSD License text as
   described in Section 4.e of the Trust Legal Provisions and are
   provided without warranty as described in the Revised BSD License.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
     1.1.  Requirements Language . . . . . . . . . . . . . . . . . .   3
   2.  Problem Statement . . . . . . . . . . . . . . . . . . . . . .   3
     2.1.  Isolated Control and Management . . . . . . . . . . . . .   4
     2.2.  Independent Resource Efficiency Evaluation  . . . . . . .   5
   3.  Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . .   5
     3.1.  Distributed AI Training . . . . . . . . . . . . . . . . .   6
     3.2.  Distributed AI Inference  . . . . . . . . . . . . . . . .   6
     3.3.  Accessing Remote AI Service . . . . . . . . . . . . . . .   6
   4.  Requirements  . . . . . . . . . . . . . . . . . . . . . . . .   7
     4.1.  Integrated Control and Management Architecture  . . . . .   7
     4.2.  Unified Abstraction of Computing and Network Resources  .   7
     4.3.  Joint Orchestration of Computing and Network Resources  .   7
   5.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .   8
   6.  Security Considerations . . . . . . . . . . . . . . . . . . .   8
   7.  References  . . . . . . . . . . . . . . . . . . . . . . . . .   8
     7.1.  Normative References  . . . . . . . . . . . . . . . . . .   8
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .   8


Tan, et al.               Expires 24 April 2026                 [Page 2]

Internet-Draft  UONACO: Problem, Use Cases, Requirements    October 2025


1.  Introduction

   The rapid proliferation of large-scale AI applications, particularly
   those involving distributed training and inference across
   geographically dispersed AIDCs, has exposed a critical gap in today's
   infrastructure: the lack of coordination between optical transport
   networks and compute orchestration systems.  While optical networks
   provide the high-bandwidth, low-latency, and deterministic
   connectivity required for wide-area AI computing collaboration, their
   control planes remain largely agnostic to the dynamic, heterogeneous
   demands of AI workloads.  Conversely, compute schedulers operate
   without visibility into the underlying network's real-time state,
   such as path latency, available bandwidth, or congestion levels.

   This decoupling leads to suboptimal resource utilization, degraded
   job performance, and inefficient scaling of distributed AI jobs.  For
   instance, a training job may be scheduled across distant AIDCs with
   abundant GPU capacity but poor optical connectivity, resulting in
   prolonged synchronization phases and significant compute efficiency
   loss.  Similarly, inference services with strict latency requirements
   may be routed through paths that meet compute criteria but violate
   network service-level objectives.

   To address these challenges, UONACO enables bidirectional awareness,
   joint resource abstraction, and synchronized control across the
   compute-optical boundary.  This document describes sample usage
   scenarios that drive UONACO requirements and will help to identify
   candidate solution architectures and solutions.

1.1.  Requirements Language

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
   "OPTIONAL" in this document are to be interpreted as described in BCP
   14 [RFC2119] [RFC8174] when, and only when, they appear in all
   capitals, as shown here.

2.  Problem Statement


Tan, et al.               Expires 24 April 2026                 [Page 3]

Internet-Draft  UONACO: Problem, Use Cases, Requirements    October 2025


   +------------------------------------------+
   |        AI Compute Service Request        |
   |    (e.g., Distributed Training Job)      |
   +--------------------+---------------------+
                        |
             +----------v----------+
             |                     |
             v                     v
   +------------------+   +---------------------+
   | Compute          |   | Optical Network     |
   | Scheduler        |   | Controller          |
   +------------------+   +---------------------+
             |                        |
             |                        |
             v                        v
   +------------------+            +-----------------+
   |     AIDC-A       |<--Low-BW-->|     AIDC-B      |
   | GPU: High        |(e.g., 100G)| GPU: High       |
   | Load: Low        |            | Load: Low       |
   +--------+---------+            +---------+-------+
           |                                |
           |                                |
           |    High-BW (e.g., 400G)        |
           +<------------------------------>+
           |                                |
           |                                |
           |    High-BW (e.g., 400G)        |
           +<------------------------------>+
           |                                |
   +--------v---------+                     |
   |     AIDC-C       |<--------------------+
   | GPU: Low         |
   | Load: High       |
   +------------------+


      Figure 1: Suboptimal Resource Allocation under Decoupled Control

2.1.  Isolated Control and Management

   The primary challenge lies in the management and control isolation
   between the computing domain (e.g., AI training clusters, cloud
   pools) and the optical transport network.  This separation creates a
   "chasm" that prevents holistic resource optimization.

   The optical network control plane operates without awareness of the
   real-time characteristics and requirements of the compute jobs it
   carries.  It cannot perceive critical parameters such as the


Tan, et al.               Expires 24 April 2026                 [Page 4]

Internet-Draft  UONACO: Problem, Use Cases, Requirements    October 2025


   bandwidth intensity of a distributed training job, the strict latency
   budget of an inference request, or the fluctuating resource demands
   of a compute job.  Consequently, it is unable to proactively
   establish, adjust, or tear down optical paths (e.g., OTN circuits,
   OXC switched paths) to optimally serve the compute workload, leading
   to suboptimal network configurations and underutilized bandwidth.

   Conversely, the compute orchestration layer (e.g., Kubernetes
   schedulers, AI job managers) makes resource allocation and job
   placement decisions based primarily on local compute and storage
   metrics (e.g., GPU availability, memory).  It lacks visibility into
   the underlying network's state, including path latency, available
   bandwidth, or congestion levels between candidate data centers.  This
   results in compute jobs being scheduled across locations with poor
   network connectivity, causing significant communication bottlenecks
   and degraded overall job performance (i.e., "compute efficiency
   loss").

   This bidirectional lack of awareness creates a fundamental mismatch,
   where neither domain can adapt to the needs of the other, severely
   limiting the potential of wide-area collaborative computing.

2.2.  Independent Resource Efficiency Evaluation

   Compounding the control isolation is the absence of a unified
   framework for evaluating the joint efficiency of compute and network
   resources.  Today, network performance is evaluated using traditional
   metrics like bandwidth utilization, latency, and packet loss.
   Compute performance is assessed through metrics like FLOPS (Floating
   Point Operations Per Second), job completion time, and resource
   utilization (CPU/GPU/memory).  These evaluation systems operate in
   silos.  There is no standardized method to quantify the combined cost
   and benefit of a joint compute-and-network resource allocation
   decision.  For instance, as shown in Figure 1, it is difficult to
   answer whether allocating a more powerful but distant GPU (with
   higher network latency) is more efficient than a less powerful but
   local one for a specific AI job.  This lack of a common evaluation
   language prevents the development of truly optimal co-scheduling
   algorithms that balance compute power against network quality.

3.  Use Cases

   The growing scale and distribution of artificial intelligence
   workloads have created new demands on wide-area optical
   infrastructure, particularly in scenarios that span multiple
   artificial intelligence data centers (AIDCs).  Three representative
   use cases illustrate the need for tighter integration between optical
   transport networks and compute orchestration systems.


Tan, et al.               Expires 24 April 2026                 [Page 5]

Internet-Draft  UONACO: Problem, Use Cases, Requirements    October 2025


3.1.  Distributed AI Training

   In distributed AI training, large models are increasingly trained
   across geographically separated AIDCs due to physical and operational
   constraints within any single site.  This requires frequent
   synchronization of model parameters over long-haul optical links,
   where performance is highly sensitive to network latency, bandwidth
   availability, and packet loss.  Without awareness of the underlying
   compute job characteristics, optical networks cannot provision
   connections that match the dynamic communication patterns of training
   iterations, leading to significant inefficiencies in GPU utilization
   and extended training times.

3.2.  Distributed AI Inference

   Distributed AI inference presents a complementary challenge, where
   workloads exhibit diverse service-level requirements ranging from
   strict latency bounds for interactive applications to high-throughput
   processing for batch analytics.  To meet these varied objectives,
   inference jobs are often deployed across a hierarchical compute
   fabric that includes cloud, regional, and edge AIDCs.  Effective
   placement of these jobs depends not only on local compute capacity
   but also on the quality of optical connectivity to the user or
   upstream service.  However, current inference orchestrators lack
   visibility into optical path conditions such as available bandwidth
   or propagation delay, while optical networks remain oblivious to the
   latency and throughput expectations of the inference jobs they carry.
   This disconnect can result in suboptimal job placement and violations
   of service-level agreements, highlighting the need for bidirectional
   signaling between compute and network control planes.

3.3.  Accessing Remote AI Service

   The third use case, accessing remote AI inference services reflects
   the growing trend of enterprises and end users consuming AI
   capabilities on demand via APIs.  In this scenario, users request
   access to remote inference or training resources through high-level
   APIs, expecting predictable performance and reliability.  Traditional
   best-effort Internet transport is inadequate for such services, which
   instead require deterministic, isolated, and dynamically adjustable
   optical channels.  Fulfilling these requests demands an integrated
   control framework capable of jointly evaluating compute availability
   and optical path feasibility, then orchestrating end-to-end resource
   allocation across both domains.  This scenario underscores the need
   for standardized protocols that can carry both compute intent and
   network constraints, enabling automated, intent-driven provisioning
   of cross-domain AI services.


Tan, et al.               Expires 24 April 2026                 [Page 6]

Internet-Draft  UONACO: Problem, Use Cases, Requirements    October 2025


4.  Requirements


4.1.  Integrated Control and Management Architecture

   A new, integrated control architecture is required to break down the
   management silos.  This architecture must facilitate bidirectional
   fusion between the compute and network control planes.

   It should allow the optical network to dynamically create, adjust,
   and tear down connections (e.g., leveraging OXC, ODUk, and fgOTN
   flexible scheduling mechanisms) based on explicit instructions or
   implicit signals derived from compute job requirements.  Conversely,
   it should enable the compute orchestration system to perform resource
   planning and online tuning based on the real-time quality and state
   of the underlying optical network.  The goal is to compress the
   response time for provisioning a compute-and-network service (e.g.,
   "dynamic compute entry") from the current hour-level down to the
   minute-level.

4.2.  Unified Abstraction of Computing and Network Resources

   A common language is needed for both domains to understand each
   other's capabilities and state.  This requires a unified resource
   abstraction model.

   The optical network's heterogeneous resources (e.g., at the OTN,
   optical, and Ethernet layers; or physical ports, spectrum, time
   slots) must be abstracted into a model that can be understood by the
   compute orchestrator.  This model should expose key attributes like
   available bandwidth, latency, and reliability.

   Similarly, heterogeneous compute resources (CPU, GPU, memory,
   storage) must be abstracted into a model that conveys their real-time
   capabilities (e.g., FLOPS, available memory/VRAM) and state to the
   network controller.  This unified abstraction is the foundation for
   any joint decision-making process.

4.3.  Joint Orchestration of Computing and Network Resources

   Building on the integrated architecture and unified abstraction, a
   joint orchestration mechanism is required to make intelligent, end-
   to-end resource allocation decisions.  This mechanism should include
   compute-aware optical network dynamic adjustment algorithms that can
   perform application-driven optical path reconfiguration and adaptive
   bandwidth allocation.  It should also include network-aware compute
   joint orchestration algorithms that can perform dynamic compute
   resource optimization and contention-aware job scheduling.


Tan, et al.               Expires 24 April 2026                 [Page 7]

Internet-Draft  UONACO: Problem, Use Cases, Requirements    October 2025


   The orchestration should be driven by a unified evaluation system
   that can assess the joint efficiency of a scheduling decision by
   considering both compute performance metrics (e.g., job completion
   time) and network performance metrics (e.g., bandwidth cost, latency)
   in a single, cohesive framework.

5.  IANA Considerations

   TBD

6.  Security Considerations

   TBD

7.  References

7.1.  Normative References

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119,
              DOI 10.17487/RFC2119, March 1997,
              <https://www.rfc-editor.org/info/rfc2119>.

   [RFC8174]  Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
              2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
              May 2017, <https://www.rfc-editor.org/info/rfc8174>.

Authors' Addresses

   Yanxia Tan (editor)
   China Unicom
   Email: tanyx11@chinaunicom.cn


   Yanlei Zheng
   China Unicom
   Email: zhengyanlei@chinaunicom.cn


   Qiaojun Hu (editor)
   Beijing University of Posts and Telecommunications
   Email: qiaoj475@bupt.edu.cn


   Wei Wang
   Beijing University of Posts and Telecommunications
   Email: weiw@bupt.edu.cn


Tan, et al.               Expires 24 April 2026                 [Page 8]

Internet-Draft  UONACO: Problem, Use Cases, Requirements    October 2025


   Jie Zhang
   Beijing University of Posts and Telecommunications
   Email: jie.zhang@bupt.edu.cn


   Yongli Zhao
   Beijing University of Posts and Telecommunications
   Email: yonglizhao@bupt.edu.cn


Tan, et al.               Expires 24 April 2026                 [Page 9]