Client API Reference, v1.0.2

Table of Contents

app_common.proto

Top

TrialArtifacts

List of artifacts produced by a single trial run and saved to user configured storage.

FieldTypeLabelDescription
configuration_files string repeated

A list containing the trial configuration in yaml format

summary_data_files string repeated

Trial summary statistics

port_metrics_data_files string repeated

Trial port metrics

error_msg_files string repeated

Trial error log

flow_metrics_data_files string repeated

Trial flow metrics

data_chunk_data_files string repeated

Trial data chunk transfer metrics

ArtifactType

NameNumberDescription
ART_UNSPECIFIED 0

Unspecified

ART_CONFIGURATION 1

Configuration artifact

ART_SUMMARY_DATA 2

Summary metrics artifact

ART_PORT_METRICS_DATA 5

Port metrics artifact

ART_ERROR_MSG 9

Trial error log artifact

ART_DATA_CHUNK_DATA 10

Data chunk transfer metrics artifact

ART_FLOW_METRICS_DATA 11

Flow metrics artifact

dse.proto

Top

dse.proto

A data model and APIs for describing an Keysight AI Data Center Builder trial and operations.

AbortTrialRequest

Input for AbortTrial API

AbortTrialResponse

Contains information about a trial abort

FieldTypeLabelDescription
state AbortState

State of abort operation

message string optional

A message containing relevant information for the abort.

CollectiveImplementations

Wrapper message for gRPC response.

FieldTypeLabelDescription
collective_implementations common.CollectiveImplementation repeated

CreateBindingRequest

FieldTypeLabelDescription
infrastructure_profile profiles.InfraProfile

Infrastructure profile

prev_binding bind.Binding optional

Previous binding (if any), useful for incremental changes to existing bindings

platform_regions bind.PlatformRegion repeated

Assigns platforms to different regions of the infrastructure. Is left empty for current use cases. DSE server will raise an error if this is non-empty and feature flag 'onearm' is not being used.

platform common.PlatformType

Platform Type

CreateBindingResponse

FieldTypeLabelDescription
binding bind.Binding

Binding

GetDiagnosticFileRequest

FieldTypeLabelDescription
trial_timestamps google.protobuf.Timestamp repeated

Timestamps of trial reports

result_ids string repeated

Result IDs of trail reports

GetDiagnosticFileResponse

FieldTypeLabelDescription
filepath string

filepath to the archived log files (zip, tar, etc.)

url string

NA

ListTrialReportsRequest

FieldTypeLabelDescription
workspace string

Trial workspace to filter by

tags string repeated

Trial tags to filter by

retrieve_custom_files bool

Whether to include custom files from the trial report

ListTrialReportsResponse

FieldTypeLabelDescription
trial_reports TrialReport repeated

List of trial reports

RunLogs

Message in a stream of updates returned while running a trial.

The client can print the log messages returned by each update to have a live indication of progress.

FieldTypeLabelDescription
log_messages string

Log output

timestamp google.protobuf.Timestamp

Log timestamp

severity_level SeverityLevel

Log severity

component_name string

Log source

SetTagsRequest

Set tags for a workspace or trial.

Replaces previous set of tags.

FieldTypeLabelDescription
workspace string

Workspace of trial report

trial_report TrialReport

Trial report

tags string repeated

Tags to set to a trial report

Trial

Trial configuration.

FieldTypeLabelDescription
workspace WorkspaceSpec

The workspace that the trial will be stored under. If the workspace does not exist, it will be created.

tags string repeated

A list of tags associated with this trial.

platform common.PlatformType

Platform for infrastructure involved in the trial

nccl_config common.NcclConfig

Configuration settings specific to nccl-tests trials

rocev2 common.Rocev2Transport

RoCEv2 transport settings

kccb kccb.Config

Keysight Collective Communications Application

binding bind.Binding optional

Bindings of logical infrastructure to physical resources, typically obtained from CreateBinding()

trial_meta TrialMeta

Trial metadata, like version info

TrialMeta

Holds the model version.

FieldTypeLabelDescription
model_version string

API data model version

is_readonly bool

Marks whether a trial configuration is read only or can be run

TrialReport

Contains information about a trial that has run, is running, or has yet to be started.

FieldTypeLabelDescription
timestamp google.protobuf.Timestamp

The timestamp signifies the time at which a trial run was started.

workspace string

Workspace name

path string

The storage path of the trial directory.

tags string repeated

User provided tags associated with the trial.

description string

Trial description

state TrialState

Stores the current state of the trial.

system_tags string repeated

System provided tags associated with the trial.

end_timestamp google.protobuf.Timestamp

The timestamp signifies the time at which a trial run completed

kccb_summary app_common.SummaryTable

Trial summary statistics

kccb_artifacts app_common.TrialArtifacts

A collection of artifacts generated by the trial that are saved in storage.

WorkspaceSpec

Describes a workspace to be created/updated.

If the workspace already exists, the list of tags will be appended to that workspace if they are not already attached to that workspace.

FieldTypeLabelDescription
name string

Workspace name

tags string repeated

Tags

AbortState

NameNumberDescription
ABORT_UNDEFINED 0

State Unspecified

ABORT_INITIATED 1

Abort initiated

ABORT_ERROR 2

Abort errored

SeverityLevel

NameNumberDescription
LEVEL_UNSPECIFIED 0

LEVEL_DEBUG 1

LEVEL_INFO 2

LEVEL_WARNING 3

LEVEL_ERROR 4

LEVEL_CRITICAL 5

TrialState

NameNumberDescription
UNSPECIFIED 0

UNCONFIGURED 1

CONFIGURATION_IN_PROGRESS 2

CONFIGURATION_SUCCESSFUL 3

RUN_IN_PROGRESS 4

RUN_SUCCESSFUL 5

ERROR 6

ABORTED 7

TERMINATED 8

ABORT_IN_PROGRESS 9

DseService

Method NameRequest TypeResponse TypeDescription
CreateBinding CreateBindingRequest CreateBindingResponse

create rank and physical bindings for use with a Trial object in ConfigureTrial/RunTrial

ConfigureTrial Trial RunLogs stream

ConfigureTrial sets up the trial based on the parameters provided. If the low-level config is not included, the server will use the high-level spec to generate the corresponding low-level config, which is returned as part of the response.

AbortTrial AbortTrialRequest AbortTrialResponse

Aborts the currently running trial and receive the trial report aborted. Returns an error if no Trial has been configured

RunTrial .google.protobuf.Empty RunLogs stream

Run the currently configured trial and receive streaming updates. Returns an error if no Trial has been configured

GetTrial .google.protobuf.Empty Trial

Returns the trial that is currently configured. If no trial is currently configured, the returned object will be uninitialized.

GetTrialReport .google.protobuf.Empty TrialReport

TrialReport contains state information (not started, in progress, successful, error) Pattern is modeled after this: https://cloud.google.com/apis/design/design_patterns#long_running_operations although it does not match completely.

SetTags SetTagsRequest .google.protobuf.Empty

Set tags for a workspace or trial. Replaces previous set of tags.

GetCollectiveImplementations .google.protobuf.Empty CollectiveImplementations

Returns a list of all (CC operation, algorithmic implementation) pairs available on the DSE server.

ListTrialReports ListTrialReportsRequest ListTrialReportsResponse

Returns a list of all TrialReports in the server's storage.

GetDiagnosticFile GetDiagnosticFileRequest GetDiagnosticFileResponse

bind.proto

Top

binding.proto

Data model definitions for binding logical infrastructure elements to ranks, NICs and physical resources.

Binding

Binds various settings to logical infrastructure elements.

The type of bound information depends on the type of binding.

FieldTypeLabelDescription
custom_binding CustomBinding

Binding type

infrastructure_profile profiles.InfraProfile

The Binding keeps a copy of the infrastructure profile so that Trials can be re-run with the original versions of (potentially modified) profiles.InfraProfile

infrastructure keysight_chakra.infra.Infrastructure

Low-level Chakra model of the infrastructure matching infrastructure_profile

infra_annotations keysight_chakra.infra.Annotation repeated

Annotations for low-level Chakra infrastructure.

platform_regions PlatformRegion repeated

Assigns platforms to different regions of the infrastructure. Should be populated by the server and be a copy of the platform_regions sent in CreateBindingRequest.

CustomBinding

Binds:

- Ranks to Logical NPUs

- NIC settings to Logical NICs

- Assigned test resources

FieldTypeLabelDescription
rank_bindings RankBinding repeated

Ranks to Logical NPUs binding

nic_bindings NicBinding repeated

NIC settings to Logical NICs binding

physical_bindings PhysicalBinding repeated

Assigned test resources binding

InfraRef

Reference to a logical infrastructure element

FieldTypeLabelDescription
device_instance_name string

Name of the logical infrastructure device (e.g. the value dse_infra.GenericHost.name was set to)

device_index int32

0-based device index

component_name string

Name of the logical infrastructure component

component_index int32

0-based component index

InfraRegion

FieldTypeLabelDescription
boundary_refs InfraRef repeated

These components mark the boundary of a region within the infrastructure.

NicBinding

Logical NIC Settings

FieldTypeLabelDescription
infra_ref InfraRef

Reference to a NIC defined in the Chakra infrastructure

nic_settings common.NicSettings

Auto-populated, can be overriden by user

associated_physical_bindings InfraRef repeated

Test ports which will may flows from corresponding NIC reference

PhysicalBinding

FieldTypeLabelDescription
infra_ref InfraRef

Reference to the logical infrastructure element that the physical resource represents

platform common.PlatformType

Platform type

chassis_location common.ChassisInfo

Platform location, if Keysight Hardware platforms

server_location common.ServerInfo

Platform location, if NCCL-tests or Keysight Software platforms

layer1 common.Layer1

Layer 1 settings of the test port

PlatformRegion

Assigns a platform to a region in the infrastructure. Currently for internal use only.

FieldTypeLabelDescription
region InfraRegion

platform common.PlatformType optional

RankBinding

Assign Ranks to Logical NPUs

FieldTypeLabelDescription
infra_ref InfraRef

Reference to NPU used by this rank

rank_id int32

Rank ID

nic_refs InfraRef repeated

List of NICs available to NPU (in the same host)

common.proto

Top

common.proto

Common data models

Algorithm

Algorithm message specifies a choice of system provided Expanders or a user provided custom implementation

FieldTypeLabelDescription
system AlgorithmType

custom string

ChassisInfo

FieldTypeLabelDescription
address string

Chassis IP Address or FQDN

port string

Chassis port. Formats <front-panel-port> or <front-panel-port>.<fanout>. Examples: '1' or '1.4'

CollectiveImplementation

Used to specify a (collective type, collective algorithm) pair.

The collective algorithm determines how a collective communication operation is expanded into

a set of peer-to-peer operations.

FieldTypeLabelDescription
type keysight_chakra.mlcommons.CollectiveCommType

algorithm Algorithm

CongestionControl

Congestion control mechanisms

FieldTypeLabelDescription
ecn ExplicitCongestionNotifications

ECN configuration

pfc PriorityFlowControl

PFC configuration

dcqcn_rate_control DCQCNRateControl

DCQCN rate control configuration

DCQCNRateControl

Data Center Quantized Congestion Notification rate control settings

FieldTypeLabelDescription
enabled bool optional

alpha_factor int32 optional

alpha_interval int32

initial_alpha int32

rate_after_first_cnp int32

rate_decrement_factor float optional

min_rate_limit int32

rate_decrement_coefficient int32

rate_decrement_interval int32

clamp_target_rate bool optional

rate_increment_interval int32

rate_increment_byte_counter int32

rate_increment_threshold int32

additive_rate_increment int32

hyper_rate_increment int32

time_between_cnps int32

ExplicitCongestionNotifications

ECN configuration

FieldTypeLabelDescription
cnp_dscp int32 optional

DSCP of CNP packets

data_ecn_bits EcnBits optional

Configures the ECN bits for data packets

control_ecn_bits EcnBits optional

Configures the ECN bits for control packets; eg RoCEv2 ACKs

cnp_ecn_bits EcnBits optional

Configures the ECN bits for CNP packets

Ipv4Addressing

FieldTypeLabelDescription
ip_address string

IP address

ip_prefix uint32

IP prefix

ip_gateway_address string

Gateway address

Ipv6Addressing

FieldTypeLabelDescription
ip_address string

IP address

ip_prefix uint32

IP prefix

ip_gateway_address string

Gateway address

Layer1

Layer 1 configuration

FieldTypeLabelDescription
speed_mode SpeedMode

Speed and port mode (speed, modulation, FEC)

auto_negotiate bool

Auto negotiation

link_training bool

Link Training

ieee_defaults bool

IEEE Defaults settings

NcclConfig

FieldTypeLabelDescription
custom_env_vars NcclConfig.CustomEnvVarsEntry repeated

NcclConfig.CustomEnvVarsEntry

FieldTypeLabelDescription
key string

value string

NicSettings

FieldTypeLabelDescription
ethernet_mtu int32

Ethernet Maximum transmission unit

ipv4_addressing Ipv4Addressing

IP addressing

ipv6_addressing Ipv6Addressing

IP addressing

congestion_control CongestionControl optional

Congestion control

mac_address string optional

MAC address

vlan Vlan optional

VLAN Tags

roce_transport_settings Rocev2TransportSettings

RoCEv2 transport settings

PriorityFlowControl

PFC configuration

FieldTypeLabelDescription
enabled bool optional

Enable/disable PFC

priorities int32 repeated

PFC priorities

Rocev2Transport

FieldTypeLabelDescription
rdma_message_size int32

(Maximum) RMDA message size in Bytes

qps_per_rankpair int32

Number of Queue Pairs per rank pair

qp_negotiation RoCEv2QPNegotiationMethod

Queue Pair Negotiation method

verb RDMAVerb

RDMA verb to use for data transfers in collective communication operations

tcp_store_host string

TCPStore hostname or IP address (only applicable when qp_negotiation is METHOD_TCP_STORE)

tcp_store_port uint32

TCPStore port number (only applicable when qp_negotiation is METHOD_TCP_STORE)

retx_retry_interval_ms int32 optional

Configure the AckTimeout for RoCEv2 protoco

retx_retry_count int32 optional

Configure the RetransRetryCount for RoCEv2 protocol

max_retry_on_rnr_nak int32 optional

ack_request_interval int32 optional

Rocev2TransportSettings

FieldTypeLabelDescription
ack_dscp int32 optional

Configures ACK DSCP value for RoCEv2 protocol

nack_dscp int32 optional

Configures NACK DSCP value for RoCEv2 protocol

data_dscp int32

Configures DSCP for all data traffic

ServerInfo

Information about servers being used for emulation

FieldTypeLabelDescription
address string

Server IP address or FQDN

nic_interface string

Server NIC interface

Vlan

FieldTypeLabelDescription
enabled bool

Enable/disable VLANs

vlan_tags VlanTag repeated

VLAN tag. Currently one VLAN tag is required.

VlanTag

VLAN Configuration

FieldTypeLabelDescription
priority int32

Priority

vlan_id int32

VLAN ID

AlgorithmType

Algorithm message specifies a choice of system provided Expanders or a user provided custom implementation

NameNumberDescription
ALGO_UNSPECIFIED 0

ALGO_ALL_TO_ALL_PARALLEL 1

ALGO_ALL_TO_ALL_PXN 2

ALGO_ALL_REDUCE_UNIDIRECTIONAL_RING 10

ALGO_ALL_REDUCE_BIDIRECTIONAL_RING 11

ALGO_ALL_REDUCE_VECTOR_HALVING_DOUBLING 12

ALGO_ALL_REDUCE_DOUBLE_BINARY_TREE 13

ALGO_ALL_GATHER_RING 20

ALGO_REDUCE_SCATTER_RING 30

ALGO_BROADCAST_PARALLEL 40

ALGO_GATHER_PARALLEL 50

EcnBits

ECN field within an IP packet

NameNumberDescription
ECN_UNSPECIFIED 0

Unspecified

ECN_DISABLED 1

ECN disabled

ECN_ECT1 2

ECN capable

ECN_ECT0 3

ECN capable

PlatformType

NameNumberDescription
PLATFORM_UNSPECIFIED 0

UNSPECIFIED

PLATFORM_KEYS_NCCL_TEST 1

Keysight Software Solution for NCCL-tests orchestration on on-prem servers

PLATFORM_KEYS_SW_AGENT 2

Keysight Software Solution for on-prem servers

PLATFORM_KEYS_HW 3

Keysight Hardware Platform

RDMAVerb

NameNumberDescription
VERB_UNSPECIFIED 0

VERB_WRITE 1

RDMA WRITE

VERB_SEND 2

RDMA SEND

RoCEv2RNRTimeout

RoCEv2 RNR Timeout. Relevant only for RC QPs

NameNumberDescription
TIMEOUT_655360_MU 0

TIMEOUT_10_MU 1

TIMEOUT_20_MU 2

TIMEOUT_30_MU 3

TIMEOUT_40_MU 4

TIMEOUT_60_MU 5

TIMEOUT_80_MU 6

TIMEOUT_120_MU 7

TIMEOUT_160_MU 8

TIMEOUT_240_MU 9

TIMEOUT_320_MU 10

TIMEOUT_480_MU 11

TIMEOUT_640_MU 12

TIMEOUT_960_MU 13

TIMEOUT_1280_MU 14

TIMEOUT_1920_MU 15

TIMEOUT_2560_MU 16

TIMEOUT_3840_MU 17

TIMEOUT_5120_MU 18

TIMEOUT_7680_MU 19

TIMEOUT_10240_MU 20

TIMEOUT_15360_MU 21

TIMEOUT_20480_MU 22

TIMEOUT_30720_MU 23

TIMEOUT_40960_MU 24

TIMEOUT_61440_MU 25

TIMEOUT_81920_MU 26

TIMEOUT_122880_MU 27

TIMEOUT_163840_MU 28

TIMEOUT_245760_MU 29

TIMEOUT_327680_MU 30

TIMEOUT_491520_MU 31

SpeedMode

A tuple of speed, modulation and FEC mode configuration

NameNumberDescription
UNSPECIFIED 0

MODE_100GE_NRZ_RS_FEC 9

MODE_100GE_NRZ_NO_FEC 10

MODE_100GE_PAM4_53G_KP4_FEC 8

MODE_100GE_PAM4_106G_RS_FEC 7

MODE_100GE_PAM4_106G_KP4_FEC 6

MODE_200GE_PAM4_53G_KP4_FEC 5

MODE_200GE_PAM4_106G_KP4_FEC 4

MODE_400GE_PAM4_53G_KP4_FEC 3

MODE_400GE_PAM4_106G_KP4_FEC 2

MODE_800GE_PAM4_106G_KP4_FEC 1

SpeedType

NameNumberDescription
SPEED_UNSPECIFIED 0

SPEED_100G 1

100 Gigabit Ethernet

SPEED_200G 2

200 Gigabit Ethernet

SPEED_400G 3

400 Gigabit Ethernet

SPEED_800G 4

800 Gigabit Ethernet

kccb.proto

Top

Data model for describing the Collective Benchmark trial

Benchmark

Benchmark message is a high level abstraction of a Chakra workload

The message MUST be converted into a list of Chakra et_def.proto Node messages

and the dse.Experiment.workload field should be populated with the

converted list

FieldTypeLabelDescription
collective_algorithm common.Algorithm

Communication collective algorithm

datasize Datasize

Data size definition as a factor

datasize_list DatasizeList

Custom data size definition

iterations int32

Number of repeated benchmark runs

Config

Collective Benchmark Application configuration

FieldTypeLabelDescription
benchmark Benchmark

Benchmark definiton

Datasize

Datasize is a container for specifying the data sizes for each Collective benchmark run

FieldTypeLabelDescription
start uint64

Data size to start the benchmark collective operation

step uint32

Factor to increase the data size for each run

end uint64

maximum data size after which the benchmark completes

DatasizeList

DatasizeList allows for specifying a custom list of data sizes

FieldTypeLabelDescription
size_bytes uint64 repeated

The size of data in bytes

dse_infra.proto

Top

Data models for logical infrastructure used in the trial

BlackboxFabric

FieldTypeLabelDescription
name string optional

Fabric name

Fabric

FieldTypeLabelDescription
blackbox BlackboxFabric

Non-detailed generic network fabric (a blackbox)

GenericHost

contains the input parameters for the generic host builder

FieldTypeLabelDescription
name string optional

Host name

npu_count uint32

Number of NPUs. Generic hosts have a 1:1 ratio of NICs and NPUs

custom_bandwidth_gbps uint32

NPU Interconnect bandwidth

Host

High level definition of hosts in logical infrastructure

FieldTypeLabelDescription
count uint32

Number of hosts

generic GenericHost

Generic host

use_npu_interconnect bool

Emulate NPU interconnects

Infrastructure

Infrastructure configuration comprising hosts and the network fabric

FieldTypeLabelDescription
host Host

Hosts

fabric Fabric optional

Network fabric

Scalar Value Types

.proto TypeNotesC++JavaPythonGoC#PHPRuby
double double double float float64 double float Float
float float float float float32 float float Float
int32 Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint32 instead. int32 int int int32 int integer Bignum or Fixnum (as required)
int64 Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint64 instead. int64 long int/long int64 long integer/string Bignum
uint32 Uses variable-length encoding. uint32 int int/long uint32 uint integer Bignum or Fixnum (as required)
uint64 Uses variable-length encoding. uint64 long int/long uint64 ulong integer/string Bignum or Fixnum (as required)
sint32 Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int32s. int32 int int int32 int integer Bignum or Fixnum (as required)
sint64 Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int64s. int64 long int/long int64 long integer/string Bignum
fixed32 Always four bytes. More efficient than uint32 if values are often greater than 2^28. uint32 int int uint32 uint integer Bignum or Fixnum (as required)
fixed64 Always eight bytes. More efficient than uint64 if values are often greater than 2^56. uint64 long int/long uint64 ulong integer/string Bignum
sfixed32 Always four bytes. int32 int int int32 int integer Bignum or Fixnum (as required)
sfixed64 Always eight bytes. int64 long int/long int64 long integer/string Bignum
bool bool boolean boolean bool bool boolean TrueClass/FalseClass
string A string must always contain UTF-8 encoded or 7-bit ASCII text. string String str/unicode string string string String (UTF-8)
bytes May contain any arbitrary sequence of bytes. string ByteString str []byte ByteString string String (ASCII-8BIT)