List of artifacts produced by a single trial run and saved to user configured storage.
Field | Type | Label | Description |
configuration_files | string | repeated | A list containing the trial configuration in yaml format |
summary_data_files | string | repeated | Trial summary statistics |
port_metrics_data_files | string | repeated | Trial port metrics |
error_msg_files | string | repeated | Trial error log |
flow_metrics_data_files | string | repeated | Trial flow metrics |
data_chunk_data_files | string | repeated | Trial data chunk transfer metrics |
Name | Number | Description |
ART_UNSPECIFIED | 0 | Unspecified |
ART_CONFIGURATION | 1 | Configuration artifact |
ART_SUMMARY_DATA | 2 | Summary metrics artifact |
ART_PORT_METRICS_DATA | 5 | Port metrics artifact |
ART_ERROR_MSG | 9 | Trial error log artifact |
ART_DATA_CHUNK_DATA | 10 | Data chunk transfer metrics artifact |
ART_FLOW_METRICS_DATA | 11 | Flow metrics artifact |
dse.proto
A data model and APIs for describing an Keysight AI Data Center Builder trial and operations.
Input for AbortTrial API
Contains information about a trial abort
Field | Type | Label | Description |
state | AbortState | State of abort operation |
|
message | string | optional | A message containing relevant information for the abort. |
Wrapper message for gRPC response.
Field | Type | Label | Description |
collective_implementations | common.CollectiveImplementation | repeated |
|
Field | Type | Label | Description |
infrastructure_profile | profiles.InfraProfile | Infrastructure profile |
|
prev_binding | bind.Binding | optional | Previous binding (if any), useful for incremental changes to existing bindings |
platform_regions | bind.PlatformRegion | repeated | Assigns platforms to different regions of the infrastructure. Is left empty for current use cases. DSE server will raise an error if this is non-empty and feature flag 'onearm' is not being used. |
platform | common.PlatformType | Platform Type |
Field | Type | Label | Description |
binding | bind.Binding | Binding |
Field | Type | Label | Description |
trial_timestamps | google.protobuf.Timestamp | repeated | Timestamps of trial reports |
result_ids | string | repeated | Result IDs of trail reports |
Field | Type | Label | Description |
filepath | string | filepath to the archived log files (zip, tar, etc.) |
|
url | string | NA |
Field | Type | Label | Description |
workspace | string | Trial workspace to filter by |
|
tags | string | repeated | Trial tags to filter by |
retrieve_custom_files | bool | Whether to include custom files from the trial report |
Field | Type | Label | Description |
trial_reports | TrialReport | repeated | List of trial reports |
Message in a stream of updates returned while running a trial.
The client can print the log messages returned by each update to have a live indication of progress.
Field | Type | Label | Description |
log_messages | string | Log output |
|
timestamp | google.protobuf.Timestamp | Log timestamp |
|
severity_level | SeverityLevel | Log severity |
|
component_name | string | Log source |
Set tags for a workspace or trial.
Replaces previous set of tags.
Field | Type | Label | Description |
workspace | string | Workspace of trial report |
|
trial_report | TrialReport | Trial report |
|
tags | string | repeated | Tags to set to a trial report |
Trial configuration.
Field | Type | Label | Description |
workspace | WorkspaceSpec | The workspace that the trial will be stored under. If the workspace does not exist, it will be created. |
|
tags | string | repeated | A list of tags associated with this trial. |
platform | common.PlatformType | Platform for infrastructure involved in the trial |
|
nccl_config | common.NcclConfig | Configuration settings specific to nccl-tests trials |
|
rocev2 | common.Rocev2Transport | RoCEv2 transport settings |
|
kccb | kccb.Config | Keysight Collective Communications Application |
|
binding | bind.Binding | optional | Bindings of logical infrastructure to physical resources, typically obtained from CreateBinding() |
trial_meta | TrialMeta | Trial metadata, like version info |
Holds the model version.
Field | Type | Label | Description |
model_version | string | API data model version |
|
is_readonly | bool | Marks whether a trial configuration is read only or can be run |
Contains information about a trial that has run, is running, or has yet to be started.
Field | Type | Label | Description |
timestamp | google.protobuf.Timestamp | The timestamp signifies the time at which a trial run was started. |
|
workspace | string | Workspace name |
|
path | string | The storage path of the trial directory. |
|
tags | string | repeated | User provided tags associated with the trial. |
description | string | Trial description |
|
state | TrialState | Stores the current state of the trial. |
|
system_tags | string | repeated | System provided tags associated with the trial. |
end_timestamp | google.protobuf.Timestamp | The timestamp signifies the time at which a trial run completed |
|
kccb_summary | app_common.SummaryTable | Trial summary statistics |
|
kccb_artifacts | app_common.TrialArtifacts | A collection of artifacts generated by the trial that are saved in storage. |
Describes a workspace to be created/updated.
If the workspace already exists, the list of tags will be appended to that workspace if they are not already attached to that workspace.
Field | Type | Label | Description |
name | string | Workspace name |
|
tags | string | repeated | Tags |
Name | Number | Description |
ABORT_UNDEFINED | 0 | State Unspecified |
ABORT_INITIATED | 1 | Abort initiated |
ABORT_ERROR | 2 | Abort errored |
Name | Number | Description |
LEVEL_UNSPECIFIED | 0 | |
LEVEL_DEBUG | 1 | |
LEVEL_INFO | 2 | |
LEVEL_WARNING | 3 | |
LEVEL_ERROR | 4 | |
LEVEL_CRITICAL | 5 |
Name | Number | Description |
UNSPECIFIED | 0 | |
UNCONFIGURED | 1 | |
CONFIGURATION_IN_PROGRESS | 2 | |
CONFIGURATION_SUCCESSFUL | 3 | |
RUN_IN_PROGRESS | 4 | |
RUN_SUCCESSFUL | 5 | |
ERROR | 6 | |
ABORTED | 7 | |
TERMINATED | 8 | |
ABORT_IN_PROGRESS | 9 |
Method Name | Request Type | Response Type | Description |
CreateBinding | CreateBindingRequest | CreateBindingResponse | create rank and physical bindings for use with a Trial object in ConfigureTrial/RunTrial |
ConfigureTrial | Trial | RunLogs stream | ConfigureTrial sets up the trial based on the parameters provided. If the low-level config is not included, the server will use the high-level spec to generate the corresponding low-level config, which is returned as part of the response. |
AbortTrial | AbortTrialRequest | AbortTrialResponse | Aborts the currently running trial and receive the trial report aborted. Returns an error if no Trial has been configured |
RunTrial | .google.protobuf.Empty | RunLogs stream | Run the currently configured trial and receive streaming updates. Returns an error if no Trial has been configured |
GetTrial | .google.protobuf.Empty | Trial | Returns the trial that is currently configured. If no trial is currently configured, the returned object will be uninitialized. |
GetTrialReport | .google.protobuf.Empty | TrialReport | TrialReport contains state information (not started, in progress, successful, error) Pattern is modeled after this: https://cloud.google.com/apis/design/design_patterns#long_running_operations although it does not match completely. |
SetTags | SetTagsRequest | .google.protobuf.Empty | Set tags for a workspace or trial. Replaces previous set of tags. |
GetCollectiveImplementations | .google.protobuf.Empty | CollectiveImplementations | Returns a list of all (CC operation, algorithmic implementation) pairs available on the DSE server. |
ListTrialReports | ListTrialReportsRequest | ListTrialReportsResponse | Returns a list of all TrialReports in the server's storage. |
GetDiagnosticFile | GetDiagnosticFileRequest | GetDiagnosticFileResponse |
binding.proto
Data model definitions for binding logical infrastructure elements to ranks, NICs and physical resources.
Binds various settings to logical infrastructure elements.
The type of bound information depends on the type of binding.
Field | Type | Label | Description |
custom_binding | CustomBinding | Binding type |
|
infrastructure_profile | profiles.InfraProfile | The Binding keeps a copy of the infrastructure profile so that Trials can be re-run with the original versions of (potentially modified) profiles.InfraProfile |
|
infrastructure | keysight_chakra.infra.Infrastructure | Low-level Chakra model of the infrastructure matching infrastructure_profile |
|
infra_annotations | keysight_chakra.infra.Annotation | repeated | Annotations for low-level Chakra infrastructure. |
platform_regions | PlatformRegion | repeated | Assigns platforms to different regions of the infrastructure. Should be populated by the server and be a copy of the platform_regions sent in CreateBindingRequest. |
Binds:
- Ranks to Logical NPUs
- NIC settings to Logical NICs
- Assigned test resources
Field | Type | Label | Description |
rank_bindings | RankBinding | repeated | Ranks to Logical NPUs binding |
nic_bindings | NicBinding | repeated | NIC settings to Logical NICs binding |
physical_bindings | PhysicalBinding | repeated | Assigned test resources binding |
Reference to a logical infrastructure element
Field | Type | Label | Description |
device_instance_name | string | Name of the logical infrastructure device (e.g. the value dse_infra.GenericHost.name was set to) |
|
device_index | int32 | 0-based device index |
|
component_name | string | Name of the logical infrastructure component |
|
component_index | int32 | 0-based component index |
Field | Type | Label | Description |
boundary_refs | InfraRef | repeated | These components mark the boundary of a region within the infrastructure. |
Logical NIC Settings
Field | Type | Label | Description |
infra_ref | InfraRef | Reference to a NIC defined in the Chakra infrastructure |
|
nic_settings | common.NicSettings | Auto-populated, can be overriden by user |
|
associated_physical_bindings | InfraRef | repeated | Test ports which will may flows from corresponding NIC reference |
Field | Type | Label | Description |
infra_ref | InfraRef | Reference to the logical infrastructure element that the physical resource represents |
|
platform | common.PlatformType | Platform type |
|
chassis_location | common.ChassisInfo | Platform location, if Keysight Hardware platforms |
|
server_location | common.ServerInfo | Platform location, if NCCL-tests or Keysight Software platforms |
|
layer1 | common.Layer1 | Layer 1 settings of the test port |
Assigns a platform to a region in the infrastructure. Currently for internal use only.
Field | Type | Label | Description |
region | InfraRegion |
|
|
platform | common.PlatformType | optional |
|
Assign Ranks to Logical NPUs
Field | Type | Label | Description |
infra_ref | InfraRef | Reference to NPU used by this rank |
|
rank_id | int32 | Rank ID |
|
nic_refs | InfraRef | repeated | List of NICs available to NPU (in the same host) |
common.proto
Common data models
Algorithm message specifies a choice of system provided Expanders or a user provided custom implementation
Field | Type | Label | Description |
system | AlgorithmType |
|
|
custom | string |
|
Field | Type | Label | Description |
address | string | Chassis IP Address or FQDN |
|
port | string | Chassis port. Formats <front-panel-port> or <front-panel-port>.<fanout>. Examples: '1' or '1.4' |
Used to specify a (collective type, collective algorithm) pair.
The collective algorithm determines how a collective communication operation is expanded into
a set of peer-to-peer operations.
Field | Type | Label | Description |
type | keysight_chakra.mlcommons.CollectiveCommType |
|
|
algorithm | Algorithm |
|
Congestion control mechanisms
Field | Type | Label | Description |
ecn | ExplicitCongestionNotifications | ECN configuration |
|
pfc | PriorityFlowControl | PFC configuration |
|
dcqcn_rate_control | DCQCNRateControl | DCQCN rate control configuration |
Data Center Quantized Congestion Notification rate control settings
Field | Type | Label | Description |
enabled | bool | optional |
|
alpha_factor | int32 | optional |
|
alpha_interval | int32 |
|
|
initial_alpha | int32 |
|
|
rate_after_first_cnp | int32 |
|
|
rate_decrement_factor | float | optional |
|
min_rate_limit | int32 |
|
|
rate_decrement_coefficient | int32 |
|
|
rate_decrement_interval | int32 |
|
|
clamp_target_rate | bool | optional |
|
rate_increment_interval | int32 |
|
|
rate_increment_byte_counter | int32 |
|
|
rate_increment_threshold | int32 |
|
|
additive_rate_increment | int32 |
|
|
hyper_rate_increment | int32 |
|
|
time_between_cnps | int32 |
|
ECN configuration
Field | Type | Label | Description |
cnp_dscp | int32 | optional | DSCP of CNP packets |
data_ecn_bits | EcnBits | optional | Configures the ECN bits for data packets |
control_ecn_bits | EcnBits | optional | Configures the ECN bits for control packets; eg RoCEv2 ACKs |
cnp_ecn_bits | EcnBits | optional | Configures the ECN bits for CNP packets |
Field | Type | Label | Description |
ip_address | string | IP address |
|
ip_prefix | uint32 | IP prefix |
|
ip_gateway_address | string | Gateway address |
Field | Type | Label | Description |
ip_address | string | IP address |
|
ip_prefix | uint32 | IP prefix |
|
ip_gateway_address | string | Gateway address |
Layer 1 configuration
Field | Type | Label | Description |
speed_mode | SpeedMode | Speed and port mode (speed, modulation, FEC) |
|
auto_negotiate | bool | Auto negotiation |
|
link_training | bool | Link Training |
|
ieee_defaults | bool | IEEE Defaults settings |
Field | Type | Label | Description |
custom_env_vars | NcclConfig.CustomEnvVarsEntry | repeated |
|
Field | Type | Label | Description |
key | string |
|
|
value | string |
|
Field | Type | Label | Description |
ethernet_mtu | int32 | Ethernet Maximum transmission unit |
|
ipv4_addressing | Ipv4Addressing | IP addressing |
|
ipv6_addressing | Ipv6Addressing | IP addressing |
|
congestion_control | CongestionControl | optional | Congestion control |
mac_address | string | optional | MAC address |
vlan | Vlan | optional | VLAN Tags |
roce_transport_settings | Rocev2TransportSettings | RoCEv2 transport settings |
PFC configuration
Field | Type | Label | Description |
enabled | bool | optional | Enable/disable PFC |
priorities | int32 | repeated | PFC priorities |
Field | Type | Label | Description |
rdma_message_size | int32 | (Maximum) RMDA message size in Bytes |
|
qps_per_rankpair | int32 | Number of Queue Pairs per rank pair |
|
qp_negotiation | RoCEv2QPNegotiationMethod | Queue Pair Negotiation method |
|
verb | RDMAVerb | RDMA verb to use for data transfers in collective communication operations |
|
tcp_store_host | string | TCPStore hostname or IP address (only applicable when qp_negotiation is METHOD_TCP_STORE) |
|
tcp_store_port | uint32 | TCPStore port number (only applicable when qp_negotiation is METHOD_TCP_STORE) |
|
retx_retry_interval_ms | int32 | optional | Configure the AckTimeout for RoCEv2 protoco |
retx_retry_count | int32 | optional | Configure the RetransRetryCount for RoCEv2 protocol |
max_retry_on_rnr_nak | int32 | optional |
|
ack_request_interval | int32 | optional |
|
Field | Type | Label | Description |
ack_dscp | int32 | optional | Configures ACK DSCP value for RoCEv2 protocol |
nack_dscp | int32 | optional | Configures NACK DSCP value for RoCEv2 protocol |
data_dscp | int32 | Configures DSCP for all data traffic |
Information about servers being used for emulation
Field | Type | Label | Description |
address | string | Server IP address or FQDN |
|
nic_interface | string | Server NIC interface |
Field | Type | Label | Description |
enabled | bool | Enable/disable VLANs |
|
vlan_tags | VlanTag | repeated | VLAN tag. Currently one VLAN tag is required. |
VLAN Configuration
Field | Type | Label | Description |
priority | int32 | Priority |
|
vlan_id | int32 | VLAN ID |
Algorithm message specifies a choice of system provided Expanders or a user provided custom implementation
Name | Number | Description |
ALGO_UNSPECIFIED | 0 | |
ALGO_ALL_TO_ALL_PARALLEL | 1 | |
ALGO_ALL_TO_ALL_PXN | 2 | |
ALGO_ALL_REDUCE_UNIDIRECTIONAL_RING | 10 | |
ALGO_ALL_REDUCE_BIDIRECTIONAL_RING | 11 | |
ALGO_ALL_REDUCE_VECTOR_HALVING_DOUBLING | 12 | |
ALGO_ALL_REDUCE_DOUBLE_BINARY_TREE | 13 | |
ALGO_ALL_GATHER_RING | 20 | |
ALGO_REDUCE_SCATTER_RING | 30 | |
ALGO_BROADCAST_PARALLEL | 40 | |
ALGO_GATHER_PARALLEL | 50 |
ECN field within an IP packet
Name | Number | Description |
ECN_UNSPECIFIED | 0 | Unspecified |
ECN_DISABLED | 1 | ECN disabled |
ECN_ECT1 | 2 | ECN capable |
ECN_ECT0 | 3 | ECN capable |
Name | Number | Description |
PLATFORM_UNSPECIFIED | 0 | UNSPECIFIED |
PLATFORM_KEYS_NCCL_TEST | 1 | Keysight Software Solution for NCCL-tests orchestration on on-prem servers |
PLATFORM_KEYS_SW_AGENT | 2 | Keysight Software Solution for on-prem servers |
PLATFORM_KEYS_HW | 3 | Keysight Hardware Platform |
Name | Number | Description |
VERB_UNSPECIFIED | 0 | |
VERB_WRITE | 1 | RDMA WRITE |
VERB_SEND | 2 | RDMA SEND |
RoCEv2 RNR Timeout. Relevant only for RC QPs
Name | Number | Description |
TIMEOUT_655360_MU | 0 | |
TIMEOUT_10_MU | 1 | |
TIMEOUT_20_MU | 2 | |
TIMEOUT_30_MU | 3 | |
TIMEOUT_40_MU | 4 | |
TIMEOUT_60_MU | 5 | |
TIMEOUT_80_MU | 6 | |
TIMEOUT_120_MU | 7 | |
TIMEOUT_160_MU | 8 | |
TIMEOUT_240_MU | 9 | |
TIMEOUT_320_MU | 10 | |
TIMEOUT_480_MU | 11 | |
TIMEOUT_640_MU | 12 | |
TIMEOUT_960_MU | 13 | |
TIMEOUT_1280_MU | 14 | |
TIMEOUT_1920_MU | 15 | |
TIMEOUT_2560_MU | 16 | |
TIMEOUT_3840_MU | 17 | |
TIMEOUT_5120_MU | 18 | |
TIMEOUT_7680_MU | 19 | |
TIMEOUT_10240_MU | 20 | |
TIMEOUT_15360_MU | 21 | |
TIMEOUT_20480_MU | 22 | |
TIMEOUT_30720_MU | 23 | |
TIMEOUT_40960_MU | 24 | |
TIMEOUT_61440_MU | 25 | |
TIMEOUT_81920_MU | 26 | |
TIMEOUT_122880_MU | 27 | |
TIMEOUT_163840_MU | 28 | |
TIMEOUT_245760_MU | 29 | |
TIMEOUT_327680_MU | 30 | |
TIMEOUT_491520_MU | 31 |
A tuple of speed, modulation and FEC mode configuration
Name | Number | Description |
UNSPECIFIED | 0 | |
MODE_100GE_NRZ_RS_FEC | 9 | |
MODE_100GE_NRZ_NO_FEC | 10 | |
MODE_100GE_PAM4_53G_KP4_FEC | 8 | |
MODE_100GE_PAM4_106G_RS_FEC | 7 | |
MODE_100GE_PAM4_106G_KP4_FEC | 6 | |
MODE_200GE_PAM4_53G_KP4_FEC | 5 | |
MODE_200GE_PAM4_106G_KP4_FEC | 4 | |
MODE_400GE_PAM4_53G_KP4_FEC | 3 | |
MODE_400GE_PAM4_106G_KP4_FEC | 2 | |
MODE_800GE_PAM4_106G_KP4_FEC | 1 |
Name | Number | Description |
SPEED_UNSPECIFIED | 0 | |
SPEED_100G | 1 | 100 Gigabit Ethernet |
SPEED_200G | 2 | 200 Gigabit Ethernet |
SPEED_400G | 3 | 400 Gigabit Ethernet |
SPEED_800G | 4 | 800 Gigabit Ethernet |
Data model for describing the Collective Benchmark trial
Benchmark message is a high level abstraction of a Chakra workload
The message MUST be converted into a list of Chakra et_def.proto Node messages
and the dse.Experiment.workload field should be populated with the
converted list
Field | Type | Label | Description |
collective_algorithm | common.Algorithm | Communication collective algorithm |
|
datasize | Datasize | Data size definition as a factor |
|
datasize_list | DatasizeList | Custom data size definition |
|
iterations | int32 | Number of repeated benchmark runs |
Collective Benchmark Application configuration
Field | Type | Label | Description |
benchmark | Benchmark | Benchmark definiton |
Datasize is a container for specifying the data sizes for each Collective benchmark run
Field | Type | Label | Description |
start | uint64 | Data size to start the benchmark collective operation |
|
step | uint32 | Factor to increase the data size for each run |
|
end | uint64 | maximum data size after which the benchmark completes |
DatasizeList allows for specifying a custom list of data sizes
Field | Type | Label | Description |
size_bytes | uint64 | repeated | The size of data in bytes |
Data models for logical infrastructure used in the trial
Field | Type | Label | Description |
name | string | optional | Fabric name |
Field | Type | Label | Description |
blackbox | BlackboxFabric | Non-detailed generic network fabric (a blackbox) |
contains the input parameters for the generic host builder
Field | Type | Label | Description |
name | string | optional | Host name |
npu_count | uint32 | Number of NPUs. Generic hosts have a 1:1 ratio of NICs and NPUs |
|
custom_bandwidth_gbps | uint32 | NPU Interconnect bandwidth |
High level definition of hosts in logical infrastructure
Field | Type | Label | Description |
count | uint32 | Number of hosts |
|
generic | GenericHost | Generic host |
|
use_npu_interconnect | bool | Emulate NPU interconnects |
Infrastructure configuration comprising hosts and the network fabric
Field | Type | Label | Description |
host | Host | Hosts |
|
fabric | Fabric | optional | Network fabric |
.proto Type | Notes | C++ | Java | Python | Go | C# | PHP | Ruby |
double | double | double | float | float64 | double | float | Float | |
float | float | float | float | float32 | float | float | Float | |
int32 | Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint32 instead. | int32 | int | int | int32 | int | integer | Bignum or Fixnum (as required) |
int64 | Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint64 instead. | int64 | long | int/long | int64 | long | integer/string | Bignum |
uint32 | Uses variable-length encoding. | uint32 | int | int/long | uint32 | uint | integer | Bignum or Fixnum (as required) |
uint64 | Uses variable-length encoding. | uint64 | long | int/long | uint64 | ulong | integer/string | Bignum or Fixnum (as required) |
sint32 | Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int32s. | int32 | int | int | int32 | int | integer | Bignum or Fixnum (as required) |
sint64 | Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int64s. | int64 | long | int/long | int64 | long | integer/string | Bignum |
fixed32 | Always four bytes. More efficient than uint32 if values are often greater than 2^28. | uint32 | int | int | uint32 | uint | integer | Bignum or Fixnum (as required) |
fixed64 | Always eight bytes. More efficient than uint64 if values are often greater than 2^56. | uint64 | long | int/long | uint64 | ulong | integer/string | Bignum |
sfixed32 | Always four bytes. | int32 | int | int | int32 | int | integer | Bignum or Fixnum (as required) |
sfixed64 | Always eight bytes. | int64 | long | int/long | int64 | long | integer/string | Bignum |
bool | bool | boolean | boolean | bool | bool | boolean | TrueClass/FalseClass | |
string | A string must always contain UTF-8 encoded or 7-bit ASCII text. | string | String | str/unicode | string | string | string | String (UTF-8) |
bytes | May contain any arbitrary sequence of bytes. | string | ByteString | str | []byte | ByteString | string | String (ASCII-8BIT) |