Feature CCX-137: Insights Advisor: Insights Advisor for OpenShift as standalone app

View the Description

Feature Overview
Insights Advisor for OpenShift is integrated within OpenShift Cluster Manager. This has some limitations for adding new features and also for sharing codebase between RHEL Advisor and OCM Insights Advisor tab. Insights Advisor for OpenShift lacks certain features from the RHEL UI, the codebase is not 1:1 clone.
As a customer of Insights I will have same/very similar user experience with Insights for OpenShift and Insights for RHEL. The workflows will share the main concepts, the UI elements will be same and features introduced to Advisor will be automatically considered for both all supported platforms.
As OpenShift users I will still see integrations of Insights Advisor within OpenShift Cluster Manager that shows aggregated information for customer account and single cluster view on Advisor data. These integration will point to new Insights Advisor for OpenShift app that will be tightly integrated into OpenShift Cluster Manager.

Note: The application will be reusing the codebase but will run as a separate app for OpenShift. THere's no intent to merge RHEL and OpenShift workflows into a single app.

Goals

Q2CY21: Explore possibility to unify codebase between RHEL Advisor and OCM Insights Advisor tab. Identify architecture misalignments, create UI mockups to merge the two existing UIs.
Q3CY21: Integrate OpenShift into Advisor codebase, standup the Insights Advisor for OpenShift application and change integration in OpenShift Cluster manager to point at the new app
Q4CY21: Deliver missing screen of Insights Advisor for OpenShift (Systems and Recommendations views)

Requirements

UX overview of UI elements in both UIs - Marie Doruskova
Architecture overview/misalignments for both UIs - Jan Zeleny [~fjansen]

Benefits

Feature parity between RHEL and OpenShift
Adopting new features developed by RHEL Advisor team quicker
Smaller maintenance cost

Questions to answer...

Possible deviations between OpenShift and RHEL
Remediation workflow different between OpenShift and RHEL

Out of Scope

Single app that combines RHEL hosts and OpenShift clusters. Goal is still to differentiate between platforms and offer view only for a single platform.
Direct/Supervised remediations and integration of remediations with Advanced Cluster Manager (as a Service)

Background, and strategic fit

Insights Advisor for OpenShift follows the goal to introduce multiple applications that add value for OpenShift customers under the Insights brand. The current UI and integration of Advisor into OpenShift cluster manager doesn't follow pattern that other Insights for OpenShift applications can/will follow.

Documentation Considerations

OCM documentation is impacted, existing workflows described in OCM documentation will persist. The placement of the application within OCM will be different.

Epic CCXDEV-6500: OCP Advisor (frontend, CY22Q1)

View the Description

TBA

Task CCXDEV-7039: Redirect OCP WebConsole users to Advisor through the links in the widget

View the Description View the linked PRs

OCP WebConsole, in the main dashboard, has an Insights Advisor widget, which has been redirecting users to OCM. Due to the Insights Advisor tab decommission in OCM, the links should point to Advisor instead.

4.10 code freeze = 28 January (marking the task as urgent)

https://github.com/openshift/console/pull/10875

Feature OCPPLAN-5714: The details of this Jira Card are restricted (Only Red Hat employees and contractors)

View the Description

The details of this Jira Card are restricted (Only Red Hat employees and contractors)

Epic AUTH-6: Logs should contain login and login failure details

View the Description

Summary (PM+lead)

Configure audit logging to capture login, ~~logout~~ and login failure details

Motivation (PM+lead)

TODO(PM): update this

Customer who needs login, ~~logout~~ and login failure details inside the openshift container platform.
I have checked for this on my test cluster but the audit logs do not contain any user name specifying login or ~~logout~~ details. For successful logins or ~~logout~~, on CLI and openshift console as well we can see 'Login successful' or 'Invalid credentials'.

Expected results: Login, ~~logout~~ and login failures should be captured in audit logging.

Goals (lead)

Non-Goals (lead)

Don't attempt to log login failures in the IdP login flow that goes beyond timeout, if it the information is not available in explicit oauth-server requests (e.g. github password login error).
Logout does not involve oauth-server (but is a simple API object deletion in oauth-apiserver). Hence, the audit log discussed here won't include logout.

Deliverables

Changes to oauth-server to log into /varLog/oauth-server/audit.log on the master node.
Documentation

Proposal (lead)

The apiserver pods today have ´/var/log/<kube|oauth|openshift>-apiserver` mounted from the host and create audit files there using the upstream audit event format (JSON lines following https://github.com/kubernetes/apiserver/blob/92392ef22153d75b3645b0ae339f89c12767fb52/pkg/apis/audit/v1/types.go#L72). These events are apiserver specific, but as oauth authentication flow events are also requests, we can use the apiserver event format to log logins, login failures and logouts. Hence, we propose to make oauth-server to create /var/log/oauth-server/audit.log files on the master nodes using that format.

When the login flow does not finish within a certain time (e.g. 10min), we can artificially create an event to show a login failure in the audit logs.

User Stories (PM)

Dependencies (internal and external, lead)

Previous Work (lead)

Open questions (lead)

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story AUTH-100: Request Header flow: hook up oauth-server to create audit logs

View the Description View the linked PRs

Right now there's no way to generate audit logs from this.

https://github.com/openshift/oauth-server/pull/106

Story AUTH-66: PW Based flow: hook up oauth-server to create audit logs

View the Description View the linked PRs

Right now there's no way to generate audit logs from this.

https://github.com/openshift/oauth-server/pull/92

Story AUTH-89: CAO: leverage policy config for oauth-server

View the Description View the linked PRs

🏆 What

Let the Cluster Authentication Operator deliver the policy to OAuthServer.

💖 Why

In order to know if authn events should be logged, OAuthServer needs to be aware of it.

🗒 Notes

Create an observer to deliver the audit policy to the oauth server

Make the authentication-operator react to the new audit field in the oauth.config/cluster object. Write an observer watching this field, such an observer will translate the top-level configuration into oauth-server config and add it to the rest of the observed config.

* Stanislav Láznička

https://github.com/openshift/cluster-authentication-operator/pull/563

Feature OCPPLAN-6839: Single replica control plane topology expansion

View the Description

OCP/Telco Definition of Done
Feature Template descriptions and documentation.

Feature Overview.

Early customer feedback is that they see SNO as a great solution covering smaller footprint deployment, but are wondering what is the evolution story OpenShift is going to provide where more capacity or high availability are needed in the future.

While migration tooling (moving workload/config to new cluster) could be a mid-term solution, customer desire is not to include extra hardware to be involved in this process.

For Telecommunications Providers, at the Far Edge they intend to start small and then grow. Many of these operators will start with a SNO-based DU deployment as an initial investment, but as DUs evolve, different segments of the radio spectrum are added, various radio hardware is provisioned and features delivered to the Far Edge, the Telecommunication Providers desire the ability for their Far Edge deployments to scale up from 1 node to 2 nodes to n nodes. On the opposite side of the spectrum from SNO is MMIMO where there is a robust cluster and workloads use HPA.

Goals

Provide the capability to expand a single replica control plane topology to host more workloads capacity - add worker
Provide the capability to expand a single replica control plane to be a highly available control plane
To satisfy MMIMO Telecommunications providers will want the ability to scale a SNO to a multi-node cluster that can support HPA.
Telecommunications providers do not want workload (DU specifically) downtime when migrating from SNO to a multi-node cluster.
Telecommunications providers wish to be able to scale from one to two or more nodes to support a variety of radio hardware.
Support CP scaling (CP HA) for 2 node cluster, 3 node cluster and n node cluster. As the number of nodes in the cluster increases so does the failure domain of the cluster. The cluster is now supporting more cell sectors and therefore has more of a need for HA and resiliency including the cluster CP.

Requirements

Requirement	Notes	isMvp?
CI - MUST be running successfully with test automation	This is a requirement for ALL features.	YES
Release Technical Enablement	Provide necessary release enablement details and documents.	YES

(Optional) Use Cases

This Section:

Main success scenarios - high-level user stories
Alternate flow/scenarios - high-level user stories
...

Questions to answer…

Out of Scope

Background, and strategic fit

This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.

Assumptions

Customer Considerations

Documentation Considerations

Questions to be addressed:

What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
Does this feature have doc impact?
New Content, Updates to existing content, Release Note, or No Doc Impact
If unsure and no Technical Writer is available, please contact Content Strategy.
What concepts do customers need to understand to be successful in [action]?
How do we expect customers will use the feature? For what purpose(s)?
What reference material might a customer want/need to complete [action]?
Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
What is the doc impact (New Content, Updates to existing content, or Release Note)?

Epic MGMT-8414: Single Node OpenShift worker node expansion

View the Description

Epic Goal

Documented and supported flow for adding 1, 2, 3 or more workers to a Single Node OpenShift (SNO) deployment without requiring cluster downtime and the understanding that this action will not make the cluster itself highly available.

Why is this important?

Telecommunications and Edge scenarios where HA is handled via failover to another site but single site capacity may vary or need to be expanded over time.
Similar scenarios exist for some ISV vendors where OpenShift is an implementation detail of how they deliver their solution on top of another platform (e.g. VMware).

Scenarios

Adding a worker to a single node openshift cluster.
Adding a second worker to a single node openshift cluster.
Adding a third worker to a single node openshift cluster.
Removing a worker node from a single node openshift cluster that has had 1 or more workers added.

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
Customer facing documentation of the add worker flow for SNO.

Dependencies (internal and external)

Previous Work (Optional):

~~MGMT-6606~~

Open questions::

Presumably there is a scale limit on how many workers could be added to an SNO control plane, and it is lower than the limit for a "normal" 3 node control plane. It is not anticipated that this limit will be established in this epic. Intent is to focus on small scale sites where adding 1-3 worker nodes would be beneficial.

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Task MGMT-9797: Single node + workers enhancement implementation

View the Description View the linked PRs

This is a ticket meant to track all the all the OCP PRs that are involved in the implementation of the SNO + workers enhancement

Feature RHDP-291: Maintain existing portfolio priorities

View the Description

Feature Overview

This Feature is a general "catch all" for the time being. There are a number of existing priorities from Q1 that should be aligned with existing priorities below but if not, assign to this feature as needed.

Goals

In order to get a better overall portfolio view, we'll leverage this Feature to gather work that doesn't fall into other existing priorities on this board. As this list grows, the portfolio priority grooming team will look to split out or handle appropriately.

Requirements

A list of specific needs or objectives that a Feature must deliver to satisfy the Feature. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.

requirement	Notes	isMvp

(Optional) Use Cases

< How will the user interact with this feature? >

< Which users will use this and when will they use it? >

< Is this feature used as part of current user interface? >

Out of Scope

Background, and strategic fit

< What does the person writing code, testing, documenting need to know? >

Assumptions

< Are there assumptions being made regarding prerequisites and dependencies?>

< Are there assumptions about hardware, software or people resources?>

Customer Considerations

< Are there specific customer environments that need to be considered (such as working with existing h/w and software)?>

Documentation Considerations

< What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)? >

< Does this feature have doc impact? Possible values are: New Content, Updates to existing content, Release Note, or No Doc Impact?>

<What concepts do customers need to understand to be successful in [action]?>
<How do we expect customers will use the feature? For what purpose(s)?>
<What reference material might a customer want/need to complete [action]?>
<Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available. >
<What is the doc impact (New Content, Updates to existing content, or Release Note)?>

Questions

Question	Outcome

Epic ODC-6351: Dynamic Plugins - Round 3

View the Description

Problem:

Console provides support UI for operators which is dynamically enabled when the operator is installed; by using feature flags against presence of CRDs. While operators have their own release cadence separately from OpenShift which makes for alignment of UI to API difficult. As new features are released for the operator, the UI becomes out of sync with APIs and customers must wait till the following OpenShift release to get any new UI.

Goal:

Create an extensibility mechanism which allows Red Hat operators to build and package their own UI that extends the console.
Make console extensible in areas required to support the needs of contributing plugins.

Why is it important?

Allows an operator to maintain their own UI and release at their own cadence.
Alleviates the pressure on console to deliver UI features for multiple operators within a release.

Use cases:

Serverless / Pipelines / Helm to contribute resource details pages, import flows, topology visuals etc...

Acceptance criteria:

Red Hat Operator can build their own UI which is deployed alongside the operator and extend the dev-console
1. objective is to get to a point where it is possible to accomplish this however code will not be moved to a separate repository, nor deployed by an operator
New extensions for console to allow operators to extend the various areas of console needed in order to provide the proper user experience.
Enable operators to override the static built in support, and supply their own UI

Dependencies (External/Internal):

Design Artifacts:

Console extensions:
https://docs.google.com/document/d/1HW5_cl6cOX5P14PQN-1_8c60o9dMY6HbFDRftH6aTno/edit

Dynamic Plugins:
https://docs.google.com/document/d/19BAFo_8BtMZVvKsU-bE61bZpSydeYONkCMWntMU9NgE/edit

Enhancement proposal:
https://github.com/openshift/enhancements/pull/441

Exploration:

Note:

plugin framework covered by another epic
out of scope:
- moving plugins to separate git repository

Story ODC-6219: Override static plugin contribution with dynamic plugin contribution

View the Description View the linked PRs

Description

As a developer, I want to be able to contribute a dynamic plugin extension and override the same extension contributed by static plugin.

Acceptance Criteria

Should replace static plugin contribution of same name by dynamic plugin contribution

Additional Details:

https://github.com/openshift/console/pull/9744

Feature CCX-218: Insights Advisor: Post-GA updates

View the Description

Problem:

Certain Insights Advisor features differentiate between RHEL and OCP advisor

Goal:

Address top priority UI misalignments between RHEL and OCP advisor. Address UI features dropped from Insights ADvisor for OCP GA.

Scope:

Specific tasks and priority of them tracked in https://issues.redhat.com/browse/CCXDEV-7432

Epic CCXDEV-7980: Insights Advisor widget in Webconsole, 4.11 OCP release

View the Description

This contains all the Insights Advisor widget deliverables for the OCP release 4.11.

Scope
It covers only minor bug fixes and improvements:

better error handling during internal outages in data processing
add "last refresh" timestamp in the Advisor widget

Task CCXDEV-5788: Provide the "last refresh" timestamp in the Advisor widget

View the Description View the linked PRs

Scenario: Check if the Insights Advisor widget in the OCP WebConsole UI shows the time of the last data analysis
Given: OCP WebConsole UI and the cluster dashboard is accessible
And: CCX external data pipeline is in a working state
And: administrator A1 has access to his cluster's dashboard
And: Insights Operator for this cluster is sending archives
When: administrator A1 clicks on the Insights Advisor widget
Then: the results of the last analysis are showed in the Insights Advisor widget
And: the time of the last analysis is shown in the Insights Advisor widget

Acceptance criteria:

The time of the last analysis is shown in the Insights Advisor widget for the scenario above
The way it is presented is defined within the scope of https://issues.redhat.com/browse/CCXDEV-5869 (mockup task)
The source of this timestamp must be a result of running the Prometheus metric (last archive upload time):
```
max_over_time(timestamp(changes(insightsclient_request_send_total\{status_code="202"}[1m]) > 0)[24h:1m])
```

https://github.com/openshift/console/pull/11391

Task CCXDEV-7974: Show error message if UploadDegraded and Degraded conditions set to true

View the Description View the linked PRs

Show the error message (mocked in ~~CCXDEV-5868~~) if the Prometheus metrics `cluster_operator_conditions{name="insights"}` contain two true conditions: UploadDegraded and Degraded at the same time. This state occurs if there was an IO archive upload error = problems with the pipeline.

Expected for 4.11 OCP release.

https://github.com/openshift/console/pull/11399

Feature OBSDA-2: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Epic MON-1985: Allow admin users to create new alerting rules based on platform metrics

View the Description

Epic Goal

Allow admin user to create new alerting rules, targeting metrics in any namespace
Allow cloning of existing rules to simplify rule creation
Allow creation of silences for existing alert rules

Why is this important?

Currently, any platform-related metrics (exposed in a openshift-, kube- and default namespace) cannot be used to form a new alerting rule. That makes it very difficult for administrators to enrich our out of the box experience for the OpenShift Container Platform with new rules that may be specific to their environments.

Additionally, we had requests from customer to allow modifications of our existing, out of the box alerting rules (for instance tweaking the alert expression or changing the severity label). Unfortunately, that is not easy since most rules come from several open source projects, or other OpenShift components, and any modifications would make a seamless upgrade not really seamless anymore. Imagine K8s changes metrics again (see 1.14) and we have to update our rules. We would not know what modifications have been done (even just the threshold might be difficult if upstream changes that as well) and we would not be able to upgrade these rules.

Scenarios

I'd like to modify the query expression of an existing rule (because the threshold value doesn't match with my environment).

Cloning the existing rule should end up with a new rule in the same namespace.
Modifications can now be done to the new rule.
(Optional) You can silence the existing rule.

I'd like to create a new rule based on a metric only available to an openshift-* namespace

Create a new PrometheusRule object inside the namespace that includes the metrics you need to form the alerting rule.

I'd like to update the label of an existing rule.

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
Ability to distinguish between rules deployed by us (CMO) and user created rules

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Distinguish between operator-created rules and user-created rules
Currently no such mechanism exists. This will need to be added to prometheus-operator or cluster-monitoring-operator.

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Task MON-2091: Implement controller for user defined alert-relabel-config CRD

View the Description View the linked PRs

CMO should reconcile the platform Prometheus configuration with the alert-relabel-config resources.

DoD

Alerts changed via alert-relabel-configs are evaluated by the Platform monitoring stack.
Product alerts which are overriden aren't sent to Alertmanager

https://github.com/openshift/cluster-monitoring-operator/pull/1676

Task MON-2552: Implement controller for AlertingRule CRD

View the Description View the linked PRs

CMO should reconcile the platform Prometheus configuration with the AlertingRule resources.

DoD

Alerts added via AlertingRule resources are evaluated by the Platform monitoring stack.

https://github.com/openshift/cluster-monitoring-operator/pull/1675

Feature OBSDA-27: Enable prometheus retention.size via CMO

View the Description

Managing PVs at scale for a fleet creates difficulties where "one size does not fit all". The ability for SRE to deploy prometheus with PVs and have retention based an on a desired size would enable easier management of these volumes across the fleet.

The prometheus-operator exposes retentionSize.

Field	Description
retentionSize	Maximum amount of disk space used by blocks. Supported units: B, KB, MB, GB, TB, PB, EB. Ex: 512MB.

This is a feature request to enable this configuration option via CMO cluster-monitoring-config ConfigMap.

cc Simon Pasquier

Epic MON-2193: Size-based retention

View the Description View the linked PRs

Epic Goal

Cluster admins want to configure the retention size for their metrics.

Why is this important?

While it is possible to define how long metrics should be retained on disk, it's not possible to tell the cluster monitoring operator how much data it should keep. For OSD/ROSA in particular, it would facilitate the management of the fleet if the retention size could be configured based on the persistent volume size because it would avoid issues with the storage getting full and monitoring being down when too many metrics are produced.

Scenarios

As a cluster admin, I want to define the maximum amount of data to be retained on the persistent volume.

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
The cluster-monitoring-config config and the user-workload-monitoring-config configmap allow to configure the retention size for
- Prometheus (Platform and UWM)
- Thanos Ruler (to be confirmed)
Proper validation is in place preventing bad user inputs from breaking the stack.

Dependencies (internal and external)

Thanos ruler doesn't support retention size (only retention time).

Previous Work (Optional):

None

Open questions::

None

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Feature OBSDA-3: Allow non-admin users to configure individual notification settings

View the Description

Problem Alignment

The Problem

Today, all configuration for setting individual, for example, routing configuration is done via a single configuration file that only admins have access to. If an environment uses multiple tenants and each tenant, for example, has different systems that they are using to notify teams in case of an issue, then someone needs to file a request w/ an admin to add the required settings.

That can be bothersome for individual teams, since requests like that usually disappear in the backlog of an administrator. At the same time, administrators might get tons of requests that they have to look at and prioritize, which takes them away from more crucial work.

We would like to introduce a more self service approach whereas individual teams can create their own configuration for their needs w/o the administrators involvement.

Last but not least, since Monitoring is deployed as a Core service of OpenShift there are multiple restrictions that the SRE team has to apply to all OSD and ROSA clusters. One restriction is the ability for customers to use the central Alertmanager that is owned and managed by the SRE team. They can't give access to the central managed secret due to security concerns so that users can add their own routing information.

High-Level Approach

Provide a new API (based on the Operator CRD approach) as part of the Prometheus Operator that allows creating a subset of the Alertmanager configuration without touching the central Alertmanager configuration file.

Please note that we do not plan to support additional individual webhooks with this work. Customers will need to deploy their own version of the third party webhooks.

Goal & Success

Allow users to deploy individual configurations that allow setting up Alertmanager for their needs without an administrator.

Solution Alignment

Key Capabilities

As an OpenShift administrator, I want to control who can CRUD individual configuration so that I can make sure that any unknown third person can touch the central Alertmanager instance shipped within OpenShift Monitoring.
As a team owner, I want to deploy a routing configuration to push notifications for alerts to my system of choice.

Key Flows

Team A wants to send all their important notifications to a specific Slack channel.

Administrator gives permission to Team A to allow creating a new configuration CR in their individual namespace.
Team A creates a new configuration CR.
Team A configures what alerts should go into their Slack channel.
Open Questions & Key Decisions (optional)
Do we want to improve anything inside the developer console to allow configuration?

Epic MON-2168: Make Alertmanager configuration for user defined monitoring generally available

View the Description

Epic Goal

Allow users to manage Alertmanager for user-defined alerts and have the feature being fully supported.

Why is this important?

Users want to configure alert notifications without admin intervention.
The feature is currently Tech Preview, it should be generally available to benefit a bigger audience.

Scenarios

As a cluster admin, I can deploy an Alertmanager service dedicated for user-defined alerts (e.g. separated from the existing Alertmanager already used for platform alerts).
As an application developer, I can silence alerts from the OCP console.
As an application developer, I'm not allowed to configure invalid AlertmanagerConfig objects.

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
The AlertmanagerConfig CRD is v1beta1
The validating webhook service checking AlertmanagerConfig resources is highly-available.

Dependencies (internal and external)

Prometheus operator upstream should migrate the AlertmanagerConfig CRD from v1alpha1 to v1beta1
Console enhancements likely to be involved (see below).

Previous Work (Optional):

Part of the feature is available as Tech Preview (~~MON-880~~).

Open questions:

Coordination with the console team to support the Alertmanager service dedicated for user-defined alerts.
Migration steps for users that are already using the v1alpha1 CRD.

Done Checklist

* CI - CI is running, tests are automated and merged.
* Release Enablement <link to Feature Enablement Presentation>
* DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
* DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
* DEV - Downstream build attached to advisory: <link to errata>
* QE - Test plans in Polarion: <link or reference to Polarion>
* QE - Automated tests merged: <link or reference to automated tests>
* DOC - Downstream documentation merged: <link to meaningful PR>

Story MON-2567: Expose AlertmanagerConfig v1beta1 in CMO

View the Description View the linked PRs

Now that upstream supports AlertmanagerConfig v1beta1 (see ~~MON-2290~~ and https://github.com/prometheus-operator/prometheus-operator/pull/4709), it should be deployed by CMO.

DoD:

Kubernetes API exposes and supports the v1beta1 version for AlertmanagerConfig CRD (in addition to v1alpha1).
Users can manage AlertmanagerConfig v1beta1 objects seamlessly.
AlertmanagerConfig v1beta1 objects are reconciled in the generated Alertmanager configuration.

https://github.com/openshift/cluster-monitoring-operator/pull/1682

Task MON-2222: Enable validating webhook for AlertmanagerConfig customer resources

View the linked PRs

https://github.com/openshift/cluster-monitoring-operator/pull/1567

Epic MON-880: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Task MON-2089: Add external label of origin to platform alerts

View the Description View the linked PRs

As described in https://github.com/openshift/enhancements/blob/ba3dc219eecc7799f8216e1d0234fd846522e88f/enhancements/monitoring/multi-tenant-alerting.md#distinction-between-platform-and-user-alerts, cluster admins want to distinguish platform alerts from user alerts. For this purpose, CMO should provision an external label (openshift_io_alert_source="platform") on prometheus-k8s instances.

https://github.com/openshift/cluster-monitoring-operator/pull/1508

Feature OBSDA-36: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Epic MON-2194: Federation for UWM metrics

View the Description

Epic Goal

The goal is to support metrics federation for user-defined monitoring via the /federate Prometheus endpoint (both from within and outside of the cluster).

Why is this important?

It is already possible to configure remote write for user-defined monitoring to push metrics outside of the cluster but in some cases, the network flow can only go from the outside to the cluster and not the opposite. This makes it impossible to leverage remote write.
It is already possible to use the /federate endpoint for the platform Prometheus (via the internal service or via the OpenShift route) so not supporting for UWM doesn't provide a consistent experience.
If we don't expose the /federate endpoint for the UWM Prometheus, users would have no supported way to store and query application metrics from a central location.

Scenarios

As a cluster admin, I want to federate user-defined metrics using the Prometheus /federate endpoint.
As a cluster admin, I want that the /federate endpoint to UWM is accessible via an OpenShift route.
As a cluster admin, I want that the access to the /federate endpoint to UWM requires authentication (with bearer token only) & authorization (the required permissions should match the permissions on the /federate endpoint of the Platform Prometheus).

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
Documentation - information about the recommendations and limitations/caveats of the federation approach.
User can federate user-defined metrics from within the cluster
User can federate user-defined metrics from the outside via the OpenShift route.

Dependencies (internal and external)

None

Previous Work (Optional):

None

Open questions:

None

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Task MON-2213: Expose the /federate endpoint of UWM Prometheus as a route

View the Description View the linked PRs

DoD

User can federate UWM metrics from outside of the cluster via the OpenShift route.
E2E test added to the CMO test suite.

https://github.com/openshift/cluster-monitoring-operator/pull/1633

Task MON-2212: Expose the /federate endpoint of UWM Prometheus as a service

View the Description View the linked PRs

DoD

User can federate UWM metrics within the cluster from the prometheus-user-workload.openshift-user-workload-monitoring.svc:9092 service
The service requires authentication via bearer token and authorization (same permissions as for federating platform metrics)

https://github.com/openshift/cluster-monitoring-operator/pull/1601

Feature OBSDA-39: Support Sigv4 authentication for remote write in OCP monitoring

View the Description

Copy/paste from [_https://github.com/openshift-cs/managed-openshift/issues/60_]

Which service is this feature request for?
OpenShift Dedicated and Red Hat OpenShift Service on AWS

What are you trying to do?
Allow ROSA/OSD to integrate with AWS Managed Prometheus.

Describe the solution you'd like
Remote-write of metrics is supported in OpenShift but it does not work with AWS Managed Prometheus since AWS Managed Prometheus requires AWS SigV4 auth.

Note that Prometheus supports AWS SigV4 since v2.26 and OpenShift 4.9 uses v2.29.

Describe alternatives you've considered
There is the workaround to use the "AWS SigV4 Proxy" but I'd think this is not properly supported by RH.
https://mobb.ninja/docs/rosa/cluster-metrics-to-aws-prometheus/

Additional context
The customer wants to use an open and portable solution to centralize metrics storage and analysis. If they also deploy to other clouds, they don't want to have to re-configure. Since most clouds offer a Prometheus service (or it's easy to self-manage Prometheus), app migration should be simplified.

Epic MON-2160: Support additional auth section in remote_write

View the Description

Epic Goal

The cluster monitoring operator should allow OpenShift customers to configure remote write with all authentication methods supported by upstream Prometheus.

We will extend CMO's configuration API to support the following authentications with remote write:

Sigv4

Authorization

OAuth2

Why is this important?

Customers want to send metrics to AWS Managed Prometheus that require sigv4 authentication (see https://docs.aws.amazon.com/prometheus/latest/userguide/AMP-secure-metric-ingestion.html#AMP-secure-auth).

Scenarios

As a cluster admin, I want to forward platform/user metrics to remote write systems requiring Sigv4 authentication.
As a cluster admin, I want to forward platform/user metrics to remote write systems requiring OAuth2 authentication.
As a cluster admin, I want to forward platform/user metrics to remote write systems requiring custom Authorization header for authentication (e.g. API key).

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
It is possible for a cluster admin to configure any authentication method that is supported by Prometheus upstream for remote write (both platform and user-defined metrics):

- Sigv4
- Authorization
- OAuth2

Dependencies (internal and external)

In theory none because everything is already supported by the Prometheus operator upstream. We may discover bugs in the upstream implementation though that may require upstream involvement.

Previous Work

After CMO started exposing the RemoteWrite specification in ~~MON-1069~~, additional authentication options where added to prometheus and prometheus-operator but CMO didn't catch up on these.

Open Questions

None

Task MON-2206: Expose sigv4 settings for remote write in the CMO configuration

View the Description View the linked PRs

Prometheus and Prometheus operator already support sigv4 authentication for remote write. This should be possible to configure the same in the CMO configuration:

apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-monitoring-config
namespace: openshift-monitoring
data:
config.yaml: |
prometheusK8s:
remoteWrite:
- url: "https://remote-write.endpoint"
sigv4:
accessKey:
name: aws-credentialss
key: access
secretKey:
name: aws-credentials
key: secret

profile: "SomeProfile"

roleArn: "SomeRoleArn"

DoD:

Ability to configure sigv4 authentication for remote write in the openshift-monitoring/cluster-monitoring-config configmap
Ability to configure sigv4 authentication for remote write in the openshift-user-workload-monitoring/user-workload-monitoring-config configmap

https://github.com/openshift/cluster-monitoring-operator/pull/1638

Task MON-2207: Expose Authorization settings for remote write in the CMO configuration

View the Description View the linked PRs

Prometheus and Prometheus operator already support custom Authorization for remote write. This should be possible to configure the same in the CMO configuration:

apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-monitoring-config
namespace: openshift-monitoring
data:
config.yaml: |
prometheusK8s:
remoteWrite:
- url: "https://remote-write.endpoint"
Authorization:
type: Bearer
credentials:
name: credentials
key: token

DoD:

Ability to configure custom Authorization for remote write in the openshift-monitoring/cluster-monitoring-config configmap
Ability to configure custom Authorization for remote write in the openshift-user-workload-monitoring/user-workload-monitoring-config configmap

https://github.com/openshift/cluster-monitoring-operator/pull/1598

Feature OCPPLAN-3604: The details of this Jira Card are restricted (Only Red Hat employees and contractors)

View the Description

The details of this Jira Card are restricted (Only Red Hat employees and contractors)

Epic WINC-505: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Story WINC-708: update must gather script with containerd logging info

View the Description View the linked PRs

Description

As WMCO user, I want to make sure containerd logging information has been updated in documents and scripts.

Acceptance Criteria

update must-gather to collect containerd logs
Internal/Customer Documents and log collecting scripts must have containerd specific information (ex: location of logs).

https://github.com/openshift/must-gather/pull/290

Feature OCPPLAN-5652: The details of this Jira Card are restricted (Only Red Hat employees and contractors)

View the Description

The details of this Jira Card are restricted (Only Red Hat employees and contractors)

Epic NETOBSERV-26: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Story NETOBSERV-15: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/console/pull/9953

Feature OCPPLAN-6007: OpenShift Core Networking Improvements

View the Description

Feature Overview

We drive OpenShift cross-market customer success and new customer adoption with constant improvements and feature additions to the existing capabilities of our OpenShift Core Networking (SDN and Network Edge). This feature captures that natural progression of the product.

Goals

Feature enhancements (performance, scale, configuration, UX, ...)
Modernization (incorporation and productization of new technologies)

Requirements

Core Networking Stability
Core Networking Performance and Scale
Core Neworking Extensibility (Multus CNIs)
Core Networking UX (Observability)
Core Networking Security and Compliance

In Scope

Network Edge (ingress, DNS, LB)
SDN (CNI plugins, openshift-sdn, OVN, network policy, egressIP, egress Router, ...)
Networking Observability

Out of Scope

There are definitely grey areas, but in general:

CNV
Service Mesh
CNF

Documentation Considerations

Questions to be addressed:

What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
Does this feature have doc impact?
New Content, Updates to existing content, Release Note, or No Doc Impact
If unsure and no Technical Writer is available, please contact Content Strategy.
What concepts do customers need to understand to be successful in [action]?
How do we expect customers will use the feature? For what purpose(s)?
What reference material might a customer want/need to complete [action]?
Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
What is the doc impact (New Content, Updates to existing content, or Release Note)?

Epic NE-577: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-ingress-operator/pull/735

Epic NE-357: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Task NE-408: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-ingress-operator/pull/451

Epic NE-683: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Story NE-781: Implement configurable router probe timeouts in openshift/cluster-ingress-operator

View the Description View the linked PRs

Create a PR in openshift/cluster-ingress-operator to implement configurable router probe timeouts.

The PR should include the following:

Changes to the ingress operator's ingress controller to allow the user to configure the readiness and liveness probe's timeoutSeconds values.
Changes to existing unit tests to verify that the new functionality works properly.
Write E2E test to verify that the new functionality works properly.

https://github.com/openshift/cluster-ingress-operator/pull/736

Epic NE-729: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Story NE-860: As a project admin, I want to specify a destination CA certificate on an ingress object, so that it gets injected into the generated route

View the linked PRs

https://github.com/openshift/openshift-controller-manager/pull/218

Epic NE-585: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-ingress-operator/pull/712

Epic NE-703: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Story NE-755: Implement changes to UpstreamResolvers and ForwardPlugin APIs to allow configuration of DNS-over-TLS for upstream resolvers

View the Description View the linked PRs

User Story: As a customer in a highly regulated environment, I need the ability to secure DNS traffic when forwarding requests to upstream resolvers so that I can ensure additional DNS traffic and data privacy.

https://github.com/openshift/cluster-dns-operator/pull/314

Epic NE-700: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Task NE-882: Implement route-subdomain enhancement

View the linked PRs

Task NE-883: Write E2E test for the route-subdomain enhancement

View the linked PRs

https://github.com/openshift/origin/pull/27030

Feature OCPPLAN-7878: NetEdge - Maintainability and Debugability & Tech Backlog

View the Description

tldr: three basic claims, the rest is explanation and one example

We cannot improve long term maintainability solely by fixing bugs.
Teams should be asked to produce designs for improving maintainability/debugability.
Specific maintenance items (or investigation of maintenance items), should be placed into planning as peer to PM requests and explicitly prioritized against them.

While bugs are an important metric, fixing bugs is different than investing in maintainability and debugability. Investing in fixing bugs will help alleviate immediate problems, but doesn't improve the ability to address future problems. You (may) get a code base with fewer bugs, but when you add a new feature, it will still be hard to debug problems and interactions. This pushes a code base towards stagnation where it gets harder and harder to add features.

One alternative is to ask teams to produce ideas for how they would improve future maintainability and debugability instead of focusing on immediate bugs. This would produce designs that make problem determination, bug resolution, and future feature additions faster over time.

I have a concrete example of one such outcome of focusing on bugs vs quality. We have resolved many bugs about communication failures with ingress by finding problems with point-to-point network communication. We have fixed the individual bugs, but have not improved the code for future debugging. In so doing, we chase many hard to diagnose problem across the stack. The alternative is to create a point-to-point network connectivity capability. this would immediately improve bug resolution and stability (detection) for kuryr, ovs, legacy sdn, network-edge, kube-apiserver, openshift-apiserver, authentication, and console. Bug fixing does not produce the same impact.

We need more investment in our future selves. Saying, "teams should reserve this" doesn't seem to be universally effective. Perhaps an approach that directly asks for designs and impacts and then follows up by placing the items directly in planning and prioritizing against PM feature requests would give teams the confidence to invest in these areas and give broad exposure to systemic problems.

Relevant links:

Documentation:
- Edge Diagnostics Scratchpad, our team's internal diagnostic guide.
- Troubleshooting OCP networking issues - The complete guide, the SDN team's diagnostic guide.
- Linux Performance, Brendan Gregg's guide to analyzing Linux performance issues.
- RFC: A proper feedback loop on Alerts.
- OpenShift Router Reload Technical Overview on Access.
- Performance Scaling HAProxy with OpenShift on Access.
- How to collect worker metrics to troubleshoot CPU load, memory pressure and interrupt issues and networking on worker nodes in OCP 4 on Access.
- OpenShift Performance and Scale Knowledge Base on Mojo, results from OpenShift scalability testing.
- Scalability and performance, OCP 4.5 documentation about the router's currently known scalability limits.
- Scaling OpenShift Container Platform HAProxy Router, OCP 3.11 documentation about the manual performance configuration that was possible in OCP 3.
- Timing web requests with cURL and Chrome from the Cloudflare blog.
- tcpdump advanced filters, some useful tcpdump commands.
- OpenShift SDN - Networking, OCP 3.11 documentation on the SDN (useful background reading).
- Ingress Operator and Controller Status Conditions, design document for improved status condition reporting.
- Observability tips for HAProxy, a slide deck by Willy Tarreau.
- Interesting Traces - Out of Order versus Retransmissions, analysis using tshark.
- The PCP Book: A Complete Documentation of Performance Co-Pilot, by Yogesh Babar.
- Debugging kernel networking bug, brief guide to using SystemTap on RHCOS.
- Troubleshooting throughput issues from the OCP 4.5 documentation.
- Troubleshooting OpenShift Clusters and Workloads.
- Red Hat Enterprise Linux Network Performance Tuning Guide (PDF).
- openshift/enhancements#289 stability: point to point network check, a diagnostic built into the kube-apiserver operator.
Diagnostic tools:
- dropwatch to watch for packet drops.
- ethtool to check NIC configuration.
- iovisor/bcc: BCC - Tools for BPF-based Linux IO analysis, networking, monitoring, and more to trace and diagnose various issues in the networking stack.
- r-curler to gather timing information about HTTP/HTTPS connections.
- route-monitor, to monitor routes for reachability.
- hping(3), a programmable packet generator.
- OpenTracing / Jaeger in OpenShift.
- node-problem-detector, a possible integration point for new diagnostics.
- Using SystemTap by Brendan Gregg.
- DTrace SystemTap cheatsheet (PDF).
Visualization and more sophisticated diagnostic tools:
- eldadru/ksniff, kubectl plugin for tcpdump & Wireshark.
- ironcladlou/ditm, Dan's "Dan in the Middle" tool.
- Skydive, network diagnostic and visualization tool.
- ali, a "load testing tool capable of performing real-time analysis" with visualization.
Testing tools:
- stress-ng, a general stress-loading tool (CPU, filesystem, network, ...).
- mb, the networking benchmarking tool written and used by Jiri Mencak from our Perf+Scale team.
Case studies:
- BZ1763206 is an example of diagnosing DNS latency/timeouts.
- BZ1829779 Investigation details the diagnosis of route latency.
- BZ1845545 is an example of diagnosing misconfigured DNS for an external LB.
- Debugging network stalls on Kubernetes, from the GitHub Blog, about diagnosing Kubernetes performance issues related to ksoftirqd.

Epic NE-367: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-dns-operator/pull/307

Epic NE-626: [Testing & CI] CI coverage to ensure DNS resolution works with major client libraries

View the Description View the linked PRs

Per the 4.6.30 Monitoring DNS Post Mortem, we should add E2E tests to openshift/cluster-dns-operator to reduce the risk that changes to our CoreDNS configuration break DNS resolution for clients.

To begin with, we add E2E DNS testing for 2 or 3 client libraries to establish a framework for testing DNS resolvers; the work of adding additional client libraries to this framework can be left for follow-up stories. Two common libraries are Go's resolver and glibc's resolver. A somewhat common library that is known to have quirks is musl libc's resolver, which uses a shorter timeout value than glibc's resolver and reportedly has issues with the EDNS0 protocol extension. It would also make sense to test Java or other popular languages or runtimes that have their own resolvers.

Additionally, as talked about in our DNS Issue Retro & Testing Coverage meeting on Feb 28th 2024, we also decided to add a test for testing a non-EDNS0 query for a larger than 512 byte record, as once was an issue in bug ~~OCPBUGS-27397~~.

The ultimate goal is that the test will inform us when a change to OpenShift's DNS or networking has an effect that may impact end-user applications.

https://github.com/openshift/origin/pull/26957

Epic NE-709: [Tech Debt] [Perf+Scale] Investigate and improve memory performance of backend server weights with random

View the Description

In OCP 4.8 the router was changed to use the "random" balancing algorithm for non-passthrough routes by default. It was previously "leastconn".

Bug https://bugzilla.redhat.com/show_bug.cgi?id=2007581 shows that using "random" by default incurs significant memory overhead for each backend that uses it.

PR https://github.com/openshift/cluster-ingress-operator/pull/663
reverted the change and made "leastconn" the default again (OCP 4.8 onwards).

The analysis in https://bugzilla.redhat.com/show_bug.cgi?id=2007581#c40 shows that the default haproxy behaviour is to multiply the weight (specified in the route CR) by 16 as it builds its data structures for each backend. If no weight is specified then openshift-router sets the weight to 256. If you have many, many thousands of routes then this balloons quickly and leads to a significant increase in memory usage, as highlighted by customer cases attached to BZ#2007581.

The purpose of this issue is to both explore changing the openshift-router default weight (i.e., 256) to something smaller, or indeed unset (assuming no explicit weight has been requested), and to measure the memory usage within the context of the existing perf&scale tests that we use for vetting new haproxy releases.

It may be that the low-hanging change is to not default to weight=256 for backends that only have one pod replica (i.e., if no value specified, and there is only 1 pod replica, then don't default to 256 for that single server entry).

Outcome: does changing the [default] weight value make it feasible to switch back to "random" as the default balancing algorithm for a future OCP release.

Story NE-825: Update router to default to "random" balancing algorithm in 4.11

View the Description View the linked PRs

Revert router to using "random" once again in 4.11 once analysis is done on impact of weight and static memory allocation.

https://github.com/openshift/cluster-ingress-operator/pull/727

Feature OCPPLAN-8029: Console: Dynamic Plugin Framework

View the Description

Feature Overview

Plugin teams need a mechanism to extend the OCP console that is decoupled enough so they can deliver at the cadence of their projects and not be forced in to the OCP Console release timelines.

The OCP Console Dynamic Plugin Framework will enable all our plugin teams to do the following:

Extend the Console
Deliver UI code with their Operator
Work in their own git Repo
Deliver at their own cadence

Goals

- Operators can deliver console plugins separate from the console image and update plugins when the operator updates.
- The dynamic plugin API is similar to the static plugin API to ease migration.
- Plugins can use shared console components such as list and details page components.
- Shared components from core will be part of a well-defined plugin API.
- Plugins can use Patternfly 4 components.
- Cluster admins control what plugins are enabled.
- Misbehaving plugins should not break console.
- Existing static plugins are not affected and will continue to work as expected.

Out of Scope

- Initially we don't plan to make this a public API. The target use is for Red Hat operators. We might reevaluate later when dynamic plugins are more mature.
- We can't avoid breaking changes in console dependencies such as Patternfly even if we don't break the console plugin API itself. We'll need a way for plugins to declare compatibility.
- Plugins won't be sandboxed. They will have full JavaScript access to the DOM and network. Plugins won't be enabled by default, however. A cluster admin will need to enable the plugin.
- This proposal does not cover allowing plugins to contribute backend console endpoints.

Requirements

Requirement	Notes	isMvp?
UI to enable and disable plugins		YES
Dynamic Plugin Framework in place		YES
Testing Infra up and running		YES
Docs and read me for creating and testing Plugins		YES
CI - MUST be running successfully with test automation	This is a requirement for ALL features.	YES
Release Technical Enablement	Provide necessary release enablement details and documents.	YES

Documentation Considerations

Questions to be addressed:

What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?

Does this feature have doc impact?

New Content, Updates to existing content, Release Note, or No Doc Impact

If unsure and no Technical Writer is available, please contact Content Strategy.

What concepts do customers need to understand to be successful in [action]?

How do we expect customers will use the feature? For what purpose(s)?

What reference material might a customer want/need to complete [action]?

Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.

What is the doc impact (New Content, Updates to existing content, or Release Note)?

Epic CONSOLE-2907: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Story CONSOLE-2381: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/console/pull/9679

Story CONSOLE-2946: Expose all of core PatternFly for dynamic plugin use

View the Description View the linked PRs

Currently, webpack tree shakes PatternFly and only includes the components used by console in its vendor bundle. We need to expose all of the core PatternFly components for use in dynamic plugin, which means we have to disable tree shaking for PatternFly. We should expose this as a separate bundle. This will allow browsers to cache more efficiently and only need to load the PF bundle again when we upgrade PatternFly.

Open Questions

What parts of PatternFly do we consider core?

Acceptance Criteria

All PatternFly core components are exposed to dynamic plugins
PatternFly is exposed as a separate bundle that is not part of the main vendor bundle

cc Christian Vogt Vojtech Szocs Joseph Caiani James Talton

https://github.com/openshift/console/pull/9882

Feature OCPPLAN-8030: Console: Customer Happiness (RFEs) for 4.8-4.12

View the Description

Feature Overview

This Section:* High-Level description of the feature ie: Executive Summary

Note: A Feature is a capability or a well defined set of functionality that delivers business value. Features can include additions or changes to existing functionality. Features can easily span multiple teams, and multiple releases.

Goals

This Section:* Provide high-level goal statement, providing user context and expected user outcome(s) for this feature

Requirements

This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.

Requirement	Notes	isMvp?

CI - MUST be running successfully with test automation

This is a requirement for ALL features.

YES

Release Technical Enablement

Provide necessary release enablement details and documents.

YES

(Optional) Use Cases

This Section:

Main success scenarios - high-level user stories

Alternate flow/scenarios - high-level user stories

Questions to answer…

Out of Scope

Background, and strategic fit

This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.

Assumptions

Customer Considerations

Documentation Considerations

Questions to be addressed:

What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?

Does this feature have doc impact?

New Content, Updates to existing content, Release Note, or No Doc Impact

If unsure and no Technical Writer is available, please contact Content Strategy.

What concepts do customers need to understand to be successful in [action]?

How do we expect customers will use the feature? For what purpose(s)?

What reference material might a customer want/need to complete [action]?

Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.

What is the doc impact (New Content, Updates to existing content, or Release Note)?

Epic CONSOLE-2893: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Story CONSOLE-2967: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/console/pull/9956

Story CONSOLE-922: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/console/pull/10137

Story CONSOLE-2360: Run Pod in Debug mode

View the Description View the linked PRs

As a user, I want the ability to run a pod in debug mode.

This should be the equivalent of running: oc debug pod

Acceptance Criteria for MVP

Build off of the crash-loop back off popover from https://github.com/openshift/console/pull/7302 to include a description of what crash-loop back off is, a link to view logs, a link to view events and a link to debug (container-name) in terminal. If more than one container is crash-looping list them individually.
Create a debug container page that includes breadcrumbs as well as the terminal to debug. Add an informational alert at the top to make it clear that this is a temporary Pod and closing this page will delete the temporary pod.
Add debug in terminal as an action to the logs tool bar. Only enable the action when the crash-loop back off status occurs for the selected container. Add a tool tip to explain when the action is disabled.

Assets
Designs (WIP): https://docs.google.com/document/d/1b2n9Ox4xDNJ6AkVsQkXc5HyG8DXJIzU8tF6IsJCiowo/edit#

https://github.com/openshift/console/pull/9578

Epic CONSOLE-3051: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Story CONSOLE-2321: Allow operators installed globally to display operand instances for all managed namespaces in their details

View the Description View the linked PRs

When viewing the Installed Operators list set to 'All projects' and then selecting an operator that is available in 'All namespaces' (globally installed,) upon clicking the operator to view its details the user is taken into the details of that operator in installed namespace (project selector will switch to the install namespace.)

This can be disorienting then to look at the lists of custom resource instances and see them all blank, since the lists are showing instances only in the currently selected project (the install namespace) and not across all namespaces the operator is available in.

It is likely that making use of the new Operator resource will improve this experience (CONSOLE-2240,) though that may still be some releases away. it should be considered if it's worth a "short term" fix in the meantime.

Note: The informational alert was not implemented. It was decided that since "All namespaces" is displayed in the radio button, the alert was not needed.

https://github.com/openshift/console/pull/8930

Story CONSOLE-3063: [RFE] PDB for console operands to avoid going too many replicas down

View the Description View the linked PRs

During master nodes upgrade when nodes are getting drained there's currently no protection from two or more operands going down. If your component is required to be available during upgrade or other voluntary disruptions, please consider deploying PDB to protect your operands.

The effort is tracked in https://issues.redhat.com/browse/WRKLDS-293.

Example:

https://github.com/openshift/cluster-authentication-operator/pull/476/files

https://github.com/openshift/cluster-authentication-operator/pull/514/files

Acceptance Criteria:
1. Create PDB controller in console-operator for both console and downloads pods
2. Add e2e tests for PDB in single node and multi node cluster

Note: We should consider to backport this to 4.10

https://github.com/openshift/console-operator/pull/655

Story CONSOLE-2936: Add support for PDB

View the Description View the linked PRs

Goal
Add support for PDB (Pod Disruption Budget) to the console.

Requirements:

Add a list, detail, and yaml view (with samples) for PDBs. In addition, update the workloads page to support PDBs as well.
For the PBD list page include a table with name, namespace, selector, availability, allowed disruptions and created. In addition, to the table provide the main call to action to create a PDB.
For the PDB details page provide a Details, YAML and Pods tab. The Pods tab will include a list pods associated with the PBD - make sure to surface the owner column.
When users create a PDB from the list page, take them to the YAML and provide samples to enhance the creation experience. Sample 1: Set max unavailable to 0, Sample 2: Set min unavailable to 25% (confirming samples with stakeholders). In the case that a PDB has already been applied, warn users that it is not recommended to add another. Cover use cases as well that keep users from creating poor policies - for example, setting the minimum available to zero.
Add the ability to add/edit/view PBDs on a workload. If we edit a PDB applied to multiple workloads, warn users that this change will affect all workloads and not only the one they are currently editing. When a PDB has been applied, add a new filed to the details page with a link to the PDB and policy.

Designs:

Exploratory designs (by Chris): https://www.sketch.com/s/a2668252-07fe-4472-a96f-d3bf94423959/p/AF1CAEE1-DB56-40B7-ADE8-970EA4D1F9
Final designs (by Thi): Marvel | Doc

Samuel Padgett Colleen Hart

https://github.com/openshift/console/pull/10445

Feature OCPPLAN-8108: The details of this Jira Card are restricted (Only Red Hat employees and contractors)

View the Description

The details of this Jira Card are restricted (Only Red Hat employees and contractors)

Epic OCPBUILD-30: Build Rebases OCP 4.11

View the Description

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

Rebase OpenShift components to k8s v1.24

Why is this important?

Rebasing ensures components work with the upcoming release of Kubernetes
Address tech debt related to upstream deprecations and removals.

Scenarios

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

k8s 1.24 release

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story BUILD-418: Rebase openshift-controller-manager-operator to k8s 1.24

View the linked PRs

https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/242

Story BUILD-416: Rebase openshift/builder to k8s 1.24

View the Description View the linked PRs

Rebase openshift/builder to k8s 1.24

https://github.com/openshift/builder/pull/299

Feature OCPPLAN-8150: Agent-Based Installer GA

View the Description

Feature Overview

As an infrastructure owner, I want a repeatable method to quickly deploy the initial OpenShift cluster.

As an infrastructure owner, I want to install the first (management, hub, “cluster 0”) cluster to manage other (standalone, hub, spoke, hub of hubs) clusters.

Goals

Enable customers and partners to successfully deploy a single “first” cluster in disconnected, on-premises settings

Requirements

4.11 MVP Requirements

Customers and partners needs to be able to download the installer
Enable customers and partners to deploy a single “first” cluster (cluster 0) using single node, compact, or highly available topologies in disconnected, on-premises settings
Installer must support advanced network settings such as static IP assignments, VLANs and NIC bonding for on-premises metal use cases, as well as DHCP and PXE provisioning environments.
Installer needs to support automation, including integration with third-party deployment tools, as well as user-driven deployments.
In the MVP automation has higher priority than interactive, user-driven deployments.
For bare metal deployments, we cannot assume that users will provide us the credentials to manage hosts via their BMCs.
Installer should prioritize support for platforms None, baremetal, and VMware.
The installer will focus on a single version of OpenShift, and a different build artifact will be produced for each different version.
The installer must not depend on a connected registry; however, the installer can optionally use a previously mirrored registry within the disconnected environment.

Use Cases

As a Telco partner engineer (Site Engineer, Specialist, Field Engineer), I want to deploy an OpenShift cluster in production with limited or no additional hardware and don’t intend to deploy more OpenShift clusters [Isolated edge experience].
As a Enterprise infrastructure owner, I want to manage the lifecycle of multiple clusters in 1 or more sites by first installing the first (management, hub, “cluster 0”) cluster to manage other (standalone, hub, spoke, hub of hubs) clusters [Cluster before your cluster].
As a Partner, I want to package OpenShift for large scale and/or distributed topology with my own software and/or hardware solution.
As a large enterprise customer or Service Provider, I want to install a “HyperShift Tugboat” OpenShift cluster in order to offer a hosted OpenShift control plane at scale to my consumers (DevOps Engineers, tenants) that allows for fleet-level provisioning for low CAPEX and OPEX, much like AKS or GKE [Hypershift].
As a new, novice to intermediate user (Enterprise Admin/Consumer, Telco Partner integrator, RH Solution Architect), I want to quickly deploy a small OpenShift cluster for Poc/Demo/Research purposes.

Questions to answer…

Out of Scope

Out of scope use cases (that are part of the Kubeframe/factory project):

As a Partner (OEMs, ISVs), I want to install and pre-configure OpenShift with my hardware/software in my disconnected factory, while allowing further (minimal) reconfiguration of a subset of capabilities later at a different site by different set of users (end customer) [Embedded OpenShift].
As an Infrastructure Admin at an Enterprise customer with multiple remote sites, I want to pre-provision OpenShift centrally prior to shipping and activating the clusters in remote sites.

Background, and strategic fit

This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.

Assumptions

The user has only access to the target nodes that will form the cluster and will boot them with the image presented locally via a USB stick. This scenario is common in sites with restricted access such as government infra where only users with security clearance can interact with the installation, where software is allowed to enter in the premises (in a USB, DVD, SD card, etc.) but never allowed to come back out. Users can't enter supporting devices such as laptops or phones.
The user has access to the target nodes remotely to their BMCs (e.g. iDrac, iLo) and can map an image as virtual media from their computer. This scenario is common in data centers where the customer provides network access to the BMCs of the target nodes.
We cannot assume that we will have access to a computer to run an installer or installer helper software.

Customer Considerations

Documentation Considerations

Questions to be addressed:

What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
Does this feature have doc impact?
New Content, Updates to existing content, Release Note, or No Doc Impact
If unsure and no Technical Writer is available, please contact Content Strategy.
What concepts do customers need to understand to be successful in [action]?
How do we expect customers will use the feature? For what purpose(s)?
What reference material might a customer want/need to complete [action]?
Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
What is the doc impact (New Content, Updates to existing content, or Release Note)?

References

Bootable Ephemeral Installer Image for Disconnected Cluster 0 Deployments

Epic AGENT-10: CLI based automated deployment

View the Description

Epic Goal

As an OpenShift infrastructure owner, I need to be able to integrate the installation of my first on-premises OpenShift cluster with my automation flows and tools.
As an OpenShift infrastructure owner, I must be able to provide the CLI tool with manifests that contain the definition of the cluster I want to deploy
As an OpenShift Infrastructure owner, I must be able to get the validation errors in a programmatic way
As an OpenShift Infrastructure owner, I must be able to get the events and progress of the installation in a programmatic way
As an OpenShift Infrastructure owner, I must be able to retrieve the kubeconfig and OpenShift Console URL in a programmatic way

Why is this important?

When deploying clusters with a large number of hosts and when deploying many clusters, it is common to require to automate the installations.
Customers and partners usually use third party tools of their own to orchestrate the installation.
For Telco RAN deployments, Telco partners need to repeatably deploy multiple OpenShift clusters in parallel to multiple sites at-scale, with no human intervention.

Scenarios

Monitoring flow:
1. I generate all the manifests for the cluster,
2. call the CLI tool pointint to the manifests path,
3. Obtain the installation image from the nodes
4. Use my infrastructure capabilities to boot the image on the target nodes
5. Use the tool to connect to assisted service to get validation status and events
6. Use the tool to retrieve credentials and URL for the deployed cluster

Acceptance Criteria

Backward compatibility between OCP releases with automation manifests (they can be applied to a newer version of OCP).
Installation progress and events can be tracked programatically
Validation errors can be obtained programatically
Kubeconfig and console URL can be obtained programatically
CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

References

Bootable Ephemeral Installer Image for Disconnected Cluster 0 Deployments

Story AGENT-40: Get the cluster credentials

View the Description View the linked PRs

User Story:

As a deployer, I want to be able to:

Get the credentials for the cluster that is going to be deployed

so that I can achieve

Checking the installed cluster for installation completion
Connect and administer the cluster that gets installed

Currently the Assisted Service generates the credentials by running the ignition generation step of the oepnshift-installer. This is why the credentials are only retrievable from the REST API towards the end of the installation.

In the BILLI usage, which takes down assisted service before the installation is complete there is no obvious point at which to alert the user that they should retrieve the credentials. This means that we either need to:

Allow the user to pass the admin key that will then get signed by the generated CA and replace the key that is made by openshift-installer (would mean new functionality in AI)
Allow the key to be retrieved by SSH with the fleeting command from the node0 (after it has generated). The command should be able to wait until it is possible
Have the possibility to POST it somewhere

Acceptance Criteria:

The admin key is generated and usable to check for installation completeness

(optional) https://issues.redhat.com/link.to.spike
Engineering detail 1
Engineering detail 2

This requires/does not require a design proposal.
This requires/does not require a feature gate.

https://github.com/openshift/installer/pull/5872

Feature OCPPLAN-8154: Apply user defined tags to all resources created by OpenShift (AWS)

View the Description

Feature Overview

The AWS-specific code added in ~~OCPPLAN-6006~~ needs to become GA and with this we want to introduce a couple of Day2 improvements.
Currently the AWS tags are defined and applied at installation time only and saved in the infrastructure CRD's status field for further operator use, which in turn just add the tags during creation.

Saving in the status field means it's not included in Velero backups, which is a crucial feature for customers and Day2.
Thus the status.resourceTags field should be deprecated in favour of a newly created spec.resourceTags with the same content. The installer should only populate the spec, consumers of the infrastructure CRD must favour the spec over the status definition if both are supplied, otherwise the status should be honored and a warning shall be issued.

Being part of the spec, the behaviour should also tag existing resources that do not have the tags yet and once the tags in the infrastructure CRD are changed all the AWS resources should be updated accordingly.

On AWS this can be done without re-creating any resources (the behaviour is basically an upsert by tag key) and is possible without service interruption as it is a metadata operation.

Tag deletes continue to be out of scope, as the customer can still have custom tags applied to the resources that we do not want to delete.

Due to the ongoing intree/out of tree split on the cloud and CSI providers, this should not apply to clusters with intree providers (!= "external").

Once confident we have all components updated, we should introduce an end2end test that makes sure we never create resources that are untagged.

After that, we can remove the experimental flag and make this a GA feature.

Goals

Inclusion in the cluster backups
Flexibility of changing tags during cluster lifetime, without recreating the whole cluster

Requirements

This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.

Requirement	Notes	isMvp?
CI - MUST be running successfully with test automation	This is a requirement for ALL features.	YES
Release Technical Enablement	Provide necessary release enablement details and documents.	YES

List any affected packages or components.

Installer
Cluster Infrastructure
Storage
Node
NetworkEdge
Internal Registry
CCO

Epic CFE-69: User defined tags for AWS Resources GA

View the Description

~~RFE-1101~~ described user defined tags for AWS resources provisioned by an OCP cluster. Currently user can define tags which are added to the resources during creation. These tags cannot be updated subsequently. The propagation of the tags is controlled using experimental flag. Before this feature goes GA we should define and implement a mechanism to exclude any experimental flags. Day2 operations and deletion of tags is not in the scope.

Story CFE-68: Make user defined resource tags on EC2 instances updatable

View the Description View the linked PRs

~~RFE-2012~~ aims to make the user-defined resource tags feature GA. This means that user defined tags should be updatable.

Currently the user-defined tags during install are passed directly as parameters of the Machine and Machineset resources for the master and worker. As a result these tags cannot be updated by consulting the Infrastructure resource of the cluster where the user defined tags are written.

The MCO should be changed such that during provisioning the MCO looks up the values of the tags in the Infrastructure resource and adds the tags during creation of the EC2 resources. The MCO should also watch the infrastructure resource for changes and when the resource tags are updated it should update the tags on the EC2 instances without restarts.

Acceptance Criteria:

~~e2e test where the ResourceTags are updated and then the test verifies that the tags on the ec2 instances are updated without restarts.~~ now moved to ~~CFE-179~~

https://github.com/openshift/machine-api-provider-aws/pull/15

Feature OCPSTRAT-180: Improve upgrades - phase 1

View the Description

Feature Overview

Customers are asking for improvements to the upgrade experience (both over-the-air and disconnected). This is a feature tracking epics required to get that work done.

Goals

Have an option to do upgrades in more discrete steps under admin control. Specifically, these steps are:
- Control plane upgrade
- Worker nodes upgrade
- Workload enabling upgrade (i..e. Router, other components) or infra nodes
Better visibility into any errors during the upgrades and documentation of what they error means and how to recover.
An user experience around an end-2-end back-up and restore after a failed upgrade
~~OTA-810~~ - Better Documentation:
- Backup procedures before upgrades.
- More control over worker upgrades (with tagged pools between user Vs admin)
- The kinds of pre-upgrade tests that are run, the errors that are flagged and what they mean and how to address them.
- Better explanation of each discrete step in upgrades, and what each CVO Operator is doing and potential errors, troubleshooting and mitigating actions.

References

Epic CONSOLE-2927: Add Control Plane Upgrade to Web Console

View the Description

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

Provide a one click option to perform an upgrade which pauses all non master pools

Why is this important?

Customers are increasingly asking that the overall upgrade is broken up into more digestible pieces
This is the limit of what's possible today
- R&D work will be done in the future to allow for further bucketing of upgrades into Control Plane, Worker Nodes, and Workload Enabling components (ie: router) That will however take much more consideration and rearchitecting

Scenarios

An admin selecting their upgrade is offered two options "Upgrade Cluster" and "Upgrade Control Plane"

1. If the admin selects Upgrade Cluster they get the pre 4.10 behavior
2. If the admin selects Upgrade Control Plane all non master pools are paused and an upgrade is initiated
A tooltip should clarify what the difference between the two are
The pool progress bars should indicate pause/unpaused status, non master pools should allow for unpausing

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Previous Work (Optional):

While this epic doesn't specifically target upgrading from 4.N to 4.N+1 to 4.N+2 with non master pools paused it would fundamentally enable that and it would simplify the UX described in Paused Worker Pool Upgrades

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story CONSOLE-2976: Add the ability to update control plane only to Cluster Settings

View the Description View the linked PRs

Goal
Add the ability to choose between a full cluster upgrade (which exists today) or control plane upgrade (which will pause all worker pools) in the console.

Background
Currently in the console, users only have the ability to complete a full cluster upgrade. For many customers, upgrades take longer than what their maintenance window allows. Users need the ability to upgrade the control plane independently of the other worker nodes.

Ex. Upgrades of huge clusters may take too long so admins may do the control plane this weekend, worker-pool-A next weekend, worker-pool-B the weekend after, etc. It is all at a pool level, they will not be able to choose specific hosts.

Requirements

Changes to the Update modal:
1. Add the ability to choose between a cluster upgrade and a control plane upgrade (the design does not default to a selection but rather disables the update button to force the user to make a conscious decision)
2. link out to documentation to learn more about update strategies
Changes to the in progress check list:
1. Add a status above the worker pool section to let users know that all worker pools are paused and an action to resume all updates
2. Add a "resume update" button for each worker pool entry
Changes to the update status:
1. When all master pools are updated successfully, change the status from what we have today "Up to date" to something like "Control plane up to date - all worker pools paused"
Add an inline alert that lets users know there is a 60 day window to update all worker pools. In the alert, include the sentiment that worker pools can remain paused as long as is normally safe, which means until certificate rotation becomes critical which is at about 60 days. The admin would be advised to unpause them in order to complete the full upgrade. If the MCPs are paused, the certification rotation does not happen, which causes the cluster to become degraded and causes failure in multiple 'oc' commands, including but not limited to 'oc debug', 'oc logs', 'oc exec' and 'oc attach'. (Are we missing anything else here?) Inline alert logic:
1. From day 60 to day 10 use the default alert.
2. From day 10 to day 3 use the warning alert.
3. From day 3 to 0 use the critical alert and continue to persist until resolved.

Design deliverables:

Doc
Marvel

https://github.com/openshift/console/pull/11053

Story CONSOLE-2977: Improve MachineConfigPool page to support ability to update control plane only

View the Description View the linked PRs

Goal
Improve the UX on the machine config pool page to reflect the new enhancements on the cluster settings that allows users to select the ability to update the control plane only.

Requirements

Changes to the table:
1. Remove "Updated, updating and paused" columns. We could also consider adding column management to this table and hide those columns by default.
2. Add "Update status" as a column, and surface the same status on cluster settings. Not true or false values but instead updating, paused, and up to date.
3. Surface the update action in the table row.
Add an inline alert that lets users know there is a 60 day window to update all worker pools. In the alert, include the sentiment that worker pools can remain paused as long as is normally safe, which means until certificate rotation becomes critical which is at about 60 days. The admin would be advised to unpause them in order to complete the full upgrade. If the MCPs are paused, the certification rotation does not happen, which causes the cluster to become degraded and causes failure in multiple 'oc' commands, including but not limited to 'oc debug', 'oc logs', 'oc exec' and 'oc attach'. (Are we missing anything else here?) Add the same alert logic to this page as the cluster settings:
1. From day 60 to day 10 use the default inline alert.
2. From day 10 to day 3 use the warning inline alert.
3. From day 3 to 0 use the critical alert and continue to persist until resolved.

Design deliverables:

Doc
Marvel

https://github.com/openshift/console/pull/11502

Feature OCPSTRAT-469: Install and upgrade OpenShift with GCP Workload Identity

View the Description

OCP/Telco Definition of Done
Feature Template descriptions and documentation.
Feature Overview

Connect OpenShift workloads to Google services with Google Workload Identity

Enable customers to access Google services from workloads on OpenShift clusters using Google Workload Identity (aka WIF)
https://cloud.google.com/kubernetes-engine/docs/concepts/workload-identity

Goals

Customers want to be able to manage and operate OpenShift on Google Cloud Platform with workload identity, much like they do with AWS + STS or Azure + workload identity.
Customers want to be able to manage and operate operators and customer workloads on top of OCP on GCP with workload identity.

Requirements

Add support to CCO for the Installation and Upgrade using both UPI and IPI methods with GCP workload identity.
Support install and upgrades for connected and disconnected/restriction environments.
Support the use of Operators with GCP workload identity with minimal friction.
Support for HyperShift and non-HyperShift clusters.
This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.

Requirement	Notes	isMvp?
CI - MUST be running successfully with test automation	This is a requirement for ALL features.	YES
Release Technical Enablement	Provide necessary release enablement details and documents.	YES

(Optional) Use Cases

This Section:

Main success scenarios - high-level user stories
Alternate flow/scenarios - high-level user stories
...

Questions to answer…

Out of Scope

Background, and strategic fit

This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.

Assumptions

Customer Considerations

Documentation Considerations

Questions to be addressed:

What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
Does this feature have doc impact?
New Content, Updates to existing content, Release Note, or No Doc Impact
If unsure and no Technical Writer is available, please contact Content Strategy.
What concepts do customers need to understand to be successful in [action]?
How do we expect customers will use the feature? For what purpose(s)?
What reference material might a customer want/need to complete [action]?
Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
What is the doc impact (New Content, Updates to existing content, or Release Note)?

Epic CCO-114: Support GCP workload identity

View the Description

Epic Goal

Complete the implementation for GCP workload identity, including support and documentation.

Why is this important?

Many customers want to follow best security practices for handling credentials.

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.

Dependencies (internal and external)

Open questions:

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story CCO-123: Update openshift operators to consume new 'external_account' type credentials

View the Description

We need to ensure following things in the openshift operators

1) Make sure to operator uses v0.0.0-20210218202405-ba52d332ba99 or later version of the golang.org/x/oauth2 module

2) Mount the oidc token in the operator pod, this needs to go in the deployment. We have done it for cluster-image-registry-operator here

3) For workload identity to work, gco credentials that the operator pod uses should be of external_account type (not service_account). The external_account credentials type have path to oidc token along, url of the service account to impersonate along with other details. These type of credentials can be generated from gcp console or programmatically (supported by ccoctl). The operator pod can then consume it from a kube secret. Make appropriate code changes to the operators so that can consume these new credentials

Following repos need one or more of above changes

Sub-task CCO-135: Update image registry to consume new 'external_account' type credentials

View the Description View the linked PRs

repo link: https://github.com/openshift/image-registry

https://github.com/openshift/image-registry/pull/283

Feature OCPSTRAT-475: Enable sharing ConfigMaps and Secrets across namespaces [Tech Preview]

View the Description

Feature Overview

Enable sharing ConfigMap and Secret across namespaces

Requirements

Requirement	Notes	isMvp?
Secrets and ConfigMaps can get shared across namespaces		YES

Questions to answer…

Out of Scope

Background, and strategic fit

Consumption of RHEL entitlements has been a challenge on OCP 4 since it moved to a cluster-based entitlement model compared to the node-based (RHEL subscription manager) entitlement mode. In order to provide a sufficiently similar experience to OCP 3, the entitlement certificates that are made available on the cluster (~~OCPBU-93~~) should be shared across namespaces in order to prevent the need for cluster admin to copy these entitlements in each namespace which leads to additional operational challenges for updating and refreshing them.

Documentation Considerations

Questions to be addressed:
* What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
* Does this feature have doc impact?
* New Content, Updates to existing content, Release Note, or No Doc Impact
* If unsure and no Technical Writer is available, please contact Content Strategy.
* What concepts do customers need to understand to be successful in [action]?
* How do we expect customers will use the feature? For what purpose(s)?
* What reference material might a customer want/need to complete [action]?
* Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
* What is the doc impact (New Content, Updates to existing content, or Release Note)?

Epic BUILD-293: Tech Preview Shared Resource CSI Driver

View the Description

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

Deliver the Projected Resources CSI driver via the OpenShift Payload

Why is this important?

Projected resource shares will be a core feature of OpenShift. The share and CSI driver have multiple use cases that are important to users and cluster administrators.
The use of projected resources will be critical to distributing Simple Content Access (SCA) certificates to workloads, such as Deployments, DaemonSets, and OpenShift Builds.

Scenarios

As a developer using OpenShift
I want to mount a Simple Content Access certificate into my build
So that I can access RHEL content within a Docker strategy build.

As a application developer or administrator
I want to share credentials across namespaces
So that I don't need to copy credentials to every workspace

Acceptance Criteria

OCP conformance suite must ensure that the projected resource CSI driver is installed on every OpenShift deployment.
OCP build suite tests that projected resource CSI driver volumes can be added to builds. Only if builds support inline CSI volumes.
Release Technical Enablement - Docs and demos on how to create a Projected Resource share and add it as a volume to workloads. A special use case for adding RHEL entitlements to builds should be included.

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story BUILD-284: Integrate Shared Resources Operator with Cluster Storage Operator

View the Description View the linked PRs

User Story

As a cluster admin
I want the cluster storage operator to install the shared resources CSI driver
So that I can test the shared resources CSI driver on my cluster

Acceptance Criteria

Cluster storage operator uses image references to resolve the csi-driver-shared-resource-operator and all images needed to deploy the csi driver.
Shared resources CSI driver is installed when the cluster enables the CSIDriverSharedResources feature gate, OR
Shared resource CSI driver is installed when the cluster enables the TechPreviewNoUpgrade feature set
CI ensures that if the TechPreviewNoUpgrade feature set is enabled on the cluster, the shared resource CSI driver is deployed and functions correctly.

Docs Impact

Docs will need to identify how to install the shared resources CSI driver (by enabling the tech preview feature set)

Notes

Tasks:

Add the Share APIs (SharedSecret, SharedConfigMap) to openshift/api
Generate clients in openshift/client-go for Share APIs
Update the CSI driver name used in the enum for the ClusterCSIDriver custom resource.
Generate custom resource definitions and include it in the deployment YAMLs for the shared resource operator
Add YAML deployment manifests for the shared resource operator to the cluster storage operator (include necessary RBAC)
Ensure cluster storage operator has permission to create custom resource definitions
Enhance the cluster storage operator to install the shared resource CSI driver only when the cluster enables the CSIDriverSharedResources feature gate

Note that to be able to test all of this on any cloud provider, we need ~~STOR-616~~ to be implemented. We can work around this by making the CSI driver installable on AWS or GCP for testing purposes.

The cluster storage operator has cluster-admin permissions. However, no other CSI driver managed by the operator includes a CRD for its API.

See https://issues.redhat.com/browse/BUILD-159?focusedCommentId=16360509&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16360509

https://github.com/openshift/cluster-storage-operator/pull/198

Story BUILD-345: Expose CSI driver metrics to Telemetry

View the Description View the linked PRs

User Story

As an OpenShift engineer
I want to know which clusters are using the Shared Resource CSI Driver
So that I can be proactive in supporting customers who are using this tech preview feature

Acceptance Criteria

Key metrics for the shared resource CSI driver are exported to Telemeter via the cluster monitoring operator.

Docs Impact

None - metrics exported to telemetry are not formally documented.

QE Impact

QE can verify that the query/recording rule for cluster monitoring operator returns data if the cluster has the Shared Resource CSI driver installed and utilizes a SharedSecret or SharedConfigMap in a pod/workload.

PX Impact

Insights rules can potentially be created off of these exported metrics. This would allow CEE to identify which clusters are using SharedSecrets or SharedConfigMaps, especially if we are exporting mount failure metrics.

Notes

To implement, a prometheus query/recording rule needs to be added to the cluster monitoring operator. Once approved by the monitoring team, the metric data will be available on DataHub once 4.10 clusters are installed with the updated version of the monitoring operator.

https://github.com/openshift/cluster-monitoring-operator/pull/1477

Feature OCPSTRAT-526: Cloud Controller Managers: Final Testing and GA tasks - Phase 1

View the Description

Feature Overview (aka. Goal Summary)

Upstream Kuberenetes is following other SIGs by moving it's intree cloud providers to an out of tree plugin format, Cloud Controller Manager, at some point in a future Kubernetes release. OpenShift needs to be ready to action this change

Goals (aka. expected user outcomes)

Bring together all the cloud controller managers (AWS, GCP, Azure), complete testing and prepare for final GA

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs. Initial completion during Refinement status.

Interoperability Considerations

Which other projects and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.

Epic OCPCLOUD-1224: Prepare CCCMO for General Availability

View the Description

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

Prepare the Cluster Cloud Controller Manager Operator (CCCMO) component, introduced in 4.9 for GA

Why is this important?

We must ensure that the component is stable before we can declare the product GA

Scenarios

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story OCPCLOUD-1189: CCCMO: isolate provider specific logic within operator

View the Description View the linked PRs

Initial work was started there: https://github.com/lobziik/cluster-cloud-controller-manager-operator/pull/1/files

Need to isolate provider specific code in respective packages and introduce interface to leverage it (regular and bootstrap manifests rendering should be there atm)

DoD:

Introduce templating logic to replace existing substitution mixture

Isolate templating logic so that this is transparent to the core of the CCCMO
Improve testing of the substitution

https://github.com/openshift/cluster-cloud-controller-manager-operator/pull/110

Feature RHDP-320: Integrated CI/CD experience with OpenShift platform

View the Description

Goal

Increase integration of Shipwright, Tekton, Argo CD in OpenShift GitOps with OpenShift platform and related products such as ACM.

Epic ODC-4981: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/console/pull/11641

Epic AUTH-133: Pod Security Admission integration in OpenShift

View the Description View the linked PRs

Summary (PM+lead)

https://issues.redhat.com/browse/AUTH-2 revealed that, in prinicipal, Pod Security Admission is possible to integrate into OpenShift while retaining SCC functionality.

This epic is about the concrete steps to enable Pod Security Admission by default in OpenShift

Motivation (PM+lead)

Goals (lead)

Enable Pod Security Admission in "restricted" policy level by default
Migrate existing core workloads to comply to the "restricted" pod security policy level

Non-Goals (lead)

Other OpenShift workloads must be migrated by the individual responsible teams.

Deliverables

Proposal (lead)

Enhancement - https://github.com/openshift/enhancements/pull/1010

User Stories (PM)

Dependencies (internal and external, lead)

Previous Work (lead)

Open questions (lead)

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Task AUTH-184: Pod Security compliance: ingress-operator

View the Description View the linked PRs

ingress-operator must comply to pod security. The current audit warning is:

{ "objectRef": "openshift-ingress-operator/deployments/ingress-operator", "pod-security.kubernetes.io/audit-violations": "would violate PodSecurity \"restricted:latest\": allowPrivilegeEscalation != false (containers \"ingress-operator\", \"kube-rbac-proxy\" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (containers \"ingress-operator\", \"kube-rbac-proxy\" must set securityContext.capabilities.drop=[\"ALL\"]), runAsNonRoot != true (pod or containers \"ingress-operator\", \"kube-rbac-proxy\" must set securityContext.run AsNonRoot=true), seccompProfile (pod or containers \"ingress-operator\", \"kube-rbac-proxy\" must set securityContext.seccompProfile.type to \"RuntimeDefault\" or \"Localhost\")" }

https://github.com/openshift/cluster-ingress-operator/pull/749

Task AUTH-182: Pod Security compliance: dns-operator

View the Description View the linked PRs

dns-operator must comply to restricted pod security level. The current audit warning is:

{ "objectRef": "openshift-dns-operator/deployments/dns-operator", "pod-security.kubernetes.io/audit-violations": "would violate PodSecurity \"restricted:latest\": allowPrivilegeEscalation != false (containers \"dns-operator\", \"kube-rbac-proxy\" must set securityContext.allowPrivilegeEscalation=false), unre stricted capabilities (containers \"dns-operator\", \"kube-rbac-proxy\" must set securityContext.capabilities.drop=[\"ALL\"]), runAsNonRoot != true (pod or containers \"dns-operator\", \"kube-rbac-proxy\" must set securityContext.runAsNonRoot=tr ue), seccompProfile (pod or containers \"dns-operator\", \"kube-rbac-proxy\" must set securityContext.seccompProfile.type to \"RuntimeDefault\" or \"Localhost\")" }

https://github.com/openshift/cluster-dns-operator/pull/319

Epic CONSOLE-2065: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Story CONSOLE-2280: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/console/pull/10551

Epic CONSOLE-2966: 4.10 Console Dependencies & Tech Debt

View the Description

An epic we can duplicate for each release to ensure we have a place to catch things we ought to be doing regularly but can tend to fall by the wayside.

Story CONSOLE-2979: Upgrade Cypress to 8.5.0

View the Description View the linked PRs

Update console from Cypress 6.0.0 to 8.5.0. Changes that impact us:

cypress run is headless by default
cy.intercept URL matching is more strict
Uncaught exception and unhandled promise rejection checks are more strict

https://docs.cypress.io/guides/references/migration-guide#Migrating-to-Cypress-8-0

https://github.com/openshift/console/pull/10164

Task CONSOLE-2964: Dynamic Plugin needs to be externally consumable - Update ts-node

View the Description View the linked PRs

As an adopter of the @openshift-console/dynamic-plugin-sdk I want to easily integrate into my development pipeline so that I can extend the OCP console.

Trying to pull in the dynamic-plugin-sdk into ACM is proving to be problematic. We would have to move to older dependencies. Integrating with webpack and typescript requires a very specific setup.

The dynamic-plugin-sdk has only really been used internally by OCP and is strongly tied to the setup and dependencies of OCP. For the dynamic-plugin-sdk to be externally consumable by adopters, it should be as easy to use as other webpack plugins such as HtmlWebpackPlugin or CompressionPlugin.

Acceptance Criteria

Uses up to date dependencies - not tied to specific versions OCP console uses
Includes it's own dependencies - does not require adopters to include those dependencies
The dynamic demo plugin should be updated to use newer dependencies and use the plugin without a bunch of tweaks to tsconfig paths.

Currently

requires old dependencies
- ts-node 5.0.1 → 10.2.1

https://github.com/openshift/console/pull/10014

Story CONSOLE-2985: Replace all instances of old variables controlling global grid widths and breakpoints with Patternfly variables for more consistency of spacing between elements and behaviors

View the Description View the linked PRs

The console has many instances of old variables, $grid-float-breakpoint and $grid-gutter-width, controlling margins/padding and responsive breakpoints throughout the Admin and Dev Console. These do not provide spacing and behaviors consistent with Patternfly components which use their own variables, $pf-global-gutter-md, $pf-global-gutter, and $pf-global-breakpoint-{size}. By replacing these, the intent it to bring the console closer to a pure Patternfly structure and behavior, requiring less overrides and customizations.

https://github.com/openshift/console/pull/10332

Story CONSOLE-2972: Upgrade webpack 4.x dependencies

View the Description View the linked PRs

Update webpack to the latest 4.x and update webpack loaders. This will help prepare us to move to webpack 5.

https://webpack.js.org/migrate/5/

https://github.com/openshift/console/pull/10080

Epic CONSOLE-2981: Ensure compatibility of OpenShift console for HyperShift Provisioned Clusters

View the Description

Epic Goal

HyperShift provisions OpenShift clusters with externally managed control-planes. It follows a slightly different process for provisioning clusters. For example, HyperShift uses cluster API as a backend and moves all the machine management bits to the management cluster.

Why is this important?

showing machine management/cluster auto-scaling tabs in the console is likely to confuse users and cause unnecessary side effects.

Definition of Done

MachineConfig and MachineConfigPool should not be present, they should be either removed or hidden when the cluster is spawned using HyperShift.
Cluster Settings show say the control plane is externally managed and be read-only.
Cluster Settings -> Configuration resources should be read-only, maybe hide the tab
Some resources should go in an allowlist. Most will be hidden
Review getting started steps

See Design Doc: https://docs.google.com/document/d/1k76JtRRHBdCCEjHPqKcYvbNVsuaGmRhWDLESWIm0mbo/edit#

Setup / Testing

It's based on the SERVER_FLAG controlPlaneTopology being set to External is really the driving factor here; this can be done in one of two ways:

Locally via a Bridge Variable, export BRIDGE_CONTROL_PLANE_TOPOLOGY_MODE="External"
Locally / OnCluster via modifying the window.SERVER_FLAGS.controlPlaneTopology to External in the dev tools

To test work related to cluster upgrade process, use a 4.10.3 cluster set on the candidate-4.10 upgrade channel using 4.11 frontend code.

Story CONSOLE-3163: kubeadmin notifier changes

View the Description View the linked PRs

If the Infrastructure.Status.ControlPlaneTopology is set to 'External', the console-operator will pass this information via the console-config.yaml to the console. Console pod will get re-deployed and will store the topology mode information as a SERVER_FLAG. Based on that value we need to suspend kubeadmin notifier, from the global notifications, since it contain link for updating the cluster OAuth configuration (see attachment).

https://github.com/openshift/console/pull/11578

Story CONSOLE-3072: Cluster overview page changes

View the Description View the linked PRs

If the Infrastructure.Status.ControlPlaneTopology is set to 'External', the console-operator will pass this information via the console-config.yaml co the console. Console pod will get re-deployed and will store the topology mode information as a SERVER_FLAG. Based on that value we need to remove the ability to “Add identity providers” under “Set up your Cluster”. In addition to the getting started card, we should remove the ability to update a cluster on the details card when applicable (anything that changes a cluster version should be read only).

Summary of changes to the overview page:

Remove the ability to “Add identify providers” under “Set up your Cluster”
Remove cluster update CTA from the details card
Remove update alerts from the status card

Check section 03 for more info: https://docs.google.com/document/d/1k76JtRRHBdCCEjHPqKcYvbNVsuaGmRhWDLESWIm0mbo/edit#

https://github.com/openshift/console/pull/11366

Story CONSOLE-3074: Notification changes

View the Description View the linked PRs

cluster upgrade notifications
new channel available notifications

For these we will need to check `ControlPlaneTopology`, if it's set to 'External' and also check if the user can edit cluster version(either by creating a hook or an RBAC call, eg. `canEditClusterVersion`)

Check section 05 for more info: https://docs.google.com/document/d/1k76JtRRHBdCCEjHPqKcYvbNVsuaGmRhWDLESWIm0mbo/edit#

https://github.com/openshift/console/pull/11375

Story CONSOLE-3071: Cluster Settings page changes

View the Description View the linked PRs

Remove update button
Make channel read only
Link out to read only CV details page
Remove the ability to edit upstream configuration
Remove the cluster autoscaler field
Add an alert to the page so that users know the control plane is externally managed

In general, anything that changes a cluster version should be read only.

Check section 02 for more info: https://docs.google.com/document/d/1k76JtRRHBdCCEjHPqKcYvbNVsuaGmRhWDLESWIm0mbo/edit#

https://github.com/openshift/console/pull/11363

Story CONSOLE-3141: Disable IBMCloud provider

View the Description View the linked PRs

Based on Cesar's comment we should be removing the `Control Plane` section, if the infrastructure.status.controlplanetopology being "External".

https://github.com/openshift/console/pull/11398

Epic CONSOLE-3048: Dark mode for OCP admin console

View the Description

Epic Goal

Ensure that OCP Admin console can enable dark mode
Mechanism for switching between light/dark mode, based on user preferences or User Settings in OCP
Identify changes needed for OCP console's dynamic plugins
Status as we work through the stories: https://docs.google.com/document/d/18ib0A9ZtSYo1e2aOtM2LtZcsNFs1aRCHXx6NAPsIAso/edit#

Why is this important?

So the UX satisfies the current trands, where dark mode is becoming a standard for modern services.

Acceptance Criteria

OCP admin console must be rendered in a preferred mode based on `prefers-color-scheme` media query
OCP admin console must be rendered in a preferred mode selected in the User Setting page
Create an followup epic/story for and listing and tracking changes needed in OCP console's dynamic plugins

Dependencies (internal and external)

PatternFly - Dark mode PF variables

Previous Work (Optional):

Mike Coker has worked on a POC from the PF point of view on both the admin and dev console, and the screenshot results are listed below along with the repo branch. Also listed is a document covering some of the common issues found when putting together the admin console POC. https://github.com/mcoker/console/tree/dark-theme
Background POC work completed for reference:

PatternFly Dark Theme Handbook: https://docs.google.com/document/d/1mRYEfUoOjTsSt7hiqjbeplqhfo3_rVDO0QqMj2p67pw/edit

Admin Console -> Workloads & Pods

Screenshots: https://docs.google.com/presentation/d/1BoOpXpX96_uNUhVJiEqGUCZ-JMYh1A78fXJhk3H86K8/edit#slide=id.p
Github branch: https://github.com/mcoker/console/tree/dark-theme
Github PR for Dark Theme WIP by Vikram: https://github.com/openshift/console/pull/10947{}

Dev Console -> Gotcha pages: Observe Dashboard and Metrics, Add, Pipelines: builder, list, log, and run

Screenshots: https://docs.google.com/presentation/d/13jagAf_JEu82hd_KpEHRmj25xCEqwImWpNufzVopuUs/edit#slide=id.p
Github branch: https://github.com/openshift/console/pull/10815
Evaluation: https://docs.google.com/spreadsheets/d/1fBCPPb4Z1sqUDKgQkpueGe8VQZo4pBUA9PV4vnFCnDw/edit#gid=0

Open questions::

Who should be responsible for updating DynamicPlugins to be able to render in dark mode?

Story CONSOLE-3090: Second pass integration of Patterfly's dark mode

View the Description View the linked PRs

As a developer, I want to be able to fix remaining issues from the spreadsheet of issues generated after the initial pass and spike of adding dark theme to the console.. As such, I need to make sure to either complete all remaining issues for the spreadsheet, or, create a bug or future story for any remaining issues in these two documents.

Acceptance criteria:

burn down the list of issues here: https://docs.google.com/presentation/d/1ZeJUVKPput7g6w9kHl3NmGbjggdJ80WH1pmZPe_hWHQ/edit#slide=id.g5c9f6cd93e_0_286
burn down the list of issues in the slide deck: https://docs.google.com/spreadsheets/d/1fBCPPb4Z1sqUDKgQkpueGe8VQZo4pBUA9PV4vnFCnDw/edit#gid=0
Create bugs / stories for anything not addressed in the above two documents

Story CONSOLE-3081: Initial integration of Patterfly's dark mode in admin console

View the Description View the linked PRs

As a developer, I want to be able to scope the changes needed to enable dark mode for the admin console. As such, I need to investigate how much of the console will display dark mode using PF variables and also define a list of gotcha pages/components which will need special casing above and beyond PF variable settings.

Acceptance criteria:

integrate the PF Release Version with dark mode
add over arching style updates to fix borders and background color variables
re-evaluate serenas slide deck: https://docs.google.com/presentation/d/1ZeJUVKPput7g6w9kHl3NmGbjggdJ80WH1pmZPe_hWHQ/edit#slide=id.g5c9f6cd93e_0_286
re-evaluate spreadsheet of issues: https://docs.google.com/spreadsheets/d/1fBCPPb4Z1sqUDKgQkpueGe8VQZo4pBUA9PV4vnFCnDw/edit#gid=0
Create styling to support the Events sections with dark mode enabled
Create styling to support all modals with dark mode enabled

Epic CONSOLE-3053: 4.11 Console Dependencies & Tech Debt

View the Description

An epic we can duplicate for each release to ensure we have a place to catch things we ought to be doing regularly but can tend to fall by the wayside.

Story CONSOLE-3124: Migrate cluster-dashboard.scenario.ts to Cypress

View the Description View the linked PRs

The Cluster Dashboard Details Card Protractor integration test was failing at high rate, and despite multiple attempts to fix, was never fully resolved, so it was disabled as a way to fix https://bugzilla.redhat.com/show_bug.cgi?id=2068594. Migrating this entire file to Cypress should give us better debugging capability, which is what was done to fix a similarly problematic project dashboard Protractor test.

https://github.com/openshift/console/pull/11332

Epic CONSOLE-3059: OCP 4.11 - Dynamic Plugins Epic

View the Description

This epic contains all the Dynamic Plugins related stories for OCP release-4.11

Epic Goal

Track all the stories under a single epic

Acceptance Criteria

Story CONSOLE-3162: Implement check for the new i18n annotation for dynamic plugins

View the Description View the linked PRs

In the 4.11 release, a console.openshift.io/default-i18next-namespace annotation is being introduced. The annotation indicates whether the ConsolePlugin contains localization resources. If the annotation is set to "true", the localization resources from the i18n namespace named after the dynamic plugin (e.g. plugin__kubevirt), are loaded. If the annotation is set to any other value or is missing on the ConsolePlugin resource, localization resources are not loaded.

In case these resources are not present in the dynamic plugin, the initial console load will be slowed down. For more info check BZ#2015654

AC:

console-operator should be checking for the new console.openshift.io/use-i18n annotation, update the console-config.yaml accordingly and redeploy the console server
console server should pick up the changes in the console-config.yaml and only load the i18n namespace that are available

Follow up of https://issues.redhat.com/browse/CONSOLE-3159

Story CONSOLE-2925: Add initial integration tests for dynamic plugins

View the Description View the linked PRs

We need to provide a base for running integration tests using the dynamic plugins. The tests should initially

Create a deployment and service to run the dynamic demo plugin
Update the console operator config to enable the plugin
Wait for the plugin to be available
Test at least one extension point used by the plugin (such as adding items to the nav)
Disable the plugin when done

Once the basic framework is in place, we can update the demo plugin and add new integration tests when we add new extension points.

https://github.com/openshift/console/tree/master/frontend/dynamic-demo-plugin

https://github.com/openshift/enhancements/blob/master/enhancements/console/dynamic-plugins.md

https://github.com/openshift/console/tree/master/frontend/packages/console-plugin-sdk

https://github.com/openshift/console/pull/10644

Story CONSOLE-3153: Expose APIs for displaying date/time

View the Description View the linked PRs

We have a Timestamp component for consistent display of dates and times that we should expose through the SDK. We might also consider a hook that formats dates and times for places were you don't want or cant use the component, eg. times on a chart.

This will become important when we add a user preference for dates so that plugins show consistent dates and times as console. If I set my user preference to UTC dates, console should show UTC dates everywhere.

AC:

Expose the Timestamp component inside the SDK.
Replace the connect with useSelector hook
Keep the original component and proxy it to the new one in the SDK

cc Jakub Hadvig Sho Weimer

https://github.com/openshift/console/pull/11693

Story CONSOLE-3062: Improve discoverability of the console plugins page

View the Description View the linked PRs

Currently, you need to navigate to

Cluster Settings ->
Global configuration ->
Console (operator) config ->
Console plugins

to see and managed plugins. This takes a lot of clicks and is not discoverable. We should look at surfacing plugin details where they're easier to find – perhaps on the Cluster Settings page – or at least provide a more convenient link somewhere in the UI.

AC: Add the Dynamic Plugins section to the Status Card in the overview that will contain:

count of active and non-active plugins
link to the ConsolePlugins instances page
status of the loaded plugins and breakout error

cc Ali Mobrem Robb Hamilton

https://github.com/openshift/console/pull/11664

Story CONSOLE-3061: Reporting errors for plugins that don't load in the notification drawer

View the Description View the linked PRs

Currently, enabled plugins can fail to load for a variety of reasons. For instance, plugins don't load if the plugin name in the manifest doesn't match the ConsolePlugin name or the plugin has an invalid codeRef. There is no indication in the UI that something has gone wrong. We should explore ways to report this problem in the UI to cluster admins. Depending on the nature of the issue, an admin might be able to resolve the issue or at least report a bug against the plugin.

The message about failing could appear in the notification drawer and/or console plugins tab on the operator config. We could also explore creating an alert if a plugin is failing.

AC:

Add notification into the Notification Drawer in case a Dynamic Plugin will error out during load.
Render these errors in the status card, notification section, as well.
For each failed plugin we should create a separate notification.

https://github.com/openshift/console/pull/11732

Epic CONSOLE-3094: Support Conditional Updates (a.k.a. Targeted Edge Blocking)

View the Description

Goal

Add the ability for users to select supported but not recommended updates.
Refine workflow when both "upgradeable=false" and "supported-but-not-recommended" updates occur

Background
RFE: for 4.10, Cincinnati and the cluster-version operator are adding conditional updates (a.k.a. targeted edge blocking): https://issues.redhat.com/browse/OTA-267

High-level plans in https://github.com/openshift/enhancements/blob/master/enhancements/update/targeted-update-edge-blocking.md#update-client-support-for-the-enhanced-schema

Example of what the oc adm upgrade UX will be in https://github.com/openshift/enhancements/blob/master/enhancements/update/targeted-update-edge-blocking.md#cluster-administrator.

The oc implementation landed via https://github.com/openshift/oc/pull/961.

Design

Use case 01: "supported but not recommended" occurs to the latest version:
- Add an info icon next to the version on update path with a pop-over to explain about why updating to this version is supported, but not recommended and a link to known risks
- Identify the difference in "recommended" versions, "supported but not recommended" versions, and "blocked" versions (upgradeable=false) in the + more modal.
- The latest version is pre-selected in the dropdown in the update modal with an inline alert to inform users about supported-but-not-recommended version with link to known risks. Users can choose to update to another recommended versions, update to a supported-but-not-recommended one, or wait.
- The "recommended" and "supported but not recommended" updates are separated in the dropdown.
- If a user selects a "recommended" update, the inline alert disappears.
Use case 02: When both "upgradeable=false" and "supported but not recommended" occur:
- Add an alert banner to explain why users shouldn’t update to the latest version and link to how to resolve on the cluster settings details page. Users have the options to resolve the issue, update to a patch version, or wait.
- If users open the update modal without resolving the "upgradeable=false" issue, the next recommended version is pre-selected. An expandable link "View blocked versions (#)" is included under the dropdown to show "upgradeable=false" versions with resolve link.
- If users resolve the "upgradeable=false" issue, the cluster settings page will change to use case 01

- Question: Priority on changing the upgradeable=false alert banner in update modal and blocked versions in dropdown

See design doc: https://docs.google.com/document/d/1Nja4whdsI5dKmQNS_rXyN8IGtRXDJ8gXuU_eSxBLMIY/edit#

See marvel: https://marvelapp.com/prototype/h3ehaa4/screen/86077932

Story CONSOLE-3138: Support Conditional Updates - Cluster update "Update Version" modal

View the Description View the linked PRs

The "Update Version" modal on the cluster settings page should be updated to give users information about recommended, not recommended, and blocked update versions.

When the modal is opened, the latest recommended update version should be pre-selected in the version dropdown.
Blocked versions should no longer be displayed in the version dropdown, and should instead be displayed in a collapsible field below the dropdown.
When blocked versions are present, a link should be provided to the cluster operator tab. The version dropdown itself should have two labeled sections: "Recommended" and "Supported but not recommended".
When the user selects a "Supported but not recommended" item from the version dropdown, an inline info alert should appear below the version selection field and should provide a link to known risks associated with the selected version. This is an external link provided through the ClusterVersion API.

https://github.com/openshift/console/pull/11424

Story CONSOLE-3136: Support Conditional Updates - Cluster settings page

View the Description View the linked PRs

Update the cluster settings page to inform the user when the latest available update is supported but not recommended. Add an informational popover to the latest version in update path visualization.

https://github.com/openshift/console/pull/11445

Epic IR-167: Feature usage telemetry

View the Description

Epic Goal

Add telemetry so that we know how image stream features are used.

Why is this important?

We have a long standing epic to create image streams v2. We need to better understand how image streams are used today.

Scenarios

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story IR-120: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-image-registry-operator/pull/768

Epic IR-208: Continuous Improvement of Maintainability 4.10

View the Description

Epic Goal

Improve CI testing of the image registry components.

Why is this important?

The image registry, image API and the image pruner had a lot of tests removed during transition 4.0. This may make the platform less stable and/or slow down the team.

Scenarios

Acceptance Criteria

CI - tests should be more stable and have broader coverage

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.

Story IR-104: Use library-go in image-registry

View the Description View the linked PRs

In the image-registry, we have packages origin-common and kubernetes-common. The problem is that this code doesn't get updates. We can replace them with more supported library-go.

https://github.com/openshift/image-registry/pull/295

Epic IR-210: Update k8s to 1.23

View the Description

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

Why is this important?

Scenarios

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story IR-211: Bump k8s to 1.23 in image-registry repo

View the Description View the linked PRs

As a OpenShift engineer
I want image-registry to use the latest k8s libraries
so that image-registry can benefit from new upstream features.

Acceptance criteria

image-registry uses k8s.io/api v1.23.z
image-registry uses latest openshift/api, openshift/library-go, openshift/client-go

https://github.com/openshift/image-registry/pull/302

Epic IR-87: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Story IR-234: Custom certificate authorities for S3

View the Description View the linked PRs

As an OpenShift administrator
I want to provide the registry operator with a custom certificate authority for S3 storage
so that I can use a third-party S3 storage provider.

Acceptance criteria

Users can specify a configmap name (from openshift-config) in config.imageregistry/cluster's spec.storage.s3.
The operator uses CA from this configmap to check S3 bucket.
The image registry pod uses CA from this configmap to access the S3 bucket.
When a custom CA is defined, the operator/image-registry should still trust certificate authorities that are used by Amazon S3 and other well-known CAs.
An end-to-end test that runs minio and checks the image registry becomes healthy with it.

https://github.com/openshift/cluster-image-registry-operator/pull/759

Epic JKNS-260: Remove Jenkins from OCP Payload

View the Description

Goal

Remove Jenkins from the OCP Payload.

Problem

Jenkins images are "non-trival in size, impact experience around OCP payloads
Security advisories cannot be handled once, but against all actively supported OCP releases, adding to response time for handling said advisories
Some customers may now want to upgrade Jenkins as OCP upgrades (making this configurable is more ideal)

Why is this important

This is an engineering motivated item to reduce costs so we have more cycles for strategic work
Aside from the team itself, top level OCP architects want this to reduce the image size, improve general OCP upgrade experience
Sends a mix message with respect to what is startegic CI/CI when Jenkins is baked into OCP, but Tekton/Pipelines is an add-on, day 2 install sort of thing

Dependencies (internal and external)

See epic linking - need alternative non payload image available to provide relatively seamless migration

Also, the EP for this is approved and merged at https://github.com/openshift/enhancements/blob/master/enhancements/builds/remove-jenkins-payload.md

Estimate (xs, s, m, l, xl, xxl):

Questions:

in addition to needing the CPaaS Image available first, have we confirmed the "deprecate first, then remove" requirements? Per grooming we've assigned Rob Gormley the task https://issues.redhat.com/browse/JKNS-274 for tracking this down. I will be reaching out to Ben Parees (he is currently on vacation) to confirm from an OCP staff engineer / architect perspective if his approval of https://github.com/openshift/enhancements/pull/841 is sufficient signoff from that end.

PARTIAL ANSWER ^^: confirmed with Ben Parees in https://coreos.slack.com/archives/C014MHHKUSF/p1646683621293839 that EP merging is currently sufficient OCP "technical leadership" approval.

Previous work

Customers

assuming none

Story JKNS-267: Run Jenkins CI Tests without Payload Image

View the Description View the linked PRs

User Stories

As maintainers of the OpenShift jenkins component, we need run Jenkins CI for PR testing against openshift/jenkins, openshift/jenkins-sync-plugin, openshift/jenkins-client-plugin, openshift/jenkins-openshift-login-plugin, using images built in the CI pipeline but not injected into CI test clusters via sample operator overriding the jenkins sample imagestream with the jenkins payload image.

As maintainers of the OpenShift Jenkins component, we need Jenkins periodics for the client and sync plugins to run against the latest non payload, CPaas image, promoted to CI's image locations on quay.io, for the current release in development.

As maintainers of the OpenShift Jenkins component, we need Jenkins related tests outside of very basic Jenkins Pipieline Strategy Build Config verification, removed from openshift-tests in OpenShift Origin, using a non-payload, CPaas image pertinent to the branch in question.

Acceptance criteria

all PR CI Tests do not utilize samples operator manipulation of the jenkins imagestream with the in payload image, but rather images including the PRs changes
all periodic CI Tests do not utilize samples operator manipulation of the jenkins imagestream with the in payload image, but rather CI promoted images for the current release pushed to quay.io

High Level, we ideally want to vet the new CPaas image via CI and periodics BEFORE we start changing the samples operator so that it does not manipulate the jenkins imagestream (our tests will override the samples operator override)

QE Impact

NONE ... QE should wait until JNKS-254

Docs Impact

NONE

PX Impact

NONE

Launch Checklist

Dependencies identified
Blockers noted and expected delivery timelines set
Design is implementable
Acceptance criteria agreed upon
Story estimated

Notes

Our CSI shared resource experience will help us here
but the old IMAGE_FORMAT stuff is deprecated, and does not work well with step registry stuff
instead, we need to use https://docs.ci.openshift.org/docs/architecture/ci-operator/#dependency-overrides
Makefile level logic will use `oc tag` to update the jenkins imagestream created as part of samples to override the use of the in payload image with the image build by the PR, or for periodics, with what has been promoted to quay.io
Ultimately, CI step registry for capturing the `oc tag` update the imagestream logic is the probably end goal
JNKS-268 might change how we do periodics, but the current thought is to get existing periodics working with the CPaas image first

Possible staging

1) before CPaas is available, we can validate images generated by PRs to openshift/jenkins, openshift/jenkins-sync-plugin, openshift/jenkins-client-plugin by taking the image built by the image (where the info needed to get the right image from the CI registry is in the IMAGE_FORMAT env var) and then doing an `oc tag --source=docker <PR image ref> openshift/jenkins:2` to replace the use of the payload image in the jenkins imagestream in the openshift namespace with the PRs image

2) insert 1) in https://github.com/openshift/release/blob/master/ci-operator/step-registry/jenkins/sync-plugin/e2e/jenkins-sync-plugin-e2e-commands.sh and https://github.com/openshift/release/blob/master/ci-operator/step-registry/jenkins/client-plugin/tests/jenkins-client-plugin-tests-commands.sh where you test for IMAGE_FORMAT being set

3) or instead of 2) you update the Makefiles for the plugins to call a script that does the same sort of thing, see what is in IMAGE_FORMAT, and if it has something, do the `oc tag`

https://github.com/openshift/release/pull/26979 is a prototype of how to stick the image built from a PR and conceivably the periodics to get the image built from it and tag it into the jenkins imagestream in the openshift namespace in the test cluster

https://github.com/openshift/origin/pull/26914

Epic MON-1961: Remove Prometheus UI

View the Description

Epic Goal

Remove this UI from our stack that we cannot support.

Why is this important?

Reduce support burden.
Remove Bugzilla burden of addressing continuous CVEs found in this project.

Acceptance Criteria

All Prometheus upstream UI links are removed
Related documentation is updated
Ports/routes etc configured to expose access to this UI are removed such that no configuration we provide enables access to this UI or its codepaths.
There is no reason any CVEs found in this UI would ever require intervention by the Monitoring Team.

Dependencies (internal and external)

Make the Prometheus Targets information available in Console UI (https://issues.redhat.com/browse/MON-1079)

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story MON-1631: Remove UI access from the Prometheus route

View the Description View the linked PRs

After installing or upgrading to the latest OCP version, the existing OpenShift route to the prometheus-k8s service is updated to be a path-based route to '/api/v1'.

DoD:

It is not possible to access the Prometheus UI via the OpenShift route
Using a bearer token with sufficient permissions, it is possible to access the /api/v1/* endpoints via the OpenShift route.

https://github.com/openshift/cluster-monitoring-operator/pull/1532

Epic MON-1988: Enable audit and query logging for all prometheus read paths

View the Description

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

Epic Goal

As a CFE team, we would like to enable query logging for all Prometheus read paths
As part of this, we would like to enable audit & query logging for Prometheus Adapter(aggregated server audit log), Prometheus(query log) and ThanosQuerier(query log)

Why is this important?

This would help all parties(customers, app-sres, CCX, monitoring team,..) to debug an overloaded Prometheus instance.

Scenarios

When a customer faces a high cpu consumption in any of the Prometheus instance, they can enable audit logging in Prometheus Adapter to see which component is calling metrics API
When a customer faces a high cpu consumption in any of the Prometheus instance, they can enable query logging in all Prometheus instances(PM & UWM) and ThanosQuerier to see which query is frequently executed
https://bugzilla.redhat.com/show_bug.cgi?id=1982302

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
Prometheus Adapter audit logs must be enabled by default
Prometheus Adapter audit logs must be preserved after each CI run

Open questions::

Should we enable ThanosRuler query logs?

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Task MON-1786: Allow OpenShift users to configure audit logs for prometheus-adapter

View the Description View the linked PRs

After investigating a complex Bugzilla involving many applications making queries to prometheus-adapter, we've noticed that we were lacking insights on the requests made to prometheus-adapter. To have such information for an aggregated API, the best would be to have audit logs for prometheus-adapter. This wasn't configurable before, but with https://github.com/kubernetes-sigs/custom-metrics-apiserver/pull/92, upstream users should now be able to configure it.

Since this would greatly help in investigating prometheus-adapter Bugzilla in the future, it would be great if we allowed OpenShift users to configure the audit logs so that they could provide them to us.

Note for the assignee: as of the time of the creation of this ticket, the upstream PR hasn't been merged in custom-metrics-apiserver and thus wasn't synced in prometheus-adapter. So we will have to wait a bit before starting looking into this ticket.

DoD:

Allow OpenShift users to configure audit logs for prometheus-adapter
Integrate with must-gather
Document how to configure audit logs in the official OpenShift documentation
Upstream jsonnet patch that enables this feature through a configuration

https://github.com/openshift/must-gather/pull/266

Epic MON-2195: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Task MON-1708: Enforce label scrape limits for UWM

View the Description View the linked PRs

Following up on https://issues.redhat.com/browse/MON-1320, we added three new CLI flags to Prometheus to apply different limits on the samples' labels. These new flags are available starting from Prometheus v2.27.0, which will most likely be shipped in OpenShift 4.9.

The limits that we want to look into for OCP are the following ones:

# Per-scrape limit on number of labels that will be accepted for a sample. If
# more than this number of labels are present post metric-relabeling, the
# entire scrape will be treated as failed. 0 means no limit.
[ label_limit: <int> | default = 0 ]

# Per-scrape limit on length of labels name that will be accepted for a sample.
# If a label name is longer than this number post metric-relabeling, the entire
# scrape will be treated as failed. 0 means no limit.
[ label_name_length_limit: <int> | default = 0 ]

# Per-scrape limit on length of labels value that will be accepted for a sample.
# If a label value is longer than this number post metric-relabeling, the
# entire scrape will be treated as failed. 0 means no limit.
[ label_value_length_limit: <int> | default = 0 ]

We could benefit from them by setting relatively high values that could only induce unbound cardinality and thus reject the targets completely if they happened to breach our constrainst.

DoD:

Being able to configure label scrape limits for UWM

https://github.com/openshift/cluster-monitoring-operator/pull/1350

Epic MON-2235: Option to add cluster ID to off-cluster integrations

View the Description

Epic Goal

When users configure CMO to interact with systems outside of an OpenShift cluster, we want to provide an easy way to add the cluster ID to the data send.

Why is this important?

Technically this can be achieved today, by adding an identifying label to the remote_write configuration for a given cluster. The operator adding the remote_write integration needs to take care that the label is unique over the managed fleet of clusters. This however adds management complexity. Any given cluster already has a pseudo-unique datum, that can be used for this purpose.

Starting in 4.9 we support the Prometheus remote_write feature to send metric data to a storage integration outside of the cluster similar to our own Telemetry service.
In Telemetry we already use the cluster ID to distinguish the various clusters.
For users of remote_write this could add an easy way to add such distinguishing information.

Scenarios

An organisation with multiple OpenShift clusters want to store their metric data centralized in a dedicated system and use remote_write in all their clusters to send this data. When querying their centralized storage, metadata (here a label) is needed to separate the data of the various clusters.
Service providers who manage multiple clusters for multiple customers via a centralized storage system need distinguishing metadata too. See https://issues.redhat.com/browse/OSD-6573 for example

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
Document how to use this feature

Dependencies (internal and external)

none

Previous Work (Optional):

none

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Implementation proposal:

Expose a flag in the CMO configuration, that is false by default (keeps backward compatibility) and when set to true will add the _id label to a remote_write configuration. More specifically it will be added to the top of a remote_write relabel_config list via the replace action. This will add the label as expect, but additionally a user could alter this label in a later relabel config to suit any specific requirements (say rename the label or add additional information to the value).
The location of this flag is the remote_write Spec, so this can be set for individual remote_write configurations.

Task MON-2245: Add option to add cluster id to CMOs remote write config

View the Description View the linked PRs

~~Add an optional boolean flag to CMOs definition of RemoteWriteSpec that if true adds an entry in the specs WriteRelabelConfigs list.~~

I went with adding the relabel config to all user-supplied remote_write configurations. This path has no risk for backwards compatibility (unless users use the {}tmp_openshift_cluster_id{} label, seems unlikely) and reduces overall complexity, as well as documentation complexity.

The entry should look like what is already added to the telemetry remote write config and it should be added as the first entry in the list, before any user supplied relabel configs.

https://github.com/openshift/cluster-monitoring-operator/pull/1578

Task MON-2269: Use prometheus as remote_write target in e2e test

View the Description View the linked PRs

We currently use a sample app to e2e test remote write in CMO.
In order to test the addition of the cluster_id relabel config, we need to confirm that the metrics send actually have the expected label.
For this test we should use Prometheus as the remote_write target. This allows us to query the metrics send via remote write and confirm they have the expected label.

https://github.com/openshift/cluster-monitoring-operator/pull/1602

Epic MON-2384: Double scrape_interval for CMO controlled ServiceMonitors for SNO

View the Description

Epic Goal

Offer the option to double the scrape intervals for CMO controlled ServiceMonitors in single node deployments
Alternatively automatically double the same scrape intervals if CMO detects an SNO setup

The potential target ServiceMonitors are:

kubelet
kube-state-metrics
node-exporter
etcd
openshift-state-metrics

Why is this important?

Reduce CPU usage in SNO setups
Specifically doubling the scrape interval is important because:

we are confident that this will have the least chance to interfere with existing rules. We typically have rate queries over the last 2 minutes (no shorter time window). With 30 second scrape intervals (the current default) this gives us 4 samples in any 2 minute window. rate needs at least 2 samples to work, we want another 2 for failure tolerance. Doubling the scrape interval will still give us 2 samples in most 2 minute windows. If a scrape fails, a few rule evaluations might fail intermittently.
We expect a measureable reduction of CPU resources (see previous work)

Scenarios

RAN deployments (Telco Edge) are SNO deployments. In these setups a full CMO deployment is often not needed and the default setup consumes too many resources. OpenShift as a whole has only very limited CPU cycles available and too many cycles are spend on Monitoring

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Previous Work (Optional):

https://issues.redhat.com/browse/MON-1569

Open questions:

Whether doubling some scrape intervals reduces CPU usage to fit into the assigned budget

Non goals

Allow arbitrarily long scrape intervals. This will interfere with alert and recoring rules
Implement a global override to scrape intervals.

Task MON-2480: Implement code to double scrape intervals for CMO controlled service monitors

View the Description View the linked PRs

Based on ~~MON-2478~~ and ~~MON-2479~~ Add the needed code to CMOs asset deployment.

https://github.com/openshift/cluster-monitoring-operator/pull/1652

Epic NETOBSERV-16: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Story NETOBSERV-31: Expose CNI type features as a config-map

View the Description View the linked PRs

The console requires to know the network type capabilities to show/hide some Network Policy form fields.

As a result of https://issues.redhat.com/browse/NETOBSERV-27, this logic is implemented as a features document inside the console code. The console fetches the network type from the network operator and checks the supported features towards this document.

However, this limits the feature to admin users, as other logged-in users do not have permissions to fetch the network type.

This task aims to modify the current Cluster Network Operator to expose the network capabilities as an `sdn-public` Config Map, writeable only by the SDN, readable by any `system:authenticated` user.

Enhancement Proposal PR: https://github.com/openshift/enhancements/pull/875

https://github.com/openshift/cluster-network-operator/pull/1204

Epic OCPCLOUD-1256: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Story OCPCLOUD-1252: Set values for validation weebhook for GuestAccelerators field in GCPProviderSpec

View the Description View the linked PRs

We want to configure 'default' and 'allowed' values in validation webhook for Guest Accelerators field in GCPProviderSpec. Also revendor it to include newly added Guest Accelerators field.

This can be done after https://github.com/openshift/cluster-api-provider-gcp/pull/172 is merged.

DoD:

Make sure that validations return errors on issues with GPU configuration
Ensure the unit tests for the webhooks are updated

https://github.com/openshift/machine-api-operator/pull/927

Epic OCPRHV-594: [Refactor] Migrate OCP on RHV subprojects to go-ovirt-client, go-ovirt-client-log and k8sOVirtCredentialsMonitor

View the Description

Description:

Openshift on RHV is composed of the following subproject the team maintains:

Each of those projects currently uses the generated oVirt API project go-ovirt.

This leads to a number of issues:

Duplicated code between the subprojects: Since the go-ovirt is a thin layer around the API then a lot of the code which interacts with oVirt is duplicated between the projects, which leads to all the classic duplication problems such as maintaining the project, lack of clear conventions, and so on.
Bad error handling and unclear errors:
1. Since the go-ovirt is a thin layer there is a lot of error handling and checking which needs to be done, since a lot of the times it looks like a certain error should be ignored, it is never checked which could lead to unexpected situations.
2. Since the errors which are returned from the oVirt Engine are sometimes unclear, when we return those errors to the users or log them is hard to understand what is the actual issue.
Lack of retries: sometimes an operation can take some time due to some condition that needs to be met, or an operation can fail due to infrastructure issues, the go-ovirt library doesn't contain any retry logic which means each client needs to implement its own retry logic which is not done at the moment and will cause more duplicated code.
Poor logging: The current go-ovirt library doesn't log anything, and all the logs come from the subprojects, this leads to:
1. Inconsistent logging between the projects.
2. Lack of logs.
Almost no test coverage:
1. It's very hard to mock and write tests with go-ovirt since there are so many calls, but will be much easier to mock and write tests with go-ovirt-clent.
2. go-ovirt only has rudimentary tests.

Then came go-ovirt-client, go-ovirt-client-log, go-ovirt-client-log-klog and k8sOVirtCredentialsMonitor to the rescue!

The go-ovirt-client is a wrapper around the go-ovirt which contains all the error handling/retry logic/logs/tests needed to provide a decent user experience and an easy-to-use API to the oVirt engine.

go-ovirt-client-log is a library to unify the logging logic between the projects, it is used by go-ovirt-client and should be used by all the sub-projects.

go-ovirt-client-log-klog is a companion library to go-ovirt-client-log enabling logging via the Kubernetes "klog" facility.

k8sOVirtCredentialsMonitor is a utility for monitoring the oVirt credentials secret, which will automatically update the ovirt credentials is they are changed.

We aim to move all projects which are using the go-ovirt to use go-ovirt-client, go-ovirt-client-log and k8sOVirtCredentialsMonitor instead.

Benefits for the eng:

Possible to write unit tests.
Easier to maintain since less code duplication - reduce the amount of code.
Test coverage exists on the ovirt-client as well.
No(Less) bugs regarding operations that needed a retry or polling logic.
Solves a number of existing bugs

Benefits for the customers:

Clearer error messages and logs.
Fewer bugs.

Acceptance criteria:

All sub-projects are not using go-ovirt directly - at least 90% of the calls to go-ovirt should be migrated to go-ovirt-client.
All sub-projects should use the corresponding go-ovirt-client-log for logging.
All csi-driver and cluster provide use k8sOVirtCredentialsMonitor.
CI tests are green for all components.

How to test:

QE regression - make sure all flows are still working.
Green CI on all jobs.
Keep an eye out for log messages that might confuse customers.

Task OCPRHV-596: Migrate ovirt-csi-driver to go-ovirt-client

View the Description View the linked PRs

Description:

Identify all the communication between ovirt-csi-driver and the go-ovirt.
Port all the logic to go-ovirt-client.
Port all calls on ovirt-csi-driver to go-ovirt-client.

Acceptance:

ovirt-csi-driver uses go-ovirt-client for 95% percent of all oVirt related logic.

https://github.com/openshift/ovirt-csi-driver/pull/88

Epic ODC-4944: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Story ODC-6660: Render topology differently based on zoom level

View the Description View the linked PRs

Description

As a user, I want the topology view to be less cluttered as I doom out showing only information that I can discern and still be able to get a feel for the status of my project.

Acceptance Criteria

When zoomed to 50% scale, all labels & decorators will be hidden. Label are shown when hovering over the node
When zoomed to 30% scale, all labels, decorators, pod rings & icons will be hidden. Node shape remains the same, and background is either white, yellow or red. Background color is determined based on aggregate status of pods, alerts, builds and pipelines. Tooltip is available showing node name as well as the "things" which are attributing to the warning/error status.

Additional Details:

https://github.com/openshift/console/pull/11698

Story ODC-6694: Show Service Binding errors in topology graph and sidebar

View the Description View the linked PRs

Description

As a user, I want to understand which service bindings connected a service to a component successfully or not. Currently it's really difficult to understand and needs inspection into each ServiceBinding resource (yaml).

Acceptance Criteria

Show a status badge on the SB details page
Show a Status field in the right column of the SB details page
Show the Status field in the right column of the Topology side panel when a SB is selected
Show an indicator in the Topology view which will help to differentiate when the service binding is in error state
Define the available statuses & associated icons 🥴
1. Connected
2. Error
Error states defined by the SB conditions … if any of these 3 are not True, the status will be displayed as Error

Additional Details:

https://github.com/openshift/console/pull/11671

Epic ODC-6266: Improve DevExp for front end developers

View the Description

T-shirt size: M

Goal:

Provide an easy and successful experience for front end developers to build and deploy their applications

Why is it important?

Currently, the front end dev experience is not positive. It's much easier for them to use other platforms. Improving the front end dev experience will enable us to gain more marketshare

Use cases:

Need to be able to override the npm command when using Node Builder Image
Need to expose target port
Need access to the URL to access my application

Although we provide the ability for 2 & 3 today, the current journey does not match with the mental model of the front end developer

Acceptance criteria:

When importing an app, I should be able to easily provide the npm build and run commands
When opting in to create a route, the target port should be exposed without having to open any Advanced Options
After importing my app, if a route is exposed, I should be able to access/copy that URL

Dependencies (External/Internal):

Design Artifacts:

Desired UX experience

enable user to provide the *Build Command* when Node Builder image is being used
enable user to provide the *Run Command* when Node Builder image is being used

expose the Target Port under the *Create a route to the Application *rather than inside Show advanced Routing options

NEED TO FINALIZE HOW TO PROVIDE THE ROUTE TO EASILY COPY – Inline Notification maybe? As well as side panel?

Note:

Story ODC-6443: Add an option to add additional labels for just the Route and move the target port before the route checkbox

View the Description View the linked PRs

Description

As a user, I want have the option to add additional labels to a Route, as I could do in OCP3. See ~~RFE-622~~

The additional labels should only be added to the route, not the service or other components. The advanced option "Labels" should not be touched and these labels are added to all components.

As an small additional we should also show always the "Target port" since it also defines the Service port and to make this more clear, the "Target port" should be shown before the "Create a route to the Application" checkbox.

Acceptance Criteria

The following changes should be applied to the Import flow (from Git, from Container, ...) and to the Edit page as well:

Move the option "Target port" before the checkbox "Create a route to the Application" and do not hide the "Target port" when the checkbox is disabled
Add a new "Additional route labels" option, with a label input field to the "Advanced Routing options"
Save (Import) and update (Edit) the labels to the Route resource. When editing a Deployment with a Route the route labels should not show the shared labels.

Additional Details:

https://github.com/openshift/console/pull/10663

Epic ODC-6322: Automation Test plan for 4.10 Release

View the Description

Problem:

This epic is mainly focused on the 4.10 Release QE activities

Goal:

1. Identify the scenarios for automation
2. Segregate the test Scenarios into smoke, Regression and other user stories
a. Update the https://docs.jboss.org/display/ODC/Automation+Status+Report
3. Align with layered operator teams for updating scripts
3. Work closely with dev team for epic automation
4. Create the automation scripts using cypress
5. Implement CI for nightly builds
6. Execute scripts on sprint basis

Why is it important?

To the track the QE progress at one place in 4.10 Release Confluence page

Use cases:

<case>

Acceptance criteria:

<criteria>

Dependencies (External/Internal):

Design Artifacts:

Exploration:

Note:

Task ODC-6455: Add page tests should use latest UI labels like "Import from Git" instead of mapping "From Devfile" strings

View the Description View the linked PRs

There are different code spots which maps the old action items "From Git", "From Dockerfile" and "From Devfile" to the new action "Import from Git".

We should avoid mapping different strings to the new version and instead update our tests so that the feature and page object files matches the latest frontend code.

Code areas I found are marked with

      // TODO (ODC-6455): Tests should use latest UI labels like "Import from Git" instead of mapping strings

https://github.com/openshift/console/pull/10864

Task ODC-5671: Spike-CI on nightly jobs

View the Description View the linked PRs

Acceptance criteria:

Execute the automation scripts on ODC nightly builds in OpenShift CI (prow) periodically
provide a separate job for each "plugin" (like pipelines, knative, etc.)

https://github.com/openshift/console/pull/10927

Task ODC-6453: Enhance the after all hook to handle deletion of more than one namespace created in a feature file

View the linked PRs

https://github.com/openshift/console/pull/10859

Epic ODC-6452: Dev Console Usability Improvements [4.11]

View the Description

Goal:

This epic covers a number of customer requests(RFEs) as well as increases usability.

Why is it important?

Customer satisfaction as well as improved usability.

Acceptance Criteria

Allow user to re-arrange the resources which have ben added to nav by the user
Improved user experience (form based experience)
1. Form based editing of Routes
2. Form based creation and editing of Config Maps
3. Form base creation of Deployments
Improved discovery
1. Include Share my project on the Add page to increase discoverability
2. NS Helm Chart Repo
  1. Add tile to Add page for discoverability
  2. Provide a form driven creation experience
  3. User should be able to switch back and forth from Form/YAML
  4. change the intro text to the below & have the link in the intro text bring up the full page form
    1. Browse for charts that help manage complex installations and upgrades. Cluster administrators can customize the content made available in the catalog. Alternatively, developers can try to configure their own custom Helm Chart repository.

Dependencies (External/Internal):

None

Exploration:

Miro board from Epic Exploration

Story ODC-6645: Convert the ProjectHelmChartRepository create form into a form-yaml switcher

View the Description View the linked PRs

Description

As a user, I should be able to switch between the form and yaml editor while creating the ProjectHelmChartRepository CR.

Acceptance Criteria

Convert the create form into a form-yaml switcher
Display this form-yaml view in Search -> ProjectHelmChartRepositories in both perspectives

Additional Details:

Form component https://github.com/openshift/console/pull/11227

https://github.com/openshift/console/pull/11440

Story ODC-6497: Form based experience for creating Deployments

View the Description View the linked PRs

Description

As a user, I want to use a form to create Deployments

Acceptance Criteria

Use existing edit Deployment form component for creating Deployments
Display the form when clicked on `Create Deployment` in the Deployments Search page in the Dev perspective
The `Create Deployment` button in the Deployments list page & the search page in the Admin perspective should have a similar experience.

Additional Details:

Edit deployment form ~~ODC-5007~~

https://github.com/openshift/console/pull/11598

Epic ODC-6462: Improve console telemetry

View the Description

Problem:

Currently we are only able to get limited telemetry from the Dev Sandbox, but not from any of our managed clusters or on prem clusters.

Goals:

Enable gathering segment telemetry whenever cluster telemetry is enabled on OSD clusters
Have our OSD clusters opt into telemetry by default
Work with PM & UX to identify additional metrics to capture in addition to what we have enabled currently on Sandbox.
Ability to get a single report from woopra across all of our Sandbox and OSD clusters.
Be able to generate a report including metrics of a single cluster or all clusters of a certain type ( sandbox, or OSD)

Why is it important?

In order to improve properly analyze usage and the user experience, we need to be able to gather as much data as possible.

Story ODC-6670: Provide telemetry configuration as SERVER_FLAGS in console backend/bridge

View the Description View the linked PRs

Acceptance Criteria

Extend console backend (bridge) to provide configuration as SERVER_FLAGS
```
// JS type
telemetry?: Record<string, string>
```
1. Read the annotation of the cluster ConfigMap for telemetry data and pass them into the internal serverconfig.
2. Pass through this internal serverconfig and export it as SERVER_FLAGS.
3. Add a new --telemetry CLI option so that the telemetry options could be tested in a dev environment:
```
./bin/bridge --telemetry SEGMENT_API_KEY=a-key-123-xzy
./bin/bridge --telemetry CONSOLE_LOG=debug
```
TBD: In best case the new annotation could be read from the cluster ConfigMap...
1. Otherwise update the console-operator to pass the annotation from the console cluster configuration to the console ConfigMap.

Additional Details:

More information about the integration with the backend could be found in the Telemetry on OSD clusters Google Doc

Epic WRKLDS-389: Add heterogeneous architecture support to oc

View the Description

Goal:
Enhance oc adm release new (and related verbs info, extract, mirror) with heterogeneous architecture support

Story WRKLDS-370: oc adm release - add heterogeneous architecture support

View the Description View the linked PRs

tl;dr

oc adm release new (and related verbs info, extract, mirror) would be enhanced to optionally allow the creation of manifest list release payloads. The manifest list flow would be triggered whenever the CVO image in an imagestream was a manifest list. If the CVO image is a standard manifest, the generated release payload will also be a manifest. If the CVO image is a manifest list, the generated release payload would be a manifest list (containing a manifest for each arch possessed by the CVO manifest list).

In either case, oc adm release new would permit non-CVO component images to be manifest or manifest lists and pass them through directly to the resultant release manifest(s).

If a manifest list release payload is generated, each architecture specific release payload manifest will reference the same pullspecs provided in the input imagestream.

More details in Option 1 of https://docs.google.com/document/d/1BOlPrmPhuGboZbLZWApXszxuJ1eish92NlOeb03XEdE/edit#heading=h.eldc1ppinjjh

https://github.com/openshift/oc/pull/1120

Epic CONSOLE-2848: Port all Protractor tests to Cypress

View the Description

Epic Goal

Port all remaining Protractor tests to Cypress

Why is this important?

Protractor is very hard to debug when tests fail/flake
Once all protractor tests are ported we can remove all Protractor dependencies, scripts, and configuration files.
Cypress has better debugging, plug-ins, and reporting tools

Acceptance Criteria

CI - MUST be running successfully with tests automated

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story CONSOLE-2867: Cypress: port protractor OAuth tests

View the Description

Please read: migrating-protractor-tests-to-cypress

Protractor test to migrate: `frontend/integration-tests/tests/oauth.scenario.ts`
Large but straight forward

47) OAuth

   48) BasicAuth IDP
      ✔ creates a Basic Authentication IDP
      ✔ shows the BasicAuth IDP on the OAuth settings page

   49) GitHub IDP
      ✔ creates a GitHub IDP
      ✔ shows the GitHub IDP on the OAuth settings page

   50) GitLab IDP
      ✔ creates a GitLab IDP
      ✔ shows the GitLab IDP on the OAuth settings page

   51) Google IDP
      ✔ creates a Google IDP
      ✔ shows the Google IDP on the OAuth settings page

   52) Keystone IDP
      ✔ creates a Keystone IDP
      ✔ shows the Keystone IDP on the OAuth settings page

   53) LDAP IDP
      ✔ creates a LDAP IDP
      ✔ shows the LDAP IDP on the OAuth settings page

   54) OpenID IDP
      ✔ creates a OpenID IDP
      ✔ shows the OpenID IDP on the OAuth settings page

Accpetance Criteria

Protractor test ported to cypress
Remove any unused legacy data-test-id`s
Protractor test deleted, and non longer referenced in `frontend/integration-tests/protractor.conf.ts`

Sub-task CONSOLE-2870: - delete -

View the linked PRs

https://github.com/openshift/console/pull/10226

Epic IR-228: Spread registry across multiple zones

View the Description

Epic Goal

Make the image registry distributed across availability zones.

Why is this important?

The registry should be highly available and zone failsafe.

Scenarios

As an administrator I want to rely on a default configuration that spreads image registry pods across topology zones so that I don't suffer from a long recovery time (>6 mins) in case of a complete zone failure if all pods are impacted.

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.

Dependencies (internal and external)

Pod's topologySpreadConstraints

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: https://github.com/openshift/cluster-image-registry-operator/pull/730
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story IR-195: Image registry is resilient against zone failures

View the Description View the linked PRs

Story: As an administrator I want to rely on a default configuration that spreads image registry pods across topology zones so that I don't suffer from a long recovery time (>6 mins) in case of a complete zone failure if all pods are impacted.

Background: The image registry currently uses affinity/anti-affinity rules to spread registry pods across different hosts. However this might cause situations in which all pods end up on hosts of a single zone, leading to a long recovery time of the registry if that zone is lost entirely. However due to problems in the past with the preferred setting of anti-affinity rule adherence the configuration was forced instead with required and the rules became constraints. With zones as constraints the internal registry would not have deployed anymore in environments with a single zone, e.g. internal CI environment. Pod topology constraints is a new API that is supported in OCP which can also relax constraints in case they cannot be satisfied. Details here: https://docs.openshift.com/container-platform/4.7/nodes/scheduling/nodes-scheduler-pod-topology-spread-constraints.html

Acceptance criteria:

by default the internal registry is deployed with at least two replica
by default the topology constraints should be on a zone-basis, so that by defaults one registry pod is scheduled in each zone
when constraints can't be satisfied the registry should deploy anyway
~~we should not do this in SNO environments~~
the registry should still work on SNO environments

Open Questions:

what happens in environments where the storage is zone dependent?

https://github.com/openshift/cluster-image-registry-operator/pull/730

Epic IR-229: Update k8s to 1.24

View the Description

Epic Goal

Update image registry dependencies (Kubernetes and OpenShift) to the latest versions.

Why is this important?

New versions usually bring improvements that are needed by the registry and help with getting updates for z-stream.

Scenarios

As an OpenShift engineer, I want my components to use the versions of dependencies, so that they get fixes for known issues and can be easily updated in z-stream.

Acceptance Criteria

CI - MUST be running successfully with tests automated

Dependencies (internal and external)

Kubernetes 1.24

Previous Work (Optional):

~~IR-210~~

Done Checklist

CI - CI is running, tests are automated and merged.
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
QE - Test plans in Polarion: <link or reference to Polarion>

Task IR-232: Bump k8s to 1.24 in image-registry repo

View the Description View the linked PRs

As a OpenShift engineer
I want image-registry to use the latest k8s libraries
so that image-registry can benefit from new upstream features.

Acceptance criteria

image-registry uses k8s.io/api v1.24.z
image-registry uses latest openshift/api, openshift/library-go, openshift/client-go

https://github.com/openshift/image-registry/pull/328

Epic MGMT-9078: OpenShift Console - NVIDIA GPU Admin Dashboard

View the Description

Epic Goal

Provide a dedicated dashboard for NVIDIA GPU usage visualization in the OpenShift Console.

Why is this important?

Customers that use GPUs in their clusters usually have the GPU workloads as the main purpose of their cluster. As such, it makes much more sense to have the details about the usage they are doing of GPGPU resources AND CPU/RAM rather than just CPU/RAM

Scenarios

As an admin of a cluster dedicated to data science, I want to quickly find out how much of my very costly resources are currently in use and if things are getting queued due to lack of resources

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.

Dependencies (internal and external)

The NVIDIA GPU Operator must export to prometheus the relevant data

Open questions::

Will NVIDIA agree to these extra data exports in their GPU Operator?

I asked Zvonko Kaiser and he seemed open to it. I need to confirm with Shiva Merla

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story MGMT-9862: Implement GPU Provider on Details card

View the Description View the linked PRs

Rename Provider to Infrastructure Provider

Add GPU Provider

https://miro.com/app/board/uXjVOeUB2B4=/?moveToWidget=3458764514332229879&cot=14

https://github.com/openshift/console/pull/11272

Epic OCPBUILD-44: Dev Preview - User Namespace Builds

View the Description

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

Run OpenShift builds that do not execute as the "root" user on the host node.

Why is this important?

OpenShift builds require an elevated set of capabilities to build a container image
Builds currently run as root to maintain adequate performance
Container workloads should run as non-root from the host's perspective. Containers running as root are a known security risk.
Builds currently run as root and require a privileged container. See ~~BUILD-225~~ for removing the privileged container requirement.

Scenarios

Run BuildConfigs in a multi-tenant environment
Run BuildConfigs in a heightened security environment/deployment

Acceptance Criteria

Developers can opt into running builds in a cri-o user namespace by providing an environment variable with a specific value.
When the correct environment variable is provided, builds run in a cri-o user namespace, and the build pod does not require the "privileged: true" security context.
User namespace builds can pass basic test scenarios for the Docker and Source strategy build.
Steps to run unprivileged builds are documented.

Dependencies (internal and external)

Buildah supports running inside a non-privileged container
CRI-O allows workloads to opt into running containers in user namespaces.

Previous Work (Optional):

~~BUILD-225~~ - remove privileged requirement for builds.

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story BUILD-433: Run Unprivileged Builds with Environment Variable

View the Description View the linked PRs

User Story

As a developer building container images on OpenShift
I want to specify that my build should run without elevated privileges
So that builds do not run as root from the host's perspective with elevated privileges

Acceptance Criteria

Developers can provide an environment variable to indicate the build should not use privileged containers
When the correct env var + value is specified, builds run in a user namespace (non-root on the host)

QE Impact

No QE required for Dev Preview. OpenShift regression testing will verify that existing behavior is not impacted.

Docs Impact

We will need to document how to enable this feature, with sufficient warnings regarding Dev Preview.

PX Impact

This likely warrants an OpenShift blog post, potentially?

Notes

https://github.com/openshift/builder/pull/291

Epic OCPCLOUD-737: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Story OCPCLOUD-1263: Integrate openshift/API machine definitions into components

View the Description

Background

As a follow up to ~~OCPCLOUD-693~~, we need to, once all of the API definitions are present in openshift/api, migrate the existing code bases to use the new API locations.

This will include:

Machine API Operator
Cluster Machine Approver
Cluster API Provider AWS|Azure|GCP|IBM|Alibaba|OpenStack|Kubevirt
Cluster API actuator pkg
Installer
WMCO
MCO
Hive
Grep OpenShift for other references to our old APIs

Steps

Replace the Machine API imports with the new openshift/API MAPI locations

Stakeholders

Cluster Infra
Owners of the repos listed above

Definition of Done

The openshift/API defintions are used across components in the MAPI ecosystem

Docs

Generated docs for API types should now come from openshift/API

Testing

Regular regression testing should be sufficient, this is a copy paste for the most part and we expect the code won't compile if we break this

Sub-task OCPCLOUD-1267: Migrate cluster-api-provider-gcp to new API defintions

View the linked PRs

https://github.com/openshift/machine-api-provider-gcp/pull/3

Epic ODC-6381: 4.9 Epics Automation stories tech debt

View the Description

Problem:

Complete all the 4.9 epic features automation user stories and merge it to master branch.

Goal:

4.9 epics automation completion

Why is it important?

Tech debt should be completed

Use cases:

<case>

Acceptance criteria:

Create the pr's for 4.9 epic user stories automation
Review it
Merge it to 4.10 master branch and 4.9 master branch

Dependencies (External/Internal):

Design Artifacts:

Exploration:

Note:

Story ODC-6364: Epic Automation for ODC-5149 "Pipeline as Code"

View the Description View the linked PRs

Description

As a user, I want to store my delivery pipelines in a Git repository as the source of truth and execute the pipeline on OpenShift on Git events, so that I can version and trace changes to the delivery pipelines in Git.

Use Cases

Developer can see the list of Git repositories that are added to the namespace for pipeline-as-code execution
Developer can navigate from the Console to the Git repository on the Git provider
For each Git repository, developer can see the details of the last pipeline execution and the commit id that triggered it with possibility to navigating to the Git commit in the Git provider
Developer can see the list of pipelinerun executions related to a Git repository in a chronological order and the commit id that triggered each

Acceptance Criteria

As a user, looking at the Pipelines page in the Developer Console, I should be able to see a list of (a) Git repositories that are added to the namespace for PAC execution AND (b) all pipelines in the namespace
As a user, I should be able to navigate to a details page of the git repo.
1. This details page should provide access to (a) details of the git repo and (b) a list of pipeline runs.
2. This PLR tab should show additional information than the typical PLR List view, including SHA (commit id), commit message, branch & trigger type
As a user, when looking at a Pipeline Run Details page, if associate with a git repo (PAC),
1. Indicate that it's from a specific git repo rather than a PL resource
2. Include the SHA (commit id), commit message, branch & trigger type

https://github.com/openshift/console/pull/10521

Story CONSOLE-2975: Migrate from Node Sass to Dart Sass

View the Description View the linked PRs

Node Sass is deprecated. See https://github.com/sass/node-sass

https://github.com/openshift/console/pull/10149

Task IR-224: Bump openshift/api package

View the Description View the linked PRs

Acceptance criteria:

All tests (including e2e) pass
No regressions are introduced
openshift/api points to a recent commit on the master branch

https://github.com/openshift/cluster-image-registry-operator/pull/728

Story OCPCLOUD-1278: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-autoscaler-operator/pull/226

Task MON-1656: Add a Makefile rule in CMO for verifications and checks

View the Description View the linked PRs

Add a Makefile rule in CMO to execute all the different rule that are used for verification and validation. Currenctly, some of them might not be at the right place, for example `check-assets` which is part of `generate` despite not being responsible of any generation. https://github.com/openshift/cluster-monitoring-operator/pull/1151/files#r629371735

DoD:

Add a new rule in CMO to handle verification
Add a CI job for this rule

Task MON-975: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-monitoring-operator/pull/1338

Story CONSOLE-2999: Update OCP branding

View the Description View the linked PRs

[Updated story request]

Decision is to always display Red Hat OpenShift logo for OCP instead of conditionally. And also update the OCP login, errors, providers templates. https://openshift.github.io/oauth-templates/

Related note in comments.

[Original request]

If the ACM or the ACS dynamic plugin is enabled and there is not a custom branding set, then the default "Red Hat Openshift" branding should be shown.

This was identified as an issue during the Hybrid Console Scrum on 11/15/20201

PRs associated with this change

https://github.com/openshift/console/pull/10940 [merged]

https://github.com/openshift/oauth-templates/pull/20 [merged]

https://github.com/openshift/cluster-authentication-operator/pull/540 [merged]

https://github.com/openshift/console/pull/10940

Story MON-1679: use static authorizer feature of kube-rbac-proxy

View the Description View the linked PRs

The static authorizer feature has landed in upstream kube-rbac-proxy. Lets use it by configuring a static authorizer for all requests that hit a /metrics endpoint.

DoD:

Downstream kube-rbac-proxy is synced.
All CMO operands are configured with static authorization.
Bugzillas created for all non-monitoring components using kube-rbac-proxy for metrics authn/authz.

https://github.com/openshift/cluster-monitoring-operator/pull/1318

Task MON-1890: update openshift/kube-state-metric to 2.2.0

View the Description View the linked PRs

New release https://github.com/kubernetes/kube-state-metrics/releases

https://github.com/openshift/kube-state-metrics/pull/61

Task MON-1873: Tag all resources created by CMO e2e tests

View the Description View the linked PRs

The CMO e2e tests create a bunch of resources. These should be cleaned up on a successful run. However:

Some test failures leave the create resource behind, which have to be cleaned up before a re-run.
There have been developer reports that even successful runs don't tidy up everything.

In a CI context this is rarely a problem, however running the tests locally can be made quite awkward, especially repeated runs on the same cluster.

We should tag all resources created by the e2e tests with a label (app.kubernetes.io/created-by: cmo-e2e-test).
This will allow easy cleanup by deleting all resources with that label and will allow for checking proper clean-up.

DoD:
All e2e resources get properly tagged.
It is straight forward to ensure that future code changes don't skip adding this tag.

https://github.com/openshift/cluster-monitoring-operator/pull/1397

Task MON-1218: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-monitoring-operator/pull/1379

Task MON-1659: set relatedObjects in ClusterOperator manifests

View the Description View the linked PRs

As mentioned in [1], the cluster monitoring operator doesn't define the relatedObjects field in the ClusterOperator manifest which is initially deployed by CVO [2].
If the CMO pod fails to start, the must-gather might miss information from the monitoring namespace. Note that once CMO runs, it will update the initial ClusterOperator object with the proper information [3].

[1] http://mailman-int.corp.redhat.com/archives/aos-devel/2021-May/msg00139.html
[2] https://github.com/openshift/cluster-monitoring-operator/blob/master/manifests/0000_50_cluster-monitoring-operator_06-clusteroperator.yaml
[3] https://github.com/openshift/cluster-monitoring-operator/blob/a6bc9824035ceb8dbfe7c53cf0c138bfb2ec5643/pkg/client/status_reporter.go#L49-L63

https://github.com/openshift/cluster-monitoring-operator/pull/1483

Story CONSOLE-2892: Allow dynamic plugins to proxy to services on the cluster

View the Description View the linked PRs

Goal

We have several use cases where dynamic plugins need to proxy to another service on the cluster. One example is the Helm plugin. We would like to move the backend code for Helm to a separate service on the cluster, and the Helm plugin could proxy to that service for its requests. This is required to make Helm a dynamic plugin. Similarly if we want to have ACM contribute any views through dynamic plugins, we will need a way for ACM to proxy to its services (e.g., for Search).

It's possible for plugins to make requests to services exposed through routes today, but that has several problems:

It requires that the service be exposed outside the cluster, which is not always desired.
It requires the service support CORS headers for the console.
There is no way to specify a CA file for the route if it's not trusted by the browser.
Plugins will not have access to the user's access token on the client, which means that there is no simple way to handle auth.

Plugins need a way to declare in-cluster services that they need to connect to. The console backend will need to set up proxies to those services on console load. This also requires that the console operator be updated to pass the configuration to the console backend.

This work will apply only to single clusters.

Open Questions

What happens when a multitenant isolated network policy is configured on the cluster?

https://docs.openshift.com/container-platform/4.7/networking/network_policy/multitenant-network-policy.html

How do we (and can we?) support this for multi-cluster where console is running on a different hub cluster?
Do we need to auth for all requests?

Acceptance Criteria

Plugins can declare a service to proxy to in the ConsolePlugin resource
Plugins can specify a CA cert for the service
Console falls back to the service signing CA if none is specified
Plugins have a way of specifying whether the user's authentication token is included in requests through the service proxy
Dynamic plugin enhancement is updated with the implementation details
Support for server-side events (SSE) for ACM
Add support, or a flag, if auth is needed for each request.

cc Ali Mobrem [~christianmvogt]

Task MON-1872: Use upstream kube-thanos in cluster-monitoring-operator jsonnet

View the Description View the linked PRs

As per [1], the jsonnet code for managing thanos-ruler resources should reuse the upstream kube-thanos project.

[1] https://github.com/openshift/cluster-monitoring-operator/blob/399c84dbca596b611b0c30a0d2df63a5d2b0b8cc/jsonnet/components/thanos-ruler.libsonnet#L1

https://github.com/openshift/cluster-monitoring-operator/pull/1478

Bug OCPBUGS-1: Test Bug

View the Description View the linked PRs

Test description

https://github.com/openshift/driver-toolkit/pull/77

Story SPLAT-246: [vsphere] Ensure clear user agent strings set for components calling to vSphere API

View the Description View the linked PRs

*USER STORY:*

As a customer or OpenShift engineer, I want to see the user agent for anything calling from OpenShift -> vSphere to eliminate troubleshooting guesswork.

*DESCRIPTION:*

A question in #forum-vmware was raised where we identified that the user-agent may not be configured for all OpenShift components calling to vSphere API.

https://coreos.slack.com/archives/CH06KMDRV/p1627368902058800

*Required:*

Audit of OpenShift components calling to vSphere API to make sure user agent strings are set appropriately.

*Nice to have:*

How can this be prevented in the future? How can we minimize maintenance costs added by new PRs/bugs reported from this spike?

*ACCEPTANCE CRITERIA:*

New PRs or bug reports for each effected component.

Task MGMT-9440: Fix single-node serial tests API crash

View the Description View the linked PRs

See these threads https://coreos.slack.com/archives/G01F05P2PTL/p1645982017061749?thread_ts=1645970469.871559&cid=G01F05P2PTL for more information

Story MON-1913: Expose field in CMO configmap to configure the retention period of Thanos Ruler

View the Description View the linked PRs

Users can't configure the retention period for Thanos Ruler currently and the default value is 24h (from the prometheus operator).

https://github.com/openshift/cluster-monitoring-operator/pull/1651

Task OSDOCS-3257: Add content type to oc CLI doc generation

View the Description View the linked PRs

The two modules that are auto generated for the CLI docs need to add ":_content-type: REFERENCE" to the top of the files. Update the doc generation templates to add these.

https://github.com/openshift/oc/pull/1072

Story CONSOLE-2768: console-operator should use bindata instead of inlining manifests

View the Description View the linked PRs

console-operator codebase contains a lot of inline manifests. Instead we should put those manifests into a `/bindata` folder, from which they will be read and then updated per purpose.

https://github.com/openshift/console-operator/pull/550

Task MON-1964: Make Telemeter receive endpoint request limit configurable

View the Description View the linked PRs

Currently, Telemeter is not equipped with configurable request limit for receive endpoint (for full context see: https://github.com/openshift/cluster-monitoring-operator/pull/1416). It is using the default limit defined in the code base, however it seems this limit might not be suitable for our usage.

As a part of this ticket, it should be:

1) Understood what is the appropriate limit for request size for our use cases

2) Make the limit configurable in Telemeter via a flag

3) Deploy the changes, initially to the staging environment, to enable our team to test it.

Story OADP-22: Send Telemetry metrics on OADP

View the Description View the linked PRs

We will want to establish some basic metrics we can report back to Telemetry.

Let's consider:

Operator installs
Backups, created, success, error
Restores created, success, error

Below is some background info from MTC when we added Telemetry support that may help

See: https://github.com/konveyor/metrics-queries/blob/master/README.md

Design/Development info:

OpenShift Monitoring Integration Guide

Sending metrics via telemetry

Monitoring integration with OLM operators

https://www.openshift.com/blog/observability-superpower-correlation

Source Code:

https://github.com/konveyor/mig-controller/blob/master/pkg/controller/migmigration/metrics.go

https://github.com/openshift/cluster-monitoring-operator/pull/1536

Task IR-227: Remove legacy code for platformStatus

View the Description View the linked PRs

Before platformStatus, the operator used to get information about AWS and GCP from the install-config config map. This code can be removed.

https://github.com/openshift/cluster-image-registry-operator/pull/739

Feature Request RFE-2703: OCP should alarm/alert when the etcd container memory consumption goes beyond 90%

View the Description View the linked PRs

1. Proposed title of this feature request
--> Alert generation when the etcd container memory consumption goes beyond 90%

2. What is the nature and description of the request?
--> When the etcd database starts growing rapidly due to some high number of objects like secrets, events, or configmap generation by application/workload, the memory and CPU consumption of APIserver and etcd container (control plane component) spikes up and eventually the control plane nodes goes to hung/unresponsive or crash due to out of memory errors as some of the critical processes/services running on master nodes get killed. Hence we request an alert/alarm when the ETCD container's memory consumption goes beyond 90% so that the cluster administrator can take some action before the cluster/nodes go unresponsive.

I see we already have a etcdExcessiveDatabaseGrowth Prometheus rule which helps when the surge in etcd writes leading to a 50% increase in database size over the past four hours on etcd instance however it does not consider the memory consumption:

$ oc get prometheusrules etcd-prometheus-rules -o yaml|grep -i etcdExcessiveDatabaseGrowth -A 9

alert: etcdExcessiveDatabaseGrowth
annotations:
description: 'etcd cluster "{{ $labels.job }}": Observed surge in etcd writes
leading to 50% increase in database size over the past four hours on etcd
instance {{ $labels.instance }}, please check as it might be disruptive.'
expr: |
increase(((etcd_mvcc_db_total_size_in_bytes/etcd_server_quota_backend_bytes)*100)[240m:1m]) > 50
for: 10m
labels:
severity: warning

3. Why does the customer need this? (List the business requirements here)
--> Once the etcd memory consumption goes beyond 90-95% of total ram as it's system critical container, the OCP cluster goes unresponsive causing revenue loss to business and impacting the productivity of users of the openshift cluster.

4. List any affected packages or components.
--> etcd

https://github.com/openshift/machine-config-operator/pull/3124

Story MON-1949: Improve prometheus-adapter consistency

View the Description View the linked PRs

The current integration of prometheus-adapter in OpenShift uses the platform Prometheus as a backend to get metrics. The problem with this design is that we are getting metrics from 2 different Prometheus instances which don't have replicated data, so two queries sent at the same time to prometheus-adapter might yield different results since the underlying promQL queries executed by prometheus-adapter might be on different Prometheus servers. The consequence is that we end up having inconsistent data across multiple autoscaling requests.

This can be easily tested by running:

$ while true ; do date; oc adm top pod -n openshift-monitoring  prometheus-k8s-0 ; echo; sleep 1 ;done 

Mon Jul 26 03:55:07 EDT 2021
NAME               CPU(cores)   MEMORY(bytes)   
prometheus-k8s-0   208m         4879Mi          

Mon Jul 26 03:55:08 EDT 2021                               
NAME               CPU(cores)   MEMORY(bytes)   
prometheus-k8s-0   246m         4877Mi          

Mon Jul 26 03:55:09 EDT 2021                               
NAME               CPU(cores)   MEMORY(bytes)   
prometheus-k8s-0   208m         4879Mi          

Mon Jul 26 03:55:10 EDT 2021
NAME               CPU(cores)   MEMORY(bytes)   
prometheus-k8s-0   246m         4877Mi

This isn't a bug in itself since it was designed that way, but we could do better by using thanos-querier as a backend instead of the platform Prometheus because it will duplicate the metrics from both instances and serve one consistent result based on the data that it will get from the Prometheuses.

DoD:

Use thanos-querier as a backend for prometheus-adapter

https://github.com/openshift/cluster-monitoring-operator/pull/1417

Bug CONSOLE-3087: Fix ActionContext type warning in components/actions/types.ts

View the Description View the linked PRs

When running yarn dev, type warnings can be seen in the console and in the dev overlay UI. These need to be resolved.

https://github.com/openshift/console/pull/11128

4.11.1-multi

Changes from 4.9.0-0.nightly-multi-2021-12-15-190302

Complete Features

Summary (PM+lead)

Motivation (PM+lead)

Goals (lead)

Non-Goals (lead)

Deliverables

Proposal (lead)

User Stories (PM)

Dependencies (internal and external, lead)

Previous Work (lead)

Open questions (lead)

Done Checklist

🏆 What

💖 Why

🗒 Notes

Feature Overview.

Goals

Requirements

(Optional) Use Cases

Questions to answer…

Out of Scope

Background, and strategic fit

Assumptions

Customer Considerations

Documentation Considerations

Epic Goal

Why is this important?

Scenarios

Acceptance Criteria

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

Feature Overview

Goals

Requirements

(Optional) Use Cases

Out of Scope

Background, and strategic fit

Assumptions

Customer Considerations

Documentation Considerations

Questions

Problem:

Goal:

Why is it important?

Use cases:

Acceptance criteria:

Dependencies (External/Internal):

Design Artifacts:

Exploration:

Note:

Description

Acceptance Criteria

Additional Details:

Incomplete Features

Epic Goal

Why is this important?

Scenarios

Acceptance Criteria

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

Epic Goal

Why is this important?

Scenarios

Acceptance Criteria

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

Problem Alignment

The Problem

High-Level Approach

Goal & Success

Solution Alignment

Key Capabilities