Back to index

4.11.59-multi

Jump to: Complete Features | Incomplete Features | Complete Epics | Incomplete Epics | Other Complete | Other Incomplete |

Changes from 4.9.0-0.nightly-multi-2021-12-15-190302

Note: this page shows the Feature-Based Change Log for a release

Complete Features

These features were completed when this image was assembled

Feature Overview
Insights Advisor for OpenShift is integrated within OpenShift Cluster Manager. This has some limitations for adding new features and also for sharing codebase between RHEL Advisor and OCM Insights Advisor tab. Insights Advisor for OpenShift lacks certain features from the RHEL UI, the codebase is not 1:1 clone.
As a customer of Insights I will have same/very similar user experience with Insights for OpenShift and Insights for RHEL. The workflows will share the main concepts, the UI elements will be same and features introduced to Advisor will be automatically considered for both all supported platforms.
As OpenShift users I will still see integrations of Insights Advisor within OpenShift Cluster Manager that shows aggregated information for customer account and single cluster view on Advisor data. These integration will point to new Insights Advisor for OpenShift app that will be tightly integrated into OpenShift Cluster Manager.

  • Note: The application will be reusing the codebase but will run as a separate app for OpenShift. THere's no intent to merge RHEL and OpenShift workflows into a single app.

Goals

  • Q2CY21: Explore possibility to unify codebase between RHEL Advisor and OCM Insights Advisor tab. Identify architecture misalignments, create UI mockups to merge the two existing UIs.
  • Q3CY21: Integrate OpenShift into Advisor codebase, standup the Insights Advisor for OpenShift application and change integration in OpenShift Cluster manager to point at the new app
  • Q4CY21: Deliver missing screen of Insights Advisor for OpenShift (Systems and Recommendations views)

Requirements

  • UX overview of UI elements in both UIs - Marie Doruskova
  • Architecture overview/misalignments for both UIs - Jan Zeleny [~fjansen]

Benefits

  • Feature parity between RHEL and OpenShift
  • Adopting new features developed by RHEL Advisor team quicker
  • Smaller maintenance cost

Questions to answer...

  • Possible deviations between OpenShift and RHEL
  • Remediation workflow different between OpenShift and RHEL

Out of Scope

  • Single app that combines RHEL hosts and OpenShift clusters. Goal is still to differentiate between platforms and offer view only for a single platform.
  • Direct/Supervised remediations and integration of remediations with Advanced Cluster Manager (as a Service)

Background, and strategic fit

  • Insights Advisor for OpenShift follows the goal to introduce multiple applications that add value for OpenShift customers under the Insights brand. The current UI and integration of Advisor into OpenShift cluster manager doesn't follow pattern that other Insights for OpenShift applications can/will follow.

Documentation Considerations

  • OCM documentation is impacted, existing workflows described in OCM documentation will persist. The placement of the application within OCM will be different.

 

OCP WebConsole, in the main dashboard, has an Insights Advisor widget, which has been redirecting users to OCM. Due to the Insights Advisor tab decommission in OCM, the links should point to Advisor instead.

4.10 code freeze = 28 January (marking the task as urgent)

Problem:

Certain Insights Advisor features differentiate between RHEL and OCP advisor

Goal:

Address top priority UI misalignments between RHEL and OCP advisor. Address UI features dropped from Insights ADvisor for OCP GA.

 

Scope:

Specific tasks and priority of them tracked in https://issues.redhat.com/browse/CCXDEV-7432

 
 
 
 

 

This contains all the Insights Advisor widget deliverables for the OCP release 4.11.

Scope
It covers only minor bug fixes and improvements:

  • better error handling during internal outages in data processing
  • add "last refresh" timestamp in the Advisor widget

Show the error message (mocked in CCXDEV-5868) if the Prometheus metrics `cluster_operator_conditions{name="insights"}` contain two true conditions: UploadDegraded and Degraded at the same time. This state occurs if there was an IO archive upload error = problems with the pipeline.

Expected for 4.11 OCP release.

Scenario: Check if the Insights Advisor widget in the OCP WebConsole UI shows the time of the last data analysis
Given: OCP WebConsole UI and the cluster dashboard is accessible
And: CCX external data pipeline is in a working state
And: administrator A1 has access to his cluster's dashboard
And: Insights Operator for this cluster is sending archives
When: administrator A1 clicks on the Insights Advisor widget
Then: the results of the last analysis are showed in the Insights Advisor widget
And: the time of the last analysis is shown in the Insights Advisor widget 

Acceptance criteria:

  1. The time of the last analysis is shown in the Insights Advisor widget for the scenario above
  2. The way it is presented is defined within the scope of https://issues.redhat.com/browse/CCXDEV-5869 (mockup task)
  3. The source of this timestamp must be a result of running the Prometheus metric (last archive upload time):
    max_over_time(timestamp(changes(insightsclient_request_send_total\{status_code="202"}[1m]) > 0)[24h:1m])
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Epic Goal

  • Allow admin user to create new alerting rules, targeting metrics in any namespace
  • Allow cloning of existing rules to simplify rule creation
  • Allow creation of silences for existing alert rules

Why is this important?

  • Currently, any platform-related metrics (exposed in a openshift-, kube- and default namespace) cannot be used to form a new alerting rule. That makes it very difficult for administrators to enrich our out of the box experience for the OpenShift Container Platform with new rules that may be specific to their environments.
  • Additionally, we had requests from customer to allow modifications of our existing, out of the box alerting rules (for instance tweaking the alert expression or changing the severity label). Unfortunately, that is not easy since most rules come from several open source projects, or other OpenShift components, and any modifications would make a seamless upgrade not really seamless anymore. Imagine K8s changes metrics again (see 1.14) and we have to update our rules. We would not know what modifications have been done (even just the threshold might be difficult if upstream changes that as well) and we would not be able to upgrade these rules.

Scenarios

  • I'd like to modify the query expression of an existing rule (because the threshold value doesn't match with my environment).

Cloning the existing rule should end up with a new rule in the same namespace.
Modifications can now be done to the new rule.
(Optional) You can silence the existing rule.

  • I'd like to create a new rule based on a metric only available to an openshift-* namespace

Create a new PrometheusRule object inside the namespace that includes the metrics you need to form the alerting rule.

  • I'd like to update the label of an existing rule.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • Ability to distinguish between rules deployed by us (CMO) and user created rules

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

  1. Distinguish between operator-created rules and user-created rules
    Currently no such mechanism exists. This will need to be added to prometheus-operator or cluster-monitoring-operator.

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

CMO should reconcile the platform Prometheus configuration with the alert-relabel-config resources.

 

DoD

  • Alerts changed via alert-relabel-configs are evaluated by the Platform monitoring stack.
  • Product alerts which are overriden aren't sent to Alertmanager

CMO should reconcile the platform Prometheus configuration with the AlertingRule resources.

 

DoD

  • Alerts added via AlertingRule resources are evaluated by the Platform monitoring stack.

Managing PVs at scale for a fleet creates difficulties where "one size does not fit all". The ability for SRE to deploy prometheus with PVs and have retention based an on a desired size would enable easier management of these volumes across the fleet. 

 

The prometheus-operator exposes retentionSize.

Field Description
retentionSize Maximum amount of disk space used by blocks. Supported units: B, KB, MB, GB, TB, PB, EB. Ex: 512MB.

This is a feature request to enable this configuration option via CMO cluster-monitoring-config ConfigMap.

 

cc Simon Pasquier  

Epic Goal

  • Cluster admins want to configure the retention size for their metrics.

Why is this important?

  • While it is possible to define how long metrics should be retained on disk, it's not possible to tell the cluster monitoring operator how much data it should keep. For OSD/ROSA in particular, it would facilitate the management of the fleet if the retention size could be configured based on the persistent volume size because it would avoid issues with the storage getting full and monitoring being down when too many metrics are produced.

Scenarios

  • As a cluster admin, I want to define the maximum amount of data to be retained on the persistent volume.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • The cluster-monitoring-config config and the user-workload-monitoring-config configmap allow to configure the retention size for
    • Prometheus (Platform and UWM)
    • Thanos Ruler (to be confirmed)
  • Proper validation is in place preventing bad user inputs from breaking the stack.

Dependencies (internal and external)

  1. Thanos ruler doesn't support retention size (only retention time).

Previous Work (Optional):

  1. None

Open questions::

  1. None

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Problem Alignment

The Problem

Today, all configuration for setting individual, for example, routing configuration is done via a single configuration file that only admins have access to. If an environment uses multiple tenants and each tenant, for example, has different systems that they are using to notify teams in case of an issue, then someone needs to file a request w/ an admin to add the required settings.

That can be bothersome for individual teams, since requests like that usually disappear in the backlog of an administrator. At the same time, administrators might get tons of requests that they have to look at and prioritize, which takes them away from more crucial work.

We would like to introduce a more self service approach whereas individual teams can create their own configuration for their needs w/o the administrators involvement.

Last but not least, since Monitoring is deployed as a Core service of OpenShift there are multiple restrictions that the SRE team has to apply to all OSD and ROSA clusters. One restriction is the ability for customers to use the central Alertmanager that is owned and managed by the SRE team. They can't give access to the central managed secret due to security concerns so that users can add their own routing information.

High-Level Approach

Provide a new API (based on the Operator CRD approach) as part of the Prometheus Operator that allows creating a subset of the Alertmanager configuration without touching the central Alertmanager configuration file.

Please note that we do not plan to support additional individual webhooks with this work. Customers will need to deploy their own version of the third party webhooks.

Goal & Success

  • Allow users to deploy individual configurations that allow setting up Alertmanager for their needs without an administrator.

Solution Alignment

Key Capabilities

  • As an OpenShift administrator, I want to control who can CRUD individual configuration so that I can make sure that any unknown third person can touch the central Alertmanager instance shipped within OpenShift Monitoring.
  • As a team owner, I want to deploy a routing configuration to push notifications for alerts to my system of choice.

Key Flows

Team A wants to send all their important notifications to a specific Slack channel.

  • Administrator gives permission to Team A to allow creating a new configuration CR in their individual namespace.
  • Team A creates a new configuration CR.
  • Team A configures what alerts should go into their Slack channel.
  • Open Questions & Key Decisions (optional)
  • Do we want to improve anything inside the developer console to allow configuration?

Epic Goal

  • Allow users to manage Alertmanager for user-defined alerts and have the feature being fully supported.

Why is this important?

  • Users want to configure alert notifications without admin intervention.
  • The feature is currently Tech Preview, it should be generally available to benefit a bigger audience.

Scenarios

  1. As a cluster admin, I can deploy an Alertmanager service dedicated for user-defined alerts (e.g. separated from the existing  Alertmanager already used for platform alerts).
  2. As an application developer, I can silence alerts from the OCP console.
  3. As an application developer, I'm not allowed to configure invalid AlertmanagerConfig objects.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • The AlertmanagerConfig CRD is v1beta1
  • The validating webhook service checking AlertmanagerConfig resources is highly-available.

Dependencies (internal and external)

  1. Prometheus operator upstream should migrate the AlertmanagerConfig CRD from v1alpha1 to v1beta1
  2. Console enhancements likely to be involved (see below).

Previous Work (Optional):

  1. Part of the feature is available as Tech Preview (MON-880).

Open questions:

  1. Coordination with the console team to support the Alertmanager service dedicated for user-defined alerts.
  2. Migration steps for users that are already using the v1alpha1 CRD.

Done Checklist

 * CI - CI is running, tests are automated and merged.
 * Release Enablement <link to Feature Enablement Presentation>
 * DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
 * DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
 * DEV - Downstream build attached to advisory: <link to errata>
 * QE - Test plans in Polarion: <link or reference to Polarion>
 * QE - Automated tests merged: <link or reference to automated tests>
 * DOC - Downstream documentation merged: <link to meaningful PR> 

 

Now that upstream supports AlertmanagerConfig v1beta1 (see MON-2290 and https://github.com/prometheus-operator/prometheus-operator/pull/4709), it should be deployed by CMO.

DoD:

  • Kubernetes API exposes and supports the v1beta1 version for AlertmanagerConfig CRD (in addition to v1alpha1).
  • Users can manage AlertmanagerConfig v1beta1 objects seamlessly.
  • AlertmanagerConfig v1beta1 objects are reconciled in the generated Alertmanager configuration.
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Epic Goal

  • The goal is to support metrics federation for user-defined monitoring via the /federate Prometheus endpoint (both from within and outside of the cluster).

Why is this important?

  • It is already possible to configure remote write for user-defined monitoring to push metrics outside of the cluster but in some cases, the network flow can only go from the outside to the cluster and not the opposite. This makes it impossible to leverage remote write.
  • It is already possible to use the /federate endpoint for the platform Prometheus (via the internal service or via the OpenShift route) so not supporting for UWM doesn't provide a consistent experience.
  • If we don't expose the /federate endpoint for the UWM Prometheus, users would have no supported way to store and query application metrics from a central location.

Scenarios

  1. As a cluster admin, I want to federate user-defined metrics using the Prometheus /federate endpoint.
  2. As a cluster admin, I want that the /federate endpoint to UWM is accessible via an OpenShift route.
  3. As a cluster admin, I want that the access to the /federate endpoint to UWM requires authentication (with bearer token only) & authorization (the required permissions should match the permissions on the /federate endpoint of the Platform Prometheus).

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • Documentation - information about the recommendations and limitations/caveats of the federation approach.
  • User can federate user-defined metrics from within the cluster
  • User can federate user-defined metrics from the outside via the OpenShift route.

Dependencies (internal and external)

  1. None

Previous Work (Optional):

  1. None

Open questions:

  1. None

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

DoD

  • User can federate UWM metrics from outside of the cluster via the OpenShift route.
  • E2E test added to the CMO test suite.

DoD

  • User can federate UWM metrics within the cluster from the prometheus-user-workload.openshift-user-workload-monitoring.svc:9092 service
  • The service requires authentication via bearer token and authorization (same permissions as for federating platform metrics)

Copy/paste from [_https://github.com/openshift-cs/managed-openshift/issues/60_]

Which service is this feature request for?
OpenShift Dedicated and Red Hat OpenShift Service on AWS

What are you trying to do?
Allow ROSA/OSD to integrate with AWS Managed Prometheus.

Describe the solution you'd like
Remote-write of metrics is supported in OpenShift but it does not work with AWS Managed Prometheus since AWS Managed Prometheus requires AWS SigV4 auth.

  • Note that Prometheus supports AWS SigV4 since v2.26 and OpenShift 4.9 uses v2.29.

Describe alternatives you've considered
There is the workaround to use the "AWS SigV4 Proxy" but I'd think this is not properly supported by RH.
https://mobb.ninja/docs/rosa/cluster-metrics-to-aws-prometheus/

Additional context
The customer wants to use an open and portable solution to centralize metrics storage and analysis. If they also deploy to other clouds, they don't want to have to re-configure. Since most clouds offer a Prometheus service (or it's easy to self-manage Prometheus), app migration should be simplified.

Epic Goal

The cluster monitoring operator should allow OpenShift customers to configure remote write with all authentication methods supported by upstream Prometheus.

We will extend CMO's configuration API to support the following authentications with remote write:

  • Sigv4
  • Authorization
  • OAuth2

Why is this important?

Customers want to send metrics to AWS Managed Prometheus that require sigv4 authentication (see https://docs.aws.amazon.com/prometheus/latest/userguide/AMP-secure-metric-ingestion.html#AMP-secure-auth).

Scenarios

  1. As a cluster admin, I want to forward platform/user metrics to remote write systems requiring Sigv4 authentication.
  2. As a cluster admin, I want to forward platform/user metrics to remote write systems requiring OAuth2 authentication.
  3. As a cluster admin, I want to forward platform/user metrics to remote write systems requiring custom Authorization header for authentication (e.g. API key).

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • It is possible for a cluster admin to configure any authentication method that is supported by Prometheus upstream for remote write (both platform and user-defined metrics):
    • Sigv4
    • Authorization
    • OAuth2

Dependencies (internal and external)

  • In theory none because everything is already supported by the Prometheus operator upstream. We may discover bugs in the upstream implementation though that may require upstream involvement.

Previous Work

  • After CMO started exposing the RemoteWrite specification in MON-1069, additional authentication options where added to prometheus and prometheus-operator but CMO didn't catch up on these.

Open Questions

  • None

Prometheus and Prometheus operator already support sigv4 authentication for remote write. This should be possible to configure the same in the CMO configuration:

 

apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    prometheusK8s:
      remoteWrite:
      - url: "https://remote-write.endpoint"
        sigv4:
          accessKey:
            name: aws-credentialss
            key: access
          secretKey:
            name: aws-credentials
            key: secret

          profile: "SomeProfile"

          roleArn: "SomeRoleArn"

DoD:

  • Ability to configure sigv4 authentication for remote write in the openshift-monitoring/cluster-monitoring-config configmap
  • Ability to configure sigv4 authentication for remote write in the openshift-user-workload-monitoring/user-workload-monitoring-config configmap

Prometheus and Prometheus operator already support custom Authorization for remote write. This should be possible to configure the same in the CMO configuration:

 

apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    prometheusK8s:
      remoteWrite:
      - url: "https://remote-write.endpoint"
        Authorization:
          type: Bearer
          credentials:
            name: credentials
            key: token

DoD:

  • Ability to configure custom Authorization for remote write in the openshift-monitoring/cluster-monitoring-config configmap
  • Ability to configure custom Authorization for remote write in the openshift-user-workload-monitoring/user-workload-monitoring-config configmap
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Description

As WMCO user, I want to make sure containerd logging information has been updated in documents and scripts.

Acceptance Criteria

  • update must-gather to collect containerd logs
  • Internal/Customer Documents and log collecting scripts must have containerd specific information (ex: location of logs). 

Summary (PM+lead)

Configure audit logging to capture login, logout and login failure details

Motivation (PM+lead)

TODO(PM): update this

Customer who needs login, logout and login failure details inside the openshift container platform.
I have checked for this on my test cluster but the audit logs do not contain any user name specifying login or logout details. For successful logins or logout, on CLI and openshift console as well we can see 'Login successful' or 'Invalid credentials'.

Expected results: Login, logout and login failures should be captured in audit logging.

Goals (lead)

  1. Login, logout and login failures should be captured in audit logs

Non-Goals (lead)

  1. Don't attempt to log login failures in the IdP login flow that goes beyond timeout, if it the information is not available in explicit oauth-server requests (e.g. github password login error).
  2. Logout does not involve oauth-server (but is a simple API object deletion in oauth-apiserver). Hence, the audit log discussed here won't include logout.

Deliverables

  1. Changes to oauth-server to log into /varLog/oauth-server/audit.log on the master node.
  2. Documentation

Proposal (lead)

The apiserver pods today have ´/var/log/<kube|oauth|openshift>-apiserver` mounted from the host and create audit files there using the upstream audit event format (JSON lines following https://github.com/kubernetes/apiserver/blob/92392ef22153d75b3645b0ae339f89c12767fb52/pkg/apis/audit/v1/types.go#L72). These events are apiserver specific, but as oauth authentication flow events are also requests, we can use the apiserver event format to log logins, login failures and logouts. Hence, we propose to make oauth-server to create /var/log/oauth-server/audit.log files on the master nodes using that format.

When the login flow does not finish within a certain time (e.g. 10min), we can artificially create an event to show a login failure in the audit logs.

User Stories (PM)

Dependencies (internal and external, lead)

Previous Work (lead)

Open questions (lead)

  1. ...

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

 

🏆 What

Let the Cluster Authentication Operator deliver the policy to OAuthServer.

💖 Why

In order to know if authn events should be logged, OAuthServer needs to be aware of it.

🗒 Notes

Create an observer to deliver the audit policy to the oauth server

Make the authentication-operator react to the new audit field in the oauth.config/cluster object. Write an observer watching this field, such an observer will translate the top-level configuration into oauth-server config and add it to the rest of the observed config.

* Stanislav Láznička

OCP/Telco Definition of Done
Feature Template descriptions and documentation.

Feature Overview.

Early customer feedback is that they see SNO as a great solution covering smaller footprint deployment, but are wondering what is the evolution story OpenShift is going to provide where more capacity or high availability are needed in the future.

While migration tooling (moving workload/config to new cluster) could be a mid-term solution, customer desire is not to include extra hardware to be involved in this process.

 For Telecommunications Providers, at the Far Edge they intend to start small and then grow. Many of these operators will start with a SNO-based DU deployment as an initial investment, but as DUs evolve, different segments of the radio spectrum are added, various radio hardware is provisioned and features delivered to the Far Edge, the Telecommunication Providers desire the ability for their Far Edge deployments to scale up from 1 node to 2 nodes to n nodes. On the opposite side of the spectrum from SNO is MMIMO where there is a robust cluster and workloads use HPA.

Goals

  • Provide the capability to expand a single replica control plane topology to host more workloads capacity - add worker
  • Provide the capability to expand a single replica control plane to be a highly available control plane
  • To satisfy MMIMO Telecommunications providers will want the ability to scale a SNO to a multi-node cluster that can support HPA.
  • Telecommunications providers do not want workload (DU specifically) downtime when migrating from SNO to a multi-node cluster.
  • Telecommunications providers wish to be able to scale from one to two or more nodes to support a variety of radio hardware.
  • Support CP scaling (CP HA) for 2 node cluster, 3 node cluster and n node cluster. As the number of nodes in the cluster increases so does the failure domain of the cluster. The cluster is now supporting more cell sectors and therefore has more of a need for HA and resiliency including the cluster CP.

Requirements

  • TBD
Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

(Optional) Use Cases

This Section:

  • Main success scenarios - high-level user stories
  • Alternate flow/scenarios - high-level user stories
  • ...

Questions to answer…

  • ...

Out of Scope

Background, and strategic fit

This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.

Assumptions

  • ...

Customer Considerations

  • ...

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?
  • New Content, Updates to existing content, Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?

Epic Goal

  • Documented and supported flow for adding 1, 2, 3 or more workers to a Single Node OpenShift (SNO) deployment without requiring cluster downtime and the understanding that this action will not make the cluster itself highly available.

Why is this important?

  • Telecommunications and Edge scenarios where HA is handled via failover to another site but single site capacity may vary or need to be expanded over time.
  • Similar scenarios exist for some ISV vendors where OpenShift is an implementation detail of how they deliver their solution on top of another platform (e.g. VMware).

Scenarios

  1. Adding a worker to a single node openshift cluster.
  2. Adding a second worker to a single node openshift cluster.
  3. Adding a third worker to a single node openshift cluster.
  4. Removing a worker node from a single node openshift cluster that has had 1 or more workers added.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • Customer facing documentation of the add worker flow for SNO.

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

  1. Presumably there is a scale limit on how many workers could be added to an SNO control plane, and it is lower than the limit for a "normal" 3 node control plane. It is not anticipated that this limit will be established in this epic. Intent is to focus on small scale sites where adding 1-3 worker nodes would be beneficial.

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Rebase OpenShift components to k8s v1.24

Why is this important?

  • Rebasing ensures components work with the upcoming release of Kubernetes
  • Address tech debt related to upstream deprecations and removals.

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. k8s 1.24 release

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Feature Overview

  • As an infrastructure owner, I want a repeatable method to quickly deploy the initial OpenShift cluster.
  • As an infrastructure owner, I want to install the first (management, hub, “cluster 0”) cluster to manage other (standalone, hub, spoke, hub of hubs) clusters.

Goals

  • Enable customers and partners to successfully deploy a single “first” cluster in disconnected, on-premises settings

Requirements

4.11 MVP Requirements

  • Customers and partners needs to be able to download the installer
  • Enable customers and partners to deploy a single “first” cluster (cluster 0) using single node, compact, or highly available topologies in disconnected, on-premises settings
  • Installer must support advanced network settings such as static IP assignments, VLANs and NIC bonding for on-premises metal use cases, as well as DHCP and PXE provisioning environments.
  • Installer needs to support automation, including integration with third-party deployment tools, as well as user-driven deployments.
  • In the MVP automation has higher priority than interactive, user-driven deployments.
  • For bare metal deployments, we cannot assume that users will provide us the credentials to manage hosts via their BMCs.
  • Installer should prioritize support for platforms None, baremetal, and VMware.
  • The installer will focus on a single version of OpenShift, and a different build artifact will be produced for each different version.
  • The installer must not depend on a connected registry; however, the installer can optionally use a previously mirrored registry within the disconnected environment.

Use Cases

  • As a Telco partner engineer (Site Engineer, Specialist, Field Engineer), I want to deploy an OpenShift cluster in production with limited or no additional hardware and don’t intend to deploy more OpenShift clusters [Isolated edge experience].
  • As a Enterprise infrastructure owner, I want to manage the lifecycle of multiple clusters in 1 or more sites by first installing the first  (management, hub, “cluster 0”) cluster to manage other (standalone, hub, spoke, hub of hubs) clusters [Cluster before your cluster].
  • As a Partner, I want to package OpenShift for large scale and/or distributed topology with my own software and/or hardware solution.
  • As a large enterprise customer or Service Provider, I want to install a “HyperShift Tugboat” OpenShift cluster in order to offer a hosted OpenShift control plane at scale to my consumers (DevOps Engineers, tenants) that allows for fleet-level provisioning for low CAPEX and OPEX, much like AKS or GKE [Hypershift].
  • As a new, novice to intermediate user (Enterprise Admin/Consumer, Telco Partner integrator, RH Solution Architect), I want to quickly deploy a small OpenShift cluster for Poc/Demo/Research purposes.

Questions to answer…

  •  

Out of Scope

Out of scope use cases (that are part of the Kubeframe/factory project):

  • As a Partner (OEMs, ISVs), I want to install and pre-configure OpenShift with my hardware/software in my disconnected factory, while allowing further (minimal) reconfiguration of a subset of capabilities later at a different site by different set of users (end customer) [Embedded OpenShift].
  • As an Infrastructure Admin at an Enterprise customer with multiple remote sites, I want to pre-provision OpenShift centrally prior to shipping and activating the clusters in remote sites.

Background, and strategic fit

  • This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.

Assumptions

  1. The user has only access to the target nodes that will form the cluster and will boot them with the image presented locally via a USB stick. This scenario is common in sites with restricted access such as government infra where only users with security clearance can interact with the installation, where software is allowed to enter in the premises (in a USB, DVD, SD card, etc.) but never allowed to come back out. Users can't enter supporting devices such as laptops or phones.
  2. The user has access to the target nodes remotely to their BMCs (e.g. iDrac, iLo) and can map an image as virtual media from their computer. This scenario is common in data centers where the customer provides network access to the BMCs of the target nodes.
  3. We cannot assume that we will have access to a computer to run an installer or installer helper software.

Customer Considerations

  • ...

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?
  • New Content, Updates to existing content, Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?

 

References

 

 

Epic Goal

  • As an OpenShift infrastructure owner, I need to be able to integrate the installation of my first on-premises OpenShift cluster with my automation flows and tools.
  • As an OpenShift infrastructure owner, I must be able to provide the CLI tool with manifests that contain the definition of the cluster I want to deploy
  • As an OpenShift Infrastructure owner, I must be able to get the validation errors in a programmatic way
  • As an OpenShift Infrastructure owner, I must be able to get the events and progress of the installation in a programmatic way
  • As an OpenShift Infrastructure owner, I must be able to retrieve the kubeconfig and OpenShift Console URL in a programmatic way

Why is this important?

  • When deploying clusters with a large number of hosts and when deploying many clusters, it is common to require to automate the installations.
  • Customers and partners usually use third party tools of their own to orchestrate the installation.
  • For Telco RAN deployments, Telco partners need to repeatably deploy multiple OpenShift clusters in parallel to multiple sites at-scale, with no human intervention.

Scenarios

  1. Monitoring flow:
    1. I generate all the manifests for the cluster,
    2. call the CLI tool pointint to the manifests path,
    3. Obtain the installation image from the nodes
    4. Use my infrastructure capabilities to boot the image on the target nodes
    5. Use the tool to connect to assisted service to get validation status and events
    6. Use the tool to retrieve credentials and URL for the deployed cluster

Acceptance Criteria

  • Backward compatibility between OCP releases with automation manifests (they can be applied to a newer version of OCP).
  • Installation progress and events can be tracked programatically
  • Validation errors can be obtained programatically
  • Kubeconfig and console URL can be obtained programatically
  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

References

User Story:

As a deployer, I want to be able to:

  • Get the credentials for the cluster that is going to be deployed

so that I can achieve

  • Checking the installed cluster for installation completion
  • Connect and administer the cluster that gets installed

 

Currently the Assisted Service generates the credentials by running the ignition generation step of the oepnshift-installer. This is why the credentials are only retrievable from the REST API towards the end of the installation.

In the BILLI usage, which takes down assisted service before the installation is complete there is no obvious point at which to alert the user that they should retrieve the credentials. This means that we either need to:

  • Allow the user to pass the admin key that will then get signed by the generated CA and replace the key that is made by openshift-installer (would mean new functionality in AI)
  • Allow the key to be retrieved by SSH with the fleeting command from the node0 (after it has generated). The command should be able to wait until it is possible
  • Have the possibility to POST it somewhere

Acceptance Criteria:

  • The admin key is generated and usable to check for installation completeness

This requires/does not require a design proposal.
This requires/does not require a feature gate.

Feature Overview

The AWS-specific code added in OCPPLAN-6006 needs to become GA and with this we want to introduce a couple of Day2 improvements.
Currently the AWS tags are defined and applied at installation time only and saved in the infrastructure CRD's status field for further operator use, which in turn just add the tags during creation.

Saving in the status field means it's not included in Velero backups, which is a crucial feature for customers and Day2.
Thus the status.resourceTags field should be deprecated in favour of a newly created spec.resourceTags with the same content. The installer should only populate the spec, consumers of the infrastructure CRD must favour the spec over the status definition if both are supplied, otherwise the status should be honored and a warning shall be issued.

Being part of the spec, the behaviour should also tag existing resources that do not have the tags yet and once the tags in the infrastructure CRD are changed all the AWS resources should be updated accordingly.

On AWS this can be done without re-creating any resources (the behaviour is basically an upsert by tag key) and is possible without service interruption as it is a metadata operation.

Tag deletes continue to be out of scope, as the customer can still have custom tags applied to the resources that we do not want to delete.

Due to the ongoing intree/out of tree split on the cloud and CSI providers, this should not apply to clusters with intree providers (!= "external").

Once confident we have all components updated, we should introduce an end2end test that makes sure we never create resources that are untagged.

After that, we can remove the experimental flag and make this a GA feature.

Goals

  • Inclusion in the cluster backups
  • Flexibility of changing tags during cluster lifetime, without recreating the whole cluster

Requirements

  • This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.
Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

List any affected packages or components.

  • Installer
  • Cluster Infrastructure
  • Storage
  • Node
  • NetworkEdge
  • Internal Registry
  • CCO

RFE-1101 described user defined tags for AWS resources provisioned by an OCP cluster. Currently user can define tags which are added to the resources during creation. These tags cannot be updated subsequently. The propagation of the tags is controlled using experimental flag. Before this feature goes GA we should define and implement a mechanism to exclude any experimental flags. Day2 operations and deletion of tags is not in the scope.

RFE-2012 aims to make the user-defined resource tags feature GA. This means that user defined tags should be updatable.

Currently the user-defined tags during install are passed directly as parameters of the Machine and Machineset resources for the master and worker. As a result these tags cannot be updated by consulting the Infrastructure resource of the cluster where the user defined tags are written.

The MCO should be changed such that during provisioning the MCO looks up the values of the tags in the Infrastructure resource and adds the tags during creation of the EC2 resources. The MCO should also watch the infrastructure resource for changes and when the resource tags are updated it should update the tags on the EC2 instances without restarts.

Acceptance Criteria:

  • e2e test where the ResourceTags are updated and then the test verifies that the tags on the ec2 instances are updated without restarts. now moved to CFE-179

Feature Overview  

Much like core OpenShift operators, a standardized flow exists for OLM-managed operators to interact with the cluster in a specific way to leverage AWS STS authorization when using AWS APIs as opposed to insecure static, long-lived credentials. OLM-managed operators can implement integration with the CloudCredentialOperator in well-defined way to support this flow.

Goals:

Enable customers to easily leverage OpenShift's capabilities around AWS STS with layered products, for increased security posture. Enable OLM-managed operators to implement support for this in well-defined pattern.

Requirements:

  • CCO gets a new mode in which it can reconcile STS credential request for OLM-managed operators
  • A standardized flow is leveraged to guide users in discovering and preparing their AWS IAM policies and roles with permissions that are required for OLM-managed operators 
  • A standardized flow is defined in which users can configure OLM-managed operators to leverage AWS STS
  • An example operator is used to demonstrate the end2end functionality
  • Clear instructions and documentation for operator development teams to implement the required interaction with the CloudCredentialOperator to support this flow

Use Cases:

See Operators & STS slide deck.

 

Out of Scope:

  • handling OLM-managed operator updates in which AWS IAM permission requirements might change from one version to another (which requires user awareness and intervention)

 

Background:

The CloudCredentialsOperator already provides a powerful API for OpenShift's cluster core operator to request credentials and acquire them via short-lived tokens. This capability should be expanded to OLM-managed operators, specifically to Red Hat layered products that interact with AWS APIs. The process today is cumbersome to none-existent based on the operator in question and seen as an adoption blocker of OpenShift on AWS.

 

Customer Considerations

This is particularly important for ROSA customers. Customers are expected to be asked to pre-create the required IAM roles outside of OpenShift, which is deemed acceptable.

Documentation Considerations

  • Internal documentation needs to exists to guide Red Hat operator developer teams on the requirements and proposed implementation of integration with CCO and the proposed flow
  • External documentation needs to exist to guide users on:
    • how to become aware that the cluster is in STS mode
    • how to become aware of operators that support STS and the proposed CCO flow
    • how to become aware of the IAM permissions requirements of these operators
    • how to configure an operator in the proposed flow to interact with CCO

Interoperability Considerations

  • this needs to work with ROSA
  • this needs to work with self-managed OCP on AWS

Market Problem

This Section: High-Level description of the Market Problem ie: Executive Summary

  • As a customer of OpenShift layered products, I need to be able to fluidly, reliably and consistently install and use OpenShift layered product Kubernetes Operators into my ROSA STS clusters, while keeping a STS workflow throughout.
  •  
  • As a customer of OpenShift on the big cloud providers, overall I expect OpenShift as a platform to function equally well with tokenized cloud auth as it does with "mint-mode" IAM credentials. I expect the same from the Kubernetes Operators under the Red Hat brand (that need to reach cloud APIs) in that tokenized workflows are equally integrated and workable as with "mint-mode" IAM credentials.
  •  
  • As the managed services, including Hypershift teams, offering a downstream opinionated, supported and managed lifecycle of OpenShift (in the forms of ROSA, ARO, OSD on GCP, Hypershift, etc), the OpenShift platform should have as close as possible, native integration with core platform operators when clusters use tokenized cloud auth, driving the use of layered products.
  • .
  • As the Hypershift team, where the only credential mode for clusters/customers is STS (on AWS) , the Red Hat branded Operators that must reach the AWS API, should be enabled to work with STS credentials in a consistent, and automated fashion that allows customer to use those operators as easily as possible, driving the use of layered products.

Why it Matters

  • Adding consistent, automated layered product integrations to OpenShift would provide great added value to OpenShift as a platform, and its downstream offerings in Managed Cloud Services and related offerings.
  • Enabling Kuberenetes Operators (at first, Red Hat ones) on OpenShift for the "big3" cloud providers is a key differentiation and security requirement that our customers have been and continue to demand.
  • HyperShift is an STS-only architecture, which means that if our layered offerings via Operators cannot easily work with STS, then it would be blocking us from our broad product adoption goals.

Illustrative User Stories or Scenarios

  1. Main success scenario - high-level user story
    1. customer creates a ROSA STS or Hypershift cluster (AWS)
    2. customer wants basic (table-stakes) features such as AWS EFS or RHODS or Logging
    3. customer sees necessary tasks for preparing for the operator in OperatorHub from their cluster
    4. customer prepares AWS IAM/STS roles/policies in anticipation of the Operator they want, using what they get from OperatorHub
    5. customer's provides a very minimal set of parameters (AWS ARN of role(s) with policy) to the Operator's OperatorHub page
    6. The cluster can automatically setup the Operator, using the provided tokenized credentials and the Operator functions as expected
    7. Cluster and Operator upgrades are taken into account and automated
    8. The above steps 1-7 should apply similarly for Google Cloud and Microsoft Azure Cloud, with their respective token-based workload identity systems.
  2. Alternate flow/scenarios - high-level user stories
    1. The same as above, but the ROSA CLI would assist with AWS role/policy management
    2. The same as above, but the oc CLI would assist with cloud role/policy management (per respective cloud provider for the cluster)
  3. ...

Expected Outcomes

This Section: Articulates and defines the value proposition from a users point of view

  • See SDE-1868 as an example of what is needed, including design proposed, for current-day ROSA STS and by extension Hypershift.
  • Further research is required to accomodate the AWS STS equivalent systems of GCP and Azure
  • Order of priority at this time is
    • 1. AWS STS for ROSA and ROSA via HyperShift
    • 2. Microsoft Azure for ARO
    • 3. Google Cloud for OpenShift Dedicated on GCP

Effect

This Section: Effect is the expected outcome within the market. There are two dimensions of outcomes; growth or retention. This represents part of the “why” statement for a feature.

  • Growth is the acquisition of net new usage of the platform. This can be new workloads not previously able to be supported, new markets not previously considered, or new end users not previously served.
  • Retention is maintaining and expanding existing use of the platform. This can be more effective use of tools, competitive pressures, and ease of use improvements.
  • Both of growth and retention are the effect of this effort.
    • Customers have strict requirements around using only token-based cloud credential systems for workloads in their cloud accounts, which include OpenShift clusters in all forms.
      • We gain new customers from both those that have waited for token-based auth/auth from OpenShift and from those that are new to OpenShift, with strict requirements around cloud account access
      • We retain customers that are going thru both cloud-native and hybrid-cloud journeys that all inevitably see security requirements driving them towards token-based auth/auth.
      •  

References

As an engineer I want the capability to implement CI test cases that run at different intervals, be it daily, weekly so as to ensure downstream operators that are dependent on certain capabilities are not negatively impacted if changes in systems CCO interacts with change behavior.

Acceptance Criteria:

Create a stubbed out e2e test path in CCO and matching e2e calling code in release such that there exists a path to tests that verify working in an AWS STS workflow.

Feature Overview

Customers are asking for improvements to the upgrade experience (both over-the-air and disconnected). This is a feature tracking epics required to get that work done.  

Goals

  1. Have an option to do upgrades in more discrete steps under admin control. Specifically, these steps are: 
    • Control plane upgrade
    • Worker nodes upgrade
    • Workload enabling upgrade (i..e. Router, other components) or infra nodes
  2. Better visibility into any errors during the upgrades and documentation of what they error means and how to recover. 
  3. An user experience around an end-2-end back-up and restore after a failed upgrade 
  4. OTA-810  - Better Documentation: 
    • Backup procedures before upgrades. 
    • More control over worker upgrades (with tagged pools between user Vs admin)
    • The kinds of pre-upgrade tests that are run, the errors that are flagged and what they mean and how to address them. 
    • Better explanation of each discrete step in upgrades, and what each CVO Operator is doing and potential errors, troubleshooting and mitigating actions.

References

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Provide a one click option to perform an upgrade which pauses all non master pools

Why is this important?

  • Customers are increasingly asking that the overall upgrade is broken up into more digestible pieces
  • This is the limit of what's possible today
    • R&D work will be done in the future to allow for further bucketing of upgrades into Control Plane, Worker Nodes, and Workload Enabling components (ie: router) That will however take much more consideration and rearchitecting

Scenarios

  1. An admin selecting their upgrade is offered two options "Upgrade Cluster" and "Upgrade Control Plane"
    1. If the admin selects Upgrade Cluster they get the pre 4.10 behavior
    2. If the admin selects Upgrade Control Plane all non master pools are paused and an upgrade is initiated
  1. A tooltip should clarify what the difference between the two are
  2. The pool progress bars should indicate pause/unpaused status, non master pools should allow for unpausing

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. While this epic doesn't specifically target upgrading from 4.N to 4.N+1 to 4.N+2 with non master pools paused it would fundamentally enable that and it would simplify the UX described in Paused Worker Pool Upgrades

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Goal
Improve the UX on the machine config pool page to reflect the new enhancements on the cluster settings that allows users to select the ability to update the control plane only.

Background
Currently in the console, users only have the ability to complete a full cluster upgrade. For many customers, upgrades take longer than what their maintenance window allows. Users need the ability to upgrade the control plane independently of the other worker nodes. 

Ex. Upgrades of huge clusters may take too long so admins may do the control plane this weekend, worker-pool-A next weekend, worker-pool-B the weekend after, etc.  It is all at a pool level, they will not be able to choose specific hosts.

Requirements

  1. Changes to the table:
    1. Remove "Updated, updating and paused" columns. We could also consider adding column management to this table and hide those columns by default.
    2. Add "Update status" as a column, and surface the same status on cluster settings. Not true or false values but instead updating, paused, and up to date.
    3. Surface the update action in the table row.
  2. Add an inline alert that lets users know there is a 60 day window to update all worker pools. In the alert, include the sentiment that worker pools can remain paused as long as is normally safe, which means until certificate rotation becomes critical which is at about 60 days. The admin would be advised to unpause them in order to complete the full upgrade. If the MCPs are paused, the certification rotation does not happen, which causes the cluster to become degraded and causes failure in multiple 'oc' commands, including but not limited to 'oc debug', 'oc logs', 'oc exec' and 'oc attach'. (Are we missing anything else here?) Add the same alert logic to this page as the cluster settings:
    1. From day 60 to day 10 use the default inline alert.
    2. From day 10 to day 3 use the warning inline alert.
    3. From day 3 to 0 use the critical alert and continue to persist until resolved.

Design deliverables: 

Goal
Add the ability to choose between a full cluster upgrade (which exists today) or control plane upgrade (which will pause all worker pools) in the console.

Background
Currently in the console, users only have the ability to complete a full cluster upgrade. For many customers, upgrades take longer than what their maintenance window allows. Users need the ability to upgrade the control plane independently of the other worker nodes. 

Ex. Upgrades of huge clusters may take too long so admins may do the control plane this weekend, worker-pool-A next weekend, worker-pool-B the weekend after, etc.  It is all at a pool level, they will not be able to choose specific hosts.

Requirements

  1. Changes to the Update modal:
    1. Add the ability to choose between a cluster upgrade and a control plane upgrade (the design does not default to a selection but rather disables the update button to force the user to make a conscious decision)
    2. link out to documentation to learn more about update strategies
  2. Changes to the in progress check list:
    1. Add a status above the worker pool section to let users know that all worker pools are paused and an action to resume all updates
    2. Add a "resume update" button for each worker pool entry
  3. Changes to the update status:
    1. When all master pools are updated successfully, change the status from what we have today "Up to date" to something like "Control plane up to date - all worker pools paused"
  4. Add an inline alert that lets users know there is a 60 day window to update all worker pools. In the alert, include the sentiment that worker pools can remain paused as long as is normally safe, which means until certificate rotation becomes critical which is at about 60 days. The admin would be advised to unpause them in order to complete the full upgrade. If the MCPs are paused, the certification rotation does not happen, which causes the cluster to become degraded and causes failure in multiple 'oc' commands, including but not limited to 'oc debug', 'oc logs', 'oc exec' and 'oc attach'. (Are we missing anything else here?) Inline alert logic:
    1. From day 60 to day 10 use the default alert.
    2. From day 10 to day 3 use the warning alert.
    3. From day 3 to 0 use the critical alert and continue to persist until resolved.

Design deliverables: 

OCP/Telco Definition of Done
Feature Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Feature --->
<--- Remove the descriptive text as appropriate --->

Feature Overview

  • As RH OpenShift Product Owners, we want to enable new providers/platforms/service with varying levels of capabilities and integration with minimal reliance on OpenShift Engineering.
  • As a new provider/platform partner, I want to enable my solution (hardware and/or software) with OpenShift with minimal effort.

 

Problem

  • It is currently challenging for us to enable new platforms / providers without taking the heavy burden on doing the platform specific development ourselves.

Goals

  • We want to enable the long-tail new platforms/providers to expand our reach into new markets and/or support new use cases.
  • We want to remove strict dependencies we have on Engineering teams to review, support and test new providers.
  • We want to lower the effort required for onboarding new platforms/providers.
  • We want to enable new platform/providers to self-certify.
  • We want to define tiered model for provider/platform integration that delineates ownership and responsibilities throughout new provider/platform development lifecycle and support model.
  • We want to reduce time to onboard new provider/platform – ideally to a single release.
  • We want to maintain consistent customer experience across all providers/platforms.

Requirements

  • Step-by-step guide on how to add a new platform/provider for each tier
  • Certification tool for partner to self-certify
  • Certification tool results for (at least) each Y/minor release submitted by partner to Red Hat for acknowledgement
  • DCI program to enable partners to run CI with OpenShift on their platform
  • Well documented, accessible, and up-to-date test suites for providing the test coverage of the partner
  • CI includes upgrade testing of OpenShift with partner's components
  • Partner component upgrade failure should not block OpenShift upgrade
  • Partner code is available in repositories in the openshift org on github with an open source license compatible with OpenShift

 

Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

(Optional) Use Cases

This Section:

  • Main success scenarios - high-level user stories
  • Alternate flow/scenarios - high-level user stories
  • ...

Questions to answer…

  • ...

Out of Scope

Background, and strategic fit

This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.

Assumptions

  • ...

Customer Considerations

  • ...

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?
  • New Content, Updates to existing content, Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?

 

References

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Running the OPCT with the latest version (v0.1.0) on OCP 4.11.0, the openshift-tests is reporting an incorrect counter for the "total" field.

In the example below, after the 1127th test, the total follows the same counter of executed. I also would assume that the total is incorrect before that point as the test continues the execution increases both counters.

 

openshift-tests output format: [failed/executed/total]

started: (0/1126/1127) "[sig-storage] PersistentVolumes-expansion  loopback local block volume should support online expansion on node [Suite:openshift/conformance/parallel] [Suite:k8s]"

passed: (38s) 2022-08-09T17:12:21 "[sig-storage] In-tree Volumes [Driver: nfs] [Testpattern: Dynamic PV (default fs)] provisioning should provision storage with mount options [Suite:openshift/conformance/parallel] [Suite:k8s]"

started: (0/1127/1127) "[sig-storage] In-tree Volumes [Driver: local][LocalVolumeType: tmpfs] [Testpattern: Generic Ephemeral-volume (block volmode) (late-binding)] ephemeral should support two pods which have the same volume definition [Suite:openshift/conformance/parallel] [Suite:k8s]"

passed: (6.6s) 2022-08-09T17:12:21 "[sig-storage] Downward API volume should provide container's memory request [NodeConformance] [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s]"

started: (0/1128/1128) "[sig-storage] In-tree Volumes [Driver: cinder] [Testpattern: Dynamic PV (immediate binding)] topology should fail to schedule a pod which has topologies that conflict with AllowedTopologies [Suite:openshift/conformance/parallel] [Suite:k8s]"

skip [k8s.io/kubernetes@v1.24.0/test/e2e/storage/framework/testsuite.go:116]: Driver local doesn't support GenericEphemeralVolume -- skipping
Ginkgo exit error 3: exit with code 3

skipped: (400ms) 2022-08-09T17:12:21 "[sig-storage] In-tree Volumes [Driver: local][LocalVolumeType: tmpfs] [Testpattern: Generic Ephemeral-volume (block volmode) (late-binding)] ephemeral should support two pods which have the same volume definition [Suite:openshift/conformance/parallel] [Suite:k8s]"

started: (0/1129/1129) "[sig-storage] In-tree Volumes [Driver: emptydir] [Testpattern: Dynamic PV (default fs)] capacity provides storage capacity information [Suite:openshift/conformance/parallel] [Suite:k8s]" 

 

OPCT output format [executed/total (failed failures)]

Tue, 09 Aug 2022 14:12:13 -03> Global Status: running
JOB_NAME                         | STATUS     | RESULTS    | PROGRESS                  | MESSAGE                                           
openshift-conformance-validated  | running    |            | 1112/1127 (0 failures)    | status=running                                    
openshift-kube-conformance       | complete   |            | 352/352 (0 failures)      | waiting for post-processor...                     
Tue, 09 Aug 2022 14:12:23 -03> Global Status: running
JOB_NAME                         | STATUS     | RESULTS    | PROGRESS                  | MESSAGE                                           
openshift-conformance-validated  | running    |            | 1120/1127 (0 failures)    | status=running                                    
openshift-kube-conformance       | complete   |            | 352/352 (0 failures)      | waiting for post-processor...                     
Tue, 09 Aug 2022 14:12:33 -03> Global Status: running
JOB_NAME                         | STATUS     | RESULTS    | PROGRESS                  | MESSAGE                                           
openshift-conformance-validated  | running    |            | 1139/1139 (0 failures)    | status=running                                    
openshift-kube-conformance       | complete   |            | 352/352 (0 failures)      | waiting for post-processor...                     
Tue, 09 Aug 2022 14:12:43 -03> Global Status: running
JOB_NAME                         | STATUS     | RESULTS    | PROGRESS                  | MESSAGE                                           
openshift-conformance-validated  | running    |            | 1185/1185 (0 failures)    | status=running                                    
openshift-kube-conformance       | complete   |            | 352/352 (0 failures)      | waiting for post-processor...                     
Tue, 09 Aug 2022 14:12:53 -03> Global Status: running
JOB_NAME                         | STATUS     | RESULTS    | PROGRESS                  | MESSAGE                                           
openshift-conformance-validated  | running    |            | 1188/1188 (0 failures)    | status=running                                    
openshift-kube-conformance       | complete   |            | 352/352 (0 failures)      | waiting for post-processor...      

 

 

 

 

OCP/Telco Definition of Done
Feature Template descriptions and documentation.
Feature Overview

  • Connect OpenShift workloads to Google services with Google Workload Identity

Goals

  • Customers want to be able to manage and operate OpenShift on Google Cloud Platform with workload identity, much like they do with AWS + STS or Azure + workload identity.
  • Customers want to be able to manage and operate operators and customer workloads on top of OCP on GCP with workload identity.

Requirements

  • Add support to CCO for the Installation and Upgrade using both UPI and IPI methods with GCP workload identity.
  • Support install and upgrades for connected and disconnected/restriction environments.
  • Support the use of Operators with GCP workload identity with minimal friction.
  • Support for HyperShift and non-HyperShift clusters.
  • This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.
Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

(Optional) Use Cases

This Section:

  • Main success scenarios - high-level user stories
  • Alternate flow/scenarios - high-level user stories
  • ...

Questions to answer…

  • ...

Out of Scope

Background, and strategic fit

This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.

Assumptions

  • ...

Customer Considerations

  • ...

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?
  • New Content, Updates to existing content, Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?

 

Epic Goal

  • Complete the implementation for GCP  workload identity, including support and documentation.

Why is this important?

  • Many customers want to follow best security practices for handling credentials.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.

Dependencies (internal and external)

Open questions:

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

We need to ensure following things in the openshift operators

1)  Make sure to operator uses v0.0.0-20210218202405-ba52d332ba99 or later version of the golang.org/x/oauth2 module

2) Mount the oidc token in the operator pod, this needs to go in the deployment. We have done it for cluster-image-registry-operator here

3) For workload identity to work, gco credentials that the operator pod uses should be of external_account type (not service_account). The external_account credentials type have path to oidc token along, url of the service account to impersonate along with other details. These type of credentials can be generated from gcp console or programmatically (supported by ccoctl). The operator pod can then consume it from a kube secret. Make appropriate code changes to the operators so that can consume these new credentials 

 

Following repos need one or more of above changes

Feature Overview

Enable sharing ConfigMap and Secret across namespaces

Requirements

Requirement Notes isMvp?
Secrets and ConfigMaps can get shared across namespaces   YES

Questions to answer…

NA

Out of Scope

NA

Background, and strategic fit

Consumption of RHEL entitlements has been a challenge on OCP 4 since it moved to a cluster-based entitlement model compared to the node-based (RHEL subscription manager) entitlement mode. In order to provide a sufficiently similar experience to OCP 3, the entitlement certificates that are made available on the cluster (OCPBU-93) should be shared across namespaces in order to prevent the need for cluster admin to copy these entitlements in each namespace which leads to additional operational challenges for updating and refreshing them. 

Documentation Considerations

Questions to be addressed:
 * What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
 * Does this feature have doc impact?
 * New Content, Updates to existing content, Release Note, or No Doc Impact
 * If unsure and no Technical Writer is available, please contact Content Strategy.
 * What concepts do customers need to understand to be successful in [action]?
 * How do we expect customers will use the feature? For what purpose(s)?
 * What reference material might a customer want/need to complete [action]?
 * Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
 * What is the doc impact (New Content, Updates to existing content, or Release Note)?

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Deliver the Projected Resources CSI driver via the OpenShift Payload

Why is this important?

  • Projected resource shares will be a core feature of OpenShift. The share and CSI driver have multiple use cases that are important to users and cluster administrators.
  • The use of projected resources will be critical to distributing Simple Content Access (SCA) certificates to workloads, such as Deployments, DaemonSets, and OpenShift Builds.

Scenarios

As a developer using OpenShift
I want to mount a Simple Content Access certificate into my build
So that I can access RHEL content within a Docker strategy build.

As a application developer or administrator
I want to share credentials across namespaces
So that I don't need to copy credentials to every workspace

Acceptance Criteria

  • OCP conformance suite must ensure that the projected resource CSI driver is installed on every OpenShift deployment.
  • OCP build suite tests that projected resource CSI driver volumes can be added to builds. Only if builds support inline CSI volumes.
  • Release Technical Enablement - Docs and demos on how to create a Projected Resource share and add it as a volume to workloads. A special use case for adding RHEL entitlements to builds should be included.

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story

As an OpenShift engineer
I want to know which clusters are using the Shared Resource CSI Driver
So that I can be proactive in supporting customers who are using this tech preview feature

Acceptance Criteria

  • Key metrics for the shared resource CSI driver are exported to Telemeter via the cluster monitoring operator.

Docs Impact

None - metrics exported to telemetry are not formally documented.

QE Impact

QE can verify that the query/recording rule for cluster monitoring operator returns data if the cluster has the Shared Resource CSI driver installed and utilizes a SharedSecret or SharedConfigMap in a pod/workload.

PX Impact

Insights rules can potentially be created off of these exported metrics. This would allow CEE to identify which clusters are using SharedSecrets or SharedConfigMaps, especially if we are exporting mount failure metrics.

Notes

To implement, a prometheus query/recording rule needs to be added to the cluster monitoring operator. Once approved by the monitoring team, the metric data will be available on DataHub once 4.10 clusters are installed with the updated version of the monitoring operator.

User Story

As a cluster admin
I want the cluster storage operator to install the shared resources CSI driver
So that I can test the shared resources CSI driver on my cluster

Acceptance Criteria

  • Cluster storage operator uses image references to resolve the csi-driver-shared-resource-operator and all images needed to deploy the csi driver.
  • Shared resources CSI driver is installed when the cluster enables the CSIDriverSharedResources feature gate, OR
  • Shared resource CSI driver is installed when the cluster enables the TechPreviewNoUpgrade feature set
  • CI ensures that if the TechPreviewNoUpgrade feature set is enabled on the cluster, the shared resource CSI driver is deployed and functions correctly.

Docs Impact

Docs will need to identify how to install the shared resources CSI driver (by enabling the tech preview feature set)

Notes

Tasks:

  • Add the Share APIs (SharedSecret, SharedConfigMap) to openshift/api
  • Generate clients in openshift/client-go for Share APIs
  • Update the CSI driver name used in the enum for the ClusterCSIDriver custom resource.
  • Generate custom resource definitions and include it in the deployment YAMLs for the shared resource operator
  • Add YAML deployment manifests for the shared resource operator to the cluster storage operator (include necessary RBAC)
  • Ensure cluster storage operator has permission to create custom resource definitions
  • Enhance the cluster storage operator to install the shared resource CSI driver only when the cluster enables the CSIDriverSharedResources feature gate

Note that to be able to test all of this on any cloud provider, we need STOR-616 to be implemented. We can work around this by making the CSI driver installable on AWS or GCP for testing purposes.

The cluster storage operator has cluster-admin permissions. However, no other CSI driver managed by the operator includes a CRD for its API.

See https://issues.redhat.com/browse/BUILD-159?focusedCommentId=16360509&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16360509

Feature Overview (aka. Goal Summary)  

Upstream Kuberenetes is following other SIGs by moving it's intree cloud providers to an out of tree plugin format, Cloud Controller Manager, at some point in a future Kubernetes release. OpenShift needs to be ready to action this change  

Goals (aka. expected user outcomes)

Bring together all the cloud controller managers (AWS, GCP, Azure), complete testing and prepare for final GA

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

 

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

 

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

 

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

 

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

 

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

 

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  Initial completion during Refinement status.

 

Interoperability Considerations

Which other projects and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Prepare the Cluster Cloud Controller Manager Operator (CCCMO) component, introduced in 4.9 for GA

Why is this important?

  • We must ensure that the component is stable before we can declare the product GA

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Initial work was started there: https://github.com/lobziik/cluster-cloud-controller-manager-operator/pull/1/files

Need to isolate provider specific code in respective packages and introduce interface to leverage it (regular and bootstrap manifests rendering should be there atm)

DoD:

  • Introduce templating logic to replace existing substitution mixture
  • Isolate templating logic so that this is transparent to the core of the CCCMO
  • Improve testing of the substitution

Feature Overview

This Feature is a general "catch all" for the time being. There are a number of existing priorities from Q1 that should be aligned with existing priorities below but if not, assign to this feature as needed.

Goals

In order to get a better overall portfolio view, we'll leverage this Feature to gather work that doesn't fall into other existing priorities on this board. As this list grows, the portfolio priority grooming team will look to split out or handle appropriately.

Requirements

A list of specific needs or objectives that a Feature must deliver to satisfy the Feature. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts.  If a non MVP requirement slips, it does not shift the feature.

 

requirement                                                                        Notes                                                              isMvp
     
     
     
     

 

 

(Optional) Use Cases

< How will the user interact with this feature? >

< Which users will use this and when will they use it? >

< Is this feature used as part of current user interface? >

Out of Scope

 

Background, and strategic fit

< What does the person writing code, testing, documenting need to know? >

Assumptions

< Are there assumptions being made regarding prerequisites and dependencies?>

< Are there assumptions about hardware, software or people resources?>

Customer Considerations

< Are there specific customer environments that need to be considered (such as working with existing h/w and software)?>

Documentation Considerations

< What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)? >

<What does success look like?>

< Does this feature have doc impact? Possible values are: New Content, Updates to existing content,  Release Note, or No Doc Impact?>

 <If unsure and no Technical Writer is available, please contact Content Strategy. If yes, complete the following.>

  • <What concepts do customers need to understand to be successful in [action]?>
  • <How do we expect customers will use the feature? For what purpose(s)?>
  • <What reference material might a customer want/need to complete [action]?>
  • <Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available. >
  • <What is the doc impact (New Content, Updates to existing content, or Release Note)?>

Questions

Question Outcome
   

 

Problem:

Console provides support UI for operators which is dynamically enabled when the operator is installed; by using feature flags against presence of CRDs. While operators have their own release cadence separately from OpenShift which makes for alignment of UI to API difficult. As new features are released for the operator, the UI becomes out of sync with APIs and customers must wait till the following OpenShift release to get any new UI.

Goal:

  • Create an extensibility mechanism which allows Red Hat operators to build and package their own UI that extends the console.
  • Make console extensible in areas required to support the needs of contributing plugins.

Why is it important?

  • Allows an operator to maintain their own UI and release at their own cadence.
  • Alleviates the pressure on console to deliver UI features for multiple operators within a release.

Use cases:

  1. Serverless / Pipelines / Helm to contribute resource details pages, import flows, topology visuals etc...

Acceptance criteria:

  1. Red Hat Operator can build their own UI which is deployed alongside the operator and extend the dev-console
    1. objective is to get to a point where it is possible to accomplish this however code will not be moved to a separate repository, nor deployed by an operator
  2. New extensions for console to allow operators to extend the various areas of console needed in order to provide the proper user experience.
  3. Enable operators to override the static built in support, and supply their own UI

Dependencies (External/Internal):

Design Artifacts:

Console extensions:
https://docs.google.com/document/d/1HW5_cl6cOX5P14PQN-1_8c60o9dMY6HbFDRftH6aTno/edit

Dynamic Plugins:
https://docs.google.com/document/d/19BAFo_8BtMZVvKsU-bE61bZpSydeYONkCMWntMU9NgE/edit

Enhancement proposal:
https://github.com/openshift/enhancements/pull/441

Exploration:

Note:

  • plugin framework covered by another epic
  • out of scope:
    • moving plugins to separate git repository

Description

As a developer, I want to be able to contribute a dynamic plugin extension and override the same extension contributed by static plugin.

Acceptance Criteria

  1. Should replace static plugin contribution of same name by dynamic plugin contribution

Additional Details:

Goal

Increase integration of Shipwright, Tekton, Argo CD in OpenShift GitOps with OpenShift platform and related products such as ACM.

Incomplete Features

When this image was assembled, these features were not yet completed. Therefore, only the Jira Cards included here are part of this release

Feature Overview

We drive OpenShift cross-market customer success and new customer adoption with constant improvements and feature additions to the existing capabilities of our OpenShift Core Networking (SDN and Network Edge). This feature captures that natural progression of the product.

Goals

  • Feature enhancements (performance, scale, configuration, UX, ...)
  • Modernization (incorporation and productization of new technologies)

Requirements

  • Core Networking Stability
  • Core Networking Performance and Scale
  • Core Neworking Extensibility (Multus CNIs)
  • Core Networking UX (Observability)
  • Core Networking Security and Compliance

In Scope

  • Network Edge (ingress, DNS, LB)
  • SDN (CNI plugins, openshift-sdn, OVN, network policy, egressIP, egress Router, ...)
  • Networking Observability

Out of Scope

There are definitely grey areas, but in general:

  • CNV
  • Service Mesh
  • CNF

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?
  • New Content, Updates to existing content, Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

User Story: As a customer in a highly regulated environment, I need the ability to secure DNS traffic when forwarding requests to upstream resolvers so that I can ensure additional DNS traffic and data privacy.

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Create a PR in openshift/cluster-ingress-operator to implement configurable router probe timeouts.

The PR should include the following:

  • Changes to the ingress operator's ingress controller to allow the user to configure the readiness and liveness probe's timeoutSeconds values.
  • Changes to existing unit tests to verify that the new functionality works properly.
  • Write E2E test to verify that the new functionality works properly.

tldr: three basic claims, the rest is explanation and one example

  1. We cannot improve long term maintainability solely by fixing bugs.
  2. Teams should be asked to produce designs for improving maintainability/debugability.
  3. Specific maintenance items (or investigation of maintenance items), should be placed into planning as peer to PM requests and explicitly prioritized against them.

While bugs are an important metric, fixing bugs is different than investing in maintainability and debugability. Investing in fixing bugs will help alleviate immediate problems, but doesn't improve the ability to address future problems. You (may) get a code base with fewer bugs, but when you add a new feature, it will still be hard to debug problems and interactions. This pushes a code base towards stagnation where it gets harder and harder to add features.

One alternative is to ask teams to produce ideas for how they would improve future maintainability and debugability instead of focusing on immediate bugs. This would produce designs that make problem determination, bug resolution, and future feature additions faster over time.

I have a concrete example of one such outcome of focusing on bugs vs quality. We have resolved many bugs about communication failures with ingress by finding problems with point-to-point network communication. We have fixed the individual bugs, but have not improved the code for future debugging. In so doing, we chase many hard to diagnose problem across the stack. The alternative is to create a point-to-point network connectivity capability. this would immediately improve bug resolution and stability (detection) for kuryr, ovs, legacy sdn, network-edge, kube-apiserver, openshift-apiserver, authentication, and console. Bug fixing does not produce the same impact.

We need more investment in our future selves. Saying, "teams should reserve this" doesn't seem to be universally effective. Perhaps an approach that directly asks for designs and impacts and then follows up by placing the items directly in planning and prioritizing against PM feature requests would give teams the confidence to invest in these areas and give broad exposure to systemic problems.


Relevant links:

In OCP 4.8 the router was changed to use the "random" balancing algorithm for non-passthrough routes by default. It was previously "leastconn".

Bug https://bugzilla.redhat.com/show_bug.cgi?id=2007581 shows that using "random" by default incurs significant memory overhead for each backend that uses it.

PR https://github.com/openshift/cluster-ingress-operator/pull/663
reverted the change and made "leastconn" the default again (OCP 4.8 onwards).

The analysis in https://bugzilla.redhat.com/show_bug.cgi?id=2007581#c40 shows that the default haproxy behaviour is to multiply the weight (specified in the route CR) by 16 as it builds its data structures for each backend. If no weight is specified then openshift-router sets the weight to 256. If you have many, many thousands of routes then this balloons quickly and leads to a significant increase in memory usage, as highlighted by customer cases attached to BZ#2007581.

The purpose of this issue is to both explore changing the openshift-router default weight (i.e., 256) to something smaller, or indeed unset (assuming no explicit weight has been requested), and to measure the memory usage within the context of the existing perf&scale tests that we use for vetting new haproxy releases.

It may be that the low-hanging change is to not default to weight=256 for backends that only have one pod replica (i.e., if no value specified, and there is only 1 pod replica, then don't default to 256 for that single server entry).

Outcome: does changing the [default] weight value make it feasible to switch back to "random" as the default balancing algorithm for a future OCP release.

Revert router to using "random" once again in 4.11 once analysis is done on impact of weight and static memory allocation.

Per the 4.6.30 Monitoring DNS Post Mortem, we should add E2E tests to openshift/cluster-dns-operator to reduce the risk that changes to our CoreDNS configuration break DNS resolution for clients.  

To begin with, we add E2E DNS testing for 2 or 3 client libraries to establish a framework for testing DNS resolvers; the work of adding additional client libraries to this framework can be left for follow-up stories.  Two common libraries are Go's resolver and glibc's resolver.  A somewhat common library that is known to have quirks is musl libc's resolver, which uses a shorter timeout value than glibc's resolver and reportedly has issues with the EDNS0 protocol extension.  It would also make sense to test Java or other popular languages or runtimes that have their own resolvers. 

Additionally, as talked about in our DNS Issue Retro & Testing Coverage meeting on Feb 28th 2024, we also decided to add a test for testing a non-EDNS0 query for a larger than 512 byte record, as once was an issue in bug OCPBUGS-27397.   

The ultimate goal is that the test will inform us when a change to OpenShift's DNS or networking has an effect that may impact end-user applications. 

Feature Overview

Plugin teams need a mechanism to extend the OCP console that is decoupled enough so they can deliver at the cadence of their projects and not be forced in to the OCP Console release timelines.

The OCP Console Dynamic Plugin Framework will enable all our plugin teams to do the following:

  • Extend the Console
  • Deliver UI code with their Operator
  • Work in their own git Repo
  • Deliver at their own cadence

Goals

    • Operators can deliver console plugins separate from the console image and update plugins when the operator updates.
    • The dynamic plugin API is similar to the static plugin API to ease migration.
    • Plugins can use shared console components such as list and details page components.
    • Shared components from core will be part of a well-defined plugin API.
    • Plugins can use Patternfly 4 components.
    • Cluster admins control what plugins are enabled.
    • Misbehaving plugins should not break console.
    • Existing static plugins are not affected and will continue to work as expected.

Out of Scope

    • Initially we don't plan to make this a public API. The target use is for Red Hat operators. We might reevaluate later when dynamic plugins are more mature.
    • We can't avoid breaking changes in console dependencies such as Patternfly even if we don't break the console plugin API itself. We'll need a way for plugins to declare compatibility.
    • Plugins won't be sandboxed. They will have full JavaScript access to the DOM and network. Plugins won't be enabled by default, however. A cluster admin will need to enable the plugin.
    • This proposal does not cover allowing plugins to contribute backend console endpoints.

 

Requirements

 

Requirement Notes isMvp?
 UI to enable and disable plugins    YES 
 Dynamic Plugin Framework in place    YES 
Testing Infra up and running   YES 
 Docs and read me for creating and testing Plugins    YES 
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

 
 Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?  
  • New Content, Updates to existing content,  Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Currently, webpack tree shakes PatternFly and only includes the components used by console in its vendor bundle. We need to expose all of the core PatternFly components for use in dynamic plugin, which means we have to disable tree shaking for PatternFly. We should expose this as a separate bundle. This will allow browsers to cache more efficiently and only need to load the PF bundle again when we upgrade PatternFly.

Open Questions

What parts of PatternFly do we consider core?

Acceptance Criteria

  • All PatternFly core components are exposed to dynamic plugins
  • PatternFly is exposed as a separate bundle that is not part of the main vendor bundle

cc Christian Vogt Vojtech Szocs Joseph Caiani James Talton

Feature Overview

  • This Section:* High-Level description of the feature ie: Executive Summary
  • Note: A Feature is a capability or a well defined set of functionality that delivers business value. Features can include additions or changes to existing functionality. Features can easily span multiple teams, and multiple releases.

 

Goals

  • This Section:* Provide high-level goal statement, providing user context and expected user outcome(s) for this feature

 

Requirements

  • This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.

 

Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

 

(Optional) Use Cases

This Section: 

  • Main success scenarios - high-level user stories
  • Alternate flow/scenarios - high-level user stories
  • ...

 

Questions to answer…

  • ...

 

Out of Scope

 

Background, and strategic fit

This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.

 

Assumptions

  • ...

 

Customer Considerations

  • ...

 

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?  
  • New Content, Updates to existing content,  Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

As a user, I want the ability to run a pod in debug mode.

This should be the equivalent of running:  oc debug pod

Acceptance Criteria for MVP

  • Build off of the crash-loop back off popover from https://github.com/openshift/console/pull/7302 to include a description of what crash-loop back off is, a link to view logs, a link to view events and a link to debug (container-name) in terminal. If more than one container is crash-looping list them individually.
  • Create a debug container page that includes breadcrumbs as well as the terminal to debug. Add an informational alert at the top to make it clear that this is a temporary Pod and closing this page will delete the temporary pod.
  • Add debug in terminal as an action to the logs tool bar. Only enable the action when the crash-loop back off status occurs for the selected container. Add a tool tip to explain when the action is disabled.

Assets
Designs (WIP): https://docs.google.com/document/d/1b2n9Ox4xDNJ6AkVsQkXc5HyG8DXJIzU8tF6IsJCiowo/edit#

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Goal
Add support for PDB (Pod Disruption Budget) to the console.

Requirements:

  • Add a list, detail, and yaml view (with samples) for PDBs. In addition, update the workloads page to support PDBs as well.
  • For the PBD list page include a table with name, namespace, selector, availability, allowed disruptions and created. In addition, to the table provide the main call to action to create a PDB.
  • For the PDB details page provide a Details, YAML and Pods tab. The Pods tab will include a list pods associated with the PBD - make sure to surface the owner column.
  • When users create a PDB from the list page, take them to the YAML and provide samples to enhance the creation experience. Sample 1: Set max unavailable to 0, Sample 2: Set min unavailable to 25% (confirming samples with stakeholders). In the case that a PDB has already been applied, warn users that it is not recommended to add another. Cover use cases as well that keep users from creating poor policies - for example, setting the minimum available to zero.
  • Add the ability to add/edit/view PBDs on a workload. If we edit a PDB applied to multiple workloads, warn users that this change will affect all workloads and not only the one they are currently editing. When a PDB has been applied, add a new filed to the details page with a link to the PDB and policy.

Designs:

Samuel Padgett Colleen Hart

During master nodes upgrade when nodes are getting drained there's currently no protection from two or more operands going down. If your component is required to be available during upgrade or other voluntary disruptions, please consider deploying PDB to protect your operands.

The effort is tracked in https://issues.redhat.com/browse/WRKLDS-293.

Example:

 

Acceptance Criteria:
1. Create PDB controller in console-operator for both console and downloads pods
2. Add e2e tests for PDB in single node and multi node cluster

 

Note: We should consider to backport this to 4.10

When viewing the Installed Operators list set to 'All projects' and then selecting an operator that is available in 'All namespaces' (globally installed,) upon clicking the operator to view its details the user is taken into the details of that operator in installed namespace (project selector will switch to the install namespace.)

This can be disorienting then to look at the lists of custom resource instances and see them all blank, since the lists are showing instances only in the currently selected project (the install namespace) and not across all namespaces the operator is available in.

It is likely that making use of the new Operator resource will improve this experience (CONSOLE-2240,) though that may still be some releases away. it should be considered if it's worth a "short term" fix in the meantime.

Note: The informational alert was not implemented. It was decided that since "All namespaces" is displayed in the radio button, the alert was not needed.

Feature Overview (aka. Goal Summary)  

Customers can trust the metadata in our operators catalogs to reason about infrastructure compatibility and interoperability. Similar to OCPPLAN-7983 the requirement is that this data is present for every layered product and Red Hat-release operator and ideally also ISV operators.

Today it is hard to validate the presence of this data due to the metadata format. This features tracks introducing a new format, implementing the appropriate validation and enforcement of presence as well as defining a grace period in which both formats are acceptable.

Goals (aka. expected user outcomes)

Customers can rely on the operator metadata as the single source of truth for capability and interoperability information instead of having to look up product-specific documentation. They can use this data to filter in on-cluster and public catalog displays as well as in their pipelines or custom workflows.

Red Hat Operators are required to provide this data and we aim for near 100% coverage in our catalogs.

Absence of this data can reliably be detected and will subsequently lead to gating in the release process.

Requirements (aka. Acceptance Criteria):

  • discrete annotations per feature that can be checked for presence as well as positive and negative values (see PORTEANBLE-525)
  • support in the OCP console and RHEC to support both the new and the older metadata annotations format
  • enforcement in ISV and RHT operator release pipelines
    • first with non-fatal warnings
    • later with blocking behavior if annotations are missing
    • the presence of ALL annotations needs to be checked in all pipelines / catalogs

Questions to Answer:

  • when can we rollout the pipeline tests?
    • only when there is support for visualization in the OCP Console and catalog.redhat.com
  • should operator authors use both, old and new annotations at the same time?
    • they can, but there is no requirement to do that, once the support in console and RHEC is there, the pipelines will only check for the new annotations
  • what happens to older OCP releases that don't support the new annotations yet?
    • the only piece in OCP that is aware of the annotations is the console, and we plan to backport the changes all the way to 4.10

 

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

 

Documentation Considerations

  • we first need internal documentation for RHT and ISV teams that need to implement the change
  • when RHEC and Console are ready, we will update the external documentation and and can point to that as the official source of truth

 

Interoperability Considerations

  • OCP Console will have to support the new format (see CONSOLE-3688) in parallel to the old format (as fallback) in all currently supported OCP versions

Epic Goal

  • Transparently support old and new infrastructure annotations format delivered by OLM-packaged operators

Why is this important?

  • As part of part of OCPSTRAT-288 we are looking to improve the metadata quality of Red Hat operators in OpenShift
  • via PORTENABLE-525 we are defining a new metadata format that supports the aforementioned initiative with more robust detection of individual infrastructure features via boolean data types

Scenarios

  1. A user can use the OCP console to browse through the OperatorHub catalog and filter for all the existing and new annotations defined in PORTENABLE-525
  2. A user reviewing an operator's detail can see the supported infrastructures transparently regardless if the operator uses the new or the existing annotations format

Acceptance Criteria

  • the new annotation format is supported in operatorhub filtering and operator details pages
  • the old annotation format keeps being supported in operatorhub filtering and operator details pages
  • the console will respect both the old and the new annotations format
  • when for a particular feature both the operator denotes data in both the old and new annotation format, the annotations in the newer format take precedence
  • the newer infrastructure features from PORTENABLE-525 tls-profiles and token-auth/* do not have equivalents in the old annotation format and evaluation doesn't need to fall back as described in the previous point

Dependencies (internal and external)

  1. none

Open Questions

  1. due to the non-intrusive nature of this feature, can we ship it in a 4.14.z patch release?

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Complete Epics

This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were completed when this image was assembled

Summary (PM+lead)

https://issues.redhat.com/browse/AUTH-2 revealed that, in prinicipal, Pod Security Admission is possible to integrate into OpenShift while retaining SCC functionality.

 

This epic is about the concrete steps to enable Pod Security Admission by default in OpenShift

Motivation (PM+lead)

Goals (lead)

  • Enable Pod Security Admission in "restricted" policy level by default
  • Migrate existing core workloads to comply to the "restricted" pod security policy level

Non-Goals (lead)

  • Other OpenShift workloads must be migrated by the individual responsible teams.

Deliverables

Proposal (lead)

Enhancement - https://github.com/openshift/enhancements/pull/1010

User Stories (PM)

Dependencies (internal and external, lead)

Previous Work (lead)

Open questions (lead)

  1. ...

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

dns-operator must comply to restricted pod security level. The current audit warning is:

{   "objectRef": "openshift-dns-operator/deployments/dns-operator",   "pod-security.kubernetes.io/audit-violations": "would violate PodSecurity \"restricted:latest\": allowPrivilegeEscalation != false (containers \"dns-operator\", \"kube-rbac-proxy\" must set securityContext.allowPrivilegeEscalation=false), unre stricted capabilities (containers \"dns-operator\", \"kube-rbac-proxy\" must set securityContext.capabilities.drop=[\"ALL\"]), runAsNonRoot != true (pod or containers \"dns-operator\", \"kube-rbac-proxy\" must set securityContext.runAsNonRoot=tr ue), seccompProfile (pod or containers \"dns-operator\", \"kube-rbac-proxy\" must set securityContext.seccompProfile.type to \"RuntimeDefault\" or \"Localhost\")" }

ingress-operator must comply to pod security. The current audit warning is:

 

{   "objectRef": "openshift-ingress-operator/deployments/ingress-operator",   "pod-security.kubernetes.io/audit-violations": "would violate PodSecurity \"restricted:latest\": allowPrivilegeEscalation != false (containers \"ingress-operator\", \"kube-rbac-proxy\" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (containers \"ingress-operator\", \"kube-rbac-proxy\" must set securityContext.capabilities.drop=[\"ALL\"]), runAsNonRoot != true (pod or containers \"ingress-operator\", \"kube-rbac-proxy\" must set securityContext.run AsNonRoot=true), seccompProfile (pod or containers \"ingress-operator\", \"kube-rbac-proxy\" must set securityContext.seccompProfile.type to \"RuntimeDefault\" or \"Localhost\")" }

An epic we can duplicate for each release to ensure we have a place to catch things we ought to be doing regularly but can tend to fall by the wayside.

As an adopter of the @openshift-console/dynamic-plugin-sdk I want to easily integrate into my development pipeline so that I can extend the OCP console.

Trying to pull in the dynamic-plugin-sdk into ACM is proving to be problematic. We would have to move to older dependencies. Integrating with webpack and typescript requires a very specific setup.

The dynamic-plugin-sdk has only really been used internally by OCP and is strongly tied to the setup and dependencies of OCP. For the dynamic-plugin-sdk to be externally consumable by adopters, it should be as easy to use as other webpack plugins such as HtmlWebpackPlugin or CompressionPlugin.

Acceptance Criteria

  • Uses up to date dependencies - not tied to specific versions OCP console uses
  • Includes it's own dependencies - does not require adopters to include those dependencies
  • The dynamic demo plugin should be updated to use newer dependencies and use the plugin without a bunch of tweaks to tsconfig paths. 

Currently

  • requires old dependencies 
    • ts-node 5.0.1 → 10.2.1

 

Update console from Cypress 6.0.0 to 8.5.0. Changes that impact us:

  • cypress run is headless by default
  • cy.intercept URL matching is more strict
  • Uncaught exception and unhandled promise rejection checks are more strict

https://docs.cypress.io/guides/references/migration-guide#Migrating-to-Cypress-8-0

The console has many instances of old variables, $grid-float-breakpoint and $grid-gutter-width, controlling margins/padding and responsive breakpoints throughout the Admin and Dev Console. These do not provide spacing and behaviors consistent with Patternfly components which use their own variables, $pf-global-gutter-md, $pf-global-gutter, and $pf-global-breakpoint-{size}. By replacing these, the intent it to bring the console closer to a pure Patternfly structure and behavior, requiring less overrides and customizations.

Epic Goal

HyperShift provisions OpenShift clusters with externally managed control-planes. It follows a slightly different process for provisioning clusters. For example, HyperShift uses cluster API as a backend and moves all the machine management bits to the management cluster.  

Why is this important?

showing machine management/cluster auto-scaling tabs in the console is likely to confuse users and cause unnecessary side effects. 

Definition of Done

  • MachineConfig and MachineConfigPool should not be present, they should be either removed or hidden when the cluster is spawned using HyperShift. 
  • Cluster Settings show say the control plane is externally managed and be read-only.
  • Cluster Settings -> Configuration resources should be read-only, maybe hide the tab
  • Some resources should go in an allowlist. Most will be hidden
  • Review getting started steps

See Design Doc: https://docs.google.com/document/d/1k76JtRRHBdCCEjHPqKcYvbNVsuaGmRhWDLESWIm0mbo/edit#

 

Setup / Testing

It's based on the SERVER_FLAG controlPlaneTopology being set to External is really the driving factor here; this can be done in one of two ways:

  • Locally via a Bridge Variable, export BRIDGE_CONTROL_PLANE_TOPOLOGY_MODE="External"
  • Locally / OnCluster via modifying the window.SERVER_FLAGS.controlPlaneTopology to External in the dev tools

To test work related to cluster upgrade process, use a 4.10.3 cluster set on the candidate-4.10 upgrade channel using 4.11 frontend code.

Based on Cesar's comment we should be removing the `Control Plane` section, if the infrastructure.status.controlplanetopology being "External".

If the Infrastructure.Status.ControlPlaneTopology is set to 'External', the console-operator will pass this information via the console-config.yaml co the console. Console pod will get re-deployed and will store the topology mode information as a SERVER_FLAG. Based on that value we need to remove the ability to “Add identity providers” under “Set up your Cluster”. In addition to the getting started card, we should remove the ability to update a cluster on the details card when applicable (anything that changes a cluster version should be read only).

Summary of changes to the overview page:

  • Remove the ability to “Add identify providers” under “Set up your Cluster”
  • Remove cluster update CTA from the details card
  • Remove update alerts from the status card

Check section 03 for more info: https://docs.google.com/document/d/1k76JtRRHBdCCEjHPqKcYvbNVsuaGmRhWDLESWIm0mbo/edit#

If the Infrastructure.Status.ControlPlaneTopology is set to 'External', the console-operator will pass this information via the console-config.yaml to the console. Console pod will get re-deployed and will store the topology mode information as a SERVER_FLAG. Based on that value we need to suspend these notifications:

  • cluster upgrade notifications
  • new channel available notifications

For these we will need to check `ControlPlaneTopology`, if it's set to 'External' and also check if the user can edit cluster version(either by creating a hook or an RBAC call, eg. `canEditClusterVersion`)

 

Check section 05 for more info: https://docs.google.com/document/d/1k76JtRRHBdCCEjHPqKcYvbNVsuaGmRhWDLESWIm0mbo/edit#

If the Infrastructure.Status.ControlPlaneTopology is set to 'External', the console-operator will pass this information via the console-config.yaml co the console. Console pod will get re-deployed and will store the topology mode information as a SERVER_FLAG. Based on that value we need surface a message that the control plane is externally managed and add following changes:

  • Remove update button
  • Make channel read only
  • Link out to read only CV details page
  • Remove the ability to edit upstream configuration
  • Remove the cluster autoscaler field
  • Add an alert to the page so that users know the control plane is externally managed

In general, anything that changes a cluster version should be read only.

Check section 02 for more info: https://docs.google.com/document/d/1k76JtRRHBdCCEjHPqKcYvbNVsuaGmRhWDLESWIm0mbo/edit#

 

If the Infrastructure.Status.ControlPlaneTopology is set to 'External', the console-operator will pass this information via the console-config.yaml to the console. Console pod will get re-deployed and will store the topology mode information as a SERVER_FLAG. Based on that value we need to suspend kubeadmin notifier, from the global notifications, since it contain link for updating the cluster OAuth configuration (see attachment).

 

 

Epic Goal

Why is this important?

  • So the UX satisfies the current trands, where dark mode is becoming a standard for modern services.  

Acceptance Criteria

  • OCP admin console must be rendered in a preferred mode based on `prefers-color-scheme` media query
  • OCP admin console must be rendered in a preferred mode selected in the User Setting page
  • Create an followup epic/story for and listing and tracking changes needed in OCP console's dynamic plugins

Dependencies (internal and external)

  1. PatternFly - Dark mode PF variables

Previous Work (Optional):

  1. Mike Coker has worked on a POC from the PF point of view on both the admin and dev console, and the screenshot results are listed below along with the repo branch. Also listed is a document covering some of the common issues found when putting together the admin console POC. https://github.com/mcoker/console/tree/dark-theme
    Background POC work completed for reference:

PatternFly Dark Theme Handbookhttps://docs.google.com/document/d/1mRYEfUoOjTsSt7hiqjbeplqhfo3_rVDO0QqMj2p67pw/edit

Admin Console -> Workloads & Pods

Dev Console -> Gotcha pages: Observe Dashboard and Metrics, Add, Pipelines: builder, list, log, and run

Open questions::

  1. Who should be responsible for updating DynamicPlugins to be able to render in dark mode?

As a developer, I want to be able to scope the changes needed to enable dark mode for the admin console. As such, I need to investigate how much of the console will display dark mode using PF variables and also define a list of gotcha pages/components which will need special casing above and beyond PF variable settings.

 

Acceptance criteria:

As a developer, I want to be able to fix remaining issues from the spreadsheet of issues generated after the initial pass and spike of adding dark theme to the console.. As such, I need to make sure to either complete all remaining issues for the spreadsheet, or, create a bug or future story for any remaining issues in these two documents.

 

Acceptance criteria:

An epic we can duplicate for each release to ensure we have a place to catch things we ought to be doing regularly but can tend to fall by the wayside.

The Cluster Dashboard Details Card Protractor integration test was failing at high rate, and despite multiple attempts to fix, was never fully resolved, so it was disabled as a way to fix https://bugzilla.redhat.com/show_bug.cgi?id=2068594. Migrating this entire file to Cypress should give us better debugging capability, which is what was done to fix a similarly problematic project dashboard Protractor test.

This epic contains all the Dynamic Plugins related stories for OCP release-4.11 

Epic Goal

  • Track all the stories under a single epic

Acceptance Criteria

  •  

We have a Timestamp component for consistent display of dates and times that we should expose through the SDK. We might also consider a hook that formats dates and times for places were you don't want or cant use the component, eg. times on a chart. 

This will become important when we add a user preference for dates so that plugins show consistent dates and times as console. If I set my user preference to UTC dates, console should show UTC dates everywhere.

 

AC:

  • Expose the Timestamp component inside the SDK. 
  • Replace the connect with useSelector hook
  • Keep the original component and proxy it to the new one in the SDK

 

 

 

cc Jakub Hadvig Sho Weimer 

Currently, enabled plugins can fail to load for a variety of reasons. For instance, plugins don't load if the plugin name in the manifest doesn't match the ConsolePlugin name or the plugin has an invalid codeRef. There is no indication in the UI that something has gone wrong. We should explore ways to report this problem in the UI to cluster admins. Depending on the nature of the issue, an admin might be able to resolve the issue or at least report a bug against the plugin.

The message about failing could appear in the notification drawer and/or console plugins tab on the operator config. We could also explore creating an alert if a plugin is failing.

 

AC:

  • Add notification into the Notification Drawer in case a Dynamic Plugin will error out during load.
  • Render these errors in the status card, notification section, as well.
  • For each failed plugin we should create a separate notification.

In the 4.11 release, a console.openshift.io/default-i18next-namespace annotation is being introduced. The annotation indicates whether the ConsolePlugin contains localization resources. If the annotation is set to "true", the localization resources from the i18n namespace named after the dynamic plugin (e.g. plugin__kubevirt), are loaded. If the annotation is set to any other value or is missing on the ConsolePlugin resource, localization resources are not loaded. 

 

In case these resources are not present in the dynamic plugin, the initial console load will be slowed down. For more info check BZ#2015654

 

AC:

  • console-operator should be checking for the new console.openshift.io/use-i18n annotation, update the console-config.yaml accordingly and redeploy the console server
  • console server should pick up the changes in the console-config.yaml and only load the i18n namespace that are available

 

Follow up of https://issues.redhat.com/browse/CONSOLE-3159

 

 

Currently, you need to navigate to

Cluster Settings ->
Global configuration ->
Console (operator) config ->
Console plugins

to see and managed plugins. This takes a lot of clicks and is not discoverable. We should look at surfacing plugin details where they're easier to find – perhaps on the Cluster Settings page – or at least provide a more convenient link somewhere in the UI.

AC: Add the Dynamic Plugins section to the Status Card in the overview that will contain:

  • count of active and non-active plugins
  • link to the ConsolePlugins instances page
  • status of the loaded plugins and breakout error

cc Ali Mobrem Robb Hamilton

We need to provide a base for running integration tests using the dynamic plugins. The tests should initially

  • Create a deployment and service to run the dynamic demo plugin
  • Update the console operator config to enable the plugin
  • Wait for the plugin to be available
  • Test at least one extension point used by the plugin (such as adding items to the nav)
  • Disable the plugin when done

Once the basic framework is in place, we can update the demo plugin and add new integration tests when we add new extension points.

https://github.com/openshift/console/tree/master/frontend/dynamic-demo-plugin

 

https://github.com/openshift/enhancements/blob/master/enhancements/console/dynamic-plugins.md

 

https://github.com/openshift/console/tree/master/frontend/packages/console-plugin-sdk

Goal

  • Add the ability for users to select supported but not recommended updates.
  • Refine workflow when both "upgradeable=false" and "supported-but-not-recommended" updates occur

Background
RFE: for 4.10, Cincinnati and the cluster-version operator are adding conditional updates (a.k.a. targeted edge blocking): https://issues.redhat.com/browse/OTA-267

High-level plans in https://github.com/openshift/enhancements/blob/master/enhancements/update/targeted-update-edge-blocking.md#update-client-support-for-the-enhanced-schema

Example of what the oc adm upgrade UX will be in https://github.com/openshift/enhancements/blob/master/enhancements/update/targeted-update-edge-blocking.md#cluster-administrator.

The oc implementation landed via https://github.com/openshift/oc/pull/961.

Design

  • Use case 01: "supported but not recommended" occurs to the latest version:
    • Add an info icon next to the version on update path with a pop-over to explain about why updating to this version is supported, but not recommended and a link to known risks
    • Identify the difference in "recommended" versions, "supported but not recommended" versions, and "blocked" versions (upgradeable=false) in the + more modal.
    • The latest version is pre-selected in the dropdown in the update modal with an inline alert to inform users about supported-but-not-recommended version with link to known risks. Users can choose to update to another recommended versions, update to a supported-but-not-recommended one, or wait.
    • The "recommended" and "supported but not recommended" updates are separated in the dropdown.
    • If a user selects a "recommended" update, the inline alert disappears.
  • Use case 02: When both "upgradeable=false" and "supported but not recommended" occur:
    • Add an alert banner to explain why users shouldn’t update to the latest version and link to how to resolve on the cluster settings details page. Users have the options to resolve the issue, update to a patch version, or wait.
    • If users open the update modal without resolving the "upgradeable=false" issue, the next recommended version is pre-selected. An expandable link "View blocked versions (#)" is included under the dropdown to show "upgradeable=false" versions with resolve link.
    • If users resolve the "upgradeable=false" issue, the cluster settings page will change to use case 01
    • Question: Priority on changing the upgradeable=false alert banner in update modal and blocked versions in dropdown

See design doc: https://docs.google.com/document/d/1Nja4whdsI5dKmQNS_rXyN8IGtRXDJ8gXuU_eSxBLMIY/edit#

See marvel: https://marvelapp.com/prototype/h3ehaa4/screen/86077932

The "Update Version" modal on the cluster settings page should be updated to give users information about recommended, not recommended, and blocked update versions.

  • When the modal is opened, the latest recommended update version should be pre-selected in the version dropdown.
  • Blocked versions should no longer be displayed in the version dropdown, and should instead be displayed in a collapsible field below the dropdown.
  • When blocked versions are present, a link should be provided to the cluster operator tab. The version dropdown itself should have two labeled sections: "Recommended" and "Supported but not recommended".
  • When the user selects a "Supported but not recommended" item from the version dropdown, an inline info alert should appear below the version selection field and should provide a link to known risks associated with the selected version. This is an external link provided through the ClusterVersion API.

Update the cluster settings page to inform the user when the latest available update is supported but not recommended. Add an informational popover to the latest version in  update path visualization.

Epic Goal

  • Add telemetry so that we know how image stream features are used.

Why is this important?

  • We have a long standing epic to create image streams v2. We need to better understand how image streams are used today.

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Epic Goal

  • Improve CI testing of the image registry components.

Why is this important?

  • The image registry, image API and the image pruner had a lot of tests removed during transition 4.0. This may make the platform less stable and/or slow down the team.

Scenarios

  1. ...

Acceptance Criteria

  • CI - tests should be more stable and have broader coverage

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.

In the image-registry, we have packages origin-common and kubernetes-common. The problem is that this code doesn't get updates. We can replace them with more supported library-go.

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • ...

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

As a OpenShift engineer
I want image-registry to use the latest k8s libraries
so that image-registry can benefit from new upstream features.

Acceptance criteria

  • image-registry uses k8s.io/api v1.23.z
  • image-registry uses latest openshift/api, openshift/library-go, openshift/client-go

Epic Goal

  • Make the image registry distributed across availability zones.

Why is this important?

  • The registry should be highly available and zone failsafe.

Scenarios

  1. As an administrator I want to rely on a default configuration that spreads image registry pods across topology zones so that I don't suffer from a long recovery time (>6 mins) in case of a complete zone failure if all pods are impacted.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.

Dependencies (internal and external)

  1. Pod's topologySpreadConstraints

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: https://github.com/openshift/cluster-image-registry-operator/pull/730
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Story: As an administrator I want to rely on a default configuration that spreads image registry pods across topology zones so that I don't suffer from a long recovery time (>6 mins) in case of a complete zone failure if all pods are impacted.

Background: The image registry currently uses affinity/anti-affinity rules to spread registry pods across different hosts. However this might cause situations in which all pods end up on hosts of a single zone, leading to a long recovery time of the registry if that zone is lost entirely. However due to problems in the past with the preferred setting of anti-affinity rule adherence the configuration was forced instead with required and the rules became constraints. With zones as constraints the internal registry would not have deployed anymore in environments with a single zone, e.g. internal CI environment. Pod topology constraints is a new API that is supported in OCP which can also relax constraints in case they cannot be satisfied. Details here: https://docs.openshift.com/container-platform/4.7/nodes/scheduling/nodes-scheduler-pod-topology-spread-constraints.html

Acceptance criteria:

  • by default the internal registry is deployed with at least two replica
  • by default the topology constraints should be on a zone-basis, so that by defaults one registry pod is scheduled in each zone
  • when constraints can't be satisfied the registry should deploy anyway
  • we should not do this in SNO environments
  • the registry should still work on SNO environments

Open Questions:

  • what happens in environments where the storage is zone dependent?
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

As an OpenShift administrator
I want to provide the registry operator with a custom certificate authority for S3 storage
so that I can use a third-party S3 storage provider.

Acceptance criteria

  1. Users can specify a configmap name (from openshift-config) in config.imageregistry/cluster's spec.storage.s3.
  2. The operator uses CA from this configmap to check S3 bucket.
  3. The image registry pod uses CA from this configmap to access the S3 bucket.
  4. When a custom CA is defined, the operator/image-registry should still trust certificate authorities that are used by Amazon S3 and other well-known CAs.
  5. An end-to-end test that runs minio and checks the image registry becomes healthy with it.

Goal

Remove Jenkins from the OCP Payload.

Problem

  • Jenkins images are "non-trival in size, impact experience around OCP payloads
  • Security advisories cannot be handled once, but against all actively supported OCP releases, adding to response time for handling said advisories
  • Some customers may now want to upgrade Jenkins as OCP upgrades (making this configurable is more ideal)

Why is this important

  • This is an engineering motivated item to reduce costs so we have more cycles for strategic work
  • Aside from the team itself, top level OCP architects want this to reduce the image size, improve general OCP upgrade experience
  • Sends a mix message with respect to what is startegic CI/CI when Jenkins is baked into OCP, but Tekton/Pipelines is an add-on, day 2 install sort of thing

Dependencies (internal and external)

See epic linking - need alternative non payload image available to provide relatively seamless migration

 

Also, the EP for this is approved and merged at https://github.com/openshift/enhancements/blob/master/enhancements/builds/remove-jenkins-payload.md

Estimate (xs, s, m, l, xl, xxl):

Questions:

       PARTIAL ANSWER ^^:  confirmed with Ben Parees in https://coreos.slack.com/archives/C014MHHKUSF/p1646683621293839 that EP merging is currently sufficient OCP "technical leadership" approval.

 

Previous work

 

Customers

assuming none

User Stories

 

As maintainers of the OpenShift jenkins component, we need run Jenkins CI for PR testing against openshift/jenkins, openshift/jenkins-sync-plugin, openshift/jenkins-client-plugin, openshift/jenkins-openshift-login-plugin, using images built in the CI pipeline but not injected into CI test clusters via sample operator overriding the jenkins sample imagestream with the jenkins payload image.

 

As maintainers of the OpenShift Jenkins component, we need Jenkins periodics for the client and sync plugins to run against the latest non payload, CPaas image, promoted to CI's image locations on quay.io, for the current release in development.

 

As maintainers of the OpenShift Jenkins component, we need Jenkins related tests outside of very basic Jenkins Pipieline Strategy Build Config verification, removed from openshift-tests in OpenShift Origin, using a non-payload, CPaas image pertinent to the branch in question.

Acceptance criteria

  • all PR CI Tests do not utilize samples operator manipulation of the jenkins imagestream with the in payload image, but rather images including the PRs changes
  • all periodic CI Tests do not utilize samples operator manipulation of the jenkins imagestream with the in payload image, but rather CI promoted images for the current release pushed to quay.io

High Level, we ideally want to vet the new CPaas image via CI and periodics BEFORE we start changing the samples operator so that it does not manipulate the jenkins imagestream (our tests will override the samples operator override)

QE Impact

NONE ... QE should wait until JNKS-254

Docs Impact

NONE

PX Impact

 

NONE

Launch Checklist

Dependencies identified
Blockers noted and expected delivery timelines set
Design is implementable
Acceptance criteria agreed upon
Story estimated

Notes

  • Our CSI shared resource experience will help us here
  • but the old IMAGE_FORMAT stuff is deprecated, and does not work well with step registry stuff
  • instead, we need to use https://docs.ci.openshift.org/docs/architecture/ci-operator/#dependency-overrides
  • Makefile level logic will use `oc tag` to update the jenkins imagestream created as part of samples to override the use of the in payload image with the image build by the PR, or for periodics, with what has been promoted to quay.io
  • Ultimately, CI step registry for capturing the `oc tag` update the imagestream logic is the probably end goal
  • JNKS-268 might change how we do periodics, but the current thought is to get existing periodics working with the CPaas image first

Possible staging

1) before CPaas is available, we can validate images generated by PRs to openshift/jenkins, openshift/jenkins-sync-plugin, openshift/jenkins-client-plugin by taking the image built by the image (where the info needed to get the right image from the CI registry is in the IMAGE_FORMAT env var) and then doing an `oc tag --source=docker <PR image ref> openshift/jenkins:2` to replace the use of the payload image in the jenkins imagestream in the openshift namespace with the PRs image

2) insert 1) in https://github.com/openshift/release/blob/master/ci-operator/step-registry/jenkins/sync-plugin/e2e/jenkins-sync-plugin-e2e-commands.sh and https://github.com/openshift/release/blob/master/ci-operator/step-registry/jenkins/client-plugin/tests/jenkins-client-plugin-tests-commands.sh where you test for IMAGE_FORMAT being set

3) or instead of 2) you update the Makefiles for the plugins to call a script that does the same sort of thing, see what is in IMAGE_FORMAT, and if it has something, do the `oc tag`

 

https://github.com/openshift/release/pull/26979 is a prototype of how to stick the image built from a PR and conceivably the periodics to get the image built from it and tag it into the jenkins imagestream in the openshift namespace in the test cluster

 

Epic Goal

  • Provide a dedicated dashboard for NVIDIA GPU usage visualization in the OpenShift Console.

Why is this important?

  • Customers that use GPUs in their clusters usually have the GPU workloads as the main purpose of their cluster. As such, it makes much more sense to have the details about the usage they are doing of GPGPU resources AND CPU/RAM rather than just CPU/RAM

Scenarios

  1. As an admin of a cluster dedicated to data science, I want to quickly find out how much of my very costly resources are currently in use and if things are getting queued due to lack of resources

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.

Dependencies (internal and external)

  1. The NVIDIA GPU Operator must export to prometheus the relevant data

Open questions::

  1. Will NVIDIA agree to these extra data exports in their GPU Operator?

I asked Zvonko Kaiser and he seemed open to it. I need to confirm with Shiva Merla

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Epic Goal

  • Remove this UI from our stack that we cannot support.

Why is this important?

  • Reduce support burden.
  • Remove Bugzilla burden of addressing continuous CVEs found in this project.

Acceptance Criteria

  • All Prometheus upstream UI links are removed
  • Related documentation is updated
  • Ports/routes etc configured to expose access to this UI are removed such that no configuration we provide enables access to this UI or its codepaths.
  • There is no reason any CVEs found in this UI would ever require intervention by the Monitoring Team.

Dependencies (internal and external)

  1. Make the Prometheus Targets information available in Console UI (https://issues.redhat.com/browse/MON-1079)

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

After installing or upgrading to the latest OCP version, the existing OpenShift route to the prometheus-k8s service is updated to be a path-based route to '/api/v1'.

DoD:

  • It is not possible to access the Prometheus UI via the OpenShift route
  • Using a bearer token with sufficient permissions, it is possible to access the /api/v1/* endpoints via the OpenShift route.

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

Epic Goal

  • As a CFE team, we would like to enable query logging for all Prometheus read paths
  • As part of this, we would like to enable audit & query logging for Prometheus Adapter(aggregated server audit log), Prometheus(query log) and ThanosQuerier(query log)

Why is this important?

  • This would help all parties(customers, app-sres, CCX, monitoring team,..) to debug an overloaded Prometheus instance.

Scenarios

  1. When a customer faces a high cpu consumption in any of the Prometheus instance, they can enable audit logging in Prometheus Adapter to see which component is calling metrics API
  2. When a customer faces a high cpu consumption in any of the Prometheus instance, they can enable query logging in all Prometheus instances(PM & UWM) and ThanosQuerier to see which query is frequently executed
  3. https://bugzilla.redhat.com/show_bug.cgi?id=1982302

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • Prometheus Adapter audit logs must be enabled by default
  • Prometheus Adapter audit logs must be preserved after each CI run

Open questions::

  1. Should we enable ThanosRuler query logs?

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

After investigating a complex Bugzilla involving many applications making queries to prometheus-adapter, we've noticed that we were lacking insights on the requests made to prometheus-adapter. To have such information for an aggregated API, the best would be to have audit logs for prometheus-adapter. This wasn't configurable before, but with https://github.com/kubernetes-sigs/custom-metrics-apiserver/pull/92, upstream users should now be able to configure it.

Since this would greatly help in investigating prometheus-adapter Bugzilla in the future, it would be great if we allowed OpenShift users to configure the audit logs so that they could provide them to us.

Note for the assignee: as of the time of the creation of this ticket, the upstream PR hasn't been merged in custom-metrics-apiserver and thus wasn't synced in prometheus-adapter. So we will have to wait a bit before starting looking into this ticket.

 

DoD:

  • Allow OpenShift users to configure audit logs for prometheus-adapter
  • Integrate with must-gather
  • Document how to configure audit logs in the official OpenShift documentation
  • Upstream jsonnet patch that enables this feature through a configuration
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Following up on https://issues.redhat.com/browse/MON-1320, we added three new CLI flags to Prometheus to apply different limits on the samples' labels. These new flags are available starting from Prometheus v2.27.0, which will most likely be shipped in OpenShift 4.9.

The limits that we want to look into for OCP are the following ones:

# Per-scrape limit on number of labels that will be accepted for a sample. If
# more than this number of labels are present post metric-relabeling, the
# entire scrape will be treated as failed. 0 means no limit.
[ label_limit: <int> | default = 0 ]

# Per-scrape limit on length of labels name that will be accepted for a sample.
# If a label name is longer than this number post metric-relabeling, the entire
# scrape will be treated as failed. 0 means no limit.
[ label_name_length_limit: <int> | default = 0 ]

# Per-scrape limit on length of labels value that will be accepted for a sample.
# If a label value is longer than this number post metric-relabeling, the
# entire scrape will be treated as failed. 0 means no limit.
[ label_value_length_limit: <int> | default = 0 ]

We could benefit from them by setting relatively high values that could only induce unbound cardinality and thus reject the targets completely if they happened to breach our constrainst.

DoD:

  • Being able to configure label scrape limits for UWM

Epic Goal

When users configure CMO to interact with systems outside of an OpenShift cluster, we want to provide an easy way to add the cluster ID to the data send.

Why is this important?

Technically this can be achieved today, by adding an identifying label to the remote_write configuration for a given cluster. The operator adding the remote_write integration needs to take care that the label is unique over the managed fleet of clusters. This however adds management complexity. Any given cluster already has a pseudo-unique datum, that can be used for this purpose.

  • Starting in 4.9 we support the Prometheus remote_write feature to send metric data to a storage integration outside of the cluster similar to our own Telemetry service.
  • In Telemetry we already use the cluster ID to distinguish the various clusters.
  • For users of remote_write this could add an easy way to add such distinguishing information.

Scenarios

  1. An organisation with multiple OpenShift clusters want to store their metric data centralized in a dedicated system and use remote_write in all their clusters to send this data. When querying their centralized storage, metadata (here a label) is needed to separate the data of the various clusters.
  2. Service providers who manage multiple clusters for multiple customers via a centralized storage system need distinguishing metadata too. See https://issues.redhat.com/browse/OSD-6573 for example

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • Document how to use this feature

Dependencies (internal and external)

  1. none

Previous Work (Optional):

  1. none

Open questions::

  1.  

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Implementation proposal:

 

Expose a flag in the CMO configuration, that is false by default (keeps backward compatibility) and when set to true will add the _id label to a remote_write configuration. More specifically it will be added to the top of a remote_write relabel_config list via the replace action. This will add the label as expect, but additionally a user could alter this label in a later relabel config to suit any specific requirements (say rename the label or add additional information to the value).
The location of this flag is the remote_write Spec, so this can be set for individual remote_write configurations.

We currently use a sample app to e2e test remote write in CMO.
In order to test the addition of the cluster_id relabel config, we need to confirm that the metrics send actually have the expected label.
For this test we should use Prometheus as the remote_write target. This allows us to query the metrics send via remote write and confirm they have the expected label.

Add an optional boolean flag to CMOs definition of RemoteWriteSpec that if true adds an entry in the specs WriteRelabelConfigs list.

I went with adding the relabel config to all user-supplied remote_write configurations. This path has no risk for backwards compatibility (unless users use the {}tmp_openshift_cluster_id{} label, seems unlikely) and reduces overall complexity, as well as documentation complexity.

The entry should look like what is already added to the telemetry remote write config and it should be added as the first entry in the list, before any user supplied relabel configs.

Epic Goal

  • Offer the option to double the scrape intervals for CMO controlled ServiceMonitors in single node deployments
  • Alternatively automatically double the same scrape intervals if CMO detects an SNO setup

The potential target ServiceMonitors are:

  • kubelet
  • kube-state-metrics
  • node-exporter
  • etcd
  • openshift-state-metrics

Why is this important?

  • Reduce CPU usage in SNO setups
  • Specifically doubling the scrape interval is important because:
  1. we are confident that this will have the least chance to interfere with existing rules. We typically have rate queries over the last 2 minutes (no shorter time window). With 30 second scrape intervals (the current default) this gives us 4 samples in any 2 minute window. rate needs at least 2 samples to work, we want another 2 for failure tolerance. Doubling the scrape interval will still give us 2 samples in most 2 minute windows. If a scrape fails, a few rule evaluations might fail intermittently.
  2. We expect a measureable reduction of CPU resources (see previous work)

Scenarios

  1. RAN deployments (Telco Edge) are SNO deployments. In these setups a full CMO deployment is often not needed and the default setup consumes too many resources. OpenShift as a whole has only very limited CPU cycles available and too many cycles are spend on Monitoring

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Previous Work (Optional):

  1. https://issues.redhat.com/browse/MON-1569

Open questions:

  1. Whether doubling some scrape intervals reduces CPU usage to fit into the assigned budget

Non goals

  • Allow arbitrarily long scrape intervals. This will interfere with alert and recoring rules
  • Implement a global override to scrape intervals.
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

The console requires to know the network type capabilities to show/hide some Network Policy form fields.

As a result of https://issues.redhat.com/browse/NETOBSERV-27, this logic is implemented as a features document inside the console code. The console fetches the network type from the network operator and checks the supported features towards this document.

However, this limits the feature to admin users, as other logged-in users do not have permissions to fetch the network type.

This task aims to modify the current Cluster Network Operator to expose the network capabilities as an `sdn-public` Config Map, writeable only by the SDN, readable by any `system:authenticated` user.

Enhancement Proposal PR: https://github.com/openshift/enhancements/pull/875

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Run OpenShift builds that do not execute as the "root" user on the host node.

Why is this important?

  • OpenShift builds require an elevated set of capabilities to build a container image
  • Builds currently run as root to maintain adequate performance
  • Container workloads should run as non-root from the host's perspective. Containers running as root are a known security risk.
  • Builds currently run as root and require a privileged container. See BUILD-225 for removing the privileged container requirement.

Scenarios

  1. Run BuildConfigs in a multi-tenant environment
  2. Run BuildConfigs in a heightened security environment/deployment

Acceptance Criteria

  • Developers can opt into running builds in a cri-o user namespace by providing an environment variable with a specific value.
  • When the correct environment variable is provided, builds run in a cri-o user namespace, and the build pod does not require the "privileged: true" security context.
  • User namespace builds can pass basic test scenarios for the Docker and Source strategy build.
  • Steps to run unprivileged builds are documented.

Dependencies (internal and external)

  1. Buildah supports running inside a non-privileged container
  2. CRI-O allows workloads to opt into running containers in user namespaces.

Previous Work (Optional):

  1. BUILD-225 - remove privileged requirement for builds.

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story

As a developer building container images on OpenShift
I want to specify that my build should run without elevated privileges
So that builds do not run as root from the host's perspective with elevated privileges

Acceptance Criteria

  • Developers can provide an environment variable to indicate the build should not use privileged containers
  • When the correct env var + value is specified, builds run in a user namespace (non-root on the host)

QE Impact

No QE required for Dev Preview. OpenShift regression testing will verify that existing behavior is not impacted.

Docs Impact

We will need to document how to enable this feature, with sufficient warnings regarding Dev Preview.

PX Impact

This likely warrants an OpenShift blog post, potentially?

Notes

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

We want to configure 'default' and 'allowed' values in validation webhook for Guest Accelerators field in GCPProviderSpec. Also revendor it to include newly added Guest Accelerators field.

This can be done after https://github.com/openshift/cluster-api-provider-gcp/pull/172  is merged.

DoD:

  • Make sure that validations return errors on issues with GPU configuration
  • Ensure the unit tests for the webhooks are updated

Description:

Openshift on RHV is composed of the following subproject the team maintains:

Each of those projects currently uses the generated oVirt API project go-ovirt.

This leads to a number of issues:

  1. Duplicated code between the subprojects: Since the go-ovirt is a thin layer around the API then a lot of the code which interacts with oVirt is duplicated between the projects, which leads to all the classic duplication problems such as maintaining the project, lack of clear conventions, and so on.
  2. Bad error handling and unclear errors:
    1. Since the go-ovirt is a thin layer there is a lot of error handling and checking which needs to be done, since a lot of the times it looks like a certain error should be ignored, it is never checked which could lead to unexpected situations.
    2. Since the errors which are returned from the oVirt Engine are sometimes unclear, when we return those errors to the users or log them is hard to understand what is the actual issue.
  3. Lack of retries: sometimes an operation can take some time due to some condition that needs to be met, or an operation can fail due to infrastructure issues, the go-ovirt library doesn't contain any retry logic which means each client needs to implement its own retry logic which is not done at the moment and will cause more duplicated code.
  4. Poor logging: The current go-ovirt library doesn't log anything, and all the logs come from the subprojects, this leads to:
    1. Inconsistent logging between the projects.
    2. Lack of logs.
  5. Almost no test coverage:
    1. It's very hard to mock and write tests with go-ovirt since there are so many calls, but will be much easier to mock and write tests with go-ovirt-clent.
    2. go-ovirt only has rudimentary tests.

Then came go-ovirt-client, go-ovirt-client-log, go-ovirt-client-log-klog and k8sOVirtCredentialsMonitor to the rescue!

The go-ovirt-client is a wrapper around the go-ovirt which contains all the error handling/retry logic/logs/tests needed to provide a decent user experience and an easy-to-use API to the oVirt engine.

go-ovirt-client-log is a library to unify the logging logic between the projects, it is used by go-ovirt-client and should be used by all the sub-projects.

go-ovirt-client-log-klog is a companion library to go-ovirt-client-log enabling logging via the Kubernetes "klog" facility.

k8sOVirtCredentialsMonitor is a utility for monitoring the oVirt credentials secret, which will automatically update the ovirt credentials is they are changed. 

We aim to move all projects which are using the go-ovirt to use go-ovirt-client, go-ovirt-client-log and k8sOVirtCredentialsMonitor instead.

Benefits for the eng:

  • Possible to write unit tests.
  • Easier to maintain since less code duplication - reduce the amount of code.
  • Test coverage exists on the ovirt-client as well.
  • No(Less) bugs regarding operations that needed a retry or polling logic.
  • Solves a number of existing bugs

Benefits for the customers:

  • Clearer error messages and logs.
  • Fewer bugs.

Acceptance criteria:

  1. All sub-projects are not using go-ovirt directly - at least 90% of the calls to go-ovirt should be migrated to go-ovirt-client.
  2. All sub-projects should use the corresponding go-ovirt-client-log for logging.
  3. All csi-driver and cluster provide use k8sOVirtCredentialsMonitor.
  4. CI tests are green for all components.

How to test:

  1. QE regression - make sure all flows are still working.
  2. Green CI on all jobs.
  3. Keep an eye out for log messages that might confuse customers.

Description:

  1. Identify all the communication between ovirt-csi-driver and the go-ovirt.
  2. Port all the logic to go-ovirt-client.
  3. Port all calls on ovirt-csi-driver to go-ovirt-client.

Acceptance:

ovirt-csi-driver uses go-ovirt-client for 95% percent of all oVirt related logic.

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Description

As a user, I want the topology view to be less cluttered as I doom out showing only information that I can discern and still be able to get a feel for the status of my project.

Acceptance Criteria

  1. When zoomed to 50% scale, all labels & decorators will be hidden. Label are shown when hovering over the node
  2. When zoomed to 30% scale, all labels, decorators, pod rings & icons will be hidden. Node shape remains the same, and background is either white, yellow or red. Background color is determined based on aggregate status of pods, alerts, builds and pipelines. Tooltip is available showing node name as well as the "things" which are attributing to the warning/error status.

Additional Details:

Description

As a user, I want to understand which service bindings connected a service to a component successfully or not. Currently it's really difficult to understand and needs inspection into each ServiceBinding resource (yaml).

Acceptance Criteria

  1. Show a status badge on the SB details page
  2. Show a Status field in the right column of the SB details page
  3. Show the Status field in the right column of the Topology side panel when a SB is selected
  4. Show an indicator in the Topology view which will help to differentiate when the service binding is in error state
  5. Define the available statuses & associated icons 🥴
    1. Connected
    2. Error
  6. Error states defined by the SB conditions … if any of these 3 are not True, the status will be displayed as Error

Additional Details:

See also https://docs.google.com/document/d/1OzE74z2RGO5LPjtDoJeUgYBQXBSVmD5tCC7xfJotE00/edit

T-shirt size: M

Goal:

Provide an easy and successful experience for front end developers to build and deploy their applications

Why is it important?

Currently, the front end dev experience is not positive. It's much easier for them to use other platforms. Improving the front end dev experience will enable us to gain more marketshare

Use cases:

  1. Need to be able to override the npm command when using Node Builder Image
  2. Need to expose target port
  3. Need access to the URL to access my application

Although we provide the ability for 2 & 3 today, the current journey does not match with the mental model of the front end developer

Acceptance criteria:

  1. When importing an app, I should be able to easily provide the npm build and run commands
  2. When opting in to create a route, the target port should be exposed without having to open any Advanced Options
  3. After importing my app, if a route is exposed, I should be able to access/copy that URL

Dependencies (External/Internal):

Design Artifacts:

Desired UX experience

  • enable user to provide the *Build Command* when Node Builder image is being used
  • enable user to provide the *Run Command* when Node Builder image is being used
  • expose the Target Port under the *Create a route to the Application *rather than inside Show advanced Routing options
  • NEED TO FINALIZE HOW TO PROVIDE THE ROUTE TO EASILY COPY – Inline Notification maybe? As well as side panel?

Note:

Description

As a user, I want have the option to add additional labels to a Route, as I could do in OCP3. See RFE-622

The additional labels should only be added to the route, not the service or other components. The advanced option "Labels" should not be touched and these labels are added to all components.

As an small additional we should also show always the "Target port" since it also defines the Service port and to make this more clear, the "Target port" should be shown before the "Create a route to the Application" checkbox.

Acceptance Criteria

The following changes should be applied to the Import flow (from Git, from Container, ...) and to the Edit page as well:

  1. Move the option "Target port" before the checkbox "Create a route to the Application" and do not hide the "Target port" when the checkbox is disabled
  2. Add a new "Additional route labels" option, with a label input field to the "Advanced Routing options"
  3. Save (Import) and update (Edit) the labels to the Route resource. When editing a Deployment with a Route the route labels should not show the shared labels.

Additional Details:

Problem:

This epic is mainly focused on the 4.10 Release QE activities

Goal:

1. Identify the scenarios for automation
2. Segregate the test Scenarios into smoke, Regression and other user stories
a. Update the https://docs.jboss.org/display/ODC/Automation+Status+Report
3. Align with layered operator teams for updating scripts
3. Work closely with dev team for epic automation
4. Create the automation scripts using cypress
5. Implement CI for nightly builds
6. Execute scripts on sprint basis

Why is it important?

To the track the QE progress at one place in 4.10 Release Confluence page

Use cases:

  1. <case>

Acceptance criteria:

  1. <criteria>

Dependencies (External/Internal):

Design Artifacts:

Exploration:

Note:

There are different code spots which maps the old action items "From Git", "From Dockerfile" and "From Devfile" to the new action "Import from Git".

We should avoid mapping different strings to the new version and instead update our tests so that the feature and page object files matches the latest frontend code.

Code areas I found are marked with

      // TODO (ODC-6455): Tests should use latest UI labels like "Import from Git" instead of mapping strings

Acceptance criteria:

  1. Execute the automation scripts on ODC nightly builds in OpenShift CI (prow) periodically
  2. provide a separate job for each "plugin" (like pipelines, knative, etc.)

Goal:

This epic covers a number of customer requests(RFEs) as well as increases usability.

Why is it important?

Customer satisfaction as well as improved usability.

Acceptance Criteria

  1. Allow user to re-arrange the resources which have ben added to nav by the user
  2. Improved user experience (form based experience)
    1. Form based editing of Routes
    2. Form based creation and editing of Config Maps
    3. Form base creation of Deployments
  3. Improved discovery
    1. Include Share my project on the Add page to increase discoverability
    2. NS Helm Chart Repo
      1. Add tile to Add page for discoverability
      2. Provide a form driven creation experience
      3. User should be able to switch back and forth from Form/YAML
      4. change the intro text to the below & have the link in the intro text bring up the full page form
        1. Browse for charts that help manage complex installations and upgrades. Cluster administrators can customize the content made available in the catalog. Alternatively, developers can try to configure their own custom Helm Chart repository.

Dependencies (External/Internal):

None

Exploration:

Miro board from Epic Exploration

Description

As a user, I should be able to switch between the form and yaml editor while creating the ProjectHelmChartRepository CR.

Acceptance Criteria

  1. Convert the create form into a form-yaml switcher
  2. Display this form-yaml view in Search -> ProjectHelmChartRepositories in both perspectives

Additional Details:

Form component https://github.com/openshift/console/pull/11227

Description

As a user, I want to use a form to create Deployments

Acceptance Criteria

  1. Use existing edit Deployment form component for creating Deployments
  2. Display the form when clicked on `Create Deployment` in the Deployments Search page in the Dev perspective
  3. The `Create Deployment` button in the Deployments list page & the search page in the Admin perspective should have a similar experience.

Additional Details:

Edit deployment form ODC-5007

Problem:

Currently we are only able to get limited telemetry from the Dev Sandbox, but not from any of our managed clusters or on prem clusters.

Goals:

  1. Enable gathering segment telemetry whenever cluster telemetry is enabled on OSD clusters
  2. Have our OSD clusters opt into telemetry by default
  3. Work with PM & UX to identify additional metrics to capture in addition to what we have enabled currently on Sandbox.
  4. Ability to get a single report from woopra across all of our Sandbox and OSD clusters.
  5. Be able to generate a report including metrics of a single cluster or all clusters of a certain type ( sandbox, or OSD)

Why is it important?

In order to improve properly analyze usage and the user experience, we need to be able to gather as much data as possible.

Acceptance Criteria

  1. Extend console backend (bridge) to provide configuration as SERVER_FLAGS
    // JS type
    telemetry?: Record<string, string>
    
    1. Read the annotation of the cluster ConfigMap for telemetry data and pass them into the internal serverconfig.
    2. Pass through this internal serverconfig and export it as SERVER_FLAGS.
    3. Add a new --telemetry CLI option so that the telemetry options could be tested in a dev environment:
      ./bin/bridge --telemetry SEGMENT_API_KEY=a-key-123-xzy
      ./bin/bridge --telemetry CONSOLE_LOG=debug
      
  2. TBD: In best case the new annotation could be read from the cluster ConfigMap...
    1. Otherwise update the console-operator to pass the annotation from the console cluster configuration to the console ConfigMap.

Additional Details:

  1. More information about the integration with the backend could be found in the Telemetry on OSD clusters Google Doc

Goal:
Enhance oc adm release new (and related verbs info, extract, mirror) with heterogeneous architecture support

tl;dr

oc adm release new (and related verbs info, extract, mirror) would be enhanced to optionally allow the creation of manifest list release payloads. The manifest list flow would be triggered whenever the CVO image in an imagestream was a manifest list. If the CVO image is a standard manifest, the generated release payload will also be a manifest. If the CVO image is a manifest list, the generated release payload would be a manifest list (containing a manifest for each arch possessed by the CVO manifest list).

In either case, oc adm release new would permit non-CVO component images to be manifest or manifest lists and pass them through directly to the resultant release manifest(s).

If a manifest list release payload is generated, each architecture specific release payload manifest will reference the same pullspecs provided in the input imagestream.

 

More details in Option 1 of https://docs.google.com/document/d/1BOlPrmPhuGboZbLZWApXszxuJ1eish92NlOeb03XEdE/edit#heading=h.eldc1ppinjjh

Incomplete Epics

This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were not completed when this image was assembled

Epic Goal

  • Port all remaining Protractor tests to Cypress

Why is this important?

  • Protractor is very hard to debug when tests fail/flake
  • Once all protractor tests are ported we can remove all Protractor dependencies, scripts, and configuration files.
  • Cypress has better debugging, plug-ins, and reporting tools

Acceptance Criteria

  • CI - MUST be running successfully with tests automated

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Please read: migrating-protractor-tests-to-cypress

Protractor test to migrate:  `frontend/integration-tests/tests/oauth.scenario.ts`
Large but straight forward

47) OAuth

   48) BasicAuth IDP
      ✔ creates a Basic Authentication IDP
      ✔ shows the BasicAuth IDP on the OAuth settings page

   49) GitHub IDP
      ✔ creates a GitHub IDP
      ✔ shows the GitHub IDP on the OAuth settings page

   50) GitLab IDP
      ✔ creates a GitLab IDP
      ✔ shows the GitLab IDP on the OAuth settings page

   51) Google IDP
      ✔ creates a Google IDP
      ✔ shows the Google IDP on the OAuth settings page

   52) Keystone IDP
      ✔ creates a Keystone IDP
      ✔ shows the Keystone IDP on the OAuth settings page

   53) LDAP IDP
      ✔ creates a LDAP IDP
      ✔ shows the LDAP IDP on the OAuth settings page

   54) OpenID IDP
      ✔ creates a OpenID IDP
      ✔ shows the OpenID IDP on the OAuth settings page

 Accpetance Criteria

  • Protractor test ported to cypress
  • Remove any unused legacy data-test-id`s
  • Protractor test deleted, and non longer referenced in `frontend/integration-tests/protractor.conf.ts`

Epic Goal

  • Update image registry dependencies (Kubernetes and OpenShift) to the latest versions.

Why is this important?

  • New versions usually bring improvements that are needed by the registry and help with getting updates for z-stream.

Scenarios

  1. As an OpenShift engineer, I want my components to use the versions of dependencies, so that they get fixes for known issues and can be easily updated in z-stream.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated

Dependencies (internal and external)

  1. Kubernetes 1.24

Previous Work (Optional):

  1. IR-210

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • QE - Test plans in Polarion: <link or reference to Polarion>

As a OpenShift engineer
I want image-registry to use the latest k8s libraries
so that image-registry can benefit from new upstream features.

Acceptance criteria

  • image-registry uses k8s.io/api v1.24.z
  • image-registry uses latest openshift/api, openshift/library-go, openshift/client-go
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

 

Background

As a follow up to OCPCLOUD-693, we need to, once all of the API definitions are present in openshift/api, migrate the existing code bases to use the new API locations.

 

This will include:

  • Machine API Operator
  • Cluster Machine Approver
  • Cluster API Provider AWS|Azure|GCP|IBM|Alibaba|OpenStack|Kubevirt
  • Cluster API actuator pkg
  • Installer
  • WMCO
  • MCO
  • Hive
  • Grep OpenShift for other references to our old APIs

Steps

  • Replace the Machine API imports with the new openshift/API MAPI locations

Stakeholders

  • Cluster Infra
  • Owners of the repos listed above

Definition of Done

  • The openshift/API defintions are used across components in the MAPI ecosystem
  • Docs
  • Generated docs for API types should now come from openshift/API
  • Testing
  • Regular regression testing should be sufficient, this is a copy paste for the most part and we expect the code won't compile if we break this

Problem:

Complete all the 4.9 epic features automation user stories and merge it to master branch.

Goal:

4.9 epics automation completion

Why is it important?

Tech debt should be completed

Use cases:

  1. <case>

Acceptance criteria:

Create the pr's for 4.9 epic user stories automation
Review it
Merge it to 4.10 master branch and 4.9 master branch

Dependencies (External/Internal):

Design Artifacts:

Exploration:

Note:

Description

As a user, I want to store my delivery pipelines in a Git repository as the source of truth and execute the pipeline on OpenShift on Git events, so that I can version and trace changes to the delivery pipelines in Git.

Use Cases

  • Developer can see the list of Git repositories that are added to the namespace for pipeline-as-code execution
  • Developer can navigate from the Console to the Git repository on the Git provider
  • For each Git repository, developer can see the details of the last pipeline execution and the commit id that triggered it with possibility to navigating to the Git commit in the Git provider
  • Developer can see the list of pipelinerun executions related to a Git repository in a chronological order and the commit id that triggered each

Acceptance Criteria

  1. As a user, looking at the Pipelines page in the Developer Console, I should be able to see a list of (a) Git repositories that are added to the namespace for PAC execution AND (b) all pipelines in the namespace
  2. As a user, I should be able to navigate to a details page of the git repo.
    1. This details page should provide access to (a) details of the git repo and (b) a list of pipeline runs.
    2. This PLR tab should show additional information than the typical PLR List view, including SHA (commit id), commit message, branch & trigger type
  3. As a user, when looking at a Pipeline Run Details page, if associate with a git repo (PAC),
    1. Indicate that it's from a specific git repo rather than a PL resource
    2. Include the SHA (commit id), commit message, branch & trigger type

Other Complete

This section includes Jira cards that are not linked to either an Epic or a Feature. These tickets were completed when this image was assembled

This is a clone of issue OCPBUGS-7633. The following is the description of the original issue:

This is a clone of issue OCPBUGS-1125. The following is the description of the original issue:

(originally reported in BZ as https://bugzilla.redhat.com/show_bug.cgi?id=1983200)

test:
[sig-etcd][Feature:DisasterRecovery][Disruptive] [Feature:EtcdRecovery] Cluster should restore itself after quorum loss [Serial]

is failing frequently in CI, see search results:
https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-etcd%5C%5D%5C%5BFeature%3ADisasterRecovery%5C%5D%5C%5BDisruptive%5C%5D+%5C%5BFeature%3AEtcdRecovery%5C%5D+Cluster+should+restore+itself+after+quorum+loss+%5C%5BSerial%5C%5D

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-disruptive-4.8/1413625606435770368
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-disruptive-4.8/1415075413717159936

some brief triaging from Thomas Jungblut on:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-disruptive-4.11/1568747321334697984

it seems the last guard pod doesn't come up, etcd operator installs this properly and the revision installer also does not spout any errors. It just doesn't progress to the latest revision. At first glance doesn't look like an issue with etcd itself, but needs to be taken a closer look at for sure.

This is a clone of issue OCPBUGS-1717. The following is the description of the original issue:

Description of problem:

Image registry pods panic while deploying OCP in me-central-1 AWS region

Version-Release number of selected component (if applicable):

4.11.2

How reproducible:

Deploy OCP in AWS me-central-1 region

Steps to Reproduce:

Deploy OCP in AWS me-central-1 region 

Actual results:

panic: Invalid region provided: me-central-1

Expected results:

Image registry pods should come up with no errors

Additional info:

 

Description of problem:

Add telemetry metrics for Red Hat Advanced Cluster Security (RHACS) to OpenShift Telemeter.

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1. Look up telemetry-config in openshift-monitoring.

Actual results:

telemetry-config in openshift-monitoring does not contain RHACS metrics.

Expected results:

telemetry-config in openshift-monitoring does contain RHACS metrics.

Additional info:

 

Description of problem:

See https://bugzilla.redhat.com/show_bug.cgi?id=2104275

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

The IPI installation in some regions got bootstrap failure, and without any node available/ready.

Version-Release number of selected component (if applicable):

12-22 16:22:27.970  ./openshift-install 4.12.0-0.nightly-2022-12-21-202045
12-22 16:22:27.970  built from commit 3f9c38a5717c638f952df82349c45c7d6964fcd9
12-22 16:22:27.970  release image registry.ci.openshift.org/ocp/release@sha256:2d910488f25e2638b6d61cda2fb2ca5de06eee5882c0b77e6ed08aa7fe680270
12-22 16:22:27.971  release architecture amd64

How reproducible:

Always

Steps to Reproduce:

1. try the IPI installation in the problem regions (so far tried and failed with ap-southeast-2, ap-south-1, eu-west-1, ap-southeast-6, ap-southeast-3, ap-southeast-5, eu-central-1, cn-shanghai, cn-hangzhou and cn-beijing) 

Actual results:

Bootstrap failed to complete

Expected results:

Installation in those regions should succeed.

Additional info:

FYI the QE flexy-install job: https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/166672/

No any node available/ready, and no any operator available.
$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version             False       True          30m     Unable to apply 4.12.0-0.nightly-2022-12-21-202045: an unknown error has occurred: MultipleErrors
$ oc get nodes
No resources found
$ oc get machines -n openshift-machine-api -o wide
NAME                         PHASE   TYPE   REGION   ZONE   AGE   NODE   PROVIDERID   STATE
jiwei-1222f-v729x-master-0                                  30m                       
jiwei-1222f-v729x-master-1                                  30m                       
jiwei-1222f-v729x-master-2                                  30m                       
$ oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication
baremetal
cloud-controller-manager                                                                          
cloud-credential                                                                                  
cluster-autoscaler                                                                                
config-operator                                                                                   
console                                                                                           
control-plane-machine-set                                                                         
csi-snapshot-controller                                                                           
dns                                                                                               
etcd                                                                                              
image-registry                                                                                    
ingress                                                                                           
insights                                                                                          
kube-apiserver                                                                                    
kube-controller-manager                                                                           
kube-scheduler                                                                                    
kube-storage-version-migrator                                                                     
machine-api                                                                                       
machine-approver                                                                                  
machine-config                                                                                    
marketplace                                                                                       
monitoring                                                                                        
network                                                                                           
node-tuning                                                                                       
openshift-apiserver                                                                               
openshift-controller-manager                                                                      
openshift-samples                                                                                 
operator-lifecycle-manager                                                                        
operator-lifecycle-manager-catalog                                                                
operator-lifecycle-manager-packageserver
service-ca
storage
$

Mater nodes don't run for example kubelet and crio services.
[core@jiwei-1222f-v729x-master-0 ~]$ sudo crictl ps
FATA[0000] unable to determine runtime API version: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /var/run/crio/crio.sock: connect: no such file or directory" 
[core@jiwei-1222f-v729x-master-0 ~]$ 

The machine-config-daemon firstboot tells "failed to update OS".
[jiwei@jiwei log-bundle-20221222085846]$ grep -Ei 'error|failed' control-plane/10.0.187.123/journals/journal.log 
Dec 22 16:24:16 localhost kernel: GPT: Use GNU Parted to correct GPT errors.
Dec 22 16:24:16 localhost kernel: GPT: Use GNU Parted to correct GPT errors.
Dec 22 16:24:18 localhost ignition[867]: failed to fetch config: resource requires networking
Dec 22 16:24:18 localhost ignition[891]: GET error: Get "http://100.100.100.200/latest/user-data": dial tcp 100.100.100.200:80: connect: network is unreachable
Dec 22 16:24:18 localhost ignition[891]: GET error: Get "http://100.100.100.200/latest/user-data": dial tcp 100.100.100.200:80: connect: network is unreachable
Dec 22 16:24:19 localhost.localdomain NetworkManager[919]: <info>  [1671726259.0329] hostname: hostname: hostnamed not used as proxy creation failed with: Could not connect: No such file or directory
Dec 22 16:24:19 localhost.localdomain NetworkManager[919]: <warn>  [1671726259.0464] sleep-monitor-sd: failed to acquire D-Bus proxy: Could not connect: No such file or directory
Dec 22 16:24:19 localhost.localdomain ignition[891]: GET error: Get "https://api-int.jiwei-1222f.alicloud-qe.devcluster.openshift.com:22623/config/master": dial tcp 10.0.187.120:22623: connect: connection refused
...repeated logs omitted...
Dec 22 16:27:46 jiwei-1222f-v729x-master-0 ovs-ctl[1888]: 2022-12-22T16:27:46Z|00001|dns_resolve|WARN|Failed to read /etc/resolv.conf: No such file or directory
Dec 22 16:27:46 jiwei-1222f-v729x-master-0 ovs-vswitchd[1888]: ovs|00001|dns_resolve|WARN|Failed to read /etc/resolv.conf: No such file or directory
Dec 22 16:27:46 jiwei-1222f-v729x-master-0 dbus-daemon[1669]: [system] Activation via systemd failed for unit 'dbus-org.freedesktop.resolve1.service': Unit dbus-org.freedesktop.resolve1.service not found.
Dec 22 16:27:46 jiwei-1222f-v729x-master-0 nm-dispatcher[1924]: Error: Device '' not found.
Dec 22 16:27:46 jiwei-1222f-v729x-master-0 nm-dispatcher[1937]: Error: Device '' not found.
Dec 22 16:27:46 jiwei-1222f-v729x-master-0 nm-dispatcher[2037]: Error: Device '' not found.
Dec 22 08:35:32 jiwei-1222f-v729x-master-0 machine-config-daemon[2181]: Warning: failed, retrying in 1s ... (1/2)I1222 08:35:32.477770    2181 run.go:19] Running: nice -- ionice -c 3 oc image extract --path /:/run/mco-extensions/os-extensions-content-910221290 --registry-config /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:259d8c6b9ec714d53f0275db9f2962769f703d4d395afb9d902e22cfe96021b0
Dec 22 08:56:06 jiwei-1222f-v729x-master-0 rpm-ostree[2288]: Txn Rebase on /org/projectatomic/rpmostree1/rhcos failed: remote error: Get "https://quay.io/v2/openshift-release-dev/ocp-v4.0-art-dev/blobs/sha256:27f262e70d98996165748f4ab50248671d4a4f97eb67465cd46e1de2d6bd24d0": net/http: TLS handshake timeout
Dec 22 08:56:06 jiwei-1222f-v729x-master-0 machine-config-daemon[2181]: W1222 08:56:06.785425    2181 firstboot_complete_machineconfig.go:46] error: failed to update OS to quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:411e6e3be017538859cfbd7b5cd57fc87e5fee58f15df19ed3ec11044ebca511 : error running rpm-ostree rebase --experimental ostree-unverified-registry:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:411e6e3be017538859cfbd7b5cd57fc87e5fee58f15df19ed3ec11044ebca511: Warning: The unit file, source configuration file or drop-ins of rpm-ostreed.service changed on disk. Run 'systemctl daemon-reload' to reload units.
Dec 22 08:56:06 jiwei-1222f-v729x-master-0 machine-config-daemon[2181]: error: remote error: Get "https://quay.io/v2/openshift-release-dev/ocp-v4.0-art-dev/blobs/sha256:27f262e70d98996165748f4ab50248671d4a4f97eb67465cd46e1de2d6bd24d0": net/http: TLS handshake timeout
Dec 22 08:57:31 jiwei-1222f-v729x-master-0 machine-config-daemon[2181]: Warning: failed, retrying in 1s ... (1/2)I1222 08:57:31.244684    2181 run.go:19] Running: nice -- ionice -c 3 oc image extract --path /:/run/mco-extensions/os-extensions-content-4021566291 --registry-config /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:259d8c6b9ec714d53f0275db9f2962769f703d4d395afb9d902e22cfe96021b0
Dec 22 08:59:20 jiwei-1222f-v729x-master-0 systemd[2353]: /usr/lib/systemd/user/podman-kube@.service:10: Failed to parse service restart specifier, ignoring: never
Dec 22 08:59:21 jiwei-1222f-v729x-master-0 podman[2437]: Error: open default: no such file or directory
Dec 22 08:59:21 jiwei-1222f-v729x-master-0 podman[2450]: Error: failed to start API service: accept unixgram @00026: accept4: operation not supported
Dec 22 08:59:21 jiwei-1222f-v729x-master-0 systemd[2353]: podman-kube@default.service: Failed with result 'exit-code'.
Dec 22 08:59:21 jiwei-1222f-v729x-master-0 systemd[2353]: Failed to start A template for running K8s workloads via podman-play-kube.
Dec 22 08:59:21 jiwei-1222f-v729x-master-0 systemd[2353]: podman.service: Failed with result 'exit-code'.
[jiwei@jiwei log-bundle-20221222085846]$ 

 

This is a clone of issue OCPBUGS-3680. The following is the description of the original issue:

Description of problem:

OCP upgrade blocks because of cluster operator csi-snapshot-controller fails to start its deployment with a fatal message of read-only filesystem

Version-Release number of selected component (if applicable):

Red Hat OpenShift 4.11
rhacs-operator.v3.72.1

How reproducible:

At least once in user's cluster while upgrading 

Steps to Reproduce:

1. Have a OCP 4.11 installed
2. Install ACS on top of the OCP cluster
3. Upgrade OCP to the next z-stream version

Actual results:

Upgrade gets blocked: waiting on csi-snapshot-controller

Expected results:

Upgrade should succeed

Additional info:

stackrox SCCs (stackrox-admission-control, stackrox-collector and stackrox-sensor) contain the `readOnlyRootFilesystem` set to `true`, if not explicitly defined/requested, other Pods might receive this SCC which will make the deployment to fail with a `read-only filesystem` message

Description of problem:

The test local-test is failing on openshift/thanos when upgrading golang version to 1.18 on the branch release-4.11.

Please refer to this test log for details:
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_thanos/82/pull-ci-openshift-thanos-release-4.11-test-local/1541516614497734656

Version-Release number of selected component (if applicable):

4.11

How reproducible:

See local-test job on pull request on the repository Openshift/Thanos

Steps to Reproduce:


Actual results:

local-test fails on the following error:
level=error ts=2022-06-27T20:28:12.306Z caller=web.go:99 component=web msg="panic while serving request" client=127.0.0.1:37064 url=/api/v1/metadata err="runtime error: invalid memory address or nil pointer dereference" stack="goroutine 278 [running]:\ngithub.com/prometheus/prometheus/web.withStackTracer.func1.1()\n\t/go/pkg/mod/github.com/prometheus/prometheus@v1.8.2-0.20200724121523-657ba532e42f/web/web.go:98 +0x99\npanic({0x1c34760, 0x308ad40})\n\t/usr/lib/golang/src/runtime/panic.go:838 +0x207\nreflect.mapiternext(0xc000458540?)\n\t/usr/lib/golang/src/runtime/map.go:1378 +0x19\ngithub.com/modern-go/reflect2.(*UnsafeMapIterator).UnsafeNext(0x1bd62e0?)\n\t/go/pkg/mod/github.com/modern-go/reflect2@v1.0.1/unsafe_map.go:136 +0x32\ngithub.com/json-iterator/go.(*sortKeysMapEncoder).Encode(0xc000949d10, 0xc0002966b0, 0xc0006c7740)\n\t/go/pkg/mod/github.com/json-iterator/go@v1.1.9/reflect_map.go:297 +0x31a\ngithub.com/json-iterator/go.(*onePtrEncoder).Encode(0xc0008cb120, 0xc000948fc0, 0xc0001139c0?)\n\t/go/pkg/mod/github.com/json-iterator/go@v1.1.9/reflect.go:219 +0x82\ngithub.com/json-iterator/go.(*Stream).WriteVal(0xc0006c7740, {0x1c16da0, 0xc000948fc0})\n\t/go/pkg/mod/github.com/json-iterator/go@v1.1.9/reflect.go:98 +0x158\ngithub.com/json-iterator/go.(*dynamicEncoder).Encode(0xc00094cd58?, 0xfa9a07?, 0xc0006c7758?)\n\t/go/pkg/mod/github.com/json-iterator/go@v1.1.9/reflect_dynamic.go:15 +0x39\ngithub.com/json-iterator/go.(*structFieldEncoder).Encode(0xc000949620, 0x1a4aaba?, 0xc0006c7740)\n\t/go/pkg/mod/github.com/json-iterator/go@v1.1.9/reflect_struct_encoder.go:110 +0x56\ngithub.com/json-iterator/go.(*structEncoder).Encode(0xc000949740, 0x0?, 0xc0006c7740)\n\t/go/pkg/mod/github.com/json-iterator/go@v1.1.9/reflect_struct_encoder.go:158 +0x652\ngithub.com/json-iterator/go.(*OptionalEncoder).Encode(0xc0001afd60?, 0x0?, 0x0?)\n\t/go/pkg/mod/github.com/json-iterator/go@v1.1.9/reflect_optional.go:74 +0xa4\ngithub.com/json-iterator/go.(*onePtrEncoder).Encode(0xc0008cad40, 0xc0006c76e0, 0xc000949020?)\n\t/go/pkg/mod/github.com/json-iterator/go@v1.1.9/reflect.go:219 +0x82\ngithub.com/json-iterator/go.(*Stream).WriteVal(0xc0006c7740, {0x1ac56e0, 0xc0006c76e0})\n\t/go/pkg/mod/github.com/json-iterator/go@v1.1.9/reflect.go:98 +0x158\ngithub.com/json-iterator/go.(*frozenConfig).Marshal(0xc0001afd60, {0x1ac56e0, 0xc0006c76e0})\n\t/go/pkg/mod/github.com/json-iterator/go@v1.1.9/config.go:299 +0xc9\ngithub.com/prometheus/prometheus/web/api/v1.(*API).respond(0xc0002d7a40, {0x229a448, 0xc00022bd60}, {0x1c16da0?, 0xc000948fc0}, {0x0?, 0x7fe5a05a5b20?, 0x20?})\n\t/go/pkg/mod/github.com/prometheus/prometheus@v1.8.2-0.20200724121523-657ba532e42f/web/api/v1/api.go:1437 +0x162\ngithub.com/prometheus/prometheus/web/api/v1.(*API).Register.func1.1({0x229a448, 0xc00022bd60}, 0x7fe5982c5300?)\n\t/go/pkg/mod/github.com/prometheus/prometheus@v1.8.2-0.20200724121523-657ba532e42f/web/api/v1/api.go:273 +0x20b\nnet/http.HandlerFunc.ServeHTTP(0x7fe5982c5300?, {0x229a448?, 0xc00022bd60?}, 0xc00072b270?)\n\t/usr/lib/golang/src/net/http/server.go:2084 +0x2f\ngithub.com/prometheus/prometheus/util/httputil.CompressionHandler.ServeHTTP({{0x2290780?, 0xc000856288?}}, {0x7fe5982c5300?, 0xc00072b270?}, 0x228fb20?)\n\t/go/pkg/mod/github.com/prometheus/prometheus@v1.8.2-0.20200724121523-657ba532e42f/util/httputil/compression.go:90 +0x69\ngithub.com/prometheus/prometheus/web.(*Handler).testReady.func1({0x7fe5982c5300?, 0xc00072b270?}, 0x7fe5982c5300?)\n\t/go/pkg/mod/github.com/prometheus/prometheus@v1.8.2-0.20200724121523-657ba532e42f/web/web.go:499 +0x39\nnet/http.HandlerFunc.ServeHTTP(0x7fe5982c5300?, {0x7fe5982c5300?, 0xc00072b270?}, 0x50?)\n\t/usr/lib/golang/src/net/http/server.go:2084 +0x2f\ngithub.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerResponseSize.func1({0x7fe5982c5300?, 0xc00072b220?}, 0xc000250c00)\n\t/go/pkg/mod/github.com/prometheus/client_golang@v1.6.0/prometheus/promhttp/instrument_server.go:196 +0xa5\nnet/http.HandlerFunc.ServeHTTP(0x228fb80?, {0x7fe5982c5300?, 0xc00072b220?}, 0xc000948ed0?)\n\t/usr/lib/golang/src/net/http/server.go:2084 +0x2f\ngithub.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerDuration.func2({0x7fe5982c5300, 0xc00072b220}, 0xc000250c00)\n\t/go/pkg/mod/github.com/prometheus/client_golang@v1.6.0/prometheus/promhttp/instrument_server.go:76 +0xa2\nnet/http.HandlerFunc.ServeHTTP(0x22a4a68?, {0x7fe5982c5300?, 0xc00072b220?}, 0x0?)\n\t/usr/lib/golang/src/net/http/server.go:2084 +0x2f\ngithub.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerCounter.func1({0x22a4a68?, 0xc00072b1d0?}, 0xc000250c00)\n\t/go/pkg/mod/github.com/prometheus/client_golang@v1.6.0/prometheus/promhttp/instrument_server.go:100 +0x94\ngithub.com/prometheus/prometheus/web.setPathWithPrefix.func1.1({0x22a4a68, 0xc00072b1d0}, 0xc000250b00)\n\t/go/pkg/mod/github.com/prometheus/prometheus@v1.8.2-0.20200724121523-657ba532e42f/web/web.go:1142 +0x290\ngithub.com/prometheus/common/route.(*Router).handle.func1({0x22a4a68, 0xc00072b1d0}, 0xc000250a00, {0x0, 0x0, 0xc00022c364?})\n\t/go/pkg/mod/github.com/prometheus/common@v0.10.0/route/route.go:83 +0x2ae\ngithub.com/julienschmidt/httprouter.(*Router).ServeHTTP(0xc0001cc780, {0x22a4a68, 0xc00072b1d0}, 0xc000250a00)\n\t/go/pkg/mod/github.com/julienschmidt/httprouter@v1.3.0/router.go:387 +0x82b\ngithub.com/prometheus/common/route.(*Router).ServeHTTP(0x8?, {0x22a4a68?, 0xc00072b1d0?}, 0x203000?)\n\t/go/pkg/mod/github.com/prometheus/common@v0.10.0/route/route.go:121 +0x26\nnet/http.StripPrefix.func1({0x22a4a68, 0xc00072b1d0}, 0xc000250900)\n\t/usr/lib/golang/src/net/http/server.go:2127 +0x330\nnet/http.HandlerFunc.ServeHTTP(0x10?, {0x22a4a68?, 0xc00072b1d0?}, 0x7fe5c8423f18?)\n\t/usr/lib/golang/src/net/http/server.go:2084 +0x2f\nnet/http.(*ServeMux).ServeHTTP(0x413d87?, {0x22a4a68, 0xc00072b1d0}, 0xc000250900)\n\t/usr/lib/golang/src/net/http/server.go:2462 +0x149\ngithub.com/opentracing-contrib/go-stdlib/nethttp.MiddlewareFunc.func5({0x22a3808?, 0xc000a282a0}, 0xc000250200)\n\t/go/pkg/mod/github.com/opentracing-contrib/go-stdlib@v0.0.0-20190519235532-cf7a6c988dc9/nethttp/server.go:140 +0x662\nnet/http.HandlerFunc.ServeHTTP(0x0?, {0x22a3808?, 0xc000a282a0?}, 0xffffffffffffffff?)\n\t/usr/lib/golang/src/net/http/server.go:2084 +0x2f\ngithub.com/prometheus/prometheus/web.withStackTracer.func1({0x22a3808?, 0xc000a282a0?}, 0xc0008ca850?)\n\t/go/pkg/mod/github.com/prometheus/prometheus@v1.8.2-0.20200724121523-657ba532e42f/web/web.go:103 +0x97\nnet/http.HandlerFunc.ServeHTTP(0x0?, {0x22a3808?, 0xc000a282a0?}, 0xc000100000?)\n\t/usr/lib/golang/src/net/http/server.go:2084 +0x2f\nnet/http.serverHandler.ServeHTTP({0xc000c55380?}, {0x22a3808, 0xc000a282a0}, 0xc000250200)\n\t/usr/lib/golang/src/net/http/server.go:2916 +0x43b\nnet/http.(*conn).serve(0xc0000d1540, {0x22a4e18, 0xc00061a0c0})\n\t/usr/lib/golang/src/net/http/server.go:1966 +0x5d7\ncreated by net/http.(*Server).Serve\n\t/usr/lib/golang/src/net/http/server.go:3071 +0x4db\n"
level=error ts=2022-06-27T20:28:12.306Z caller=stdlib.go:89 component=web caller="http: panic serving 127.0.0.1:37064" msg="runtime error: invalid memory address or nil pointer dereference"

Expected results:

local-test does no fail on the error above.

Additional info:


Following the trail
https://github.com/kubernetes-sigs/aws-ebs-csi-driver/issues/1139
https://github.com/kubernetes-sigs/aws-ebs-csi-driver/pull/1175
https://github.com/openshift/aws-ebs-csi-driver/pull/206
 
Looks like the fix should be in 4.12, but it still see it being 39 vs ~24 on an m6i instance type.

It seems that the kubelet applies this capacity to the node in 4.11 and earlier and, thus, unlikely to receive this fix for attachable volumes in the upstream CSI driver. 4.12 behavior is currently unknown but it seems that the kubelet might still be setting this capacity.

The actual issue is that kube scheduler schedules pods that require PVs to nodes where those PVs can not be attached.

This is a clone of issue OCPBUGS-8339. The following is the description of the original issue:

This is a clone of issue OCPBUGS-5287. The following is the description of the original issue:

Description of problem:

See https://issues.redhat.com/browse/THREESCALE-9015.  A problem with the Red Hat Integration - 3scale - Managed Application Services operator prevents it from installing correctly, which results in the failure of operator-install-single-namespace.spec.ts integration test.

This is a clone of issue OCPBUGS-14415. The following is the description of the original issue:

This is a clone of issue OCPBUGS-14315. The following is the description of the original issue:

This is a clone of issue OCPBUGS-4501. The following is the description of the original issue:

Description of problem:

IPV6 interface and IP is missing in all pods created in OCP 4.12 EC-2.

Version-Release number of selected component (if applicable):

4.12

How reproducible:

Every time

Steps to Reproduce:

We create network-attachment-definitions.k8s.cni.cncf.io in OCP cluster at namespace scope for our software pods to get IPV6 IPs. 

Actual results:

Pods do not receive IPv6 addresses

Expected results:

Pods receive IPv6 addresses

Additional info:

This has been working flawlessly till OCP 4.10. 21 however we are trying same code in OCP 4.12-ec2 and we notice all our pods are missing ipv6 address and we have to restart pods couple times for them to get ipv6 address.

This is a clone of issue OCPBUGS-18305. The following is the description of the original issue:

Description of problem:

It appears it may be possible to have invalid CSV entries in the resolver cache, resulting in the inability to reinstall an Operator.

The situation:
--------------
A customer has removed the CSV, InstallPlan and Subscription for the GitOps Operator from the cluster but upon attempting to reinstall the Operator, the OLM was providing a conflict with existing CSV.

This CSV was not in the ETCD instance and was removed previously. Upon deleting the `operator-catalog` and `operator-lifecycle-manager` Pods, the collision was resolved and the Operator was able to installed again.
~~~
'Warning' reason: 'ResolutionFailed' constraints not satisfiable: subscription openshift-gitops-operator exists, subscription openshift-gitops-operator requires redhat-operators/openshift-marketplace/stable/openshift-gitops-operator.v1.5.8, redhat-operators/openshift-marketplace/stable/openshift-gitops-operator.v1.5.8 and @existing/openshift-operators//openshift-gitops-operator.v1.5.6-0.1664915551.p originate from package openshift-gitops-operator, clusterserviceversion openshift-gitops-operator.v1.5.6-0.1664915551.p exists and is not referenced by a subscription
~~~

Version-Release number of selected component (if applicable):

4.9.31

How reproducible:

Very intermittent, however once the issue has occurred it was impossible to avoid without deleting the Pods.

Steps to Reproduce:

1. Add Operator with manual approval InstallPlan
2. Remove Operator (Subscription, CSV, InstallPlan)
3. Attempt to reinstall Operator 

Actual results:

Very intermittent failure

Expected results:

Operators do not have conflicts with CSVs which have already been removed.

Additional info:

Briefly reviewing the OLM code, it appears an internal resolver cache is populated and used for checking constraints when an operator is installed. If there are stale entries in the cache, this would result in the described issue.
The cache appears to have been rearchitected (moved to a dedicated object) since OCP 4.9.31. Due to the nature of this issue, the request does not have clear reproduction steps to reproduce so if the issue is unable to reproduced, I would like instructions on how to dump the contents of the cache if the issue is to arise again.

This is a clone of issue OCPBUGS-855. The following is the description of the original issue:

Description of problem:

When setting the allowedregistries like the example below, the openshift-samples operator is degraded:

oc get image.config.openshift.io/cluster -o yaml
apiVersion: config.openshift.io/v1
kind: Image
metadata:
  annotations:
    release.openshift.io/create-only: "true"
  creationTimestamp: "2020-12-16T15:48:20Z"
  generation: 2
  name: cluster
  resourceVersion: "422284920"
  uid: d406d5a0-c452-4a84-b6b3-763abb51d7a5
spec:
  additionalTrustedCA:
    name: registry-ca
  allowedRegistriesForImport:
  - domainName: quay.io
    insecure: false
  - domainName: registry.redhat.io
    insecure: false
  - domainName: registry.access.redhat.com
    insecure: false
  - domainName: registry.redhat.io/redhat/redhat-operator-index
    insecure: true
  - domainName: registry.redhat.io/redhat/redhat-marketplace-index
    insecure: true
  - domainName: registry.redhat.io/redhat/certified-operator-index
    insecure: true
  - domainName: registry.redhat.io/redhat/community-operator-index
    insecure: true
  registrySources:
    allowedRegistries:
    - quay.io
    - registry.redhat.io
    - registry.rijksapps.nl
    - registry.access.redhat.com
    - registry.redhat.io/redhat/redhat-operator-index
    - registry.redhat.io/redhat/redhat-marketplace-index
    - registry.redhat.io/redhat/certified-operator-index
    - registry.redhat.io/redhat/community-operator-index


oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.10.21   True        False         False      5d13h   
baremetal                                  4.10.21   True        False         False      450d    
cloud-controller-manager                   4.10.21   True        False         False      94d     
cloud-credential                           4.10.21   True        False         False      624d    
cluster-autoscaler                         4.10.21   True        False         False      624d    
config-operator                            4.10.21   True        False         False      624d    
console                                    4.10.21   True        False         False      42d     
csi-snapshot-controller                    4.10.21   True        False         False      31d     
dns                                        4.10.21   True        False         False      217d    
etcd                                       4.10.21   True        False         False      624d    
image-registry                             4.10.21   True        False         False      94d     
ingress                                    4.10.21   True        False         False      94d     
insights                                   4.10.21   True        False         False      104s    
kube-apiserver                             4.10.21   True        False         False      624d    
kube-controller-manager                    4.10.21   True        False         False      624d    
kube-scheduler                             4.10.21   True        False         False      624d    
kube-storage-version-migrator              4.10.21   True        False         False      31d     
machine-api                                4.10.21   True        False         False      624d    
machine-approver                           4.10.21   True        False         False      624d    
machine-config                             4.10.21   True        False         False      17d     
marketplace                                4.10.21   True        False         False      258d    
monitoring                                 4.10.21   True        False         False      161d    
network                                    4.10.21   True        False         False      624d    
node-tuning                                4.10.21   True        False         False      31d     
openshift-apiserver                        4.10.21   True        False         False      42d     
openshift-controller-manager               4.10.21   True        False         False      22d     
openshift-samples                          4.10.21   True        True          True       31d     Samples installation in error at 4.10.21: &errors.errorString{s:"global openshift image configuration prevents the creation of imagestreams using the registry "}
operator-lifecycle-manager                 4.10.21   True        False         False      624d    
operator-lifecycle-manager-catalog         4.10.21   True        False         False      624d    
operator-lifecycle-manager-packageserver   4.10.21   True        False         False      31d     
service-ca                                 4.10.21   True        False         False      624d    
storage                                    4.10.21   True        False         False      113d  


After applying the fix as described here(  https://access.redhat.com/solutions/6547281 ) it is resolved:
oc patch configs.samples.operator.openshift.io cluster --type merge --patch '{"spec": {"samplesRegistry": "registry.redhat.io"}}'

But according the the BZ this should be fixed in 4.10.3 https://bugzilla.redhat.com/show_bug.cgi?id=2027745 but the issue is still occur in our 4.10.21 cluster:

oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.21   True        False         31d     Error while reconciling 4.10.21: the cluster operator openshift-samples is degraded

Version-Release number of selected component (if applicable):

 

How reproducible:

100%

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-5092. The following is the description of the original issue:

This is a clone of issue OCPBUGS-3314. The following is the description of the original issue:

Description of problem:

triggers[].gitlab.secretReference[1] disappears when a 'buildconfig' is edited on ‘From View’

Version-Release number of selected component (if applicable):

4.10.32

How reproducible:

Always

Steps to Reproduce:

1. Configure triggers[].gitlab.secretReference[1] as below 

~~~
spec:
 .. 
  triggers:
    - type: ConfigChange
    - type: GitLab
      gitlab:
        secretReference:
          name: m24s40-githook
~~~
2. Open ‘Edit BuildConfig’ buildconfig  with ‘From’ View:
 - Buildconfigs -> Actions -> Edit Buildconfig

3. Click ‘YAML view’ on top. 

Actual results:

The 'secretReference' configured earlier has disappeared. You can click [Reload] button which will bring the configuration back.

Expected results:

'secretReference' configured in buildconfigs do not disappear. 

Additional info:


[1]https://docs.openshift.com/container-platform/4.10/rest_api/workloads_apis/buildconfig-build-openshift-io-v1.html#spec-triggers-gitlab-secretreference

 

This is a clone of issue OCPBUGS-13013. The following is the description of the original issue:

This is a clone of issue OCPBUGS-12854. The following is the description of the original issue:

This is a clone of issue OCPBUGS-11550. The following is the description of the original issue:

Description of problem:

`cluster-reader` ClusterRole should have ["get", "list", "watch"] permissions for a number of privileged CRs, but lacks them for the API Group "k8s.ovn.org", which includes CRs such as EgressFirewalls, EgressIPs, etc.

Version-Release number of selected component (if applicable):

OCP 4.10 - 4.12 OVN

How reproducible:

Always

Steps to Reproduce:

1. Create a cluster with OVN components, e.g. EgressFirewall
2. Check permissions of ClusterRole `cluster-reader`

Actual results:

No permissions for OVN resources 

Expected results:

Get, list, and watch verb permissions for OVN resources

Additional info:

Looks like a similar bug was opened for "network-attachment-definitions" in OCPBUGS-6959 (whose closure is being contested).

Description of problem:

The metal3 Pod of openshift-machine-api is in CrashLoopBackOff status.

Version-Release number of selected component (if applicable):

4.10.31

How reproducible:

Always reproduce in IPv6

Steps to Reproduce:

1. Preparing to configure ipi on the provisioning node
   - RHEL 8 ( haproxy, named, mirror registry, rhcos_cache_server ..)

2. configuring the install-config.yaml (attached)
   - provisioningNetwork: disabled
   - machine network: only IPv6
   - disconnected installation
   
3. deploy the cluster 

Actual results:

It is not possible to add worker nodes because metal3 does not start normally.

Expected results:

metal3 starts normally in IPv6 environment

Additional info:

1. attached must-gather
https://drive.google.com/file/d/1GKxj3syROIMnURx_PYzOYhJdEuXBNXVW/view?usp=sharing

2. pod status
[kni@prov ~]$ oc get pod
NAME                                           READY   STATUS             RESTARTS          AGE
cluster-autoscaler-operator-6656bfd7b9-bt4j8   2/2     Running            0                 35h
cluster-baremetal-operator-6bbdd6758-rmxgq     2/2     Running            0                 35h
machine-api-controllers-55fb545b56-kl5sj       7/7     Running            0                 34h
machine-api-operator-845b6cf855-q7gdd          2/2     Running            0                 35h
metal3-574876cfdb-98fmz                        6/7     CrashLoopBackOff   179 (3m17s ago)   14h
metal3-image-cache-5mq2w                       1/1     Running            0                 14h
metal3-image-cache-nftpj                       1/1     Running            0                 14h
metal3-image-cache-t7whh                       1/1     Running            0                 14h
metal3-image-customization-68d4d6d99b-dbqgn    1/1     Running            0                 15h

[kni@prov ~]$ oc logs metal3-574876cfdb-98fmz -c metal3-httpd
AH00526: Syntax error on line 8 of /etc/httpd/conf.d/vmedia.conf:
The port number "2001:feed:101::102" is outside the appropriate range (i.e., 1..65535).

This bug is a backport clone of [Bugzilla Bug 2089950](https://bugzilla.redhat.com/show_bug.cgi?id=2089950). The following is the description of the original bug:

Description of problem: Some upgrades failed during scale testing with messages indicating the console operator is not available. In total 5 out of 2200 clusters failed with this pattern.

These clusters are all configured with the Console operator disabled in order to reduce overall OCP cpu use in the Telecom environment. The following CR is applied:
apiVersion: operator.openshift.io/v1
kind: Console
metadata:
annotations:
include.release.openshift.io/ibm-cloud-managed: "false"
include.release.openshift.io/self-managed-high-availability: "false"
include.release.openshift.io/single-node-developer: "false"
release.openshift.io/create-only: "true"
ran.openshift.io/ztp-deploy-wave: "10"
name: cluster
spec:
logLevel: Normal
managementState: Removed
operatorLogLevel: Normal

From one cluster (sno01175) the ClusterVersion conditions show:

  1. oc get clusterversion version -o jsonpath=' {.status.conditions}

    ' | jq
    [

    { "lastTransitionTime": "2022-05-19T01:44:13Z", "message": "Done applying 4.9.26", "status": "True", "type": "Available" }

    ,

    { "lastTransitionTime": "2022-05-24T14:57:50Z", "message": "Cluster operator console is degraded", "reason": "ClusterOperatorDegraded", "status": "True", "type": "Failing" }

    ,

    { "lastTransitionTime": "2022-05-24T13:49:43Z", "message": "Unable to apply 4.10.13: wait has exceeded 40 minutes for these operators: console", "reason": "ClusterOperatorDegraded", "status": "True", "type": "Progressing" }

    ,

    { "lastTransitionTime": "2022-05-21T02:07:06Z", "status": "True", "type": "RetrievedUpdates" }

    ,

    { "lastTransitionTime": "2022-05-24T13:53:05Z", "message": "Payload loaded version=\"4.10.13\" image=\"quay.io/openshift-release-dev/ocp-release@sha256:4f516616baed3cf84585e753359f7ef2153ae139c2e80e0191902fbd073c4143\"", "reason": "PayloadLoaded", "status": "True", "type": "ReleaseAccepted" }

    ,

    { "lastTransitionTime": "2022-05-24T13:57:05Z", "message": "Cluster operator kube-apiserver should not be upgraded between minor versions: KubeletMinorVersionUpgradeable: Kubelet minor version (1.22.5+5c84e52) on node sno01175 will not be supported in the next OpenShift minor version upgrade.", "reason": "KubeletMinorVersion_KubeletMinorVersionUnsupportedNextUpgrade", "status": "False", "type": "Upgradeable" }

    ]

Another cluster (sno01959) has very similar conditions with slight variation in the Failing and Progressing messages:

{ "lastTransitionTime": "2022-05-24T14:32:42Z", "message": "Cluster operator console is not available", "reason": "ClusterOperatorNotAvailable", "status": "True", "type": "Failing" }

,

{ "lastTransitionTime": "2022-05-24T13:52:04Z", "message": "Unable to apply 4.10.13: the cluster operator console has not yet successfully rolled out", "reason": "ClusterOperatorNotAvailable", "status": "True", "type": "Progressing" }

,

Version-Release number of selected component (if applicable): 4.9.26 upgrade to 4.10.13

How reproducible: 5 out of 2200

Steps to Reproduce:
1. Disable console with managementState: Removed
2. Starting OCP version 4.9.26
3. Initiate upgrade to 4.10.13 via ClusterVersion CR

Actual results: Cluster upgrade is stuck (no longer progressing) for 5+ hours

Expected results: Cluster upgrade completes

Additional info:

Description of problem:

Disconnected IPI OCP 4.10.22 cluster install on baremetal fails when hostname of master nodes does not include "master"    

Version-Release number of selected component (if applicable): 4.10.22

How reproducible:  Perform disconnected IPI install of OCP 4.10.22 on bare metal with master nodes that do not contain the text "master"

Steps to Reproduce:

Perform disconnected IPI install of OCP 4.10.22 on bare metal with master nodes that do not contain the text "master"

Actual results: master nodes do come up.

Expected results: master nodes should come up despite that the text "master" is not in their hostname.

Additional info:

Disconnected IPI OCP 4.10.22 cluster install on baremetal fails when hostname of master nodes does not include "master"    

The code for the cluster-baremetal-operator at the following link: 

https://github.com/openshift/cluster-baremetal-operator/blob/49d7b249c5dcef8228f206eff4530a25f03b201f/controllers/provisioning_controller.go#L441

The following condition is concerning:

if strings.Contains(bmh.Name, "master") && len(bmh.Spec.BootMACAddress) > 0

The packages reveal that bmh.Name references the name inside the metadata of the BMH object. 

Should a customer have masters with names that do not include the text "master", the above condition can never become true, and so, the following slice is never created :

macs = append(macs, bmh.Spec.BootMACAddress)

 

 

This is a clone of issue OCPBUGS-212. The following is the description of the original issue:

Description of problem:

oc --context build02 get clusterversion
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.12.0-ec.1   True        False         45h     Error while reconciling 4.12.0-ec.1: the cluster operator kube-controller-manager is degraded

oc --context build02 get co kube-controller-manager
NAME                      VERSION       AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
kube-controller-manager   4.12.0-ec.1   True        False         True       2y87d   GarbageCollectorDegraded: error fetching rules: Get "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/rules": dial tcp 172.30.153.28:9091: connect: cannot assign requested address

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1.
2.
3.

Actual results:

Expected results:

Additional info:

build02 is a build farm cluster in CI production.
I can provide credentials to access the cluster if needed.

Description of problem:

A cluster installation (4.11.36) ultimately failed because an alertmanager pod could not start, and remained in a ContainerCreating state.

The namespace events show:

LAST SEEN   TYPE      REASON                   OBJECT                    MESSAGE                 
3m10s       Warning   FailedCreatePodSandBox   pod/alertmanager-main-0   (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_alertmanager-main-0_openshift-monitor
ing_ead22ae2-c67d-4e3f-a3c2-73a87a564e6d_0(6105cad796e2b51bed66b5515bf42939694dfa920395ebc72aec21cd076eab85): error adding pod openshift-monitoring_alertmanager-main-0 to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-net
work" failed (add): [openshift-monitoring/alertmanager-main-0/ead22ae2-c67d-4e3f-a3c2-73a87a564e6d:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[openshift-monitoring/alertmanager-main-0
 6105cad796e2b51bed66b5515bf42939694dfa920395ebc72aec21cd076eab85] [openshift-monitoring/alertmanager-main-0 6105cad796e2b51bed66b5515bf42939694dfa920395ebc72aec21cd076eab85] failed to get pod annotation: timed out waiting for annotations: contex
t deadline exceeded...                 

Manually deleting the pod caused it to immediately recreate and run successfully.

Version-Release number of selected component (if applicable):

4.12.10

How reproducible:

Unknown

Actual results:

The monitoring cluster operator remains in a non-available state due to the lack of the alertmanager pod being present. The alertmanager pod never runs.

Expected results:

The alertmanager pod should run without needing manual intervention.

Additional info:

 

This is a clone of issue OCPBUGS-19907. The following is the description of the original issue:

This is a clone of issue OCPBUGS-18764. The following is the description of the original issue:

This is a clone of issue OCPBUGS-6513. The following is the description of the original issue:

Description of problem:

Using the web console on the RH Developer Sandbox, created the most basic Knative Service (KSVC) using the default suggested, ie image openshift/hello-openshift.

Then tried to change the displayed icon using the web UI and an error about Probes was displayed. See attached images.

The error has no relevance to the item changed.

Version-Release number of selected component (if applicable):

whatever the RH sandbox uses, this value is not displayed to users

How reproducible:

very

Steps to Reproduce:

Using the web console on the RH Developer Sandbox, created the most basic Knative Service (KSVC) using the default image openshift/hello-openshift.

Then used the webUi to edit the KSVC sample to change the icon used from an OpenShift logo to a 3Scale logo for instance.

When saving from this form an error was reported: admission webhook 'validation webhook.serving.knative.dev' denied the request: validation failed: must not set the field(s): spec.template.spec.containers[0].readiness.Probe




Actual results:

 

Expected results:

Either a failure message related to changing the icon, or the icon change to take effect

Additional info:

KSVC details as provided by the web console.

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: sample
  namespace: agroom-dev
spec:
  template:
    spec:
      containers:
        - image: openshift/hello-openshift

This is a clone of issue OCPBUGS-10622. The following is the description of the original issue:

Description of problem:

Unit test failing 

=== RUN   TestNewAppRunAll/app_generation_using_context_dir
    newapp_test.go:907: app generation using context dir: Error mismatch! Expected <nil>, got supplied context directory '2.0/test/rack-test-app' does not exist in 'https://github.com/openshift/sti-ruby'
    --- FAIL: TestNewAppRunAll/app_generation_using_context_dir (0.61s)


Version-Release number of selected component (if applicable):

 

How reproducible:

100

Steps to Reproduce:

see for example https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_oc/1376/pull-ci-openshift-oc-master-images/1638172620648091648 

Actual results:

unit tests fail

Expected results:

TestNewAppRunAll unit test should pass

Additional info:

 

This bug is a backport clone of [Bugzilla Bug 2034883](https://bugzilla.redhat.com/show_bug.cgi?id=2034883). The following is the description of the original bug:

Description of problem:

Situation (starting point):

  • There is an ongoing change to the machine-config-daemon daemonset being applied by the machine-config-operator pod. It is waiting for the daemonset to roll out.
  • There are some nodes not ready, so daemonset rollout never ends and waiting on that ends in timeout error.

Problem:

  • Machine-config-operator pods stops trying to reconcile stuff whenever it finds timeout error in waiting for the machine-config-daemon rollout
  • This implies that the `spec.kubeAPIServerServingCAData` field of controllerconfig/machine-config-controller object is not updated when the kube-apiserver-operator updates kube-apiserver-to-kubelet-client-ca configmap.
  • Without that field updated, a kube-apiserver-to-kubelet-client-ca change is never rolled out to the nodes.
  • That ultimately leads to cluster-wide unavailability of "oc logs", "oc rsh" etc. commands when the kube-apiserver-operator starts using a client cert signed by the new kube-apiserver-to-kubelet-client-ca to access kubelet ports.

Version-Release number of MCO (Machine Config Operator) (if applicable):

4.7.21

Platform (AWS, VSphere, Metal, etc.): (not relevant)

Are you certain that the root cause of the issue being reported is the MCO (Machine Config Operator)?
(Y/N/Not sure): Y

How reproducible:

Always if the said conditions are met.

Steps to Reproduce:
1. Have some nodes not ready
2. Force a change that requires machine-config-daemon daemonset rollout (I think that changing proxy settings would work for this)
3. Wait until a new kube-apiserver-to-kubelet-client-ca is rolled out by kube-apiserver-operator

Actual results:

New kube-apiserver-to-kubelet-client-ca not forwarded to controllerconfig, kube-apiserver-to-kubelet-client-ca not deployed on nodes

Expected results:

kube-apiserver-to-kubelet-client-ca forwarded to controllerconfig, kube-apiserver-to-kubelet-client-ca deployed to nodes.

Additional info:

In comments

This is a clone of issue OCPBUGS-2438. The following is the description of the original issue:

Description of problem:

On the alert details page and alerting rule details page, clicking on a field that has a popover help throws an uncaught JavaScript error.

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1. Go to Observe > Alerting pages
2. Click on an alert (or go to the rules tab then click on a rule)
3. Click on one of the underlined fields (those that have a popover help)

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-2508. The following is the description of the original issue:

Description of problem:

Installer fails due to Neutron policy error when creating Openstack servers for OCP master nodes.

$ oc get machines -A
NAMESPACE               NAME                          PHASE          TYPE   REGION   ZONE   AGE
openshift-machine-api   ostest-kwtf8-master-0         Running                               23h
openshift-machine-api   ostest-kwtf8-master-1         Running                               23h
openshift-machine-api   ostest-kwtf8-master-2         Running                               23h
openshift-machine-api   ostest-kwtf8-worker-0-g7nrw   Provisioning                          23h
openshift-machine-api   ostest-kwtf8-worker-0-lrkvb   Provisioning                          23h
openshift-machine-api   ostest-kwtf8-worker-0-vwrsk   Provisioning                          23h

$ oc -n openshift-machine-api logs machine-api-controllers-7454f5d65b-8fqx2 -c machine-controller
[...]
E1018 10:51:49.355143       1 controller.go:317] controller/machine_controller "msg"="Reconciler error" "error"="error creating Openstack instance: Failed to create port err: Request forbidden: [POST https://overcloud.redhat.local:13696/v2.0/ports], error message: {\"NeutronError\": {\"type\": \"PolicyNotAuthorized\", \"message\": \"(rule:create_port and (rule:create_port:allowed_address_pairs and (rule:create_port:allowed_address_pairs:ip_address and rule:create_port:allowed_address_pairs:ip_address))) is disallowed by policy\", \"detail\": \"\"}}" "name"="ostest-kwtf8-worker-0-lrkvb" "namespace"="openshift-machine-api"

Version-Release number of selected component (if applicable):

4.10.0-0.nightly-2022-10-14-023020

How reproducible:

Always

Steps to Reproduce:

1. Install 4.10 within provider networks (in primary or secondary interface)

Actual results:

Installation failure:
4.10.0-0.nightly-2022-10-14-023020: some cluster operators have not yet rolled out

Expected results:

Successful installation

Additional info:

Please find must-gather for installation on primary interface link here and for installation on secondary interface link here.

 

In order to delete the correct GCP cloud resources, the "--credentials-requests-dir" parameter must be passed to "ccoctl gcp delete". This was fixed for 4.12 as part of https://github.com/openshift/cloud-credential-operator/pull/489 but must be backported for previous releases. See https://github.com/openshift/cloud-credential-operator/pull/489#issuecomment-1248733205 for discussion regarding this bug.

To reproduce, create GCP infrastructure with a name parameter that is a subset of another set of GCP infrastructure's name parameter. I will "ccoctl gcp create all" with "name=abutcher-gcp" and "name=abutcher-gcp1".

$ ./ccoctl gcp create-all \
--name=abutcher-gcp \
--region=us-central1 \
--project=openshift-hive-dev \
--credentials-requests-dir=./credrequests

$ ./ccoctl gcp create-all \
--name=abutcher-gcp1 \
--region=us-central1 \
--project=openshift-hive-dev \
--credentials-requests-dir=./credrequests

Running "ccoctl gcp delete --name=abutcher-gcp" will result in GCP infrastructure for both "abutcher-gcp" and "abutcher-gcp1" being deleted. 

$ ./ccoctl gcp delete --name abutcher-gcp --project openshift-hive-dev
2022/10/24 11:30:06 Credentials loaded from file "/home/abutcher/.gcp/osServiceAccount.json"
2022/10/24 11:30:06 Deleted object .well-known/openid-configuration from bucket abutcher-gcp-oidc
2022/10/24 11:30:07 Deleted object keys.json from bucket abutcher-gcp-oidc
2022/10/24 11:30:07 OIDC bucket abutcher-gcp-oidc deleted
2022/10/24 11:30:09 IAM Service account abutcher-gcp-openshift-image-registry-gcs deleted
2022/10/24 11:30:10 IAM Service account abutcher-gcp-openshift-gcp-ccm deleted
2022/10/24 11:30:11 IAM Service account abutcher-gcp1-openshift-cloud-network-config-controller-gcp deleted
2022/10/24 11:30:12 IAM Service account abutcher-gcp-openshift-machine-api-gcp deleted
2022/10/24 11:30:13 IAM Service account abutcher-gcp-openshift-ingress-gcp deleted
2022/10/24 11:30:15 IAM Service account abutcher-gcp-openshift-gcp-pd-csi-driver-operator deleted
2022/10/24 11:30:16 IAM Service account abutcher-gcp1-openshift-ingress-gcp deleted
2022/10/24 11:30:17 IAM Service account abutcher-gcp1-openshift-image-registry-gcs deleted
2022/10/24 11:30:19 IAM Service account abutcher-gcp-cloud-credential-operator-gcp-ro-creds deleted
2022/10/24 11:30:20 IAM Service account abutcher-gcp1-openshift-gcp-pd-csi-driver-operator deleted
2022/10/24 11:30:21 IAM Service account abutcher-gcp1-openshift-gcp-ccm deleted
2022/10/24 11:30:22 IAM Service account abutcher-gcp1-cloud-credential-operator-gcp-ro-creds deleted
2022/10/24 11:30:24 IAM Service account abutcher-gcp1-openshift-machine-api-gcp deleted
2022/10/24 11:30:25 IAM Service account abutcher-gcp-openshift-cloud-network-config-controller-gcp deleted
2022/10/24 11:30:25 Workload identity pool abutcher-gcp deleted

 

This is a clone of issue OCPBUGS-501. The following is the description of the original issue:

Description of problem: 

Version-Release number of selected component (if applicable): 4.10.16

How reproducible: Always

Steps to Reproduce:
1. Edit the apiserver resource and add spec.audit.customRules field

$ oc get apiserver cluster -o yaml
spec:
audit:
customRules:

  • group: system:authenticated:oauth
    profile: AllRequestBodies
  • group: system:authenticated
    profile: AllRequestBodies
    profile: Default

2. Allow the kube-apiserver pods to rollout new revision.
3. Once the kube-apiserver pods are in new revision execute $ oc get dc

Actual results:

Error from server (InternalError): an error on the server ("This request caused apiserver to panic. Look in the logs for details.") has prevented the request from succeeding (get deploymentconfigs.apps.openshift.io)

Expected results: The command "oc get dc" should display the deploymentconfig without any error.

Additional info:

Description of problem:

This is just a clone of https://bugzilla.redhat.com/show_bug.cgi?id=2105570 for purposes of cherry-picking.

Version-Release number of selected component (if applicable):

4.13

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem: This is a follow-up to OCPBUGS-2795 and OCPBUGS-2941.

The installer fails to destroy the cluster when the OpenStack object storage omits 'content-type' from responses. This can happen on responses with HTTP status code 204, where a reverse proxy is truncating content-related headers (see this nginX bug report). In such cases, the Installer errors with:

level=error msg=Bulk deleting of container "5ifivltb-ac890-chr5h-image-registry-fnxlmmhiesrfvpuxlxqnkoxdbl" objects failed: Cannot extract names from response with content-type: []

Listing container object suffers from the same issue as listing the containers and this one isn't fixed in latest versions of gophercloud. I've reported https://github.com/gophercloud/gophercloud/issues/2509 and fixing it with https://github.com/gophercloud/gophercloud/issues/2510, however we likely won't be able to backport the bump to gophercloud master back to release-4.8 so we'll have to look for alternatives.

I'm setting the priority to critical as it's causing all our jobs to fail in master.

Version-Release number of selected component (if applicable):

4.8.z

How reproducible:

Likely not happening in customer environments where Swift is exposed directly. We're seeing the issue in our CI where we're using a non-RHOSP managed cloud.

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-7445. The following is the description of the original issue:

This is a clone of issue OCPBUGS-7207. The following is the description of the original issue:

At some point in the mtu-migration development a configuration file was generated at /etc/cno/mtu-migration/config which was used as a flag to indicate to configure-ovs that a migration procedure was in progress. When that file was missing, it was assumed the migration procedure was over and configure-ovs did some cleaning on behalf of it.

But that changed and /etc/cno/mtu-migration/config is never set. That causes configure-ovs to remove mtu-migration information when the procedure is still in progress making it to use incorrect MTU values and either causing nodes to be tainted with "ovn.k8s.org/mtu-too-small" blocking the procedure itself or causing network disruption until the procedure is over.

However, this was not a problem for the CI job as it doesn't use the migration procedure as documented for the sake of saving limited time available to run CI jobs. The CI merges two steps of the procedure into one so that there is never a reboot while the procedure is in progress and hiding this issue.

This was probably not detected in QE as well for the same reason as CI.

Description of problem:

In a 4.11 cluster with only openshift-samples enabled, the 4.12 introduced optional COs console and insights are installed. While upgrading to 4.12, CVO considers them to be disabled explicitly and skips reconciling them. So these COs are not upgraded to 4.12.

Installed COs cannot be disabled, so CVO is supposed to implicitly enable them.


$ oc get clusterversion -oyaml
{
  "apiVersion": "config.openshift.io/v1",
     "kind": "ClusterVersion",
     "metadata": {
         "creationTimestamp": "2022-09-30T05:02:31Z",
         "generation": 3,
         "name": "version",
         "resourceVersion": "134808",
         "uid": "bd95473f-ffda-402d-8fe3-74f852a9d6eb"
     },
     "spec": {
         "capabilities": {
             "additionalEnabledCapabilities": [
                 "openshift-samples"
             ],
             "baselineCapabilitySet": "None"
         },
         "channel": "stable-4.11",
         "clusterID": "8eda5167-a730-4b39-be1d-214a80506d34",
         "desiredUpdate": {
             "force": true,
             "image": "registry.ci.openshift.org/ocp/release@sha256:2c8e617830f84ac1ee1bfcc3581010dec4ae5d9cad7a54271574e8d91ef5ecbc",
             "version": ""
         }
     },
     "status": {
         "availableUpdates": null,
         "capabilities": {
             "enabledCapabilities": [
                 "openshift-samples"
             ],
             "knownCapabilities": [
                 "Console",
                 "Insights",
                 "Storage",
                 "baremetal",
                 "marketplace",
                 "openshift-samples"
             ]
         },
         "conditions": [
             {
                 "lastTransitionTime": "2022-09-30T05:02:33Z",
                 "message": "Unable to retrieve available updates: currently reconciling cluster version 4.12.0-0.nightly-2022-09-28-204419 not found in the \"stable-4.11\" channel",
                 "reason": "VersionNotFound",
                 "status": "False",
                 "type": "RetrievedUpdates"
             },
             {
                 "lastTransitionTime": "2022-09-30T05:02:33Z",
                 "message": "Capabilities match configured spec",
                 "reason": "AsExpected",
                 "status": "False",
                 "type": "ImplicitlyEnabledCapabilities"
             },
             {
                 "lastTransitionTime": "2022-09-30T05:02:33Z",
                 "message": "Payload loaded version=\"4.12.0-0.nightly-2022-09-28-204419\" image=\"registry.ci.openshift.org/ocp/release@sha256:2c8e617830f84ac1ee1bfcc3581010dec4ae5d9cad7a54271574e8d91ef5ecbc\" architecture=\"amd64\"",
                 "reason": "PayloadLoaded",
                 "status": "True",
                 "type": "ReleaseAccepted"
             },
             {
                 "lastTransitionTime": "2022-09-30T05:23:18Z",
                 "message": "Done applying 4.12.0-0.nightly-2022-09-28-204419",
                 "status": "True",
                 "type": "Available"
             },
             {
                 "lastTransitionTime": "2022-09-30T07:05:42Z",
                 "status": "False",
                 "type": "Failing"
             },
             {
                 "lastTransitionTime": "2022-09-30T07:41:53Z",
                 "message": "Cluster version is 4.12.0-0.nightly-2022-09-28-204419",
                 "status": "False",
                 "type": "Progressing"
             }
         ],
         "desired": {
             "image": "registry.ci.openshift.org/ocp/release@sha256:2c8e617830f84ac1ee1bfcc3581010dec4ae5d9cad7a54271574e8d91ef5ecbc",
             "version": "4.12.0-0.nightly-2022-09-28-204419"
         },
         "history": [
             {
                 "completionTime": "2022-09-30T07:41:53Z",
                 "image": "registry.ci.openshift.org/ocp/release@sha256:2c8e617830f84ac1ee1bfcc3581010dec4ae5d9cad7a54271574e8d91ef5ecbc",
                 "startedTime": "2022-09-30T06:42:01Z",
                 "state": "Completed",
                 "verified": false,
                 "version": "4.12.0-0.nightly-2022-09-28-204419"
             },
             {
                 "completionTime": "2022-09-30T05:23:18Z",
                 "image": "registry.ci.openshift.org/ocp/release@sha256:5a6f6d1bf5c752c75d7554aa927c06b5ea0880b51909e83387ee4d3bca424631",
                 "startedTime": "2022-09-30T05:02:33Z",
                 "state": "Completed",
                 "verified": false,
                 "version": "4.11.0-0.nightly-2022-09-29-191451"
             }
         ],
         "observedGeneration": 3,
         "versionHash": "CSCJ2fxM_2o="
     }
 }

$ oc get co
 NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.12.0-0.nightly-2022-09-28-204419   True        False         False      93m     
cloud-controller-manager                   4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h56m   
cloud-credential                           4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h59m   
cluster-autoscaler                         4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h53m   
config-operator                            4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h54m   
console                                    4.11.0-0.nightly-2022-09-29-191451   True        False         False      3h45m   
control-plane-machine-set                  4.12.0-0.nightly-2022-09-28-204419   True        False         False      117m    
csi-snapshot-controller                    4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h54m   
dns                                        4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h53m   
etcd                                       4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h52m   
image-registry                             4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h46m   
ingress                                    4.12.0-0.nightly-2022-09-28-204419   True        False         False      151m    
insights                                   4.11.0-0.nightly-2022-09-29-191451   True        False         False      3h48m   
kube-apiserver                             4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h50m   
kube-controller-manager                    4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h51m   
kube-scheduler                             4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h51m   
kube-storage-version-migrator              4.12.0-0.nightly-2022-09-28-204419   True        False         False      91m     
machine-api                                4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h50m   
machine-approver                           4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h54m   
machine-config                             4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h52m   
monitoring                                 4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h44m   
network                                    4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h55m   
node-tuning                                4.12.0-0.nightly-2022-09-28-204419   True        False         False      113m    
openshift-apiserver                        4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h48m   
openshift-controller-manager               4.12.0-0.nightly-2022-09-28-204419   True        False         False      113m    
openshift-samples                          4.12.0-0.nightly-2022-09-28-204419   True        False         False      116m    
operator-lifecycle-manager                 4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h54m   
operator-lifecycle-manager-catalog         4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h54m   
operator-lifecycle-manager-packageserver   4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h48m   
service-ca                                 4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h54m   
storage                                    4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h54m 

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-09-28-204419

How reproducible:

Always

Steps to Reproduce:

1. Install a 4.11 cluster with only openshift-samples enabled
2. Upgrade to 4.12
3.

Actual results:

The 4.12 introduced optional CO console and insights are not upgraded to 4.12

Expected results:

All the installed COs get upgraded

Additional info:

 

Description of problem:

When adding new nodes to the existing cluster, the newly allocated node-subnet can be overlapped with the existing node.

Version-Release number of selected component (if applicable):

openshift 4.10.30

How reproducible:

It's quite hard to reproduce but  there is a possibility it can happen any time. 

Steps to Reproduce:

1. Create a OVN dual-stack cluster
2. add nodes to the existing cluster
3. check the allocated node subnet 

Actual results:

Some newly added nodes have the same node-subnet and ovn-k8s-mp0 IP as some existing nodes.

Expected results:

Should have duplicated node-subnet and ovn-k8s-mp0 IP

Additional info:

Additional info can be found at the case 03329155 and the must-gather attached(comment #1) 

% omg logs ovnkube-master-v8crc -n openshift-ovn-kubernetes -c ovnkube-master | grep '2022-09-30T06:42:50.857'
2022-09-30T06:42:50.857031565Z W0930 06:42:50.857020       1 master.go:1422] Did not find any logical switches with other-config
2022-09-30T06:42:50.857112441Z I0930 06:42:50.857099       1 master.go:1003] Allocated Subnets [10.131.0.0/23 fd02:0:0:4::/64] on Node worker01.ss1.samsung.local
2022-09-30T06:42:50.857122455Z I0930 06:42:50.857105       1 master.go:1003] Allocated Subnets [10.129.4.0/23 fd02:0:0:a::/64] on Node oam04.ss1.samsung.local
2022-09-30T06:42:50.857130289Z I0930 06:42:50.857122       1 kube.go:99] Setting annotations map[k8s.ovn.org/node-subnets:{"default":["10.131.0.0/23","fd02:0:0:4::/64"]}] on node worker01.ss1.samsung.local
2022-09-30T06:42:50.857140773Z I0930 06:42:50.857132       1 kube.go:99] Setting annotations map[k8s.ovn.org/node-subnets:{"default":["10.129.4.0/23","fd02:0:0:a::/64"]}] on node oam04.ss1.samsung.local
2022-09-30T06:42:50.857166726Z I0930 06:42:50.857156       1 master.go:1003] Allocated Subnets [10.128.2.0/23 fd02:0:0:5::/64] on Node oam01.ss1.samsung.local
2022-09-30T06:42:50.857176132Z I0930 06:42:50.857157       1 master.go:1003] Allocated Subnets [10.131.0.0/23 fd02:0:0:4::/64] on Node rhel01.ss1.samsung.local
2022-09-30T06:42:50.857176132Z I0930 06:42:50.857167       1 kube.go:99] Setting annotations map[k8s.ovn.org/node-subnets:{"default":["10.128.2.0/23","fd02:0:0:5::/64"]}] on node oam01.ss1.samsung.local
2022-09-30T06:42:50.857185257Z I0930 06:42:50.857157       1 master.go:1003] Allocated Subnets [10.128.6.0/23 fd02:0:0:d::/64] on Node call03.ss1.samsung.local
2022-09-30T06:42:50.857192996Z I0930 06:42:50.857183       1 kube.go:99] Setting annotations map[k8s.ovn.org/node-subnets:{"default":["10.131.0.0/23","fd02:0:0:4::/64"]}] on node rhel01.ss1.samsung.local
2022-09-30T06:42:50.857200017Z I0930 06:42:50.857190       1 kube.go:99] Setting annotations map[k8s.ovn.org/node-subnets:{"default":["10.128.6.0/23","fd02:0:0:d::/64"]}] on node call03.ss1.samsung.local
2022-09-30T06:42:50.857282717Z I0930 06:42:50.857258       1 master.go:1003] Allocated Subnets [10.130.2.0/23 fd02:0:0:7::/64] on Node call01.ss1.samsung.local
2022-09-30T06:42:50.857304886Z I0930 06:42:50.857293       1 kube.go:99] Setting annotations map[k8s.ovn.org/node-subnets:{"default":["10.130.2.0/23","fd02:0:0:7::/64"]}] on node call01.ss1.samsung.local
2022-09-30T06:42:50.857338896Z I0930 06:42:50.857314       1 master.go:1003] Allocated Subnets [10.128.4.0/23 fd02:0:0:9::/64] on Node f501.ss1.samsung.local
2022-09-30T06:42:50.857349485Z I0930 06:42:50.857329       1 master.go:1003] Allocated Subnets [10.131.2.0/23 fd02:0:0:8::/64] on Node call02.ss1.samsung.local
2022-09-30T06:42:50.857371344Z I0930 06:42:50.857354       1 kube.go:99] Setting annotations map[k8s.ovn.org/node-subnets:{"default":["10.128.4.0/23","fd02:0:0:9::/64"]}] on node f501.ss1.samsung.local
2022-09-30T06:42:50.857371344Z I0930 06:42:50.857361       1 kube.go:99] Setting annotations map[k8s.ovn.org/node-subnets:{"default":["10.131.2.0/23","fd02:0:0:8::/64"]}] on node call02.ss1.samsung.local

*USER STORY:*

 As a customer or OpenShift engineer, I want to see the user agent for anything calling from OpenShift -> vSphere to eliminate troubleshooting guesswork.

*DESCRIPTION:*

A question in #forum-vmware was raised where we identified that the user-agent may not be configured for all OpenShift components calling to vSphere API.

https://coreos.slack.com/archives/CH06KMDRV/p1627368902058800

*Required:*

Audit of OpenShift components calling to vSphere API to make sure user agent strings are set appropriately.

*Nice to have:*

How can this be prevented in the future? How can we minimize maintenance costs added by new PRs/bugs reported from this spike?

*ACCEPTANCE CRITERIA:*

New PRs or bug reports for each effected component.

This is a clone of issue OCPBUGS-11993. The following is the description of the original issue:

Description of problem:

There's argument number mismatch on release_vif() call while reverting
port association.

Version-Release number of selected component (if applicable):

 

How reproducible:

It's clear in the code, no need to reproduce this.

Steps to Reproduce:

1.
2.
3.

Actual results:

TypeError

Expected results:

KuryrPort released

Additional info:

 

OCPBUGS-1251 landed an admin-ack gate in 4.11.z to help admins prepare for Kubernetes 1.25 API removals which are coming in OpenShift 4.12. Poking around in a 4.12.0-ec.2 cluster where APIRemovedInNextReleaseInUse is firing:

$ oc --as system:admin adm must-gather -- /usr/bin/gather_audit_logs
$ zgrep -h v1beta1/poddisruptionbudget must-gather.local.1378724704026451055/quay*/audit_logs/kube-apiserver/*.log.gz | jq -r '.verb + " " + (.user | .username + " " + (.extra["authentication.kubernetes.io/pod-name"] | tostr
ing))' | sort | uniq -c
parse error: Invalid numeric literal at line 29, column 6
     28 watch system:serviceaccount:openshift-machine-api:cluster-autoscaler ["cluster-autoscaler-default-5cf997b8d6-ptgg7"]

Finding the source for that container:

$ oc --as system:admin -n openshift-machine-api get -o json pod cluster-autoscaler-default-5cf997b8d6-ptgg7 | jq -r '.status.containerStatuses[].image'
quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f81ab7ce0c851ba5e5169bba717cb54716ce5457cbe89d159c97a5c25fd820ed
$ oc image info quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f81ab7ce0c851ba5e5169bba717cb54716ce5457cbe89d159c97a5c25fd820ed | grep github
             SOURCE_GIT_URL=https://github.com/openshift/kubernetes-autoscaler
             io.openshift.build.commit.url=https://github.com/openshift/kubernetes-autoscaler/commit/1dac0311b9842958ec630273428b74703d51c1c9
             io.openshift.build.source-location=https://github.com/openshift/kubernetes-autoscaler

Poking about in the source:

$ git clone --depth 30 --branch master https://github.com/openshift/kubernetes-autoscaler.git
$ cd kubernetes-autoscaler
$ find . -name vendor
./addon-resizer/vendor
./cluster-autoscaler/vendor
./vertical-pod-autoscaler/e2e/vendor
./vertical-pod-autoscaler/vendor

Lots of vendoring. I haven't checked to see how new the client code is in the various vendor packages. But the main issue seems to be the v1beta1 in:

$ git grep policy cluster-autoscaler/core cluster-autoscaler/utils | grep policy.*v1beta1
cluster-autoscaler/core/scaledown/actuation/actuator_test.go:   policyv1beta1 "k8s.io/api/policy/v1beta1"
cluster-autoscaler/core/scaledown/actuation/actuator_test.go:                                   eviction := createAction.GetObject().(*policyv1beta1.Eviction)
cluster-autoscaler/core/scaledown/actuation/drain.go:   policyv1 "k8s.io/api/policy/v1beta1"
cluster-autoscaler/core/scaledown/actuation/drain_test.go:      policyv1 "k8s.io/api/policy/v1beta1"
cluster-autoscaler/core/scaledown/legacy/legacy.go:     policyv1 "k8s.io/api/policy/v1beta1"
cluster-autoscaler/core/scaledown/legacy/wrapper.go:    policyv1 "k8s.io/api/policy/v1beta1"
cluster-autoscaler/core/scaledown/scaledown.go: policyv1 "k8s.io/api/policy/v1beta1"
cluster-autoscaler/core/static_autoscaler_test.go:      policyv1 "k8s.io/api/policy/v1beta1"
cluster-autoscaler/utils/drain/drain.go:        policyv1 "k8s.io/api/policy/v1beta1"
cluster-autoscaler/utils/drain/drain_test.go:   policyv1 "k8s.io/api/policy/v1beta1"
cluster-autoscaler/utils/kubernetes/listers.go: policyv1 "k8s.io/api/policy/v1beta1"
cluster-autoscaler/utils/kubernetes/listers.go: v1policylister "k8s.io/client-go/listers/policy/v1beta1"

The main change from v1beta1 to v1 involves spec.selector; I dunno if that's relevant to the autoscaler use-case or not.

Do we run autoscaler CI? I was poking around a bit, but did not find a 4.12 periodic excercising the autoscaler that might have turned up this alert and issue.

While running a PerfScale test we noticed that the hosted ovnkube-master pods always initially error on deployment. They eventually succeed on retry however. 

This is running quay.io/openshift-release-dev/ocp-release:4.11.11-x86_64 for the hosted clusters and the hypershift operator is quay.io/hypershift/hypershift-operator:4.11 on a 4.11.9 management cluster.

An example of the error in the ovnkube-master container:

```

F1102 13:27:51.935600       1 ovnkube.go:133] error when trying to initialize libovsdb SB client: unable to connect to any endpoints: failed to connect to ssl:ovnkube-master-0.ovnkube-master-internal.clusters-perf-pqd-0021.svc.cluster.local:9642: failed to open connection: dial tcp 10.131.8.25:9642: connect: connection refused. failed to connect to ssl:ovnkube-master-1.ovnkube-master-internal.clusters-perf-pqd-0021.svc.cluste

```

Description of problem:
Cannot scale up worker node have deploying OCP 4.11.1 cluster via UPI on Azure

5h2m Warning FailedCreate machine/pokus-2knkh-worker-northeurope1-f6kc4 InvalidConfiguration: failed to reconcile machine "pokus-2knkh-worker-northeurope1-f6kc4": failed to create vm pokus-2knkh-worker-northeurope1-f6kc4: failure sending request for machine pokus-2knkh-worker-northeurope1-f6kc4: cannot create vm: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=404 - Original Error: Code="NotFound" Message="The Image '/subscriptions/e639e479-2737-4b3d-b338-f1928f6429a1/resourceGroups/mlpipe-2163-azpln-rg/providers/Microsoft.Compute/images/pokus-2knkh-gen2' cannot be found in 'northeurope' region."

Customer would like to have the installer create machineset from the inital installation, therefore Kubernetes manifest files that define the worker machines were not removed during the installation.

Highlights:
Can I please let help verifying if these are the correct steps to have the initial installation created and manage the worker machines?Is there an explanation on how changing the image to -gen2 in [concat(parameters('baseName'),'-gen2')] from the 02_storage.json template can resolve the problem?
Version-Release number of selected component (if applicable):

Environment:
OCP 4.11.1 UPI install on Azure using ARM
VM size:
bootstrap: Standard_D4s_v3
master: Standard_D4s_v3

How reproducible:
Always

Steps to Reproduce:
Following the step described in the document: Installing a cluster on Azure using ARM templates .

In the install-config.yaml, worker replicas was set to 0

compute:
- architecture: amd64
  hyperthreading: Enabled
  name: worker
  platform: {}
  replicas: 3   
controlPlane:
  architecture: amd64
  hyperthreading: Enabled
  name: master
  platform: {}
  replicas: 3

After creating the manifests described in this step: Creating the Kubernetes manifest and Ignition config files only control plane machines manifests were removed, worker machines manifests remain untouchedAfter three masters and three worker nodes were created by ARM templates, additional worker were added using machine sets via command

oc scale --replicas=1 machineset cluster-g7rzv-worker-francecentral1 -n openshift-machine-api` 

Actual results:
No addition node visible from `oc get nodes` and the following error occur:

5h2m Warning FailedCreate machine/pokus-2knkh-worker-northeurope1-f6kc4 InvalidConfiguration: failed to reconcile machine "pokus-2knkh-worker-northeurope1-f6kc4": failed to create vm pokus-2knkh-worker-northeurope1-f6kc4: failure sending request for machine pokus-2knkh-worker-northeurope1-f6kc4: cannot create vm: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=404 - Original Error: Code="NotFound" Message="The Image '/subscriptions/e639e479-2737-4b3d-b338-f1928f6429a1/resourceGroups/mlpipe-2163-azpln-rg/providers/Microsoft.Compute/images/pokus-2knkh-gen2' cannot be found in 'northeurope' region."

The customer found out that this can be resolved if changing the -image to -gen2 in [concat(parameters('baseName'),'-gen2')] from the 02_storage.json template

Expected results:
The installer should be able to create and manage machineset

Additional info:
SFDC case #03304526

Slack discussion, might due to MAO not able to support UPI in Azure Thread1, Thread2
 

 

 

 

 

 

 

 

 

This is a clone of issue OCPBUGS-4460. The following is the description of the original issue:

This is a clone of issue OCPBUGS-3164. The following is the description of the original issue:

During first bootstrap boot we need crio and kubelet on the disk, so we start release-image-pivot systemd task. However, its not blocking bootkube, so these two run in parallel.

release-image-pivot restarts the node to apply new OS image, which may leave bootkube in an inconsistent state. This task should run before bootkube

Description of problem:

When scaling down the machineSet for worker nodes, a PV(vmdk) file got deleted.

Version-Release number of selected component (if applicable):

4.10

How reproducible:

N/A

Steps to Reproduce:

1. Scale down worker nodes
2. Check VMware logs and VM gets deleted with vmdk still attached

Actual results:

After scaling down nodes, volumes still attached to the VM get deleted alongside the VM

Expected results:

Worker nodes scaled down without any accidental deletion

Additional info:

 

Description of problem:

 

During ocp multinode spoke cluster creation agent provisioning is stuck on "configuring" because machineConfig service is crashing on the node.
After restarting the service still fails with 

Can't read link "/var/lib/containers/storage/overlay/l/V2OP2CCVMKSOHK2XICC546DUCG" because it does not exist. A storage corruption might have occurred, attempting to recreate the missing symlinks. It might be best wipe the storage to avoid further errors due to storage corruption. 

Version-Release number of selected component (if applicable):

Podman 4.0.2 + 

How reproducible:

sometimes

Steps to Reproduce:

1. deploy multinode spoke (ipxe + boot order )
2.
3.

Actual results:

4 agents in done state and 1 is in "configuring"

 

Expected results:

all agents are in "done" state

Additional info:

issue mentioned in https://github.com/containers/podman/issues/14003

 

Fix: https://github.com/containers/storage/issues/1136

 

 

 

We are in the process of moving our bug tracking to JIRA. We should update the report bug link in the help menu to use JIRA instead of Bugzilla for new bugs. Opening as a medium severity bug since this only impacts prerelease OpenShift versions. For release versions, we have users open customer cases.

The static authorizer feature has landed in upstream kube-rbac-proxy. Lets use it by configuring a static authorizer for all requests that hit a /metrics endpoint.

DoD:

  • Downstream kube-rbac-proxy is synced.
  • All CMO operands are configured with static authorization.
  • Bugzillas created for all non-monitoring components using kube-rbac-proxy for metrics authn/authz.

We have created a fix in 4.12 that fetches instance type information from Azure API instead of updating the lists. We feel that backporting that fix is too risky, but agreed to update the list in older versions.

Description of problem:

Add the following instance types to azure_instance_types list[1]:

  • Standard_D8s_v5
  • Standard_E8s_v5
  • Standard_E16s_v5

Version-Release number of selected component (if applicable):
OCP 4.8

Steps to Reproduce:
1. Migrate worker/infra nodes to above mentioned (missing) v5 instance types
2. "Failed to set autoscaling from zero annotations, instance type unknown"

Actual results:

  • "Failed to set autoscaling from zero annotations, instance type unknown"
  • New v5 instance types not officially tested/supported

Expected results:
The new instance types are available in the azure_instance_types list[1] and no errors/warnings are observed after migrating:

  • Standard_D8s_v5
  • Standard_E8s_v5
  • Standard_E16s_v5

Additional info:

The related v4 instance types are already available[1] - I suspect adding the mentioned v5 instance types is a minor update:

  • Standard_D8s_v4
  • Standard_E8s_v4
  • Standard_E16s_v4

1) azure_instance_types.go
https://github.com/openshift/cluster-api-provider-azure/blob/release-4.8/pkg/cloud/azure/actuators/machineset/azure_instance_types.go

Description of problem:

After setting the appsDomain flag in the OpenShift Ingress configuration, all Routes that are generated are based off this domain.

This includes OpenShift-core Routes, which can result in failing health-checks, such as seen with the OpenShift Web Console.

According to the documentation[0], the appsDomain value should only be used by user-created Routes and is not intended to be seen when Openshift components are reconciled.

Version-Release number of selected component (if applicable):


OpenShift 4.12

How reproducible:

Everytime

Steps to Reproduce:

1. Provision a new cluster
2. Configure appsDomain value
3. Delete the OpenShift Console Route:
     oc delete routes -n openshift-console console
4. The Route is automatically reconciled

Actual results:

The re-created Route uses the appsDomain value, resulting in failed health-checks throughout the cluster.

Expected results:

The Console and all other OpenShift-core components should use the default domain.

Additional info:

Reviewing the appsDomain code, if the appsDomain value is configured this takes precedent for all Routes. [3][1][2]


Resources:
[0] https://docs.openshift.com/container-platform/4.9/networking/ingress-operator.html#nw-ingress-configuring-application-domain_configuring-ingress
[1] Sets Route Subdomain to appsdomain if it's present, otherwise uses the default domain
    https://github.com/openshift/cluster-openshift-apiserver-operator/blob/d2182396d647839f61b027602578ac31c9437e7d/pkg/operator/configobservation/ingresses/observe_ingresses.go#L16-L60
[2] Sets up the Route generator
    https://github.com/openshift/openshift-apiserver/blob/9c381fc873b35c074f7c0c485893e038828c1405/pkg/cmd/openshift-apiserver/openshiftapiserver/config.go#L217-L218
[3] Where Route.spec.host is generated:
    https://github.com/openshift/openshift-apiserver/blob/9c381fc873b35c074f7c0c485893e038828c1405/vendor/github.com/openshift/library-go/pkg/route/hostassignment/plugin.go#L22-L47


This is a clone of issue OCPBUGS-11998. The following is the description of the original issue:

This is a clone of issue OCPBUGS-10678. The following is the description of the original issue:

This is a clone of issue OCPBUGS-10655. The following is the description of the original issue:

Description of problem:
The dev console shows a list of samples. The user can create a sample based on a git repository. But some of these samples doesn't include a git repository reference and could not be created.

Version-Release number of selected component (if applicable):
Tested different frontend versions against a 4.11 cluster and all (oldest tested frontend was 4.8) show the sample without git repository.

But the result also depends on the installed samples operator and installed ImageStreams.

How reproducible:
Always

Steps to Reproduce:

  1. Switch to the Developer perspective
  2. Navigate to Add > All Samples
  3. Search for Jboss
  4. Click on "JBoss EAP XP 4.0 with OpenJDK 11" (for example)

Actual results:
The git repository is not filled and the create button is disabled.

Expected results:
Samples without git repositories should not be displayed in the list.

Additional info:
The Git repository is saved as "sampleRepo" in the ImageStream tag section.

This is a clone of issue OCPBUGS-12914. The following is the description of the original issue:

This is a clone of issue OCPBUGS-12878. The following is the description of the original issue:

We want to add the dual-stack tests to the CNI plugin conformance test suite, for the currently supported releases.

(This has no impact on OpenShift itself. We're just modifying a test suite that OCP does not use.)

Description of problem:

When a pod runs to a completed state, we typically rely on the update event that will indicate to us that this pod is completed. At that point the pod IP is released and the port configuration is removed in OVN. The subsequent delete event for this pod will be ignored because it should have been cleaned up in the previous update.

However, there can be cases where the update event is missed with pod completed. In this case we will only receive a delete with pod completed event, and ignore tearing down the pod. The end result is the pod is not cleaned up in OVN and the IP address remains allocated, reducing the amount of address range available to launch another pod. This can lead to exhausting all IP addresses available for pod allocation on a node.

Version-Release number of selected component (if applicable):

4.10.24

How reproducible:

Not sure how to reproduce this. I'm guessing some lag in kapi updates can cause the completed update event and the final delete event to be combined into a single event.

Steps to Reproduce:

1.
2.
3.

Actual results:

Port still exists in OVN, IP remains allocated for a deleted pod.

Expected results:

IP should be freed, port should be removed from OVN.

Additional info:

 

Description of problem:

Data race seen in unit tests:
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/1448/pull-ci-openshift-ovn-kubernetes-release-4.11-unit/1604898712423763968/artifacts/test/build-log.txt
 

Description of problem:

Customer is facing issue similar to https://github.com/devfile/api/issues/897

Version-Release number of selected component (if applicable):

OCP 4.10.17

How reproducible:
N/A
Steps to Reproduce:
1.
2.
3.

Actual results:

Expected results:

Additional info:

Tried working around it with ALL_PROXY but it did not help. Note because the console operator reverts changes pretty quickly testing this was a bit of a PITA

This is a clone of issue OCPBUGS-676. The following is the description of the original issue:

the machine approver isn't recognizing hostnames that use capital letters as valid even though DNS is case-insensitive

an example of this is in OHSS-14709:

I0822 19:04:51.587266       1 controller.go:114] Reconciling CSR: csr-vdtpv
I0822 19:04:51.600941       1 csr_check.go:156] csr-vdtpv: CSR does not appear to be client csr
I0822 19:04:51.603648       1 csr_check.go:542] retrieving serving cert from ip-100-66-119-117.ec2.internal (100.66.119.117:10250)
I0822 19:04:51.604003       1 csr_check.go:181] Failed to retrieve current serving cert: dial tcp 100.66.119.117:10250: connect: connection refused
I0822 19:04:51.604017       1 csr_check.go:201] Falling back to machine-api authorization for ip-100-66-119-117.ec2.internal
E0822 19:04:51.604024       1 csr_check.go:392] csr-vdtpv: DNS name 'ip-100-66-119-117.tech-ace-maint-prd.aws.delta.com' not in machine names: ip-100-66-119-117.ec2.internal ip-100-66-119-117.ec2.internal ip-100-66-119-117.tech-ACE-maint-prd.aws.delta.com
I0822 19:04:51.604033       1 csr_check.go:204] Could not use Machine for serving cert authorization: DNS name 'ip-100-66-119-117.tech-ace-maint-prd.aws.delta.com' not in machine names: ip-100-66-119-117.ec2.internal ip-100-66-119-117.ec2.internal ip-100-66-119-117.tech-ACE-maint-prd.aws.delta.com
I0822 19:04:51.606777       1 controller.go:199] csr-vdtpv: CSR not authorized

This can be worked around by manually approving the CSR

The relevant line in the machine approver appears to be here: https://github.com/openshift/cluster-machine-approver/blob/master/pkg/controller/csr_check.go#L378

Description of problem:

The ovn-kubernetes ovnkube-master containers are continuously crashlooping since we updated to 4.11.0-0.okd-2022-10-15-073651.

Log Excerpt:

] [] []  [{kubectl-client-side-apply Update networking.k8s.io/v1 2022-09-12 12:25:06 +0000 UTC FieldsV1 {"f:metadata":{"f:annotations":{".":{},"f:kubectl.kubernetes.io/last-applied-configuration":{}}},"f:spec":{"f:ingress":{},"f:policyTypes":{}}} }]},Spec:NetworkPolicySpec{PodSelector:{map[] []},Ingress:[]NetworkPolicyIngressRule{NetworkPolicyIngressRule{Ports:[]NetworkPolicyPort{},From:[]NetworkPolicyPeer{NetworkPolicyPeer{PodSelector:&v1.LabelSelector{MatchLabels:map[string]string{access: true,},MatchExpressions:[]LabelSelectorRequirement{},},NamespaceSelector:nil,IPBlock:nil,},},},},Egress:[]NetworkPolicyEgressRule{},PolicyTypes:[Ingress],},} &NetworkPolicy{ObjectMeta:{allow-from-openshift-ingress  compsci-gradcentral  a405f843-c250-40d7-8dd4-a759f764f091 217304038 1 2022-09-22 14:36:38 +0000 UTC <nil> <nil> map[] map[] [] []  [{openshift-apiserver Update networking.k8s.io/v1 2022-09-22 14:36:38 +0000 UTC FieldsV1 {"f:spec":{"f:ingress":{},"f:policyTypes":{}}} }]},Spec:NetworkPolicySpec{PodSelector:{map[] []},Ingress:[]NetworkPolicyIngressRule{NetworkPolicyIngressRule{Ports:[]NetworkPolicyPort{},From:[]NetworkPolicyPeer{NetworkPolicyPeer{PodSelector:nil,NamespaceSelector:&v1.LabelSelector{MatchLabels:map[string]string{policy-group.network.openshift.io/ingress: ,},MatchExpressions:[]LabelSelectorRequirement{},},IPBlock:nil,},},},},Egress:[]NetworkPolicyEgressRule{},PolicyTypes:[Ingress],},}]: cannot clean up egress default deny ACL name: error in transact with ops [{Op:mutate Table:Port_Group Row:map[] Rows:[] Columns:[] Mutations:[{Column:acls Mutator:delete Value:{GoSet:[{GoUUID:60cb946a-46e9-4623-9ba4-3cb35f018ed6}]}}] Timeout:<nil> Where:[where column _uuid == {ccdd01bf-3009-42fb-9672-e1df38190cd7}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:mutate Table:Port_Group Row:map[] Rows:[] Columns:[] Mutations:[{Column:acls Mutator:delete Value:{GoSet:[{GoUUID:60cb946a-46e9-4623-9ba4-3cb35f018ed6}]}}] Timeout:<nil> Where:[where column _uuid == {10bbf229-8c1b-4c62-b36e-4ba0097722db}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:delete Table:ACL Row:map[] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {7b55ba0c-150f-4a63-9601-cfde25f29408}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:delete Table:ACL Row:map[] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {60cb946a-46e9-4623-9ba4-3cb35f018ed6}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}] results [{Count:1 Error: Details: UUID:{GoUUID:} Rows:[]} {Count:1 Error: Details: UUID:{GoUUID:} Rows:[]} {Count:1 Error: Details: UUID:{GoUUID:} Rows:[]} {Count:1 Error: Details: UUID:{GoUUID:} Rows:[]} {Count:0 Error:referential integrity violation Details:cannot delete ACL row 7b55ba0c-150f-4a63-9601-cfde25f29408 because of 1 remaining reference(s) UUID:{GoUUID:} Rows:[]}] and errors []: referential integrity violation: cannot delete ACL row 7b55ba0c-150f-4a63-9601-cfde25f29408 because of 1 remaining reference(s)

Additional info:

https://github.com/okd-project/okd/issues/1372

Issue persisted through update to 4.11.0-0.okd-2022-10-28-153352

must-gather: https://nbc9-snips.cloud.duke.edu/snips/must-gather.local.2859117512952590880.zip

Description of problem:

Remove the self-provisioner role for the system authenticated users as per  https://access.redhat.com/solutions/4040541 to stop users from having the ability to create new projects, but the customer has found this is only partially working. It appears that when you use cluster Web UI Administrator view, the "Create Project" button is not available but switching to the default Developer view default user can create a project

Version-Release number of selected component (if applicable):

 

How reproducible:

Follow https://access.redhat.com/solutions/1529893

Steps to Reproduce:

1. oc adm policy remove-cluster-role-from-group self-provisioner system:authenticated:oauth
2. log back in as user and switch between admin/Dev view
3. User still has link showing in Dev console

Actual results:

Create new project link still exists

Expected results:

Create new project link should be removed, similar to Admin Console

 

Additional info:

Although the loink still exists, the user get's a correct permission denied message.

Please review the following PR: https://github.com/openshift/cluster-bootstrap/pull/72

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

This is a clone of issue OCPBUGS-78. The following is the description of the original issue:

Copied from an upstream issue: https://github.com/operator-framework/operator-lifecycle-manager/issues/2830

What did you do?

When attempting to reinstall an operator that uses conversion webhooks by

  • Deleting the operator subscription and any CSVs associated with it
  • Recreating the operator subscription

The resulting InstallPlan enters a failed state with message similar to

error validating existing CRs against new CRD's schema for "devworkspaces.workspace.devfile.io": error listing resources in GroupVersionResource schema.GroupVersionResource{Group:"workspace.devfile.io", Version:"v1alpha1", Resource:"devworkspaces"}: conversion webhook for workspace.devfile.io/v1alpha2, Kind=DevWorkspace failed: Post "https://devworkspace-controller-manager-service.test-namespace.svc:443/convert?timeout=30s": service "devworkspace-controller-manager-service" not found

When the original CSVs are deleted, the operator's main deployment and service are removed, but CRDs are left in-cluster. However, since the service/CA bundle/deployment that serve the conversion webhook are removed, conversion webhooks are broken at that point. Eventually this impacts garbage collection on the cluster as well.

This can be reproduced by installing the DevWorkspace Operator from the Red Hat catalog. (I can provide yamls/upstream images that reproduce as well, if that's helpful). It may be necessary to create a DevWorkspace in the cluster before deletion, e.g. by oc apply -f https://raw.githubusercontent.com/devfile/devworkspace-operator/main/samples/plain.yaml

What did you expect to see?
Operator is able to be reinstalled without removing CRDs and all instances.

What did you see instead? Under which circumstances?
It's necessary to completely remove the operator including CRDs. For our operator (DevWorkspace), this also makes uninstall especially complicated as finalizers are used (so CRDs cannot be deleted if the controller is removed, and the controller cannot be restored by reinstalling)

Environment

operator-lifecycle-manager version: 4.10.24

Kubernetes version information: Kubernetes Version: v1.23.5+012e945 (OpenShift 4.10.24)

Kubernetes cluster kind: OpenShift

This is a 4.11.z backport.  By not backporting this change originally to 4.11 we introduced a mandatory flag for `opm serve` command without at least a 2-OCP-version runway as optional.   This is impacting customers on 4.11 clusters as well as dependent peer projects like oc-mirror.

 

-----------------------------

Description of problem:

opm serve fails with message:

Error: compute digest: compute hash: write tar: stat .: os: DirFS with empty root

Version-Release number of selected component (if applicable):

4.12

How reproducible:

100%

Steps to Reproduce:

(The easiest reproducer involves serving an empty catalog)

1. mkdir /tmp/catalog

2. using Dockerfile /tmp/catalog.Dockerfile based on 4.12 docs (https://access.redhat.com/documentation/en-us/openshift_container_platform/4.12/html-single/operators/index#olm-creating-fb-catalog-image_olm-managing-custom-catalogs
# The base image is expected to contain
# /bin/opm (with a serve subcommand) and /bin/grpc_health_probe
FROM registry.redhat.io/openshift4/ose-operator-registry:v4.12

# Configure the entrypoint and command
ENTRYPOINT ["/bin/opm"]
CMD ["serve", "/configs"]

# Copy declarative config root into image at /configs
ADD catalog /configs

# Set DC-specific label for the location of the DC root directory
# in the image
LABEL operators.operatorframework.io.index.configs.v1=/configs

3. build the image `cd /tmp/ && docker build -f catalog.Dockerfile .`

4. execute an instance of the container in docker/podman `docker run --name cat-run [image-file]`

5. error

Using a dockerfile generated from opm (`opm generate dockerfile [dir]`) works, but includes precache and cachedir options to opm.

 

Actual results:

Error: compute digest: compute hash: write tar: stat .: os: DirFS with empty root

Expected results:

opm generates cache in default /tmp/cache location and serves without error

Additional info:

 

 

Description of problem:

release-4.11 of openshift/cloud-provider-openstack is missing some commits that were backported in upstream project into the release-1.24 branch.
We should import them in our downstream fork.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:

1.
2.
3.

Actual results:


Expected results:


Additional info:


Description of problem:

prometheus-k8s-0 ends in CrashLoopBackOff with evel=error err="opening storage failed: /prometheus/chunks_head/000002: invalid magic number 0" on SNO after hard reboot tests

Version-Release number of selected component (if applicable):

4.11.6

How reproducible:

Not always, after ~10 attempts

Steps to Reproduce:

1. Deploy SNO with Telco DU profile applied
2. Hard reboot node via out of band interface
3. oc -n openshift-monitoring get pods prometheus-k8s-0 

Actual results:

NAME               READY   STATUS             RESTARTS          AGE
prometheus-k8s-0   5/6     CrashLoopBackOff   125 (4m57s ago)   5h28m

Expected results:

Running

Additional info:

Attaching must-gather.

The pod recovers successfully after deleting/re-creating.


[kni@registry.kni-qe-0 ~]$ oc -n openshift-monitoring logs prometheus-k8s-0
ts=2022-09-26T14:54:01.919Z caller=main.go:552 level=info msg="Starting Prometheus Server" mode=server version="(version=2.36.2, branch=rhaos-4.11-rhel-8, revision=0d81ba04ce410df37ca2c0b1ec619e1bc02e19ef)"
ts=2022-09-26T14:54:01.919Z caller=main.go:557 level=info build_context="(go=go1.18.4, user=root@371541f17026, date=20220916-14:15:37)"
ts=2022-09-26T14:54:01.919Z caller=main.go:558 level=info host_details="(Linux 4.18.0-372.26.1.rt7.183.el8_6.x86_64 #1 SMP PREEMPT_RT Sat Aug 27 22:04:33 EDT 2022 x86_64 prometheus-k8s-0 (none))"
ts=2022-09-26T14:54:01.919Z caller=main.go:559 level=info fd_limits="(soft=1048576, hard=1048576)"
ts=2022-09-26T14:54:01.919Z caller=main.go:560 level=info vm_limits="(soft=unlimited, hard=unlimited)"
ts=2022-09-26T14:54:01.921Z caller=web.go:553 level=info component=web msg="Start listening for connections" address=127.0.0.1:9090
ts=2022-09-26T14:54:01.922Z caller=main.go:989 level=info msg="Starting TSDB ..."
ts=2022-09-26T14:54:01.924Z caller=tls_config.go:231 level=info component=web msg="TLS is disabled." http2=false
ts=2022-09-26T14:54:01.926Z caller=main.go:848 level=info msg="Stopping scrape discovery manager..."
ts=2022-09-26T14:54:01.926Z caller=main.go:862 level=info msg="Stopping notify discovery manager..."
ts=2022-09-26T14:54:01.926Z caller=manager.go:951 level=info component="rule manager" msg="Stopping rule manager..."
ts=2022-09-26T14:54:01.926Z caller=manager.go:961 level=info component="rule manager" msg="Rule manager stopped"
ts=2022-09-26T14:54:01.926Z caller=main.go:899 level=info msg="Stopping scrape manager..."
ts=2022-09-26T14:54:01.926Z caller=main.go:858 level=info msg="Notify discovery manager stopped"
ts=2022-09-26T14:54:01.926Z caller=main.go:891 level=info msg="Scrape manager stopped"
ts=2022-09-26T14:54:01.926Z caller=notifier.go:599 level=info component=notifier msg="Stopping notification manager..."
ts=2022-09-26T14:54:01.926Z caller=main.go:844 level=info msg="Scrape discovery manager stopped"
ts=2022-09-26T14:54:01.926Z caller=manager.go:937 level=info component="rule manager" msg="Starting rule manager..."
ts=2022-09-26T14:54:01.926Z caller=main.go:1120 level=info msg="Notifier manager stopped"
ts=2022-09-26T14:54:01.926Z caller=main.go:1129 level=error err="opening storage failed: /prometheus/chunks_head/000002: invalid magic number 0"

This is a clone of issue OCPBUGS-4238. The following is the description of the original issue:

This is a clone of issue OCPBUGS-3883. The following is the description of the original issue:

While doing a PerfScale test of we noticed that the ovnkube pods are not being spread out evenly among the available workers. Instead they are all stacking on a few until they fill up the available allocatable ebs volumes (25 in the case of m5 instances that we see here).

An example from partway through our 80 hosted cluster test when there were ~30 hosted clusters created/in progress

There are 24 workers available:

```

$ for i in `oc get nodes l node-role.kubernetes.io/worker=,node-role.kubernetes.io/infra!=,node-role.kubernetes.io/workload!= | egrep -v "NAME" | awk '{ print $1 }'`;    do  echo $i `oc describe node $i | grep -v openshift | grep ovnkube -c`; done
ip-10-0-129-227.us-west-2.compute.internal 0
ip-10-0-136-22.us-west-2.compute.internal 25
ip-10-0-136-29.us-west-2.compute.internal 0
ip-10-0-147-248.us-west-2.compute.internal 0
ip-10-0-150-147.us-west-2.compute.internal 0
ip-10-0-154-207.us-west-2.compute.internal 0
ip-10-0-156-0.us-west-2.compute.internal 0
ip-10-0-157-1.us-west-2.compute.internal 4
ip-10-0-160-253.us-west-2.compute.internal 0
ip-10-0-161-30.us-west-2.compute.internal 0
ip-10-0-164-98.us-west-2.compute.internal 0
ip-10-0-168-245.us-west-2.compute.internal 0
ip-10-0-170-103.us-west-2.compute.internal 0
ip-10-0-188-169.us-west-2.compute.internal 25
ip-10-0-188-194.us-west-2.compute.internal 0
ip-10-0-191-51.us-west-2.compute.internal 5
ip-10-0-192-10.us-west-2.compute.internal 0
ip-10-0-193-200.us-west-2.compute.internal 0
ip-10-0-193-27.us-west-2.compute.internal 7
ip-10-0-199-1.us-west-2.compute.internal 0
ip-10-0-203-161.us-west-2.compute.internal 0
ip-10-0-204-40.us-west-2.compute.internal 23
ip-10-0-220-164.us-west-2.compute.internal 0
ip-10-0-222-59.us-west-2.compute.internal 0

```

This is running quay.io/openshift-release-dev/ocp-release:4.11.11-x86_64 for the hosted clusters and the hypershift operator is quay.io/hypershift/hypershift-operator:4.11 on a 4.11.9 management cluster

Description of problem:

The reconciler removes the overlappingrangeipreservations.whereabouts.cni.cncf.io resources whether the pod is alive or not. 

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1. Create pods and check the overlappingrangeipreservations.whereabouts.cni.cncf.io resources:

$ oc get overlappingrangeipreservations.whereabouts.cni.cncf.io -A
NAMESPACE          NAME                      AGE
openshift-multus   2001-1b70-820d-4b04--13   4m53s
openshift-multus   2001-1b70-820d-4b05--13   4m49s

2.  Verify that when the ip-reconciler cronjob removes the overlappingrangeipreservations.whereabouts.cni.cncf.io resources when run:

$ oc get cronjob -n openshift-multus
NAME            SCHEDULE       SUSPEND   ACTIVE   LAST SCHEDULE   AGE
ip-reconciler   */15 * * * *   False     0        14m             4d13h

$ oc get overlappingrangeipreservations.whereabouts.cni.cncf.io -A
No resources found

$ oc get cronjob -n openshift-multus
NAME            SCHEDULE       SUSPEND   ACTIVE   LAST SCHEDULE   AGE
ip-reconciler   */15 * * * *   False     0        5s              4d13h

 

Actual results:

The overlappingrangeipreservations.whereabouts.cni.cncf.io resources are removed for each created pod by the ip-reconciler cronjob.
The "overlapping ranges" are not used. 

Expected results:

The overlappingrangeipreservations.whereabouts.cni.cncf.io should not be removed regardless of if a pod has used an IP in the overlapping ranges.

Additional info:

 

This is a clone of issue OCPBUGS-3111. The following is the description of the original issue:

This is a clone of issue OCPBUGS-2992. The following is the description of the original issue:

Description of problem:

The metal3-ironic container image in OKD fails during steps in configure-ironic.sh that look for additional Oslo configuration entries as environment variables to configure the Ironic instance. The mechanism by which it fails in OKD but not OpenShift is that the image for OpenShift happens to have unrelated variables set which match the regex, because it is based on the builder image, but the OKD image is based only on a stream8 image without these unrelated OS_ prefixed variables set.

The metal3 pod created in response to even a provisioningNetwork: Disabled Provisioning object will therefore crashloop indefinitely.

Version-Release number of selected component (if applicable):

4.11

How reproducible:

Always

Steps to Reproduce:

1. Deploy OKD to a bare metal cluster using the assisted-service, with the OKD ConfigMap applied to podman play kube, as in :https://github.com/openshift/assisted-service/tree/master/deploy/podman#okd-configuration
2. Observe the state of the metal3 pod in the openshift-machine-api namespace.

Actual results:

The metal3-ironic container repeatedly exits with nonzero, with the logs ending here:

++ export IRONIC_URL_HOST=10.1.1.21
++ IRONIC_URL_HOST=10.1.1.21
++ export IRONIC_BASE_URL=https://10.1.1.21:6385
++ IRONIC_BASE_URL=https://10.1.1.21:6385
++ export IRONIC_INSPECTOR_BASE_URL=https://10.1.1.21:5050
++ IRONIC_INSPECTOR_BASE_URL=https://10.1.1.21:5050
++ '[' '!' -z '' ']'
++ '[' -f /etc/ironic/ironic.conf ']'
++ cp /etc/ironic/ironic.conf /etc/ironic/ironic.conf_orig
++ tee /etc/ironic/ironic.extra
# Options set from Environment variables
++ echo '# Options set from Environment variables'
++ env
++ grep '^OS_'
++ tee -a /etc/ironic/ironic.extra

Expected results:

The metal3-ironic container starts and the metal3 pod is reported as ready.

Additional info:

This is the PR that introduced pipefail to the downstream ironic-image, which is not yet accepted in the upstream:
https://github.com/openshift/ironic-image/pull/267/files#diff-ab2b20df06f98d48f232d90f0b7aa464704257224862780635ec45b0ce8a26d4R3

This is the line that's failing:
https://github.com/openshift/ironic-image/blob/4838a077d849070563b70761957178055d5d4517/scripts/configure-ironic.sh#L57

This is the image base that OpenShift uses for ironic-image (before rewriting in ci-operator):
https://github.com/openshift/ironic-image/blob/4838a077d849070563b70761957178055d5d4517/Dockerfile.ocp#L9

Here is where the relevant environment variables are set in the builder images for OCP:
https://github.com/openshift/builder/blob/973602e0e576d7eccef4fc5810ba511405cd3064/hack/lib/build/version.sh#L87

Here is the final FROM line in the OKD image build (just stream8):
https://github.com/openshift/ironic-image/blob/4838a077d849070563b70761957178055d5d4517/Dockerfile.okd#L9

This results in the following differences between the two images:
$ podman run --rm -it --entrypoint bash quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:519ac06836d972047f311de5e57914cf842716e22a1d916a771f02499e0f235c -c 'env | grep ^OS_'
OS_GIT_MINOR=11
OS_GIT_TREE_STATE=clean
OS_GIT_COMMIT=97530a7
OS_GIT_VERSION=4.11.0-202210061001.p0.g97530a7.assembly.stream-97530a7
OS_GIT_MAJOR=4
OS_GIT_PATCH=0
$ podman run --rm -it --entrypoint bash quay.io/openshift/okd-content@sha256:6b8401f8d84c4838cf0e7c598b126fdd920b6391c07c9409b1f2f17be6d6d5cb -c 'env | grep ^OS_'

Here is what the OS_ prefixed variables should be used for:
https://github.com/metal3-io/ironic-image/blob/807a120b4ce5e1675a79ebf3ee0bb817cfb1f010/README.md?plain=1#L36
https://opendev.org/openstack/oslo.config/src/commit/84478d83f87e9993625044de5cd8b4a18dfcaf5d/oslo_config/sources/_environment.py

It's worth noting that ironic.extra is not consumed anywhere, and is simply being used here to save off the variables that Oslo _might_ be consuming (it won't consume the variables that are present in the OCP builder image, though they do get caught by this regex).

With pipefail set, grep returns non-zero when it fails to find an environment variable that matches the regex, as in the case of the OKD ironic-image builds.

 

This is a clone of issue OCPBUGS-224. The following is the description of the original issue:

Description of problem:
OCP v4.9.31 cluster didn't have the $search domain in /etc/resolv.conf, which was there in the v4.8.29 OCP cluster. This was observed in all the nodes of the v4.9.31 cluster.
~~~
OpenShift 4.9.31
sh-4.4# cat /etc/resolv.conf

  1. Generated by KNI resolv prepender NM dispatcher script
    nameserver 172.xx.xx.xx
    nameserver 10.xx.xx.xx
    nameserver 10.xx.xx.xx
  2. nameserver 10.xx.xx.xx

OpenShift 4.8.29

  1. Generated by KNI resolv prepender NM dispatcher script
    search sepia.lab.iad2.dc.paas.redhat.com
    nameserver 172.xx.xx.xx
    nameserver 10.xx.xx.xx
    nameserver 10.xx.xx.xx
  2. nameserver 10.xx.xx.xx
    ~~~

ENV: OpenStack IAD2, IPI installation. Connected cluster.

Version-Release number of selected component (if applicable):
OCP v4.9.31

How reproducible:
Always

Steps to Reproduce:
1. Install IPI cluster on OpenStack IAD2 platform having cluster version 4.9.31
2. Debug to any of the node(master/worker)
3. Check and confirm the missing search domain on all nodes of the cluster.

Actual results:
The search domain was missing when checked in `/etc/resolv.conf` file on all nodes of the cluster causing serious issues in the cluster.

Expected results:
The installer should embed the search domain in /etc/resolv.conf file on all nodes of the cluster.

Additional info:

  • Cu was trying to deploy secure Kerberos on the CoreOS nodes and it failed when the IPA-client install command failed. This is when the customer noticed this unusual behavior. They did not manually update the resolv.conf file to include the $search domain. They instead added the script below to /etc/NetworkManager/dispatcher.d/ and restarted NetworkManager on the node to fix this issue and installation was successful.
    ~~~
    #!/bin/bash

set -eo pipefail

DISPATCHER_FILE="/etc/NetworkManager/dispatcher.d/30-resolv-prepender"
DOMAINS="$(grep -E '\s*DOMAINS=.*iad2.dc.paas.redhat.com' $DISPATCHER_FILE \

grep -oE '[a-z0-9]*.dev.iad2.dc.paas.redhat.com' \
tr '\n' ' ')"

>&2 echo "IT-PaaS: overwriting search domains in /etc/resolv.conf with: $DOMAINS"

sed -e "/^search/d" \
-e "/Generated by/c# Generated by KNI resolv prepender NM dispatcher script \nsearch $DOMAINS" \
/etc/resolv.conf > /etc/resolv.tmp

mv /etc/resolv.tmp /etc/resolv.conf
~~~

  • Cu confirms that the $search domain was missing since the cluster was freshly installed/ They even confirmed this with a fresh new cluster as well that it was missing.
  • The fresh cluster was initially installed at v4.9.31 but was updated afterward to v4.9.43 (the latest z-stream) to see if the updates fixed anything but it didn't make any difference. The cluster is currently running v4.9.43 and shows the $search domain missing in the /etc/resolv.conf file on all nodes.

Description of problem:

Pipeline list page fetches all the pipelineruns to find the last pipeline run and which results in more load time. This performance issue needs to be addressed in all the pieplines list pages wherever applicable.

Version-Release number of selected component (if applicable):
4.9

How reproducible:
Always

Steps to Reproduce:
1. Create 10+ pipelines in a namespace
2. Create more number of pipelineruns under each pipeline
3. navigate to piplines list page.

Actual results:
Pipelines list will take a long time to load the list.

Expected results:

Pipeline list should not take more time to load the list.

Additional info:

Reduce the amount to data fetched to find the last pipelinerun, maybe use PartialMetadata to find the latest pipeline run and to improve the performance.

This is a clone of issue OCPBUGS-1704. The following is the description of the original issue:

Description of problem:

According to OCP 4.11 doc (https://docs.openshift.com/container-platform/4.11/installing/installing_gcp/installing-gcp-account.html#installation-gcp-enabling-api-services_installing-gcp-account), the Service Usage API (serviceusage.googleapis.com) is an optional API service to be enabled. But, the installation cannot succeed if this API is disabled.

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-09-25-071630

How reproducible:

Always, if the Service Usage API is disabled in the GCP project.

Steps to Reproduce:

1. Make sure the Service Usage API (serviceusage.googleapis.com) is disabled in the GCP project.
2. Try IPI installation in the GCP project. 

Actual results:

The installation would fail finally, without any worker machines launched.

Expected results:

Installation should succeed, or the OCP doc should be updated.

Additional info:

Please see the attached must-gather logs (http://virt-openshift-05.lab.eng.nay.redhat.com/jiwei/jiwei-0926-03-cnxn5/) and the sanity check results. 
FYI if enabling the API, and without changing anything else, the installation could succeed. 

This is a clone of issue OCPBUGS-17703. The following is the description of the original issue:

This is a clone of issue OCPBUGS-17365. The following is the description of the original issue:

When we update a Secret referenced in the BareMetalHost, an immediate reconcile of the corresponding BMH is not triggered. In most states we requeue each CR after a timeout, so we should eventually see the changes.

In the case of BMC Secrets, this has been broken since the fix for OCPBUGS-1080 in 4.12.

This is a clone of issue OCPBUGS-17823. The following is the description of the original issue:

This is a clone of issue OCPBUGS-17813. The following is the description of the original issue:

Description of problem:

GCP bootimage override is not available in 4.13, 4.12 or 4.11

Feature CORS-2445

Version-Release number of selected component (if applicable):{code:none}


How reproducible: Always

Steps to Reproduce:{code:none}
1.
2.
3.

Actual results:


Expected results:


Additional info:


This is a clone of issue OCPBUGS-18582. The following is the description of the original issue:

This is a clone of issue OCPBUGS-18257. The following is the description of the original issue:

Description of problem:

The fix for https://issues.redhat.com/browse/OCPBUGS-15947 seems to have introduced a problem in our keepalived-monitor logic. What I'm seeing is that at some point all of the apiservers became unavailable, which caused haproxy-monitor to drop the redirect firewall rule since it wasn't able to reach the API and we normally want to fall back to direct, un-loadbalanced API connectivity in that case.

However, due to the fix linked above we now short-circuit the keepalived-monitor update loop if we're unable to retrieve the node list, which is what will happen if the node holding the VIP has neither a local apiserver nor the HAProxy firewall rule. Because of this we will also skip updating the status of the firewall rule and thus the keepalived priority for the node won't be dropped appropriately.

Version-Release number of selected component (if applicable):

We backported the fix linked above to 4.11 so I expect this goes back at least that far.

How reproducible:

Unsure. It's clearly not happening every time, but I have a local dev cluster in this state so it can happen.

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

I think the solution here is just to move the firewall rule check earlier in the update loop so it will have run before we try to retrieve nodes. There's no dependency on the ordering of those two steps so I don't foresee any major issues.

To workaround this I believe we can just bounce keepalived on the affected node until the VIP ends up on the node with a local apiserver.

This is a clone of issue OCPBUGS-1522. The following is the description of the original issue:

Description of problem:

Normal user cannot open the debug container from the pods(crashLoopbackoff) they created, And would be got error message:
pods "<pod name>" is forbidden: cannot set blockOwnerDeletion if an ownerReference refers to a resource you can't set finalizers on: , <nil>

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-09-20-040107, 4.11.z, 4.10.z

How reproducible:

Always

Steps to Reproduce:

1. Login OCP as a normal user
   eg: flexy-htpasswd-provider
2. Create a project, go to Developer prespective -> +Add page
3. Click "Import from Git", and provide below data to get a Pods with CrashLoopBackOff state
   Git Repo URL: https://github.com/sclorg/nodejs-ex.git
   Name: nodejs-ex-git
   Run command: star a wktw
4. Navigate to /k8s/ns/<project name>/pods page, find the pod with CrashLoopBackOff status, and go to it details page -> Logs Tab
5. Click the link of "Debug container"
6. Check if the Debug container can be opened

Actual results:

6. Error message would be shown on page, user cannot open debug container via UI
   pods "nodejs-ex-git-6dd986d8bd-9h2wj-debug-tkqk2" is forbidden: cannot set blockOwnerDeletion if an ownerReference refers to a resource you can't set finalizers on: , <nil>

Expected results:

6. Normal user could use debug container without any error message

Additional info:

The debug container could be created for the normal user successfully via CommandLine
 $ oc debug <crashloopbackoff pod name> -n <project name>

This is a clone of issue OCPBUGS-1237. The following is the description of the original issue:

job=pull-ci-openshift-origin-master-e2e-gcp-builds=all

This test has started permafailing on e2e-gcp-builds:

[sig-builds][Feature:Builds][Slow] s2i build with environment file in sources Building from a template should create a image from "test-env-build.json" template and run it in a pod [apigroup:build.openshift.io][apigroup:image.openshift.io]

The error in the test says

Sep 13 07:03:30.345: INFO: At 2022-09-13 07:00:21 +0000 UTC - event for build-test-pod: {kubelet ci-op-kg1t2x13-4e3c6-7hrm8-worker-a-66nwd} Pulling: Pulling image "image-registry.openshift-image-registry.svc:5000/e2e-test-build-sti-env-nglnt/test@sha256:262820fd1a94d68442874346f4c4024fdf556631da51cbf37ce69de094f56fe8"
Sep 13 07:03:30.345: INFO: At 2022-09-13 07:00:23 +0000 UTC - event for build-test-pod: {kubelet ci-op-kg1t2x13-4e3c6-7hrm8-worker-a-66nwd} Pulled: Successfully pulled image "image-registry.openshift-image-registry.svc:5000/e2e-test-build-sti-env-nglnt/test@sha256:262820fd1a94d68442874346f4c4024fdf556631da51cbf37ce69de094f56fe8" in 1.763914719s
Sep 13 07:03:30.345: INFO: At 2022-09-13 07:00:23 +0000 UTC - event for build-test-pod: {kubelet ci-op-kg1t2x13-4e3c6-7hrm8-worker-a-66nwd} Created: Created container test
Sep 13 07:03:30.345: INFO: At 2022-09-13 07:00:23 +0000 UTC - event for build-test-pod: {kubelet ci-op-kg1t2x13-4e3c6-7hrm8-worker-a-66nwd} Started: Started container test
Sep 13 07:03:30.345: INFO: At 2022-09-13 07:00:24 +0000 UTC - event for build-test-pod: {kubelet ci-op-kg1t2x13-4e3c6-7hrm8-worker-a-66nwd} Pulled: Container image "image-registry.openshift-image-registry.svc:5000/e2e-test-build-sti-env-nglnt/test@sha256:262820fd1a94d68442874346f4c4024fdf556631da51cbf37ce69de094f56fe8" already present on machine
Sep 13 07:03:30.345: INFO: At 2022-09-13 07:00:25 +0000 UTC - event for build-test-pod: {kubelet ci-op-kg1t2x13-4e3c6-7hrm8-worker-a-66nwd} Unhealthy: Readiness probe failed: Get "http://10.129.2.63:8080/": dial tcp 10.129.2.63:8080: connect: connection refused
Sep 13 07:03:30.345: INFO: At 2022-09-13 07:00:26 +0000 UTC - event for build-test-pod: {kubelet ci-op-kg1t2x13-4e3c6-7hrm8-worker-a-66nwd} BackOff: Back-off restarting failed container

Description of problem:

To address: 'Static Pod is managed but errored" err="managed container xxx does not have Resource.Requests'

Version-Release number of selected component (if applicable):

4.11

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

Already merged in https://github.com/openshift/cluster-kube-apiserver-operator/pull/1398

This is a clone of issue OCPBUGS-4897. The following is the description of the original issue:

This is a clone of issue OCPBUGS-2500. The following is the description of the original issue:

Description of problem:

When the Ux switches to the Dev console the topology is always blank in a Project that has a large number of components.

Version-Release number of selected component (if applicable):

 

How reproducible:

Always occurs

Steps to Reproduce:

1.Create a project with at least 12 components (Apps, Operators, knative Brokers)
2. Go to the Administrator Viewpoint
3. Switch to Developer Viewpoint/Topology
4. No components displayed
5. Click on 'fit to screen'
6. All components appear

Actual results:

Topology renders with all controls but no components visible (see screenshot 1)

Expected results:

All components should be visible

Additional info:

 

Acceptance criteria:

  • All tests (including e2e) pass
  • No regressions are introduced
  • openshift/api points to a recent commit on the master branch

This is a clone of issue OCPBUGS-3021. The following is the description of the original issue:

Description of problem:

me-west1 is not listed in the survey

Steps to Reproduce:

1. run survey (openshift-install create install-config without an install config file)
2. go through prompts until regions
3.

Actual results:

me-west1 region is missing

Expected results:

me-west1 region is listed (and install succeeds in the region)

 

During initial backporting, due to a number of other colliding commits in upstream, the cobra commands facilitating caching did not get downstreamed. 

This is to downstream those two lines.

 

This is a clone of issue OCPBUGS-13739. The following is the description of the original issue:

This is a clone of issue OCPBUGS-13692. The following is the description of the original issue:

This is a clone of issue OCPBUGS-13549. The following is the description of the original issue:

Description of problem:

Incorrect AWS ARN [1] is used for GovCloud and AWS China regions, which will cause the command `ccoctl aws create-all` to fail:

Failed to create Identity provider: failed to apply public access policy to the bucket ci-op-bb5dgq54-77753-oidc: MalformedPolicy: Policy has invalid resource
	status code: 400, request id: VNBZ3NYDH6YXWFZ3, host id: pHF8v7C3vr9YJdD9HWamFmRbMaOPRbHSNIDaXUuUyrgy0gKCO9DDFU/Xy8ZPmY2LCjfLQnUDmtQ=

Correct AWS ARN prefix:
GovCloud (us-gov-east-1 and us-gov-west-1): arn:aws-us-gov
AWS China (cn-north-1 and cn-northwest-1): arn:aws-cn

[1] https://github.com/openshift/cloud-credential-operator/pull/526/files#diff-1909afc64595b92551779d9be99de733f8b694cfb6e599e49454b380afc58876R211


 

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2023-05-11-024616

How reproducible:

Always
 

Steps to Reproduce:

1. Run command: `aws create-all --name="${infra_name}" --region="${REGION}" --credentials-requests-dir="/tmp/credrequests" --output-dir="/tmp"` on GovCloud regions
2.
3.

Actual results:

Failed to create Identity provider
 

Expected results:

Create resources successfully.
 

Additional info:

Related PRs:
4.10: https://github.com/openshift/cloud-credential-operator/pull/531
4.11: https://github.com/openshift/cloud-credential-operator/pull/530
4.12: https://github.com/openshift/cloud-credential-operator/pull/529
4.13: https://github.com/openshift/cloud-credential-operator/pull/528
4.14: https://github.com/openshift/cloud-credential-operator/pull/526
 

Description of problem:

When queried dns hostname from certain pod on the certain node, responded from random coredns pod, not prefer local one. Is it expected result ?

# In OCP v4.8.13 case
// Ran dig command on the certain node which is running the following test-7cc4488d48-tqc4m pod.
sh-4.4# while : ; do echo -n "$(date '+%H:%M:%S') :"; dig google.com +short; sleep 1; done
:
07:16:33 :172.217.175.238
07:16:34 :172.217.175.238 <--- Refreshed the upstream result
07:16:36 :142.250.207.46
07:16:37 :142.250.207.46

// The dig results is matched with the running node one as you can see the above one.
$ oc rsh  test-7cc4488d48-tqc4m bash -c 'while : ; do echo -n "$(date '+%H:%M:%S') :"; dig google.com +short; sleep 1; done'
:
07:16:35 :172.217.175.238 
07:16:36 :172.217.175.238 <--- At the same time, the pod dig result is also refreshed.
07:16:37 :142.250.207.46
07:16:38 :142.250.207.46


But in v4.10 case, in contrast, the dns query result is various and responded randomly regardless local dns results on the node as follows.

# In OCP v4.10.23 case, pod's response from DNS services are not consistent.
$ oc rsh test-848fcf8ddb-zrcbx  bash -c 'while : ; do echo -n "$(date '+%H:%M:%S') :"; dig google.com +short; sleep 1; done'
07:23:00 :142.250.199.110
07:23:01 :142.250.207.46
07:23:02 :142.250.207.46
07:23:03 :142.250.199.110
07:23:04 :142.250.199.110
07:23:05 :172.217.161.78

# Even though the node which is running the pod keep responding the same IP...
sh-4.4# while : ; do echo -n "$(date '+%H:%M:%S') :"; dig google.com +short; sleep 1; done
07:23:00 :172.217.161.78
07:23:01 :172.217.161.78
07:23:02 :172.217.161.78
07:23:03 :172.217.161.78
07:23:04 :172.217.161.78
07:23:05 :172.217.161.78

Version-Release number of selected component (if applicable):

v4.10.23 (ROSA)
SDN: OpenShiftSDN

How reproducible:

You can always reproduce this issue using "dig google.com" from both any pod and the node the pod running according to the above "Description" details.

Steps to Reproduce:

1. Run any usual pod, and check which node the pod is running on.
2. Run dig google.com on the pod and the node.
3. Check the IP is consistent with the running node each other. 

Actual results:

The response IPs are not consistent and random IP is responded.

Expected results:

The response IP is kind of consistent, and aware of prefer local dns.

Additional info:

This issue affects EgressNetworkPolicy dnsName feature.

This is a clone of issue OCPBUGS-6850. The following is the description of the original issue:

This is a clone of issue OCPBUGS-6503. The following is the description of the original issue:

Description of problem:

While looking into OCPBUGS-5505 I discovered that some 4.10->4.11 upgrade job runs perform an Admin Ack check, while some do not. 4.11 has a ack-4.11-kube-1.25-api-removals-in-4.12 gate, so these upgrade jobs sometimes test that Upgradeable goes false after the ugprade, and sometimes they do not. This is only determined by the polling race condition: the check is executed once per 10 minutes, and we cancel the polling after upgrade is completed. This means that in some cases we are lucky and manage to run one check before the cancel, and sometimes we are not and only check while still on the base version.

Example job that checked admin acks post-upgrade:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/openshift-cluster-version-operator-880-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1611444032104304640

$ curl --silent https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/openshift-cluster-version-operator-880-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1611444032104304640/artifacts/e2e-azure-upgrade/openshift-e2e-test/artifacts/e2e.log | grep 'Waiting for Upgradeable to be AdminAckRequired'
Jan  6 21:16:40.153: INFO: Waiting for Upgradeable to be AdminAckRequired ...

Example job that did not check admin acks post-upgrade:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/openshift-cluster-version-operator-880-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1611444033509396480

$ curl --silent https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/openshift-cluster-version-operator-880-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1611444033509396480/artifacts/e2e-azure-upgrade/openshift-e2e-test/artifacts/e2e.log | grep 'Waiting for Upgradeable to be AdminAckRequired'

Version-Release number of selected component (if applicable):

4.11+ openshift-tests

How reproducible:

nondeterministic, wild guess is ~30% of upgrade jobs

Steps to Reproduce:

1. Inspect the E2E test log of an upgrade jobs and compare the time of the update ("Completed upgrade") with the time of the last check ( "Skipping admin ack", "Gate .* not applicable to current version", "Admin Ack verified') done by the admin ack test

Actual results:

Jan 23 00:47:43.842: INFO: Admin Ack verified
Jan 23 00:57:43.836: INFO: Admin Ack verified
Jan 23 01:07:43.839: INFO: Admin Ack verified
Jan 23 01:17:33.474: INFO: Completed upgrade to registry.build01.ci.openshift.org/ci-op-z09ll8fw/release@sha256:322cf67dc00dd6fa4fdd25c3530e4e75800f6306bd86c4ad1418c92770d58ab8

No check done after the upgrade

Expected results:

Jan 23 00:57:37.894: INFO: Admin Ack verified
Jan 23 01:07:37.894: INFO: Admin Ack verified
Jan 23 01:16:43.618: INFO: Completed upgrade to registry.build01.ci.openshift.org/ci-op-z8h5x1c5/release@sha256:9c4c732a0b4c2ae887c73b35685e52146518e5d2b06726465d99e6a83ccfee8d
Jan 23 01:17:57.937: INFO: Admin Ack verified

One or more checks done after upgrade

This is a clone of issue OCPBUGS-7732. The following is the description of the original issue:

Description of problem:

When services are deleted, the services controller cache should also remove the service from its top level cache to avoid growing forever.

While this is not an issue in 4.13 once the lb_cache rework merges [1], the 4.12 and older branches have this problem because that rework is meant for 4.13 only.

[1]: https://github.com/ovn-org/ovn-kubernetes/pull/3387

This is the location where alreadyApplied is not deleting the removal: 
https://github.com/openshift/ovn-kubernetes/blob/cf9fb51510e1870961bf3a0f064b73536757a4f8/go-controller/pkg/ovn/controller/services/services_controller.go#L269

It should do the similar changes depicted here (currently merged upstream):
https://github.com/ovn-org/ovn-kubernetes/blob/cd78ae1af4657d38bdc41003a8737aa958d62b9d/go-controller/pkg/ovn/controller/services/services_controller.go#L322-L324

 

Version-Release number of selected component (if applicable):

 

How reproducible:

100%

Steps to Reproduce:

1. create service -- use unique name
2. remove service
3. notice how alreadyApplied grows and never gets smaller
4. repeat

Actual results:

^^

Expected results:

alreadyApplied should not grow forever

Additional info:

 

This is a clone of issue OCPBUGS-2495. The following is the description of the original issue:

Failures like:

$ oc login --token=...

Logged into "https://api..." as "..." using the token provided.

Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get projects.project.openshift.io)

break login, which tries to gather information before saving the configuration, including a giant project list.

Ideally login would be able to save the successful login credentials, even when the informative gathering had difficulties. And possibly the informative gathering could be made conditional (--quiet or similar?) so expensive gathering could be skipped in use-cases where the context was not needed.

This is a clone of issue OCPBUGS-23037. The following is the description of the original issue:

This is a clone of issue OCPBUGS-23021. The following is the description of the original issue:

This is a clone of issue OCPBUGS-23006. The following is the description of the original issue:

This is a clone of issue OCPBUGS-22497. The following is the description of the original issue:

While trying to develop a demo for a Java application, that first builds using the source-to-image strategy and then uses the resulting image to copy artefacts from the s2i-builder+compiled sources-image to a slimmer runtime image using an inline Dockerfile build strategy on OpenShift, the deployment then fails since the inline Dockerfile hooks doesn't preserve the modification time of the file that gets copied. This is different to how 'docker' itself does it with a multi-stage build.

Version-Release number of selected component (if applicable):

4.12.14

How reproducible:

Always

Steps to Reproduce:

1. git clone https://github.com/jerboaa/quarkus-quickstarts
2. cd quarkus-quickstarts && git checkout ocp-bug-inline-docker
3. oc new-project quarkus-appcds-nok
4. oc process -f rest-json-quickstart/openshift/quarkus_runtime_appcds_template.yaml | oc create -f -

Actual results:

$ oc logs quarkus-rest-json-appcds-4-xc47z
INFO exec -a "java" java -XX:MaxRAMPercentage=80.0 -XX:+UseParallelGC -XX:MinHeapFreeRatio=10 -XX:MaxHeapFreeRatio=20 -XX:GCTimeRatio=4 -XX:AdaptiveSizePolicyWeight=90 -XX:+ExitOnOutOfMemoryError -XX:+UseCompressedClassPointers -XX:+UseCompressedOops -Xshare:on -XX:SharedArchiveFile=/deployments/app-cds.jsa -Dquarkus.http.host=0.0.0.0 -cp "." -jar /deployments/rest-json-quickstart-1.0.0-SNAPSHOT-runner.jar 
INFO running in /deployments
Error occurred during initialization of VM
Unable to use shared archive.
An error has occurred while processing the shared archive file.
A jar file is not the one used while building the shared archive file: rest-json-quickstart-1.0.0-SNAPSHOT-runner.jar

Expected results:

Starting the Java application using /opt/jboss/container/java/run/run-java.sh ...
INFO exec -a "java" java -XX:MaxRAMPercentage=80.0 -XX:+UseParallelGC -XX:MinHeapFreeRatio=10 -XX:MaxHeapFreeRatio=20 -XX:GCTimeRatio=4 -XX:AdaptiveSizePolicyWeight=90 -XX:+ExitOnOutOfMemoryError -XX:+UseCompressedClassPointers -XX:+UseCompressedOops -Xshare:on -XX:SharedArchiveFile=/deployments/app-cds.jsa -Dquarkus.http.host=0.0.0.0 -cp "." -jar /deployments/rest-json-quickstart-1.0.0-SNAPSHOT-runner.jar 
INFO running in /deployments
__  ____  __  _____   ___  __ ____  ______ 
 --/ __ \/ / / / _ | / _ \/ //_/ / / / __/ 
 -/ /_/ / /_/ / __ |/ , _/ ,< / /_/ /\ \   
--\___\_\____/_/ |_/_/|_/_/|_|\____/___/   
2023-10-27 18:13:01,866 INFO  [io.quarkus] (main) rest-json-quickstart 1.0.0-SNAPSHOT on JVM (powered by Quarkus 3.4.3) started in 0.966s. Listening on: http://0.0.0.0:8080
2023-10-27 18:13:01,867 INFO  [io.quarkus] (main) Profile prod activated. 
2023-10-27 18:13:01,867 INFO  [io.quarkus] (main) Installed features: [cdi, resteasy-reactive, resteasy-reactive-jackson, smallrye-context-propagation, vertx]

Additional info:

When deploying with AppCDS turned on, then we can get the pods to start and when we then look at the modified file time of the offending file we notice that these differ from the original s2i-merge-image (A) and the runtime image (B):

(A)
$ oc rsh quarkus-rest-json-appcds-s2i-1-x5hct stat /deployments/rest-json-quickstart-1.0.0-SNAPSHOT-runner.jar
  File: /deployments/rest-json-quickstart-1.0.0-SNAPSHOT-runner.jar
  Size: 16057039  	Blocks: 31368      IO Block: 4096   regular file
Device: 200001h/2097153d	Inode: 60146490    Links: 1
Access: (0664/-rw-rw-r--)  Uid: (  185/ default)   Gid: (    0/    root)
Access: 2023-10-27 18:11:22.000000000 +0000
Modify: 2023-10-27 18:11:22.000000000 +0000
Change: 2023-10-27 18:11:41.555586774 +0000
 Birth: 2023-10-27 18:11:41.491586774 +0000

(B)
$ oc rsh quarkus-rest-json-appcds-1-l7xw2 stat /deployments/rest-json-quickstart-1.0.0-SNAPSHOT-runner.jar
  File: /deployments/rest-json-quickstart-1.0.0-SNAPSHOT-runner.jar
  Size: 16057039  	Blocks: 31368      IO Block: 4096   regular file
Device: 2000a3h/2097315d	Inode: 71601163    Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2023-10-27 18:11:44.000000000 +0000
Modify: 2023-10-27 18:11:44.000000000 +0000
Change: 2023-10-27 18:12:12.169087346 +0000
 Birth: 2023-10-27 18:12:12.114087346 +0000

Both should have 'Modify: 2023-10-27 18:11:22.000000000 +0000'.

When I perform a local s2i build of the same application sources and then use this multi-stage Dockerfile, the modify time of the files remain the same.

FROM quarkus-app-uberjar:ubi9 as s2iimg

FROM registry.access.redhat.com/ubi9/openjdk-17-runtime as final
COPY --from=s2iimg /deployments/* /deployments/
ENV JAVA_OPTS_APPEND="-XX:+UseCompressedClassPointers -XX:+UseCompressedOops -Xshare:on -XX:SharedArchiveFile=app-cds.jsa"

as shown here:

$ sudo docker run --rm -ti --entrypoint /bin/bash quarkus-app-uberjar:ubi9 -c 'stat /deployments/rest-json-quickstart-1.0.0-SNAPSHOT-runner.jar'
  File: /deployments/rest-json-quickstart-1.0.0-SNAPSHOT-runner.jar
  Size: 16057020  	Blocks: 31368      IO Block: 4096   regular file
Device: 6fh/111d	Inode: 276781319   Links: 1
Access: (0664/-rw-rw-r--)  Uid: (  185/ default)   Gid: (    0/    root)
Access: 2023-10-27 15:52:28.000000000 +0000
Modify: 2023-10-27 15:52:28.000000000 +0000
Change: 2023-10-27 15:52:37.352926632 +0000
 Birth: 2023-10-27 15:52:37.288926109 +0000
$ sudo docker run --rm -ti --entrypoint /bin/bash quarkus-cds-app -c 'stat /deployments/rest-json-quickstart-1.0.0-SNAPSHOT-runner.jar'
  File: /deployments/rest-json-quickstart-1.0.0-SNAPSHOT-runner.jar
  Size: 16057020  	Blocks: 31368      IO Block: 4096   regular file
Device: 6fh/111d	Inode: 14916403    Links: 1
Access: (0664/-rw-rw-r--)  Uid: (  185/ default)   Gid: (    0/    root)
Access: 2023-10-27 15:52:28.000000000 +0000
Modify: 2023-10-27 15:52:28.000000000 +0000
Change: 2023-10-27 15:53:04.408147760 +0000
 Birth: 2023-10-27 15:53:04.346147253 +0000

Both have a modified file time of 2023-10-27 15:52:28.000000000 +0000

This is a clone of issue OCPBUGS-5185. The following is the description of the original issue:

This is a clone of issue OCPBUGS-5165. The following is the description of the original issue:

Currently, the Dev Sandbox clusters sends the clusterType "OSD" instead of "DEVSANDBOX" because the configuration annotations of the console config are automatically overridden by some SyncSets.

Open Dev Sandbox and browser console and inspect window.SERVER_FLAGS.telemetry

This is a clone of issue OCPBUGS-7960. The following is the description of the original issue:

This is a clone of issue OCPBUGS-7780. The following is the description of the original issue:

Description of problem:

4.9 and 4.10 oc calls to oc adm upgrade channel ... for 4.11+ clusters would clear spec.capabilities. Not all that many clusters try to restrict capabilities, but folks will need to bump their channel for at least every other minor (if their using EUS channels), and while we recommend folks use an oc from the 4.y they're heading towards, we don't have anything in place to enforce that.

Version-Release number of selected component (if applicable):

4.9 and 4.10 oc are exposed vs. the new-in-4.11 spec.capabilities. Newer oc could theoretically be exposed vs. any new ClusterVersion spec capabilities.

How reproducible:

100%

Steps to Reproduce:

1. Install a 4.11+ cluster with None capabilities.
2. Set the channel with a 4.10.51 oc, like oc adm upgrade channel fast-4.11.
3. Check the capabilities with oc get -o json clusterversion version | jq -c .spec.capabilities.

Actual results:

null

Expected results:

{"baselineCapabilitySet":"None"}

This is a clone of issue OCPBUGS-15482. The following is the description of the original issue:

The 4.12 builds fail all the time. Last successfully build was from May 31.

Error:

# Root Suite.Entire pipeline flow from Builder page "before all" hook for "Background Steps"
AssertionError: Timed out retrying after 80000ms: Expected to find element: `[data-test-id="PipelineResource"]`, but never found it.

Full error:

  Running:  e2e/pipeline-ci.feature                                                         (1 of 1)
Couldn't determine Mocha version


  Logging in as kubeadmin
      Installing operator: "Red Hat OpenShift Pipelines"
      Operator Red Hat OpenShift Pipelines was not yet installed.
      Performing Pipelines post-installation steps
      Verify the CRD's for the "Red Hat OpenShift Pipelines"
  1) "before all" hook for "Background Steps"
      Deleting "" namespace

  0 passing (3m)
  1 failing

  1) Entire pipeline flow from Builder page
       "before all" hook for "Background Steps":
     AssertionError: Timed out retrying after 80000ms: Expected to find element: `[data-test-id="PipelineResource"]`, but never found it.

Because this error occurred during a `before all` hook we are skipping all of the remaining tests.
      at ../../dev-console/integration-tests/support/pages/functions/installOperatorOnCluster.ts.exports.waitForCRDs (https://console-openshift-console.apps.ci-op-issiwkzy-bc347.XXXXXXXXXXXXXXXXXXXXXX/__cypress/tests?p=support/commands/index.ts:17156:77)
      at performPostInstallationSteps (https://console-openshift-console.apps.ci-op-issiwkzy-bc347.XXXXXXXXXXXXXXXXXXXXXX/__cypress/tests?p=support/commands/index.ts:17242:21)
      at ../../dev-console/integration-tests/support/pages/functions/installOperatorOnCluster.ts.exports.verifyAndInstallOperator (https://console-openshift-console.apps.ci-op-issiwkzy-bc347.XXXXXXXXXXXXXXXXXXXXXX/__cypress/tests?p=support/commands/index.ts:17268:5)
      at ../../dev-console/integration-tests/support/pages/functions/installOperatorOnCluster.ts.exports.verifyAndInstallPipelinesOperator (https://console-openshift-console.apps.ci-op-issiwkzy-bc347.XXXXXXXXXXXXXXXXXXXXXX/__cypress/tests?p=support/commands/index.ts:17272:13)
      at Context.eval (https://console-openshift-console.apps.ci-op-issiwkzy-bc347.XXXXXXXXXXXXXXXXXXXXXX/__cypress/tests?p=support/commands/index.ts:20848:13)



[mochawesome] Report JSON saved to /go/src/github.com/openshift/console/frontend/gui_test_screenshots/cypress_report_pipelines.json


  (Results)

  ┌────────────────────────────────────────────────────────────────────────────────────────────────┐
  │ Tests:        13                                                                               │
  │ Passing:      0                                                                                │
  │ Failing:      1                                                                                │
  │ Pending:      0                                                                                │
  │ Skipped:      12                                                                               │
  │ Screenshots:  1                                                                                │
  │ Video:        true                                                                             │
  │ Duration:     2 minutes, 58 seconds                                                            │
  │ Spec Ran:     e2e/pipeline-ci.feature                                                          │
  └────────────────────────────────────────────────────────────────────────────────────────────────┘


  (Screenshots)

  -  /go/src/github.com/openshift/console/frontend/gui_test_screenshots/cypress/scree     (1280x720)
     nshots/e2e/pipeline-ci.feature/Background Steps -- before all hook (failed).png                


  (Video)

  -  Started processing:  Compressing to 32 CRF                                                     
  -  Finished processing: /go/src/github.com/openshift/console/frontend/gui_test_scre   (16 seconds)
                          enshots/cypress/videos/e2e/pipeline-ci.feature.mp4                        

    Compression progress:  100%

====================================================================================================

  (Run Finished)


       Spec                                              Tests  Passing  Failing  Pending  Skipped  
  ┌────────────────────────────────────────────────────────────────────────────────────────────────┐
  │ ✖  e2e/pipeline-ci.feature                  02:58       13        -        1        -       12 │
  └────────────────────────────────────────────────────────────────────────────────────────────────┘
    ✖  1 of 1 failed (100%)                     02:58       13        -        1        -       12  

See also

  1. https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-console-release-4.12-e2e-gcp-console
  2. https://search.ci.openshift.org/?search=Expected+to+find+element&maxAge=336h&context=1&type=all&name=pull-ci-openshift-console-release-4.12-e2e-gcp-console&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job (not exact match, but couldn't create a better filter)

Description of problem:

TO address: 'Static Pod is managed but errored" err="managed container xxx does not have Resource.Requests'

Version-Release number of selected component (if applicable):

4.11

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-14152. The following is the description of the original issue:

This is a clone of issue OCPBUGS-14127. The following is the description of the original issue:

This is a clone of issue OCPBUGS-14125. The following is the description of the original issue:

Description of problem:

Since registry.centos.org is closed, tests relying on this registry in e2e-agnostic-ovn-cmd job are failing.

Version-Release number of selected component (if applicable):

all

How reproducible:

Trigger e2e-agnostic-ovn-cmd job

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

When creating a pod with an additional network that contains a `spec.config.ipam.exclude` range, any address within the excluded range is still iterated while searching for a suitable IP candidate. As a result, pod creation times out when large exclude ranges are used.

Version-Release number of selected component (if applicable):

 

How reproducible:

with big exclude ranges, 100%

Steps to Reproduce:

1. create network-attachment-definition with a large range:

$ cat <<EOF| oc apply -f -       
apiVersion: k8s.cni.cncf.io/v1                                            
kind: NetworkAttachmentDefinition
metadata:
  name: nad-w-excludes
spec:
  config: |-
    {
      "cniVersion": "0.3.1",
      "name": "macvlan-net",
      "type": "macvlan",
      "master": "ens3",
      "mode": "bridge",
      "ipam": {
         "type": "whereabouts",
         "range": "fd43:01f1:3daa:0baa::/64",
         "exclude": [ "fd43:01f1:3daa:0baa::/100" ],
         "log_file": "/tmp/whereabouts.log",
         "log_level" : "debug"
      }
    }
EOF
2. create a pod with the network attached:

$ cat <<EOF|oc apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: pod-with-exclude-range
  annotations:
    k8s.v1.cni.cncf.io/networks: nad-w-excludes
spec:
  containers:
  - name: pod-1
    image: openshift/hello-openshift
EOF

3. check pod status, event log and whereabouts logs after a while: 

$ oc get pods
NAME                        READY   STATUS              RESTARTS   AGE
pod-with-exclude-range      0/1     ContainerCreating   0          2m23s

$ oc get events
<...>
6m39s       Normal    Scheduled                                    pod/pod-with-exclude-range                   Successfully assigned default/pod-with-exclude-range to <worker-node>
6m37s       Normal    AddedInterface                               pod/pod-with-exclude-range                   Add eth0 [10.129.2.49/23] from openshift-sdn
2m39s       Warning   FailedCreatePodSandBox                       pod/pod-with-exclude-range                   Failed to create pod sandbox: rpc error: code = DeadlineExceeded desc = context deadline exceeded

$ oc debug node/<worker-node> - tail /host/tmp/whereabouts.log
Starting pod/<worker-node>-debug ...
To use host binaries, run `chroot /host`
2022-10-27T14:14:50Z [debug] Finished leader election
2022-10-27T14:14:50Z [debug] IPManagement: {fd43:1f1:3daa:baa::1 ffffffffffffffff0000000000000000} , <nil>
2022-10-27T14:14:59Z [debug] Used defaults from parsed flat file config @ /etc/kubernetes/cni/net.d/whereabouts.d/whereabouts.conf
2022-10-27T14:14:59Z [debug] ADD - IPAM configuration successfully read: {Name:macvlan-net Type:whereabouts Routes:[] Datastore:kubernetes Addresses:[] OmitRanges:[fd43:01f1:3daa:0baa::/80] DNS: {Nameservers:[] Domain: Search:[] Options:[]} Range:fd43:1f1:3daa:baa::/64 RangeStart:fd43:1f1:3daa:baa:: RangeEnd:<nil> GatewayStr: EtcdHost: EtcdUsername: EtcdPassword:********* EtcdKeyFile: EtcdCertFile: EtcdCACertFile: LeaderLeaseDuration:1500 LeaderRenewDeadline:1000 LeaderRetryPeriod:500 LogFile:/tmp/whereabouts.log LogLevel:debug OverlappingRanges:true SleepForRace:0 Gateway:<nil> Kubernetes: {KubeConfigPath:/etc/kubernetes/cni/net.d/whereabouts.d/whereabouts.kubeconfig K8sAPIRoot:} ConfigurationPath:PodName:pod-with-exclude-range PodNamespace:default} 
2022-10-27T14:14:59Z [debug] Beginning IPAM for ContainerID: f4ffd0e07d6c1a2b6ffb0fa29910c795258792bb1a1710ff66f6b48fab37af82
2022-10-27T14:14:59Z [debug] Started leader election
2022-10-27T14:14:59Z [debug] OnStartedLeading() called
2022-10-27T14:14:59Z [debug] Elected as leader, do processing
2022-10-27T14:14:59Z [debug] IPManagement - mode: 0 / containerID:f4ffd0e07d6c1a2b6ffb0fa29910c795258792bb1a1710ff66f6b48fab37af82 / podRef: default/pod-with-exclude-range
2022-10-27T14:14:59Z [debug] IterateForAssignment input >> ip: fd43:1f1:3daa:baa:: | ipnet: {fd43:1f1:3daa:baa:: ffffffffffffffff0000000000000000} | first IP: fd43:1f1:3daa:baa::1 | last IP: fd43:1f1:3daa:baa:ffff:ffff:ffff:ffff

Actual results:

Failed to create pod sandbox: rpc error: code = DeadlineExceeded desc = context deadline exceeded

Expected results:

additional network gets attached to the pod

Additional info:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

This is a clone of issue OCPBUGS-18273. The following is the description of the original issue:

This is a clone of issue OCPBUGS-17876. The following is the description of the original issue:

This is a clone of issue OCPBUGS-16374. The following is the description of the original issue:

Description of problem:

The topology page is crashed 

Version-Release number of selected component (if applicable):

 

How reproducible:

100%

Steps to Reproduce:

1. Visit developer console
2. Topology view
3.

Actual results:

Error message:
TypeError
Description:
e is null
Component trace:
f@https://console-openshift-console.apps.cl2.cloud.local/static/vendors~app/code-refs/actions~delete-revision~dev-console-add~dev-console-deployImage~dev-console-ed~cf101ec3-chunk-5018ae746e2320e4e737.min.js:26:14244
5363/t.a@https://console-openshift-console.apps.cl2.cloud.local/static/dev-console-topology-chunk-492be609fb2f16849dfa.min.js:1:177913
u@https://console-openshift-console.apps.cl2.cloud.local/static/dev-console-topology-chunk-492be609fb2f16849dfa.min.js:1:275718
8248/t.a<@https://console-openshift-console.apps.cl2.cloud.local/static/dev-console-topology-chunk-492be609fb2f16849dfa.min.js:1:475504
i@https://console-openshift-console.apps.cl2.cloud.local/static/main-chunk-378881319405723c0627.min.js:1:470135
withFallback()
5174/t.default@https://console-openshift-console.apps.cl2.cloud.local/static/dev-console-topology-chunk-492be609fb2f16849dfa.min.js:1:78258
s@https://console-openshift-console.apps.cl2.cloud.local/static/main-chunk-378881319405723c0627.min.js:1:237096
[...]
ne<@https://console-openshift-console.apps.cl2.cloud.local/static/main-chunk-378881319405723c0627.min.js:1:1592411
r@https://console-openshift-console.apps.cl2.cloud.local/static/vendors~main-chunk-12b31b866c0a4fea4c58.min.js:36:125397
t@https://console-openshift-console.apps.cl2.cloud.local/static/vendors~main-chunk-12b31b866c0a4fea4c58.min.js:21:58042
t@https://console-openshift-console.apps.cl2.cloud.local/static/vendors~main-chunk-12b31b866c0a4fea4c58.min.js:21:60087
t@https://console-openshift-console.apps.cl2.cloud.local/static/vendors~main-chunk-12b31b866c0a4fea4c58.min.js:21:54647
re@https://console-openshift-console.apps.cl2.cloud.local/static/main-chunk-378881319405723c0627.min.js:1:1592722
t.a@https://console-openshift-console.apps.cl2.cloud.local/static/main-chunk-378881319405723c0627.min.js:1:791129
t.a@https://console-openshift-console.apps.cl2.cloud.local/static/main-chunk-378881319405723c0627.min.js:1:1062384
s@https://console-openshift-console.apps.cl2.cloud.local/static/main-chunk-378881319405723c0627.min.js:1:613567
t.a@https://console-openshift-console.apps.cl2.cloud.local/static/vendors~main-chunk-12b31b866c0a4fea4c58.min.js:141:244663

Expected results:

No error should be there

Additional info:

Cloud Pak Operator is installed 

Description of problem:

When running node-density (245 pods/node) on a 120 node cluster, we see that there is a huge spike (~22s) in Avg pod-latency. When the spike occurs we see all the ovnkube-master pods go through a restart. 

The restart happens because of (ovnkube-master pods)

2022-08-10T04:04:44.494945179Z panic: reflect: call of reflect.Value.Len on ptr Value

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-08-09-114621

How reproducible:

Steps to Reproduce:
1. Run node-density on a 120 node cluster

Actual results:

Spike observed in pod-latency graph ~22s

Expected results:

Steady pod-latency graph ~4s

Additional info:

This is a clone of OCPBUGS-853.

Description of problem:

Large OpenShift Container Platform 4.10.24 - Cluster is failing to update router-certs secret in openshift-config-managed namespace as the given secret is too big.

2022-09-01T06:24:15.157333294Z 2022-09-01T06:24:15.157Z ERROR operator.init.controller.certificate_publisher_controller controller/controller.go:266  Reconciler error  {"name": "foo-bar", "namespace": "openshift-ingress-operator", "error": "failed to ensure global secret: failed to update published router certificates secret: Secret \"router-certs\" is invalid: data: Too long: must have at most 1048576 bytes"}

The OpenShift Container Platform 4 - Cluster has 180 IngressController configured with endpointPublishingStrategy set to private.

Now the default certificate needs to be replaced but is not properly replicated to openshift-authentication namespace and potentially other location because of the problem mentioned (since the required secret can not be updated)

Version-Release number of selected component (if applicable):

OpenShift Container Platform 4.10.24

How reproducible:

Always

Steps to Reproduce:

1. Install OpenShift Container Platform 4.10
2. Create 180 IngressController with specific certificates
3. Check openshift-ingress-operator logs to see how it fails to update/create the necessary secret in openshift-config-managed

Actual results:

2022-09-01T06:24:15.157333294Z 2022-09-01T06:24:15.157Z ERROR operator.init.controller.certificate_publisher_controller controller/controller.go:266  Reconciler error  {"name": "foo-bar", "namespace": "openshift-ingress-operator", "error": "failed to ensure global secret: failed to update published router certificates secret: Secret \"router-certs\" is invalid: data: Too long: must have at most 1048576 bytes"}

Expected results:

No matter how many IngressController is created, secret management taken care by Operators need to work, even if data exceed 1 MB size limitation. In that case an approach needs to exist to split data into multiple secrets or handle it otherwise.

Additional info:

 

This is a clone of issue OCPBUGS-11404. The following is the description of the original issue:

This is a clone of issue OCPBUGS-11333. The following is the description of the original issue:

This is a clone of issue OCPBUGS-10690. The following is the description of the original issue:

Description of problem:

according to PR: https://github.com/openshift/cluster-monitoring-operator/pull/1824, startupProbe for UWM prometheus/platform prometheus should be 1 hour, but startupProbe for UWM prometheus is still 15m after enabled UWM, platform promethues does not have issue, startupProbe is increased to 1 hour

$ oc -n openshift-user-workload-monitoring get pod prometheus-user-workload-0 -oyaml | grep startupProbe -A20
    startupProbe:
      exec:
        command:
        - sh
        - -c
        - if [ -x "$(command -v curl)" ]; then exec curl --fail http://localhost:9090/-/ready;
          elif [ -x "$(command -v wget)" ]; then exec wget -q -O /dev/null http://localhost:9090/-/ready;
          else exit 1; fi
      failureThreshold: 60
      periodSeconds: 15
      successThreshold: 1
      timeoutSeconds: 3
...

$ oc -n openshift-monitoring get pod prometheus-k8s-0 -oyaml | grep startupProbe -A20
    startupProbe:
      exec:
        command:
        - sh
        - -c
        - if [ -x "$(command -v curl)" ]; then exec curl --fail http://localhost:9090/-/ready;
          elif [ -x "$(command -v wget)" ]; then exec wget -q -O /dev/null http://localhost:9090/-/ready;
          else exit 1; fi
      failureThreshold: 240
      periodSeconds: 15
      successThreshold: 1
      timeoutSeconds: 3

 

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2023-03-19-052243

How reproducible:

always

Steps to Reproduce:

1. enable UWM, check startupProbe for UWM prometheus/platform prometheus
2.
3.

Actual results:

startupProbe for UWM prometheus is still 15m

Expected results:

startupProbe for UWM prometheus should be 1 hour

Additional info:

since startupProbe for platform prometheus is increased to 1 hour, and no similar bug for UWM prometheus, won't fix the issue is OK.

Description of problem:

Whenever one runs ovnkube-trace from an in-cluster pod to a pod in the host network that is in different node, the following spurious error appears despite of the underlying ovn-trace being correct:

ovn-trace indicates failure from ingress-canary-7zhxs to router-default-6758fb465c-s66rv - output to "k8s-worker-0.example.redhat.com" not matched

This is caused because as per[1], if the destination pod is in host network, the outport is expected to be of the form "k8s-${NODE_NAME}", which is true only if either in local gateway or if the source pod is in the same node than the destination pod.

This is already fixed in the master branch[2], but we would need this to be backported to previous releases. 

Version-Release number of selected component (if applicable):

4.11.4

How reproducible:

Always

Steps to Reproduce:

1. ovnkube-trace from pod in the SDN to pod in host network
2.
3.

Actual results:

Wrong error

Expected results:

No wrong error

Additional info:

References:
[1] - https://github.com/openshift/ovn-kubernetes/blob/release-4.11/go-controller/cmd/ovnkube-trace/ovnkube-trace.go#L771-L777
[2] - https://github.com/openshift/ovn-kubernetes/blob/master/go-controller/cmd/ovnkube-trace/ovnkube-trace.go#L755-L769

As discussed previously by email, customer support case 03211616 requests a means to use the latest patch version of a given X.Y golang via imagestreams, with the blocking issue being the lack of X.Y tags for the go-toolset containers on RHCC.

The latter has now been fixed with the latest version also getting a :1.17 tag, and the imagestream source has been modified accordingly, which will get picked up in 4.12. We can now fix this in 4.11.z by backporting this to the imagestream files bundled in cluster-samples-operator.

/cc Ian Watson Feny Mehta

Description of problem:

The path used by --rotated-pod-logs to gather the rotated pod logs from /var/log/pods node folder via /api/v1/nodes/${NODE}/proxy/logs/${LOG_PATH} is only valid for regular pods but not for static pods.

The main problem is that, while normal pods have their rotated logs at this /var/log/pods/${POD_NAME}_${POD_UID_IN_API}/${CONTAINER_NAME}, static pods have them at /var/log/pods/${POD_NAME}_${CONFIG_HASH}/${CONTAINER_NAME} because the UID cannot be known at the time that the static pod is born (because static pods are created by kubelet before registering them in the kube-apiserver, and UID is assigned by the kube-apiserver).

The visible results of that are:

  • Spurious errors of not found resources related to the pods.
  • Rotated pod logs are not gathered even if present.

Version-Release number of selected component (if applicable):

4.10

How reproducible:

Always if there are static pods.

Steps to Reproduce:

1. oc adm inspect --rotated-pod-logs ns/openshift-etcd (or any other project with static pods).

Actual results:

  • Rotated pods not gathered.
  • Errors like these
    error: errors occurred while gathering data:
        one or more errors occurred while gathering pod-specific data for namespace: openshift-etcd
    
        [one or more errors occurred while gathering container data for pod etcd-master-0.example.net:
    
        the server could not find the requested resource, one or more errors occurred while gathering container data for pod etcd-master-1.example.net:
    
        the server could not find the requested resource, one or more errors occurred while gathering container data for pod etcd-master-2.example.net:
    
        the server could not find the requested resource]
    

Expected results:

No errors like the ones above and rotated pod logs to be gathered, if present.

Additional info:

Despite being marked as experimental, this --rotated-pod-logs is used in must-gather, so this issue can be easily reproduced by just running a default must-gather. I focused on bare oc adm inspect reproducers for simplicity.

This is a clone of issue OCPBUGS-13811. The following is the description of the original issue:

This is a clone of issue OCPBUGS-10816. The following is the description of the original issue:

Description of problem:

We have observed a situation where:
- A workload mounting multiple EBS volumes gets stuck in a Terminating state when it finishes.
- The node that the workload ran on eventually gets stuck draining, because it gets stuck on unmounting one of the volumes from that workload, despite no containers from the workload now running on the node.

What we observe via the node logs is that the volume seems to unmount successfully. Then it attempts to unmount a second time, unsuccessfully. This unmount attempt then repeats and holds up the node.

Specific examples from the node's logs to illustrate this will be included in a private comment. 

Version-Release number of selected component (if applicable):

4.11.5

How reproducible:

Has occurred on four separate nodes on one specific cluster, but the mechanism to reproduce it is not known.

Steps to Reproduce:

1.
2.
3.

Actual results:

A volume gets stuck unmounting, holding up removal of the node and completed deletion of the pod.

Expected results:

The volume should not get stuck unmounting.

Additional info:

 

An RW mutex was introduced to the project auth cache with https://github.com/openshift/openshift-apiserver/pull/267, taking exclusive access during cache syncs. On clusters with extremely high object counts for namespaces and RBAC, syncs appear to be extremely slow (on the order of several minutes). The project LIST handler acquires the same mutex in shared mode as part of its critical path.

This is a clone of issue OCPBUGS-533. The following is the description of the original issue:

Description of problem:

customer is using Azure AD as openid provider and groups synchronization from the provider.

The scenario is the following:

1)

  • user A login.
    groups are created with the membership.
    User A is member of a group with admin rights and it's cluster-admin

2)

  • user B login:
    groups are updated with membership
    UserB is also member of the group with admin rights and it's cluster admin

3)

  • user A login:
    groups are identical as in the former step.
    user A has no administration rights.

The groups memberships are the same in step 2 and 3.
The cluster role bindings of the groups have never changed.

the only way to have user A again the admin rights is to delete the membership from the group and have user A login again.

I have not managed to reproduce this using RH SSO. Neither Azure AD.

But my configuration is not exactly the same yet.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1.
2.
3.

Actual results:

Expected results:

Additional info:

Description of problem:

To address: 'Static Pod is managed but errored" err="managed container xxx does not have Resource.Requests'

Version-Release number of selected component (if applicable):

4.11

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

Already merged in https://github.com/openshift/cluster-kube-controller-manager-operator/pull/660

Clone of https://bugzilla.redhat.com/show_bug.cgi?id=2106803 to backport the e2e fix to 4.11 and 4.10.

Description of problem: E2E: intermittent failure is seen on tests for devfile due to network call to devfile registry

Deploy git workload with devfile from topology page: A-04-TC01

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_console/11726/pull-ci-openshift-console-master-e2e-gcp-console/1547046629775773696/artifacts/e2e-gcp-console/test/artifacts/gui_test_screenshots/cypress/screenshots/e2e/add-flow-ci.feature/Deploy%20git%20workload%20with%20devfile%20from%20topology%20page%20A-04-TC01%20--%20after%20each%20hook%20(failed).png

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_console/11768/pull-ci-openshift-console-master-e2e-gcp-console/1547061730478133248

Version-Release number of selected component (if applicable):

How reproducible: Intermittent

Steps to Reproduce:
1. Run test for add-flow-ci.feature to test Deploy git workload with devfile from topology page: A-04-TC01

Actual results:

Expected results: Show always pass

Additional info:

The sample templates in oc don't use the correct namespace which makes the following build test fail in CI:

[sig-builds][Feature:JenkinsRHELImagesOnly][Feature:Jenkins][Feature:Builds][sig-devex][Slow] openshift pipeline build jenkins pipeline build config strategy using a jenkins instance launched with the ephemeral template

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_openshift-apiserver/400/pull-ci-openshift-openshift-apiserver-release-4.11-e2e-aws-builds/1727635797814808576

When a thin provisioned COW format disk is created on OCP on RHV via CSI driver (a PVC -
https://github.com/openshift/ovirt-csi-driver/blob/master/deploy/example/storage-claim.yaml

  • with for example requested storage 100 GB), the go-ovirt-client behaviour makes it so that the created disk has virtual size 100 GB and it's actual size is 110 GB.

But this is thin provisioned disk, so the initial size of the disk should be default of the engine and then grow as needed, it shouldn't be this big.

This causes all the disks created this way to be functionally preallocated (since it eats all that space), which is a real waste of space.

How reproducible: 100%

Steps to Reproduce:
1. Create a storage claim (PVC) in Openshift (
https://github.com/openshift/ovirt-csi-driver/blob/master/deploy/example/storage-claim.yaml
) using the default storage class (or any other storage class with thinProvisioning: "true") and with requested storage i.e. 100Gi

$ oc create -f storage-claim.yaml

2. In the RHV web console navigate to Storage -> Disks and check Virtual size and Actual size of the created disk (PVC)

Actual results:
Disk from our example with requested storage 100GB reports virtual size 100GB and actual size 110 GB.

Expected results:
Thin provisioned disks should start with small initial size and then grow as needed, so its actual size should be considerably smaller (the default initial size set by the engine should be 2.5 GB if I'm not mistaken).

Note: The extra 10GB in the actual size are caused by overhead for the qcow2 disk format, which is 10%, and this was tracked here as a separate issue:
https://bugzilla.redhat.com/show_bug.cgi?id=2097139

Description of problem:

Upgrade OCP 4.11 --> 4.12 fails with one 'NotReady,SchedulingDisabled' node and MachineConfigDaemonFailed.

Version-Release number of selected component (if applicable):

Upgrade from OCP 4.11.0-0.nightly-2022-09-19-214532 on top of OSP RHOS-16.2-RHEL-8-20220804.n.1 to 4.12.0-0.nightly-2022-09-20-040107.

Network Type: OVNKubernetes

How reproducible:

Twice out of two attempts.

Steps to Reproduce:

1. Install OCP 4.11.0-0.nightly-2022-09-19-214532 (IPI) on top of OSP RHOS-16.2-RHEL-8-20220804.n.1.
   The cluster is up and running with three workers:
   $ oc get clusterversion
   NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
   version   4.11.0-0.nightly-2022-09-19-214532   True        False         51m     Cluster version is 4.11.0-0.nightly-2022-09-19-214532

2. Run the OC command to upgrade to 4.12.0-0.nightly-2022-09-20-040107:
$ oc adm upgrade --to-image=registry.ci.openshift.org/ocp/release:4.12.0-0.nightly-2022-09-20-040107 --allow-explicit-upgrade --force=true
warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead
warning: The requested upgrade image is not one of the available updates.You have used --allow-explicit-upgrade for the update to proceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
Requesting update to release image registry.ci.openshift.org/ocp/release:4.12.0-0.nightly-2022-09-20-040107 

3. The upgrade is not succeeds: [0]
$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-09-19-214532   True        True          17h     Unable to apply 4.12.0-0.nightly-2022-09-20-040107: wait has exceeded 40 minutes for these operators: network

One node degrided to 'NotReady,SchedulingDisabled' status:
$ oc get nodes
NAME                          STATUS                        ROLES    AGE   VERSION
ostest-9vllk-master-0         Ready                         master   19h   v1.24.0+07c9eb7
ostest-9vllk-master-1         Ready                         master   19h   v1.24.0+07c9eb7
ostest-9vllk-master-2         Ready                         master   19h   v1.24.0+07c9eb7
ostest-9vllk-worker-0-4x4pt   NotReady,SchedulingDisabled   worker   18h   v1.24.0+3882f8f
ostest-9vllk-worker-0-h6kcs   Ready                         worker   18h   v1.24.0+3882f8f
ostest-9vllk-worker-0-xhz9b   Ready                         worker   18h   v1.24.0+3882f8f

$ oc get pods -A | grep -v -e Completed -e Running
NAMESPACE                                          NAME                                                         READY   STATUS      RESTARTS       AGE
openshift-openstack-infra                          coredns-ostest-9vllk-worker-0-4x4pt                          0/2     Init:0/1    0              18h
 
$ oc get events
LAST SEEN   TYPE      REASON                                        OBJECT            MESSAGE
7m15s       Warning   OperatorDegraded: MachineConfigDaemonFailed   /machine-config   Unable to apply 4.12.0-0.nightly-2022-09-20-040107: failed to apply machine config daemon manifests: error during waitForDaemonsetRollout: [timed out waiting for the condition, daemonset machine-config-daemon is not ready. status: (desired: 6, updated: 6, ready: 5, unavailable: 1)]
7m15s       Warning   MachineConfigDaemonFailed                     /machine-config   Cluster not available for [{operator 4.11.0-0.nightly-2022-09-19-214532}]: failed to apply machine config daemon manifests: error during waitForDaemonsetRollout: [timed out waiting for the condition, daemonset machine-config-daemon is not ready. status: (desired: 6, updated: 6, ready: 5, unavailable: 1)]

$ oc get co
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.12.0-0.nightly-2022-09-20-040107   True        False         False      18h    
baremetal                                  4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
cloud-controller-manager                   4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
cloud-credential                           4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
cluster-autoscaler                         4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
config-operator                            4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
console                                    4.12.0-0.nightly-2022-09-20-040107   True        False         False      18h    
control-plane-machine-set                  4.12.0-0.nightly-2022-09-20-040107   True        False         False      17h    
csi-snapshot-controller                    4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
dns                                        4.12.0-0.nightly-2022-09-20-040107   True        True          False      19h     DNS "default" reports Progressing=True: "Have 5 available node-resolver pods, want 6."
etcd                                       4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
image-registry                             4.12.0-0.nightly-2022-09-20-040107   True        True          False      18h     Progressing: The registry is ready...
ingress                                    4.12.0-0.nightly-2022-09-20-040107   True        False         False      18h    
insights                                   4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
kube-apiserver                             4.12.0-0.nightly-2022-09-20-040107   True        True          False      18h     NodeInstallerProgressing: 1 nodes are at revision 11; 2 nodes are at revision 13
kube-controller-manager                    4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
kube-scheduler                             4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
kube-storage-version-migrator              4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
machine-api                                4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
machine-approver                           4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
machine-config                             4.11.0-0.nightly-2022-09-19-214532   False       True          True       16h     Cluster not available for [{operator 4.11.0-0.nightly-2022-09-19-214532}]: failed to apply machine config daemon manifests: error during waitForDaemonsetRollout: [timed out waiting for the condition, daemonset machine-config-daemon is not ready. status: (desired: 6, updated: 6, ready: 5, unavailable: 1)]
marketplace                                4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
monitoring                                 4.12.0-0.nightly-2022-09-20-040107   True        False         False      18h    
network                                    4.12.0-0.nightly-2022-09-20-040107   True        True          True       19h     DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - last change 2022-09-20T14:16:13Z...
node-tuning                                4.12.0-0.nightly-2022-09-20-040107   True        False         False      17h    
openshift-apiserver                        4.12.0-0.nightly-2022-09-20-040107   True        False         False      18h    
openshift-controller-manager               4.12.0-0.nightly-2022-09-20-040107   True        False         False      17h    
openshift-samples                          4.12.0-0.nightly-2022-09-20-040107   True        False         False      17h    
operator-lifecycle-manager                 4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
operator-lifecycle-manager-catalog         4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
operator-lifecycle-manager-packageserver   4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
service-ca                                 4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
storage                                    4.12.0-0.nightly-2022-09-20-040107   True        True          False      19h     ManilaCSIDriverOperatorCRProgressing: ManilaDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods...

[0] http://pastebin.test.redhat.com/1074531

Actual results:

OCP 4.11 --> 4.12 upgrade fails.

Expected results:

OCP 4.11 --> 4.12 upgrade success.

Additional info:

Attached logs of the NotReady node - [^journalctl_ostest-9vllk-worker-0-4x4pt.log.tar.gz]

This is a clone of issue OCPBUGS-3889. The following is the description of the original issue:

This is a clone of issue OCPBUGS-3744. The following is the description of the original issue:

Description of problem:

Egress router POD creation on Openshift 4.11 is failing with below error.
~~~
Nov 15 21:51:29 pltocpwn03 hyperkube[3237]: E1115 21:51:29.467436    3237 pod_workers.go:951] "Error syncing pod, skipping" err="failed to \"CreatePodSandbox\" for \"stage-wfe-proxy-ext-qrhjw_stage-wfe-proxy(c965a287-28aa-47b6-9e79-0cc0e209fcf2)\" with CreatePodSandboxError: \"Failed to create sandbox for pod \\\"stage-wfe-proxy-ext-qrhjw_stage-wfe-proxy(c965a287-28aa-47b6-9e79-0cc0e209fcf2)\\\": rpc error: code = Unknown desc = failed to create pod network sandbox k8s_stage-wfe-proxy-ext-qrhjw_stage-wfe-proxy_c965a287-28aa-47b6-9e79-0cc0e209fcf2_0(72bcf9e52b199061d6e651e84b0892efc142601b2442c2d00b92a1ba23208344): error adding pod stage-wfe-proxy_stage-wfe-proxy-ext-qrhjw to CNI network \\\"multus-cni-network\\\": plugin type=\\\"multus\\\" name=\\\"multus-cni-network\\\" failed (add): [stage-wfe-proxy/stage-wfe-proxy-ext-qrhjw/c965a287-28aa-47b6-9e79-0cc0e209fcf2:openshift-sdn]: error adding container to network \\\"openshift-sdn\\\": CNI request failed with status 400: 'could not open netns \\\"/var/run/netns/8c5ca402-3381-4935-baed-ea454161d669\\\": unknown FS magic on \\\"/var/run/netns/8c5ca402-3381-4935-baed-ea454161d669\\\": 1021994\\n'\"" pod="stage-wfe-proxy/stage-wfe-proxy-ext-qrhjw" podUID=c965a287-28aa-47b6-9e79-0cc0e209fcf2
~~~

I have checked SDN POD log from node where egress router POD is failing and I could see below error message.

~~~
2022-11-15T21:51:29.283002590Z W1115 21:51:29.282954  181720 pod.go:296] CNI_ADD stage-wfe-proxy/stage-wfe-proxy-ext-qrhjw failed: could not open netns "/var/run/netns/8c5ca402-3381-4935-baed-ea454161d669": unknown FS magic on "/var/run/netns/8c5ca402-3381-4935-baed-ea454161d669": 1021994
~~~

Crio is logging below event and looking at the log it seems the namespace has been created on node.

~~~
Nov 15 21:51:29 pltocpwn03 crio[3150]: time="2022-11-15 21:51:29.307184956Z" level=info msg="Got pod network &{Name:stage-wfe-proxy-ext-qrhjw Namespace:stage-wfe-proxy ID:72bcf9e52b199061d6e651e84b0892efc142601b2442c2d00b92a1ba23208344 UID:c965a287-28aa-47b6-9e79-0cc0e209fcf2 NetNS:/var/run/netns/8c5ca402-3381-4935-baed-ea454161d669 Networks:[] RuntimeConfig:map[multus-cni-network:{IP: MAC: PortMappings:[] Bandwidth:<nil> IpRanges:[]}] Aliases:map[]}"
~~~

Version-Release number of selected component (if applicable):

4.11.12

How reproducible:

Not Sure

Steps to Reproduce:

1.
2.
3.

Actual results:

Egress router POD is failing to create. Sample application could be created without any issue.

Expected results:

Egress router POD should get created

Additional info:

Egress router POD is created following below document and it does contain pod.network.openshift.io/assign-macvlan: "true" annotation.

https://docs.openshift.com/container-platform/4.11/networking/openshift_sdn/deploying-egress-router-layer3-redirection.html#nw-egress-router-pod_deploying-egress-router-layer3-redirection

This is a clone of issue OCPBUGS-4250. The following is the description of the original issue:

Description of problem:

Added a script to collect PodNetworkConnectivityChecks to able to view the overall status of the pod network connectivity.

Current must-gather collects the contents of `openshift-network-diagnostics` but does not collect the PodNetworkConnectivityCheck.

Version-Release number of selected component (if applicable):

4.12, 4.11, 4.10

Description of problem:

Whereabouts doesn't allow the use of network interface names that are not preceded by the prefix "net", see https://github.com/k8snetworkplumbingwg/whereabouts/issues/130.

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1. Define two Pods, one with the interface name 'port1' and the other with 'net-port1':

test-ip-removal-port1:	
              k8s.v1.cni.cncf.io/networks:
                [
                  {
                    "name": "test-sriovnd",
                    "interface": "port1",
                    "namespace": "default"
                  }
                ]

test-ip-removal-net-port1:
              k8s.v1.cni.cncf.io/networks:
                [
                  {
                    "name": "test-sriovnd",
                    "interface": "net-port1",
                    "namespace": "default"
                  }
                ]

2. IP allocated in the IPPool:

kind: IPPool
...
spec:
  allocations:
    "16":
      id: ...
      podref: test-ecoloma-1/test-ip-removal-port1
    "17":
      id: ...
      podref: test-ecoloma-1/test-ip-removal-net-port1

3. When the ip-reconciler job is run, the allocation for the port with the interface name 'port1' is removed:

[13:29][]$ oc get cronjob -n openshift-multus
NAME            SCHEDULE       SUSPEND   ACTIVE   LAST SCHEDULE   AGE
ip-reconciler   */15 * * * *   False     0        14m             11d

[13:29][]$ oc get ippools.whereabouts.cni.cncf.io -n openshift-multus   2001-1b70-820d-2610---64 -o yaml
apiVersion: whereabouts.cni.cncf.io/v1alpha1
kind: IPPool
metadata:
...
spec:
  allocations:
    "17":
      id: ...
      podref: test-ecoloma-1/test-ip-removal-net-port1
  range: 2001:1b70:820d:2610::/64

[13:30][]$ oc get cronjob -n openshift-multus
NAME            SCHEDULE       SUSPEND   ACTIVE   LAST SCHEDULE   AGE
ip-reconciler   */15 * * * *   False     0        9s              11d

 

Actual results:

The network interface with a name that doesn't have a 'net' prefix is removed from the ip-reconciler cronjob.

Expected results:

The network interface must not be removed, regardless of the name.

Additional info:

Upstream PR @ https://github.com/k8snetworkplumbingwg/whereabouts/pull/147 master PR @ https://github.com/openshift/whereabouts-cni/pull/94

This is a clone of issue OCPBUGS-2075. The following is the description of the original issue:

Description of problem:

We got a feedback from the support team that it is confusing to see switch in the Notifications column for the Alerting rule which have no alerts associated to it as user can not silence the Alerting rule. 

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1. oc apply -f https://gist.githubusercontent.com/vikram-raj/727629797eb9d9bfcfa2721cae2ade86/raw/7c2305e14115a1a4f4f88ebb74cdad32cbec4132/Alerting%2520rule%2520without%2520alert 
2. navigate to the Developer perspective Observe -> Alerts
3. Try to silence the VersionAlert alerting rule, nothing will happen 

Actual results:

Silence the alerting rule using the switch will do nothing

Expected results:

No switch for silence the alerting rule should be visible if no alerts are associated to the alerting rule.

Additional info:

 

This is a clone of issue OCPBUGS-11208. The following is the description of the original issue:

This is a clone of issue OCPBUGS-11054. The following is the description of the original issue:

This is a clone of issue OCPBUGS-11038. The following is the description of the original issue:

Description of problem:

Backport support starting in 4.12.z to a new GCP region europe-west12

Version-Release number of selected component (if applicable):

4.12.z and 4.13.z

How reproducible:

Always

Steps to Reproduce:

1. Use openhift-install to deploy OCP in europe-west12

Actual results:

europe-west12 is not available as a supported region in the user survey

Expected results:

europe-west12 to be available as a supported region in the user survey

Additional info:

 

Description of problem:

On the Machinesets, we configured the Azure tags, that should be assigned to the newly created nodes. VMs and disks have that tags assigned, while NICs - don't have configured Azsure tags assigned to them.

Version-Release number of selected component (if applicable):


OCP 4.11

How reproducible:


It can be reproducible

Steps to Reproduce:


1. We need to acquire Azure tags
2. Create machine set configs with Azure tags configured
3. Create VMs through the machine set

Actual results:


NICs, created by the Machinesets don't have Azure tags, configured on the Machineset.

Expected results:


NiCs should automatically pick up these tags.

Additional info:


As in Azure NICs can be treated as separate resources. there is a possibility if we assign the tags for the NICs in the main machine config file. it may work.

This is a clone of issue OCPBUGS-6018. The following is the description of the original issue:

This is a public clone of OCPBUGS-3821

The MCO can sometimes render a rendered-config in the middle of an upgrade with old MCs, e.g.:

  1. the containerruntimeconfigcontroller creates a new containerruntimeconfig due to the update
  2. the template controller finishes re-creating the base configs
  3. the kubeletconfig errors long enough and doesn't finish until after 2

This will cause the render controller to create a new rendered MC that uses the OLD kubeletconfig-MC, which at best is a double reboot for 1 node, and at worst block the update and break maxUnavailable nodes per pool.

Tracker issue for bootimage bump in 4.11. This issue should block issues which need a bootimage bump to fix.

The previous bump was OCPBUGS-3362.

This is a clone of issue OCPBUGS-11972. The following is the description of the original issue:

This is a clone of issue OCPBUGS-9956. The following is the description of the original issue:

Description of problem:

PipelineRun default template name has been updated in the backend in Pipeline operator 1.10, So we need to update the name in the UI code as well.

 

https://github.com/openshift/console/blob/master/frontend/packages/pipelines-plugin/src/components/pac/const.ts#L9

 

Description of problem:

AWS tagging - when applying user defined tags you cannot add more than 10

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

During an upgrade from 4.10.0-0.nightly-2022-09-06-081345 -> 4.11.0-0.nightly-2022-09-06-074353

DNS operator stuck in progressing state with following error in dns-defalt pod

2022-09-07T00:47:35.959931289Z [WARNING] plugin/kubernetes: Kubernetes API connection failure: Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: i/o timeout

must-gather: http://shell.lab.bos.redhat.com/~anusaxen/must-gather-136172-050448657.tar.gz

Version-Release number of selected component (if applicable):

4.11.0-0.nightly-2022-09-06-074353

How reproducible:

Intermittent

Steps to Reproduce:

1. Perform upgrade from 4.10->4.11
2.
3.

Actual results:

Upgrade unsuccessful due to API svc unreachability

Expected results:

Upgrade should be successful

Additional info:

$ omg get co | grep -v "True.*False.*False"
NAME                                      VERSION                             AVAILABLE  PROGRESSING  DEGRADED  SINCE
dns                                       4.10.0-0.nightly-2022-09-06-081345  True       True         False     3h4m
network                                   4.11.0-0.nightly-2022-09-06-074353  True       True         False     3h5m

$ omg logs dns-default-5mqz4 -c dns -n openshift-dns
/home/anusaxen/Downloads/must-gather-136208-318613238/must-gather.local.4033693973186012455/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-d995e1608ba83a1f71b5a1c49705aa3cef38618e0674ee256332d1b0e2cb0e23/namespaces/openshift-dns/pods/dns-default-5mqz4/dns/dns/logs/current.log
2022-09-07T00:45:36.355161941Z [INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
2022-09-07T00:45:36.854403897Z [INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
2022-09-07T00:45:37.354367197Z [INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
2022-09-07T00:45:37.861427861Z [INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
2022-09-07T00:45:38.355095996Z [INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
2022-09-07T00:45:38.855063626Z [INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
2022-09-07T00:45:39.355170472Z [INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
2022-09-07T00:45:39.855296624Z [INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
2022-09-07T00:45:40.355481413Z [INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
2022-09-07T00:45:40.855455283Z [WARNING] plugin/kubernetes: starting server with unsynced Kubernetes API
2022-09-07T00:45:40.856166731Z .:5353
2022-09-07T00:45:40.856166731Z [INFO] plugin/reload: Running configuration SHA512 = 7c3db3182389e270fe3af971fef9e7157a34bde01eb750d58c2d4895a965ecb2e2d00df34594b6a819653088b6efa2590e748ce23f9e16cbb0f44d74a4fc3587
2022-09-07T00:45:40.856166731Z CoreDNS-1.9.2
2022-09-07T00:45:40.856166731Z linux/amd64, go1.18.4, 
2022-09-07T00:46:05.957503195Z [WARNING] plugin/kubernetes: Kubernetes API connection failure: Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: i/o timeout
2022-09-07T00:47:25.444876534Z [INFO] SIGTERM: Shutting down servers then terminating
2022-09-07T00:47:25.444984367Z [INFO] plugin/health: Going into lameduck mode for 20s
2022-09-07T00:47:35.959931289Z [WARNING] plugin/kubernetes: Kubernetes API connection failure: Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: i/o timeout

Description of problem:

Stop option for pipelinerun is not working

Version-Release number of selected component (if applicable):

Openshift Pipelines 1.9.x

How reproducible:

Always

Steps to Reproduce:

1. Create a pipeline and start it
2. From Actions dropdown select  stop option

Actual results:

Pipelinerun is not getting cancelled

Expected results:

Pipelinerun should get cancelled

Additional info:

 

 

Description of problem:
-----------------------
On dualstack baremetal IPI cluster next error message is present in ovnkube logs:

oc logs -n openshift-ovn-kubernetes ovnkube-node-rvggh -c ovnkube-node
...

E0810 02:12:46.343460 353971 node_linux.go:593] Failed to dump flows for flow sync, stderr: "ovs-ofctl: br-ext is not a bridge or a socket\n", error: exit status 1
E0810 02:13:16.347603 353971 node_linux.go:593] Failed to dump flows for flow sync, stderr: "ovs-ofctl: br-ext is not a bridge or a socket\n", error: exit status 1
E0810 02:13:46.351108 353971 node_linux.go:593] Failed to dump flows for flow sync, stderr: "ovs-ofctl: br-ext is not a bridge or a socket\n", error: exit status 1
E0810 02:14:16.355047 353971 node_linux.go:593] Failed to dump flows for flow sync, stderr: "ovs-ofctl: br-ext is not a bridge or a socket\n", error: exit status 1
E0810 02:14:46.358950 353971 node_linux.go:593] Failed to dump flows for flow sync, stderr: "ovs-ofctl: br-ext is not a bridge or a socket\n", error: exit status 1
I0810 02:15:13.313945 353971 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.Service total 9 items received
E0810 02:15:16.362737 353971 node_linux.go:593] Failed to dump flows for flow sync, stderr: "ovs-ofctl: br-ext is not a bridge or a socket\n", error: exit status 1
E0810 02:15:46.366490 353971 node_linux.go:593] Failed to dump flows for flow sync, stderr: "ovs-ofctl: br-ext is not a bridge or a socket\n", error: exit status 1
E0810 02:16:16.369963 353971 node_linux.go:593] Failed to dump flows for flow sync, stderr: "ovs-ofctl: br-ext is not a bridge or a socket\n", error: exit status 1
I0810 02:16:24.306561 353971 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.Endpoints total 560 items received
E0810 02:16:46.373482 353971 node_linux.go:593] Failed to dump flows for flow sync, stderr: "ovs-ofctl: br-ext is not a bridge or a socket\n", error: exit status 1
E0810 02:17:16.377497 353971 node_linux.go:593] Failed to dump flows for flow sync, stderr: "ovs-ofctl: br-ext is not a bridge or a socket\n", error: exit status 1
E0810 02:17:46.380726 353971 node_linux.go:593] Failed to dump flows for flow sync, stderr: "ovs-ofctl: br-ext is not a bridge or a socket\n", error: exit status 1
I0810 02:18:15.325871 353971 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.Node total 50 items received
E0810 02:18:16.384732 353971 node_linux.go:593] Failed to dump flows for flow sync, stderr: "ovs-ofctl: br-ext is not a bridge or a socket\n", error: exit status 1
I0810 02:18:38.299738 353971 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.Pod total 9 items received
E0810 02:18:46.388162 353971 node_linux.go:593] Failed to dump flows for flow sync, stderr: "ovs-ofctl: br-ext is not a bridge or a socket\n", error: exit status 1
E0810 02:19:16.391669 353971 node_linux.go:593] Failed to dump flows for flow sync, stderr: "ovs-ofctl: br-ext is not a bridge or a socket\n", error: exit status 1

Version-Release number of selected component (if applicable):
-------------------------------------------------------------
OCP-4.10.26

ovn-2021-21.12.0-58.el8fdp.x86_64
ovn-2021-host-21.12.0-58.el8fdp.x86_64
ovn-2021-central-21.12.0-58.el8fdp.x86_64
ovn-2021-vtep-21.12.0-58.el8fdp.x86_64

How reproducible:
-----------------
so far spotted on 2 different clusters

Steps to Reproduce:
-------------------
1. Deploy dualstack baremetal IPI cluster with OVNKubernetesHybrid network(add next to cluster's config before running cluster install):

defaultNetwork:
type: OVNKubernetes
ovnKubernetesConfig:
hybridOverlayConfig:
hybridClusterNetwork: []

Actual results:
---------------
Error message in logs

Expected results:
-----------------
No error message in logs

Additional info:
----------------
Baremetal dualstack setup with 3 masters and 4 workers, bonding configured for baremetal network on masters and workers

Description of problem: defined in https://bugzilla.redhat.com/show_bug.cgi?id=2051533 

When adding remote worker node using ZTP the agent finishes the installation and is marked as done. 
oc get agent -o wide
NAME                                   CLUSTER   APPROVED   ROLE     STAGE   HOSTNAME                                      REQUESTED HOSTNAME
0277804e-2a7c-4d95-9d0f-e22a190d582a   spoke-0   true       worker   Done    spoke-worker-0-0.spoke-0.qe.lab.redhat.com    spoke-worker-0-0
12efa520-5b99-4474-805d-931e46ad43f7   spoke-0   true       master   Done    spoke-master-0-2.spoke-0.qe.lab.redhat.com    spoke-master-0-2
3b8eec89-f26f-4896-8f71-8a810894c560   spoke-0   true       master   Done    spoke-master-0-0.spoke-0.qe.lab.redhat.com    spoke-master-0-0
3fb3749e-c132-4258-ad1a-08a0445c9022   spoke-0   true       worker   Done    spoke-worker-0-1.spoke-0.qe.lab.redhat.com    spoke-worker-0-1
728559e9-5543-41d9-adb0-e58196f765af   spoke-0   true       master   Done    spoke-master-0-1.spoke-0.qe.lab.redhat.com    spoke-master-0-1
982e1ff6-6e83-4800-b061-8cdfd0b844fb   spoke-0   true       worker   Done    spoke-rwn-0-1.spoke-rwn-0.qe.lab.redhat.com   spoke-rwn-0-1
a76eaa6a-b351-429f-bfa1-e53a70503573   spoke-0   true       worker   Done    spoke-rwn-0-0.spoke-rwn-0.qe.lab.redhat.com   spoke-rwn-0-0



Logging into the spoke cluster the bmh and machine resources are created and the node resource is not:
oc get bmh -n openshift-machine-api
NAME                STATE                    CONSUMER                       ONLINE   ERROR                            AGE
spoke-master-0-0    unmanaged                spoke-0-pxbfh-master-0         true                                      3h32m
spoke-master-0-1    unmanaged                spoke-0-pxbfh-master-1         true                                      3h32m
spoke-master-0-2    unmanaged                spoke-0-pxbfh-master-2         true                                      3h32m
spoke-rwn-0-0-bmh   externally provisioned   spoke-0-spoke-rwn-0-0-bmh      true     provisioned registration error   168m
spoke-rwn-0-1-bmh   externally provisioned   spoke-0-spoke-rwn-0-1-bmh      true     provisioned registration error   168m
spoke-worker-0-0    unmanaged                spoke-0-pxbfh-worker-0-65mrb   true                                      3h32m
spoke-worker-0-1    unmanaged                spoke-0-pxbfh-worker-0-nnmcq   true                                      3h32m     

 oc get machine -n openshift-machine-api
NAME                           PHASE         TYPE   REGION   ZONE   AGE
spoke-0-pxbfh-master-0         Running                              3h33m
spoke-0-pxbfh-master-1         Running                              3h33m
spoke-0-pxbfh-master-2         Running                              3h33m
spoke-0-pxbfh-worker-0-65mrb   Running                              3h19m
spoke-0-pxbfh-worker-0-nnmcq   Running                              3h20m
spoke-0-spoke-rwn-0-0-bmh      Provisioned                          169m
spoke-0-spoke-rwn-0-1-bmh      Provisioned                          169m

Note: bmh is in error state:
Normal  ProvisionedRegistrationError  30m   metal3-baremetal-controller  Host adoption failed: Error while attempting to adopt node 529b3e75-5d04-4486-9296-269081d0ec02: Error validating Redfish virtual media. Some parameters were missing in node's driver_info. Missing are: ['deploy_kernel', 'deploy_ramdisk'].

oc get nodes
NAME                                         STATUS   ROLES    AGE   VERSION
spoke-master-0-0.spoke-0.qe.lab.redhat.com   Ready    master   72m   v1.22.3+2cb6068
spoke-master-0-1.spoke-0.qe.lab.redhat.com   Ready    master   50m   v1.22.3+2cb6068
spoke-master-0-2.spoke-0.qe.lab.redhat.com   Ready    master   72m   v1.22.3+2cb6068
spoke-worker-0-0.spoke-0.qe.lab.redhat.com   Ready    worker   51m   v1.22.3+2cb6068
spoke-worker-0-1.spoke-0.qe.lab.redhat.com   Ready    worker   51m   v1.22.3+2cb6068


node-bootstrapper CSR is created but not auto-approved; periodically another node-strapper csr is created until it is manually approved:

oc get csr | grep Pending
csr-5ll2g                                        9m9s    kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper         <none>              Pending
csr-f8vbl                                        8m24s   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper         <none>              Pending

Version-Release number of selected component (if applicable):

assisted-service master at revision af0bafb3f7f629932f8c3dc31ccddedfe6984926
ocp version: 4.10.0-rc.1

How reproducible:

1. Install remote worker node using ztp

2. Wait for node resource to be created

Steps to Reproduce:

1. Install remote worker node using ztp

2. Wait for node resource to be created
 

Actual results:

node-bootstrapper and node CSR are not auto-approved and node resource is not created.  The bmh resource remains in registration error

Expected results:

node-bootstrapper and node CSR should be auto-approved and node resource created.  The bmh resource should not be in registration error

Additional info:

 

This is a clone of issue OCPBUGS-17808. The following is the description of the original issue:

This is a clone of issue OCPBUGS-16804. The following is the description of the original issue:

This is a clone of issue OCPBUGS-15327. The following is the description of the original issue:

Description of problem:

On OpenShift Container Platform, the etcd Pod is showing messages like the following:

2023-06-19T09:10:30.817918145Z {"level":"warn","ts":"2023-06-19T09:10:30.817Z","caller":"fileutil/purge.go:72","msg":"failed to lock file","path":"/var/lib/etcd/member/wal/000000000000bc4b-00000000183620a4.wal","error":"fileutil: file already locked"}


This is described in KCS https://access.redhat.com/solutions/7000327

Version-Release number of selected component (if applicable):

any currently supported version (> 4.10) running with 3.5.x

How reproducible:

always

Steps to Reproduce:

happens after running etcd for a while

 

This has been discussed in https://github.com/etcd-io/etcd/issues/15360

It's not a harmful error message, it merely indicates that some WALs have not been included in snapshots yet.

This was caused by changing default numbers: https://github.com/etcd-io/etcd/issues/13889

This was fixed in https://github.com/etcd-io/etcd/pull/15408/files but never backported to 3.5.

To mitigate that error and stop confusing people, we should also supply that argument when starting etcd in: https://github.com/openshift/cluster-etcd-operator/blob/master/bindata/etcd/pod.yaml#L170-L187

That way we're not surprised by changes of the default values upstream.

This is a clone of issue OCPBUGS-22689. The following is the description of the original issue:

This is a clone of issue OCPBUGS-22210. The following is the description of the original issue:

This is a clone of issue OCPBUGS-14010. The following is the description of the original issue:

Extended api-server disruption periods during upgrades on Azure, detected by TRT analysis, led to the finding that non-fatal etcd delays were causing an api-server 503s.

https://github.com/openshift/cluster-openshift-apiserver-operator/blob/00f7e4cc95063ba5aba1992568088d924cfbf516/bindata/v3.11.0/openshift-apiserver/deploy.yaml#L137 shows the current readiness check permits only one failure after 1 second. Suggesting we backport OCPBUGS-14010 to make this 10 seconds and more forgiving of temporary etcd issues.

 

This prowjob shows the behavior:

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-azure-sdn-upgrade/1706399977418264576/artifacts/e2e-azure-sdn-upgrade/openshift-e2e-test/artifacts/junit/e2e-timelines_everything_20230925-210342.html

This is a clone of issue OCPBUGS-268. The following is the description of the original issue:

The linux kernel was updated:
https://lkml.org/lkml/2020/3/20/1030
to include steal

{time,clock}

accounting

This would greatly assist in troubleshooting vSphere performance issues
caused by over-provisioned ESXi hosts.

This is a clone of issue OCPBUGS-9955. The following is the description of the original issue:

Description of problem:

OCP cluster installation (SNO) using assisted installer running on ACM hub cluster. 
Hub cluster is OCP 4.10.33
ACM is 2.5.4

When a cluster fails to install we remove the installation CRs and cluster namespace from the hub cluster (to eventually redeploy). The termination of the namespace hangs indefinitely (14+ hours) with finalizers remaining. 

To resolve the hang we can remove the finalizers by editing both the secret pointed to by BareMetalHost .spec.bmc.credentialsName and BareMetalHost CR. When these finalizers are removed the namespace termination completes within a few seconds.

Version-Release number of selected component (if applicable):

OCP 4.10.33
ACM 2.5.4

How reproducible:

Always

Steps to Reproduce:

1. Generate installation CRs (AgentClusterInstall, BMH, ClusterDeployment, InfraEnv, NMStateConfig, ...) with an invalid configuration parameter. Two scenarios validated to hit this issue:
  a. Invalid rootDeviceHint in BareMetalHost CR
  b. Invalid credentials in the secret referenced by BareMetalHost.spec.bmc.credentialsName
2. Apply installation CRs to hub cluster
3. Wait for cluster installation to fail
4. Remove cluster installation CRs and namespace

Actual results:

Cluster namespace remains in terminating state indefinitely:
$ oc get ns cnfocto1
NAME       STATUS        AGE    
cnfocto1   Terminating   17h

Expected results:

Cluster namespace (and all installation CRs in it) are successfully removed.

Additional info:

The installation CRs are applied to and removed from the hub cluster using argocd. The CRs have the following waves applied to them which affects the creation order (lowest to highest) and removal order (highest to lowest):
Namespace: 0
AgentClusterInstall: 1
ClusterDeployment: 1
NMStateConfig: 1
InfraEnv: 1
BareMetalHost: 1
HostFirmwareSettings: 1
ConfigMap: 1 (extra manifests)
ManagedCluster: 2
KlusterletAddonConfig: 2

 

This is a clone of issue OCPBUGS-5019. The following is the description of the original issue:

This is a clone of issue OCPBUGS-4941. The following is the description of the original issue:

Description of problem: This is a follow-up to OCPBUGS-3933.

The installer fails to destroy the cluster when the OpenStack object storage omits 'content-type' from responses, and a container is empty.

Version-Release number of selected component (if applicable):

4.8.z

How reproducible:

Likely not happening in customer environments where Swift is exposed directly. We're seeing the issue in our CI where we're using a non-RHOSP managed cloud.

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

This is a clone of issue OCPBUGS-3114. The following is the description of the original issue:

Description of problem:

When running a Hosted Cluster on Hypershift the cluster-networking-operator never progressed to Available despite all the components being up and running

Version-Release number of selected component (if applicable):

quay.io/openshift-release-dev/ocp-release:4.11.11-x86_64 for the hosted clusters
hypershift operator is quay.io/hypershift/hypershift-operator:4.11
4.11.9 management cluster

How reproducible:

Happened once

Steps to Reproduce:

1.
2.
3.

Actual results:

oc get co network reports False availability

Expected results:

oc get co network reports True availability

Additional info:

 

Description of problem:

In order to understand what is going on with OCPBUGS-5379 we want to add more logs

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:
This is a clone of https://issues.redhat.com/browse/OCPBUGS-469

 

Description of problem: Numerous erroreneous logs in OVN master

I0823 18:00:11.163491       1 obj_retry.go:1063] Retry object setup: *v1.Pod openshift-operator-lifecycle-manager/collect-profiles-27687900-hlp6k
I0823 18:00:11.163546       1 obj_retry.go:1096] Removing old object: *v1.Pod openshift-operator-lifecycle-manager/collect-profiles-27687900-hlp6k
I0823 18:00:11.163555       1 pods.go:124] Deleting pod: openshift-operator-lifecycle-manager/collect-profiles-27687900-hlp6k
I0823 18:00:11.163631       1 obj_retry.go:1103] Retry delete failed for *v1.Pod openshift-operator-lifecycle-manager/collect-profiles-27687900-hlp6k, will try again later: deleteLogicalPort failed for pod openshift-operator-lifecycle-manager_collect-profiles-27687900-hlp6k: unable to locate portUUID+nodeName for pod openshift-operator-lifecycle-manager/collect-profiles-27687900-hlp6k: error getting logical port <nil>: object not found
W0823 18:00:41.163633       1 obj_retry.go:1031] Dropping retry entry for *v1.Pod openshift-operator-lifecycle-manager/collect-profiles-27687900-hlp6k: exceeded number of failed attempts

Must-gather: http://shell.lab.bos.redhat.com/~anusaxen/must-gather.local.2234927131259452300/

 

Version-Release number of selected component (if applicable): 4.12.0-0.nightly-2022-08-23-031342

How reproducible: Always

Steps to Reproduce:
1. Bring up OVN cluster on 4.12
2.
3.

Actual results: deleteLogicalPort failed for already gone object

Expected results: deleteLogicalPort should not keep retrying post object deletion

Additional info:

This is a clone of issue OCPBUGS-10225. The following is the description of the original issue:

Description of problem:
Pipeline Repository (Pipeline-as-code) list never shows an Event type.

Version-Release number of selected component (if applicable):
4.9+

How reproducible:
Always

Steps to Reproduce:

  1. Install Pipelines Operator and setup a Pipeline-as-code repository
  2. Trigger an event and a build

Actual results:
Pipeline Repository list shows a column Event type but no value.

Expected results:
Pipeline Repository list should show the Event type from the matching Pipeline Run.

Similar to the Pipeline Run Details page based on the label.

Additional info:
The list page packages/pipelines-plugin/src/components/repository/list-page/RepositoryRow.tsx renders obj.metadata.namespace as event type.

I believe we should show the Pipeline Run event type instead. packages/pipelines-plugin/src/components/repository/RepositoryLinkList.tsx uses

{plrLabels[RepositoryLabels[RepositoryFields.EVENT_TYPE]]}

to render it.

Also the Pipeline Repository details page tried to render the Branch and Event type from the Repository resource. My research says these properties doesn't exist on the Repository resource. The code should be removed from the Repository details page.

This is a clone of issue OCPBUGS-13335. The following is the description of the original issue:

This is a clone of issue OCPBUGS-13334. The following is the description of the original issue:

This is a clone of issue OCPBUGS-11753. The following is the description of the original issue:

Description of problem:

During the IPI bm provisioning process the deployment fails with the following error:

Error: could not inspect: could not inspect node, node is currently 'inspect failed' , last error was 'Redfish exception occurred. Error: Failed to get network interface information on node 995431b0-14c8-4aae-9733-e02b42da29ac: The attribute EthernetInterfaces is missing from the resource /redfish/v1/Systems/<redacted>'

Version-Release number of selected component (if applicable):

4.12

How reproducible:

Snippet from my install-config.yaml
      - name: ocp-test-2
        role: master
        bmc:
          address: redfish-virtualmedia://<redacted>/redfish/v1/Systems/<redacted>
          username: admin
          password: <redacted>
          disableCertificateVerification: True
        bootMACAddress: <redacted>

Steps to Reproduce:

see above

Actual results:

Error: could not inspect: could not inspect node, node is currently 'inspect failed' , last error was 'Redfish exception occurred. Error: Failed to get network interface information on node 995431b0-14c8-4aae-9733-e02b42da29ac: The attribute EthernetInterfaces is missing from the resource /redfish/v1/Systems/<redacted>'

Expected results:

Because I already specified the bootMACAddress in the install-config.yaml I would like the installer to continue and don't fail when the Redfish API doesn't provide the EthernetInterfaces

Additional info:

The blades in this specific chassis don't have access to the NIC adapter.
Therefore they can't get the MAC address and populate this field in the Redfish API.
These blades are currently still selling and will need to be supported for another 5 years atleast.
Our new servers in the new chassis don't have this problem (X-Series).

Please keep the bootMACAddress as an optional field in the install-config.yaml to support these blades.

 

This is a clone of issue OCPBUGS-8286. The following is the description of the original issue:

Description of problem:

mapi_machinehealthcheck_short_circuit is not properly reconciling the state, when a MachineHealthCheck is failing because of unhealthy Machines but then is removed.

When doing two MachineSet (called blue and green and only one has running Machines at a specific point in time) with MachineAutoscaler and MachineHealthCheck, the mapi_machinehealthcheck_short_circuit will continue to report 1 for MachineHealth that actually was removed because of a switch from blue to green.

$ oc get machineset | egrep 'blue|green'
housiocp4-wvqbx-worker-blue-us-east-2a    0         0                             2d17h
housiocp4-wvqbx-worker-green-us-east-2a   1         1         1       1           2d17h

$ oc get machineautoscaler
NAME                      REF KIND     REF NAME                                   MIN   MAX   AGE
worker-green-us-east-1a   MachineSet   housiocp4-wvqbx-worker-green-us-east-2a   1     4     2d17h

$ oc get machinehealthcheck
NAME                              MAXUNHEALTHY   EXPECTEDMACHINES   CURRENTHEALTHY
machine-api-termination-handler   100%           0                  0
worker-green-us-east-1a           40%            1                  1

      {
        "name": "machine-health-check-unterminated-short-circuit",
        "file": "/etc/prometheus/rules/prometheus-k8s-rulefiles-0/openshift-machine-api-machine-api-operator-prometheus-rules-ccb650d9-6fc4-422b-90bb-70452f4aff8f.yaml",
        "rules": [
          { 
            "state": "firing",
            "name": "MachineHealthCheckUnterminatedShortCircuit",
            "query": "mapi_machinehealthcheck_short_circuit == 1",
            "duration": 1800,
            "labels": {
              "severity": "warning"
            },
            "annotations": {
              "description": "The number of unhealthy machines has exceeded the `maxUnhealthy` limit for the check, you should check\nthe status of machines in the cluster.\n",
              "summary": "machine health check {{ $labels.name }} has been disabled by short circuit for more than 30 minutes"
            },
            "alerts": [
              { 
                "labels": {
                  "alertname": "MachineHealthCheckUnterminatedShortCircuit",
                  "container": "kube-rbac-proxy-mhc-mtrc",
                  "endpoint": "mhc-mtrc",
                  "exported_namespace": "openshift-machine-api",
                  "instance": "10.128.0.58:8444",
                  "job": "machine-api-controllers",
                  "name": "worker-blue-us-east-1a",
                  "namespace": "openshift-machine-api",
                  "pod": "machine-api-controllers-779dcb8769-8gcn6",
                  "service": "machine-api-controllers",
                  "severity": "warning"
                },
                "annotations": {
                  "description": "The number of unhealthy machines has exceeded the `maxUnhealthy` limit for the check, you should check\nthe status of machines in the cluster.\n",
                  "summary": "machine health check worker-blue-us-east-1a has been disabled by short circuit for more than 30 minutes"
                },
                "state": "firing",
                "activeAt": "2022-12-09T15:59:25.1287541Z",
                "value": "1e+00"
              }
            ],
            "health": "ok",
            "evaluationTime": 0.000648129,
            "lastEvaluation": "2022-12-12T09:35:55.140174009Z",
            "type": "alerting"
          }
        ],
        "interval": 30,
        "limit": 0,
        "evaluationTime": 0.000661589,
        "lastEvaluation": "2022-12-12T09:35:55.140165629Z"
      },

As we can see above, worker-blue-us-east-1a is no longer available and active but rather worker-green-us-east-1a. But worker-blue-us-east-1a was there before the switch to green has happen and was actuall reporting some unhealthy Machines. But since it's now gone, mapi_machinehealthcheck_short_circuit should properly reconcile as otherwise this is a false/positive alert.

Version-Release number of selected component (if applicable):

OpenShift Container Platform 4.12.0-rc.3 (but is also seen on previous version)

How reproducible:

- Always

Steps to Reproduce:

1. Setup OpenShift Container Platform 4 on AWS for example
2. Create blue and green MachineSet with MachineAutoScaler and MachineHealthCheck
3. Have active Machines for blue only
4. Trigger unhealthy Machines in blue MachineSet
5. Switch to green MachineSet, by removing MachineHealthCheck, MachineAutoscaler and setting replicate of blue MachineSet to 0
6. Create green MachineHealthCheck, MachineAutoscaler and scale geen MachineSet to 1
7. Observe how mapi_machinehealthcheck_short_circuit continues to report unhealthy state for blue MachineHealthCheck which no longer exists.

Actual results:

mapi_machinehealthcheck_short_circuit reporting problematic MachineHealthCheck even though the faulty MachineHealthCheck does no longer exist.

Expected results:

mapi_machinehealthcheck_short_circuit to properly reconcile it's state and remove MachineHealthChecks that have been removed on OpenShift Container Platform level

Additional info:

It kind of looks like similar to the issue reported in https://bugzilla.redhat.com/show_bug.cgi?id=2013528 respectively https://bugzilla.redhat.com/show_bug.cgi?id=2047702 (although https://bugzilla.redhat.com/show_bug.cgi?id=2047702 may not be super relevant)

This is a clone of issue OCPBUGS-7474. The following is the description of the original issue:

This is a clone of issue OCPBUGS-6714. The following is the description of the original issue:

Description of problem:

Traffic from egress IPs was interrupted after Cluster patch to Openshift 4.10.46

a customer cluster was patched. It is an Openshift 4.10.46 cluster with SDN.

More description about issue is available in private comment below since it contains customer data.

This is a clone of issue OCPBUGS-1805. The following is the description of the original issue:

The vSphere CSI cloud.conf lists the single datacenter from platform workspace config but in a multi-zone setup (https://github.com/openshift/enhancements/pull/918 ) there may be more than the one datacenter.

This issue is resulting in PVs failing to attach because the virtual machines can't be find in any other datacenter. For example:

0s Warning FailedAttachVolume pod/image-registry-85b5d5db54-m78vp AttachVolume.Attach failed for volume "pvc-ab1a0611-cb3b-418d-bb3b-1e7bbe2a69ed" : rpc error: code = Internal desc = failed to find VirtualMachine for node:"rbost-zonal-ghxp2-worker-3-xm7gw". Error: virtual machine wasn't found  

The machine above lives in datacenter-2 but the CSI cloud.conf is only aware of the datacenter IBMCloud.

$ oc get cm vsphere-csi-config -o yaml  -n openshift-cluster-csi-drivers | grep datacenters
    datacenters = "IBMCloud" 

 

This is a clone of issue OCPBUGS-6831. The following is the description of the original issue:

Description of problem:
The console crashes when it used with a user settings ConfigMap that is created with a 4.13+ console. This version saves "null" for the key "console.pinnedResources" which doesn't happen before and the old console version could not handle this well.

Version-Release number of selected component (if applicable):
4.8-4.12

How reproducible:
Always, but only in the edge case that someone used a newer console first and then downgraded.

This can happen only by manually applying the user settings ConfigMap or when downgrading a cluster.

Steps to Reproduce:
Open the user-settings ConfigMap and set "console.pinedResources" to "null" (with quotes as all ConfigMap values needs to be strings)

Or run this patch command:

oc patch -n openshift-console-user-settings configmaps user-settings-kubeadmin --type=merge --patch '{"data":{"console.pinnedResources":"null"}}'

Open console...

Actual results:
Console crashes

Expected results:
Console should not crash

This is a clone of issue OCPBUGS-1678. The following is the description of the original issue:

Description of problem:
pkg/devfile/sample_test.go fails after devfile registry was updated (https://github.com/devfile/registry/pull/126)

OCPBUGS-1677 is about updating our assertion so that the CI job runs successfully again. We might want to backport this as well.

This is about updating the code that the test should use a mock response instead of the latest registry content OR check some specific attributes instead of comparing the full JSON response.

Version-Release number of selected component (if applicable):
4.12

How reproducible:
Always

Steps to Reproduce:
1. Clone openshift/console
2. Run ./test-backend.sh

Actual results:
Unit tests fail

Expected results:
Unit tests should pass again

Additional info:

This is a clone of issue OCPBUGS-15853. The following is the description of the original issue:

This is a clone of issue OCPBUGS-8404. The following is the description of the original issue:

Description of problem:

If a custom API server certificate is added as per documentation[1], but the secret name is wrong and points to a non-existing secret, the following happens:
- The kube-apiserver config is rendered with some of the namedCertificates pointing to /etc/kubernetes/static-pod-certs/secrets/user-serving-cert-000/
- As the secret in apiserver/cluster object is wrong, no user-serving-cert-000 secret is generated, so the /etc/kubernetes/static-pod-certs/secrets/user-serving-cert-000/ does not exist (and may be automatically removed if manually created).
- The combination of the 2 points above causes kube-apiserver to start crash-looping because its config points to non-existent certificates.

This is a cluster-kube-apiserver-operator, because it should validate that the specified secret exists and degrade and do nothing if it doesn't, not render inconsistent configuration.

Version-Release number of selected component (if applicable):

First found in 4.11.13, but also reproduced in the latest nightly build.

How reproducible:

Always

Steps to Reproduce:

1. Setup a named certificate pointing to a secret that doesn't exist.
2.
3.

Actual results:

Inconsistent configuration that points to non-existing secret. Kube API server pod crash-loop.

Expected results:

Cluster Kube API Server Operator to detect that the secret is wrong, do nothing and only report itself as degraded with meaningful message so the user can fix. No Kube API server pod crash-looping.

Additional info:

Once the kube-apiserver is broken, even if the apiserver/cluster object is fixed, it is usually needed to apply a manual workaround in the crash-looping master. An example of workaround that works is[2], even though that KB article was written for another bug with different root cause. 

References:

[1] - https://docs.openshift.com/container-platform/4.11/security/certificates/api-server.html#api-server-certificates
[2] - https://access.redhat.com/solutions/4893641

This is a clone of issue OCPBUGS-6887. The following is the description of the original issue:

This is a clone of issue OCPBUGS-3476. The following is the description of the original issue:

Description of problem:

When we detect a refs/heads/branchname we should show the label as what we have now:

- Branch: branchname

And when we detect a refs/tags/tagname we should instead show the label as:

- Tag: tagname

I haven't implemented this in cli but there is an old issue for that here openshift-pipelines/pipelines-as-code#181

Version-Release number of selected component (if applicable):

4.11.z

How reproducible:

 

Steps to Reproduce:

1. Create a repository
2. Trigger the pipelineruns by push or pull request event on the github  

Actual results:

We do not show tag name even is tag is present instead of branch

Expected results:

We should show tag if tag is detected and branch if branch is detedcted.

Additional info:

https://github.com/openshift/console/pull/12247#issuecomment-1306879310

This is a clone of issue OCPBUGS-16151. The following is the description of the original issue:

This is a clone of issue OCPBUGS-13812. The following is the description of the original issue:

This is a clone of issue OCPBUGS-13718. The following is the description of the original issue:

Description of problem:

IPI install on azure stack failed when setting platform.azure.osDiks.diskType as StandardSSD_LRS in install-config.yaml.

When setting controlPlane.platform.azure.osDisk.diskType as StandardSSD_LRS, get error in terraform log and some resources have been created.

level=error msg=Error: expected storage_os_disk.0.managed_disk_type to be one of [Premium_LRS Standard_LRS], got StandardSSD_LRS
level=error
level=error msg=  with azurestack_virtual_machine.bootstrap,
level=error msg=  on main.tf line 107, in resource "azurestack_virtual_machine" "bootstrap":
level=error msg= 107: resource "azurestack_virtual_machine" "bootstrap" {
level=error
level=error msg=failed to fetch Cluster: failed to generate asset "Cluster": failure applying terraform for "bootstrap" stage: failed to create cluster: failed to apply Terraform: exit status 1
level=error
level=error msg=Error: expected storage_os_disk.0.managed_disk_type to be one of [Premium_LRS Standard_LRS], got StandardSSD_LRS
level=error
level=error msg=  with azurestack_virtual_machine.bootstrap,
level=error msg=  on main.tf line 107, in resource "azurestack_virtual_machine" "bootstrap":
level=error msg= 107: resource "azurestack_virtual_machine" "bootstrap" {
level=error
level=error

When setting compute.platform.azure.osDisk.diskType as StandardSSD_LRS, fail to provision compute machines

$ oc get machine -n openshift-machine-api
NAME                                     PHASE     TYPE              REGION   ZONE   AGE
jima414ash03-xkq5x-master-0              Running   Standard_DS4_v2   mtcazs          62m
jima414ash03-xkq5x-master-1              Running   Standard_DS4_v2   mtcazs          62m
jima414ash03-xkq5x-master-2              Running   Standard_DS4_v2   mtcazs          62m
jima414ash03-xkq5x-worker-mtcazs-89mgn   Failed                                      52m
jima414ash03-xkq5x-worker-mtcazs-jl5kk   Failed                                      52m
jima414ash03-xkq5x-worker-mtcazs-p5kvw   Failed                                      52m

$ oc describe machine jima414ash03-xkq5x-worker-mtcazs-jl5kk -n openshift-machine-api
...
  Error Message:           failed to reconcile machine "jima414ash03-xkq5x-worker-mtcazs-jl5kk": failed to create vm jima414ash03-xkq5x-worker-mtcazs-jl5kk: failure sending request for machine jima414ash03-xkq5x-worker-mtcazs-jl5kk: cannot create vm: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: Code="InvalidParameter" Message="Storage account type 'StandardSSD_LRS' is supported by Microsoft.Compute API version 2018-04-01 and above" Target="osDisk.managedDisk.storageAccountType"
...

Based on azure-stack doc[1], supported disk types on ASH are Premium SSD, Standard HDD. It's better to do validation for diskType on Azure Stack to avoid above errors.

[1]https://learn.microsoft.com/en-us/azure-stack/user/azure-stack-managed-disk-considerations?view=azs-2206&tabs=az1%2Caz2#cheat-sheet-managed-disk-differences

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-05-16-085836

How reproducible:

Always

Steps to Reproduce:

1. Prepare install-config.yaml, set platform.azure.osDiks.diskType as StandardSSD_LRS
2. Install IPI cluster on Azure Stack
3.

Actual results:

Installation failed

Expected results:

Installer validate diskType on AzureStack Cloud, and exit for unsupported disk type with error message

Additional info:

 

This is a clone of issue RHIBMCS-151. The following is the description of the original issue:

Error msg

type: 'Warning' reason: 'ResolutionFailed' constraints not satisfiable: @existing/ibm-common-services//ibm-namespace-scope-operator.v2.0.0 and @existing/ibm-common-services//ibm-namespace-scope-operator.v1.15.0 provide NamespaceScope (operator.ibm.com/v1), subscription ibm-namespace-scope-operator requires @existing/ibm-common-services//ibm-namespace-scope-operator.v2.0.0, subscription ibm-namespace-scope-operator exists, clusterserviceversion ibm-namespace-scope-operator.v1.15.0 exists and is not referenced by a subscription

 

The issue happens  during the upgrade with and without channel switch. And several places which reports this issue

https://ibm-cloudplatform.slack.com/archives/CM95C10RK/p1662557747140069

PrivateCloud-analytics/CPD-Quality#5548

 

Current status

Issue opened in OLM community https://github.com/operator-framework/operator-lifecycle-manager/issues/2201

 

bugzilla ticket https://bugzilla.redhat.com/show_bug.cgi?id=1980755

 

Knowledge Base from Red Hat https://access.redhat.com/solutions/6603001

 

It is a known OLM issue and Bedrock also provides workarounds in documents https://www.ibm.com/docs/en/cpfs?topic=ii-olm-known-issue-updates-subscription-status-creates-csv-asynchronous

 

 

 

Usually when the second error msg happened not referenced by a subscription , it requires us to re-install the operator.

https://www.ibm.com/docs/en/cpfs?topic=ii-olm-known-issue-updates-subscription-status-creates-csv-asynchronous

 Or the mis synchronisation may be rectified by restarting the catalog and olm operators in some efforts.

 

 

 

This is a clone of issue OCPBUGS-11218. The following is the description of the original issue:

This is a clone of issue OCPBUGS-10950. The following is the description of the original issue:

Description of problem: 

"pipelines-as-code-pipelinerun-go" configMap is not been used for the Go repository while creating Pipeline Repository. "pipelines-as-code-pipelinerun-generic" configMap has been used.

Prerequisites (if any, like setup, operators/versions):

Install Red Hat Pipeline operator

Steps to Reproduce

  1. Navigate to Create Repository form 
  2. Enter the Git URL `https://github.com/vikram-raj/hello-func-go`
  3. Click on Add

Actual results:

`pipelines-as-code-pipelinerun-generic` PipelineRun template has been shown on the overview page 

Expected results:

`pipelines-as-code-pipelinerun-go` PipelineRun template should show on the overview page

Reproducibility (Always/Intermittent/Only Once):

Build Details:

4.13

Workaround:

Additional info:

This bug is a backport clone of [Bugzilla Bug 2073220](https://bugzilla.redhat.com/show_bug.cgi?id=2073220). The following is the description of the original bug:

Description of problem:

https://docs.openshift.com/container-platform/4.10/security/audit-log-policy-config.html#about-audit-log-profiles_audit-log-policy-config

Version-Release number of selected component (if applicable): 4.*

How reproducible: always

Steps to Reproduce:
1. Set audit profile to WriteRequestBodies
2. Wait for api server rollout to complete
3. tail -f /var/log/kube-apiserver/audit.log | grep routes/status

Actual results:

Write events to routes/status are recorded at the RequestResponse level, which often includes keys and certificates.

Expected results:

Events involving routes should always be recorded at the Metadata level, per the documentation at https://docs.openshift.com/container-platform/4.10/security/audit-log-policy-config.html#about-audit-log-profiles_audit-log-policy-config

Additional info:

This is a clone of issue OCPBUGS-4072. The following is the description of the original issue:

This is a clone of issue OCPBUGS-4026. The following is the description of the original issue:

Description of problem:
There is an endless re-render loop and a browser feels slow to stuck when opening the add page or the topology.

Saw also endless API calls to /api/kubernetes/apis/binding.operators.coreos.com/v1alpha1/bindablekinds/bindable-kinds

Version-Release number of selected component (if applicable):
1. Console UI 4.12-4.13 (master)
2. Service Binding Operator (tested with 1.3.1)

How reproducible:
Always with installed SBO

But the "stuck feeling" depends on the browser (Firefox feels more stuck) and your locale machine power

Steps to Reproduce:
1. Install Service Binding Operator
2. Create or update the BindableKinds resource "bindable-kinds"

apiVersion: binding.operators.coreos.com/v1alpha1
kind: BindableKinds
metadata:
  name: bindable-kinds

3. Open the browser console log
4. Open the console UI and navigate to the add page

Actual results:
1. Saw endless API calls to /api/kubernetes/apis/binding.operators.coreos.com/v1alpha1/bindablekinds/bindable-kinds
2. Browser feels slow and get stuck after some time
3. The page crashs after some time

Expected results:
1. The API call should be called just once
2. The add page should just work without feeling laggy
3. No crash

Additional info:
Get introduced after we watching the bindable-kinds resource with https://github.com/openshift/console/pull/11161

It looks like this happen only if the SBO is installed and the bindable-kinds resource exist, but doesn't contain any status.

The status list all available bindable resource types. I could not reproduce this by installing and uninstalling an operator, but you can manually create or update this resource as mentioned above.

Backport clone of https://issues.redhat.com/browse/OCPBUGSM-24281

openshift-4 tracking bug for telemeter-container: see the bugs linked in the "Blocks" field of this bug for full details of the security issue(s).

This bug is never intended to be made public, please put any public notes in the blocked bugs.

Impact: Moderate
Public Date: 11-Jan-2021
PM Fix/Wontfix Decision By: 04-May-2021
Resolve Bug By: 11-Jan-2022

In case the dates above are already past, please evaluate this bug in your next prioritization review and make a decision then. Remember to explicitly set CLOSED:WONTFIX if you decide not to fix this bug.

Please see the Security Errata Policy for further details: https://docs.engineering.redhat.com/x/9RBqB

This is a backport from https://issues.redhat.com/browse/OCPBUGS-1044

 

Description of problem:

https://github.com/prometheus/node_exporter/issues/2299

The node exporter pod when ran on a bare metal worker using an AMD EPYC CPU crashes and fails to start up and crashes with the following error message.    State:       Waiting
      Reason:    CrashLoopBackOff
    Last State:  Terminated
      Reason:    Error
      Message:   05.145Z caller=node_exporter.go:115 level=info collector=tapestats
ts=2022-09-07T20:25:05.145Z caller=node_exporter.go:115 level=info collector=textfile
ts=2022-09-07T20:25:05.145Z caller=node_exporter.go:115 level=info collector=thermal_zone
ts=2022-09-07T20:25:05.146Z caller=node_exporter.go:115 level=info collector=time
ts=2022-09-07T20:25:05.146Z caller=node_exporter.go:115 level=info collector=timex
ts=2022-09-07T20:25:05.146Z caller=node_exporter.go:115 level=info collector=udp_queues
ts=2022-09-07T20:25:05.146Z caller=node_exporter.go:115 level=info collector=uname
ts=2022-09-07T20:25:05.146Z caller=node_exporter.go:115 level=info collector=vmstat
ts=2022-09-07T20:25:05.146Z caller=node_exporter.go:115 level=info collector=xfs
ts=2022-09-07T20:25:05.146Z caller=node_exporter.go:115 level=info collector=zfs
ts=2022-09-07T20:25:05.146Z caller=node_exporter.go:199 level=info msg="Listening on" address=127.0.0.1:9100
ts=2022-09-07T20:25:05.146Z caller=tls_config.go:195 level=info msg="TLS is disabled." http2=false
panic: "node_rapl_package-0-die-0_joules_total" is not a valid metric name

 Apparently this is a known issue (See Github link) and was fixed in a later upstream.

Version-Release number of selected component (if applicable):

4.11.0

How reproducible:

Every-time 

Steps to Reproduce:

1. Provision a bare metal node using an AMD EPYC CPU
2. Node-exporter pods that try to start on the nodes will crash with error message

Actual results:

Node-exporter pods cannot run on the new nodes 

Expected results:

Node exporter pods should be able to start up and run like on any other node

Additional info:

As mentioned above this issue was tracked and fixed in a later upstream of node-exporter

https://github.com/prometheus/node_exporter/issues/2299

Would we be able to get the fixed version pulled for 4.11?

terraform-provider-libvirt uses SPICE by default for its libvirtxml.DomainGraphic, but this has been deprecated/removed in RHEL9/CentOS9 Stream.

Check https://github.com/openshift/installer/pull/6062

Description of problem:

Jenkins and Jenkins Agent Base image versions needs to be updated to use the latest images to mitigate known CVEs in plugins and Jenkins versions.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-1130. The following is the description of the original issue:

The test results in sippy look really bad on our less common platforms, but still pretty unacceptable even on core clouds. It's reasonably often the only test that fails. We need to decide what to do here, and we're going to need input from the etcd team.

As of Sep 13th:

  • several vsphere and openstack variant combo's fail this test around 24-32% of the time
  • aws, amd64, ovn, upgrade, upgrade-micro, ha - fails 6% of the time
  • aws, amd64, ovn, upgrade, upgrade-minor, ha - fails 4% of the time
  • gcp, amd64, sdn, upgrade, upgrade-minor, ha - fails 8% of the time
  • globally across all jobs fails around 3% of the time.

Even on some major variant combos, a 4-8% failure rate is too high.
On Sep 13 arch call (no etcd present), Damien mentioned this might be an upstream alert that just isn't well suited for OpenShift's use cases, is this the case and it needs tuning?

Has the problem been getting worse?

I believe this link https://datastudio.google.com/s/urkKwmmzvgo indicates that this may be the case for 4.12, AWS and Azure are both getting worse in ways that I don't see if we change the release to 4.11 where it looks consistent. gcp seems fine on 4.12. We do not have data for vsphere for some reason.

This link shows the grpc_methods most commonly involved: https://search.ci.openshift.org/?search=etcdGRPCRequestsSlow+was+at+or+above&maxAge=48h&context=7&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

At a glance: LeaseGrant, MemberList, Txn, Status, Range.

Broken out of TRT-401
For linking with sippy:
[bz-etcd][invariant] alert/etcdGRPCRequestsSlow should not be at or above info
[sig-arch][bz-etcd][Late] Alerts alert/etcdGRPCRequestsSlow should not be at or above info [Suite:openshift/conformance/parallel]

This is a clone of issue OCPBUGS-7373. The following is the description of the original issue:

Originally reported by lance5890 in issue https://github.com/openshift/cluster-etcd-operator/issues/1000

Under some circumstances the static pod machinery fails to populate the node status in time to generate the correct env variables for ETCD_URL_HOST, ETCD_NAME etc. The pods that come up will fail to accept those variables.

This is particularly pronounced in SNO topologies, leading to installation failures. 

The fix is to fail fast in the targetconfig/envvar controller to ensure the CEO goes degraded instead of silently failing on the rollout of an invalid static pod.

Description of problem:

go test -mod=vendor -test.v -race  github.com/ovn-org/ovn-kubernetes/go-controller/pkg/libovsdbops
# github.com/ovn-org/ovn-kubernetes/go-controller/pkg/libovsdbops [github.com/ovn-org/ovn-kubernetes/go-controller/pkg/libovsdbops.test]
pkg/libovsdbops/acl_test.go:98:15: undefined: FindACLs
pkg/libovsdbops/acl_test.go:105:15: undefined: UpdateACLsOps
FAIL    github.com/ovn-org/ovn-kubernetes/go-controller/pkg/libovsdbops [build failed]
FAIL

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

Created two egressIP object, egressIPs in one egressIP object cannot be applied successfully 

Version-Release number of selected component (if applicable):

4.11.0-0.nightly-2022-11-27-164248

How reproducible:

Frequently happen in auto case

Steps to Reproduce:

1.  Label two nodes as egress nodes
  oc get nodes -o wide
NAME                                   STATUS   ROLES    AGE    VERSION           INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                                        KERNEL-VERSION                 CONTAINER-RUNTIME
huirwang-1128a-s6j6t-master-0          Ready    master   154m   v1.24.6+5658434   10.0.0.8      <none>        Red Hat Enterprise Linux CoreOS 411.86.202211232221-0 (Ootpa)   4.18.0-372.32.1.el8_6.x86_64   cri-o://1.24.3-6.rhaos4.11.gitc4567c0.el8
huirwang-1128a-s6j6t-master-1          Ready    master   154m   v1.24.6+5658434   10.0.0.7      <none>        Red Hat Enterprise Linux CoreOS 411.86.202211232221-0 (Ootpa)   4.18.0-372.32.1.el8_6.x86_64   cri-o://1.24.3-6.rhaos4.11.gitc4567c0.el8
huirwang-1128a-s6j6t-master-2          Ready    master   153m   v1.24.6+5658434   10.0.0.6      <none>        Red Hat Enterprise Linux CoreOS 411.86.202211232221-0 (Ootpa)   4.18.0-372.32.1.el8_6.x86_64   cri-o://1.24.3-6.rhaos4.11.gitc4567c0.el8
huirwang-1128a-s6j6t-worker-westus-1   Ready    worker   135m   v1.24.6+5658434   10.0.1.5      <none>        Red Hat Enterprise Linux CoreOS 411.86.202211232221-0 (Ootpa)   4.18.0-372.32.1.el8_6.x86_64   cri-o://1.24.3-6.rhaos4.11.gitc4567c0.el8
huirwang-1128a-s6j6t-worker-westus-2   Ready    worker   136m   v1.24.6+5658434   10.0.1.4      <none>        Red Hat Enterprise Linux CoreOS 411.86.202211232221-0 (Ootpa)   4.18.0-372.32.1.el8_6.x86_64   cri-o://1.24.3-6.rhaos4.11.gitc4567c0.el8

 % oc get node huirwang-1128a-s6j6t-worker-westus-1 --show-labels
NAME                                   STATUS   ROLES    AGE    VERSION           LABELS
huirwang-1128a-s6j6t-worker-westus-1   Ready    worker   136m   v1.24.6+5658434   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=Standard_D4s_v3,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=westus,failure-domain.beta.kubernetes.io/zone=0,k8s.ovn.org/egress-assignable=true,kubernetes.io/arch=amd64,kubernetes.io/hostname=huirwang-1128a-s6j6t-worker-westus-1,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=Standard_D4s_v3,node.openshift.io/os_id=rhcos,topology.disk.csi.azure.com/zone=,topology.kubernetes.io/region=westus,topology.kubernetes.io/zone=0

 % oc get node huirwang-1128a-s6j6t-worker-westus-2 --show-labels
NAME                                   STATUS   ROLES    AGE    VERSION           LABELS
huirwang-1128a-s6j6t-worker-westus-2   Ready    worker   136m   v1.24.6+5658434   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=Standard_D4s_v3,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=westus,failure-domain.beta.kubernetes.io/zone=0,k8s.ovn.org/egress-assignable=true,kubernetes.io/arch=amd64,kubernetes.io/hostname=huirwang-1128a-s6j6t-worker-westus-2,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=Standard_D4s_v3,node.openshift.io/os_id=rhcos,topology.disk.csi.azure.com/zone=,topology.kubernetes.io/region=westus,topology.kubernetes.io/zone=0
2. Created two egressIP objects
3.

Actual results:

egressip-47032 was not applied to any egress node
% oc get egressip
NAME             EGRESSIPS    ASSIGNED NODE                          ASSIGNED EGRESSIPS
egressip-47032   10.0.1.166                                          
egressip-47034   10.0.1.181   huirwang-1128a-s6j6t-worker-westus-1   10.0.1.181

%  oc get cloudprivateipconfig                      
NAME         AGE
10.0.1.130   6m25s
10.0.1.138   6m34s
10.0.1.166   6m34s
10.0.1.181   6m25s
%  oc get cloudprivateipconfig  10.0.1.166  -o yaml
apiVersion: cloud.network.openshift.io/v1
kind: CloudPrivateIPConfig
metadata:
  annotations:
    k8s.ovn.org/egressip-owner-ref: egressip-47032
  creationTimestamp: "2022-11-28T10:27:37Z"
  finalizers:
  - cloudprivateipconfig.cloud.network.openshift.io/finalizer
  generation: 1
  name: 10.0.1.166
  resourceVersion: "87528"
  uid: 5221075a-35d0-4670-a6a7-ddfc6cbc700b
spec:
  node: huirwang-1128a-s6j6t-worker-westus-1
status:
  conditions:
  - lastTransitionTime: "2022-11-28T10:33:29Z"
    message: 'Error processing cloud assignment request, err: <nil>'
    observedGeneration: 1
    reason: CloudResponseError
    status: "False"
    type: Assigned
  node: huirwang-1128a-s6j6t-worker-westus-1
%  oc get cloudprivateipconfig  10.0.1.138  -o yaml 
apiVersion: cloud.network.openshift.io/v1
kind: CloudPrivateIPConfig
metadata:
  annotations:
    k8s.ovn.org/egressip-owner-ref: egressip-47032
  creationTimestamp: "2022-11-28T10:27:37Z"
  finalizers:
  - cloudprivateipconfig.cloud.network.openshift.io/finalizer
  generation: 1
  name: 10.0.1.138
  resourceVersion: "87523"
  uid: e4604e76-64d8-4735-87a2-eb50d28854cc
spec:
  node: huirwang-1128a-s6j6t-worker-westus-2
status:
  conditions:
  - lastTransitionTime: "2022-11-28T10:33:29Z"
    message: 'Error processing cloud assignment request, err: <nil>'
    observedGeneration: 1
    reason: CloudResponseError
    status: "False"
    type: Assigned
  node: huirwang-1128a-s6j6t-worker-westus-2

oc logs cloud-network-config-controller-6f7b994ddc-vhtbp  -n openshift-cloud-network-config-controller

.......
E1128 10:30:43.590807       1 controller.go:165] error syncing '10.0.1.138': error assigning CloudPrivateIPConfig: "10.0.1.138" to node: "huirwang-1128a-s6j6t-worker-westus-2", err: network.InterfacesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: Code="InvalidRequestFormat" Message="Cannot parse the request." Details=[{"code":"DuplicateResourceName","message":"Resource /subscriptions//resourceGroups//providers/Microsoft.Network/networkInterfaces/ has two child resources with the same name (huirwang-1128a-s6j6t-worker-westus-2_10.0.1.138)."}], requeuing in cloud-private-ip-config workqueue
I1128 10:30:44.051422       1 cloudprivateipconfig_controller.go:271] CloudPrivateIPConfig: "10.0.1.166" will be added to node: "huirwang-1128a-s6j6t-worker-westus-1"
E1128 10:30:44.301259       1 controller.go:165] error syncing '10.0.1.166': error assigning CloudPrivateIPConfig: "10.0.1.166" to node: "huirwang-1128a-s6j6t-worker-westus-1", err: network.InterfacesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: Code="InvalidRequestFormat" Message="Cannot parse the request." Details=[{"code":"DuplicateResourceName","message":"Resource /subscriptions//resourceGroups//providers/Microsoft.Network/networkInterfaces/ has two child resources with the same name (huirwang-1128a-s6j6t-worker-westus-1_10.0.1.166)."}], requeuing in cloud-private-ip-config workqueue
..........

Expected results:

EgressIP can be applied successfully.

Additional info:


This is a clone of issue OCPBUGS-4696. The following is the description of the original issue:

Description of problem:

metal3 pod does not come up on SNO when creating Provisioning with provisioningNetwork set to Disabled

The issue is that on SNO, there is no Machine, and no BareMetalHost, it is looking of Machine objects to populate the provisioningMacAddresses field. However, when provisioningNetwork is Disabled, provisioningMacAddresses is not used anyway.

You can work around this issue by populating provisioningMacAddresses with a dummy address, like this:

kind: Provisioning
metadata:
  name: provisioning-configuration
spec:
  provisioningMacAddresses:
  - aa:aa:aa:aa:aa:aa
  provisioningNetwork: Disabled
  watchAllNamespaces: true

Version-Release number of selected component (if applicable):

4.11.17

How reproducible:

Try to bring up Provisioning on SNO in 4.11.17 with provisioningNetwork set to Disabled

apiVersion: metal3.io/v1alpha1
kind: Provisioning
metadata:
  name: provisioning-configuration
spec:
  provisioningNetwork: Disabled
  watchAllNamespaces: true

Steps to Reproduce:

1.
2.
3.

Actual results:

controller/provisioning "msg"="Reconciler error" "error"="machines with cluster-api-machine-role=master not found" "name"="provisioning-configuration" "namespace"="" "reconciler group"="metal3.io" "reconciler kind"="Provisioning"

Expected results:

metal3 pod should be deployed

Additional info:

This issue is a result of this change: https://github.com/openshift/cluster-baremetal-operator/pull/307
See this Slack thread: https://coreos.slack.com/archives/CFP6ST0A3/p1670530729168599

This is a clone of issue OCPBUGS-10977. The following is the description of the original issue:

This is a clone of issue OCPBUGS-10890. The following is the description of the original issue:

This is a clone of issue OCPBUGS-10649. The following is the description of the original issue:

Description of problem:

After a replace upgrade from OCP 4.14 image to another 4.14 image first node is in NotReady.

jiezhao-mac:hypershift jiezhao$ oc get node --kubeconfig=hostedcluster.kubeconfig 
NAME                     STATUS   ROLES  AGE   VERSION
ip-10-0-128-175.us-east-2.compute.internal  Ready   worker  72m   v1.26.2+06e8c46
ip-10-0-134-164.us-east-2.compute.internal  Ready   worker  68m   v1.26.2+06e8c46
ip-10-0-137-194.us-east-2.compute.internal  Ready   worker  77m   v1.26.2+06e8c46
ip-10-0-141-231.us-east-2.compute.internal  NotReady  worker  9m54s  v1.26.2+06e8c46

- lastHeartbeatTime: "2023-03-21T19:48:46Z"
  lastTransitionTime: "2023-03-21T19:42:37Z"
  message: 'container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady
   message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/.
   Has your network provider started?'
  reason: KubeletNotReady
  status: "False"
  type: Ready

Events:
 Type   Reason          Age         From          Message
 ----   ------          ----        ----          -------
 Normal  Starting         11m         kubelet        Starting kubelet.
 Normal  NodeHasSufficientMemory 11m (x2 over 11m)  kubelet        Node ip-10-0-141-231.us-east-2.compute.internal status is now: NodeHasSufficientMemory
 Normal  NodeHasNoDiskPressure  11m (x2 over 11m)  kubelet        Node ip-10-0-141-231.us-east-2.compute.internal status is now: NodeHasNoDiskPressure
 Normal  NodeHasSufficientPID   11m (x2 over 11m)  kubelet        Node ip-10-0-141-231.us-east-2.compute.internal status is now: NodeHasSufficientPID
 Normal  NodeAllocatableEnforced 11m         kubelet        Updated Node Allocatable limit across pods
 Normal  Synced          11m         cloud-node-controller Node synced successfully
 Normal  RegisteredNode      11m         node-controller    Node ip-10-0-141-231.us-east-2.compute.internal event: Registered Node ip-10-0-141-231.us-east-2.compute.internal in Controller
 Warning ErrorReconcilingNode   17s (x30 over 11m) controlplane      nodeAdd: error adding node "ip-10-0-141-231.us-east-2.compute.internal": could not find "k8s.ovn.org/node-subnets" annotation

ovnkube-master log:

I0321 20:55:16.270197       1 default_network_controller.go:667] Node add failed for ip-10-0-141-231.us-east-2.compute.internal, will try again later: nodeAdd: error adding node "ip-10-0-141-231.us-east-2.compute.internal": could not find "k8s.ovn.org/node-subnets" annotation
I0321 20:55:16.270209       1 obj_retry.go:326] Retry add failed for *v1.Node ip-10-0-141-231.us-east-2.compute.internal, will try again later: nodeAdd: error adding node "ip-10-0-141-231.us-east-2.compute.internal": could not find "k8s.ovn.org/node-subnets" annotation
I0321 20:55:16.270273       1 event.go:285] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"ip-10-0-141-231.us-east-2.compute.internal", UID:"621e6289-ca5a-4e17-afff-5b49961cfb38", APIVersion:"v1", ResourceVersion:"52970", FieldPath:""}): type: 'Warning' reason: 'ErrorReconcilingNode' nodeAdd: error adding node "ip-10-0-141-231.us-east-2.compute.internal": could not find "k8s.ovn.org/node-subnets" annotation
I0321 20:55:17.851497       1 master.go:719] Adding or Updating Node "ip-10-0-137-194.us-east-2.compute.internal"
I0321 20:55:25.965132       1 master.go:719] Adding or Updating Node "ip-10-0-128-175.us-east-2.compute.internal"
I0321 20:55:45.928694       1 client.go:783]  "msg"="transacting operations" "database"="OVN_Northbound" "operations"="[{Op:update Table:NB_Global Row:map[options:{GoMap:map[e2e_timestamp:1679432145 mac_prefix:2e:f9:d8 max_tunid:16711680 northd_internal_version:23.03.1-20.27.0-70.6 northd_probe_interval:5000 svc_monitor_mac:fe:cb:72:cf:f8:5f use_logical_dp_groups:true]}] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {c8b24290-296e-44a2-a4d0-02db7e312614}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}]"
I0321 20:55:46.270129       1 obj_retry.go:265] Retry object setup: *v1.Node ip-10-0-141-231.us-east-2.compute.internal
I0321 20:55:46.270154       1 obj_retry.go:319] Adding new object: *v1.Node ip-10-0-141-231.us-east-2.compute.internal
I0321 20:55:46.270164       1 master.go:719] Adding or Updating Node "ip-10-0-141-231.us-east-2.compute.internal"
I0321 20:55:46.270201       1 default_network_controller.go:667] Node add failed for ip-10-0-141-231.us-east-2.compute.internal, will try again later: nodeAdd: error adding node "ip-10-0-141-231.us-east-2.compute.internal": could not find "k8s.ovn.org/node-subnets" annotation
I0321 20:55:46.270209       1 obj_retry.go:326] Retry add failed for *v1.Node ip-10-0-141-231.us-east-2.compute.internal, will try again later: nodeAdd: error adding node "ip-10-0-141-231.us-east-2.compute.internal": could not find "k8s.ovn.org/node-subnets" annotation
I0321 20:55:46.270284       1 event.go:285] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"ip-10-0-141-231.us-east-2.compute.internal", UID:"621e6289-ca5a-4e17-afff-5b49961cfb38", APIVersion:"v1", ResourceVersion:"52970", FieldPath:""}): type: 'Warning' reason: 'ErrorReconcilingNode' nodeAdd: error adding node "ip-10-0-141-231.us-east-2.compute.internal": could not find "k8s.ovn.org/node-subnets" annotation
I0321 20:55:52.916512       1 reflector.go:559] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.Namespace total 5 items received
I0321 20:56:06.910669       1 reflector.go:559] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.Pod total 12 items received
I0321 20:56:15.928505       1 client.go:783]  "msg"="transacting operations" "database"="OVN_Northbound" "operations"="[{Op:update Table:NB_Global Row:map[options:{GoMap:map[e2e_timestamp:1679432175 mac_prefix:2e:f9:d8 max_tunid:16711680 northd_internal_version:23.03.1-20.27.0-70.6 northd_probe_interval:5000 svc_monitor_mac:fe:cb:72:cf:f8:5f use_logical_dp_groups:true]}] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {c8b24290-296e-44a2-a4d0-02db7e312614}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}]"
I0321 20:56:16.269611       1 obj_retry.go:265] Retry object setup: *v1.Node ip-10-0-141-231.us-east-2.compute.internal
I0321 20:56:16.269637       1 obj_retry.go:319] Adding new object: *v1.Node ip-10-0-141-231.us-east-2.compute.internal
I0321 20:56:16.269646       1 master.go:719] Adding or Updating Node "ip-10-0-141-231.us-east-2.compute.internal"
I0321 20:56:16.269688       1 default_network_controller.go:667] Node add failed for ip-10-0-141-231.us-east-2.compute.internal, will try again later: nodeAdd: error adding node "ip-10-0-141-231.us-east-2.compute.internal": could not find "k8s.ovn.org/node-subnets" annotation
I0321 20:56:16.269697       1 obj_retry.go:326] Retry add failed for *v1.Node ip-10-0-141-231.us-east-2.compute.internal, will try again later: nodeAdd: error adding node "ip-10-0-141-231.us-east-2.compute.internal": could not find "k8s.ovn.org/node-subnets" annotation
I0321 20:56:16.269724       1 event.go:285] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"ip-10-0-141-231.us-east-2.compute.internal", UID:"621e6289-ca5a-4e17-afff-5b49961cfb38", APIVersion:"v1", ResourceVersion:"52970", FieldPath:""}): type: 'Warning' reason: 'ErrorReconcilingNode' nodeAdd: error adding node "ip-10-0-141-231.us-east-2.compute.internal": could not find "k8s.ovn.org/node-subnets" annotation

cluster-network-operator log:

I0321 21:03:38.487602       1 log.go:198] Set operator conditions:
- lastTransitionTime: "2023-03-21T17:39:21Z"
  status: "False"
  type: ManagementStateDegraded
- lastTransitionTime: "2023-03-21T19:53:10Z"
  message: DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" rollout is not making
    progress - last change 2023-03-21T19:42:39Z
  reason: RolloutHung
  status: "True"
  type: Degraded
- lastTransitionTime: "2023-03-21T17:39:21Z"
  status: "True"
  type: Upgradeable
- lastTransitionTime: "2023-03-21T19:42:39Z"
  message: |-
    DaemonSet "/openshift-network-diagnostics/network-check-target" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-multus/multus" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-multus/network-metrics-daemon" is not available (awaiting 1 nodes)
  reason: Deploying
  status: "True"
  type: Progressing
- lastTransitionTime: "2023-03-21T17:39:26Z"
  status: "True"
  type: Available
I0321 21:03:38.488312       1 log.go:198] Skipping reconcile of Network.operator.openshift.io: spec unchanged
I0321 21:03:38.499825       1 log.go:198] Set ClusterOperator conditions:
- lastTransitionTime: "2023-03-21T17:39:21Z"
  status: "False"
  type: ManagementStateDegraded
- lastTransitionTime: "2023-03-21T19:53:10Z"
  message: DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" rollout is not making
    progress - last change 2023-03-21T19:42:39Z
  reason: RolloutHung
  status: "True"
  type: Degraded
- lastTransitionTime: "2023-03-21T17:39:21Z"
  status: "True"
  type: Upgradeable
- lastTransitionTime: "2023-03-21T19:42:39Z"
  message: |-
    DaemonSet "/openshift-network-diagnostics/network-check-target" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-multus/multus" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-multus/network-metrics-daemon" is not available (awaiting 1 nodes)
  reason: Deploying
  status: "True"
  type: Progressing
- lastTransitionTime: "2023-03-21T17:39:26Z"
  status: "True"
  type: Available
I0321 21:03:38.571013       1 log.go:198] Set HostedControlPlane conditions:
- lastTransitionTime: "2023-03-21T17:38:24Z"
  message: All is well
  observedGeneration: 3
  reason: AsExpected
  status: "True"
  type: ValidAWSIdentityProvider
- lastTransitionTime: "2023-03-21T17:37:06Z"
  message: Configuration passes validation
  observedGeneration: 3
  reason: AsExpected
  status: "True"
  type: ValidHostedControlPlaneConfiguration
- lastTransitionTime: "2023-03-21T19:24:24Z"
  message: ""
  observedGeneration: 3
  reason: QuorumAvailable
  status: "True"
  type: EtcdAvailable
- lastTransitionTime: "2023-03-21T17:38:23Z"
  message: Kube APIServer deployment is available
  observedGeneration: 3
  reason: AsExpected
  status: "True"
  type: KubeAPIServerAvailable
- lastTransitionTime: "2023-03-21T20:26:29Z"
  message: ""
  observedGeneration: 3
  reason: AsExpected
  status: "False"
  type: Degraded
- lastTransitionTime: "2023-03-21T17:37:11Z"
  message: All is well
  observedGeneration: 3
  reason: AsExpected
  status: "True"
  type: InfrastructureReady
- lastTransitionTime: "2023-03-21T17:37:06Z"
  message: External DNS is not configured
  observedGeneration: 3
  reason: StatusUnknown
  status: Unknown
  type: ExternalDNSReachable
- lastTransitionTime: "2023-03-21T19:24:24Z"
  message: ""
  observedGeneration: 3
  reason: AsExpected
  status: "True"
  type: Available
- lastTransitionTime: "2023-03-21T17:37:06Z"
  message: Reconciliation active on resource
  observedGeneration: 3
  reason: AsExpected
  status: "True"
  type: ReconciliationActive
- lastTransitionTime: "2023-03-21T17:38:25Z"
  message: All is well
  reason: AsExpected
  status: "True"
  type: AWSDefaultSecurityGroupCreated
- lastTransitionTime: "2023-03-21T19:30:54Z"
  message: 'Error while reconciling 4.14.0-0.nightly-2023-03-20-201450: the cluster
    operator network is degraded'
  observedGeneration: 3
  reason: ClusterOperatorDegraded
  status: "False"
  type: ClusterVersionProgressing
- lastTransitionTime: "2023-03-21T17:39:11Z"
  message: Condition not found in the CVO.
  observedGeneration: 3
  reason: StatusUnknown
  status: Unknown
  type: ClusterVersionUpgradeable
- lastTransitionTime: "2023-03-21T17:44:05Z"
  message: Done applying 4.14.0-0.nightly-2023-03-20-201450
  observedGeneration: 3
  reason: FromClusterVersion
  status: "True"
  type: ClusterVersionAvailable
- lastTransitionTime: "2023-03-21T19:55:15Z"
  message: Cluster operator network is degraded
  observedGeneration: 3
  reason: ClusterOperatorDegraded
  status: "True"
  type: ClusterVersionFailing
- lastTransitionTime: "2023-03-21T17:39:11Z"
  message: Payload loaded version="4.14.0-0.nightly-2023-03-20-201450" image="registry.ci.openshift.org/ocp/release:4.14.0-0.nightly-2023-03-20-201450"
    architecture="amd64"
  observedGeneration: 3
  reason: PayloadLoaded
  status: "True"
  type: ClusterVersionReleaseAccepted
- lastTransitionTime: "2023-03-21T17:39:21Z"
  message: ""
  reason: AsExpected
  status: "False"
  type: network.operator.openshift.io/ManagementStateDegraded
- lastTransitionTime: "2023-03-21T19:53:10Z"
  message: DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" rollout is not making
    progress - last change 2023-03-21T19:42:39Z
  reason: RolloutHung
  status: "True"
  type: network.operator.openshift.io/Degraded
- lastTransitionTime: "2023-03-21T17:39:21Z"
  message: ""
  reason: AsExpected
  status: "True"
  type: network.operator.openshift.io/Upgradeable
- lastTransitionTime: "2023-03-21T19:42:39Z"
  message: |-
    DaemonSet "/openshift-network-diagnostics/network-check-target" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-multus/multus" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-multus/network-metrics-daemon" is not available (awaiting 1 nodes)
  reason: Deploying
  status: "True"
  type: network.operator.openshift.io/Progressing
- lastTransitionTime: "2023-03-21T17:39:27Z"
  message: ""
  reason: AsExpected
  status: "True"
  type: network.operator.openshift.io/Available
I0321 21:03:39.450912       1 pod_watcher.go:125] Operand /, Kind= openshift-multus/multus updated, re-generating status
I0321 21:03:39.450953       1 pod_watcher.go:125] Operand /, Kind= openshift-multus/multus updated, re-generating status
I0321 21:03:39.493206       1 log.go:198] Set operator conditions:
- lastTransitionTime: "2023-03-21T17:39:21Z"
  status: "False"
  type: ManagementStateDegraded
- lastTransitionTime: "2023-03-21T19:53:10Z"
  message: DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" rollout is not making
    progress - last change 2023-03-21T19:42:39Z
  reason: RolloutHung
  status: "True"
  type: Degraded
- lastTransitionTime: "2023-03-21T17:39:21Z"
  status: "True"
  type: Upgradeable
- lastTransitionTime: "2023-03-21T19:42:39Z"
  message: |-
    DaemonSet "/openshift-multus/network-metrics-daemon" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-network-diagnostics/network-check-target" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" is not available (awaiting 1 nodes)
  reason: Deploying
  status: "True"
  type: Progressing
- lastTransitionTime: "2023-03-21T17:39:26Z"
  status: "True"
  type: Available
I0321 21:03:39.494050       1 log.go:198] Skipping reconcile of Network.operator.openshift.io: spec unchanged
I0321 21:03:39.508538       1 log.go:198] Set ClusterOperator conditions:
- lastTransitionTime: "2023-03-21T17:39:21Z"
  status: "False"
  type: ManagementStateDegraded
- lastTransitionTime: "2023-03-21T19:53:10Z"
  message: DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" rollout is not making
    progress - last change 2023-03-21T19:42:39Z
  reason: RolloutHung
  status: "True"
  type: Degraded
- lastTransitionTime: "2023-03-21T17:39:21Z"
  status: "True"
  type: Upgradeable
- lastTransitionTime: "2023-03-21T19:42:39Z"
  message: |-
    DaemonSet "/openshift-multus/network-metrics-daemon" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-network-diagnostics/network-check-target" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" is not available (awaiting 1 nodes)
  reason: Deploying
  status: "True"
  type: Progressing
- lastTransitionTime: "2023-03-21T17:39:26Z"
  status: "True"
  type: Available
I0321 21:03:39.684429       1 log.go:198] Set HostedControlPlane conditions:
- lastTransitionTime: "2023-03-21T17:38:24Z"
  message: All is well
  observedGeneration: 3
  reason: AsExpected
  status: "True"
  type: ValidAWSIdentityProvider
- lastTransitionTime: "2023-03-21T17:37:06Z"
  message: Configuration passes validation
  observedGeneration: 3
  reason: AsExpected
  status: "True"
  type: ValidHostedControlPlaneConfiguration
- lastTransitionTime: "2023-03-21T19:24:24Z"
  message: ""
  observedGeneration: 3
  reason: QuorumAvailable
  status: "True"
  type: EtcdAvailable
- lastTransitionTime: "2023-03-21T17:38:23Z"
  message: Kube APIServer deployment is available
  observedGeneration: 3
  reason: AsExpected
  status: "True"
  type: KubeAPIServerAvailable
- lastTransitionTime: "2023-03-21T20:26:29Z"
  message: ""
  observedGeneration: 3
  reason: AsExpected
  status: "False"
  type: Degraded
- lastTransitionTime: "2023-03-21T17:37:11Z"
  message: All is well
  observedGeneration: 3
  reason: AsExpected
  status: "True"
  type: InfrastructureReady
- lastTransitionTime: "2023-03-21T17:37:06Z"
  message: External DNS is not configured
  observedGeneration: 3
  reason: StatusUnknown
  status: Unknown
  type: ExternalDNSReachable
- lastTransitionTime: "2023-03-21T19:24:24Z"
  message: ""
  observedGeneration: 3
  reason: AsExpected
  status: "True"
  type: Available
- lastTransitionTime: "2023-03-21T17:37:06Z"
  message: Reconciliation active on resource
  observedGeneration: 3
  reason: AsExpected
  status: "True"
  type: ReconciliationActive
- lastTransitionTime: "2023-03-21T17:38:25Z"
  message: All is well
  reason: AsExpected
  status: "True"
  type: AWSDefaultSecurityGroupCreated
- lastTransitionTime: "2023-03-21T19:30:54Z"
  message: 'Error while reconciling 4.14.0-0.nightly-2023-03-20-201450: the cluster
    operator network is degraded'
  observedGeneration: 3
  reason: ClusterOperatorDegraded
  status: "False"
  type: ClusterVersionProgressing
- lastTransitionTime: "2023-03-21T17:39:11Z"
  message: Condition not found in the CVO.
  observedGeneration: 3
  reason: StatusUnknown
  status: Unknown
  type: ClusterVersionUpgradeable
- lastTransitionTime: "2023-03-21T17:44:05Z"
  message: Done applying 4.14.0-0.nightly-2023-03-20-201450
  observedGeneration: 3
  reason: FromClusterVersion
  status: "True"
  type: ClusterVersionAvailable
- lastTransitionTime: "2023-03-21T19:55:15Z"
  message: Cluster operator network is degraded
  observedGeneration: 3
  reason: ClusterOperatorDegraded
  status: "True"
  type: ClusterVersionFailing
- lastTransitionTime: "2023-03-21T17:39:11Z"
  message: Payload loaded version="4.14.0-0.nightly-2023-03-20-201450" image="registry.ci.openshift.org/ocp/release:4.14.0-0.nightly-2023-03-20-201450"
    architecture="amd64"
  observedGeneration: 3
  reason: PayloadLoaded
  status: "True"
  type: ClusterVersionReleaseAccepted
- lastTransitionTime: "2023-03-21T17:39:21Z"
  message: ""
  reason: AsExpected
  status: "False"
  type: network.operator.openshift.io/ManagementStateDegraded
- lastTransitionTime: "2023-03-21T19:53:10Z"
  message: DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" rollout is not making
    progress - last change 2023-03-21T19:42:39Z
  reason: RolloutHung
  status: "True"
  type: network.operator.openshift.io/Degraded
- lastTransitionTime: "2023-03-21T17:39:21Z"
  message: ""
  reason: AsExpected
  status: "True"
  type: network.operator.openshift.io/Upgradeable
- lastTransitionTime: "2023-03-21T19:42:39Z"
  message: |-
    DaemonSet "/openshift-multus/network-metrics-daemon" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-network-diagnostics/network-check-target" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" is not available (awaiting 1 nodes)
  reason: Deploying
  status: "True"
  type: network.operator.openshift.io/Progressing
- lastTransitionTime: "2023-03-21T17:39:27Z"
  message: ""
  reason: AsExpected
  status: "True"
  type: network.operator.openshift.io/Available

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1. management cluster 4.13
2. bring up the hostedcluster and nodepool in 4.14.0-0.nightly-2023-03-19-234132
3. upgrade the hostedcluster to 4.14.0-0.nightly-2023-03-20-201450 
4. replace upgrade the nodepool to 4.14.0-0.nightly-2023-03-20-201450 

Actual results

First node is in NotReady

Expected results:

All nodes should be Ready

Additional info:

No issue with replace upgrade from 4.13 to 4.14

 

 

 

 

 

 

This is a clone of issue OCPBUGS-12839. The following is the description of the original issue:

Description

As a user, I would like to see the type of technology used by the samples on the samples view similar to the all services view. 

On the samples view:

It is showing different types of samples, e.g. devfile, helm and all showing as .NET. It is difficult for user to decide which .Net entry to select on the list. We'll need something like the all service view where it shows the type of technology on the top right of each card for users to differentiate between the entries:

Acceptance Criteria

  1. Add visible label as the all services view on each card to show the technology used by the sample on the samples view.

Additional Details:

Description of problem:

Requesting backport for the pull request  https://github.com/openshift/console/pull/12109 in RHOCP 4.11

Version-Release number of selected component (if applicable):

4.11.z

How reproducible:

NA

Steps to Reproduce:

1.
2.
3.

Actual results:

Developer view does not display Logs tab

Expected results:

Developer view should display Logs tab

Additional info:

Jira to be referred https://issues.redhat.com/browse/LOG-3388

#Description of problem:

Developer Console > +ADD > Develoeper Catalog > Service > select Types Templates > Initiate Template

Input values in Instantiate Template are disappeared randomly.

#Version-Release number of selected component (if applicable):

  • Customer ENV
  • OCP4.10.9
  • Developer Console
  • Edge 88x. / Edge 85.0.x / Chrome 97.x /Chrome 88.x
  • Internet Disconnected OCP cluster
  • quicklab test ENV
  • Developer Console
  • OCP4.10.12 
  • Chrome 100

#How reproducible:

I reproduced this issue in ocp410ovn shared cluster in the quicklab

Select Apache HTTP Server > Input name "test" in Application Hostname box
After several seconds, the value has disappeared in the web console.

#Steps to Reproduce:

0. Developer Console > +ADD > Develoeper Catalog > Service > select Types Templates > Initiate Template

1. Input values in the box of template menu.

2. The values are disappeared after several seconds later. (20s~ or randomly)

3. Many users have experienced this issue.

  • The web browser version of users experiencing this issue.
  • Customer: Edge 88x. / Edge 85.0.x / Chrome 97.x /Chrome 88.x
  • My browser: Chrome 102.x

==> the browser version doesn't matter.

#Actual results:

Input values in "Instantiate Template" are disappeared randomly.
Users can't use the Initiate Template feature in the Dev console.

#Expected results:
Input values remain in the web console and users creat the object by the "Instantiate Template"

#Additional info:

See "Application Name" has disappeared in the video I attached.

Description of problem:

We need to have admin-ack in 4.11 so that admins can check the deprecated APIs and approve when they move to 4.12.Refer https://access.redhat.com/articles/6955381 for  more information. As planned we want to add the admin-ack around 4.12 feature freeze. 

Version-Release number of selected component (if applicable):

4.11

How reproducible:

Always

Steps to Reproduce:

1. Install a cluster in 4.11. 
2. Run an application which uses the deprecated API. See https://access.redhat.com/articles/6955381 for more information.
3. Upgrade to 4.12

Actual results:

The upgrade happens without asking the admin to confirm that the worksloads do not use the deprecated APIs.

Expected results:

Upgrade should wait for the admin-ack.

Additional info:

We had admin-acks in the past too e.g. https://docs.openshift.com/container-platform/4.9/updating/updating-cluster-prepare.html#update-preparing-migrate_updating-cluster-prepare

This is a clone of issue OCPBUGS-16382. The following is the description of the original issue:

This is a clone of issue OCPBUGS-16124. The following is the description of the original issue:

This is a clone of issue OCPBUGS-9404. The following is the description of the original issue:

Version:

$ openshift-install version
./openshift-install 4.11.0-0.nightly-2022-07-13-131410
built from commit cdb9627de7efb43ad7af53e7804ddd3434b0dc58
release image registry.ci.openshift.org/ocp/release@sha256:c5413c0fdd0335e5b4063f19133328fee532cacbce74105711070398134bb433
release architecture amd64

Platform:

  • Azure IPI

What happened?
When one creates an IPI Azure cluster with an `internal` publishing method, it creates a standard load balancer with an empty definition. This load balancer doesn't serve a purpose as far as I can tell since the configuration is completely empty. Because it doesn't have a public IP address and backend pools it's not providing any outbound connectivity, and there are no frontend IP configurations for ingress connectivity to the cluster.

Below is the ARM template that is deployed by the installer (through terraform)

```
{
"$schema": "https://schema.management.azure.com/schemas/2019-04-01/deploymentTemplate.json#",
"contentVersion": "1.0.0.0",
"parameters": {
"loadBalancers_mgahagan411_7p82n_name":

{ "defaultValue": "mgahagan411-7p82n", "type": "String" }

},
"variables": {},
"resources": [
{
"type": "Microsoft.Network/loadBalancers",
"apiVersion": "2020-11-01",
"name": "[parameters('loadBalancers_mgahagan411_7p82n_name')]",
"location": "northcentralus",
"sku":

{ "name": "Standard", "tier": "Regional" }

,
"properties":

{ "frontendIPConfigurations": [], "backendAddressPools": [], "loadBalancingRules": [], "probes": [], "inboundNatRules": [], "outboundRules": [], "inboundNatPools": [] }

}
]
}
```

What did you expect to happen?

  • Don't create the standard load balancer on an internal Azure IPI cluster (as it appears to serve no purpose)

How to reproduce it (as minimally and precisely as possible)?
1. Create an IPI cluster with the `publish` installation config set to `Internal` and the `outboundType` set to `UserDefinedRouting`.
```
apiVersion: v1
controlPlane:
architecture: amd64
hyperthreading: Enabled
name: master
platform:
azure: {}
replicas: 3
compute:

  • architecture: amd64
    hyperthreading: Enabled
    name: worker
    platform:
    azure: {}
    replicas: 3
    metadata:
    name: mgahaganpvt
    platform:
    azure:
    region: northcentralus
    baseDomainResourceGroupName: os4-common
    outboundType: UserDefinedRouting
    networkResourceGroupName: mgahaganpvt-rg
    virtualNetwork: mgahaganpvt-vnet
    controlPlaneSubnet: mgahaganpvt-master-subnet
    computeSubnet: mgahaganpvt-worker-subnet
    pullSecret: HIDDEN
    networking:
    clusterNetwork:
  • cidr: 10.128.0.0/14
    hostPrefix: 23
    serviceNetwork:
  • 172.30.0.0/16
    machineNetwork:
  • cidr: 10.0.0.0/16
    networkType: OpenShiftSDN
    publish: Internal
    proxy:
    httpProxy: http://proxy-user1:password@10.0.0.0:3128
    httpsProxy: http://proxy-user1:password@10.0.0.0:3128
    baseDomain: qe.azure.devcluster.openshift.com
    ```

2. Show the json content of the standard load balancer is completely empty
`az network lb show -g myResourceGroup -n myLbName`

```
{
"name": "mgahagan411-7p82n",
"id": "/subscriptions/00000000-0000-0000-00000000/resourceGroups/mgahagan411-7p82n-rg/providers/Microsoft.Network/loadBalancers/mgahagan411-7p82n",
"etag": "W/\"40468fd2-e56b-4429-b582-6852348b6a15\"",
"type": "Microsoft.Network/loadBalancers",
"location": "northcentralus",
"tags": {},
"properties":

{ "provisioningState": "Succeeded", "resourceGuid": "6fb11ec9-d89f-4c05-b201-a61ea8ed55fe", "frontendIPConfigurations": [], "backendAddressPools": [], "loadBalancingRules": [], "probes": [], "inboundNatRules": [], "inboundNatPools": [] }

,
"sku":

{ "name": "Standard" }

}
```

Description of problem:

Version-Release number of selected component (if applicable):
4.11

How reproducible:
Always

Steps to Reproduce:
1. Enable UWM + dedicated UWM Alertmanager
2. Deploy an application + service monitor + alerting rule which fires always
3. Go to the OCP dev console and silence the alert.

Actual results:
Nothing happens

Expected results:
The alert notification is muted.

Additional info:
Copied from https://bugzilla.redhat.com/show_bug.cgi?id=2100860

This is a clone of issue OCPBUGS-11636. The following is the description of the original issue:

Description of problem:

The ACLs are disabled for all newly created s3 buckets, this causes all OCP installs to fail: the bootstrap ignition can not be uploaded:

level=info msg=Creating infrastructure resources...
level=error
level=error msg=Error: error creating S3 bucket ACL for yunjiang-acl413-4dnhx-bootstrap: AccessControlListNotSupported: The bucket does not allow ACLs
level=error msg=	status code: 400, request id: HTB2HSH6XDG0Q3ZA, host id: V6CrEgbc6eyfJkUbLXLxuK4/0IC5hWCVKEc1RVonSbGpKAP1RWB8gcl5dfyKjbrLctVlY5MG2E4=
level=error
level=error msg=  with aws_s3_bucket_acl.ignition,
level=error msg=  on main.tf line 62, in resource "aws_s3_bucket_acl" "ignition":
level=error msg=  62: resource "aws_s3_bucket_acl" ignition {
level=error
level=error msg=failed to fetch Cluster: failed to generate asset "Cluster": failure applying terraform for "bootstrap" stage: failed to create cluster: failed to apply Terraform: exit status 1
level=error
level=error msg=Error: error creating S3 bucket ACL for yunjiang-acl413-4dnhx-bootstrap: AccessControlListNotSupported: The bucket does not allow ACLs
level=error msg=	status code: 400, request id: HTB2HSH6XDG0Q3ZA, host id: V6CrEgbc6eyfJkUbLXLxuK4/0IC5hWCVKEc1RVonSbGpKAP1RWB8gcl5dfyKjbrLctVlY5MG2E4=
level=error
level=error msg=  with aws_s3_bucket_acl.ignition,
level=error msg=  on main.tf line 62, in resource "aws_s3_bucket_acl" "ignition":
level=error msg=  62: resource "aws_s3_bucket_acl" ignition {


Version-Release number of selected component (if applicable):

4.11+
 

How reproducible:

Always
 

Steps to Reproduce:

1.Create a cluster via IPI

Actual results:

install fail
 

Expected results:

install succeed
 

Additional info:

Heads-Up: Amazon S3 Security Changes Are Coming in April of 2023 - https://aws.amazon.com/blogs/aws/heads-up-amazon-s3-security-changes-are-coming-in-april-of-2023/

https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-ownership-error-responses.html - After you apply the bucket owner enforced setting for Object Ownership, ACLs are disabled.

 

This is a clone of issue OCPBUGS-6661. The following is the description of the original issue: 

Description of problem:

CRL list is capped at 1MB due to configmap max size. If multiple public CRLs are needed for ingress controller the CRL pem file will be over 1MB. 

Version-Release number of selected component (if applicable):

 

How reproducible:

100%

Steps to Reproduce:

1. Create CRL configmap with the following distribution points: 

         Issuer: C=US, O=DigiCert Inc, CN=DigiCert Global G2 TLS RSA SHA256 2020 CA1
         Subject: SOME SIGNED CERT            X509v3 CRL Distribution Points: 
                Full Name:
                  URI:http://crl3.digicert.com/DigiCertGlobalG2TLSRSASHA2562020CA1-2.cr  
       
      
# curl -o DigiCertGlobalG2TLSRSASHA2562020CA1-2.crl http://crl3.digicert.com/DigiCertGlobalG2TLSRSASHA2562020CA1-2.crl
# openssl crl -in  DigiCertGlobalG2TLSRSASHA2562020CA1-2.crl -inform DER -out  DigiCertGlobalG2TLSRSASHA2562020CA1-2.pem 
# du -bsh DigiCertGlobalG2TLSRSASHA2562020CA1-2.pem 
604K    DigiCertGlobalG2TLSRSASHA2562020CA1-2.pem


I still need to find more intermediate CRLS to grow this. 

Actual results:

2023-01-25T13:45:01.443Z ERROR operator.init controller/controller.go:273 Reconciler error {"controller": "crl", "object": {"name":"custom","namespace":"openshift-ingress-operator"}, "namespace": "openshift-ingress-operator", "name": "custom", "reconcileID": "d49d9b96-d509-4562-b3d9-d4fc315226c0", "error": "failed to ensure client CA CRL configmap for ingresscontroller openshift-ingress-operator/custom: failed to update configmap: ConfigMap \"router-client-ca-crl-custom\" is invalid: []: Too long: must have at most 1048576 bytes"}

Expected results:

First be able to create a configmap where data only accounted to the 1MB max (see additional info below for more details), second some way to compress or allow a large CRL list that would be larger than 1MB

Additional info:

Only using this CRL and it being only 600K still causes issue and it could be due to  the `last-applied-configuration` annotation on the configmap. This is added since we do an apply operation (update) on the configmap. I am not sure if this is counting towards the 1MB max. 

https://github.com/openshift/cluster-ingress-operator/blob/release-4.10/pkg/operator/controller/crl/crl_configmap.go#L295 

Not sure if we could just replace the configmap.   

 

Description of problem:

IPI installation failed against 4.11 nightly build on Azure Stack Hub WWT.

workers machines failed to be provisioned as using the wrong image which does not exist.
$ oc get machine -n openshift-machine-api
NAME                                  PHASE     TYPE              REGION   ZONE   AGE
jima411az-xvwvg-master-0              Running   Standard_DS4_v2   mtcazs          3h10m
jima411az-xvwvg-master-1              Running   Standard_DS4_v2   mtcazs          3h10m
jima411az-xvwvg-master-2              Running   Standard_DS4_v2   mtcazs          3h10m
jima411az-xvwvg-worker-mtcazs-4k6hs   Failed                                      3h3m
jima411az-xvwvg-worker-mtcazs-fc6s5   Failed                                      3h3m
jima411az-xvwvg-worker-mtcazs-sclgf   Failed                                      3h3m

$ oc describe machine jima411az-xvwvg-worker-mtcazs-4k6hs -n openshift-machine-api
...
Events:
  Type     Reason        Age   From              Message
  ----     ------        ----  ----              -------
  Warning  FailedCreate  3h4m  azure-controller  InvalidConfiguration: failed to reconcile machine "jima411az-xvwvg-worker-mtcazs-4k6hs": failed to create vm jima411az-xvwvg-worker-mtcazs-4k6hs: failure sending request for machine jima411az-xvwvg-worker-mtcazs-4k6hs: cannot create vm: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=404 -- Original Error: Code="NotFound" Message="The Image '/subscriptions/de7e09c3-b59a-4c7d-9c77-439c11b92879/resourceGroups/jima411az-xvwvg-rg/providers/Microsoft.Compute/images/jima411az-xvwvg-gen2' cannot be found in 'mtcazs' region."

The image created on ASH is `/subscriptions/de7e09c3-b59a-4c7d-9c77-439c11b92879/resourceGroups/jima411az-xvwvg-rg/providers/Microsoft.Compute/images/jima411az-xvwvg`

# az resource list -g jima411az-xvwvg-rg --query "[?type=='Microsoft.Compute/images']"
[
  {
    "changedTime": "2023-08-03T01:08:03.369233+00:00",
    "createdTime": "2023-08-03T00:57:34.228352+00:00",
    "id": "/subscriptions/de7e09c3-b59a-4c7d-9c77-439c11b92879/resourceGroups/jima411az-xvwvg-rg/providers/Microsoft.Compute/images/jima411az-xvwvg",
    "identity": null,
    "kind": null,
    "location": "mtcazs",
    "managedBy": null,
    "name": "jima411az-xvwvg",
    "plan": null,
    "properties": null,
    "provisioningState": "Succeeded",
    "resourceGroup": "jima411az-xvwvg-rg",
    "sku": null,
    "tags": {},
    "type": "Microsoft.Compute/images"
  }
]

Version-Release number of selected component (if applicable):

4.11.0-0.nightly-2023-07-29-013834

How reproducible:

Always on 4.11

Steps to Reproduce:

1. Deploy IPI with 4.11 on ASH WWT
2. 
3.

Actual results:

worker nodes failed to be provisioned

Expected results:

Installation is successful. 

Additional info:

Detect issue on 4.11, IPI installation works well on 4.12+.

Description of problem:

  intra namespace allow network policy doesn't work after applying ingress&egress deny all network policy

Version-Release number of selected component (if applicable):

  OpenShift 4.10.12

How reproducible:

Always

Steps to Reproduce:
  1. Define deny all network policy for egress an ingress in a namespace:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress

2. Define the following network policy to allow the traffic between the pods in the namespace:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-intra-namespace-001
spec:
  egress:
  - to:
    - podSelector: {}
  ingress:
  - from:
    - podSelector: {}
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress 

3. Test the connectivity between two pods from the namespace.

Actual results:

   The connectivity is not allowed

Expected results:

  The connectivity should be allowed between pods from the same namespace.

Additional info:

  After performing a test and analyzing SDN flows for the namespace: 

sh-4.4# ovs-ofctl dump-flows -O OpenFlow13 br0 | grep --color 0x964376 
 cookie=0x0, duration=99375.342s, table=20, n_packets=14, n_bytes=588, priority=100,arp,in_port=21,arp_spa=10.128.2.20,arp_sha=00:00:0a:80:02:14/00:00:ff:ff:ff:ff actions=load:0x964376->NXM_NX_REG0[],goto_table:30
 cookie=0x0, duration=1681.845s, table=20, n_packets=11, n_bytes=462, priority=100,arp,in_port=24,arp_spa=10.128.2.23,arp_sha=00:00:0a:80:02:17/00:00:ff:ff:ff:ff actions=load:0x964376->NXM_NX_REG0[],goto_table:30
 cookie=0x0, duration=99375.342s, table=20, n_packets=135610, n_bytes=759239814, priority=100,ip,in_port=21,nw_src=10.128.2.20 actions=load:0x964376->NXM_NX_REG0[],goto_table:27
 cookie=0x0, duration=1681.845s, table=20, n_packets=2006, n_bytes=12684967, priority=100,ip,in_port=24,nw_src=10.128.2.23 actions=load:0x964376->NXM_NX_REG0[],goto_table:27
 cookie=0x0, duration=99375.342s, table=25, n_packets=0, n_bytes=0, priority=100,ip,nw_src=10.128.2.20 actions=load:0x964376->NXM_NX_REG0[],goto_table:27
 cookie=0x0, duration=1681.845s, table=25, n_packets=0, n_bytes=0, priority=100,ip,nw_src=10.128.2.23 actions=load:0x964376->NXM_NX_REG0[],goto_table:27
 cookie=0x0, duration=975.129s, table=27, n_packets=0, n_bytes=0, priority=150,reg0=0x964376,reg1=0x964376 actions=goto_table:30
 cookie=0x0, duration=99375.342s, table=70, n_packets=145260, n_bytes=11722173, priority=100,ip,nw_dst=10.128.2.20 actions=load:0x964376->NXM_NX_REG1[],load:0x15->NXM_NX_REG2[],goto_table:80
 cookie=0x0, duration=1681.845s, table=70, n_packets=2336, n_bytes=191079, priority=100,ip,nw_dst=10.128.2.23 actions=load:0x964376->NXM_NX_REG1[],load:0x18->NXM_NX_REG2[],goto_table:80
 cookie=0x0, duration=975.129s, table=80, n_packets=0, n_bytes=0, priority=150,reg0=0x964376,reg1=0x964376 actions=output:NXM_NX_REG2[]

We see that the following rule doesn't match because `reg1` hasn't been defined:

 cookie=0x0, duration=975.129s, table=27, n_packets=0, n_bytes=0, priority=150,reg0=0x964376,reg1=0x964376 actions=goto_table:30 

 

Description of problem:

Data upload form in storage -> PVC -> Data upload form, does not support source Ref.

Version-Release number of selected component (if applicable):

 

How reproducible:

100%

Steps to Reproduce:

1. storage -> PVC -> Data upload
2. click "Attach this data to a Virtual Machine operating system"
3. 

Actual results:

If the template using sourceRef ( and the source reference name is not identical to the PVC name ), the vm will use the sourceRef and not the uploaded data source.

Expected results:

a - do not allow to upload to specific template/s, ( use a different UI to manage sourceRef and importCron )
b - allow to upload to specific template/s, and make sure it works with sourceRef and importCron

Additional info:

this is a manual clone of [1]. Summary: The checkbox of "Attach this data to a Virtual Machine operating system" should go away, for more info follow comments on [1].
[1] https://bugzilla.redhat.com/show_bug.cgi?id=2110256

 

This is a clone of issue OCPBUGS-19942. The following is the description of the original issue:

This is a clone of issue OCPBUGS-16735. The following is the description of the original issue:

Description of problem:

oc adm inspect generated files sometime have the leading "---" and some time do not. This depends on the order of objects collected. This by itself is not an issue.

However this becomes an issue when combined with multiple invocations of oc adm inspect and collecting data to the same directory like must-gather does.

If an object is collected multiple times then the second time oc might overwrite the original file improperly and leave 4 bytes of the original content behind.

This is happening when not writing the "---\n" in the second invocation as this makes the content 4B shorter and the original tailing 4B are left in the file intact.

This garbage confuses YAML parsers.

Version-Release number of selected component (if applicable):

4.14 nighly as of Jul 25 and before

How reproducible:

Always

Steps to Reproduce:

Run oc adm inspect twice with different order of objects:

[msivak@x openshift-must-gather]$ oc adm inspect performanceprofile,machineconfigs,nodes --dest-dir=inspect.dual --all-namespaces
[msivak@x openshift-must-gather]$ oc adm inspect nodes --dest-dir=inspect.dual --all-namespaces


And then check the alphabetically first node yaml file - it will have garbage at the end of the file.

Actual results:

Garbage at the end of the file.

Expected results:

No garbage.

Additional info:

I believe this is caused by the lack of Truncate mode here https://github.com/openshift/oc/blob/master/pkg/cli/admin/inspect/writer.go#L54


Collecting data multiple times cannot be easily avoided when multiple collect scripts are combined with relatedObjects requested by operators.

This is a clone of issue OCPBUGS-8491. The following is the description of the original issue:

Description of problem:

Image registry pods panic while deploying OCP in ap-southeast-4 AWS region

Version-Release number of selected component (if applicable):

4.12.0

How reproducible:

Deploy OCP in AWS ap-southeast-4 region

Steps to Reproduce:

Deploy OCP in AWS ap-southeast-4 region 

Actual results:

panic: Invalid region provided: ap-southeast-4

Expected results:

Image registry pods should come up with no errors

Additional info:

 

 

 

 

Fixes rolled out for FIPS make it impossible for a binary compiled on RHELX to run on RHELY (in a FIPS compliant way) if the RHELY host is running with FIPS enabled.

This is because the binary will not find a compatible OpenSSL library and fall back to using internal Go crypto. Only OpenSSL is FIPS certified.

With the introduction of "FIPS or Die", binaries run in FIPS mode but which cannot use OpenSSL will exit with an error. Customers using RHEL7 with FIPS enabled, therefore, need a RHEL7 based oc.

This bug is a backport clone of [Bugzilla Bug 2094362](https://bugzilla.redhat.com/show_bug.cgi?id=2094362). The following is the description of the original bug:

Description of problem:
A change [1] was introduced to split the kube-apiserver SLO rules into 2 groups to reduce the load on Prometheus (see bug 2004585).

[1] https://github.com/openshift/cluster-kube-apiserver-operator/commit/4a1751ee86cda37f0d9ea520cac09f91ebc3abe9

Version-Release number of selected component (if applicable):
4.9 (because the change was backported to 4.9.z)

How reproducible:
Always

Steps to Reproduce:
1. Install OCP 4.9
2. Retrieve kube-apiserver-slos*
oc get -n openshift-kube-apiserver prometheusrules kube-apiserver-slos -o yaml
oc get -n openshift-kube-apiserver prometheusrules kube-apiserver-slos-basic -o yaml

Actual results:

The KubeAPIErrorBudgetBurn alert with labels

{long="1h",namespace="openshift-kube-apiserver",severity="critical",short="5m"}

exists both in kube-apiserver-slos and kube-apiserver-slos-basic.

The alerting rules is evaluated twice. The same is true for recording rules like "apiserver_request:burnrate1h" and in this case, it can trigger warning logs in the Prometheus pods:

> level=warn component="rule manager" group=kube-apiserver.rules msg="Error on ingesting out-of-order result from rule evaluation" numDropped=283

Expected results:

I presume that kube-apiserver-slos shouldn't exist since it's been replaced by kube-apiserver-slos-basic and kube-apiserver-slos-extended.

Additional info:

Discovered while investigating bug 2091902

This is a clone of issue OCPBUGS-5879. The following is the description of the original issue:

This is a clone of issue OCPBUGS-5505. The following is the description of the original issue:

Description of problem:

The upgradeability check in CVO is throttled (essentially cached) for a nondeterministic period of time, same as the minimal sync period computed at runtime. The period can be up to 4 minutes, determined at CVO start time as 2minutes * (0..1 + 1). We agreed with Trevor that such throttling is unnecessarily aggressive (the check is not that expensive). It also causes CI flakes, because the matching test only has 3 minutes timeout. Additionally, the non-determinism and longer throttling results makes UX worse by actions done in the cluster may have their observable effect delayed.

Version-Release number of selected component (if applicable):

discovered in 4.10 -> 4.11 upgrade jobs

How reproducible:

The test seems to flake ~10% of 4.10->4.11 Azure jobs (sippy). There does not seem to be that much impact on non-Azure jobs though which is a bit weird.

Steps to Reproduce:

Inspect the CVO log and E2E logs from failing jobs with the provided [^check-cvo.py] helper:

$ ./check-cvo.py cvo.log && echo PASS || echo FAIL

Preferably, inspect CVO logs of clusters that just underwent an upgrade (upgrades makes the original problematic behavior more likely to surface)

Actual results:

$ ./check-cvo.py openshift-cluster-version_cluster-version-operator-5b6966c474-g4kwk_cluster-version-operator.log && echo PASS || echo FAIL
FAIL: Cache hit at 11:59:55.332339 0:03:13.665006 after check at 11:56:41.667333
FAIL: Cache hit at 12:06:22.663215 0:03:13.664964 after check at 12:03:08.998251
FAIL: Cache hit at 12:12:49.997119 0:03:13.665598 after check at 12:09:36.331521
FAIL: Cache hit at 12:19:17.328510 0:03:13.664906 after check at 12:16:03.663604
FAIL: Cache hit at 12:25:44.662290 0:03:13.666759 after check at 12:22:30.995531
Upgradeability checks:           5
Upgradeability check cache hits: 12
FAIL

Note that the bug is probabilistic, so not all unfixed clusters will exhibit the behavior. My guess of the incidence rate is about 30-40%.

Expected result

$ ./check-cvo.py openshift-cluster-version_cluster-version-operator-7b8f85d455-mk9fs_cluster-version-operator.log && echo PASS || echo FAIL
Upgradeability checks:           12
Upgradeability check cache hits: 11
PASS

The actual numbers are not relevant (unless the upgradeabilily check count is zero, which means the test is not conclusive, the script warns about that), lack of failure is.

Additional info:

$ curl --silent https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1607602927633960960/artifacts/e2e-azure-upgrade/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-7b7d4b5bbd-zjqdt_cluster-version-operator.log | grep upgradeable.go
...
I1227 06:50:59.023190       1 upgradeable.go:122] Cluster current version=4.10.46
I1227 06:50:59.042735       1 upgradeable.go:42] Upgradeable conditions were recently checked, will try later.
I1227 06:51:14.024345       1 upgradeable.go:42] Upgradeable conditions were recently checked, will try later.
I1227 06:53:23.080768       1 upgradeable.go:42] Upgradeable conditions were recently checked, will try later.
I1227 06:56:59.366010       1 upgradeable.go:122] Cluster current version=4.11.0-0.ci-2022-12-26-193640

$ curl --silent https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1607602927633960960/artifacts/e2e-azure-upgrade/openshift-e2e-test/artifacts/e2e.log | grep 'Kubernetes 1.25 and therefore OpenShift 4.12'
Dec 27 06:51:15.319: INFO: Waiting for Upgradeable to be AdminAckRequired for "Kubernetes 1.25 and therefore OpenShift 4.12 remove several APIs which require admin consideration. Please see the knowledge article https://access.redhat.com/articles/6955381 for details and instructions." ...
Dec 27 06:54:15.413: FAIL: Error while waiting for Upgradeable to complain about AdminAckRequired with message "Kubernetes 1.25 and therefore OpenShift 4.12 remove several APIs which require admin consideration. Please see the knowledge article https://access.redhat.com/articles/6955381 for details and instructions.": timed out waiting for the condition
The test passes. Also, the "Upgradeable conditions were recently checked, will try later." messages in CVO logs should never occur after a deterministic, short amount of time (I propose 1 minute) after upgradeability was checked.

I tested the throttling period in https://github.com/openshift/cluster-version-operator/pull/880. With the period of 15m, the test passrate was 4 of 9. Wiht the period of 1m, the test did not fail at all.

Some context in Slack thread

This is a clone of BZ https://bugzilla.redhat.com/show_bug.cgi?id=2117374 which fixed 4.12

 
Description of problem:
OCP 4.11 introduced the `restricted-v2` SecurityContextConstraint as the default binding for all authenticated users. When a Pod that would have normally be admitted successfully using the `restricted` SCC, but cannot be admitted using a `restricted-v2` SCC, it's not clear to the customer that the problem is related to the default SCC changing from restricted to restricted-v2.

Suggestion is to generate an additional message for this specific use case that makes it clear to customers that they have a workload that is not compatible with restricted-v2.

Example message:
Error creating: pods "nginx-ingress-controller-75bffcfdf8-" is forbidden: the pod fails to validate against the default `restricted-v2` security context constraint, but would validate successfully against the `restricted` security context constraint.

The goal of such a message is to give the customer a breadcrumb that the `restricted-v2` SCC is activated and set as the default, which they may have not been aware of. This would give them a way to find alternatives, such as changing the global default, fixing the application or assigning the `restricted` SCC to the Namespace or ServiceAccount.

Version-Release number of selected component (if applicable):
4.11

How reproducible:
Always

Steps to Reproduce:
1. Install OCP 4.11.0
2. Create a deployment with a podtemplate with .spec.containers[*].securityContext.allowPrivigeEscalation = true in a new namespace

Actual results:
The pod will fail to be admitted with an error such as:

Warning FailedCreate 27m (x25 over 76m) replicaset-controller Error creating: pods "nginx-ingress-controller-75bffcfdf8-" is forbidden: unable to validate against any security context constraint: [provider "anyuid": Forbidden: not usable by user or serviceaccount, spec.containers[0].securityContext.allowPrivilegeEscalation: Invalid value: true: Allowing privilege escalation for containers is not allowed, provider "restricted": Forbidden: not usable by user or serviceaccount, provider "nonroot-v2": Forbidden: not usable by user or serviceaccount, provider "nonroot": Forbidden: not usable by user or serviceaccount, provider "noobaa": Forbidden: not usable by user or serviceaccount, provider "noobaa-endpoint": Forbidden: not usable by user or serviceaccount, provider "hostmount-anyuid": Forbidden: not usable by user or serviceaccount, provider "machine-api-termination-handler": Forbidden: not usable by user or serviceaccount, provider "hostnetwork-v2": Forbidden: not usable by user or serviceaccount, provider "hostnetwork": Forbidden: not usable by user or serviceaccount, provider "hostaccess": Forbidden: not usable by user or serviceaccount, provider "rook-ceph": Forbidden: not usable by user or serviceaccount, provider "node-exporter": Forbidden: not usable by user or serviceaccount, provider "rook-ceph-csi": Forbidden: not usable by user or serviceaccount, provider "privileged": Forbidden: not usable by user or serviceaccount]

Expected results:
Add an additional or replacement error message when there is no SCC assignment other than `restricted-v2` AND `restricted` would have worked:

Error creating: pods "nginx-ingress-controller-75bffcfdf8-" is forbidden: the pod fails to validate against the default `restricted-v2` security context constraint, but would validate successfully against the `restricted` security context constraint.

Additional info:

This is a clone of issue OCPBUGS-7885. The following is the description of the original issue:

This is a clone of issue OCPBUGS-7617. The following is the description of the original issue:

Description of problem:

Azure Disk volume is taking time to attach/detach
Version-Release number of selected component (if applicable):

Openshift ARO 4.10.30
How reproducible:

While performing scaledown and scaleup of statefulset pod takes time to attach and detach volume from nodes.

Reviewed must-gather and test output will share my findings in comments.

Steps to Reproduce:
1.
2.
3.

Actual results:

Expected results:

Additional info:

Description of problem:
In a complete disconnected cluster, the dev catalog is taking too much time in loading

Version-Release number of selected component (if applicable):

How reproducible:
Always

Steps to Reproduce:
1. A complete disconnected cluster
2. In add page go to the All services page
3.

Actual results:
Taking too much time too load

Expected results:
Time taken should be reduced

Additional info:
Attached a gif for reference

The CMO e2e tests create a bunch of resources. These should be cleaned up on a successful run. However:

  • Some test failures leave the create resource behind, which have to be cleaned up before a re-run.
  • There have been developer reports that even successful runs don't tidy up everything.

In a CI context this is rarely a problem, however running the tests locally can be made quite awkward, especially repeated runs on the same cluster.

We should tag all resources created by the e2e tests with a label (app.kubernetes.io/created-by: cmo-e2e-test).
This will allow easy cleanup by deleting all resources with that label and will allow for checking proper clean-up.

DoD:
All e2e resources get properly tagged.
It is straight forward to ensure that future code changes don't skip adding this tag.

This is a clone of issue OCPBUGS-7437. The following is the description of the original issue:

This is a clone of issue OCPBUGS-5547. The following is the description of the original issue:

Description of problem:
This is a follow-up on https://bugzilla.redhat.com/show_bug.cgi?id=2083087 and https://github.com/openshift/console/pull/12390

When creating a Knative Service and delete it again with enabled option "Delete other resources created by console" (only available on 4.13+ with the PR above) the secret "$name-github-webhook-secret" is not deleted.

When the user tries to create the same Knative Service again this fails with an error:

An error occurred
secrets "nodeinfo-github-webhook-secret" already exists

Version-Release number of selected component (if applicable):
4.13

(we might want to backport this together with https://github.com/openshift/console/pull/12390 and OCPBUGS-5548)

How reproducible:
Always

Steps to Reproduce:

  1. Install OpenShift Serverless operator (tested with 1.26.0)
  2. Create a new project
  3. Navigate to Add > Import from git and create an application
  4. In the topology select the Knative Service > "Delete Service" (not Delete App)

Actual results:
Deleted resources:

  1. Knative Service (tries it twice!) $name
  2. ImageStream $name
  3. BuildConfig $name
  4. Secret $name-generic-webhook-secret

Expected results:
Should also remove this resource

  1. Delete Knative Service should be called just once
  2. Secret $name-github-webhook-secret

Additional info:
When delete the whole application all the resources are deleted correctly (and just once)!

  1. Knative Service (just once!) $name
  2. ImageStream $name
  3. BuildConfig $name
  4. Secret $name-generic-webhook-secret
  5. Secret $name-github-webhook-secret

Description of problem:

The alertmanager pod is stuck on OCP 4.11 with OVN in container Creating State

From oc describe alertmanager pod:
...
Events:
  Type     Reason                  Age                  From     Message
  ----     ------                  ----                 ----     -------
  Warning  FailedCreatePodSandBox  16s (x459 over 17h)  kubelet  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_alertmanager-managed-ocs-alertmanager-0_openshift-storage_3a55ed54-4eaa-4f65-8a10-e5d21fad1ebc_0(88575547dc0b210307b89dd2bb8e379ece0962b607ac2707a1c2cf630b1aaa78): error adding pod openshift-storage_alertmanager-managed-ocs-alertmanager-0 to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): [openshift-storage/alertmanager-managed-ocs-alertmanager-0/3a55ed54-4eaa-4f65-8a10-e5d21fad1ebc:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[openshift-storage/alertmanager-managed-ocs-alertmanager-0 88575547dc0b210307b89dd2bb8e379ece0962b607ac2707a1c2cf630b1aaa78] [openshift

Version-Release number of selected component (if applicable):

OCP 4.11 with OVN

How reproducible:

100%

Steps to Reproduce:

1. Terminate the node on which alertmanager pod is running
2. pod will get stuck in container Creating state
3.

Actual results:

AlertManager pod is stuck in container Creating state

Expected results:

Alertmanager pod is ready

Additional info:

The workaround would be to terminate the alertmanager pod

Description of problem:

Dummy bug that is needed to track backport of https://github.com/ovn-org/ovn-kubernetes/pull/2975/commits/816e30a1fbb5beb8b20fe3e96906285762dd8eb6 which is already merged in 4.12

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-20181. The following is the description of the original issue:

Description of problem:

unit test failures rates are high https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-oc-master-unit

TestNewAppRunAll/emptyDir_volumes is failing

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_oc/1557/pull-ci-openshift-oc-master-unit/1710206848667226112

Version-Release number of selected component (if applicable):

 

How reproducible:

Run local or in CI and see that unit test job is failing

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-4504. The following is the description of the original issue:

This is a clone of issue OCPBUGS-1557. The following is the description of the original issue:

Seen in an instance created recently by a 4.12.0-ec.2 GCP provider:

  "scheduling": {
    "automaticRestart": false,
    "onHostMaintenance": "MIGRATE",
    "preemptible": false,
    "provisioningModel": "STANDARD"
  },

From GCP's docs, they may stop instances on hardware failures and other causes, and we'd need automaticRestart: true to auto-recover from that. Also from GCP docs, the default for automaticRestart is true. And on the Go provider side, we doc:

If omitted, the platform chooses a default, which is subject to change over time, currently that default is "Always".

But the implementing code does not actually float the setting. Seems like a regression here, which is part of 4.10:

$ git clone https://github.com/openshift/machine-api-provider-gcp.git
$ cd machine-api-provider-gcp
$ git log --oneline origin/release-4.10 | grep 'migrate to openshift/api'
44f0f958 migrate to openshift/api

But that's not where the 4.9 and earlier code is located:

$ git branch -a | grep origin/release
  remotes/origin/release-4.10
  remotes/origin/release-4.11
  remotes/origin/release-4.12
  remotes/origin/release-4.13

Hunting for 4.9 code:

$ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.9.48-x86_64 | grep gcp
  gcp-machine-controllers                        https://github.com/openshift/cluster-api-provider-gcp                       c955c03b2d05e3b8eb0d39d5b4927128e6d1c6c6
  gcp-pd-csi-driver                              https://github.com/openshift/gcp-pd-csi-driver                              48d49f7f9ef96a7a42a789e3304ead53f266f475
  gcp-pd-csi-driver-operator                     https://github.com/openshift/gcp-pd-csi-driver-operator                     d8a891de5ae9cf552d7d012ebe61c2abd395386e

So looking there:

$ git clone https://github.com/openshift/cluster-api-provider-gcp.git
$ cd cluster-api-provider-gcp
$ git log --oneline | grep 'migrate to openshift/api'
...no hits...
$ git grep -i automaticRestart origin/release-4.9  | grep -v '"description"\|compute-gen.go'
origin/release-4.9:vendor/google.golang.org/api/compute/v1/compute-api.json:        "automaticRestart": {

Not actually clear to me how that code is structured. So 4.10 and later GCP machine-API providers are impacted, and I'm unclear on 4.9 and earlier.

This is a clone of issue OCPBUGS-5100. The following is the description of the original issue:

This is a clone of issue OCPBUGS-5068. The following is the description of the original issue:

Description of problem:

virtual media provisioning fails when iLO Ironic driver is used

Version-Release number of selected component (if applicable):

4.13

How reproducible:

Always

Steps to Reproduce:

1. attempt virtual media provisioning on a node configured with ilo-virtualmedia:// drivers
2.
3.

Actual results:

Provisioning fails with "An auth plugin is required to determine endpoint URL" error

Expected results:

Provisioning succeeds

Additional info:

Relevant log snippet:

3742 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector [None req-e58ac1f2-fac6-4d28-be9e-983fa900a19b - - - - - -] Unable to start managed inspection for node e4445d43-3458-4cee-9cbe-6da1de75      78cd: An auth plugin is required to determine endpoint URL: keystoneauth1.exceptions.auth_plugins.MissingAuthPlugin: An auth plugin is required to determine endpoint URL
 3743 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector Traceback (most recent call last):
 3744 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector   File "/usr/lib/python3.9/site-packages/ironic/drivers/modules/inspector.py", line 210, in _start_managed_inspection
 3745 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector     task.driver.boot.prepare_ramdisk(task, ramdisk_params=params)
 3746 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector   File "/usr/lib/python3.9/site-packages/ironic_lib/metrics.py", line 59, in wrapped
 3747 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector     result = f(*args, **kwargs)
 3748 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector   File "/usr/lib/python3.9/site-packages/ironic/drivers/modules/ilo/boot.py", line 408, in prepare_ramdisk
 3749 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector     iso = image_utils.prepare_deploy_iso(task, ramdisk_params,
 3750 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector   File "/usr/lib/python3.9/site-packages/ironic/drivers/modules/image_utils.py", line 624, in prepare_deploy_iso
 3751 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector     return prepare_iso_image(inject_files=inject_files)
 3752 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector   File "/usr/lib/python3.9/site-packages/ironic/drivers/modules/image_utils.py", line 537, in _prepare_iso_image
 3753 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector     image_url = img_handler.publish_image(
 3754 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector   File "/usr/lib/python3.9/site-packages/ironic/drivers/modules/image_utils.py", line 193, in publish_image
 3755 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector     swift_api = swift.SwiftAPI()
 3756 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector   File "/usr/lib/python3.9/site-packages/ironic/common/swift.py", line 66, in __init__
 3757 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector     endpoint = keystone.get_endpoint('swift', session=session)

Description of problem:

During restart egress firewall acls will be deleted and re-created from scratch, meaning that egress firewall rules won't be applied for some time during restart

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-10943. The following is the description of the original issue:

This is a clone of issue OCPBUGS-10661. The following is the description of the original issue:

This is a clone of issue OCPBUGS-10591. The following is the description of the original issue:

Description of problem:

Starting with 4.12.0-0.nightly-2023-03-13-172313, the machine API operator began receiving an invalid version tag either due to a missing or invalid VERSION_OVERRIDE(https://github.com/openshift/machine-api-operator/blob/release-4.12/hack/go-build.sh#L17-L20) value being passed tot he build.

This is resulting in all jobs invoked by the 4.12 nightlies failing to install.

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2023-03-13-172313 and later

How reproducible:

consistently in 4.12 nightlies only(ci builds do not seem to be impacted).

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

Example of failure https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.12-e2e-aws-csi/1635331349046890496/artifacts/e2e-aws-csi/gather-extra/artifacts/pods/openshift-machine-api_machine-api-operator-866d7647bd-6lhl4_machine-api-operator.log

Tracker issue for bootimage bump in 4.11. This issue should block issues which need a bootimage bump to fix.

The previous bump was OCPBUGS-7530.

This is a clone of issue OCPBUGS-17862. The following is the description of the original issue:

This is a clone of issue OCPBUGS-17769. The following is the description of the original issue:

This is a clone of issue OCPBUGS-17568. The following is the description of the original issue:

Description of problem:

 

Customer used Agent-based installer to install 4.13.8 on they CID env, but during install process, the bootstrap machine had oom issue, check sosreport find the init container had oom issue

NOTE: Issue is not see when testing with 4.13.6, per the customer

initContainers:

  • name: machine-config-controller
    image: .Images.MachineConfigOperator
    command: ["/usr/bin/machine-config-controller"]
    args:
  • "bootstrap"
  • "--manifest-dir=/etc/mcc/bootstrap"
  • "--dest-dir=/etc/mcs/bootstrap"
  • "--pull-secret=/etc/mcc/bootstrap/machineconfigcontroller-pull-secret"
  • "--payload-version=.ReleaseVersion"
    resources:
    limits:
    memory: 50Mi

we found the sosreport dmesg and crio logs had oom kill machine-config-controller container issue, the issue was cause by cgroup kill, so looks like the limit 50M is too small

The customer used a physical machine that had 100GB of memory

the customer had some network config in asstant install yaml file, maybe the issue is them had some nic config?

log files:
1. sosreport
https://attachments.access.redhat.com/hydra/rest/cases/03578865/attachments/b5501734-60be-4de4-adcf-da57e22cbb8e?usePresignedUrl=true

2. asstent installer yaml file
https://attachments.access.redhat.com/hydra/rest/cases/03578865/attachments/a32635cf-112d-49ed-828c-4501e95a0e7a?usePresignedUrl=true

3. bootstrap machine oom screenshot
https://attachments.access.redhat.com/hydra/rest/cases/03578865/attachments/eefe2e57-cd23-4abd-9e0b-dd45f20a34d2?usePresignedUrl=true

This is a clone of issue OCPBUGS-5761. The following is the description of the original issue:

This is a clone of issue OCPBUGS-5458. The following is the description of the original issue:

reported in https://coreos.slack.com/archives/C027U68LP/p1673010878672479

Description of problem:

Hey guys, I have a openshift cluster that was upgraded to version 4.9.58 from version 4.8. After the upgrade was done, the etcd pod on master1 isn't coming up and is crashlooping. and it gives the following error: {"level":"fatal","ts":"2023-01-06T12:12:58.709Z","caller":"etcdmain/etcd.go:204","msg":"discovery failed","error":"wal: max entry size limit exceeded, recBytes: 13279, fileSize(313430016) - offset(313418480) - padBytes(1) = entryLimit(11535)","stacktrace":"go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\t/remote-source/cachito-gomod-with-deps/app/server/etcdmain/etcd.go:204\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\t/remote-source/cachito-gomod-with-deps/app/server/etcdmain/main.go:40\nmain.main\n\t/remote-source/cachito-gomod-with-deps/app/server/main.go:32\nruntime.main\n\t/usr/lib/golang/src/runtime/proc.go:225"}

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:

1.
2.
3.

Actual results:


Expected results:


Additional info:


ENV:
OCP 4.11.29
VSphere IPI

ISSUE:
If scaling up 3-5 machines the machines will transition to running, join the cluster.

If scaling up a large number of machines, 10-20+ some will transition to running and join the cluster.
The remaining will stay in provisioned/provisioning state and never power on
If those nodes are manually powered on they join the cluster and are healthy.

OBSERVATION:
There are ~16k tags shared across multiple VMWare Cloud Foundation data centers.
In the logs it is observed the reconciliation of tags occur and then some of the nodes will power on and transition to running.
Do not see in the logs where the reconciliation of the tags on the remaining nodes runs again or completes.

This is a clone of issue OCPBUGS-11989. The following is the description of the original issue:

Description of problem:

Customer reports that when trying to create an application using the "Import from Git" workflow, the "Create" button at the very bottom of the form stays inactive. You can observe the issue in the video shared via Google Drive here (timestamp 00:35): https://drive.google.com/file/d/1GEA_TF5vV_ai9YDMZ3uzwEwYKkp_CY8r/view?usp=sharing

The customer can work around the issue by selecting another Import Strategy than "Builder Image" and then switching back to "Builder Image" (timestamp 00:49).

Version-Release number of selected component (if applicable):

OpenShift Container Platform 4.11.25

How reproducible:

Always

Steps to Reproduce:

1. Click the "+Add" button on the left menu
2. Enter a Git repository URL
3. Select "Bitbucket" as Git type
3a. If necessary, select a "Source Secret"
4. For "Import Strategy", select "Builder Image" and select one of the available images
5. In "Application" select "Create application"
6. For "Application Name" and "Name" insert any valid value

Actual results:

"Create" button at the bottom of the form is inactive and cannot be clicked.

Changing the Import Strategy to something else and then back to "Builder Image" makes the button active.

Expected results:

Button is active after filling out all the required form fields

Additional info:

* Video of the issue provided: https://drive.google.com/file/d/1GEA_TF5vV_ai9YDMZ3uzwEwYKkp_CY8r/view?usp=sharing

This is a clone of issue OCPBUGS-23141. The following is the description of the original issue:

This is a clone of issue OCPBUGS-22945. The following is the description of the original issue:

This is a clone of issue OCPBUGS-22830. The following is the description of the original issue:

Description of problem:

google CLI deprecated Python 3.5-3.7 from 448.0.0 causing release ci jobs failed with ERROR: gcloud failed to load. You are running gcloud with Python 3.6, which is no longer supported by gcloud. . specified version to 447.0.0
job link: https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-o[...]cp-upi-f28-destructive/1719562110486188032

Description of problem:

This is a clone of https://bugzilla.redhat.com/show_bug.cgi?id=2074299 for backporting purposes.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Created attachment 1905034 [details]
Plugin page with error

Steps to reproduce:

1. Install a plugin with a page that has a runtime error. (Demo Plugin -> Dynamic Nav 1 currently has an error for me, but you can reproduce by editing any plugin and introducing an error.)
2. Observe the "something went wrong" error message.
3. Navigate to any other page (e.g. Workloads -> Pods)

Expected result:

The pods page is displayed.

Action result:

The error message persists. There is no way to clear except to refresh the browser.

Description of problem:

[OVN][OSP] After reboot egress node, egress IP cannot be applied anymore.

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-11-07-181244

How reproducible:

Frequently happened in automation. But didn't reproduce it in manual.

Steps to Reproduce:

1. Label one node as egress node

2.
Config one egressIP object
STEP: Check  one EgressIP assigned in the object.

Nov  8 15:28:23.591: INFO: egressIPStatus: [{"egressIP":"192.168.54.72","node":"huirwang-1108c-pg2mt-worker-0-2fn6q"}]

3.
Reboot the node, wait for the node ready.


Actual results:

EgressIP cannot be applied anymore. Waited more than 1 hour.
 oc get egressip
NAME             EGRESSIPS       ASSIGNED NODE   ASSIGNED EGRESSIPS
egressip-47031   192.168.54.72    

Expected results:

The egressIP should be applied correctly.

Additional info:


Some logs
E1108 07:29:41.849149       1 egressip.go:1635] No assignable nodes found for EgressIP: egressip-47031 and requested IPs: [192.168.54.72]
I1108 07:29:41.849288       1 event.go:285] Event(v1.ObjectReference{Kind:"EgressIP", Namespace:"", Name:"egressip-47031", UID:"", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'NoMatchingNodeFound' no assignable nodes for EgressIP: egressip-47031, please tag at least one node with label: k8s.ovn.org/egress-assignable


W1108 07:33:37.401149       1 egressip_healthcheck.go:162] Could not connect to huirwang-1108c-pg2mt-worker-0-2fn6q (10.131.0.2:9107): context deadline exceeded
I1108 07:33:37.401348       1 master.go:1364] Adding or Updating Node "huirwang-1108c-pg2mt-worker-0-2fn6q"
I1108 07:33:37.437465       1 egressip_healthcheck.go:168] Connected to huirwang-1108c-pg2mt-worker-0-2fn6q (10.131.0.2:9107)

After this log, seems like no logs related to "192.168.54.72" happened.

This is a clone of issue OCPBUGS-15933. The following is the description of the original issue:

This is a clone of issue OCPBUGS-15892. The following is the description of the original issue:

This is a clone of issue OCPBUGS-15835. The following is the description of the original issue:

Description of problem:

https://search.ci.openshift.org/?search=error%3A+tag+latest+failed%3A+Internal+error+occurred%3A+registry.centos.org&maxAge=48h&context=1&type=build-log&name=okd&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Version-Release number of selected component (if applicable):

all currently tested versions

How reproducible:

~ 9% of jobs fail on this test

 

 ! error: Import failed (InternalError): Internal error occurred: registry.centos.org/dotnet/dotnet-31-runtime-centos7:latest: Get "https://registry.centos.org/v2/": dial tcp: lookup registry.centos.org on 172.30.0.10:53: no such host   782 31 minutes ago 

 

This is a clone of issue OCPBUGS-15858. The following is the description of the original issue:

This is a clone of issue OCPBUGS-15589. The following is the description of the original issue:

This is a clone of issue OCPBUGS-13526. The following is the description of the original issue:

Description of problem:

During a fresh install of an operator with conversion webhooks enabled, `crd.spec.conversion.webhook.clientConfig` is dynamically updated initially, as expected, with the proper webhook ns, name, & caBundle. However, within a few seconds, those critical settings are overwritten with the bundle’s packaged CRD conversion settings. This breaks the operator and stops the installation from completing successfully.

Oddly though, if that same operator version is installed as part of an upgrade from a prior release... the dynamic clientConfig settings are retained and all works as expected.

 

Version-Release number of selected component (if applicable):

OCP 4.10.36
OCP 4.11.18

How reproducible:

Consistently

 

Steps to Reproduce:

1. oc apply -f https://gist.githubusercontent.com/tchughesiv/0951d40f58f2f49306cc4061887e8860/raw/3c7979b58705ab3a9e008b45a4ed4abc3ef21c2b/conversionIssuesFreshInstall.yaml
2. oc get crd dbaasproviders.dbaas.redhat.com --template '{{ .spec.conversion.webhook.clientConfig }}' -w

 

Actual results:

Eventually, the clientConfig settings will revert to the following and stay that way.

$ oc get crd dbaasproviders.dbaas.redhat.com --template '{{ .spec.conversion.webhook.clientConfig }}'
map[service:map[name:dbaas-operator-webhook-service namespace:openshift-dbaas-operator path:/convert port:443]]
 conversion:
   strategy: Webhook
   webhook:
     clientConfig:
       service:
         namespace: openshift-dbaas-operator
         name: dbaas-operator-webhook-service
         path: /convert
         port: 443
     conversionReviewVersions:
       - v1alpha1
       - v1beta1

 

Expected results:

The `crd.spec.conversion.webhook.clientConfig` should instead retain the following settings.

$ oc get crd dbaasproviders.dbaas.redhat.com --template '{{ .spec.conversion.webhook.clientConfig }}'
map[caBundle:LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUJpRENDQVMyZ0F3SUJBZ0lJUVA1b1ZtYTNqUG93Q2dZSUtvWkl6ajBFQXdJd0dERVdNQlFHQTFVRUNoTU4KVW1Wa0lFaGhkQ3dnU1c1akxqQWVGdzB5TWpFeU1UWXhPVEEwTWpsYUZ3MHlOREV5TVRVeE9UQTBNamxhTUJneApGakFVQmdOVkJBb1REVkpsWkNCSVlYUXNJRWx1WXk0d1dUQVRCZ2NxaGtqT1BRSUJCZ2dxaGtqT1BRTUJCd05DCkFBVGcxaEtPWW40MStnTC9PdmVKT21jbkx5MzZNWTBEdnRGcXF3cjJFdlZhUWt2WnEzWG9ZeWlrdlFlQ29DZ3QKZ2VLK0UyaXIxNndzSmRSZ2paYnFHc3pGbzJFd1h6QU9CZ05WSFE4QkFmOEVCQU1DQW9Rd0hRWURWUjBsQkJZdwpGQVlJS3dZQkJRVUhBd0lHQ0NzR0FRVUZCd01CTUE4R0ExVWRFd0VCL3dRRk1BTUJBZjh3SFFZRFZSME9CQllFCkZPMWNXNFBrbDZhcDdVTVR1UGNxZWhST1gzRHZNQW9HQ0NxR1NNNDlCQU1DQTBrQU1FWUNJUURxN0pkUjkxWlgKeWNKT0hyQTZrL0M0SG9sSjNwUUJ6bmx3V3FXektOd0xiZ0loQU5ObUd6RnBqaHd6WXpVY2RCQ3llU3lYYkp3SAphYllDUXFkSjBtUGFha28xCi0tLS0tRU5EIENFUlRJRklDQVRFLS0tLS0K service:map[name:dbaas-operator-controller-manager-service namespace:redhat-dbaas-operator path:/convert port:443]]
 conversion:
   strategy: Webhook
   webhook:
     clientConfig:
       service:
         namespace: redhat-dbaas-operator
         name: dbaas-operator-controller-manager-service
         path: /convert
         port: 443
       caBundle: >-
         LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUJoekNDQVMyZ0F3SUJBZ0lJZXdhVHNLS0hhbWd3Q2dZSUtvWkl6ajBFQXdJd0dERVdNQlFHQTFVRUNoTU4KVW1Wa0lFaGhkQ3dnU1c1akxqQWVGdzB5TWpFeU1UWXhPVEF5TURkYUZ3MHlOREV5TVRVeE9UQXlNRGRhTUJneApGakFVQmdOVkJBb1REVkpsWkNCSVlYUXNJRWx1WXk0d1dUQVRCZ2NxaGtqT1BRSUJCZ2dxaGtqT1BRTUJCd05DCkFBUVRFQm8zb1BWcjRLemF3ZkE4MWtmaTBZQTJuVGRzU2RpMyt4d081ZmpKQTczdDQ2WVhOblFzTjNCMVBHM04KSXJ6N1dKVkJmVFFWMWI3TXE1anpySndTbzJFd1h6QU9CZ05WSFE4QkFmOEVCQU1DQW9Rd0hRWURWUjBsQkJZdwpGQVlJS3dZQkJRVUhBd0lHQ0NzR0FRVUZCd01CTUE4R0ExVWRFd0VCL3dRRk1BTUJBZjh3SFFZRFZSME9CQllFCkZJemdWbC9ZWkFWNmltdHl5b0ZkNFRkLzd0L3BNQW9HQ0NxR1NNNDlCQU1DQTBnQU1FVUNJRUY3ZXZ0RS95OFAKRnVrTUtGVlM1VkQ3a09DRzRkdFVVOGUyc1dsSTZlNEdBaUVBZ29aNmMvYnNpNEwwcUNrRmZSeXZHVkJRa25SRwp5SW1WSXlrbjhWWnNYcHM9Ci0tLS0tRU5EIENFUlRJRklDQVRFLS0tLS0K 

 

Additional info:

If the operator is, instead, installed as an upgrade... vs a fresh install... the webhook settings are properly/permanently set and everything works as expected. This can be tested in a fresh cluster like this.

1. oc apply -f https://gist.githubusercontent.com/tchughesiv/703109961f22ab379a45a401be0cf351/raw/2d0541b76876a468757269472e8e3a31b86b3c68/conversionWorksUpgrade.yaml
2. oc get crd dbaasproviders.dbaas.redhat.com --template '{{ .spec.conversion.webhook.clientConfig }}' -w

Description of problem:

The 4.11 version of openshift-installer does not support the mon01 zone

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:

1.
2.
3.

Actual results:


Expected results:


Additional info:


Our Prometheus alerts are inconsistent with both upstream and sometimes our own vendor folder. Let's do a clean update run before the next release is branched off.

This is a clone of issue OCPBUGS-5191. The following is the description of the original issue:

This is a clone of issue OCPBUGS-5164. The following is the description of the original issue:

Description of problem:

It looks like the ODC doesn't register KNATIVE_SERVING and KNATIVE_EVENTING flags. Those are based on KnativeServing and KnativeEventing CRs, but they are looking for v1alpha1 version of those: https://github.com/openshift/console/blob/f72519fdf2267ad91cc0aa51467113cc36423a49/frontend/packages/knative-plugin/console-extensions.json#L6-L8
This PR https://github.com/openshift-knative/serverless-operator/pull/1695 moved the CRs to v1beta1, and that breaks that ODC discovery.

Version-Release number of selected component (if applicable):

Openshift 4.8, Serverless Operator 1.27

Additional info:

https://coreos.slack.com/archives/CHGU4P8UU/p1671634903447019

 

Description of problem:
When creating a incomplete ClusterServiceVersion resource the OLM details page crashes (on 4.11).

apiVersion: operators.coreos.com/v1alpha1
kind: ClusterServiceVersion
metadata:
  name: minimal-csv
  namespace: christoph
spec:
  apiservicedefinitions:
    owned:
      - group: A
        kind: A
        name: A
        version: v1
  customresourcedefinitions:
    owned:
      - kind: B
        name: B
        version: v1
  displayName: My minimal CSV
  install:
    strategy: ''

Version-Release number of selected component (if applicable):
Crashes on 4.8-4.11, work fine from 4.12 onwards.

How reproducible:
Alway

Steps to Reproduce:
1. Apply the ClusterServiceVersion YAML from above
2. Open the Admin perspective > Installed Operator > Operator detail page

Actual results:
Details page crashes on tab A and B.

Expected results:
Page should not crash

Additional info:
Thi is a follow up on https://bugzilla.redhat.com/show_bug.cgi?id=2084287

This is a clone of issue OCPBUGS-2895. The following is the description of the original issue:

Description of problem:

Current validation will not accept Resource Groups or DiskEncryptionSets which have upper-case letters.

Version-Release number of selected component (if applicable):

4.11

How reproducible:

Attempt to create a cluster/machineset using a DiskEncryptionSet with an RG or Name with upper-case letters

Steps to Reproduce:

1. Create cluster with DiskEncryptionSet with upper-case letters in DES name or in Resource Group name

Actual results:

See error message:

encountered error: [controlPlane.platform.azure.defaultMachinePlatform.osDisk.diskEncryptionSet.resourceGroup: Invalid value: \"v4-e2e-V62447568-eastus\": invalid resource group format, compute[0].platform.azure.defaultMachinePlatform.osDisk.diskEncryptionSet.resourceGroup: Invalid value: \"v4-e2e-V62447568-eastus\": invalid resource group format]

Expected results:

Create a cluster/machineset using the existing and valid DiskEncryptionSet

Additional info:

I have submitted a PR for this already, but it needs to be reviewed and backported to 4.11: https://github.com/openshift/installer/pull/6513

This is a clone of issue OCPBUGS-6755. The following is the description of the original issue:

This is a clone of issue OCPBUGS-3316. The following is the description of the original issue:

Description of problem:

Branch name in repository pipelineruns list view should match the actual github branch name.

Version-Release number of selected component (if applicable):

4.11.z

How reproducible:

alwaus

Steps to Reproduce:

1. Create a repository
2. Trigger the pipelineruns by push or pull request event on the github 

Actual results:

Branch name contains "refs-heads-" prefix in front of the actual branch name eg: "refs-heads-cicd-demo" (cicd-demo is the branch name)

Expected results:

Branch name should be the acutal github branch name. just `cicd-demo`should be shown in the branch column.

 

Additional info:
Ref: https://coreos.slack.com/archives/CHG0KRB7G/p1667564311865459

This is a clone of issue OCPBUGS-20573. The following is the description of the original issue:

This is a clone of issue OCPBUGS-20571. The following is the description of the original issue:

This is a clone of issue OCPBUGS-20564. The following is the description of the original issue:

This is just to allow update and backport of DOWNSTREAM_OWNERS files. No verification is needed.

This is a clone of issue OCPBUGS-753. The following is the description of the original issue:

Description of problem:

The default dns-default pod is missing the "target.workload.openshift.io/management:" annotation. 

As a result when the workload partitioning feature is enabled on SNO, this pod resources will not get mutated and pinned to the reserved cpuset.

This is a regresion from 4.10. Pod spec from 4.10.17

Annotations:
...
   resources.workload.openshift.io/dns: {"cpushares": 51}
   resources.workload.openshift.io/kube-rbac-proxy: {"cpushares": 10}
   target.workload.openshift.io/management {"effect":"PreferredDuringScheduling"}

Version-Release number of selected component (if applicable):

4.11.0

How reproducible:

100%

Steps to Reproduce:

1. Install a SNO and check the annotation
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

 

This is a copy of Bugzilla bug 2117524 for backport to 4.11.z

Original Text:

Description of problem:
On routers configured with mTLS and CRL defined in the CA with a CDP ; new CRL is downloaded only when restarting the ingress-operator.

2022-07-20T23:36:26.943Z	INFO	operator.clientca_configmap_controller	controller/controller.go:298	reconciling	{"request": "openshift-ingress-operator/service-bdrc"}
2022-07-20T23:36:26.943Z	INFO	operator.crl	crl/crl_configmap.go:69	certificate revocation list has expired	{"subject key identifier": "6aa909992e9890457b2a8de5659a44cab8e867a8"}
2022-07-20T23:36:26.943Z	INFO	operator.crl	crl/crl_configmap.go:69	retrieving certificate revocation list	{"subject key identifier": "6aa909992e9890457b2a8de5659a44cab8e867a8"}
2022-07-20T23:36:26.943Z	INFO	operator.crl	crl/crl_configmap.go:169	retrieving CRL distribution point	{"distribution point": "http://crl.domain.com/der/CN=XXXX,OU=XXX,O=XXX,C=XXX"}

Version-Release number of selected component (if applicable):
4.9.33

How reproducible:
Enable mTLS with a CRL

Actual results:
CRL is not download when expired
Clients get "SSL client certificate not trusted" errors while accessing resources

Expected results:
ingress-operator triggers CRL download when approaching expiration date so that the configmap is updated without manual action

Description of problem:

Setting disableNetworkDiagnostics: true does not persist when network-operator pod gets re-created.

Version-Release number of selected component (if applicable):
4.11.0-rc.0

How reproducible:
100%

Steps to Reproduce:

1. $ oc patch network.operator.openshift.io/cluster --patch '{"spec":{"disableNetworkDiagnostics":true}}' --type=merge
network.operator.openshift.io/cluster patched

2. $ oc -n openshift-network-operator delete pods network-operator-9b68954c6-bclx6
pod "network-operator-9b68954c6-bclx6" deleted

3. $ oc get network.operator.openshift.io cluster -o json | jq .spec.disableNetworkDiagnostics
false

Actual results:
disableNetworkDiagnostics set to false, not set to true as configured in step 1

Expected results:
disableNetworkDiagnostics set to true

Additional info:

Attaching must-gather.

This is a clone of issue OCPBUGS-9464. The following is the description of the original issue:

Description of problem:

mtls connection is not working when using an intermetiate CA appart from the root CA, both with CRL defined.
The Intermediate CA Cert had a published CDP which directed to a CRL issued by the root CA.

The config map in the openshift-ingress namespace contains the CRL as issued by the root CA. The CRL issued by the Intermediate CA is not present since that CDP is in the user cert and so not in the bundle.

When attempting to connect using a user certificate issued by the Intermediate CA it fails with an error of unknown CA.

When attempting to connect using a user certificate issued by the to Root CA the connection is successful.

Version-Release number of selected component (if applicable):

4.10.24

How reproducible:
Always

Steps to Reproduce:

1. Configure CA and intermediate CA with CRL
2. Sign client certificate with the intermediate CA
3. Configure mtls in openshift-ingress

Actual results:

When attempting to connect using a user certificate issued by the Intermediate CA it fails with an error of unknown CA.
When attempting to connect using a user certificate issued by the to Root CA the connection is successful.

Expected results:

Be able to connect with client certificated signed by the intermediate CA

Additional info:

This is a clone of issue OCPBUGS-1677. The following is the description of the original issue:

Description of problem:
pkg/devfile/sample_test.go fails after devfile registry was updated (https://github.com/devfile/registry/pull/126)

This issue is about updating our assertion so that the CI job runs successfully again. We might want to backport this as well.

OCPBUGS-1678 is about updating the code that the test should use a mock response instead of the latest registry content OR check some specific attributes instead of comparing the full JSON response.

Version-Release number of selected component (if applicable):
4.12

How reproducible:
Always

Steps to Reproduce:
1. Clone openshift/console
2. Run ./test-backend.sh

Actual results:
Unit tests fail

Expected results:
Unit tests should pass again

Additional info:

This is a clone of issue OCPBUGS-4851. The following is the description of the original issue:

This is a clone of issue OCPBUGS-4850. The following is the description of the original issue:

Description of problem:

Kuryr might take a while to create Pods because it has to create Neutron ports for the pods. If a pod gets deleted while this is being processed, a
warning Event will be generated causing the "[sig-network] pods should successfully create sandboxes by adding pod to network" to fail.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-6913. The following is the description of the original issue:

This is a clone of issue OCPBUGS-186. The following is the description of the original issue:

Description of problem:
When resizing the browser window, the PipelineRun task status bar would overlap the status text that says "Succeeded" in the screenshot.

Actual results:
Status text is overlapped by the task status bar

Expected results:
Status text breaks to a newline or gets shortened by "..."

This is a clone of issue OCPBUGS-8261. The following is the description of the original issue:

Description of problem:

An update from 4.13.0-ec.2 to 4.13.0-ec.3 stuck on:

$ oc get clusteroperator machine-config
NAME             VERSION       AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
machine-config   4.13.0-ec.2   True        True          True       30h     Unable to apply 4.13.0-ec.3: error during syncRequiredMachineConfigPools: [timed out waiting for the condition, error pool worker is not ready, retrying. Status: (pool degraded: true total: 105, ready 105, updated: 105, unavailable: 0)]

The worker MachineConfigPool status included:

Unable to find source-code formatter for language: node. Available languages are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, yaml      type: NodeDegraded
    - lastTransitionTime: "2023-02-16T14:29:21Z"
      message: 'Failed to render configuration for pool worker: Ignoring MC 99-worker-generated-containerruntime
        generated by older version 8276d9c1f574481043d3661a1ace1f36cd8c3b62 (my version:
        c06601510c0917a48912cc2dda095d8414cc5182)'

Version-Release number of selected component (if applicable):

4.13.0-ec.3. The behavior was apparently introduced as part of OCPBUGS-6018, which has been backported, so the following update targets are expected to be vulnerable: 4.10.52+, 4.11.26+, 4.12.2+, and 4.13.0-ec.3.

How reproducible:

100%, when updating into a vulnerable release, if you happen to have leaked MachineConfig.

Steps to Reproduce:

1. 4.12.0-ec.1 dropped cleanUpDuplicatedMC. Run a later release, like 4.13.0-ec.2.
2. Create more than one KubeletConfig or ContainerRuntimeConfig targeting the worker pool (or any pool other than master). The number of clusters who have had redundant configuration objects like this is expected to be small.
3. (Optionally?) delete the extra KubeletConfig and ContainerRuntimeConfig.
4. Update to 4.13.0-ec.3.

Actual results:

Update sticks on the machine-config ClusterOperator, as described above.

Expected results:

Update completes without issues.

Description of problem:

Availability Set will be created when vmSize is invalid in a region which has zones, but Availability Set should only be created in a region which don’t have zones.

Version-Release number of selected component (if applicable):

4.11.0-0.nightly-2022-10-07-174524
4.10.0-0.nightly-2022-10-07-205844

How reproducible:

Always

Steps to Reproduce:

1.Set up a cluster in a region which has zones. 
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                                     PHASE     TYPE              REGION   ZONE   AGE
huliu-az410-99qcm-master-0               Running   Standard_D8s_v3   eastus   2      34m
huliu-az410-99qcm-master-1               Running   Standard_D8s_v3   eastus   3      34m
huliu-az410-99qcm-master-2               Running   Standard_D8s_v3   eastus   1      34m
huliu-az410-99qcm-worker-eastus1-xld58   Running   Standard_D4s_v3   eastus   1      27m
huliu-az410-99qcm-worker-eastus2-chzg8   Running   Standard_D4s_v3   eastus   2      27m
huliu-az410-99qcm-worker-eastus3-7g2mw   Running   Standard_D4s_v3   eastus   3      27m

2.Create a machineset with invalid vmSize 
liuhuali@Lius-MacBook-Pro huali-test % oc create -f ms4.yaml 
machineset.machine.openshift.io/huliu-az410-99qcm-1 created
liuhuali@Lius-MacBook-Pro huali-test % oc get machine                                                       
NAME                                     PHASE     TYPE              REGION   ZONE   AGE
huliu-az410-99qcm-1-cfw6w                Failed                                      8s
huliu-az410-99qcm-master-0               Running   Standard_D8s_v3   eastus   2      35m
huliu-az410-99qcm-master-1               Running   Standard_D8s_v3   eastus   3      35m
huliu-az410-99qcm-master-2               Running   Standard_D8s_v3   eastus   1      35m
huliu-az410-99qcm-worker-eastus1-xld58   Running   Standard_D4s_v3   eastus   1      28m
huliu-az410-99qcm-worker-eastus2-chzg8   Running   Standard_D4s_v3   eastus   2      28m
huliu-az410-99qcm-worker-eastus3-7g2mw   Running   Standard_D4s_v3   eastus   3      28m
liuhuali@Lius-MacBook-Pro huali-test % oc get machine huliu-az410-99qcm-1-cfw6w  -o yaml
apiVersion: machine.openshift.io/v1beta1
kind: Machine
metadata:
  annotations:
    machine.openshift.io/instance-state: Unknown
  creationTimestamp: "2022-10-08T07:42:28Z"
  finalizers:
  - machine.machine.openshift.io
  generateName: huliu-az410-99qcm-1-
  generation: 2
  labels:
    machine.openshift.io/cluster-api-cluster: huliu-az410-99qcm
    machine.openshift.io/cluster-api-machine-role: worker
    machine.openshift.io/cluster-api-machine-type: worker
    machine.openshift.io/cluster-api-machineset: huliu-az410-99qcm-1
  name: huliu-az410-99qcm-1-cfw6w
  namespace: openshift-machine-api
  ownerReferences:
  - apiVersion: machine.openshift.io/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: MachineSet
    name: huliu-az410-99qcm-1
    uid: bf8f7518-1fa9-4704-bdd7-6d0fde54e38e
  resourceVersion: "31287"
  uid: 303cf672-a2fa-44f3-8793-59801bb78902
spec:
  lifecycleHooks: {}
  metadata: {}
  providerSpec:
    value:
      apiVersion: machine.openshift.io/v1beta1
      credentialsSecret:
        name: azure-cloud-credentials
        namespace: openshift-machine-api
      image:
        offer: ""
        publisher: ""
        resourceID: /resourceGroups/huliu-az410-99qcm-rg/providers/Microsoft.Compute/images/huliu-az410-99qcm
        sku: ""
        version: ""
      kind: AzureMachineProviderSpec
      location: eastus
      managedIdentity: huliu-az410-99qcm-identity
      metadata:
        creationTimestamp: null
        name: huliu-az410-99qcm
      networkResourceGroup: huliu-az410-99qcm-rg
      osDisk:
        diskSettings: {}
        diskSizeGB: 128
        managedDisk:
          storageAccountType: Premium_LRS
        osType: Linux
      publicIP: false
      publicLoadBalancer: huliu-az410-99qcm
      resourceGroup: huliu-az410-99qcm-rg
      spotVMOptions: {}
      subnet: huliu-az410-99qcm-worker-subnet
      userDataSecret:
        name: worker-user-data
      vmSize: invalidStandard_D4s_v3
      vnet: huliu-az410-99qcm-vnet
      zone: "3"
status:
  conditions:
  - lastTransitionTime: "2022-10-08T07:42:28Z"
    status: "True"
    type: Drainable
  - lastTransitionTime: "2022-10-08T07:42:28Z"
    message: Instance has not been created
    reason: InstanceNotCreated
    severity: Warning
    status: "False"
    type: InstanceExists
  - lastTransitionTime: "2022-10-08T07:42:28Z"
    status: "True"
    type: Terminable
  errorMessage: 'failed to reconcile machine "huliu-az410-99qcm-1-cfw6w": failed to
    create vm huliu-az410-99qcm-1-cfw6w: failure sending request for machine huliu-az410-99qcm-1-cfw6w:
    cannot create vm: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending
    request: StatusCode=0 -- Original Error: Code="BadRequest" Message="Virtual Machine
    cannot be created because both Availability Zone and Availability Set were specified.
    Deploying an Availability Set to an Availability Zone isn’t supported."'
  errorReason: InvalidConfiguration
  lastUpdated: "2022-10-08T07:42:35Z"
  phase: Failed
  providerStatus:
    conditions:
    - lastProbeTime: "2022-10-08T07:42:35Z"
      lastTransitionTime: "2022-10-08T07:42:35Z"
      message: 'failed to create vm huliu-az410-99qcm-1-cfw6w: failure sending request
        for machine huliu-az410-99qcm-1-cfw6w: cannot create vm: compute.VirtualMachinesClient#CreateOrUpdate:
        Failure sending request: StatusCode=0 -- Original Error: Code="BadRequest"
        Message="Virtual Machine cannot be created because both Availability Zone
        and Availability Set were specified. Deploying an Availability Set to an Availability
        Zone isn’t supported."'
      reason: MachineCreationFailed
      status: "True"
      type: MachineCreated
    metadata: {}

Actual results:

Created Availability Set for it.

Expected results:

Should not create Availability Set, as the region has zones.

Additional info:

If provided correct vmSize, the machine get Running and will not create Availability Set for it. Not sure why it will create Availability Set for it when vmSize is invalid.

The issue can be reproduced both on 4.11 and 4.10 version, as Availability Set is introduced in 4.10. 
On 4.12, there is bug https://issues.redhat.com/browse/OCPBUGS-1871, will also check this on 4.12 when this bug get verified.

We're seeing a slight uptick in how long upgrades are taking[1][2]. We are not 100% sure of the cause, but it looks like it started with 4.11 rc.7. There's no obvious culprits in the diff[3].

Looking at some of the jobs, we are seeing the gaps between kube-scheduler being updated and then machine-api appear to take longer. Example job run[4] showing 10+ minutes waiting for it.

TRT had a debugging session, and we have two suggestions:

  • Adding logging around when CVO sees an operator version changed
  • Instead of a fixed polling interval at 5 minutes (which is what we think CVO is doing), would it be possible to trigger on the CO to know when to look again? We think there could be some substantial savings on upgrade time by doing this.

[1] https://search.ci.openshift.org/graph/metrics?metric=job%3Aduration%3Atotal%3Aseconds&job=periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade&job=periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-aws-sdn-upgrade&job=periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-azure-upgrade&job=periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-gcp-ovn-upgrade&job=periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-gcp-sdn-upgrade
[2] https://sippy.dptools.openshift.org/sippy-ng/tests/4.12/analysis?test=Cluster%20upgrade.%5Bsig-cluster-lifecycle%5D%20cluster%20upgrade%20should%20complete%20in%2075.00%20minutes
[3] https://amd64.ocp.releases.ci.openshift.org/releasestream/4-stable/release/4.11.0-rc.7
[4] https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-e2e-azure-sdn-upgrade/1556865989923049472

This is a clone of issue OCPBUGS-18475. The following is the description of the original issue:

This is a clone of issue OCPBUGS-17160. The following is the description of the original issue:

This is a clone of issue OCPBUGS-14049. The following is the description of the original issue:

Description of problem:

After all cluster operators have reconciled after the password rotation, we can still see authentication failures in keystone (attached screenshot of splunk query)

Version-Release number of selected component (if applicable):

Environment:
- OpenShift 4.12.10 on OpenStack 16
- The cluster is managed via RHACM, but password rotation shall be done via "regular"  OpenShift means.

How reproducible:

Rotated the OpenStack credentials according to the documentation [1]

[1] https://docs.openshift.com/container-platform/4.12/authentication/managing_cloud_provider_credentials/cco-mode-passthrough.html#manually-rotating-cloud-creds_cco-mode-passthrough 

Additional info:

- we can't trace back where these authentication failures come from - they do disappear after a cluster upgrade (so when nodes are rebooted and all pods are restarted which indicates that there's still a component using the old credentials)
- The relevant technical integration points _seem_ to be working though (LBaaS, CSI, Machine API, Swift)

What is the business impact? Please also provide timeframe information.

- We cannot rely on splunk monitoring for authentication issues since it's currently constantly showing authentication errors - We cannot be entirely sure that everything works as expected since we don't know the component that doesn't seem to use the new credentials

 

This is a clone of issue OCPBUGS-6764. The following is the description of the original issue:

Description of problem:
The "Add Git Repository" has a "Show configuration options" expandable section that shows the required permissions for a webhook setup, and provides a link to "read more about setting up webhook".

But the permission section shows nothing when open this second expandable section, and the link doesn't do anything until the user enters a "supported" GitHub, GitLab or BitBucket URL.

Version-Release number of selected component (if applicable):
4.11-4.13

How reproducible:
Always

Steps to Reproduce:

  1. Install Pipelines operator
  2. Navigate to the Developer perspective > Pipelines
  3. Press "Create" and select "Repository"
  4. Click on "Show configuration options"
  5. Click on "See Git permissions"
  6. Click on "Read more about setting up webhook"

Actual results:

  1. The Git permission section shows no git permissions.
  2. The Read more link doesn't open any new page.

Expected results:

  1. The Git permission section should show some info or must not be disabled.
  2. The Read more link should open a page or must not be displayed as well.

Additional info:

  1. None

Description of problem:

Sometimes we see VMs fail to power on when the land on a host that does not have enough resources. The current power on does not retry or leverage DRS to power on the node on a suitable host.

https://github.com/vmware/govmomi/issues/1026

Our code is still making calls to PowerOnVM_Task which, according to the vsphere docs, is deprecated and we should use PowerOnMultiVM_Task instead.

PowerOnVM_Task does not return a DRS ClusterRecommendation, no vmotion nor host power operations will be done as part of a DRS-facilitated power on. To have DRS consider such operations use PowerOnMultiVM_Task.

https://vdc-download.vmware.com/vmwb-repository/dcr-public/b50dcbbf-051d-4204-a3e7-e1b618c1e384/538cf2ec-b34f-4bae-a332-3820ef9e7773/vim.VirtualMachine.html#powerOn:
As of vSphere API 5.1, use of this method with vCenter Server is deprecated; use PowerOnMultiVM_Task instead.

Version-Release number of selected component (if applicable):
4.8.x

How reproducible:
Always

Steps to Reproduce:
1.
2.
3.

Actual results:
Sometimes powers on fails requiring manual intervention.

Expected results:
PowerOn should use DRS to ensure it's always successful.

Additional info:

This bug is a backport clone of [Bugzilla Bug 2094174](https://bugzilla.redhat.com/show_bug.cgi?id=2094174). The following is the description of the original bug:

Created attachment 1887340
CVO log file

Description of problem:
Clearing upgrade after signature verification fails, ReleaseAccepted=False keeps complaining about the update cannot be verified blah blah.

  1. oc get clusterversion/version -ojson | jq -r '.spec, .status.conditions'
    {
    "channel": "stable-4.11",
    "clusterID": "d740b8f3-bb49-40cf-86e8-5df4a755111a"
    }
    [
    { "lastTransitionTime": "2022-06-07T01:31:43Z", "message": "Unable to retrieve available updates: currently reconciling cluster version 4.11.0-0.nightly-2022-06-06-025509 not found in the \"stable-4.11\" channel", "reason": "VersionNotFound", "status": "False", "type": "RetrievedUpdates" }

    ,

    { "lastTransitionTime": "2022-06-07T01:31:43Z", "message": "Capabilities match configured spec", "reason": "AsExpected", "status": "False", "type": "ImplicitlyEnabledCapabilities" }

    ,

    { "lastTransitionTime": "2022-06-07T02:44:54Z", "message": "Retrieving payload failed version=\"\" image=\"registry.ci.openshift.org/ocp/release@sha256:5967359c2bfee0512030418af0f69faa3fa74a81a89ad64a734420e020e7f100\" failure=The update cannot be verified: unable to verify sha256:5967359c2bfee0512030418af0f69faa3fa74a81a89ad64a734420e020e7f100 against keyrings: verifier-public-key-redhat", "reason": "RetrievePayload", "status": "False", "type": "ReleaseAccepted" }

    ,

    { "lastTransitionTime": "2022-06-07T01:56:17Z", "message": "Done applying 4.11.0-0.nightly-2022-06-06-025509", "status": "True", "type": "Available" }

    ,

    { "lastTransitionTime": "2022-06-07T01:55:47Z", "status": "False", "type": "Failing" }

    ,

    { "lastTransitionTime": "2022-06-07T01:56:17Z", "message": "Cluster version is 4.11.0-0.nightly-2022-06-06-025509", "status": "False", "type": "Progressing" }

    ]

Version-Release number of the following components:
4.11.0-0.nightly-2022-06-06-025509

How reproducible:
1/1

Steps to Reproduce:
1. Upgrade to a fake release

  1. oc adm upgrade --to-image=registry.ci.openshift.org/ocp/release@sha256:5967359c2bfee0512030418af0f69faa3fa74a81a89ad64a734420e020e7f100 --allow-explicit-upgrade
    warning: The requested upgrade image is not one of the available updates.You have used --allow-explicit-upgrade for the update to proceed anyway
    Requesting update to release image registry.ci.openshift.org/ocp/release@sha256:5967359c2bfee0512030418af0f69faa3fa74a81a89ad64a734420e020e7f100

2. Check ReleaseAccepted=False due to target image signature verification failure

  1. oc adm upgrade
    Cluster version is 4.11.0-0.nightly-2022-06-04-014713

ReleaseAccepted=False

Reason: RetrievePayload
Message: Retrieving payload failed version="" image="registry.ci.openshift.org/ocp/release@sha256:5967359c2bfee0512030418af0f69faa3fa74a81a89ad64a734420e020e7f100" failure=The update cannot be verified: unable to verify sha256:5967359c2bfee0512030418af0f69faa3fa74a81a89ad64a734420e020e7f100 against keyrings: verifier-public-key-redhat

Upstream is unset, so the cluster will use an appropriate default.
Channel: stable-4.11
warning: Cannot display available updates:
Reason: VersionNotFound
Message: Unable to retrieve available updates: currently reconciling cluster version 4.11.0-0.nightly-2022-06-04-014713 not found in the "stable-4.11" channel

3. Clear the upgrade

  1. oc adm upgrade --clear
    Cancelled requested upgrade to registry.ci.openshift.org/ocp/release@sha256:5967359c2bfee0512030418af0f69faa3fa74a81a89ad64a734420e020e7f100

4. Check oc adm upgrade info

  1. oc adm upgrade
    Cluster version is 4.11.0-0.nightly-2022-06-04-014713

ReleaseAccepted=False

Reason: RetrievePayload
Message: Retrieving payload failed version="" image="registry.ci.openshift.org/ocp/release@sha256:5967359c2bfee0512030418af0f69faa3fa74a81a89ad64a734420e020e7f100" failure=The update cannot be verified: unable to verify sha256:5967359c2bfee0512030418af0f69faa3fa74a81a89ad64a734420e020e7f100 against keyrings: verifier-public-key-redhat

Upstream is unset, so the cluster will use an appropriate default.
Channel: stable-4.11
warning: Cannot display available updates:
Reason: VersionNotFound
Message: Unable to retrieve available updates: currently reconciling cluster version 4.11.0-0.nightly-2022-06-04-014713 not found in the "stable-4.11" channel

Actual results:
After upgrade is cleared, cv condition ReleaseAccepted keeps to false with message The update cannot be verified

Expected results:
After upgrade is cleared, cv condition ReleaseAccepted should stop complaining about the target image

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

This was originally reported in BZ as https://bugzilla.redhat.com/show_bug.cgi?id=2046335

Description of problem:

The issue reported here https://bugzilla.redhat.com/show_bug.cgi?id=1954121 still occur (tested on OCP 4.8.11, the CU also verified that the issue can happen even with OpenShift 4.7.30, 4.8.17 and 4.9.11)

How reproducible:

Attach a NIC to a master node will trigger the issue

Steps to Reproduce:
1. Deploy an OCP cluster (I've tested it IPI on AWS)
2. Attach a second NIC to a running master node (in my case "ip-10-0-178-163.eu-central-1.compute.internal")

Actual results:

~~~
$ oc get node ip-10-0-178-163.eu-central-1.compute.internal -o json | jq ".status.addresses"
[

{ "address": "10.0.178.163", "type": "InternalIP" }

,

{ "address": "10.0.187.247", "type": "InternalIP" }

,

{ "address": "ip-10-0-178-163.eu-central-1.compute.internal", "type": "Hostname" }

,

{ "address": "ip-10-0-178-163.eu-central-1.compute.internal", "type": "InternalDNS" }

]

$ oc get co etcd
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE
etcd 4.8.11 True False True 31h

$ oc get co etcd -o json | jq ".status.conditions[0]"

{ "lastTransitionTime": "2022-01-26T15:47:42Z", "message": "EtcdCertSignerControllerDegraded: [x509: certificate is valid for 10.0.178.163, not 10.0.187.247, x509: certificate is valid for ::1, 10.0.178.163, 127.0.0.1, ::1, not 10.0.187.247]", "reason": "EtcdCertSignerController_Error", "status": "True", "type": "Degraded" }

~~~

Expected results:

To have the certificate valid also for the second IP (the newly created one "10.0.187.247")

Additional info:

Deleting the following secrets seems to solve the issue:
~~~
$ oc get secret n openshift-etcd | grep kubernetes.io/tls | grep ^etcd
etcd-client kubernetes.io/tls 2 61s
etcd-peer-ip-10-0-132-49.eu-central-1.compute.internal kubernetes.io/tls 2 61s
etcd-peer-ip-10-0-178-163.eu-central-1.compute.internal kubernetes.io/tls 2 61s
etcd-peer-ip-10-0-202-187.eu-central-1.compute.internal kubernetes.io/tls 2 60s
etcd-serving-ip-10-0-132-49.eu-central-1.compute.internal kubernetes.io/tls 2 60s
etcd-serving-ip-10-0-178-163.eu-central-1.compute.internal kubernetes.io/tls 2 59s
etcd-serving-ip-10-0-202-187.eu-central-1.compute.internal kubernetes.io/tls 2 59s
etcd-serving-metrics-ip-10-0-132-49.eu-central-1.compute.internal kubernetes.io/tls 2 58s
etcd-serving-metrics-ip-10-0-178-163.eu-central-1.compute.internal kubernetes.io/tls 2 59s
etcd-serving-metrics-ip-10-0-202-187.eu-central-1.compute.internal kubernetes.io/tls 2 58s

$ oc get secret n openshift-etcd | grep kubernetes.io/tls | grep ^etcd | awk '

{print $1}

' | xargs -I {} oc delete secret {} -n openshift-etcd
secret "etcd-client" deleted
secret "etcd-peer-ip-10-0-132-49.eu-central-1.compute.internal" deleted
secret "etcd-peer-ip-10-0-178-163.eu-central-1.compute.internal" deleted
secret "etcd-peer-ip-10-0-202-187.eu-central-1.compute.internal" deleted
secret "etcd-serving-ip-10-0-132-49.eu-central-1.compute.internal" deleted
secret "etcd-serving-ip-10-0-178-163.eu-central-1.compute.internal" deleted
secret "etcd-serving-ip-10-0-202-187.eu-central-1.compute.internal" deleted
secret "etcd-serving-metrics-ip-10-0-132-49.eu-central-1.compute.internal" deleted
secret "etcd-serving-metrics-ip-10-0-178-163.eu-central-1.compute.internal" deleted
secret "etcd-serving-metrics-ip-10-0-202-187.eu-central-1.compute.internal" deleted

$ oc get co etcd -o json | jq ".status.conditions[0]"

{ "lastTransitionTime": "2022-01-26T15:52:21Z", "message": "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found", "reason": "AsExpected", "status": "False", "type": "Degraded" }

~~~

This is a clone of issue OCPBUGS-2079. The following is the description of the original issue:

Description of problem:

The setting of systemReserved: ephemeral-storage in KubeletConfig is not working as expected. 

Version-Release number of selected component (if applicable):

4.10.z, may exist on other OCP versions as well. 

How reproducible:

always

Steps to Reproduce:

1. Create a KubeletConfig on the node:

apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
  name: system-reserved-config
spec:
  machineConfigPoolSelector:
    matchLabels:
      pools.operator.machineconfiguration.openshift.io/master: ""
  kubeletConfig:
    systemReserved:
      cpu: 500m
      memory: 500Mi
      ephemeral-storage: 10Gi


2. Check node allocatable storage with command: oc describe node |grep -C 5 ephemeral-storage

Actual results:

The Allocatable:ephemeral-storage on the node is not capacity.ephemeral-storage - systemReserved.ephemeral-storage - eviction-thresholds (10% of the capacity.ephemeral-storage by default)  

Expected results:

The Allocatable:ephemeral-storage on the node should be capacity.ephemeral-storage - systemReserved.ephemeral-storage - eviction-thresholds (10% of the capacity.ephemeral-storage by default) 

Additional info:

The root cause might be: process argument '--system-reserved=cpu=500m,memory=500Mi' overwrote the setting in /etc/kubernetes/kubelet.conf, one example:

root        6824       1 27 Sep30 ?        1-09:00:24 kubelet --config=/etc/kubernetes/kubelet.conf --bootstrap-kubeconfig=/etc/kubernetes/kubeconfig --kubeconfig=/var/lib/kubelet/kubeconfig --container-runtime=remote --container-runtime-endpoint=/var/run/crio/crio.sock --runtime-cgroups=/system.slice/crio.service --node-labels=node-role.kubernetes.io/master,node.openshift.io/os_id=rhcos --node-ip=192.168.58.47 --minimum-container-ttl-duration=6m0s --cloud-provider= --volume-plugin-dir=/etc/kubernetes/kubelet-plugins/volume/exec --hostname-override= --register-with-taints=node-role.kubernetes.io/master=:NoSchedule --pod-infra-container-image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4a7b6408460148cb73c59677dbc2c261076bc07226c43b0c9192cc70aef5ba62 --system-reserved=cpu=500m,memory=500Mi --v=2 --housekeeping-interval=30s


libovsdb builds transaction log messages for every transaction and then throws them away if the log level is not 4 or above. This wastes a bunch of CPU at scale and increases pod ready latency.

Description of problem:

 

When creating a ProjectHelmChartRepository (with or without the form) and setting a display name (as `spec.name`), this value is not used in the developer catalog / Helm Charts catalog filter sidebar.

It shows (and watches) the display names of `HelmChartRepository` resources.

Version-Release number of selected component (if applicable):

4.11

How reproducible:

Always

Steps to Reproduce:

1. Switch to Developer Perspective
2. Navigate to Add > "Helm Chart repositories"
3. Enter "ibm-charts" as "Chart repository name"
4. Enter URL https://raw.githubusercontent.com/IBM/charts/master/repo/community/index.yaml as URL)
5. Press on create
6. Open the YAML editor and change the `spec.name` attribute to "IBM Charts"
7. Save the change
8. Navigate to Add > "Helm Chart" 

Actual results:

The filter navigation on the left side shows "Chart Repositories" "Ibm Chart". A camel case version of the resource name.

Expected results:

It should show the "spec.name" "IBM Charts" if defined and fallback to the current implementation if the optional spec.name is not defined.

Additional info:

There is a bug discussing that the display name could not be entered directly, https://bugzilla.redhat.com/show_bug.cgi?id=2106366. This bug here is only about the catalog output.

 

Currently, Telemeter is not equipped with configurable request limit for receive endpoint (for full context see: https://github.com/openshift/cluster-monitoring-operator/pull/1416). It is using the default limit defined in the code base, however it seems this limit might not be suitable for our usage.

As a part of this ticket, it should be:

1) Understood what is the appropriate limit for request size for our use cases

2) Make the limit configurable in Telemeter via a flag

3) Deploy the changes, initially to the staging environment, to enable our team to test it.

Description of problem:

When the user installs a helm chart, the dropdown to select a specific version is always disabled. Also for helm charts that can upgraded or downgraded after installation. For example the nodejs helm chart.

Version-Release number of selected component (if applicable):

At least 4.11, maybe all versions, but a backport to 4.11 is fine

How reproducible:

Always

Steps to Reproduce:

1. Switch to developer perspective
2. Navigate to Add > Helm chart
3. Select the Nodejs helm chart
4. Try to select another version
..
When the user installs the not selectable version and edit the helm chart there is another version to selec 

Actual results:

The version is not selectable.

Expected results:

The version should be selectable.

Additional info:

 

This is a clone of issue OCPBUGS-4311. The following is the description of the original issue:

This is a clone of issue OCPBUGS-4305. The following is the description of the original issue:

Description of problem:

Please add an option to DISABLE debug in ironic-api. Presently it is enabled by default and there is no way to disable it or reduce log level

https://github.com/metal3-io/ironic-image/blob/main/ironic-config/ironic.conf.j2#L3


Version-Release number of selected component (if applicable): none

How reproducible: Every time

Steps to Reproduce:

Please check source code here: https://github.com/metal3-io/ironic-image/blob/main/ironic-config/ironic.conf.j2#L3

It is enabled by default and there is no way to disable it or reduce log level

Actual results:

Please check Case: 03371411, the log file grew to 409 GB

Expected results: Need a way to disable debug

Additional info: Case 03371411. A cluster must gather and log file can be found in the case.

Description of problem:

Looks like a regression was introduced and the Windows networking tests started to fail accross all 4.11 CI jobs for both Linux-to-Windows and Windows-to-Windows pods communication. 

https://github.com/openshift/ovn-kubernetes/pull/1415 merged without passing Windows networking tests.

https://github.com/openshift/windows-machine-config-operator/pull/1359  shows networking tests failing accross all CI jobs.





Version-Release number of selected component (if applicable):

4.11

How reproducible:

Always 

Steps to Reproduce:

1. Trigger a job retest in any of the above mentioned PRs
 

Actual results:

Windows networking test fail

Expected results:

Windows networking test pass

Additional info:

Direct link to a release-4.11 aws-e2e-operator CI job showing the failed networking test

1. Proposed title of this feature request
--> Alert generation when the etcd container memory consumption goes beyond 90%

2. What is the nature and description of the request?
--> When the etcd database starts growing rapidly due to some high number of objects like secrets, events, or configmap generation by application/workload, the memory and CPU consumption of APIserver and etcd container (control plane component) spikes up and eventually the control plane nodes goes to hung/unresponsive or crash due to out of memory errors as some of the critical processes/services running on master nodes get killed. Hence we request an alert/alarm when the ETCD container's memory consumption goes beyond 90% so that the cluster administrator can take some action before the cluster/nodes go unresponsive.

I see we already have a etcdExcessiveDatabaseGrowth Prometheus rule which helps when the surge in etcd writes leading to a 50% increase in database size over the past four hours on etcd instance however it does not consider the memory consumption:

$ oc get prometheusrules etcd-prometheus-rules -o yaml|grep -i etcdExcessiveDatabaseGrowth -A 9

  • alert: etcdExcessiveDatabaseGrowth
    annotations:
    description: 'etcd cluster "{{ $labels.job }}": Observed surge in etcd writes
    leading to 50% increase in database size over the past four hours on etcd
    instance {{ $labels.instance }}, please check as it might be disruptive.'
    expr: |
    increase(((etcd_mvcc_db_total_size_in_bytes/etcd_server_quota_backend_bytes)*100)[240m:1m]) > 50
    for: 10m
    labels:
    severity: warning

3. Why does the customer need this? (List the business requirements here)
--> Once the etcd memory consumption goes beyond 90-95% of total ram as it's system critical container, the OCP cluster goes unresponsive causing revenue loss to business and impacting the productivity of users of the openshift cluster. 

 

4. List any affected packages or components.
--> etcd

This is a clone of issue OCPBUGS-1226. The following is the description of the original issue:

We added server groups for control plane and computes as part of OSASINFRA-2570, except for UPI that only creates server group for the control plane.

We need to update the UPI scripts to create server group for computes to be consistent with IPI and have the instruction at https://docs.openshift.com/container-platform/4.11/machine_management/creating_machinesets/creating-machineset-osp.html work out of the box in case customers want to create MachineSets on their UPI clusters.

Related to OCPCLOUD-1135.

Description of problem:
Switching the spec.endpointPublishingStrategy.loadBalancer.scope of the default ingresscontroller results in a degraded ingress operator. The routes using that endpoint like the console URL become inaccessible.
Degraded operators after scope change:

$ oc get co | grep -v ' True        False         False'
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.11.4    False       False         True       72m     OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.kartrosa.ukld.s1.devshift.org/healthz": EOF
console                                    4.11.4    False       False         False      72m     RouteHealthAvailable: failed to GET route (https://console-openshift-console.apps.kartrosa.ukld.s1.devshift.org): Get "https://console-openshift-console.apps.kartrosa.ukld.s1.devshift.org": EOF
ingress                                    4.11.4    True        False         True       65m     The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)

We have noticed that each time this happens the underlying AWS loadbalancer gets recreated which is as expected however the router pods probably do not get notified about the new loadbalancer. The instances in the new loadbalancer become 'outOfService'.

Restarting one of the router pods fixes the issue and brings back a couple of instances under the loadbalancer back to 'InService' which leads to the operators becoming happy again.

Version-Release number of selected component (if applicable):

ingress in 4.11.z however we suspect this issue to also apply to older versions

How reproducible:

Consistently reproducible

Steps to Reproduce:

1. Create a test OCP 4.11 cluster in AWS
2. Switch the spec.endpointPublishingStrategy.loadBalancer.scope of the default ingresscontroller in openshift-ingress-operator to Internal from External (or vice versa)
3. New Loadbalancer is created in AWS for the default router service, however the instances behind are not in service

Actual results:

ingress, authentication and console operators go into a degraded state. Console URL of the cluster is inaccessible

Expected results:

The ingresscontroller scope transition from internal->External (or vice versa) is smooth without any downtime or operators going into degraded state. The console is accessible.

 

[Updated story request]

Decision is to always display Red Hat OpenShift logo for OCP instead of conditionally. And also update the OCP login, errors, providers templates. https://openshift.github.io/oauth-templates/

Related note in comments.

 

[Original request]

If the ACM or the ACS dynamic plugin is enabled and there is not a custom branding set, then the default "Red Hat Openshift" branding should be shown.

This was identified as an issue during the Hybrid Console Scrum on 11/15/20201

 

PRs associated with this change

https://github.com/openshift/console/pull/10940 [merged]

https://github.com/openshift/oauth-templates/pull/20 [merged]

https://github.com/openshift/cluster-authentication-operator/pull/540 [merged]

 

 

Description of problem

CI is flaky because tests pull the "openshift/origin-node" image from Docker Hub and get rate-limited:

E0803 20:44:32.429877    2066 kuberuntime_image.go:53] "Failed to pull image" err="rpc error: code = Unknown desc = reading manifest latest in docker.io/openshift/origin-node: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit" image="openshift/origin-node:latest"

This particular failure comes from https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-ingress-operator/929/pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-operator/16871891662673059841687189166267305984. I don't know how to search for this failure using search.ci. I discovered the rate-limiting through Loki: https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%7B%22datasource%22:%22PCEB727DF2F34084E%22,%22queries%22:%5B%7B%22expr%22:%22%7Binvoker%3D%5C%22openshift-internal-ci%2Fpull-ci-openshift-cluster-ingress-operator-master-e2e-aws-operator%2F1687189166267305984%5C%22%7D%20%7C%20unpack%20%7C~%20%5C%22pull%20rate%20limit%5C%22%22,%22refId%22:%22A%22,%22editorMode%22:%22code%22,%22queryType%22:%22range%22%7D%5D,%22range%22:%7B%22from%22:%221691086303449%22,%22to%22:%221691122303451%22%7D%7D.

Version-Release number of selected component (if applicable)

This happened on 4.14 CI job.

How reproducible

I have observed this once so far, but it is quite obscure.

Steps to Reproduce

1. Post a PR and have bad luck.
2. Check Loki using the following query:

{...} {invoker="openshift-internal-ci/pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-operator/*"} | unpack | systemd_unit="kubelet.service" |~ "pull rate limit"

Actual results

CI pulls from Docker Hub and fails.

Expected results

CI passes, or fails on some other test failure. CI should never pull from Docker Hub.

Additional info

We have been using the "openshift/origin-node" image in multiple tests for years. I have no idea why it is suddenly pulling from Docker Hub, or how we failed to notice that it was pulling from Docker Hub if that's what it was doing all along.

Description of problem:

PROXY protocol cannot be enabled for the "Private" endpoint publishing strategy type.

Version-Release number of selected component (if applicable):

4.10.0+. PROXY protocol was made configurable for the "HostNetwork" and "NodePortService" endpoint publishing strategy types, but not for "Private", in this release.

How reproducible:

Always.

Steps to Reproduce:

1. Create an ingresscontroller with the "Private" endpoint publishing strategy type:

oc create -f - <<'EOF'
apiVersion: operator.openshift.io/v1
kind: IngressController
metadata:
  name: example
  namespace: openshift-ingress-operator
spec:
  domain: example.com
  endpointPublishingStrategy:
    type: Private
    private:
      protocol: PROXY
EOF

2. Check the ingresscontroller's status:

oc -n openshift-ingress-operator get ingresscontrollers/example -o 'jsonpath={.status.endpointPublishingStrategy}'

3. Check whether the resulting router deployment has PROXY protocol enabled.

oc -n openshift-ingress get deployments/router-example -o 'jsonpath={.spec.template.spec.containers[0].env[?(@.name=="ROUTER_USE_PROXY_PROTOCOL")]}'

Actual results:

The ingresscontroller is created:

ingresscontroller.operator.openshift.io/example created

The status shows that the spec.endpointPublishingStrategy.private.protocol setting was ignored:

{"private":{},"type":"Private"}

The deployment does not enable PROXY protocol; the oc get command prints no output.

Expected results:

The ingresscontroller's status should indicate that PROXY protocol is enabled:

{"private":{"protocol":"PROXY"},"type":"Private"}

The deployment should have PROXY protocol enabled:

{"name":"ROUTER_USE_PROXY_PROTOCOL","value":"true"}

Additional info:

This bug report duplicates https://bugzilla.redhat.com/show_bug.cgi?id=2104481 in order to facilitate backports.

This is a clone of issue OCPBUGS-3117. The following is the description of the original issue:

This is a clone of issue OCPBUGS-3084. The following is the description of the original issue:

Upstream Issue: https://github.com/kubernetes/kubernetes/issues/77603

Long log lines get corrupted when using '--timestamps' by the Kubelet.

The root cause is that the buffer reads up to a new line. If the line is greater than 4096 bytes and '--timestamps' is turrned on the kubelet will write the timestamp and the partial log line. We will need to refactor the ReadLogs function to allow for a partial line read.

https://github.com/kubernetes/kubernetes/blob/f892ab1bd7fd97f1fcc2e296e85fdb8e3e8fb82d/pkg/kubelet/kuberuntime/logs/logs.go#L325

apiVersion: v1
kind: Pod
metadata:
  name: logs
spec:
  restartPolicy: Never
  containers:
  - name: logs
    image: fedora
    args:
    - bash
    - -c
    - 'for i in `seq 1 10000000`; do echo -n $i; done'
kubectl logs logs --timestamps

Description of problem:

For OVNK to become CNCF complaint, we need to support session affinity timeout feature and enable the e2e's on OpenShift side. This bug tracks the efforts to get this into 4.12 OCP.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-19045. The following is the description of the original issue:

This is a clone of issue OCPBUGS-18312. The following is the description of the original issue:

This is a clone of issue OCPBUGS-17864. The following is the description of the original issue:

Description of problem:

Cluster recently upgraded to OCP 4.12.19 experiencing serious slowness issues with Project>Project access page.
The loading time of that page grows significantly faster than the number of entries, and is very noticeable even at a relatively low number of entries.

Version-Release number of selected component (if applicable):

4.12.19

How reproducible:

Easily 

Steps to Reproduce:

1. Create a namespace, and add RoleBindings for multiple users, for instance with :
$ oc -n test-namespace create rolebinding test-load --clusterrole=view --user=user01 --user=user02 --user=...
2. In Developer view of that namespace, navigate to "Project"->"Project access". The page will take a long time to load compared to the time an "oc get rolebinding" would take.

Actual results:

0 RB => instantaneous loading
40 RB => about 10 seconds until page loaded
100 RB => one try took 50 seconds, another 110 seconds
200 RB => nothing for 8 minutes, after which my web browser (Firefox) proposed to stop the page since it slowed the browser down, and after 10 minutes I stopped the attempt without ever seeing the page load. 

Expected results:

Page should load almost instantly with only a few hundred role bindings

Description of problem:

Upgrade to 4.10 is stuck looping in syncEgressFirewall

We see transacting operations with context deadline exceeded.

It looks to be trying to process 2.8 million records is one go.

2023-02-21T19:55:06.514097513Z I0221 19:55:06.435220 1 client.go:781] "msg"="transacting operations" "database"="OVN_Northbound" "operations"="[{Op:mutate Table:Logical_Switch Row:map[] Rows:[] Columns:[] Mutations:[{Column:acls Mutator:delete Value:{GoSet:[{GoUUID:6a3ad543-a77d-4700-83b8-5ccae6b2d067} ID:1c5297ff-8588-467a-93f4-22f22d609563} {GoUUID:f6288ed3-3928-45a8-ae57-40ed94cfa249} {GoUUID:04bf90c2-fde1-4a10-baaa-6a3f1d8e2931} {GoUUID:c6609536-857c-48ae-9125-9505753180 a8} {GoUUID:c79b4398-d7cc-4dcf-8c1d-11484f318324} {GoUUID:4323ac2c-033e-43c3-885b-e951cd7a4159} {GoUUID:7b316a80-076f-4266-b7d2-bd69b1d4b874} {GoUUID:57dfecb2-2f94-4cd8-a277-8 b28205e1048} {GoUUID:2c039f15-ff11-4ceb-aa82-bcbe82fc86d1} {GoUUID:063c4121-73c3-4d53-a89d-1063e775146b} {GoUUID:25c788e3-6146-4571-98bf-61010100a22a} {GoUUID:3d3c150f-1296-4d 91-b334-506f28bff4bd}]}}] Timeout:<nil> Where:[where column _uuid == {ba9652de-5aae-4a74-a512-29f775e38c19}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}]: context deadline exceeded 

2023-02-21T19:55:18.739739417Z E0221 19:55:18.643127       1 master.go:1369] Failed (will retry) in syncing syncEgressFirewall: failed to remove reject acl from node logical switches: error while removing ACLS: [6a3ad543-a77d-4700-83b8-5ccae6b2d067 8e004991-0382-455f-9901-33ef724acbc2 

Everything is built into one operation via:
https://github.com/openshift/ovn-kubernetes/blob/release-4.10/go-controller/pkg/libovsdbops/switch.go#L243

TrandactAndCheck is being called with a 10s timeout and this operation never completes. 
 

Version-Release number of selected component (if applicable):

4.10.50

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

Upgrade completes

Additional info:

 

This is a clone of issue OCPBUGS-4960. The following is the description of the original issue:

Description of problem:
This is a follow up on OCPBUGSM-47202 (https://bugzilla.redhat.com/show_bug.cgi?id=2110570)

While OCPBUGSM-47202 fixes the issue specific for Set Pod Count, many other actions aren't fixed. When the user updates a Deployment with one of this options, and selects the action again, the old values are still shown.

Version-Release number of selected component (if applicable)
4.8-4.12 as well as master with the changes of OCPBUGSM-47202

How reproducible:
Always

Steps to Reproduce:

  1. Import a deployment
  2. Select the deployment to open the topology sidebar
  3. Click on actions and one of the 4 options to update the deployment with a modal
    1. Edit labels
    2. Edit annotatations
    3. Edit update strategy
    4. Edit resource limits
  4. Click on the action again and check if the data in the modal reflects the changes from step 3

Actual results:
Old data (labels, annotations, etc.) was shown.

Expected results:
Latest data should be shown

Additional info:

This is a clone of issue OCPBUGS-5016. The following is the description of the original issue:

Description of problem:

When editing any pipeline in the openshift console, the correct content cannot be obtained (the obtained information is the initial information).

Version-Release number of selected component (if applicable):

 

How reproducible:

100%

Steps to Reproduce:

Developer -> Pipeline -> select pipeline -> Details -> Actions -> Edit Pipeline -> YAML view -> Cancel ->  Actions -> Edit Pipeline -> YAML view 

Actual results:

displayed content is incorrect.

Expected results:

Get the content of the current pipeline, not the "pipeline create" content.

Additional info:

If cancel or save in the "Pipeline Builder" interface after "Edit Pipeline", can get the expected content.
~
Developer -> Pipeline -> select pipeline -> Details -> Actions -> Edit Pipeline -> Pipeline builder -> Cancel ->  Actions -> Edit Pipeline -> YAML view :Display resource content normally
~

This is a clone of issue OCPBUGS-10314. The following is the description of the original issue:

This is a clone of issue OCPBUGS-8741. The following is the description of the original issue:

This is a clone of issue OCPBUGS-5889. The following is the description of the original issue:

Description of problem:

Customer running a cluster with following config:
4.10.23
AWS/IPI
OVNKubernetes

Observed that in namespace with networkpolicy rules enabled, and a policy for allow-from-same namespace, pods will have different behaviors when calling service IP's hosted in that same namespace.

Example:
Deployment1 with two pods (A/B) exists in namespace <EXAMPLE>
Deployment2 with 1 pod hosting a service and route exists in same namespace
Pod A will unexpectedly stop being able to call service IP of deployment2; Pod B will never lose access to calling service IP of deployment2.

Pod A remains able to call out through br-ex interface, tag the ROUTE address, and reach deployment2 pod via haproxy (this never breaks)

Pod A remains able to reach the local gateway on the node

Host node for Pod A is able to reach the service IP of deployment2 and remains able to do so, even while pod A is impacted.

Issue can be mitigated by applying a label or annotation to pod A, which immediately allows it to reach internal service IPs again within the namespace.

I suspect that the issue is to do with the networkpolicy rules failing to stay updated on the pod object, and the pod needs to be 'refreshed' --> label appendation/other update, to force the pod to 'remember' that it is allowed to call peers within the namespace.

Additional relevant data:
- pods affects throughout cluster; no specific project/service/deployment/application
- pods ride on different nodes all the time (no one node affected)
- pods with fail condition are on same node with other pods without issue
- multiple namespaces see this problem
- all namespaces are using similar networkpolicy isolation and allow-from-same-namespace ruleset (which matches our documentation on syntax).



Version-Release number of selected component (if applicable):

4.10.23

How reproducible:

every time --> unclear what the trigger is that causes this; pods will be functional and several hours/days later, will stop being able to talk to peer services.

Steps to Reproduce:

1. deploy pod with at least two replicas in a namespace with allow-from same network policy
2. deploy a different service and route example httpd instance in same namespace
3. observe that one of the two pods may fail to reach service IP after some time
4. apply annotation to pod and it is immediately able to reach services again.

Actual results:

pods intermittently fail to reach internal service addresses, but are able to be interacted with otherwise, and can reach upstream/external addresses including routes on cluster. 

Expected results:

pods should not lose access to service network peers. 

Additional info:

see next comments for relevant uploads/sosreports and inspects.

Description of problem:
Follow-up of: https://issues.redhat.com/browse/SDN-2988

This failure is perma-failing in the e2e-metal-ipi-ovn-dualstack-local-gateway jobs.

Example: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.13-e2e-metal-ipi-ovn-dualstack-local-gateway/1597574181430497280
Search CI: https://search.ci.openshift.org/?search=when+using+openshift+ovn-kubernetes+should+ensure+egressfirewall+is+created&maxAge=336h&context=1&type=junit&name=e2e-metal-ipi-ovn-dualstack-local-gateway&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
Sippy: https://sippy.dptools.openshift.org/sippy-ng/jobs/4.13/analysis?filters=%7B%22items%22%3A%5B%7B%22columnField%22%3A%22name%22%2C%22operatorValue%22%3A%22equals%22%2C%22value%22%3A%22periodic-ci-openshift-release-master-nightly-4.13-e2e-metal-ipi-ovn-dualstack-local-gateway%22%7D%5D%7D

Version-Release number of selected component (if applicable):

4.12,4.13

How reproducible:

Every time

Steps to Reproduce:

1. Setup dualstack KinD cluster
2. Create egress fw policy with spec
Spec:
  Egress:
    To:
      Cidr Selector:  0.0.0.0/0
    Type:             Deny
3. create a pod and ping to 1.1.1.1

Actual results:

Egress policy does not block flows to external IP

Expected results:

Egress policy blocks flows to external IP

Additional info:

It seems mixing ip4 and ip6 operands in ACL matchs doesnt work

This is a clone of issue OCPBUGS-4499. The following is the description of the original issue:

This is a clone of issue OCPBUGS-860. The following is the description of the original issue:

Description of problem:

In GCP, once an external IP address is assigned to master/infra node through GCP console, numbers of pending CSR from kubernetes.io/kubelet-serving is increasing, and the following error are reported:

I0902 10:48:29.254427       1 controller.go:121] Reconciling CSR: csr-q7bwd
I0902 10:48:29.365774       1 csr_check.go:157] csr-q7bwd: CSR does not appear to be client csr
I0902 10:48:29.371827       1 csr_check.go:545] retrieving serving cert from build04-c92hb-master-1.c.openshift-ci-build-farm.internal (10.0.0.5:10250)
I0902 10:48:29.375052       1 csr_check.go:188] Found existing serving cert for build04-c92hb-master-1.c.openshift-ci-build-farm.internal
I0902 10:48:29.375152       1 csr_check.go:192] Could not use current serving cert for renewal: CSR Subject Alternate Name values do not match current certificate
I0902 10:48:29.375166       1 csr_check.go:193] Current SAN Values: [build04-c92hb-master-1.c.openshift-ci-build-farm.internal 10.0.0.5], CSR SAN Values: [build04-c92hb-master-1.c.openshift-ci-build-farm.internal 10.0.0.5 35.211.234.95]
I0902 10:48:29.375175       1 csr_check.go:202] Falling back to machine-api authorization for build04-c92hb-master-1.c.openshift-ci-build-farm.internal
E0902 10:48:29.375184       1 csr_check.go:420] csr-q7bwd: IP address '35.211.234.95' not in machine addresses: 10.0.0.5
I0902 10:48:29.375193       1 csr_check.go:205] Could not use Machine for serving cert authorization: IP address '35.211.234.95' not in machine addresses: 10.0.0.5
I0902 10:48:29.379457       1 csr_check.go:218] Falling back to serving cert renewal with Egress IP checks
I0902 10:48:29.382668       1 csr_check.go:221] Could not use current serving cert and egress IPs for renewal: CSR Subject Alternate Names includes unknown IP addresses
I0902 10:48:29.382702       1 controller.go:233] csr-q7bwd: CSR not authorized

Version-Release number of selected component (if applicable):

4.11.2

Steps to Reproduce:

1. Assign external IPs to master/infra node in GCP
2. oc get csr | grep kubernetes.io/kubelet-serving

Actual results:

CSRs are not approved

Expected results:

CSRs are approved

Additional info:

This issue is only happen in GCP. Same OpenShift installations in AWS do not have this issue.

It looks like the CSR are created using external IP addresses once assigned.

Ref: https://coreos.slack.com/archives/C03KEQZC1L2/p1662122007083059

Tracker bug for bootimage bump in 4.11. This bug should block bugs which need a bootimage bump to fix.

Add a Makefile rule in CMO to execute all the different rule that are used for verification and validation. Currenctly, some of them might not be at the right place, for example `check-assets` which is part of `generate` despite not being responsible of any generation. https://github.com/openshift/cluster-monitoring-operator/pull/1151/files#r629371735

DoD:

  • Add a new rule in CMO to handle verification
  • Add a CI job for this rule

This is a clone of issue OCPBUGS-4805. The following is the description of the original issue:

This is a clone of issue OCPBUGS-4101. The following is the description of the original issue:

Description of problem:

We experienced two separate upgrade failures relating to the introduction of the SYSTEM_RESERVED_ES node sizing parameter, causing kubelet to stop running.

One cluster (clusterA) upgraded from 4.11.14 to 4.11.17. It experienced an issue whereby 
   /etc/node-sizing.env 
on its master nodes contained an empty SYSTEM_RESERVED_ES value:

---
cat /etc/node-sizing.env 
SYSTEM_RESERVED_MEMORY=5.36Gi
SYSTEM_RESERVED_CPU=0.11
SYSTEM_RESERVED_ES=
---

causing the kubelet to not start up. To restore service, this file was manually updated to set a value (1Gi), and kubelet was restarted.

We are uncertain what conditions led to this occuring on the clusterA master nodes as part of the upgrade.

A second cluster (clusterB) upgraded from 4.11.16 to 4.11.17. It experienced an issue whereby worker nodes were impacted by a similar problem, however this was because a custom node-sizing-enabled.env MachineConfig which did not set SYSTEM_RESERVED_ES

This caused existing worker nodes to go into a NotReady state after the ugprade, and additionally new nodes did not join the cluster as their kubelet would become impacted. 

For clusterB the conditions are more well-known of why the value is empty.

However, for both clusters, if SYSTEM_RESERVED_ES ends up as empty on a node it can cause the kubelet to not start. 

We have some asks as a result:
- Can MCO be made to recover from this situation if it occurs, perhaps  through application of a safe default if none exists, such that kubelet would start correctly?
- Can there possibly be alerting that could indicate and draw attention to the misconfiguration?

Version-Release number of selected component (if applicable):

4.11.17

How reproducible:

Have not been able to reproduce it on a fresh cluster upgrading from 4.11.16 to 4.11.17

Expected results:

If SYSTEM_RESERVED_ES is empty in /etc/node-sizing*env then a default should be applied and/or kubelet able to continue running.

Additional info:

 

The two modules that are auto generated for the CLI docs need to add ":_content-type: REFERENCE" to the top of the files. Update the doc generation templates to add these.

This is a clone of issue OCPBUGS-2083. The following is the description of the original issue:

Description of problem:
Currently we are running VMWare CSI Operator in OpenShift 4.10.33. After running vulnerability scans, the operator was discovered to be running a known weak cipher 3DES. We are attempting to upgrade or modify the operator to customize the ciphers available. We were looking at performing a manual upgrade via Quay.io but can't seem to pull the image and was trying to steer away from performing a custom install from scratch. Looking for any suggestions into mitigated the weak cipher in the kube-rbac-proxy under VMware CSI Operator.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

Intended to backport the corresponding https://bugzilla.redhat.com/show_bug.cgi?id=2095852 which has been fixed already for this version.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

 As mentioned in [1], the cluster monitoring operator doesn't define the relatedObjects field in the ClusterOperator manifest which is initially deployed by CVO [2].
If the CMO pod fails to start, the must-gather might miss information from the monitoring namespace. Note that once CMO runs, it will update the initial ClusterOperator object with the proper information [3].

[1] http://mailman-int.corp.redhat.com/archives/aos-devel/2021-May/msg00139.html
[2] https://github.com/openshift/cluster-monitoring-operator/blob/master/manifests/0000_50_cluster-monitoring-operator_06-clusteroperator.yaml
[3] https://github.com/openshift/cluster-monitoring-operator/blob/a6bc9824035ceb8dbfe7c53cf0c138bfb2ec5643/pkg/client/status_reporter.go#L49-L63

This is a clone of issue OCPBUGS-18135. The following is the description of the original issue:

Description of problem:

The target.workload.openshift.io/management annotation causes CNO operator pods to wait for nodes to appear. Eventually they give up waiting and they get scheduled. This annotation should not be set for the hosted control plane topology, given that we should not wait for nodes to exist for the CNO to be scheduled.

Version-Release number of selected component (if applicable):

4.14, 4.13

How reproducible:

always

Steps to Reproduce:

1. Create IBM ROKS cluster
2. Wait for cluster to come up
3.

Actual results:

Cluster takes a long time to come up because CNO pods take ~15 min to schedule.

Expected results:

Cluster comes up quickly

Additional info:

Note: Verification for the fix has already happened on the IBM Cloud side. All OCP QE needs to do is to make sure that the fix doesn't cause any regression to the regular OCP use case.

Description of problem:

For some reason, the LSP of a pod is not properly added to the port group where the ACL of a NetworkPolicy is applied. This results on the networkpolicy not being applied to the pod and communication not possible.

Version-Release number of selected component (if applicable):

4.10

How reproducible:

Always with a concrete pod at customer environment.

Steps to Reproduce:

(not known exactly yet)

Actual results:

LSP not in port group. ACL not applied. Netpol not in effect.

Expected results:

LSP in port group. ACL applied. Netpol in effect.

Additional info:

Details in private comments, as they involve sensitive data.

Deleting the pod does nothing, but it is possible that this has something to do with the pod being recreated with the same name (although the LSPs UUIDs are different in each incarnation).

This is a clone of issue OCPBUGS-13785. The following is the description of the original issue:

This is a clone of issue OCPBUGS-13150. The following is the description of the original issue:

This is a clone of issue OCPBUGS-12435. The following is the description of the original issue:

Description of problem:

If the user specifies a DNS name in an egressnetworkpolicy for which the upstream server returns a truncated DNS response, openshift-sdn does not fall back to TCP as expected but just take this as a failure.

Version-Release number of selected component (if applicable):

4.11 (originally reproduced on 4.9)

How reproducible:

Always

Steps to Reproduce:

1. Setup an EgressNetworkPolicy that points to a domain where a truncated response is returned while querying via UDP.
2.
3.

Actual results:

Error, DNS resolution not completed.

Expected results:

Request retried via TCP and succeeded.

Additional info:

In comments.

Description of problem:

Each LB created for a Service type LoadBalancer results in 1 client rule and <# of public subnets> health rules being created.  The rules per SG quota in AWS is quite small; 60 by default, and 200 hard max.  OCP has about 40 rules OOTB. Assuming an HA cluster in 3 AZs, that is 4 rules per LB.  With default AWS quota, only ~5 LBs can be create and with the hard max of 200, only ~40 LBs can be created.

Version-Release number of selected component (if applicable):

4.12

How reproducible:

Always

Steps to Reproduce:

1.  Create Service type LoadBalancer and observe increase in master-sg and worker-sg rules sets
2.
3.

Actual results:

4 rules are created

Expected results:

1 rules is created when the client rule is a superset of the per-subnet health rules

Additional info:

This ~4x the number of Services of type LoadBalancer.  This is required for Hypershift.

This is a clone of issue OCPBUGS-15099. The following is the description of the original issue:

This is a clone of issue OCPBUGS-14943. The following is the description of the original issue:

This is a clone of issue OCPBUGS-14668. The following is the description of the original issue:

Description of problem:

visiting global configurations page will return error after 'Red Hat OpenShift Serverless' is installed, the error persist even operator is uninstalled

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-06-06-212044

How reproducible:

Always

Steps to Reproduce:

1. Subscribe 'Red Hat OpenShift Serverless' from OperatorHub, wait for the operator to be successfully installed
2. Visit Administration -> Cluster Settings -> Configurations tab

Actual results:

react_devtools_backend_compact.js:2367 unhandled promise rejection: TypeError: Cannot read properties of undefined (reading 'apiGroup') 
    at r (main-chunk-e70ea3b3d562514df486.min.js:1:1)
    at main-chunk-e70ea3b3d562514df486.min.js:1:1
    at Array.map (<anonymous>)
    at main-chunk-e70ea3b3d562514df486.min.js:1:1
overrideMethod @ react_devtools_backend_compact.js:2367
window.onunhandledrejection @ main-chunk-e70ea3b3d562514df486.min.js:1

main-chunk-e70ea3b3d562514df486.min.js:1 Uncaught (in promise) TypeError: Cannot read properties of undefined (reading 'apiGroup')
    at r (main-chunk-e70ea3b3d562514df486.min.js:1:1)
    at main-chunk-e70ea3b3d562514df486.min.js:1:1
    at Array.map (<anonymous>)
    at main-chunk-e70ea3b3d562514df486.min.js:1:1

 

Expected results:

no errors

Additional info:

 

This is a clone of issue OCPBUGS-451. The following is the description of the original issue:

Description of problem:

Git icon shown in the repository details page should be based on the git provider.

Version-Release number of selected component (if applicable):
4.11

How reproducible:
Always

Steps to Reproduce:
1. Create a Repository with gitlab repo url
2. Navigate to the detail page.

Actual results:

github icon is displayed for the gitlab url.

Expected results:

gitlab icon should be displayed for the gitlab url.

Additional info:

use `GitLabIcon` and `BitBucketIcon` from patternfly react-icons.

Just like kube proxy, ovnk should expose port 10256 on every node, so that cloud LBs can send health checks and know which nodes are available. This is relevant for services with externalTrafficPolicy=Cluster.

This is a clone of issue OCPBUGS-7830. The following is the description of the original issue:

This is a clone of issue OCPBUGS-7729. The following is the description of the original issue:

Description of problem:

Etcd's liveliness probe should be removed. 

Version-Release number of selected component (if applicable):

4.11

Additional info:

When the Master Hosts hit CPU load this can cause a cascading restart loop for etcd and kube-api due to the etcd liveliness probes failing. Due to this loop load on the masters stays high because the api and controllers restarting over and over again..  

There is no reason for etcd to have a liveliness probe, we removed this probe in 3.11 due issues like this.  

This is a clone of issue OCPBUGS-13778. The following is the description of the original issue:

This is a clone of issue OCPBUGS-13427. The following is the description of the original issue:

This is a clone of issue OCPBUGS-12780. The following is the description of the original issue:

Description of problem:

023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health [-] Component KuryrPortHandler is dead. Last caught exception below: openstack.exceptions.InvalidRequest: Request requires an ID but none was found
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health Traceback (most recent call last):
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/controller/handlers/kuryrport.py", line 169, in on_finalize
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health     pod = self.k8s.get(f"{constants.K8S_API_NAMESPACES}"
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/k8s_client.py", line 121, in get
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health     self._raise_from_response(response)
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/k8s_client.py", line 99, in _raise_from_response
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health     raise exc.K8sResourceNotFound(response.text)
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health kuryr_kubernetes.exceptions.K8sResourceNotFound: Resource not found: '{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods \\"mygov-tuo-microservice-dev2-59fffbc58c-l5b79\\" not found","reason":"NotFound","details":{"name":"mygov-tuo-microservice-dev2-59fffbc58c-l5b79","kind":"pods"},"code":404}\n'
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health During handling of the above exception, another exception occurred:
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health Traceback (most recent call last):
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/handlers/logging.py", line 38, in __call__
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health     self._handler(event, *args, **kwargs)
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/handlers/retry.py", line 85, in __call__
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health     self._handler(event, *args, retry_info=info, **kwargs)
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/handlers/k8s_base.py", line 98, in __call__
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health     self.on_finalize(obj, *args, **kwargs)
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/controller/handlers/kuryrport.py", line 184, in on_finalize
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health     pod = self._mock_cleanup_pod(kuryrport_crd)
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/controller/handlers/kuryrport.py", line 160, in _mock_cleanup_pod
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health     host_ip = utils.get_parent_port_ip(port_id)
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/utils.py", line 830, in get_parent_port_ip
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health     parent_port = os_net.get_port(port_id)
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health   File "/usr/lib/python3.6/site-packages/openstack/network/v2/_proxy.py", line 1987, in get_port
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health     return self._get(_port.Port, port)
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health   File "/usr/lib/python3.6/site-packages/openstack/proxy.py", line 48, in check
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health     return method(self, expected, actual, *args, **kwargs)
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health   File "/usr/lib/python3.6/site-packages/openstack/proxy.py", line 513, in _get
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health     resource_type=resource_type.__name__, value=value))
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health   File "/usr/lib/python3.6/site-packages/openstack/resource.py", line 1472, in fetch
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health     base_path=base_path)
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health   File "/usr/lib/python3.6/site-packages/openstack/network/v2/_base.py", line 26, in _prepare_request
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health     base_path=base_path, params=params)
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health   File "/usr/lib/python3.6/site-packages/openstack/resource.py", line 1156, in _prepare_request
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health     "Request requires an ID but none was found")
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health openstack.exceptions.InvalidRequest: Request requires an ID but none was found
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health
2023-04-20 02:08:09.918 1 INFO kuryr_kubernetes.controller.service [-] Service 'KuryrK8sService' stopping
2023-04-20 02:08:09.919 1 INFO kuryr_kubernetes.watcher [-] Stopped watching '/apis/openstack.org/v1/kuryrnetworks'
2023-04-20 02:08:10.026 1 INFO kuryr_kubernetes.watcher [-] Stopped watching '/apis/machine.openshift.io/v1beta1/machines'
2023-04-20 02:08:10.152 1 INFO kuryr_kubernetes.watcher [-] Stopped watching '/api/v1/pods'
2023-04-20 02:08:10.174 1 INFO kuryr_kubernetes.watcher [-] Stopped watching '/apis/networking.k8s.io/v1/networkpolicies'
2023-04-20 02:08:10.857 1 INFO kuryr_kubernetes.watcher [-] Stopped watching '/api/v1/namespaces'
2023-04-20 02:08:10.877 1 WARNING kuryr_kubernetes.controller.drivers.utils [-] Namespace dev-health-air-ids not yet ready: kuryr_kubernetes.exceptions.K8sResourceNotFound: Resource not found: '{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"kuryrnetworks.openstack.org \\"dev-health-air-ids\\" not found","reason":"NotFound","details":{"name":"dev-health-air-ids","group":"openstack.org","kind":"kuryrnetworks"},"code":404}\n'
2023-04-20 02:08:11.024 1 INFO kuryr_kubernetes.watcher [-] Stopped watching '/api/v1/services'
2023-04-20 02:08:11.078 1 INFO kuryr_kubernetes.watcher [-] Stopped watching '/api/v1/endpoints'
2023-04-20 02:08:11.170 1 INFO kuryr_kubernetes.watcher [-] Stopped watching '/apis/openstack.org/v1/kuryrports'
2023-04-20 02:08:11.344 1 INFO kuryr_kubernetes.watcher [-] Stopped watching '/apis/openstack.org/v1/kuryrnetworkpolicies'
2023-04-20 02:08:11.475 1 INFO kuryr_kubernetes.watcher [-] Stopped watching '/apis/openstack.org/v1/kuryrloadbalancers'
2023-04-20 02:08:11.475 1 INFO kuryr_kubernetes.watcher [-] No remaining active watchers, Exiting...
2023-04-20 02:08:11.475 1 INFO kuryr_kubernetes.controller.service [-] Service 'KuryrK8sService' stopping

Version-Release number of selected component (if applicable):


How reproducible:

Always

Steps to Reproduce:

1. Create a pod.
2. Stop kuryr-controller.
3. Delete the pod and the finalizer on it.
4. Delete pod's subport.
5. Start the controller.

Actual results:

Crash

Expected results:

Port cleaned up normally.

Additional info:


This is a clone of issue OCPBUGS-4422. The following is the description of the original issue:

This bug is a backport clone of [Bugzilla Bug 2050230](https://bugzilla.redhat.com/show_bug.cgi?id=2050230). The following is the description of the original bug:

Description of problem:
In a large cluster, sdn daemonset can DoS the kube-apiserver with un-paginated LIST calls on high count resources.

Version-Release number of selected component (if applicable):

How reproducible:
NA

Steps to Reproduce:
NA

Actual results:
Kube API Server and Openshift API Server in one of the cluster keeps restarting, without proper exception. The cluster is not accessible.

Expected results:
Kube API Server and Openshift API Server should be stable.

Additional info:

This is a clone of issue OCPBUGS-4486. The following is the description of the original issue:

This is a clone of issue OCPBUGS-95. The following is the description of the original issue:

In an OpenShift cluster with OpenShiftSDN network plugin with egressIP and NMstate operator configured, there are some conditions when the egressIP is deconfigured from the network interface.

 

The bug is 100% reproducible.

Steps for reproducing the issue are:

1. Install a cluster with OpenShiftSDN network plugin.

2. Configure egressip for a project.

3. Install NMstate operator.

4. Create a NodeNetworkConfigurationPolicy.

5. Identify on which node the egressIP is present.

6. Restart the nmstate-handler pod running on the identified node.

7. Verify that the egressIP is no more present.

Restarting the sdn pod related to the identified node will reconfigure the egressIP in the node.

This issue has a high impact since any changes triggered for the NMstate operator will prevent application traffic. For example, in the customer environment, the issue is triggered any time a new node is added to the cluster.

The expectation is that NMstate operator should not interfere with SDN configuration.

Description of problem:

container_network* metrics stop reporting after a container restarts. Other container_* metrics continue to report for the same pod. 

How reproducible:

Issue can be reproduced by triggering a container restart 

Steps to Reproduce:

1.Restart container 
2.Check metrics and see container_network* not reporting

Additional info:
Ticket with more detailed debugging process OHSS-16739

This is a clone of issue OCPBUGS-1549. The following is the description of the original issue:

Description of problem:

The cluster-dns-operator does not reconcile the openshift-dns namespace, which has been exposed as an issue in 4.12 due to the requirement for the namespace to have pod-security labels.

If a cluster has been incrementally updated from a version less than or equal to 4.9, the openshift-dns namespace will most likely not contain the required pod-security labels since the namespace was statically created when the cluster was installed with old namespace configuration.

Version-Release number of selected component (if applicable):

4.12

How reproducible:

Always if cluster originally installed with v4.9 or less

Steps to Reproduce:

1. Install v4.9
2. Upgrade to v4.12 (incrementally if required for upgrade path)
3. openshift-dns namespace will be missing pod-security labels

Actual results:

"oc get ns openshift-dns -o yaml" will show missing pod-security labels: 

apiVersion: v1
kind: Namespace
metadata:
  annotations:
    openshift.io/node-selector: ""
    openshift.io/sa.scc.mcs: s0:c15,c0
    openshift.io/sa.scc.supplemental-groups: 1000210000/10000
    openshift.io/sa.scc.uid-range: 1000210000/10000
  creationTimestamp: "2020-05-21T19:36:15Z"
  labels:
    kubernetes.io/metadata.name: openshift-dns
    olm.operatorgroup.uid/3d42c0c1-01cd-4c55-bf88-864f041c7e7a: ""
    openshift.io/cluster-monitoring: "true"
    openshift.io/run-level: "0"
  name: openshift-dns
  resourceVersion: "3127555382"
  uid: 0fb4571e-952f-4bea-bc45-461beec54369
spec:
  finalizers:
  - kubernetes

Expected results:

pod-security labels should exist:
 
 labels:
    kubernetes.io/metadata.name: openshift-dns
    olm.operatorgroup.uid/3d42c0c1-01cd-4c55-bf88-864f041c7e7a: ""
    openshift.io/cluster-monitoring: "true"
    openshift.io/run-level: "0"
    pod-security.kubernetes.io/audit: privileged
    pod-security.kubernetes.io/enforce: privileged
    pod-security.kubernetes.io/warn: privileged

Additional info:

Issue found in CI during upgrade

https://coreos.slack.com/archives/C03G7REB4JV/p1663676443155839 

This is a clone of issue OCPBUGS-5093. The following is the description of the original issue:

Description of problem:

When users try to deploy an application from git method on dev console it throws warning message for specific public repos `URL is valid but cannot be reached. If this is a private repository, enter a source secret in Advanced Git Options.`. If we ignore the warning and go ahead the build will be successful although the warning message seems to be misleading.

Actual results:
Getting a warning for url while trying to deploy an application from git method on dev console from a public repo

Expected results:
It should show validated

Description of problem:

The following test fails using Openshift on Openstack

[BeforeEach] [Top Level]
  github.com/openshift/origin/test/extended/util/framework.go:1496
[BeforeEach] [Top Level]
  github.com/openshift/origin/test/extended/util/framework.go:1496
[BeforeEach] [Top Level]
  github.com/openshift/origin/test/extended/util/test.go:61
[BeforeEach] [sig-auth][Feature:SCC][Early]
  github.com/openshift/origin/test/extended/util/client.go:153
STEP: Creating a kubernetes client
[It] should not have pod creation failures during install [Suite:openshift/conformance/parallel]
  github.com/openshift/origin/test/extended/authorization/scc.go:24
[AfterEach] [sig-auth][Feature:SCC][Early]
  github.com/openshift/origin/test/extended/util/client.go:151
[AfterEach] [sig-auth][Feature:SCC][Early]
  github.com/openshift/origin/test/extended/util/client.go:152
fail [github.com/openshift/origin/test/extended/authorization/scc.go:94]: 2 pods failed before test on SCC errors
Error creating: pods "csi-nodeplugin-nfsplugin-" is forbidden: unable to validate against any security context constraint: [provider "anyuid": Forbidden: not usable by user or serviceaccount, provider restricted-v2: .spec.securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used, spec.volumes[0]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[1]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.containers[0].securityContext.privileged: Invalid value: true: Privileged containers are not allowed, spec.containers[0].securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used, provider "restricted": Forbidden: not usable by user or serviceaccount, provider "nonroot-v2": Forbidden: not usable by user or serviceaccount, provider "nonroot": Forbidden: not usable by user or serviceaccount, provider "hostmount-anyuid": Forbidden: not usable by user or serviceaccount, provider "machine-api-termination-handler": Forbidden: not usable by user or serviceaccount, provider "hostnetwork-v2": Forbidden: not usable by user or serviceaccount, provider "hostnetwork": Forbidden: not usable by user or serviceaccount, provider "hostaccess": Forbidden: not usable by user or serviceaccount, provider "privileged": Forbidden: not usable by user or serviceaccount] for DaemonSet.apps/v1/csi-nodeplugin-nfsplugin -n openshift-manila-csi-driver happened 10 times
Error creating: pods "openstack-manila-csi-nodeplugin-" is forbidden: unable to validate against any security context constraint: [provider "anyuid": Forbidden: not usable by user or serviceaccount, provider restricted-v2: .spec.securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used, spec.volumes[0]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[1]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[2]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.containers[0].securityContext.privileged: Invalid value: true: Privileged containers are not allowed, spec.containers[0].securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used, spec.containers[1].securityContext.privileged: Invalid value: true: Privileged containers are not allowed, spec.containers[1].securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used, provider "restricted": Forbidden: not usable by user or serviceaccount, provider "nonroot-v2": Forbidden: not usable by user or serviceaccount, provider "nonroot": Forbidden: not usable by user or serviceaccount, provider "hostmount-anyuid": Forbidden: not usable by user or serviceaccount, provider "machine-api-termination-handler": Forbidden: not usable by user or serviceaccount, provider "hostnetwork-v2": Forbidden: not usable by user or serviceaccount, provider "hostnetwork": Forbidden: not usable by user or serviceaccount, provider "hostaccess": Forbidden: not usable by user or serviceaccount, provider "privileged": Forbidden: not usable by user or serviceaccount] for DaemonSet.apps/v1/openstack-manila-csi-nodeplugin -n openshift-manila-csi-driver happened 11 times
Ginkgo exit error 1: exit with code 1

Version-Release number of selected component (if applicable):

OCP:4.11.27 OSP:RHOS-16.2-RHEL-8-20221201.n.1

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-1329. The following is the description of the original issue:

Description of problem:

etcd and kube-apiserver pods get restarted due to failed liveness probes while deleting/re-creating pods on SNO

Version-Release number of selected component (if applicable):

4.10.32

How reproducible:

Not always, after ~10 attempts

Steps to Reproduce:

1. Deploy SNO with Telco DU profile applied
2. Create multiple pods with local storage volumes attached(attaching yaml manifest)
3. Force delete and re-create pods 10 times

Actual results:

etcd and kube-apiserver pods get restarted, making to cluster unavailable for a period of time

Expected results:

etcd and kube-apiserver do not get restarted

Additional info:

Attaching must-gather.

Please let me know if any additional info is required. Thank you!

This is a clone of issue OCPBUGS-3022. The following is the description of the original issue:

Description of problem:

** europe-west8
** europe-west9
** europe-southwest1
** southamerica-west1
** us-east5
** us-south1
are not listed in the survey

Steps to Reproduce:

1. run survey (openshift-install create install-config without an install config file)
2. go through prompts until regions
3.

Actual results:

listed regions are missing

Expected results:

regions above are listed (and install succeeds in the region)

 

This is a clone of issue OCPBUGS-10496. The following is the description of the original issue:

Description of problem:

Customer is running machine learning (ML) tasks on OpenShift Container Platform, for which large models need to be embedded in the container image. When building a new container image with large container image layers (>=10GB) and pushing it to the internal image registry, this fails with the following error message:

error: build error: Failed to push image: writing blob: uploading layer to https://image-registry.openshift-image-registry.svc:5000/v2/example/example-image/blobs/uploads/b305b374-af79-4dce-afe0-afe6893b0ada?_state=[..]: blob upload invalid

In the image registry Pod we can see the following error message:

time="2023-01-30T14:12:22.315726147Z" level=error msg="upload resumed at wrong offest: 10485760000 != 10738341637" [..]
time="2023-01-30T14:12:22.338264863Z" level=error msg="response completed with error" err.code="blob upload invalid" err.message="blob upload invalid" [..]

Backend storage is AWS S3. We suspect that this could be the following upstream bug: https://github.com/distribution/distribution/issues/1698

Version-Release number of selected component (if applicable):

Customer encountered the issue on OCP 4.11.20. We reproduced the issue on OCP 4.11.21:

$  oc version
Client Version: 4.12.0
Kustomize Version: v4.5.7
Server Version: 4.11.21
Kubernetes Version: v1.24.6+5658434

How reproducible:

Always

Steps to Reproduce:

1. Install OpenShift Container Platform cluster 4.11.21 on AWS
2. Confirm registry storage is on AWS S3
3. Create a new build including a 10GB file using the following command: `printf "FROM registry.fedoraproject.org/fedora:37\nRUN dd if=/dev/urandom of=/bigfile bs=1M count=10240" | oc new-build -D -`
4. Wait for some time for the build to run

Actual results:

Pushing the new build fails with the following error message:

error: build error: Failed to push image: writing blob: uploading layer to https://image-registry.openshift-image-registry.svc:5000/v2/example/example-image/blobs/uploads/b305b374-af79-4dce-afe0-afe6893b0ada?_state=[..]: blob upload invalid

Expected results:

Push of large container image layers succeeds

Additional info:

This is a clone of issue OCPBUGS-7650. The following is the description of the original issue:

This is a clone of issue OCPBUGS-672. The following is the description of the original issue:

Description of problem:

Redhat-operator part of the marketplace is failing regularly due to startup probe timing out connecting to registry-server container part of the same pod within 1 sec which in turn increases CPU/Mem usage on Master nodes:

62m         Normal    Scheduled                pod/redhat-operators-zb4j7                         Successfully assigned openshift-marketplace/redhat-operators-zb4j7 to ip-10-0-163-212.us-west-2.compute.internal by ip-10-0-149-93
62m         Normal    AddedInterface           pod/redhat-operators-zb4j7                         Add eth0 [10.129.1.112/23] from ovn-kubernetes
62m         Normal    Pulling                  pod/redhat-operators-zb4j7                         Pulling image "registry.redhat.io/redhat/redhat-operator-index:v4.11"
62m         Normal    Pulled                   pod/redhat-operators-zb4j7                         Successfully pulled image "registry.redhat.io/redhat/redhat-operator-index:v4.11" in 498.834447ms
62m         Normal    Created                  pod/redhat-operators-zb4j7                         Created container registry-server
62m         Normal    Started                  pod/redhat-operators-zb4j7                         Started container registry-server
62m         Warning   Unhealthy                pod/redhat-operators-zb4j7                         Startup probe failed: timeout: failed to connect service ":50051" within 1s
62m         Normal    Killing                  pod/redhat-operators-zb4j7                         Stopping container registry-server


Increasing the threshold of the probe might fix the problem:
  livenessProbe:
      exec:
        command:
        - grpc_health_probe
        - -addr=:50051
      failureThreshold: 3
      initialDelaySeconds: 10
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 5
    name: registry-server
    ports:
    - containerPort: 50051
      name: grpc
      protocol: TCP
    readinessProbe:
      exec:
        command:
        - grpc_health_probe
        - -addr=:50051
      failureThreshold: 3
      initialDelaySeconds: 5
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 5 

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1. Install OSD cluster using 4.11.0-0.nightly-2022-08-26-162248 payload
2. Inspect redhat-operator pod in openshift-marketplace namespace
3. Observe the resource usage ( CPU and Memory ) of the pod 

Actual results:

Redhat-operator failing leading to increase to CPU and Mem usage on master nodes regularly during the startup

Expected results:

Redhat-operator startup probe succeeding and no spikes in resource on master nodes

Additional info:

Attached cpu, memory and event traces.

 

Description of problem:

There is a bug affecting verify steps functionality on iDRAC hardware in OpenShift 4.11 and 4.10. Original bug report has been made against 4.10:

https://issues.redhat.com/browse/OCPBUGS-1740

While I am not aware of this issue being reported against 4.11, due to the fact that the fix is only present in 4.12 codebase, 4.11 versions will also be affected by this issue.

This bug is created to meet automation requirements for backporting the fixes from 4.12 version to 4.11 (and then to 4.11 in the bug quoted above).

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:
Users on a fully-disconnected cluster could not see Devfiles in the developer catalog or import a Devfiles. That's fine.

But the API calls /api/devfile/samples/ and /api/devfile/ takes 30 seconds until they fail with a 504 Gateway timeout error.

If possible they should fail immediately.

Version-Release number of selected component (if applicable):
This might happen since 4.8

Tested this yet only on 4.12.0-0.nightly-2022-09-07-112008

How reproducible:
Always

Steps to Reproduce:

  1. Start a disconnected cluster with a proxy
  2. Open the browser network inspector and filter for /api/devfile
  3. Switch to Developer perspective
    1. Navigate to Add > Developer Catalog (All Services) > Devfiles
    2. Or Add > Import from Git > and enter https://github.com/devfile-samples/devfile-sample-go-basic.git

Actual results:

  • Network call fails after 30 seconds
  • Import doesn't work

Expected results:

  • Network calls should fail immediately
  • We doesn't expect that the import will work

Additional info:
The console Pod log contains this error:

E0909 10:28:18.448680 1 devfile-handler.go:74] Failed to parse devfile: failed to populateAndParseDevfile: Get "https://registry.devfile.io/devfiles/go": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

This is a clone of issue OCPBUGS-12956. The following is the description of the original issue:

This is a clone of issue OCPBUGS-12910. The following is the description of the original issue:

This is a clone of issue OCPBUGS-12904. The following is the description of the original issue:

Description of problem:

In order to test proxy installations, the CI base image for OpenShift on OpenStack needs netcat.

This is a clone of issue OCPBUGS-14403. The following is the description of the original issue:

Description of problem:

IngressVIP is getting attached to two node at once.

Version-Release number of selected component (if applicable):

4.11.39

How reproducible:

Always in customer cluster

Actual results:

IngressVIP is getting attached to two node at once.

Expected results:

IngressVIP should get attach to only one node.

Additional info:

 

This bug was initially created as a copy of
Bug #2096605
I am copying this bug because: the parent bug solved the validation aspect of diskType but now the description of diskType in
https://github.com/openshift/installer/blob/master/data/data/install.openshift.io_installconfigs.yaml#L2914-L2923
needs to be updated.

Version: 4.11.0-0.nightly-2022-06-06-201913

Platform: vSphere IPI

What happened?
1. If user inputs an invalid value for platform.vsphere.diskType in install-config.yaml file, there is no validation checking for diskType and doesn't exit with error, but continues the installation, which is not the same behavior as in 4.10.

After all vms are provisioned, I checked that the disk provision type is thick.

2. If user doesn't set platform.vsphere.diskType in install-config.yaml file, the default disk provision type is thick, but not the vSphere default storage policy. On VMC, the default policy is thin, so maybe the description of diskType should also need to be updated.

$ ./openshift-install explain installconfig.platform.vsphere.diskType
KIND: InstallConfig
VERSION: v1

RESOURCE: <string>
Valid Values: "","thin","thick","eagerZeroedThick"
DiskType is the name of the disk provisioning type, valid values are thin, thick, and eagerZeroedThick. When not specified, it will be set according to the default storage policy of vsphere.

What did you expect to happen?
validation for diskType

How to reproduce it (as minimally and precisely as possible)?
set diskType to invalid value in install-config.yaml and install the cluster

The kube-state-metric pod inside the openshift-monitoring namespace is not running as expected.

On checking the logs I am able to see that there is a memory panic

~~~
2022-11-22T09:57:17.901790234Z I1122 09:57:17.901768 1 main.go:199] Starting kube-state-metrics self metrics server: 127.0.0.1:8082
2022-11-22T09:57:17.901975837Z I1122 09:57:17.901951 1 main.go:66] levelinfomsgTLS is disabled.http2false
2022-11-22T09:57:17.902389844Z I1122 09:57:17.902291 1 main.go:210] Starting metrics server: 127.0.0.1:8081
2022-11-22T09:57:17.903191857Z I1122 09:57:17.903133 1 main.go:66] levelinfomsgTLS is disabled.http2false
2022-11-22T09:57:17.906272505Z I1122 09:57:17.906224 1 builder.go:191] Active resources: certificatesigningrequests,configmaps,cronjobs,daemonsets,deployments,endpoints,horizontalpodautoscalers,ingresses,jobs,leases,limitranges,mutatingwebhookconfigurations,namespaces,networkpolicies,nodes,persistentvolumeclaims,persistentvolumes,poddisruptionbudgets,pods,replicasets,replicationcontrollers,resourcequotas,secrets,services,statefulsets,storageclasses,validatingwebhookconfigurations,volumeattachments
2022-11-22T09:57:17.917758187Z E1122 09:57:17.917560 1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
2022-11-22T09:57:17.917758187Z goroutine 24 [running]:
2022-11-22T09:57:17.917758187Z k8s.io/apimachinery/pkg/util/runtime.logPanic(

{0x1635600, 0x2696e10})
2022-11-22T09:57:17.917758187Z /go/src/k8s.io/kube-state-metrics/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0x7d
2022-11-22T09:57:17.917758187Z k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xfffffffe})
2022-11-22T09:57:17.917758187Z /go/src/k8s.io/kube-state-metrics/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x75
2022-11-22T09:57:17.917758187Z panic({0x1635600, 0x2696e10}

)
2022-11-22T09:57:17.917758187Z /usr/lib/golang/src/runtime/panic.go:1038 +0x215
2022-11-22T09:57:17.917758187Z k8s.io/kube-state-metrics/v2/internal/store.ingressMetricFamilies.func6(0x40)
2022-11-22T09:57:17.917758187Z /go/src/k8s.io/kube-state-metrics/internal/store/ingress.go:136 +0x189
2022-11-22T09:57:17.917758187Z k8s.io/kube-state-metrics/v2/internal/store.wrapIngressFunc.func1(

{0x17fe520, 0xc00063b590})
2022-11-22T09:57:17.917758187Z /go/src/k8s.io/kube-state-metrics/internal/store/ingress.go:175 +0x49
2022-11-22T09:57:17.917758187Z k8s.io/kube-state-metrics/v2/pkg/metric_generator.(*FamilyGenerator).Generate(...)
2022-11-22T09:57:17.917758187Z /go/src/k8s.io/kube-state-metrics/pkg/metric_generator/generator.go:67
2022-11-22T09:57:17.917758187Z k8s.io/kube-state-metrics/v2/pkg/metric_generator.ComposeMetricGenFuncs.func1({0x17fe520, 0xc00063b590}

)
2022-11-22T09:57:17.917758187Z /go/src/k8s.io/kube-state-metrics/pkg/metric_generator/generator.go:107 +0xd8
~~~

Logs are attached to the support case

Description of problem:

Create Loadbalancer type service within the OCP 4.11.x OVNKubernetes cluster to expose the api server endpoint, the service does not response for normal oc request. 
But some of them are working, like "oc whoami", "oc get --raw /api"

Version-Release number of selected component (if applicable):

4.11.8 with OVNKubernetes

How reproducible:

always

Steps to Reproduce:

1. Setup openshift cluster 4.11 on AWS with OVNKubernetes as the default network
2. Create the following service under openshift-kube-apiserver namespace to expose the api
----
apiVersion: v1
kind: Service
metadata:
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-connection-idle-timeout: "1800"
  finalizers:
  - service.kubernetes.io/load-balancer-cleanup
  name: test-api
  namespace: openshift-kube-apiserver
spec:
  allocateLoadBalancerNodePorts: true
  externalTrafficPolicy: Cluster
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  loadBalancerSourceRanges:
  - <my_ip>/32
  ports:
  - nodePort: 31248
    port: 6443
    protocol: TCP
    targetPort: 6443
  selector:
    apiserver: "true"
    app: openshift-kube-apiserver
  sessionAffinity: None
  type: LoadBalancer

3. Setup the DNS resolution for the access
xxx.mydomain.com ---> <elb-auto-generated-dns>

4. Try to access the cluster api via the service above by updating the kubeconfig to use the custom dns name

Actual results:

No response from the server side.

$ time oc get node -v8
I1025 08:29:10.284069  103974 loader.go:375] Config loaded from file:  bmeng.kubeconfig
I1025 08:29:10.294017  103974 round_trippers.go:420] GET https://rh-api.bmeng-ccs-ovn.3o13.s1.devshift.org:6443/api/v1/nodes?limit=500
I1025 08:29:10.294035  103974 round_trippers.go:427] Request Headers:
I1025 08:29:10.294043  103974 round_trippers.go:431]     Accept: application/json;as=Table;v=v1;g=meta.k8s.io,application/json;as=Table;v=v1beta1;g=meta.k8s.io,application/json
I1025 08:29:10.294052  103974 round_trippers.go:431]     User-Agent: oc/openshift (linux/amd64) kubernetes/e40bd2d
I1025 08:29:10.365119  103974 round_trippers.go:446] Response Status: 200 OK in 71 milliseconds
I1025 08:29:10.365142  103974 round_trippers.go:449] Response Headers:
I1025 08:29:10.365148  103974 round_trippers.go:452]     Audit-Id: 83b9d8ae-05a4-4036-bff6-de371d5bec12
I1025 08:29:10.365155  103974 round_trippers.go:452]     Cache-Control: no-cache, private
I1025 08:29:10.365161  103974 round_trippers.go:452]     Content-Type: application/json
I1025 08:29:10.365167  103974 round_trippers.go:452]     X-Kubernetes-Pf-Flowschema-Uid: 2abc2e2d-ada3-4cb8-a86f-235df3a4e214
I1025 08:29:10.365173  103974 round_trippers.go:452]     X-Kubernetes-Pf-Prioritylevel-Uid: 02f7a188-43c7-4827-af58-5ebe861a1891
I1025 08:29:10.365179  103974 round_trippers.go:452]     Date: Tue, 25 Oct 2022 08:29:10 GMT
^C
real    17m4.840s
user    0m0.567s
sys    0m0.163s


However, it has the correct response if using --raw to request, eg:
$ oc get --raw /api/v1  --kubeconfig bmeng.kubeconfig 
{"kind":"APIResourceList","groupVersion":"v1","resources":[{"name":"bindings","singularName":"","namespaced":true,"kind":"Binding","verbs":["create"]},{"name":"componentstatuses","singularName":"","namespaced":false,"kind":"ComponentStatus","verbs":["get","list"],"shortNames":["cs"]},{"name":"configmaps","singularName":"","namespaced":true,"kind":"ConfigMap","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"shortNames":["cm"],"storageVersionHash":"qFsyl6wFWjQ="},{"name":"endpoints","singularName":"","namespaced":true,"kind":"Endpoints","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"shortNames":["ep"],"storageVersionHash":"fWeeMqaN/OA="},{"name":"events","singularName":"","namespaced":true,"kind":"Event","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"shortNames":["ev"],"storageVersionHash":"r2yiGXH7wu8="},{"name":"limitranges","singularName":"","namespaced":true,"kind":"LimitRange","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"shortNames":["limits"],"storageVersionHash":"EBKMFVe6cwo="},{"name":"namespaces","singularName":"","namespaced":false,"kind":"Namespace","verbs":["create","delete","get","list","patch","update","watch"],"shortNames":["ns"],"storageVersionHash":"Q3oi5N2YM8M="},{"name":"namespaces/finalize","singularName":"","namespaced":false,"kind":"Namespace","verbs":["update"]},{"name":"namespaces/status","singularName":"","namespaced":false,"kind":"Namespace","verbs":["get","patch","update"]},{"name":"nodes","singularName":"","namespaced":false,"kind":"Node","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"shortNames":["no"],"storageVersionHash":"XwShjMxG9Fs="},{"name":"nodes/proxy","singularName":"","namespaced":false,"kind":"NodeProxyOptions","verbs":["create","delete","get","patch","update"]},{"name":"nodes/status","singularName":"","namespaced":false,"kind":"Node","verbs":["get","patch","update"]},{"name":"persistentvolumeclaims","singularName":"","namespaced":true,"kind":"PersistentVolumeClaim","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"shortNames":["pvc"],"storageVersionHash":"QWTyNDq0dC4="},{"name":"persistentvolumeclaims/status","singularName":"","namespaced":true,"kind":"PersistentVolumeClaim","verbs":["get","patch","update"]},{"name":"persistentvolumes","singularName":"","namespaced":false,"kind":"PersistentVolume","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"shortNames":["pv"],"storageVersionHash":"HN/zwEC+JgM="},{"name":"persistentvolumes/status","singularName":"","namespaced":false,"kind":"PersistentVolume","verbs":["get","patch","update"]},{"name":"pods","singularName":"","namespaced":true,"kind":"Pod","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"shortNames":["po"],"categories":["all"],"storageVersionHash":"xPOwRZ+Yhw8="},{"name":"pods/attach","singularName":"","namespaced":true,"kind":"PodAttachOptions","verbs":["create","get"]},{"name":"pods/binding","singularName":"","namespaced":true,"kind":"Binding","verbs":["create"]},{"name":"pods/ephemeralcontainers","singularName":"","namespaced":true,"kind":"Pod","verbs":["get","patch","update"]},{"name":"pods/eviction","singularName":"","namespaced":true,"group":"policy","version":"v1","kind":"Eviction","verbs":["create"]},{"name":"pods/exec","singularName":"","namespaced":true,"kind":"PodExecOptions","verbs":["create","get"]},{"name":"pods/log","singularName":"","namespaced":true,"kind":"Pod","verbs":["get"]},{"name":"pods/portforward","singularName":"","namespaced":true,"kind":"PodPortForwardOptions","verbs":["create","get"]},{"name":"pods/proxy","singularName":"","namespaced":true,"kind":"PodProxyOptions","verbs":["create","delete","get","patch","update"]},{"name":"pods/status","singularName":"","namespaced":true,"kind":"Pod","verbs":["get","patch","update"]},{"name":"podtemplates","singularName":"","namespaced":true,"kind":"PodTemplate","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"storageVersionHash":"LIXB2x4IFpk="},{"name":"replicationcontrollers","singularName":"","namespaced":true,"kind":"ReplicationController","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"shortNames":["rc"],"categories":["all"],"storageVersionHash":"Jond2If31h0="},{"name":"replicationcontrollers/scale","singularName":"","namespaced":true,"group":"autoscaling","version":"v1","kind":"Scale","verbs":["get","patch","update"]},{"name":"replicationcontrollers/status","singularName":"","namespaced":true,"kind":"ReplicationController","verbs":["get","patch","update"]},{"name":"resourcequotas","singularName":"","namespaced":true,"kind":"ResourceQuota","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"shortNames":["quota"],"storageVersionHash":"8uhSgffRX6w="},{"name":"resourcequotas/status","singularName":"","namespaced":true,"kind":"ResourceQuota","verbs":["get","patch","update"]},{"name":"secrets","singularName":"","namespaced":true,"kind":"Secret","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"storageVersionHash":"S6u1pOWzb84="},{"name":"serviceaccounts","singularName":"","namespaced":true,"kind":"ServiceAccount","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"shortNames":["sa"],"storageVersionHash":"pbx9ZvyFpBE="},{"name":"serviceaccounts/token","singularName":"","namespaced":true,"group":"authentication.k8s.io","version":"v1","kind":"TokenRequest","verbs":["create"]},{"name":"services","singularName":"","namespaced":true,"kind":"Service","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"shortNames":["svc"],"categories":["all"],"storageVersionHash":"0/CO1lhkEBI="},{"name":"services/proxy","singularName":"","namespaced":true,"kind":"ServiceProxyOptions","verbs":["create","delete","get","patch","update"]},{"name":"services/status","singularName":"","namespaced":true,"kind":"Service","verbs":["get","patch","update"]}]}
 

Expected results:

The normal oc request should be working.

Additional info:

There is no such issue for clusters with openshift-sdn with the same OpenShift version and same LoadBalancer service.

We suspected that it might be related to the MTU setting, but this cannot explain why OpenShiftSDN works well.

Another thing might be related is that the OpenShiftSDN is using iptables for service loadbalancing and OVN is dealing that within the OVN services.

 

Please let me know if any debug log/info is needed.

This is a clone of issue OCPBUGS-10372. The following is the description of the original issue:

This is a clone of issue OCPBUGS-10220. The following is the description of the original issue:

This is a clone of issue OCPBUGS-7559. The following is the description of the original issue:

Description of problem:

When attempting to add nodes to a long-lived 4.12.3 cluster, net new nodes are not able to join the cluster. They are provisioned in the cloud provider (AWS), but never actually join as a node.

Version-Release number of selected component (if applicable):

4.12.3

How reproducible:

Consistent

Steps to Reproduce:

1. On a long lived cluster, add a new machineset

Actual results:

Machines reach "Provisioned" but don't join the cluster

Expected results:

Machines join cluster as nodes

Additional info:


Description of problem:

[4.11.z] Fix kubevirt-console tests

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-8044. The following is the description of the original issue:

Description of problem:

rhbz#2089199, backported to 4.11.5, shifted the etcd Grafana dashboard from the monitoring operator to the etcd operator. During the shift, the ConfigMap was renamed from grafana-dashboard-etcd to etcd-dashboard. However, we did not include logic for garbage-collecting the obsolete dasboard, so clusters that update from 4.11.1 and similar into 4.11.>=5 or 4.12+ currently end up with both the obsolete and new ConfigMaps. We should grow code to remove the obsolete ConfigMap.

Version-Release number of selected component (if applicable):

4.11.>=5 and 4.12+ are currently exposed.

How reproducible:

100%

Steps to Reproduce:

1. Install 4.11.1.
2. Update to a release that defines the etcd-dashboard ConfigMap.
3. Check for etcd dashboards with oc -n openshift-config-managed get configmaps | grep etcd.

Actual results:

Both etcd-dashboard and grafana-dashboard-etcd exist:

$ oc -n openshift-config-managed get configmaps | grep etcd
etcd-dashboard                                        1      196d
grafana-dashboard-etcd                                1      2y282d

Another example is 4.11.1 to 4.11.5 CI:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1570415394001260544/artifacts/e2e-aws-upgrade/configmaps.json | jq -r '.items[].metadata | select(.namespace == "openshift-config-managed" and (.name | contains("etcd"))) | .name'
etcd-dashboard
grafana-dashboard-etcd

Expected results:

Only etcd-dashboard still exists.

Additional info:

A new manifest for the outgoing ConfigMap that sets the release.openshift.io/delete: "true" annotation would ask the cluster-version operator to reap the obsolete ConfigMap.

Description of problem:

The alibabacloud client "aliyun" would be used when pre-configuring some resources (e.g. VPC, bastion host, etc.) before launching an OCP cluster with customization.

Version-Release number of selected component (if applicable):

4.11

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

 

This is a clone of issue OCPBUGS-1765. The following is the description of the original issue:

Description of problem:

If a customer creates a machine with a networks section like this

networks:
- filter: {}
  noAllowedAddressPairs: false
  subnets:
  - filter: {}
    uuid: primary-subnet-uuid
- filter: {}
  noAllowedAddressPairs: true
  subnets:
  - filter: {}
    uuid: other-subnet-uuid
primarySubnet: primary-subnet-uuid

Then all the ports are created without the allowed address pairs.

Doing some research in the source code, I have found that:
- For each entry on the networks: section, networks are filtered as per its filter: section[1]
- Then, if the subnets: section of the network entry is not empty, for each of the network IDs found above[2], 2 things are done that are relevant for this situatoin:
  - The net ID is saved on a netsWithoutAllowedAddressPairs[3]. That map is later checked while creating any port[4].
  - For each subnet entry that matches the network ID, a port is created[5].

So, the problematic behavior happens due to the following:

- Both entries in the networks array have empty filters. This means that both entries selected all the neutron networks.
- This configuration results in one port per subnet as expected because, in the later traversal of the subnets array of each entry[5], it is filtering by subnet and creating a single port as expected.
- However, the entry with "noAllowedAddressPairs: true" is selecting all the neutron networks, so it adds all of them to the netsWithoutAllowedAddressPairs map[3], regardless of the subnets filtering.
- As all the networks are in noAllowedAddressPairs: true array, all the ports created for the VM have their allowed address pairs removed[4].

Why do we consider this behavior undesired?

I understand that, if we create a port for a network that has no allowed pairs, we create all the other ports in the same networks without the pairs. However, it is surprising that a port in a network is removed the allowed address pairs due to a setting in an entry that yielded no port on that network. In other words, one would expect that the same subnet filtering that happens on each network entry in what regards yielding ports for the VM would also work for the noAllowedPairs parameter.

Version-Release number of selected component (if applicable):

4.10.30

How reproducible:

Always

Steps to Reproduce:

1. Create a machineset like in the description
2.
3.

Actual results:

All ports have no address pairs

Expected results:

Only the port on the secondary subnet has no address pairs.

Additional info:

A simple workaround would be to just fill the filter so that a single network is selected for each network entry.

References:
[1] - https://github.com/openshift/cluster-api-provider-openstack/blob/f6b51710d4f395ded401347589447f5f41dd5c4c/pkg/cloud/openstack/clients/machineservice.go#L576
[2] - https://github.com/openshift/cluster-api-provider-openstack/blob/f6b51710d4f395ded401347589447f5f41dd5c4c/pkg/cloud/openstack/clients/machineservice.go#L580
[3] - https://github.com/openshift/cluster-api-provider-openstack/blob/f6b51710d4f395ded401347589447f5f41dd5c4c/pkg/cloud/openstack/clients/machineservice.go#L581-L583
[4] - https://github.com/openshift/cluster-api-provider-openstack/blob/f6b51710d4f395ded401347589447f5f41dd5c4c/pkg/cloud/openstack/clients/machineservice.go#L658-L660
[5] - https://github.com/openshift/cluster-api-provider-openstack/blob/f6b51710d4f395ded401347589447f5f41dd5c4c/pkg/cloud/openstack/clients/machineservice.go#L610-L625

Description of problem:

NodePort port not accessible

Version-Release number of selected component (if applicable):

OCP 4.8.20

How reproducible:

$oc -n ui-nprd get services -o wide
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
docker-registry ClusterIP 10.201.219.240 <none> 5000/TCP 24d app=registry
docker-registry-lb LoadBalancer 10.201.252.253 internal-xxxxxx.xx-xxxx-1.elb.amazonaws.com 5000:30779/TCP 3d22h app=registry
docker-registry-np NodePort 10.201.216.26 <none> 5000:32428/TCP 3d16h app=registry

$oc debug node/ip-xxx.ca-central-1.compute.internal
Starting pod/ip-xxx.ca-central-1computeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.81.23.96
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# nc -vz 10.81.23.96 32428
Ncat: Version 7.70 ( https://nmap.org/ncat )
Ncat: Connection timed out.

In a new-created namespaces the same deployment works:

[RHEL7:> oc project
Using project "test-c1" on server "https://api.xx.xx.xxxx.xx.xx:6443".
[RHEL7:- ~/tmp]> oc port-forward service/docker-registry-np 5000:5000
Forwarding from 127.0.0.1:5000 -> 5000

[1]+ Stopped oc4 port-forward service/docker-registry-np 5000:5000
[RHEL7: ~/tmp]> bg %1
[1]+ oc4 port-forward service/docker-registry-np 5000:5000 &
[RHEL7: ~/tmp]> nc -v localhost 5000
Ncat: Version 7.50 ( https://nmap.org/ncat )
Ncat: Connected to 127.0.0.1:5000.
Handling connection for 5000

[RHEL7: ~/tmp]> kill %1
[RHEL7: ~/tmp]>
[1]+ Terminated oc4 port-forward service/docker-registry-np 5000:5000
[RHEL7: ~/tmp]> oc get services
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
docker-registry-np NodePort 10.201.224.174 <none> 5000:31793/TCP 68s

[RHEL7: ~/tmp]> oc get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
registry-75b7c7fd94-rx29j 1/1 Running 0 7m5s 10.201.1.29 ip-xxx.ca-central-1.compute.internal <none> <none>
[RHEL7: ~/tmp]> oc debug node/ip-xxx.ca-central-1.compute.internal
Starting pod/ip-xxxca-central-1computeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.81.23.87
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# nc -v 10.81.23.87 31793
Ncat: Version 7.70 ( https://nmap.org/ncat )
Ncat: Connected to 10.81.23.87:31793.

Actual results:

  • Working on new created namespace
  • Not working on already created namespace

Expected results:

  • Suppose to work on all namespaces.

Additional info:

  • This cluster get upgrade from 4.7.x to 4.8 and then they manually enable OVN.
  • The issue was happening on all namespaces but after restarting the ovnkube-master-xxxx pods only the newly created namespaces work.

This is a clone of issue OCPBUGS-6816. The following is the description of the original issue:

This is a clone of issue OCPBUGS-6799. The following is the description of the original issue:

Description of problem:
The pipelines -> repositories list view in Dev Console does not show the running pipelineline as the last pipelinerun in the table.

Original BugZilla Link: https://bugzilla.redhat.com/show_bug.cgi?id=2016006
OCPBUGSM: https://issues.redhat.com/browse/OCPBUGSM-36408

This bug is a backport clone of [Bugzilla Bug 2118318](https://bugzilla.redhat.com/show_bug.cgi?id=2118318). The following is the description of the original bug:

+++ This bug was initially created as a clone of Bug #2117569 +++

Description of problem:

The garbage collector resource quota controller must ignore ALL events; otherwise, if a rogue controller or a workload causes unbound event creation, performance will degrade as it has to process the events.

Fix: https://github.com/kubernetes/kubernetes/pull/110939

This bug is to track fix in master (4.12) and also allow to backport to 4.11.1

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1.
2.
3.

Actual results:

Expected results:

Additional info:

— Additional comment from Michal Fojtik on 2022-08-11 10:52:28 UTC —

I'm using FastFix here as we need to backport this to 4.11.1 to avoid support churn for busy clusters or clusters doing upgrades.

— Additional comment from ART BZ Bot on 2022-08-11 15:13:32 UTC —

Elliott changed bug status from MODIFIED to ON_QA.
This bug is expected to ship in the next 4.12 release.

— Additional comment from zhou ying on 2022-08-12 03:03:34 UTC —

checked the payload commit id , the payload 4.12.0-0.nightly-2022-08-11-191750 has container the fixed pr .

oc adm release info registry.ci.openshift.org/ocp/release:4.12.0-0.nightly-2022-08-11-191750 --commit-urls |grep hyperkube
Warning: the default reading order of registry auth file will be changed from "${HOME}/.docker/config.json" to podman registry config locations in the future version. "${HOME}/.docker/config.json" is deprecated, but can still be used for storing credentials as a fallback. See https://github.com/containers/image/blob/main/docs/containers-auth.json.5.md for the order of podman registry config locations.
hyperkube https://github.com/openshift/kubernetes/commit/da80cd038ee5c3b45ba36d4b48b42eb8a74439a3

commit da80cd038ee5c3b45ba36d4b48b42eb8a74439a3 (HEAD -> master, origin/release-4.13, origin/release-4.12, origin/master, origin/HEAD)
Merge: a9d6306a701 055b96e614a
Author: OpenShift Merge Robot <openshift-merge-robot@users.noreply.github.com>
Date: Thu Aug 11 15:13:05 2022 +0000

Merge pull request #1338 from benluddy/openshift-pick-110888

Bug 2117569: UPSTREAM: 110888: feat: fix a bug thaat not all event be ignored by gc controller

This is a clone of issue OCPBUGS-675. The following is the description of the original issue:

Description of problem:

A cluster hit a panic in etcd operator in bootstrap:
I0829 14:46:02.736582 1 controller_manager.go:54] StaticPodStateController controller terminated
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1e940ab]

goroutine 2701 [running]:
github.com/openshift/cluster-etcd-operator/pkg/etcdcli.checkSingleMemberHealth({0x29374c0, 0xc00217d920}, 0xc0021fb110)
github.com/openshift/cluster-etcd-operator/pkg/etcdcli/health.go:135 +0x34b
github.com/openshift/cluster-etcd-operator/pkg/etcdcli.getMemberHealth.func1()
github.com/openshift/cluster-etcd-operator/pkg/etcdcli/health.go:58 +0x7f
created by github.com/openshift/cluster-etcd-operator/pkg/etcdcli.getMemberHealth
github.com/openshift/cluster-etcd-operator/pkg/etcdcli/health.go:54 +0x2ac
Version-Release number of selected component (if applicable):

 

How reproducible:

Pulled up a 4.12 cluster and hit panic during bootstrap

Steps to Reproduce:

1.
2.
3.

Actual results:

panic as above

Expected results:

no panic

Additional info:

 

This is a clone of issue OCPBUGS-7800. The following is the description of the original issue:

This is a clone of issue OCPBUGS-266. The following is the description of the original issue:

Description of problem: I am working with a customer who uses the web console.  From the Developer Perspective's Project Access tab, they cannot differentiate between users and groups and furthermore cannot add groups from this web console.  This has led to confusion whether existing resources were in fact users or groups, and furthermore they have added users when they intended to add groups instead.  What we really need is a third column in the Project Access tab that says whether a resource is a user or group.

 

Version-Release number of selected component (if applicable): This is an issue in OCP 4.10 and 4.11, and I presume future versions as well

How reproducible: Every time.  My customer is running on ROSA, but I have determined this issue to be general to OpenShift.

Steps to Reproduce:

From the oc cli, I create a group and add a user to it.

$ oc adm groups new techlead
group.user.openshift.io/techlead created
$ oc adm groups add-users techlead admin
group.user.openshift.io/techlead added: "admin"
$ oc get groups
NAME                                     USERS
cluster-admins                           
dedicated-admins                         admin
techlead   admin
I create a new namespace so that I can assign a group project level access:

$ oc new-project my-namespace

$ oc adm policy add-role-to-group edit techlead -n my-namespace
I then went to the web console -> Developer perspective -> Project -> Project Access.  I verified the rolebinding named 'edit' is bound to a group named 'techlead'.

$ oc get rolebinding
NAME                                                              ROLE                                   AGE
admin                                                             ClusterRole/admin                      15m
admin-dedicated-admins                                            ClusterRole/admin                      15m
admin-system:serviceaccounts:dedicated-admin                      ClusterRole/admin                      15m
dedicated-admins-project-dedicated-admins                         ClusterRole/dedicated-admins-project   15m
dedicated-admins-project-system:serviceaccounts:dedicated-admin   ClusterRole/dedicated-admins-project   15m
edit                                                              ClusterRole/edit                       2m18s
system:deployers                                                  ClusterRole/system:deployer            15m
system:image-builders                                             ClusterRole/system:image-builder       15m
system:image-pullers                                              ClusterRole/system:image-puller        15m

$ oc get rolebinding edit -o yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  creationTimestamp: "2022-08-15T14:16:56Z"
  name: edit
  namespace: my-namespace
  resourceVersion: "108357"
  uid: 4abca27d-08e8-43a3-b9d3-d20d5c294bbe
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: edit
subjects:

  • apiGroup: rbac.authorization.k8s.io
      kind: Group
      name: techlead
    Now, from the same Project Access tab in the web console, I added the developer with role "View".  From this web console, it is unclear whether developer and techlead are users or groups.

Now back to the CLI, I view the newly created rolebinding named 'developer-view-c15b720facbc8deb', and find that the "View" role is assigned to a user named 'developer', rather than a group.

$ oc get rolebinding                                                                      
NAME                                                              ROLE                                   AGE
admin                                                             ClusterRole/admin                      17m
admin-dedicated-admins                                            ClusterRole/admin                      17m
admin-system:serviceaccounts:dedicated-admin                      ClusterRole/admin                      17m
dedicated-admins-project-dedicated-admins                         ClusterRole/dedicated-admins-project   17m
dedicated-admins-project-system:serviceaccounts:dedicated-admin   ClusterRole/dedicated-admins-project   17m
edit                                                              ClusterRole/edit                       4m25s
developer-view-c15b720facbc8deb     ClusterRole/view                       90s
system:deployers                                                  ClusterRole/system:deployer            17m
system:image-builders                                             ClusterRole/system:image-builder       17m
system:image-pullers                                              ClusterRole/system:image-puller        17m
[10:21:21] kechung:~ $ oc get rolebinding developer-view-c15b720facbc8deb -o yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  creationTimestamp: "2022-08-15T14:19:51Z"
  name: developer-view-c15b720facbc8deb
  namespace: my-namespace
  resourceVersion: "113298"
  uid: cc2d1b37-922b-4e9b-8e96-bf5e1fa77779
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: view
subjects:

  • apiGroup: rbac.authorization.k8s.io
      kind: User
      name: developer

So in conclusion, from the Project Access tab, we're unable to add groups and unable to differentiate between users and groups.  This is in essence our ask for this RFE.

 

Actual results:

Developer perspective -> Project -> Project Access tab shows a list of resources which can be users or groups, but does not differentiate between them.  Furthermore, when we add resources, they are only users and there is no way to add a group from this tab in the web console.

 

Expected results:

Should have the ability to add groups and differentiate between users and groups.  Ideally, we're looking at a third column for user or group.

 

Additional info:

This is a clone of issue OCPBUGS-3235. The following is the description of the original issue:

Description of problem:

Frequently we see the loading state of the topology view, even when there aren't many resources in the project.

Including an example

Prerequisites (if any, like setup, operators/versions):

Steps to Reproduce

  1. load topology
  2. if it loads successfully, keep trying  until it fails to load

Actual results:

topology will sometimes hang with the loading indicator showing indefinitely

Expected results:

topology should load consistently without fail

Reproducibility (Always/Intermittent/Only Once):

intermittent

Build Details:

4.9

Additional info:

Description of problem:

The oc new-app command using a private Git repository no longer works with oc v4.10. Specifically, the private Git repository is authenticating over SSH and this occurs when no image stream is specified in the command so that the language detection step is required. Here is an example command:

oc new-app git@github.com:scottishkiwi/test-oc-newapp.git --source-secret github-repo-key --name test-app

Version-Release number of selected component (if applicable):
oc v4.10

How reproducible:

Easily reproducible.

Steps to Reproduce:

1. Download v4.10 of the oc tool and add to local executable path as 'oc':
https://access.redhat.com/downloads/content/290/ver=4.10/rhel---8/4.10.10/x86_64/product-software

➜ ~ oc version
Client Version: 4.10.9

2. Setup a private Github repository

3. Add SSH public key as a deploy key to the private Github repository

4. Push some empty test file like index.php to the private repository (used for language detection)

5. Create a new OpenShift project:
➜ ~ oc new-project test-oc-newapp

6. Create a secret to hold the private key of the SSH key pair
➜ ~ oc create secret generic github-repo-key --from-file=ssh-privatekey=/Users/daniel/test/github-repo --type=kubernetes.io/ssh-auth
secret/github-repo-key created

7. Enable access to the secret from the builder service account:
➜ ~ oc secrets link builder github-repo-key

8. Create a new application using the source secret:
oc new-app git@github.com:scottishkiwi/test-oc-newapp.git --source-secret github-repo-key --name test-app

Actual results:

➜ ~ oc new-app git@github.com:scottishkiwi/test-oc-newapp.git --source-secret github-repo-key --name test-app
warning: Cannot check if git requires authentication.
error: local file access failed with: stat git@github.com:scottishkiwi/test-oc-newapp.git: no such file or directory
error: unable to locate any images in image streams, templates loaded in accessible projects, template files, local docker images with name "git@github.com:scottishkiwi/test-oc-newapp.git"

Argument 'git@github.com:scottishkiwi/test-oc-newapp.git' was classified as an image, image~source, or loaded template reference.

The 'oc new-app' command will match arguments to the following types:

1. Images tagged into image streams in the current project or the 'openshift' project

  • if you don't specify a tag, we'll add ':latest'
    2. Images in the container storage, on remote registries, or on the local container engine
    3. Templates in the current project or the 'openshift' project
    4. Git repository URLs or local paths that point to Git repositories

--allow-missing-images can be used to point to an image that does not exist yet.

Expected results (with oc v4.8):

➜ ~ oc-4.8 version
Client Version: 4.8.37

➜ oc-4.8 new-app git@github.com:scottishkiwi/test-oc-newapp.git --source-secret github-repo-key --name test-app
warning: Cannot check if git requires authentication.
--> Found image 22f1bf3 (4 weeks old) in image stream "openshift/php" under tag "7.4-ubi8" for "php"

Apache 2.4 with PHP 7.4
-----------------------
PHP 7.4 available as container is a base platform for building and running various PHP 7.4 applications and frameworks. PHP is an HTML-embedded scripting language. PHP attempts to make it easy for developers to write dynamically generated web pages. PHP also offers built-in database integration for several commercial and non-commercial database management systems, so writing a database-enabled webpage with PHP is fairly simple. The most common use of PHP coding is probably as a replacement for CGI scripts.

Tags: builder, php, php74, php-74

  • The source repository appears to match: php
  • A source build using source code from git@github.com:scottishkiwi/test-oc-newapp.git will be created
  • The resulting image will be pushed to image stream tag "test-app:latest"
  • Use 'oc start-build' to trigger a new build
  • WARNING: this source repository may require credentials.
    Create a secret with your git credentials and use 'oc set build-secret' to assign it to the build config.

--> Creating resources ...
imagestream.image.openshift.io "test-app" created
buildconfig.build.openshift.io "test-app" created
deployment.apps "test-app" created
service "test-app" created
--> Success
Build scheduled, use 'oc logs -f buildconfig/test-app' to track its progress.
Application is not exposed. You can expose services to the outside world by executing one or more of the commands below:
'oc expose service/test-app'
Run 'oc status' to view your app.

Additional info:

Also tested with oc v4.9 and works as expected:

➜ ~ oc-4.9 version
Client Version: 4.9.29

➜ ~ oc-4.9 new-app git@github.com:scottishkiwi/test-oc-newapp.git --source-secret github-repo-key --name test-app
warning: Cannot check if git requires authentication.
--> Found image 22f1bf3 (4 weeks old) in image stream "openshift/php" under tag "7.4-ubi8" for "php"

Apache 2.4 with PHP 7.4
-----------------------
PHP 7.4 available as container is a base platform for building and running various PHP 7.4 applications and frameworks. PHP is an HTML-embedded scripting language. PHP attempts to make it easy for developers to write dynamically generated web pages. PHP also offers built-in database integration for several commercial and non-commercial database management systems, so writing a database-enabled webpage with PHP is fairly simple. The most common use of PHP coding is probably as a replacement for CGI scripts.

Tags: builder, php, php74, php-74

  • The source repository appears to match: php
  • A source build using source code from git@github.com:scottishkiwi/test-oc-newapp.git will be created
  • The resulting image will be pushed to image stream tag "test-app:latest"
  • Use 'oc start-build' to trigger a new build
  • WARNING: this source repository may require credentials.
    Create a secret with your git credentials and use 'oc set build-secret' to assign it to the build config.

--> Creating resources ...
buildconfig.build.openshift.io "test-app" created
deployment.apps "test-app" created
service "test-app" created
--> Success
Build scheduled, use 'oc logs -f buildconfig/test-app' to track its progress.
Application is not exposed. You can expose services to the outside world by executing one or more of the commands below:
'oc expose service/test-app'
Run 'oc status' to view your app.
➜ ~ oc status
In project dan-test-oc-newapp on server https://api.dsquirre.2b7w.p1.openshiftapps.com:6443

svc/test-app - 172.30.238.75 ports 8080, 8443
deployment/test-app deploys istag/test-app:latest <-
bc/test-app source builds git@github.com:scottishkiwi/test-oc-newapp.git on openshift/php:7.4-ubi8
deployment #3 running for 38 minutes - 1 pod
deployment #2 deployed 38 minutes ago
deployment #1 deployed 38 minutes ago

1 info identified, use 'oc status --suggest' to see details.

Description of problem:

Similar to OCPBUGS-11636 ccoctl needs to be updated to account for the s3 bucket changes described in https://aws.amazon.com/blogs/aws/heads-up-amazon-s3-security-changes-are-coming-in-april-of-2023/

these changes have rolled out to us-east-2 and China regions as of today and will roll out to additional regions in the near future

See OCPBUGS-11636 for additional information

Version-Release number of selected component (if applicable):

 

How reproducible:

Reproducible in affected regions.

Steps to Reproduce:

1. Use "ccoctl aws create-all" flow to create STS infrastructure in an affected region like us-east-2. Notice that document upload fails because the s3 bucket is created in a state that does not allow usage of ACLs with the s3 bucket.

Actual results:

./ccoctl aws create-all --name abutchertestue2 --region us-east-2 --credentials-requests-dir ./credrequests --output-dir _output
2023/04/11 13:01:06 Using existing RSA keypair found at _output/serviceaccount-signer.private
2023/04/11 13:01:06 Copying signing key for use by installer
2023/04/11 13:01:07 Bucket abutchertestue2-oidc created
2023/04/11 13:01:07 Failed to create Identity provider: failed to upload discovery document in the S3 bucket abutchertestue2-oidc: AccessControlListNotSupported: The bucket does not allow ACLs
        status code: 400, request id: 2TJKZC6C909WVRK7, host id: zQckCPmozx+1yEhAj+lnJwvDY9rG14FwGXDnzKIs8nQd4fO4xLWJW3p9ejhFpDw3c0FE2Ggy1Yc=

Expected results:

"ccoctl aws create-all" successfully creates IAM and S3 infrastructure. OIDC discovery and JWKS documents are successfully uploaded to the S3 bucket and are publicly accessible.

Additional info:

 

This bug is a backport clone of [Bugzilla Bug 2111686](https://bugzilla.redhat.com/show_bug.cgi?id=2111686). The following is the description of the original bug:

Description of problem:
While using the console web UI with a nanokube cluster at our hack week in June 2022, we found different NPEs when the project or build status wasn't set.

Version-Release number of selected component (if applicable):
4.12 (but maybe all)

How reproducible:
Always, but only when running a nanokube cluster

Steps to Reproduce:
1. Clone and build https://github.com/RedHatInsights/nanokube

export KUBEBUILDER_ASSETS=`~/go/bin/setup-envtest use 1.21 -p path`
./nanokube

2. Run our console web UI against this cluster!

  1. do not call ./contrib/oc-environment.sh !!
    export BRIDGE_USER_AUTH="disabled"
    export BRIDGE_K8S_MODE="off-cluster"
    export BRIDGE_K8S_MODE_OFF_CLUSTER_ENDPOINT="http://127.0.0.1:8090/"
    export BRIDGE_K8S_AUTH="bearer-token"
    export BRIDGE_K8S_AUTH_BEARER_TOKEN="ignored-by-proxy"
    ./bin/bridge

Actual results:
Crashes while navigate to the project list or details page.
Crash on the build config list page.

Expected results:
No crashes, or at least not the described.

Additional info:

Description of problem:

Users search a resource (for example, Pod) with Name filter applied and input a text to the filter field then the search results filtered accordingly. 

Once the results are shown, when the user clear the value in one-shot (i.e. select whole filter text from the field and clear it using delete or backspace key) from the field, 

then the search result doesn't clear accordingly and the previous result stays on the page.

Version-Release number of selected components (if applicable):

4.11.0-0.nightly-2022-08-16-194731 & works fine with OCP 4.12 latest version.

How reproducible:

 Always

Steps to Reproduce:

  1. Login to OCP web console.
  2. Go to the Search section.
  3. Select the resource filter and choose a resource (for example , Pod)
  4. Change the filter option to "Name" and input a text( for example, installer) to the filter field.
  5. Once the results are shown, select the whole filter text ( i.e. installer).
  6. Clear the text using the Delete or Backspace key.
  7. View the behavior on search results

Actual results:

Search result doesn't clear when user clears name filter in one-shot for any resources.

Expected results:

Search results should clear when the user clears name filters in one-shot for any resources.

Additional info:

Reproduced in both chrome[103.0.5060.114 (Official Build) (64-bit)] and firefox[91.11.0esr (64-bit)] browsers.

Attached screen share for the same issue. SearchIssues.mp4

This is a clone of issue OCPBUGS-723. The following is the description of the original issue:

Description of problem:
I have a customer who created clusterquota for one of the namespace, it got created but the values were not reflecting under limits or not displaying namespace details.
~~~
$ oc describe AppliedClusterResourceQuota
Name: test-clusterquota
Created: 19 minutes ago
Labels: size=custom
Annotations: <none>
Namespace Selector: []
Label Selector:
AnnotationSelector: map[openshift.io/requester:system:serviceaccount:application-service-accounts:test-sa]
Scopes: NotTerminating
Resource Used Hard
-------- ---- ----
~~~

WORKAROUND: They recreated the clusterquota object (cache it off, delete it, create new) after which it displayed values as expected.

In the past, they saw similar behavior on their test cluster, there it was heavily utilized the etcd DB was much larger in size (>2.5Gi), and had many more objects (at that time, helm secrets were being cached for all deployments, and keeping a history of 10, so etcd was being bombarded).

This cluster the same "symptom" was noticed however etcd was nowhere near that in size nor the amount of etcd objects and/or helm cached secrets.

Version-Release number of selected component (if applicable): OCP 4.9

How reproducible: Occurred only twice(once in test and in current cluster)

Steps to Reproduce:
1. Create ClusterQuota
2. Check AppliedClusterResourceQuota
3. The values and namespace is empty

Actual results: ClusterQuota should display the values

Expected results: ClusterQuota not displaying values

This is a clone of issue OCPBUGS-6766. The following is the description of the original issue:

This is a clone of https://bugzilla.redhat.com/show_bug.cgi?id=2083087 (OCPBUGSM-44070) to backport this issue.

Description of problem:
"Delete dependent objects of this resource" is a bit of confusing for some users because when creating the Application in Dev console not only the deployment but also IS, route, svc, secret objects will be created as well. When deleting the Application (in fact it is deployment), there is an option called "Delete dependent objects of this resource" and some users might think this means the IS, route, svc and any other objects which are created alongside with the deployment will be deleted as well

Version-Release number of selected component (if applicable):
4.8

How reproducible:
Always

Steps to Reproduce:
1. Create Application in Dev console
2. Delete the deployment
3. Check "Delete dependent objects of this resource"

Actual results:
Only deployment will be deleted and IS, svc, route will not be deleted

Expected results:
We either change the description of this option, or we really delete IS, svc, route and any other objects created under this Application.

Additional info:

console-operator codebase contains a lot of inline manifests. Instead we should put those manifests into a `/bindata` folder, from which they will be read and then updated per purpose.

Description of problem:

Cluster running 4.10.52 had three aws-ebs-csi-driver-node pods begin to consume multiple GB of memory, causing heavy node memory pressure as the pods have no memory limit. 

All other aws-ebs-csi-driver-node pods were still in the 50-70MB range:

NAME                                            CPU(cores)   MEMORY(bytes)   
aws-ebs-csi-driver-controller-59867579b-d6s2q   0m           397Mi           
aws-ebs-csi-driver-controller-59867579b-t4wgq   0m           276Mi           
aws-ebs-csi-driver-node-4rmvk                   0m           53Mi            
aws-ebs-csi-driver-node-5799f                   0m           50Mi            
aws-ebs-csi-driver-node-6dpvg                   0m           59Mi            
aws-ebs-csi-driver-node-6ldzk                   0m           65Mi            
aws-ebs-csi-driver-node-6mbk5                   0m           54Mi            
aws-ebs-csi-driver-node-bkvsr                   0m           50Mi            
aws-ebs-csi-driver-node-c2fb2                   0m           62Mi            
aws-ebs-csi-driver-node-f422m                   0m           61Mi            
aws-ebs-csi-driver-node-lwzbb                   6m           1940Mi          
aws-ebs-csi-driver-node-mjznt                   0m           53Mi            
aws-ebs-csi-driver-node-pczsj                   0m           62Mi            
aws-ebs-csi-driver-node-pmskn                   0m           3493Mi          
aws-ebs-csi-driver-node-qft8w                   0m           68Mi            
aws-ebs-csi-driver-node-v5bpx                   11m          2076Mi          
aws-ebs-csi-driver-node-vn8km                   0m           84Mi            
aws-ebs-csi-driver-node-ws6hx                   0m           73Mi            
aws-ebs-csi-driver-node-xsk7k                   0m           59Mi            
aws-ebs-csi-driver-node-xzwlh                   0m           55Mi            
aws-ebs-csi-driver-operator-8c5ffb6d4-fk6zk     5m           88Mi            

Deleting the pods caused them to recreate, with normal memory consumption levels.

Version-Release number of selected component (if applicable):

4.10.52

How reproducible:

Unknown

Description of problem:

We observed that a dual stack cluster deployed with AI gui only fails.
This cluster is dhcp for ipv4, RA/RS autoconfiguration for ipv6.

It fails with error in the onvkube container

```
I0906 07:45:43.044090   87450 gateway_init.go:261] Initializing Gateway Functionality
I0906 07:45:43.046398   87450 gateway_localnet.go:152] Node local addresses initialized to: map[10.131.31.214:{10.131.31.208 fffffff0} 10.255.0.2:{10.255.0.0 fffffe00} 127.0.0.1:{127.0.0.0 ff000000} 2001:1b74:480:613a:f6e9:d4ff:fef1:6f26:{2001:1b74:480:613a:: ffffffffffffffff0000000000000000} ::1:{::1 ffffffffffffffffffffffffffffffff} fd01:0:0:1::2:{fd01:0:0:1:: ffffffffffffffff0000000000000000} fe80::8ce9:b4ff:fe1a:1208:{fe80:: ffffffffffffffff0000000000000000} fe80::c8ef:ecff:fee3:64c7:{fe80:: ffffffffffffffff0000000000000000} fe80::f6e9:d4ff:fef1:6f26:{fe80:: ffffffffffffffff0000000000000000}]
I0906 07:45:43.047759   87450 helper_linux.go:71] Provided gateway interface "br-ex", found as index: 7
I0906 07:45:43.048045   87450 helper_linux.go:97] Found default gateway interface br-ex 10.131.31.209
I0906 07:45:43.048152   87450 helper_linux.go:71] Provided gateway interface "br-ex", found as index: 7
F0906 07:45:43.048318   87450 ovnkube.go:133] failed to get default gateway interface
```

on the node we observed that there is multi-path entry during

```
default proto ra metric 48 pref medium
        nexthop via fe80::e2f6:2d01:ab14:ec71 dev br-ex weight 1
        nexthop via fe80::e2f6:2d01:ab11:c271 dev br-ex weight 1
```

I manually remove one of the entries (`ip route delete`) and then delete the ovnkube-node pod. Then the installation continues, container works.

Every time there is multiple entry, if the onvkube-node starts, it fails.


Version-Release number of selected component (if applicable):

4.10.30

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

There might a side issue: the interface of the node upon boot takes time to get the ipv6 autoconfiguration, no RS packets seemed to be sent out (observed zero on all routers).

This is a clone of issue OCPBUGS-669. The following is the description of the original issue:

Description of problem:

This is an OCP clone of https://bugzilla.redhat.com/show_bug.cgi?id=2099794

In summary, NetworkManager reports the network as being up before the ipv6 address of the primary interface is ready and crio fails to bind to it.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-6517. The following is the description of the original issue:

Description of problem:

When the cluster is configured with Proxy the swift client in the image registry operator is not using the proxy to authenticate with OpenStack, so it's unable to reach the OpenStack API. This issue became evident since recently the support was added to not fallback to cinder in case swift is available[1].

[1]https://github.com/openshift/cluster-image-registry-operator/pull/819

 

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1. Deploy a cluster with proxy and restricted installation
2. 
3.

Actual results:

 

Expected results:

 

Additional info:

 

Before platformStatus, the operator used to get information about AWS and GCP from the install-config config map. This code can be removed.

This is a clone of issue OCPBUGS-11844. The following is the description of the original issue:

This is a clone of issue OCPBUGS-5548. The following is the description of the original issue:

Description of problem:
This is a follow-up on https://bugzilla.redhat.com/show_bug.cgi?id=2083087 and https://github.com/openshift/console/pull/12390

When creating a Deployment, DeploymentConfig, or Knative Service with enabled Pipeline, and then deleting it again with the enabled option "Delete other resources created by console" (only available on 4.13+ with the PR above) the automatically created Pipeline is not deleted.

When the user tries to create the same resource with a Pipeline again this fails with an error:

An error occurred
secrets "nodeinfo-generic-webhook-secret" already exists

Version-Release number of selected component (if applicable):
4.13

(we might want to backport this together with https://github.com/openshift/console/pull/12390 and OCPBUGS-5547)

How reproducible:
Always

Steps to Reproduce:

  1. Install OpenShift Pipelines operator (tested with 1.8.2)
  2. Create a new project
  3. Navigate to Add > Import from git and create an application
  4. Case 1: In the topology select the new resource and delete it
  5. Case 2: In the topology select the application group and delete the complete app

Actual results:
Case 1: Delete resources:

  1. Deployment (tries it twice!) $name
  2. Service $name
  3. Route $name
  4. ImageStream $name

Case 2: Delete application:

  1. Deployment (just once) $name
  2. Service $name
  3. Route $name
  4. ImageStream $name

Expected results:
Case 1: Delete resource:

  1. Delete Deployment $name should be called just once
  2. (Keep this deletion) Service $name
  3. (Keep this deletion) Route $name
  4. (Keep this deletion) ImageStream $name
  5. Missing deletion of the Tekton Pipeline $name
  6. Missing deletion of the Tekton TriggerTemplate with generated name trigger-template-$name-$random
  7. Missing deletion of the Secret $name-generic-webhook-secret
  8. Missing deletion of the Secret $name-github-webhook-secret

Case 2: Delete application:

  1. (Keep this deletion) Deployment $name
  2. (Keep this deletion) Service $name
  3. (Keep this deletion) Route $name
  4. (Keep this deletion) ImageStream $name
  5. Missing deletion of the Tekton Pipeline $name
  6. Missing deletion of the Tekton TriggerTemplate with generated name trigger-template-$name-$random
  7. Missing deletion of the Secret $name-generic-webhook-secret
  8. Missing deletion of the Secret $name-github-webhook-secret

Additional info:

Since 4.11 OCP comes with OperatorHub definition which declares a capability
and enables all catalog sources. For OKD we want to enable just community-operators
as users may not have Red Hat pull secret set.
This commit would ensure that OKD version of marketplace operator gets
its own OperatorHub manifest with a custom set of operator catalogs enabled

This is a clone of issue OCPBUGS-7458. The following is the description of the original issue:

Description of problem:

- After upgrading to OCP 4.10.41, thanos-ruler-user-workload-1 in the openshift-user-workload-monitoring namespace is consistently being created and deleted.
- We had to scale down the Prometheus operator multiple times so that the upgrade is considered as successful.
- This fix is temporary. After some time it appears again and Prometheus operator needs to be scaled down and up again.
- The issue is present on all clusters in this customer environment which are upgraded to 4.10.41.

Version-Release number of selected component (if applicable):

 

How reproducible:

N/A, I wasn't able to reproduce the issue.

Steps to Reproduce:

 

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

Manual backport of 
* https://github.com/openshift/cluster-dns-operator/pull/336
* https://github.com/openshift/cluster-dns-operator/pull/339

Version-Release number of selected component (if applicable):

4.11

This is a clone of issue OCPBUGS-858. The following is the description of the original issue:

Description of problem:

In OCP 4.9, the package-server-manager was introduced to manage the packageserver CSV. However, when OCP 4.8 in upgraded to 4.9, the packageserver stays stuck in v0.17.0, which is the version in OCP 4.8, and v0.18.3 does not roll out, which is the version in OCP 4.9

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1. Install OCP 4.8

2. Upgrade to OCP 4.9 

$ oc get clusterversion 
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2022-08-31-160214   True        True          50m     Working towards 4.9.47: 619 of 738 done (83% complete)

$ oc get clusterversion 
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.47    True        False         4m26s   Cluster version is 4.9.47
 

Actual results:

Check packageserver CSV. It's in v0.17.0 

$ oc get csv  NAME            DISPLAY          VERSION   REPLACES   PHASE packageserver   Package Server   0.17.0               Succeeded 

Expected results:

packageserver CSV is at 0.18.3 

Additional info:

packageserver CSV version in 4.8: https://github.com/openshift/operator-framework-olm/blob/release-4.8/manifests/0000_50_olm_15-packageserver.clusterserviceversion.yaml#L12

packageserver CSV version in 4.9: https://github.com/openshift/operator-framework-olm/blob/release-4.9/pkg/manifests/csv.yaml#L8

This is a clone of issue OCPBUGS-5151. The following is the description of the original issue:

Description of problem:

Cx is not able to install new cluster OCP BM IPI. During the bootstrapping the provisioning interfaces from master node not getting ipv4 dhcp ip address from bootstrap dhcp server on OCP IPI BareMetal install 

Please refer to following BUG --> https://issues.redhat.com/browse/OCPBUGS-872  The problem was solved by applying rd.net.timeout.carrier=30 to the kernel parameters of compute nodes via cluster-baremetal operator. The fix also need to be apply to the control-plane. 

  ref:// https://github.com/openshift/cluster-baremetal-operator/pull/286/files

 

Version-Release number of selected component (if applicable):

 

How reproducible:

Perform OCP 4.10.16 IPI BareMetal install.

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

Customer should be able to install the cluster without any issue.

Additional info:

 

Description of problem:

With every pod update we are executing a mutate operation to add the pod port to the port group or add the pod IP to an address set. This functionally doesn't hurt, since mutate will not add duplicate values to the same set. However, this is bad for performance. For example, with a 730 network policies affecting a pod, and issuing 7 pod updates would result in over 5k transactions.

Description of problem:

This issue exists to drive the backport process of https://github.com/openshift/api/pull/1313

According to the Kubernetes documentation, starting from Kubernetes 1.22, the service-account-issuer flag can be specified multiple times. The first value is then used to generate new tokens and other values are accepted. Using this field can prevent cluster disruptions and allows for smoother reconfiguration of this field.

see: https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/#service-account-token-volume-projection

The status field will allow us to keep track of "used" service account issuers and also expire/prune them.

this is a replacement for: #1309

xref: https://issues.redhat.com/browse/AUTH-309

 

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

If you set a services cluster IP to an IP with a leading zero (e.g. 192.168.0.011), ovn-k should normalise this and remove the leading zero before sending it to ovn.

This was seen by me on a CI run executing the k8 test here: test/e2e/network/funny_ips.go +75

you can reproduce using that above test.

Have a read of the text there:

 43 // What are funny IPs:  
 44 // The adjective is because of the curl blog that explains the history and the problem of liberal  
 45 // parsing of IP addresses and the consequences and security risks caused the lack of normalization,
 46 // mainly due to the use of different notations to abuse parsers misalignment to bypass filters.
 47 // xref: https://daniel.haxx.se/blog/2021/04/19/curl-those-funny-ipv4-addresses/   
 48 //     
 49 // Since golang 1.17, IPv4 addresses with leading zeros are rejected by the standard library.
 50 // xref: https://github.com/golang/go/issues/30999
 51 //     
 52 // Because this change on the parsers can cause that previous valid data become invalid, Kubernetes
 53 // forked the old parsers allowing leading zeros on IPv4 address to not break the compatibility.
 54 //     
 55 // Kubernetes interprets leading zeros on IPv4 addresses as decimal, users must not rely on parser
 56 // alignment to not being impacted by the associated security advisory: CVE-2021-29923 golang
 57 // standard library "net" - Improper Input Validation of octal literals in golang 1.16.2 and below
 58 // standard library "net" results in indeterminate SSRF & RFI vulnerabilities. xref:
 59 // https://nvd.nist.gov/vuln/detail/CVE-2021-29923                                                                                                     

northd is logging an error about this also:

|socket_util|ERR|172.30.0.011:7180: bad IP address "172.30.0.011" 
...
2022-08-23T14:14:21.968Z|01839|ovn_util|WARN|bad ip address or port for load balancer key 172.30.0.011:7180

 

Also, I see the error:

E0823 14:14:34.135115    3284 gateway_shared_intf.go:600] Failed to delete conntrack entry for service e2e-funny-ips-8626/funny-ip: failed to delete conntrack entry for service e2e-funny-ips-8626/funny-ip with svcVIP 172.30.0.011, svcPort 7180, protocol TCP: value "<nil>" passed to DeleteConntrack is not an IP address 

We should normalise the IPs before sending to OVN-k. I see also theres conntrack error when trying to set this bad IP.

 

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1. See above k8 test

Actual results:

Leading zero IP sent to OVN

Expected results:

No leading zero IP sent to OVN

Additional info:

This is a clone of issue OCPBUGS-3824. The following is the description of the original issue:

This is a clone of issue OCPBUGS-2598. The following is the description of the original issue:

Description of problem:

Liveness probe of ipsec pods fail with large clusters. Currently the command that is executed in the ipsec container is
ovs-appctl -t ovs-monitor-ipsec ipsec/status && ipsec status
The problem is with command "ipsec/status". In clusters with high node count this command will return a list with all the node daemons of the cluster. This means that as the node count raises the completion time of the command raises too. 

This makes the main command 

ovs-appctl -t ovs-monitor-ipsec

To hang until the subcommand is finished.

As the liveness and readiness probe values are hardcoded in the manifest of the ipsec container herehttps//github.com/openshift/cluster-network-operator/blob/9c1181e34316d34db49d573698d2779b008bcc20/bindata/network/ovn-kubernetes/common/ipsec.yaml] the liveness timeout of the container probe of 60 seconds start to be  insufficient as the node count list is growing. This resulted in a cluster with 170 + nodes to have 15+ ipsec pods in a crashloopbackoff state.

Version-Release number of selected component (if applicable):

Openshift Container Platform 4.10 but i think the same will be visible to other versions too.

How reproducible:

I was not able to reproduce due to an extreamely high amount of resources are needed and i think that there is no point as we have spotted the issue.

Steps to Reproduce:

1. Install an Openshift cluster with IPSEC enabled
2. Scale to 170+ nodes or more
3. Notice that the ipsec pods will start getting in a Crashloopbackoff state with failed Liveness/Readiness probes.

Actual results:

Ip Sec pods are stuck in a Crashloopbackoff state

Expected results:

Ip Sec pods to work normally

Additional info:

We have provided a workaround where CVO and CNO operators are scaled to 0 replicas in order for us to be able to increase the liveness probe limit to a value of 600 that recovered the cluster. 
As a next step the customer will try to reduce the node count and restore the default liveness timeout value along with bringing the operators back to see if the cluster will stabilize.

 

Description of problem:

Setting a telemeter proxy in the cluster-monitoring-config config map does not work as expected

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
the following KCS details steps to add a proxy.
The steps have been verified at 4.7 but do not work at 4.8, 4.9 or 4.10

https://access.redhat.com/solutions/6172402

When testing at 4.8, 4.9 and 4.10 the proxy setting where also nested under `telemeterClient`

which triggered a telemeter restart but the proxy setting do not get set in the deployment as they do in 4.7

Actual results:

4.8, 4.9 and 4.10 without the nested `telemeterClient`
does not trigger a restart of the telemeter pod

Expected results:

I think the proxy setting should be nested under telemeterClient
but should set the environment variables in the deployment

Additional info:

This is a backport of https://bugzilla.redhat.com/show_bug.cgi?id=2116382 from 4.12 to 4.11.z. Creating manually because as seen in https://github.com/openshift/cluster-monitoring-operator/pull/1743 `/cherry-pick` doesn't work for bugs originally created in bugzilla

This is a clone of issue OCPBUGS-10497. The following is the description of the original issue:

This is a clone of issue OCPBUGS-10213. The following is the description of the original issue:

This is a clone of issue OCPBUGS-8468. The following is the description of the original issue:

Description of problem:

RHCOS is being published to new AWS regions (https://github.com/openshift/installer/pull/6861) but aws-sdk-go need to be bumped to recognize those regions

Version-Release number of selected component (if applicable):

master/4.14

How reproducible:

always

Steps to Reproduce:

1. openshift-install create install-config
2. Try to select ap-south-2 as a region
3.

Actual results:

New regions are not found. New regions are: ap-south-2, ap-southeast-4, eu-central-2, eu-south-2, me-central-1.

Expected results:

Installer supports and displays the new regions in the Survey

Additional info:

See https://github.com/openshift/installer/blob/master/pkg/asset/installconfig/aws/regions.go#L13-L23

 

This is a clone of issue OCPBUGS-1565. The following is the description of the original issue:

Description of problem:

We've observed a split brain case for keepalived unicast, where two worker nodes were fighting for the ingress VIP. 
One of these nodes failed to register itself with the cluster, so it was missing from the output of the node list. That, in turn, caused it to be missing from the unicast_peer list in keepalived. This one node believed it was the master (not receiving VRRP from other nodes), and other nodes constantly re-electing a master.

This behavior was observed in a QE-deployed cluster on PSI. It caused constant VIP flapping and a huge load on OVN.

Version-Release number of selected component (if applicable):


How reproducible:

Not sure. We don't know why the worker node failed to register with the cluster (the cluster is gone now) or what the QE were testing at the time. 

Steps to Reproduce:

1.
2.
3.

Actual results:

The cluster was unhealthy due to the constant Ingress VIP failover. It was also putting a huge load on PSI cloud.

Expected results:

The flapping VIP can be very expensive for the underlying infrastructure. In no way we should allow OCP to bring the underlying infra down.

The node should not be able to claim the VIP when using keepalived in unicast mode unless they have correctly registered with the cluster and they appear in the node list.

Additional info:


Description of problem:

During the upgrade mutlijob for OCP starting from version 4.10 with OVNkubernetes network type on OSP 16.2, the upgrade process encountered an error when upgrading from version 4.11 to 4.12. The cluster operator etcd became unavailable. A specific node, ostest-ttvx4-master-2, is currently in SchedulingDisabled status. Examination of the openshift-etcd namespace reveals that the etcd-ostest-ttvx4-master-0 pod has been reporting errors. Log data suggests issues related to etcd members and their data directories.

Version-Release number of selected component (if applicable):

OCP 4.11.50 to 4.12.36
RHOS-16.2-RHEL-8-20230510.n.1

How reproducible:

Always

Steps to Reproduce:

1.Begin the OCP upgrade process starting from version 4.10
2.Upgrade from 4.10 to 4.11
3.Upgrade from 4.11 to 4.12

Actual results:

The upgrade process fails during the upgrading between versions 4.11 and 4.12, specifically pointing to issues with the etcd operator. The operator reports being unavailable and indicates problems with specific etcd members.

Expected results:

Smooth upgrade from 4.11 to 4.12 without any issues.

Additional info:

$ oc get co
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
authentication 4.12.36 True False True 4h17m APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-oauth-apiserver ()...
baremetal 4.12.36 True False False 8h
...
...
csi-snapshot-controller 4.12.36 True False False 8h
dns 4.12.36 True False False 8h
etcd 4.12.36 False True True 4h33m EtcdMembersAvailable: 2 of 4 members are available, NAME-PENDING-172.17.5.228 has not started, ostest-ttvx4-master-0 is unhealthy
.....
machine-approver 4.12.36 True False False 8h
machine-config 4.11.50 True True True 6h14m Unable to apply 4.12.36: error during syncRequiredMachineConfigPools: [timed out waiting for the condition, error pool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 2, updated: 2, unavailable: 1)]
marketplace 4.12.36 True False False 8h
monitoring 4.12.36 True False False 4h15m
network 4.12.36 True False False 8h
node-tuning 4.12.36 True False False 5h15m
openshift-apiserver 4.12.36 True False True 4h19m APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver ()
...
operator-lifecycle-manager-packageserver 4.12.36 True False False 8h
service-ca 4.12.36 True False False 8h
storage 4.12.36 True False False 8h 
$ oc get pods -n openshift-etcd
NAME READY STATUS RESTARTS AGE
etcd-guard-ostest-ttvx4-master-0 0/1 Running 0 4h33m
etcd-guard-ostest-ttvx4-master-1 1/1 Running 0 4h22m
etcd-guard-ostest-ttvx4-master-2 1/1 Running 0 5h40m
etcd-ostest-ttvx4-master-0 3/4 Error 58 (5m10s ago) 4h25m
etcd-ostest-ttvx4-master-1 4/4 Running 0 4h25m
etcd-ostest-ttvx4-master-2 4/4 Running 2 (4h28m ago) 4h48m
installer-25-ostest-ttvx4-master-0 0/1 Completed 0 4h47m
installer-26-ostest-ttvx4-master-0 0/1 Completed 0 4h44m
installer-27-ostest-ttvx4-master-0 0/1 Completed 0 4h34m
revision-pruner-25-ostest-ttvx4-master-0 0/1 Completed 0 4h47m
revision-pruner-26-ostest-ttvx4-master-0 0/1 Completed 0 4h44m
revision-pruner-26-ostest-ttvx4-master-1 0/1 Completed 0 4h34m
revision-pruner-27-ostest-ttvx4-master-0 0/1 Completed 0 4h34m
revision-pruner-27-ostest-ttvx4-master-1 0/1 Completed 0 4h34m
$ oc logs etcd-ostest-ttvx4-master-0 -n openshift-etcd
1a4f2630e5f2296f, unstarted, , https://172.17.5.228:2380, , true
2f6c4ca331daa2de, started, ostest-ttvx4-master-2, https://10.196.2.249:2380, https://10.196.2.249:2379, false
752ca6c9953eff21, started, ostest-ttvx4-master-1, https://10.196.1.187:2380, https://10.196.1.187:2379, false
a6d1d802202a55e3, started, ostest-ttvx4-master-0, https://10.196.2.93:2380, https://10.196.2.93:2379, false
#### attempt 0
      member={name="", peerURLs=[https://172.17.5.228:2380}, clientURLs=[]
      member={name="ostest-ttvx4-master-2", peerURLs=[https://10.196.2.249:2380}, clientURLs=[https://10.196.2.249:2379]
      member={name="ostest-ttvx4-master-1", peerURLs=[https://10.196.1.187:2380}, clientURLs=[https://10.196.1.187:2379]
      member={name="ostest-ttvx4-master-0", peerURLs=[https://10.196.2.93:2380}, clientURLs=[https://10.196.2.93:2379]
      target={name="ostest-ttvx4-master-0", peerURLs=[https://10.196.2.93:2380}, clientURLs=[https://10.196.2.93:2379]
member "https://10.196.2.93:2380" dataDir has been destroyed and must be removed from the cluster

This is a clone of issue OCPBUGS-19805. The following is the description of the original issue:

Description of problem:

While reviewing PRs in CoreDNS 1.11.0, we stumbled upon https://github.com/coredns/coredns/pull/6179, which describes an CoreDNS crash in the kubernetes plugin if you create an EndpointSlice object contains a port without a port number.

I reproduced this myself and was able to successfully bring down all of CoreDNS so that the cluster was put into a degraded state.

We've bumped to CoreDNS 1.11.1 in 4.15, so this is concern for < 4.15.

Version-Release number of selected component (if applicable):

Less than or equal to 4.14

How reproducible:

100%

Steps to Reproduce:

1. Create an endpointslice with a port with no port number:

apiVersion: discovery.k8s.io/v1
kind: EndpointSlice
metadata:
  name: example-abc
addressType: IPv4
ports:
  - name: ""

2.Shortly after creating this object, all DNS pods continuously crash:
oc get -n openshift-dns pods
NAME                  READY   STATUS             RESTARTS     AGE
dns-default-57lmh     1/2     CrashLoopBackOff   1 (3s ago)   79m
dns-default-h6cvm     1/2     CrashLoopBackOff   1 (4s ago)   79m
dns-default-mn7qd     1/2     CrashLoopBackOff   1 (3s ago)   79m
dns-default-mxq5g     1/2     CrashLoopBackOff   1 (3s ago)   79m
dns-default-wdrff     1/2     CrashLoopBackOff   1 (3s ago)   79m
dns-default-zs7cd     1/2     CrashLoopBackOff   1 (3s ago)   79m

Actual results:

DNS Pods crash

Expected results:

DNS Pods should NOT crash

Additional info:

 

Description of problem: 

This repository is out of sync with the downstream product builds for this component.
One or more images differ from those being used by ART to create product builds. This
should be addressed to ensure that the component's CI testing is accurately
reflecting what customers will experience.

The information within the following ART component metadata is driving this alignment
request: ose-baremetal-installer.yml.

The vast majority of these PRs are opened because a different Golang version is being
used to build the downstream component. ART compiles most components with the version
of Golang being used by the control plane for a given OpenShift release. Exceptions
to this convention (i.e. you believe your component must be compiled with a Golang
version independent from the control plane) must be granted by the OpenShift
architecture team and communicated to the ART team.

Version-Release number of selected component (if applicable): 4.11

Additional info: This issue is needed for the bot-created PR to merge after the 4.11 GA.

This is a clone of issue OCPBUGS-20100. The following is the description of the original issue:

This is a clone of issue OCPBUGS-19378. The following is the description of the original issue:

Description of problem:

In 4.14 we have introduced a bunch of improvements to the backup/restore scripts along with many test cases and a full e2e suite. 

Since we've already rebased the new features for etcdctl into all supported branches, we can also backport the CEO tooling along with it to get into a consistent and supportable state again. 

Version-Release number of selected component (if applicable):

4.11->4.13

How reproducible:

na

Steps to Reproduce:

na

Actual results:

na

Expected results:

na

Additional info:

 

This is a clone of issue OCPBUGS-6671. The following is the description of the original issue:

This is a clone of issue OCPBUGS-3228. The following is the description of the original issue:

While starting a Pipelinerun using UI, and in the process of providing the values on "Start Pipeline" , the IBM Power Customer (Deepak Shetty from IBM) has tried creating credentials under "Advanced options" with "Image Registry Credentials" (Authenticaion type). When the IBM Customer verified the credentials from  Secrets tab (in Workloads) , the secret was found in broken state. Screenshot of the broken secret is attached. 

The issue has been observed on OCP4.8, OCP4.9 and OCP4.10.

Goal

We have several use cases where dynamic plugins need to proxy to another service on the cluster. One example is the Helm plugin. We would like to move the backend code for Helm to a separate service on the cluster, and the Helm plugin could proxy to that service for its requests. This is required to make Helm a dynamic plugin. Similarly if we want to have ACM contribute any views through dynamic plugins, we will need a way for ACM to proxy to its services (e.g., for Search).

It's possible for plugins to make requests to services exposed through routes today, but that has several problems:

  1. It requires that the service be exposed outside the cluster, which is not always desired.
  2. It requires the service support CORS headers for the console.
  3. There is no way to specify a CA file for the route if it's not trusted by the browser.
  4. Plugins will not have access to the user's access token on the client, which means that there is no simple way to handle auth.

Plugins need a way to declare in-cluster services that they need to connect to. The console backend will need to set up proxies to those services on console load. This also requires that the console operator be updated to pass the configuration to the console backend.

 

This work will apply only to single clusters.

 

Open Questions

  • What happens when a multitenant isolated network policy is configured on the cluster?

https://docs.openshift.com/container-platform/4.7/networking/network_policy/multitenant-network-policy.html

  • How do we (and can we?) support this for multi-cluster where console is running on a different hub cluster?
  • Do we need to auth for all requests?

Acceptance Criteria

  • Plugins can declare a service to proxy to in the ConsolePlugin resource
  • Plugins can specify a CA cert for the service
  • Console falls back to the service signing CA if none is specified
  • Plugins have a way of specifying whether the user's authentication token is included in requests through the service proxy
  • Dynamic plugin enhancement is updated with the implementation details
  • Support for server-side events (SSE) for ACM
  • Add support, or a flag, if auth is needed for each request.  

cc Ali Mobrem [~christianmvogt]

The current integration of prometheus-adapter in OpenShift uses the platform Prometheus as a backend to get metrics. The problem with this design is that we are getting metrics from 2 different Prometheus instances which don't have replicated data, so two queries sent at the same time to prometheus-adapter might yield different results since the underlying promQL queries executed by prometheus-adapter might be on different Prometheus servers. The consequence is that we end up having inconsistent data across multiple autoscaling requests.

This can be easily tested by running:

$ while true ; do date; oc adm top pod -n openshift-monitoring  prometheus-k8s-0 ; echo; sleep 1 ;done 

Mon Jul 26 03:55:07 EDT 2021
NAME               CPU(cores)   MEMORY(bytes)   
prometheus-k8s-0   208m         4879Mi          

Mon Jul 26 03:55:08 EDT 2021                               
NAME               CPU(cores)   MEMORY(bytes)   
prometheus-k8s-0   246m         4877Mi          

Mon Jul 26 03:55:09 EDT 2021                               
NAME               CPU(cores)   MEMORY(bytes)   
prometheus-k8s-0   208m         4879Mi          

Mon Jul 26 03:55:10 EDT 2021
NAME               CPU(cores)   MEMORY(bytes)   
prometheus-k8s-0   246m         4877Mi          

This isn't a bug in itself since it was designed that way, but we could do better by using thanos-querier as a backend instead of the platform Prometheus because it will duplicate the metrics from both instances and serve one consistent result based on the data that it will get from the Prometheuses.

DoD:

  • Use thanos-querier as a backend for prometheus-adapter

This is a clone of issue OCPBUGS-10514. The following is the description of the original issue:

This is a clone of issue OCPBUGS-10221. The following is the description of the original issue:

This is a clone of issue OCPBUGS-5469. The following is the description of the original issue:

Description of problem:

When changing channels it's possible that multiple new conditional update risks will need to be evaluated. For instance, a cluster running 4.10.34 in a 4.10 channel today only has to evaluate `OpenStackNodeCreationFails` but when the channel is changed to a 4.11 channel multiple new risks require evaluation and the evaluation of new risks is throttled at one every 10 minutes. This means if there are three new risks it may take up to 30 minutes after the channel has changed for the full set of conditional updates to be computed. This leads to a perception that no update paths are recommended because most will not wait 30 minutes, they expect immediate feedback.

Version-Release number of selected component (if applicable):

4.10.z, 4.11.z, 4.12, 4.13

How reproducible:

100% 

Steps to Reproduce:

1. Install 4.10.34
2. Switch from stable-4.10 to stable-4.11
3. 

Actual results:

Observe no recommended updates for 10-20 minutes because all available paths to 4.11 have a risk associated with them

Expected results:

Risks are computed in a timely manner for an interactive UX, lets say < 10s

Additional info:

This was intentional in the design, we didn't want risks to continuously re-evaluate or overwhelm the monitoring stack, however we didn't anticipate that we'd have long standing pile of risks and realize how confusing the user experience would be.

We intend to work around this in the deployed fleet by converting older risks from `type: promql` to `type: Always` avoiding the evaluation period but preserving the notification. While this may lead customers to believe they're exposed to a risk they may not be, as long as the set of outstanding risks to the latest version is limited to no more than one it's likely no one will notice. All 4.10 and 4.11 clusters currently have a clear path toward relatively recent 4.10.z or 4.11.z with no more than one risk to be evaluated.

Discovered in the must gather kubelet_service.log from https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-gcp-sdn-upgrade/1586093220087992320

It appears the guard pod names are too long, and being truncated down to where they will collide with those from the other masters.

From kubelet logs in this run:

❯ grep openshift-kube-scheduler-guard-ci-op-3hj6pnwf-4f6ab-lv57z-maste kubelet_service.log
Oct 28 23:58:55.693391 ci-op-3hj6pnwf-4f6ab-lv57z-master-1 kubenswrapper[1657]: E1028 23:58:55.693346    1657 kubelet_pods.go:413] "Hostname for pod was too long, truncated it" podName="openshift-kube-scheduler-guard-ci-op-3hj6pnwf-4f6ab-lv57z-master-1" hostnameMaxLen=63 truncatedHostname="openshift-kube-scheduler-guard-ci-op-3hj6pnwf-4f6ab-lv57z-maste"
Oct 28 23:59:03.735726 ci-op-3hj6pnwf-4f6ab-lv57z-master-0 kubenswrapper[1670]: E1028 23:59:03.735671    1670 kubelet_pods.go:413] "Hostname for pod was too long, truncated it" podName="openshift-kube-scheduler-guard-ci-op-3hj6pnwf-4f6ab-lv57z-master-0" hostnameMaxLen=63 truncatedHostname="openshift-kube-scheduler-guard-ci-op-3hj6pnwf-4f6ab-lv57z-maste"
Oct 28 23:59:11.168082 ci-op-3hj6pnwf-4f6ab-lv57z-master-2 kubenswrapper[1667]: E1028 23:59:11.168041    1667 kubelet_pods.go:413] "Hostname for pod was too long, truncated it" podName="openshift-kube-scheduler-guard-ci-op-3hj6pnwf-4f6ab-lv57z-master-2" hostnameMaxLen=63 truncatedHostname="openshift-kube-scheduler-guard-ci-op-3hj6pnwf-4f6ab-lv57z-maste"

This also looks to be happening for openshift-kube-scheduler-guard, kube-controller-manager-guard, possibly others.

Looks like they should be truncated further to make room for random suffixes in https://github.com/openshift/library-go/blame/bd9b0e19121022561dcd1d9823407cd58b2265d0/pkg/operator/staticpod/controller/guard/guard_controller.go#L97-L98

Unsure of the implications here, it looks a little scary.

Tracker issue for bootimage bump in 4.11. This issue should block issues which need a bootimage bump to fix.

The previous bump was OCPBUGS-562.

Description of problem:

If we use a macvlan with the configuration...
spec:
  config: '{ "cniVersion": "0.3.1", "name": "ran-bh-macvlan-test", "plugins": [ {"type": "macvlan","master": "vlan306", "mode": "bridge", "ipam": { "type": "whereabouts", "range": "2001:1b74:480:603d:0304:0403:000:0000-2001:1b74:480:603d:0304:0403:0000:0004/64","gateway": "2001:1b74:480:603d::1" } } ]}'

there is an error creating the pod:

  Warning  FailedCreatePodSandBox  17s (x3 over 55s)  kubelet            (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_test31_test-ecoloma-01_a593bd0a-83e7-4d31-857e-0c31491e849e_0(5cf36bd99ffa532fd34735e68caecfbc69d820ba6cb04e348c9f9f168498022f): error adding pod test-ecoloma-01_test31 to CNI network "multus-cni-network": [test-ecoloma-01/test31:ran-bh-macvlan-test]: error adding container to network "ran-bh-macvlan-test": Error at storage engine: OverlappingRangeIPReservation.whereabouts.cni.cncf.io "2001-1b74-480-603d-304-403--" is invalid: metadata.name: Invalid value: "2001-1b74-480-603d-304-403--": a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')
  
  
If we change the start IP address to 2001:1b74:480:603d:0304:0403:000:0001, it works ok ok.

Version-Release number of selected component (if applicable):

4.13

How reproducible:

Always reproducible

Steps to Reproduce:

1. See description of problem.

Actual results:

Unable to create pod

Expected results:

IP range should be valid and pod should get created

Additional info:

 

Description of problem:

When trying to add a Cisco UCS Rackmount server as a `baremetalhost` CR the following error comes up in the metal3 container log in the openshift-machine-api namespace.

'TransferProtocolType' property which is mandatory to complete the action is missing in the request body

Full log entry:

{"level":"info","ts":1677155695.061805,"logger":"provisioner.ironic","msg":"current provision state","host":"ucs-rackmounts~ocp-test-1","lastError":"Deploy step deploy.deploy failed with BadRequestError: HTTP POST https://10.5.4.78/redfish/v1/Managers/CIMC/VirtualMedia/0/Actions/VirtualMedia.InsertMedia returned code 400. Base.1.4.0.GeneralError: 'TransferProtocolType' property which is mandatory to complete the action is missing in the request body. Extended information: [{'@odata.type': 'Message.v1_0_6.Message', 'MessageId': 'Base.1.4.0.GeneralError', 'Message': "'TransferProtocolType' property which is mandatory to complete the action is missing in the request body.", 'MessageArgs': [], 'Severity': 'Critical'}].","current":"deploy failed","target":"active"}

Version-Release number of selected component (if applicable):

    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:30328143480d6598d0b52d41a6b755bb0f4dfe04c4b7aa7aefd02ea793a2c52b
    imagePullPolicy: IfNotPresent
    name: metal3-ironic

How reproducible:

Adding a Cisco UCS Rackmount with Redfish enabled as a baremetalhost to metal3

Steps to Reproduce:

1. The address to use: redfish-virtualmedia://10.5.4.78/redfish/v1/Systems/WZP22100SBV

Actual results:

[baelen@baelen-jumphost mce]$ oc get baremetalhosts.metal3.io  -n ucs-rackmounts  ocp-test-1
NAME         STATE          CONSUMER   ONLINE   ERROR                AGE
ocp-test-1   provisioning              true     provisioning error   23h

Expected results:

For the provisioning to be successfull.

Additional info:

 

Description of problem:

[sig-arch] events should not repeat pathologically is frequently failing in 4.11 upgrade jobs. The error is too many readiness probe errors.

Version-Release number of selected component (if applicable):

 

How reproducible:

Flakey

Steps to Reproduce:

1.
2.
3.

Actual results:

: [sig-arch] events should not repeat pathologically expand_less0s{  1 events happened too frequently

event happened 44 times, something is wrong: ns/openshift-monitoring pod/thanos-querier-d89745c9-xttvz node/ip-10-0-178-252.us-west-2.compute.internal - reason/ProbeError Readiness probe error: Get "https://10.128.2.13:9091/-/ready": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
body: 
}

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-3265. The following is the description of the original issue:

This is a clone of issue OCPBUGS-3172. The following is the description of the original issue:

Customer is trying to install the Logging operator, which appears to attempt to install a dynamic plugin. The operator installation fails in the console because permissions aren't available to "patch resource consoles".

We shouldn't block operator installation if permission issues prevent dynamic plugin installation.

This is an OSD cluster, presumably for a customer with "cluster-admin", although it may be a paired down permission set called "dedicated-admin".

See https://docs.google.com/document/d/1hYS-bm6aH7S6z7We76dn9XOFcpi9CGYcGoJys514YSY/edit for permissions investigation work on OSD

This is a clone of OCPBUGSM-47085
Version:

$ openshift-install version
4.11.0-rc2

Platform:

Nutanix

On `openshift-installer create manifests` stage a connection to Prism is made (see https://github.com/openshift/installer/blob/master/pkg/asset/installconfig/nutanix/validation.go#L15-L36=)

This make generating manifests separately impossible, which breaks Assisted Installer flow. Instead of storing sensitive user information, Assisted Installer sets fake details in install-config.yaml and asks user to update these after installation has completed.

With validation happening on `openshift-install create manifests` phase installation process can't start with invalid credentials.

Please move this validation to ValidateForProvisioning, similar to vSphere

 

This is a clone of issue OCPBUGS-1428. The following is the description of the original issue:

Description of problem:

When using an OperatorGroup attached to a service account, AND if there is a secret present in the namespace, the operator installation will fail with the message:
the service account does not have any API secret sa=testx-ns/testx-sa
This issue seems similar to https://bugzilla.redhat.com/show_bug.cgi?id=2094303 - which was resolved in 4.11.0 - however, the new element now, is that the presence of a secret in the namespace  is causing the issue.
The name of the secret seems significant - suggesting something somewhere is depending on the order that secrets are listed in. For example, If the secret in the namespace is called "asecret", the problem does not occur. If it is called "zsecret", the problem always occurs.
"zsecret" is not a "kubernetes.io/service-account-token". The issue I have raised here relates to Opaque secrets - zsecret is an Opaque secret. The issue may apply to other types of secrets, but specifically my issue is that when there is an opaque secret present in the namespace, the operator install fails as described. I aught to be allowed to have an opaque secret present in the namespace where I am installing the operator.

Version-Release number of selected component (if applicable):

4.11.0 & 4.11.1

How reproducible:

100% reproducible

Steps to Reproduce:

1.Create namespace: oc new-project testx-ns
2. oc apply -f api-secret-issue.yaml

Actual results:

 

Expected results:

 

Additional info:

API YAML:

cat api-secret-issue.yaml 
apiVersion: v1
kind: Secret
metadata:
  name: zsecret
  namespace: testx-ns
  annotations:
   kubernetes.io/service-account.name: testx-sa
type: Opaque
stringData:
  mykey: mypass

apiVersion: v1
kind: ServiceAccount
metadata:
  name: testx-sa
  namespace: testx-ns

kind: OperatorGroup
apiVersion: operators.coreos.com/v1
metadata:
  name: testx-og
  namespace: testx-ns
spec:
  serviceAccountName: "testx-sa"
  targetNamespaces:
  - testx-ns

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: testx-role
  namespace: testx-ns
rules:

  • apiGroups: ["*"]
      resources: ["*"]
      verbs: ["*"] 
      

apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: testx-rolebinding
  namespace: testx-ns
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: testx-role
subjects:

  • kind: ServiceAccount
      name: testx-sa
      namespace: testx-ns

apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: etcd-operator
  namespace: testx-ns
spec:
  channel: singlenamespace-alpha
  installPlanApproval: Automatic
  name: etcd
  source: community-operators
  sourceNamespace: openshift-marketplace

Description of problem:

The storageclass "thin-csi" is created by vsphere-CSI-Driver-Operator, after deleting it manually, it should be re-created immediately. 

Version-Release number of selected component (if applicable):

4.11.4

How reproducible:

Always

Steps to Reproduce:

1. Check storageclass in running cluster, thin-csi is present:
$ oc get sc 
NAME             PROVISIONER                    RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
thin (default)   kubernetes.io/vsphere-volume   Delete          Immediate              false                  41m
thin-csi         csi.vsphere.vmware.com         Delete          WaitForFirstConsumer   true                   38m
2. Delete thin-csi storageclass:
$ oc delete sc thin-csi
storageclass.storage.k8s.io "thin-csi" deleted
3. Check storageclass again, thin-csi is not present:
$ oc get sc
NAME             PROVISIONER                    RECLAIMPOLICY   VOLUMEBINDINGMODE   ALLOWVOLUMEEXPANSION   AGE
thin (default)   kubernetes.io/vsphere-volume   Delete          Immediate           false                  50m
4. Check vmware-vsphere-csi-driver-operator log:
......
I0909 03:47:42.172866       1 named_certificates.go:53] "Loaded SNI cert" index=0 certName="self-signed loopback" certDetail="\"apiserver-loopback-client@1662695014\" [serving] validServingFor=[apiserver-loopback-client] issuer=\"apiserver-loopback-client-ca@1662695014\" (2022-09-09 02:43:34 +0000 UTC to 2023-09-09 02:43:34 +0000 UTC (now=2022-09-09 03:47:42.172853123 +0000 UTC))"I0909 03:49:38.294962       
1 streamwatcher.go:111] Unexpected EOF during watch stream event decoding: unexpected EOFI0909 03:49:38.295468       
1 streamwatcher.go:111] Unexpected EOF during watch stream event decoding: unexpected EOFI0909 03:49:38.295765       
1 streamwatcher.go:111] Unexpected EOF during watch stream event decoding: unexpected EOF

5. Only first time creating in vmware-vsphere-csi-driver-operator log:
$ oc -n openshift-cluster-csi-drivers logs vmware-vsphere-csi-driver-operator-7cc6d44b5c-c8czw | grep -i "storageclass"I0909 03:46:31.865926   1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-cluster-csi-drivers", Name:"vmware-vsphere-csi-driver-operator", UID:"9e0c3e2d-d403-40a1-bf69-191d7aec202b", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'StorageClassCreated' Created StorageClass.storage.k8s.io/thin-csi because it was missing 

Actual results:

The storageclass "thin-csi" could not be re-created after deleting

Expected results:

The storageclass "thin-csi" should be re-created after deleting

Additional info:

 

This is a clone of issue OCPBUGS-18768. The following is the description of the original issue:

This is a clone of issue OCPBUGS-18677. The following is the description of the original issue:

This is a clone of issue OCPBUGS-18608. The following is the description of the original issue:

Description of problem:

UPSTREAM: <carry>: Force using host go always and use host libriaries introduced a build failure for the Windows kubelet that is showing up only in release-4.11 for an unknown reason but could potentially occur on other releases too.

Version-Release number of selected component (if applicable):

WMCO version: 9.0.0 and below
 

How reproducible:

Always on release-4.11
 

Steps to Reproduce:

1. Clone the WMCO repo
2. Build the WMCO image

Actual results:

WMCO image build fails

Expected results:

 WMCO image build should succeed

This is a clone of issue OCPBUGS-8016. The following is the description of the original issue:

This is a clone of issue OCPBUGS-1748. The following is the description of the original issue:

Description of problem:

PipelineRun templates are currently fetched from `openshift-pipelines` namespace. It has to be fetched from `openshift` namespace.

Version-Release number of selected component (if applicable):
4.11 and 1.8.1 OSP

Align with operator changes https://issues.redhat.com/browse/SRVKP-2413 in 1.8.1, UI has to update the code to fetch pipelinerun templates from openshift namespace.

Description of problem:

Currently when installing Openshift on the Openstack cluster name length limit is allowed to  14 characters.
Customer wants to know if is it possible to override this validation when installing Openshift on Openstack and create a cluster name that is greater than 14 characters.

Version : OCP 4.8.5 UPI Disconnected 
Environment : Openstack 16 

Issue:
User reports that they are getting error for OCP cluster in Openstack UPI, where the name of the cluster is > 14 characters.

Error events :
~~~
fatal: [localhost]: FAILED! => {"changed": true, "cmd": ["/usr/local/bin/openshift-install", "create", "manifests", "--dir=/home/gitlab-runner/builds/WK8mkokN/0/CPE/SKS/pipelines/non-prod/ocp4-openstack-build/ocpinstaller/install-upi"], "delta": "0:00:00.311397", "end": "2022-09-03 21:38:41.974608", "msg": "non-zero return code", "rc": 1, "start": "2022-09-03 21:38:41.663211", "stderr": "level=fatal msg=failed to fetch Master Machines: failed to load asset \"Install Config\": invalid \"install-config.yaml\" file: metadata.name: Invalid value: \"sks-osp-inf-cpe-1-cbr1a\": cluster name is too long, please restrict it to 14 characters", "stderr_lines": ["level=fatal msg=failed to fetch Master Machines: failed to load asset \"Install Config\": invalid \"install-config.yaml\" file: metadata.name: Invalid value: \"sks-osp-inf-cpe-1-cbr1a\": cluster name is too long, please restrict it to 14 characters"], "stdout": "", "stdout_lines": []}
~~~

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

 

Actual results:

Users are getting error "cluster name is too long" when clustername contains more than 14 characters for OCP on Openstack

Expected results:

The 14 characters limit should be change for the OCP clustername on Openstack

Additional info:

 

Description of problem:

Provisioning interface on master node not getting ipv4 dhcp ip address from bootstrap dhcp server on OCP 4.10.16 IPI BareMetal install.

Customer is performing an OCP 4.10.16 IPI BareMetal install and bootstrap node provisions just fine, but when master nodes are booted for provisioning, they are not getting an ipv4 address via dhcp. As such, the install is not moving forward at this point.

Version-Release number of selected component (if applicable):

OCP 4.10.16

How reproducible:

Perform OCP 4.10.16 IPI BareMetal install.

Actual results:

provisioning interface comes up (as evidenced by ipv6 address) but is not getting an ipv4 address via dhcp. OCP install / provisioning fails at this point.

Expected results:

provisioning interface successfully received an ipv4 ip address and successfully provisioned master nodes (and subsequently worker nodes as well.)

Additional info:

As a troubleshooting measure, manually adding an ipv4 ip address did allow the coreos image on the bootstrap node to be reached via curl.

Further, the kernel boot line for the first master node was updated for a static ip addresss assignment for further confirmation that the master node would successfully image this way which further confirming that the issue is the provisioning interface not receiving an ipv4 ip address from the dhcp server.

This is a clone of issue OCPBUGS-23486. The following is the description of the original issue:

This is a clone of issue OCPBUGS-23368. The following is the description of the original issue:

This is a clone of issue OCPBUGS-23150. The following is the description of the original issue:

This is a clone of issue OCPBUGS-21594. The following is the description of the original issue:

Description of problem:

The MAPI metric mapi_current_pending_csr fires even when there are no pending MAPI CSRs. However, there are non-MAPI CSRs present. It may not be appropriately scoping this metric to only it's CSRs.

Version-Release number of selected component (if applicable):

Observed in 4.11.25

How reproducible:

Consistent

Steps to Reproduce:

1. Install a component that uses CSRs (like ACM) but leave the CSRs in a pending state
2. Observe metric firing
3.

Actual results:

Metric is firing

Expected results:

Metric only fires if there are MAPI specific CSRs pending

Additional info:

This impacts SRE alerting

This is a clone of issue OCPBUGS-4489. The following is the description of the original issue:

This is a clone of issue OCPBUGS-4168. The following is the description of the original issue:

Description of problem:

Prometheus continuously restarts due to slow WAL replay

Version-Release number of selected component (if applicable):

openshift - 4.11.13

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

OVS 2.17+ introduced an optimization of "weak references" to substantially speed up database snapshots. in some cases weak references may leak memory; to aforementioned commit fixes that and has been pulled into ovs2.17-62 and later.

This is a clone of issue OCPBUGS-7409. The following is the description of the original issue:

This is a clone of issue OCPBUGS-7374. The following is the description of the original issue:

Originally reported by lance5890 in issue https://github.com/openshift/cluster-etcd-operator/issues/1000

The controllers sometimes get stuck on listing members in failure scenarios, this is known and can be mitigated by simply restarting the CEO. 

similar BZ 2093819 with stuck controllers was fixed slightly different in https://github.com/openshift/cluster-etcd-operator/commit/4816fab709e11e0681b760003be3f1de12c9c103

 

This fix was contributed by lance5890, thanks a lot!

 

This is a clone of issue OCPBUGS-6672. The following is the description of the original issue:

This is a clone of issue OCPBUGS-4684. The following is the description of the original issue:

Description of problem:

In DeploymentConfig both the Form view and Yaml view are not in sync

Version-Release number of selected component (if applicable):

4.11.13

How reproducible:

Always

Steps to Reproduce:

1. Create a DC with selector and labels as given below
spec:
  replicas: 1
  selector:
    app: apigateway
    deploymentconfig: qa-apigateway
    environment: qa
  strategy:
    activeDeadlineSeconds: 21600
    resources: {}
    rollingParams:
      intervalSeconds: 1
      maxSurge: 25%
      maxUnavailable: 25%
      timeoutSeconds: 600
      updatePeriodSeconds: 1
    type: Rolling
  template:
    metadata:
      labels:
        app: apigateway
        deploymentconfig: qa-apigateway
        environment: qa

2. Now go to GUI--> Workloads--> DeploymentConfig --> Actions--> Edit DeploymentConfig, first go to Form view and now switch to Yaml view, the selector and labels shows as app: ubi8 while it should display app: apigateway

  selector:
    app: ubi8
    deploymentconfig: qa-apigateway
    environment: qa
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: ubi8
        deploymentconfig: qa-apigateway
        environment: qa

3. Now in yaml view just click reload and the value is displayed as it is when it was created (app: apigateway).

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-15643. The following is the description of the original issue:

This is a clone of issue OCPBUGS-15606. The following is the description of the original issue:

This is a clone of issue OCPBUGS-15497. The following is the description of the original issue:

I am using a BuildConfig with git source and the Docker strategy. The git repo contains a large zip file via LFS and that zip file is not getting downloaded. Instead just the ascii metadata is getting downloaded. I've created a simple reproducer (https://github.com/selrahal/buildconfig-git-lfs) on my personal github. If you clone the repo

git clone git@github.com:selrahal/buildconfig-git-lfs.git

and apply the bc.yaml file with

oc apply -f bc.yaml

Then start the build with

oc start-build test-git-lfs

You will see the build fails at the unzip step in the docker file

STEP 3/7: RUN unzip migrationtoolkit-mta-cli-5.3.0-offline.zip
End-of-central-directory signature not found. Either this file is not
a zipfile, or it constitutes one disk of a multi-part archive. In the
latter case the central directory and zipfile comment will be found on
the last disk(s) of this archive.

I've attached the full build logs to this issue.

4.12 will have an option in cri-o: add_inheritable_capabilities which will allow a user to opt-out of dropping inheritable capabilities (which comes as a fix for CVE-2022-27652). We should add it by default as a drop-in in 4.11 so clusters that upgrade from it inherit the old behavior

Other Incomplete

This section includes Jira cards that are not linked to either an Epic or a Feature. These tickets were not completed when this image was assembled

The following tests are failing in OpenShift 4.11 after updating openshift-apiserver clients to 1.24 (https://github.com/openshift/openshift-apiserver/pull/400):

[sig-devex][Feature:Templates] templateinstance cross-namespace test should create and delete objects across namespaces [Suite:openshift/conformance/parallel]
[sig-devex][Feature:Templates] templateinstance readiness test  should report failed soon after an annotated objects has failed [Skipped:Disconnected] [Suite:openshift/conformance/parallel]
[sig-devex][Feature:Templates] templateinstance readiness test  should report ready soon after all annotated objects are ready [Skipped:Disconnected] [Suite:openshift/conformance/parallel]
[sig-devex][Feature:Templates] templateinstance security tests  should pass security tests [Suite:openshift/conformance/parallel]

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_openshift-apiserver/400/pull-ci-openshift-openshift-apiserver-release-4.11-e2e-aws/1719006784351375360

This prevents us from merging HTTP/2 CVE fixes to 4.11.

This initial issue was fixed during the 1.25 rebase inhttps://github.com/openshift/openshift-controller-manager/pull/242, but since openshift-apiserver wasn't properly updated in 4.11, we didn't caught it before.

I already backported the library-go change to 4.11 (https://github.com/openshift/library-go/pull/1598), all there is to do now is to include that fix in openshift-controller-manager.

This is a clone of issue OCPBUGS-927. The following is the description of the original issue:

Description of problem:

We're seeing frequent private DNS zone creation failures in Azure CI jobs recent two days, the Azure CI jobs have been greatly affected.
https://search.ci.openshift.org/?search=error+creating%2Fupdating+Private+DNS+Zone+Virtual+network&maxAge=48h&context=1&type=build-log&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Such as the following error from https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-azure-sdn-upgrade/1566852244215697408

level=info msg=Consuming Openshift Manifests from target directory
level=info msg=Consuming Common Manifests from target directory
level=info msg=Credentials loaded from file "/var/run/secrets/ci.openshift.io/cluster-profile/osServicePrincipal.json"
level=info msg=Creating infrastructure resources...
level=error
level=error msg=Error: error creating/updating Private DNS Zone Virtual network link "ci-op-1w80vs6f-7f65d-t2zlz-network-link" (Resource Group "ci-op-1w80vs6f-7f65d-t2zlz-rg"): privatedns.VirtualNetworkLinksClient#CreateOrUpdate: Failure sending request: StatusCode=404 -- Original Error: Code="ParentResourceNotFound" Message="Can not perform requested operation on nested resource. Parent resource 'ci-op-1w80vs6f-7f65d.ci2.azure.devcluster.openshift.com' not found."
level=error
level=error msg=  with module.dns.azureprivatedns_zone_virtual_network_link.network,
level=error msg=  on dns/dns.tf line 13, in resource "azureprivatedns_zone_virtual_network_link" "network":
level=error msg=  13: resource "azureprivatedns_zone_virtual_network_link" "network" 

Version-Release number of selected component (if applicable):

All OCP versions

How reproducible:

https://search.ci.openshift.org/chart?name=e2e-azure&search=error+creating%2Fupdating+Private+DNS+Zone&maxAge=24h&type=build-log
shows 26% of the failed Azure jobs are related to "error creating/updating Private DNS Zone" in the past day. 
3/5 of the failed Azure jobs are caused by this in QE’s CI today. 

Steps to Reproduce:

1.
2.
3.

Actual results:


Expected results:


Additional info:

 
No Azure outage was reported from https://status.azure.com/en-us/status.
No private zone or DNS records quota exceeded was observed.   

When running yarn dev, type warnings can be seen in the console and in the dev overlay UI. These need to be resolved.