Back to index

4.14.0-rc.5

Jump to: Complete Features | Incomplete Features | Complete Epics | Incomplete Epics | Other Complete | Other Incomplete |

Changes from 4.13.53

Note: this page shows the Feature-Based Change Log for a release

Complete Features

These features were completed when this image was assembled

Feature Overview (aka. Goal Summary)  

An elevator pitch (value statement) that describes the Feature in a clear, concise way.  Complete during New status.

 

Goals (aka. expected user outcomes)

The observable functionality that the user now has as a result of receiving this feature. Complete during New status.

 

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

 

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

 

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

 

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

 

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

 

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

 

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  Initial completion during Refinement status.

 

Interoperability Considerations

Which other projects and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

Design Doc:

https://docs.google.com/document/d/1m6OYdz696vg1v8591v0Ao0_r_iqgsWjjM2UjcR_tIrM/

Problem:

Goal

As a developer, I want to be able to test my serverless function after it's been deployed.

Why is it important?

Use cases:

  1. As a developer, I want to test my serverless function 

Acceptance criteria:

  1. This features needs to work in ACM (Multi cluster environment when console is being run on the Hub cluster)

Dependencies (External/Internal):

Please add a spike to see if there are dependencies.

Design Artifacts:

Exploration:

Developers can use the the kn func invoke CLI to accomplish this. According to Naina, there is an API, but it's in Go.

Note:

Description

As a user, I want to invoke a Serverless function from the developer console. This action should be available as a page and as a modal.

Acceptance Criteria

  1. A backend proxy to invoke a serverless function (or a k8s service in general) from the frontend without a public route.
  2. The API endpoint should be only accessible to logged-in users.
  3. Should also work when the bridge is running off-cluster (as developers start them mostly for local development)

Additional Details:

This will be similar to the web terminal proxy, except that no auth headers will be passed to the underlying service.

We need something similar to:

POST /proxy/in-cluster

{
  endpoint: string
  # Or just service: string ?? tbd.

  headers: Record<string, string | string[]>
  body: string
  timeout: number
}

Description

As a user, I want to invoke a Serverless function from the developer console. This action should be available as a page and as a modal.

This story depends on ODC-7273, ODC-7274, and ODC-7288. This story should bring the backend proxy, and the frontend together and finalize the work.

Acceptance Criteria

  1. Write proper types if they are missed
  2. Connect the form and invoke a serverless function, consume and show the response
  3. Unit testes
  4. E2E tests

Additional Details:

Description

Current YAMLEditor also supports other languages like JSON. Therefore need to rename the component.

Acceptance Criteria

  1. Rename all instances of YAMLEditor to CodeEditor

Additional Details:

Description

As a user, I want to invoke a Serverless function from the developer console. This action should be available as a page and as a modal.

This story is to evaluate a good UI for this and check this with our PM (Serena) and the Serverless team (Naina and Lance).

Acceptance Criteria

  1. Add a new page with title "Invoke Serverless function {function-name}" and should be available via a new URL (/serverless/ns/:ns/invoke-function/:function-name/).
  2. Implement a form with formik to "invoke" (console.log for now) Serverless functions, without writing the network call for this already. Focus on the UI to get feedback as early as possible. Use reusable, well-named components anyway.
  3. The page should be also available as a modal. Add a new action to all Serverless Services with the label (tbd) to open this modal from the Topology graph or from the Serverless Service list view.
  4. The page should have two tabs or two panes for the request and response. Each of this tabs/panes should have again two tabs, "similar" to the browser network inspector. See below for what we know currently.
  5. Get confirmation from Christoph, Serena, Naina, and Lance.
  6. Disable the action until we implement the network communication in ODC-7275 with the serverless function.
  7. No e2e tests are needed for this story.

Additional Details:

Information the form should show:

  1. Request tab shows "Body" and "Options" tab
    1. Body is just a full size editor. We should reuse our code editor.
    2. Options contains:
      1. Auto complete text field “Content type” with placeholder “application/json”, that will be used when nothing is entered
      2. Dropdown “Format” with values “cloudevent” (default) and “http”
      3. Text field “Type” with placeholder text “boson.fn”, that will be used when nothing is entered
      4. Text field “Source” with placeholder “/boson/fn”, that will be used when nothing is entered
  2. Response tab shows Body and Info tab
    1. Body is a full size editor that shows the response. We should format a JSON string with JSON.stringify(data, null, 2)
    2. Info contains:
      1. Id (id)
      2. Type (type)
      3. Source (source)
      4. Time (time) (formatted)
      5. Content-Type: (datacontenttype)

< High-Level description of the feature ie: Executive Summary >

Goals

Cluster administrators need an in-product experience to discover and install new Red Hat offerings that can add high value to developer workflows.

Requirements

Requirements Notes IS MVP
Discover new offerings in Home Dashboard   Y
Access details outlining value of offerings   Y
Access step-by-step guide to install offering   N
Allow developers to easily find and use newly installed offerings   Y
Support air-gapped clusters   Y
    • (Optional) Use Cases

< What are we making, for who, and why/what problem are we solving?>

Out of scope

Discovering solutions that are not available for installation on cluster

Dependencies

No known dependencies

Background, and strategic fit

 

Assumptions

None

 

Customer Considerations

 

Documentation Considerations

Quick Starts 

What does success look like?

 

QE Contact

 

Impact

 

Related Architecture/Technical Documents

 

Done Checklist

  • Acceptance criteria are met
  • Non-functional properties of the Feature have been validated (such as performance, resource, UX, security or privacy aspects)
  • User Journey automation is delivered
  • Support and SRE teams are provided with enough skills to support the feature in production environment

Problem:

Developers using Dev Console need to be made aware of the RH developer tooling available to them.

Goal:

Provide awareness to developers using Dev Console of the RH developer tooling that is available to them, including:

Consider enhancing the +Add page and/or the Guided tour

Provide a Quick Start for installing the Cryostat Operator

Why is it important?

To increase usage of our RH portfolio

Acceptance criteria:

  1. Quick Start - Installing Cryostat Operator
  2.  Quick Start - Get started with JBoss EAP using a Helm Chart
  3. Discoverability of the IDE extensions from Create Serverless form
  4. Update Terminal step of the Guided Tour to indicate that odo CLI is accessible (link to https://developers.redhat.com/products/odo/overview)

Dependencies (External/Internal):

Design Artifacts:

Exploration:

Note:

Description

This story is to add new Quick Start for installing the Cryostat Operator

Acceptance Criteria

  1. Create new Quick Start for installing the Cryostat Operator

Additional Details:

Description

Add below IDE extensions in create serverless form,

Acceptance Criteria

  1. In create serverless form add above IDE extensions
  2. On click of the link, user needs to take to respective pages
  3. Add e2e tests for that

Additional Details:

Description 

Add OpenShift Quickstart for JBoss EAP 7

Acceptance Criteria

  1. Add OpenShift Quickstart for JBoss EAP 7

Additional Details:

Description

Update Terminal step of the Guided Tour to indicate that odo CLI is accessible - https://developers.redhat.com/products/odo/overview

Acceptance Criteria

  1. Update Guided tour of Web Terminal to add odo CLI link
  2. On click of link user has to redirected to respective page

Additional Details:

We are deprecating DeploymentConfig with Deployment in OpenShift because Deployment is the recommended way to deploy applications. Deployment is a more flexible and powerful resource that allows you to control the deployment of your applications more precisely. DeploymentConfig is a legacy resource that is no longer necessary. We will continue to support DeploymentConfig for a period of time, but we encourage you to migrate to Deployment as soon as possible.

Here are some of the benefits of using Deployment over DeploymentConfig:

  • Deployment is more flexible. You can specify the number of replicas to deploy, the image to deploy, and the environment variables to use.
  • Deployment is more powerful. You can use Deployment to roll out changes to your applications in a controlled manner.
  • Deployment is the recommended way to deploy applications. OpenShift will continue to improve Deployment and make it the best way to deploy applications.

We hope that you will migrate to Deployment as soon as possible. If you have any questions, please contact us.

Epic Goal

  • Make it possible to disable the DeploymentConfig and BuildConfig APIs, and associated controller logic.

 

Given the nature of this component (embedded into a shared api server and controller manager), this will likely require adding logic within those shared components to not enable specific bits of function when the build or DeploymentConfig capability is disabled, and watching the enabled capability set so that the components enable the functionality when necessary.

I would not expect us to split the components out of their existing location as part of this, though that is theoretically an option.

 

Why is this important?

  • Reduces resource footprint and bug surface area for clusters that do not need to utilize the DeploymentConfig or BuildConfig functionality, such as SNO and OKE.

Acceptance Criteria (Mandatory)

  • CI - MUST be running successfully with tests automated (we have an existing CI job that runs a cluster with all optional capabilities disabled.  Passing that job will require disabling certain deploymentconfig tests when the cap is disabled)
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. Cluster install capabilities

Previous Work (Optional):

  1. The optional cap architecture and guidance for adding a new capability is described here: https://github.com/openshift/enhancements/blob/master/enhancements/installer/component-selection.md

Open questions::

None

Done Checklist

  • Acceptance criteria are met
  • Non-functional properties of the Feature have been validated (such as performance, resource, UX, security or privacy aspects)
  • User Journey automation is delivered
  • Support and SRE teams are provided with enough skills to support the feature in production environment

Make the list of enabled/disable controllers in OAS reflect enabled/disabled capabilities.

Acceptance criteria:

  • OAS allows to specify a list of enabled/disabled APIs (e.g. watches, caches, ...)
  • OASO watches capabilities and generates the right configuration for OAS with enabled/disabled list of APIs
  • Documentation is properly updated

QE:

  • enabled/disable capabilities and validate a given API (DC, Builds, ...) is/is not managed by a cluster:
  • checking the OAS logs do/do not log entries about affected API(s)
  • DC/Builds objects are created/fail to be created

Feature Overview

At the moment, HyperShift is relying on an older etcd operator (i.e, the CoreOS etcd operator). However, this operator is basic and does not support HA as required.  

Goals

Introduce a reliable component to operate Etcd that: 

  • Is backed up by a stable operator
  • Supports Images with a Hash
  • Supprts  for Backups
  • Local-persistent volumes for persistent data? 
  • Encryption.
  • HA and Scalablity. 

 

Following on from https://issues.redhat.com/browse/HOSTEDCP-444 we need to add the steps to enable migration of the Node/CAPI resources to enable workloads to continue running during controlplane migration.

This will be a manual process where controlplane downtime will occur.

 

This must satisfy a successful migration criteria:

  • All HC conditions are positive.
  • All NodePool conditions are positive.
  • All service endpoints kas/oauth/ignition server... are reachable.
  • Ability to create/scale NodePools remains operational.

We need to validate and document this manually for starters.

Eventually this should be automated in the upcoming e2e test.

We could even have a job running conformance tests over a migrated cluster

Epic Goal

As an OpenShift on vSphere administrator, I want to specify static IP assignments to my VMs.

As an OpenShift on vSphere administrator, I want to completely avoid using a DHCP server for the VMs of my OpenShift cluster.

Why is this important?

Customers want the convenience of IPI deployments for vSphere without having to use DHCP. As in bare metal, where METAL-1 added this capability, some of the reasons are the security implications of DHCP (customers report that for example depending on configuration they allow any device to get in the network). At the same time IPI deployments only require to our OpenShift installation software, while with UPI they would need automation software that in secure environments they would have to certify along with OpenShift.

Acceptance Criteria

  • I can specify static IPs for node VMs at install time with IPI

Previous Work

Bare metal related work:

CoreOS Afterburn:

https://github.com/coreos/afterburn/blob/main/src/providers/vmware/amd64.rs#L28

https://github.com/openshift/installer/blob/master/upi/vsphere/vm/main.tf#L34

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Epic Goal

As an OpenShift on vSphere administrator, I want to specify static IP assignments to my VMs.

As an OpenShift on vSphere administrator, I want to completely avoid using a DHCP server for the VMs of my OpenShift cluster.

Why is this important?

Customers want the convenience of IPI deployments for vSphere without having to use DHCP. As in bare metal, where METAL-1 added this capability, some of the reasons are the security implications of DHCP (customers report that for example depending on configuration they allow any device to get in the network). At the same time IPI deployments only require to our OpenShift installation software, while with UPI they would need automation software that in secure environments they would have to certify along with OpenShift.

Acceptance Criteria

  • I can specify static IPs for node VMs at install time with IPI

Previous Work

Bare metal related work:

CoreOS Afterburn:

https://github.com/coreos/afterburn/blob/main/src/providers/vmware/amd64.rs#L28

https://github.com/openshift/installer/blob/master/upi/vsphere/vm/main.tf#L34

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

{}USER STORY:{}

As an OpenShift administrator, I want to apply an IP configuration so that I can adhere to my organizations security guidelines.

{}DESCRIPTION:{}

The vSphere machine controller needs to be modified to convert nmstate to `guestinfo.afterburn.initrd.network-kargs` upon cloning the template for a new machine.  An example of this is here: https://github.com/openshift/machine-api-operator/pull/1079

{}Required:{}

{}Nice to have:{}

{}ACCEPTANCE CRITERIA:{}

{}ENGINEERING DETAILS:{}

https://github.com/openshift/enhancements/pull/1267

Authentication-operator ignores noproxy settings defined in the cluster-wide proxy.

Expected outcome: When noproxy is set, Authentication operator should initialize connections through ingress instead of the cluster-wide proxy. 

Background and Goal

Currently in OpenShift we do not support adding 3rd party agents and other software to cluster nodes. While rpm-ostree supports adding packages, we have no way today to do that in a sane, scalable way across machineconfigpools and clusters. Some customers may not be able to meet their IT policies due to this.

In addition to third party content, some customers may want to use the layering process as a point to inject configuration. The build process allows for simple copying of config files and the ability to run arbitrary scripts to set user config files (e.g. through an Ansible playbook). This should be a supported use case, except where it conflicts with OpenShift (for example, the MCO must continue to manage Cri-O and Kubelet configs).

Example Use Cases

  • Bare metal firmware update software that is packaged as an RPM
  • Host security monitors
  • Forensic tools
  • SEIM logging agents
  • SSH Key management
  • Device Drivers from OEM/ODM partners

Acceptance Criteria

  1. Administrators can deploy 3rd party repositories and packages to MachineConfigPools.
  2. Administrators can easily remove added packages and repository files.
  3. Administrators can manage system configuration files by copying files into the RHCOS build. [Note: if the same file is managed by the MCO, the MachineConfig version of the file is expected to "win" over the OS image version.]

Background

As part of enabling OCP CoreOS Layering for third party components, we will need to allow for package installation to /opt. Many OEMs and ISVs install to /opt and it would be difficult for them to make the change only for RHCOS. Meanwhile changing their RHEL target to a different target would also be problematic as their customers are expecting these tools to install in a certain way. Not having to worry about this path will provide the best ecosystem partner and customer experience.

Requirements

  • Document how 3rd party vendors can be compatible with our current offering.
  • Provide mechanism for 3rd party vendors or their customers to provide information for exceptions that require an RPM to install binaries to /opt as an install target path.

Feature Overview (aka. Goal Summary)  

Add support for custom security groups to be attached to control plane and compute nodes at installation time.

Goals (aka. expected user outcomes)

Allow the user to provide existing security groups to be attached to the control plane and compute node instances at installation time.

Requirements (aka. Acceptance Criteria):

The user will be able to provide a list of existing security groups to the install config manifest that will be used as additional custom security groups to be attached to the control plane and compute node instances at installation time.

Out of Scope

The installer won't be responsible of creating any custom security groups, these must be created by the user before the installation starts.

Background

We do have users/customers with specific requirements on adding additional network rules to every instance created in AWS. For OpenShift these additional rules need to be added on day-2 manually as the Installer doesn't provide the ability to add custom security groups to be attached to any instance at install time.

MachineSets already support adding a list of existing custom security groups, so this could be automated already at install time manually editing each MachineSet manifest before starting the installation, but even for these cases the Installer doesn't allow the user to provide this information to add the list of these security groups to the MachineSet manifests.

Documentation Considerations

Documentation will be required to explain how this information needs to be provided to the install config manifest as any other supported field.

Epic Goal

  • Allow the user to provide existing security groups to be attached to the control plane and compute node instances at installation time.

Why is this important?

  • We do have users/customers with specific requirements on adding additional network rules to every instance created in AWS. For OpenShift these additional rules need to be added on day-2 manually as the Installer doesn't provide the ability to add custom security groups to be attached to any instance at install time.

    MachineSets already support adding a list of existing custom security groups, so this could be automated already at install time manually editing each MachineSet manifest before starting the installation, but even for these cases the Installer doesn't allow the user to provide this information to add the list of these security groups to the MachineSet manifests.

Scenarios

  1. The user will be able to provide a list of existing security groups to the install config that will be used as additional custom security groups to be attached to the control plane and compute node instances at installation time.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.

Previous Work (Optional):

  1. Compute Nodes managed by MAPI already support this feature

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story:

As a (user persona), I want to be able to:

  • Add custom security groups for compute nodes
  • Add custom security groups for control plane nodes

so that I can achieve

  • Control Plane and Compute nodes can support operational specific security rules. For instance: specific traffic may be required for compute vs control plane nodes.

Acceptance Criteria:

Description of criteria:

  • The control plane and compute machine sections of the install config accept user input as additionalSecurityGroupIDs (when using the aws platform).

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

  •  
    additionalSecurityGroupIDs:
      description: AdditionalSecurityGroupIDs contains IDs of
        additional security groups for machines, where each ID
        is presented in the format sg-xxxx.
      items:
        type: string
      type: array 

 

This requires/does not require a design proposal.

Feature Overview (aka. Goal Summary)  

Scaling of pod in Openshift highly depends on customer workload and their hardware setup . Some workloads on certain hardware might not scale beyond 100 pods and others might scale to 1000 pods . 

As a openshift admin i want to monitor metrics that will indicate why i am not able to scale my pods . think of pressure gauge that will tell customer when its green ( can scale) when its red ( not scale)

As a openshift support team if a customer call in with their complain about pod scaling then i should be able to check some metrics and inform them why they are not able to scale 

Goals (aka. expected user outcomes)

Metrics and alert and dashboard 

 

Requirements (aka. Acceptance Criteria):

able to integrate these metrics and alert in a monitoring dashboard 

 

 

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • To come up with set of metrics that indicate optimal node resource usage.

Why is this important?

  • These metrics will help customers to understand the capacity they have instead of restricting themselves to hard coded max pod limit.

Scenarios

  1. As a owner of extremely high capacity machine, I want to be able to deploy as many pods as my machine can handle. 

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. None

Previous Work (Optional):

  1. https://issues.redhat.com/browse/OCPNODE-1125

Open questions::

  1. The challenging part is come up with set of metrics that accurately indicate system resource usage.  

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

We need to have an operator to inject dashboard jsonnet. E.g. etcd team injects their dashboard jsonnet using their operator in the form of a config map. 

https://redhat-internal.slack.com/archives/C027U68LP/p1683574004805639?thread_ts=1683573783.216759&cid=C027U68LP

 

We will need similar approach for node dashboard. 

Feature Overview

Create a GCP cloud specific spec.resourceTags entry in the infrastructure CRD. This should create and update tags (or labels in GCP) on any openshift cloud resource that we create and manage. The behaviour should also tag existing resources that do not have the tags yet and once the tags in the infrastructure CRD are changed all the resources should be updated accordingly.

Tag deletes continue to be out of scope, as the customer can still have custom tags applied to the resources that we do not want to delete.

Due to the ongoing intree/out of tree split on the cloud and CSI providers, this should not apply to clusters with intree providers (!= "external").

Once confident we have all components updated, we should introduce an end2end test that makes sure we never create resources that are untagged.

 
Goals

  • Functionality on GCP Tech Preview
  • inclusion in the cluster backups
  • flexibility of changing tags during cluster lifetime, without recreating the whole cluster

Requirements

  • This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.
Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

List any affected packages or components.

  • Installer
  • Cluster Infrastructure
  • Storage
  • Node
  • NetworkEdge
  • Internal Registry
  • CCO

This epic covers the work to apply user defined labels GCP resources created for openshift cluster available as tech preview.

The user should be able to define GCP labels to be applied on the resources created during cluster creation by the installer and other operators which manages the specific resources. The user will be able to define the required tags/labels in the install-config.yaml while preparing with the user inputs for cluster creation, which will then be made available in the status sub-resource of Infrastructure custom resource which cannot be edited but will be available for user reference and will be used by the in-cluster operators for labeling when the resources are created.

Updating/deleting of labels added during cluster creation or adding new labels as Day-2 operation is out of scope of this epic.

List any affected packages or components.

  • Installer
  • Cluster Infrastructure
  • Storage
  • Node
  • NetworkEdge
  • Internal Registry
  • CCO

Reference - https://issues.redhat.com/browse/RFE-2017

Enhancement proposed for Azure tags support in OCP, requires install-config CRD to be updated to include gcp userLabels for user to configure, which will be referred by the installer to apply the list of labels on each resource created by it and as well make it available in the Infrastructure CR created.

Below is the snippet of the change required in the CRD

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata: 
  name: installconfigs.install.openshift.io
spec: 
  versions: 
  - name: v1
    schema: 
      openAPIV3Schema: 
        properties: 
          platform: 
            properties: 
              gcp: 
                properties: 
                  userLabels: 
                    additionalProperties: 
                      type: string
                    description: UserLabels additional keys and values that the installer
                      will add as labels to all resources that it creates. Resources
                      created by the cluster itself may not include these labels.
                  type: object

This change is required for testing the changes of the feature, and should ideally get merged first.

Acceptance Criteria

  • Code linting, validation and best practices adhered to
  • User should be able to configure gcp user defined labels in the install-config.yaml
  • Fields descriptions

Enhancement proposed for GCP labels and tags support in OCP requires making use of latest APIs made available in terraform provider for google and requires an update to use the same.

Acceptance Criteria

  • Code linting, validation and best practices adhered to.

Enhancement proposed for GCP tags support in OCP, requires cluster-image-registry-operator to add gcp userTags available in the status sub resource of infrastructure CR, to the gcp storage resource created.

cluster-image-registry-operator uses the method createStorageAccount() to create storage resource which should be updated to add tags after resource creation.

Acceptance Criteria

  • Code linting, validation and best practices adhered to
  • UTs and e2e are added/updated

cluster-config-operator makes Infrastructure CRD available for installer, which is included in it's container image from the openshift/api package and requires the package to be updated to have the latest CRD.

Installer creates below list of gcp resources during create cluster phase and these resources should be applied with the user defined labels and the default OCP label kubernetes-io-cluster-<cluster_id>:owned

Resources List

Resource Terraform API
VM Instance google_compute_instance
Image google_compute_image
Address google_compute_address(beta)
ForwardingRule google_compute_forwarding_rule(beta)
Zones google_dns_managed_zone
Storage Bucket google_storage_bucket

Acceptance Criteria:

  • Code linting, validation and best practices adhered to
  • List of gcp resources created by installer should have user defined labels and as well as the default OCP label.

Enhancement proposed for GCP labels support in OCP, requires cluster-image-registry-operator to add gcp userLabels available in the status sub resource of infrastructure CR, to the gcp storage resource created.

cluster-image-registry-operator uses the method createStorageAccount() to create storage resource which should be updated to add labels.

Acceptance Criteria

  • Code linting, validation and best practices adhered to
  • UTs and e2e are added/updated

Enhancement proposed for GCP labels support in OCP, requires machine-api-provider-gcp to add azure userLabels available in the status sub resource of infrastructure CR, to the gcp virtual machines resource and the sub-resources created.

Acceptance Criteria

  • Code linting, validation and best practices adhered to
  • UTs and e2e are added/updated

Installer generates Infrastructure CR in manifests creation step of cluster creation process based on the user provided input recorded in install-config.yaml. While generating Infrastructure CR platformStatus.gcp.resourceLabels should be updated with the user provided labels(installconfig.platform.gcp.userLabels).

Acceptance Criteria

  • Code linting, validation and best practices adhered to
  • Infrastructure CR created by installer should have gcp user defined labels if any, in status field.

Feature Overview  

Much like core OpenShift operators, a standardized flow exists for OLM-managed operators to interact with the cluster in a specific way to leverage AWS STS authorization when using AWS APIs as opposed to insecure static, long-lived credentials. OLM-managed operators can implement integration with the CloudCredentialOperator in well-defined way to support this flow.

Goals:

Enable customers to easily leverage OpenShift's capabilities around AWS STS with layered products, for increased security posture. Enable OLM-managed operators to implement support for this in well-defined pattern.

Requirements:

  • CCO gets a new mode in which it can reconcile STS credential request for OLM-managed operators
  • A standardized flow is leveraged to guide users in discovering and preparing their AWS IAM policies and roles with permissions that are required for OLM-managed operators 
  • A standardized flow is defined in which users can configure OLM-managed operators to leverage AWS STS
  • An example operator is used to demonstrate the end2end functionality
  • Clear instructions and documentation for operator development teams to implement the required interaction with the CloudCredentialOperator to support this flow

Use Cases:

See Operators & STS slide deck.

 

Out of Scope:

  • handling OLM-managed operator updates in which AWS IAM permission requirements might change from one version to another (which requires user awareness and intervention)

 

Background:

The CloudCredentialsOperator already provides a powerful API for OpenShift's cluster core operator to request credentials and acquire them via short-lived tokens. This capability should be expanded to OLM-managed operators, specifically to Red Hat layered products that interact with AWS APIs. The process today is cumbersome to none-existent based on the operator in question and seen as an adoption blocker of OpenShift on AWS.

 

Customer Considerations

This is particularly important for ROSA customers. Customers are expected to be asked to pre-create the required IAM roles outside of OpenShift, which is deemed acceptable.

Documentation Considerations

  • Internal documentation needs to exists to guide Red Hat operator developer teams on the requirements and proposed implementation of integration with CCO and the proposed flow
  • External documentation needs to exist to guide users on:
    • how to become aware that the cluster is in STS mode
    • how to become aware of operators that support STS and the proposed CCO flow
    • how to become aware of the IAM permissions requirements of these operators
    • how to configure an operator in the proposed flow to interact with CCO

Interoperability Considerations

  • this needs to work with ROSA
  • this needs to work with self-managed OCP on AWS

Market Problem

This Section: High-Level description of the Market Problem ie: Executive Summary

  • As a customer of OpenShift layered products, I need to be able to fluidly, reliably and consistently install and use OpenShift layered product Kubernetes Operators into my ROSA STS clusters, while keeping a STS workflow throughout.
  •  
  • As a customer of OpenShift on the big cloud providers, overall I expect OpenShift as a platform to function equally well with tokenized cloud auth as it does with "mint-mode" IAM credentials. I expect the same from the Kubernetes Operators under the Red Hat brand (that need to reach cloud APIs) in that tokenized workflows are equally integrated and workable as with "mint-mode" IAM credentials.
  •  
  • As the managed services, including Hypershift teams, offering a downstream opinionated, supported and managed lifecycle of OpenShift (in the forms of ROSA, ARO, OSD on GCP, Hypershift, etc), the OpenShift platform should have as close as possible, native integration with core platform operators when clusters use tokenized cloud auth, driving the use of layered products.
  • .
  • As the Hypershift team, where the only credential mode for clusters/customers is STS (on AWS) , the Red Hat branded Operators that must reach the AWS API, should be enabled to work with STS credentials in a consistent, and automated fashion that allows customer to use those operators as easily as possible, driving the use of layered products.

Why it Matters

  • Adding consistent, automated layered product integrations to OpenShift would provide great added value to OpenShift as a platform, and its downstream offerings in Managed Cloud Services and related offerings.
  • Enabling Kuberenetes Operators (at first, Red Hat ones) on OpenShift for the "big3" cloud providers is a key differentiation and security requirement that our customers have been and continue to demand.
  • HyperShift is an STS-only architecture, which means that if our layered offerings via Operators cannot easily work with STS, then it would be blocking us from our broad product adoption goals.

Illustrative User Stories or Scenarios

  1. Main success scenario - high-level user story
    1. customer creates a ROSA STS or Hypershift cluster (AWS)
    2. customer wants basic (table-stakes) features such as AWS EFS or RHODS or Logging
    3. customer sees necessary tasks for preparing for the operator in OperatorHub from their cluster
    4. customer prepares AWS IAM/STS roles/policies in anticipation of the Operator they want, using what they get from OperatorHub
    5. customer's provides a very minimal set of parameters (AWS ARN of role(s) with policy) to the Operator's OperatorHub page
    6. The cluster can automatically setup the Operator, using the provided tokenized credentials and the Operator functions as expected
    7. Cluster and Operator upgrades are taken into account and automated
    8. The above steps 1-7 should apply similarly for Google Cloud and Microsoft Azure Cloud, with their respective token-based workload identity systems.
  2. Alternate flow/scenarios - high-level user stories
    1. The same as above, but the ROSA CLI would assist with AWS role/policy management
    2. The same as above, but the oc CLI would assist with cloud role/policy management (per respective cloud provider for the cluster)
  3. ...

Expected Outcomes

This Section: Articulates and defines the value proposition from a users point of view

  • See SDE-1868 as an example of what is needed, including design proposed, for current-day ROSA STS and by extension Hypershift.
  • Further research is required to accomodate the AWS STS equivalent systems of GCP and Azure
  • Order of priority at this time is
    • 1. AWS STS for ROSA and ROSA via HyperShift
    • 2. Microsoft Azure for ARO
    • 3. Google Cloud for OpenShift Dedicated on GCP

Effect

This Section: Effect is the expected outcome within the market. There are two dimensions of outcomes; growth or retention. This represents part of the “why” statement for a feature.

  • Growth is the acquisition of net new usage of the platform. This can be new workloads not previously able to be supported, new markets not previously considered, or new end users not previously served.
  • Retention is maintaining and expanding existing use of the platform. This can be more effective use of tools, competitive pressures, and ease of use improvements.
  • Both of growth and retention are the effect of this effort.
    • Customers have strict requirements around using only token-based cloud credential systems for workloads in their cloud accounts, which include OpenShift clusters in all forms.
      • We gain new customers from both those that have waited for token-based auth/auth from OpenShift and from those that are new to OpenShift, with strict requirements around cloud account access
      • We retain customers that are going thru both cloud-native and hybrid-cloud journeys that all inevitably see security requirements driving them towards token-based auth/auth.
      •  

References

As an engineer I want the capability to implement CI test cases that run at different intervals, be it daily, weekly so as to ensure downstream operators that are dependent on certain capabilities are not negatively impacted if changes in systems CCO interacts with change behavior.

Acceptance Criteria:

Create a stubbed out e2e test path in CCO and matching e2e calling code in release such that there exists a path to tests that verify working in an AWS STS workflow.

OC mirror is GA product as of Openshift 4.11 .

The goal of this feature is to solve any future customer request for new features or capabilities in OC mirror 

In 4.12 release, a new feature was introduced to oc-mirror allowing it to use OCI FBC catalogs as starting point for mirroring operators.

Overview

As a oc-mirror user, I would like the OCI FBC feature to be stable
so that I can use it in a production ready environment
and to make the new feature and all existing features of oc-mirror seamless

Current Status

This feature is ring-fenced in the oc mirror repository, it uses the following flags to achieve this so as not to cause any breaking changes in the current oc-mirror functionality.

  • --use-oci-feature
  • --oci-feature-action (copy or mirror)
  • --oci-registries-config

The OCI FBC (file base container) format has been delivered for Tech Preview in 4.12

Tech Enablement slides can be found here https://docs.google.com/presentation/d/1jossypQureBHGUyD-dezHM4JQoTWPYwiVCM3NlANxn0/edit#slide=id.g175a240206d_0_7

Design doc is in https://docs.google.com/document/d/1-TESqErOjxxWVPCbhQUfnT3XezG2898fEREuhGena5Q/edit#heading=h.r57m6kfc2cwt (also contains latest design discussions around the stories of this epic)

Link to previous working epic https://issues.redhat.com/browse/CFE-538

Contacts for the OCI FBC feature

 

Feature Overview (aka. Goal Summary)  

The OpenShift Assisted Installer is a user-friendly OpenShift installation solution for the various platforms, but focused on bare metal. This very useful functionality should be made available for the IBM zSystem platform.

 

Goals (aka. expected user outcomes)

Use of the OpenShift Assisted Installer to install OpenShift on an IBM zSystem

 

Requirements (aka. Acceptance Criteria):

Using the OpenShift Assisted Installer to install OpenShift on an IBM zSystem 

 

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

 

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

 

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

 

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

 

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

 

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  Initial completion during Refinement status.

 

Interoperability Considerations

Which other projects and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

As a multi-arch development engineer, I would like to ensure that the Assisted Installer workflow is fully functional and supported for z/VM deployments.

Acceptance Criteria

  • Feature is implemented, tested, QE, documented, and technically enabled.
  • Stories closed.

Description of the problem:

Using FCP (multipath) devices for zVM node 
parmline:

rd.neednet=1 console=ttysclp0 coreos.live.rootfs_url=http://172.23.236.156:8080/assisted-installer/rootfs.img ip=10.14.6.8::10.14.6.1:255.255.255.0:master-0:encbdd0:none nameserver=10.14.6.1 ip=[fd00::8]::[fd00::1]:64::encbdd0:none nameserver=[fd00::1] zfcp.allow_lun_scan=0 rd.znet=qeth,0.0.bdd0,0.0.bdd1,0.0.bdd2,layer2=1 rd.zfcp=0.0.8007,0x500507630400d1e3,0x4000401e00000000 rd.zfcp=0.0.8107,0x50050763040851e3,0x4000401e00000000 random.trust_cpu=on rd.luks.options=discard ignition.firstboot ignition.platform.id=metal console=tty1 console=ttyS1,115200n8

shows disk limitation error in the UI. 

<see attached image>

How reproducible:

Attach two FCP devices to a zVM node. Create a cluster and boot zVM node into discovery service. Host discovery panel shows an error for discovered host.

Steps to reproduce:

1. Attach two FCP devices to the zVM.

2. Create new cluster using the AI UI and configure discovery image

3. Boot zVM node 

4. Waiting until node is showing up on the Host discovery panel.

5. FCP devices are not recognized as valid option

Actual results:

FCP devices can't be used as installable disk

Expected results:
FCP device can be used for installation (multipath must be activated after installation:
https://docs.openshift.com/container-platform/4.13/post_installation_configuration/ibmz-post-install.html#enabling-multipathing-fcp-luns_post-install-configure-additional-devices-ibmz)

Discovering an regression on staging where default is set to minimal ISO preventing installation of OCP 4.13 for s390x architecture.

See following older bugs addressing the same issue I guess

  1. MGMT-14298

 

Description of the problem:

Using DASD devices are not recognized correctly if attached and used for a zVM node.
<see attached screenshot>

Attach two FCP devices to a zVM node. Create a cluster and boot zVM node into discovery service. Host discovery panel shows an error for discovered host.

Steps to reproduce:

1. Attach two DASD devices to the zVM.

2. Create new cluster using the AI UI and configure discovery image

3. Boot zVM node 

4. Waiting until node is showing up on the Host discovery panel.

5. DASD devices are not recognized as valid option

Actual results:

DASD devices can't be used as installable disk

Expected results:
DASD device can be used for installation. User can choose the on which device AI will install to.

Epic Goal

As an OpenShift infrastructure owner I need to deploy OCP on OpenStack with the installer-provisioned infrastructure workflow and configure my own load balancers

Why is this important?

Customers want to use their own load balancers and IPI comes with built-in LBs based in keepalived and haproxy. 

Scenarios

  1. A large deployment routed across multiple failure domains without stretched L2 networks, would require to dynamically route the control plane VIP traffic through load-balancers capable of living in multiple L2.
  2. Customers who want to use their existing LB appliances for the control plane.

Acceptance Criteria

  • Should we require the support of migration from internal to external LB?
  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • QE - must be testing a scenario where we disable the internal LB and setup an external LB and OCP deployment is running fine.
  • Documentation - we need to document all the gotchas regarding this type of deployment, even the specifics about the load-balancer itself (routing policy, dynamic routing, etc)

Dependencies (internal and external)

  1. Fixed IPs would be very interesting to support, already WIP by vsphere (need to Spike on this): https://issues.redhat.com/browse/OCPBU-179
  2. Confirm with customers that they are ok with external LB or they prefer a new internal LB that supports BGP

Previous Work:

vsphere has done the work already via https://issues.redhat.com/browse/SPLAT-409

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Epic Goal

As an OpenShift infrastructure owner I need to deploy OCP on OpenStack with the installer-provisioned infrastructure workflow and configure my own load balancers

Why is this important?

Customers want to use their own load balancers and IPI comes with built-in LBs based in keepalived and haproxy. 

Scenarios

  1. A large deployment routed across multiple failure domains without stretched L2 networks, would require to dynamically route the control plane VIP traffic through load-balancers capable of living in multiple L2.
  2. Customers who want to use their existing LB appliances for the control plane.

Acceptance Criteria

  • Should we require the support of migration from internal to external LB?
  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • QE - must be testing a scenario where we disable the internal LB and setup an external LB and OCP deployment is running fine.
  • Documentation - we need to document all the gotchas regarding this type of deployment, even the specifics about the load-balancer itself (routing policy, dynamic routing, etc)

Dependencies (internal and external)

  1. Fixed IPs would be very interesting to support, already WIP by vsphere (need to Spike on this): https://issues.redhat.com/browse/OCPBU-179
  2. Confirm with customers that they are ok with external LB or they prefer a new internal LB that supports BGP

Previous Work:

vsphere has done the work already via https://issues.redhat.com/browse/SPLAT-409

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

 

Notes: https://github.com/EmilienM/ansible-role-routed-lb is an example of a LB that will be used for CI, can be used by QE and customers.

Epic Goal

As an OpenShift infrastructure owner I need to deploy OCP on OpenStack with the installer-provisioned infrastructure workflow and configure my own load balancers

Why is this important?

Customers want to use their own load balancers and IPI comes with built-in LBs based in keepalived and haproxy. 

Scenarios

  1. A large deployment routed across multiple failure domains without stretched L2 networks, would require to dynamically route the control plane VIP traffic through load-balancers capable of living in multiple L2.
  2. Customers who want to use their existing LB appliances for the control plane.

Acceptance Criteria

  • Should we require the support of migration from internal to external LB?
  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • QE - must be testing a scenario where we disable the internal LB and setup an external LB and OCP deployment is running fine.
  • Documentation - we need to document all the gotchas regarding this type of deployment, even the specifics about the load-balancer itself (routing policy, dynamic routing, etc)

Dependencies (internal and external)

  1. Fixed IPs would be very interesting to support, already WIP by vsphere (need to Spike on this): https://issues.redhat.com/browse/OCPBU-179
  2. Confirm with customers that they are ok with external LB or they prefer a new internal LB that supports BGP

Previous Work:

vsphere has done the work already via https://issues.redhat.com/browse/SPLAT-409

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Feature Overview

Goals

  • Support OpenShift to be deployed from day-0 on AWS Local Zones
  • Support an existing OpenShift cluster to deploy compute Nodes on AWS Local Zones (day-2)

AWS Local Zones support - feature delivered in phases:

  • Phase 0 (OCPPLAN-9630): Document how to create compute nodes on AWS Local Zones in day-0 (SPLAT-635)
  • Phase 1 ( OCPBU-2): Create edge compute pool to generate MachineSets for node with NoSchedule taints when installing a cluster in existing VPC with AWS Local Zone subnets (SPLAT-636)
  • Phase 2 (OCPBU-351): Installer automates network resources creation on Local Zone based on the edge compute pool (SPLAT-657)

Requirements

  • This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.
Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

 

<!--

Please make sure to fill all story details here with enough information so
that it can be properly sized and is immediately actionable. Our Definition
of Ready for user stories is detailed in the link below:

https://docs.google.com/document/d/1Ps9hWl6ymuLOAhX_-usLmZIP4pQ8PWO15tMksh0Lb_A/

As much as possible, make sure this story represents a small chunk of work
that could be delivered within a sprint. If not, consider the possibility
of splitting it or turning it into an epic with smaller related stories.

Before submitting it, please make sure to remove all comments like this one.

-->

{}USER STORY:{}

<!--

One sentence describing this story from an end-user perspective.

-->

As a [type of user], I want [an action] so that [a benefit/a value].

{}DESCRIPTION:{}

<!--

Provide as many details as possible, so that any team member can pick it up
and start to work on it immediately without having to reach out to you.

-->

{}Required:{}

...

{}Nice to have:{}

...

{}ACCEPTANCE CRITERIA:{}

<!--

Describe the goals that need to be achieved so that this story can be
considered complete. Note this will also help QE to write their acceptance
tests.

-->

{}ENGINEERING DETAILS:{}

<!--

Any additional information that might be useful for engineers: related
repositories or pull requests, related email threads, GitHub issues or
other online discussions, how to set up any required accounts and/or
environments if applicable, and so on.

-->

Feature Overview

Testing is one of the main pillars of production-grade software. It helps validate and flag issues early on before the code is shipped into productive landscapes. Code changes no matter how small they are might lead to bugs and outages, the best way to validate bugs is to write proper tests, and to run those tests we need to have a foundation for a test infrastructure, finally, to close the circle, automation of these tests and their corresponding build help reduce errors and save a lot of time.

Goal(s)

  • How do we get infrastructure, what infrastructure accounts are required?
  • Build e2e integration with openshift-release on AWS.
  • Define MVP CI Jobs to validate (e.g., conformance). What tests are failing, are we skipping any? why? 

Note: Sync with the Developer productivity teams might be required to understand infra requirements especially for our first HyperShift infrastructure backend, AWS.

Context:

This is a placeholder epic to capture all the e2e scenarios that we want to test in CI in the long term. Anything which is a TODO here should at minimum be validated by QE as it is developed.

DoD:

Every supported scenario is e2e CI tested.

Scenarios:

  • Hypershift deployment with services as routes.
  • Hypershift deployment with services as NodePorts.

 

DoD:

Refactor the E2E tests following new pattern with 1 HostedCluster and targeted NodePools:

  • nodepool_upgrade_test.go

 

Goal

Productize agent-installer-utils container from https://github.com/openshift/agent-installer-utils

Feature Description

In order to ship the network reconfiguration it would be useful to move the agent-tui to its own image instead of sharing the agent-installer-node-agent one.

Goal

Productize agent-installer-utils container from https://github.com/openshift/agent-installer-utils

Feature Description

In order to ship the network reconfiguration it would be useful to move the agent-tui to its own image instead of sharing the agent-installer-node-agent one.

Currently the `agent create image` command takes care to extract the agent-tui binary (and required libs) from the `assisted-installer-agent` image (shipped in the release as `agent-installer-node-agent`).
Once the agent-tui will be available instead from the `agent-installer-utils` image, it would be necessary to update accordingly the installer code (see https://github.com/openshift/installer/blob/56e85bee78490c18aaf33994e073cbc16181f66d/pkg/asset/agent/image/agentimage.go#L81)

agent-tui is currently built and shipped using the assisted-installer-agent repo. Since it will be move into its own repository (agent-installer-utils), it's necessary to cleanup the previous code.

Feature Overview

Allow users to interactively adjust the network configuration for a host after booting the agent ISO.

Goals

Configure network after host boots

The user has Static IPs, VLANs, and/or bonds to configure, but has no idea of the device names of the NICs. They don't enter any network config in agent-config.yaml. Instead they configure each host's network via the text console after it boots into the image.

Epic Goal

  • Allow users to interactively adjust the network configuration for a host after booting the agent ISO, before starting processes that pull container images.

Why is this important?

  • Configuring the network prior to booting a host is difficult and error-prone. Not only is the nmstate syntax fairly arcane, but the advent of 'predictable' interface names means that interfaces retain the same name across reboots but it is nearly impossible to predict what they will be. Applying configuration to the correct hosts requires correct knowledge and input of MAC addresses. All of these present opportunities for things to go wrong, and when they do the user is forced to return to the beginning of the process and generate a new ISO, then boot all of the hosts in the cluster with it again.

Scenarios

  1. The user has Static IPs, VLANs, and/or bonds to configure, but has no idea of the device names of the NICs. They don't enter any network config in agent-config.yaml. Instead they configure each host's network via the text console after it boots into the image.
  2. The user has Static IPs, VLANs, and/or bonds to configure, but makes an error entering the configuration in agent-config.yaml so that (at least) one host will not be able to pull container images from the release payload. They correct the configuration for that host via the text console before proceeding with the installation.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Currently the agent-tui displays always the additional checks (nslookup/ping/http get), even when the primary check (pull image) passes. This may cause some confusion to the user, due the fact that the additional checks do not prevent the agent-tui to complete successfully but they are just informative, to allow a better troubleshooting of the issue (so not needed in the positive case).

The additional checks should then be shown only when the primary check fails for any reason.

When the UI is active in the console events messages that are generated will distort the interface and make it difficult for the user to view the configuration and select options. An example is shown in the attached screenshot.

When the agent-tui is shown during the initial host boot, if the pull release image check fails then an additional checks box is shown along with a details text view.
The content of the details view gets continuosly updated with the details of failed check, but the user cannot move the focus over the details box (using the arrow/tab keys), thus cannot scroll its content (using the up/down arrow keys)

The openshift-install agent create image will need to fetch the agent-tui executable so that it could be embedded within the agent ISO. For this reason the agent-tui must be available in the release payload, so that it could be retrieved even when the command is invoked in a disconnected environment.

Epic Goal

Full support of North-South (cluster egress-ingress) IPsec that shares an encryption back-end with the current East-West implementation, allows for IPsec offload to capable SmartNICs, can be enabled and disabled at runtime, and allows for FIPS compliance (including install-time configuration and disabling of runtime configuration).

Why is this important?

  • Customers went end-to-end default encryption with external servers and/or clients. 

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • Must allow for the possibility of offloading the IPsec encryption to a SmartNIC.
  •  

Dependencies (internal and external)

  1.  

Related:

  • ITUP-44 - OpenShift support for North-South OVN IPSec
  • HATSTRAT-33 - Encrypt All Traffic to/from Cluster (aka IPSec as a Service)

Previous Work (Optional):

  1. SDN-717 - Support IPSEC on ovn-kubernetes

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

This is a clone of issue OCPBUGS-17380. The following is the description of the original issue:

Description of problem:

Enable IPSec pre/post install on OVN IC cluster

$ oc patch networks.operator.openshift.io cluster --type=merge -p '{"spec":{"defaultNetwork":{"ovnKubernetesConfig":{"ipsecConfig":{ }}}}}'
network.operator.openshift.io/cluster patched


ovn-ipsec containers complaining:

ovs-monitor-ipsec | ERR | Failed to import certificate into NSS.
b'certutil:  unable to open "/etc/openvswitch/keys/ipsec-cacert.pem" for reading (-5950, 2).\n'



$ oc rsh ovn-ipsec-d7rx9
Defaulted container "ovn-ipsec" out of: ovn-ipsec, ovn-keys (init)
sh-5.1# certutil -L -d /var/lib/ipsec/nss Certificate Nickname                                         Trust Attributes
                                                             SSL,S/MIME,JAR/XPIovs_certkey_db961f9a-7de4-4f1d-a2fb-a8306d4079c5             u,u,u 

sh-5.1# cat /var/log/openvswitch/libreswan.log
Aug  4 15:12:46.808394: Initializing NSS using read-write database "sql:/var/lib/ipsec/nss"
Aug  4 15:12:46.837350: FIPS Mode: NO
Aug  4 15:12:46.837370: NSS crypto library initialized
Aug  4 15:12:46.837387: FIPS mode disabled for pluto daemon
Aug  4 15:12:46.837390: FIPS HMAC integrity support [disabled]
Aug  4 15:12:46.837541: libcap-ng support [enabled]
Aug  4 15:12:46.837550: Linux audit support [enabled]
Aug  4 15:12:46.837576: Linux audit activated
Aug  4 15:12:46.837580: Starting Pluto (Libreswan Version 4.9 IKEv2 IKEv1 XFRM XFRMI esp-hw-offload FORK PTHREAD_SETSCHEDPRIO GCC_EXCEPTIONS NSS (IPsec profile) (NSS-KDF) DNSSEC SYSTEMD_WATCHDOG LABELED_IPSEC (SELINUX) SECCOMP LIBCAP_NG LINUX_AUDIT AUTH_PAM NETWORKMANAGER CURL(non-NSS) LDAP(non-NSS)) pid:147
Aug  4 15:12:46.837583: core dump dir: /run/pluto
Aug  4 15:12:46.837585: secrets file: /etc/ipsec.secrets
Aug  4 15:12:46.837587: leak-detective enabled
Aug  4 15:12:46.837589: NSS crypto [enabled]
Aug  4 15:12:46.837591: XAUTH PAM support [enabled]
Aug  4 15:12:46.837604: initializing libevent in pthreads mode: headers: 2.1.12-stable (2010c00); library: 2.1.12-stable (2010c00)
Aug  4 15:12:46.837664: NAT-Traversal support  [enabled]
Aug  4 15:12:46.837803: Encryption algorithms:
Aug  4 15:12:46.837814:   AES_CCM_16         {256,192,*128} IKEv1:     ESP     IKEv2:     ESP     FIPS              aes_ccm, aes_ccm_c
Aug  4 15:12:46.837820:   AES_CCM_12         {256,192,*128} IKEv1:     ESP     IKEv2:     ESP     FIPS              aes_ccm_b
Aug  4 15:12:46.837826:   AES_CCM_8          {256,192,*128} IKEv1:     ESP     IKEv2:     ESP     FIPS              aes_ccm_a
Aug  4 15:12:46.837831:   3DES_CBC           [*192]         IKEv1: IKE ESP     IKEv2: IKE ESP     FIPS NSS(CBC)     3des
Aug  4 15:12:46.837837:   CAMELLIA_CTR       {256,192,*128} IKEv1:     ESP     IKEv2:     ESP                      
Aug  4 15:12:46.837843:   CAMELLIA_CBC       {256,192,*128} IKEv1: IKE ESP     IKEv2: IKE ESP          NSS(CBC)     camellia
Aug  4 15:12:46.837849:   AES_GCM_16         {256,192,*128} IKEv1:     ESP     IKEv2: IKE ESP     FIPS NSS(GCM)     aes_gcm, aes_gcm_c
Aug  4 15:12:46.837855:   AES_GCM_12         {256,192,*128} IKEv1:     ESP     IKEv2: IKE ESP     FIPS NSS(GCM)     aes_gcm_b
Aug  4 15:12:46.837861:   AES_GCM_8          {256,192,*128} IKEv1:     ESP     IKEv2: IKE ESP     FIPS NSS(GCM)     aes_gcm_a
Aug  4 15:12:46.837867:   AES_CTR            {256,192,*128} IKEv1: IKE ESP     IKEv2: IKE ESP     FIPS NSS(CTR)     aesctr
Aug  4 15:12:46.837872:   AES_CBC            {256,192,*128} IKEv1: IKE ESP     IKEv2: IKE ESP     FIPS NSS(CBC)     aes
Aug  4 15:12:46.837878:   NULL_AUTH_AES_GMAC {256,192,*128} IKEv1:     ESP     IKEv2:     ESP     FIPS              aes_gmac
Aug  4 15:12:46.837883:   NULL               []             IKEv1:     ESP     IKEv2:     ESP                      
Aug  4 15:12:46.837889:   CHACHA20_POLY1305  [*256]         IKEv1:             IKEv2: IKE ESP          NSS(AEAD)    chacha20poly1305
Aug  4 15:12:46.837892: Hash algorithms:
Aug  4 15:12:46.837896:   MD5                               IKEv1: IKE         IKEv2:                  NSS         
Aug  4 15:12:46.837901:   SHA1                              IKEv1: IKE         IKEv2: IKE         FIPS NSS          sha
Aug  4 15:12:46.837906:   SHA2_256                          IKEv1: IKE         IKEv2: IKE         FIPS NSS          sha2, sha256
Aug  4 15:12:46.837910:   SHA2_384                          IKEv1: IKE         IKEv2: IKE         FIPS NSS          sha384
Aug  4 15:12:46.837915:   SHA2_512                          IKEv1: IKE         IKEv2: IKE         FIPS NSS          sha512
Aug  4 15:12:46.837919:   IDENTITY                          IKEv1:             IKEv2:             FIPS             
Aug  4 15:12:46.837922: PRF algorithms:
Aug  4 15:12:46.837927:   HMAC_MD5                          IKEv1: IKE         IKEv2: IKE              native(HMAC) md5
Aug  4 15:12:46.837931:   HMAC_SHA1                         IKEv1: IKE         IKEv2: IKE         FIPS NSS          sha, sha1
Aug  4 15:12:46.837936:   HMAC_SHA2_256                     IKEv1: IKE         IKEv2: IKE         FIPS NSS          sha2, sha256, sha2_256
Aug  4 15:12:46.837950:   HMAC_SHA2_384                     IKEv1: IKE         IKEv2: IKE         FIPS NSS          sha384, sha2_384
Aug  4 15:12:46.837955:   HMAC_SHA2_512                     IKEv1: IKE         IKEv2: IKE         FIPS NSS          sha512, sha2_512
Aug  4 15:12:46.837959:   AES_XCBC                          IKEv1:             IKEv2: IKE              native(XCBC) aes128_xcbc
Aug  4 15:12:46.837962: Integrity algorithms:
Aug  4 15:12:46.837966:   HMAC_MD5_96                       IKEv1: IKE ESP AH  IKEv2: IKE ESP AH       native(HMAC) md5, hmac_md5
Aug  4 15:12:46.837984:   HMAC_SHA1_96                      IKEv1: IKE ESP AH  IKEv2: IKE ESP AH  FIPS NSS          sha, sha1, sha1_96, hmac_sha1
Aug  4 15:12:46.837995:   HMAC_SHA2_512_256                 IKEv1: IKE ESP AH  IKEv2: IKE ESP AH  FIPS NSS          sha512, sha2_512, sha2_512_256, hmac_sha2_512
Aug  4 15:12:46.837999:   HMAC_SHA2_384_192                 IKEv1: IKE ESP AH  IKEv2: IKE ESP AH  FIPS NSS          sha384, sha2_384, sha2_384_192, hmac_sha2_384
Aug  4 15:12:46.838005:   HMAC_SHA2_256_128                 IKEv1: IKE ESP AH  IKEv2: IKE ESP AH  FIPS NSS          sha2, sha256, sha2_256, sha2_256_128, hmac_sha2_256
Aug  4 15:12:46.838008:   HMAC_SHA2_256_TRUNCBUG            IKEv1:     ESP AH  IKEv2:         AH                   
Aug  4 15:12:46.838014:   AES_XCBC_96                       IKEv1:     ESP AH  IKEv2: IKE ESP AH       native(XCBC) aes_xcbc, aes128_xcbc, aes128_xcbc_96
Aug  4 15:12:46.838018:   AES_CMAC_96                       IKEv1:     ESP AH  IKEv2:     ESP AH  FIPS              aes_cmac
Aug  4 15:12:46.838023:   NONE                              IKEv1:     ESP     IKEv2: IKE ESP     FIPS              null
Aug  4 15:12:46.838026: DH algorithms:
Aug  4 15:12:46.838031:   NONE                              IKEv1:             IKEv2: IKE ESP AH  FIPS NSS(MODP)    null, dh0
Aug  4 15:12:46.838035:   MODP1536                          IKEv1: IKE ESP AH  IKEv2: IKE ESP AH       NSS(MODP)    dh5
Aug  4 15:12:46.838039:   MODP2048                          IKEv1: IKE ESP AH  IKEv2: IKE ESP AH  FIPS NSS(MODP)    dh14
Aug  4 15:12:46.838044:   MODP3072                          IKEv1: IKE ESP AH  IKEv2: IKE ESP AH  FIPS NSS(MODP)    dh15
Aug  4 15:12:46.838048:   MODP4096                          IKEv1: IKE ESP AH  IKEv2: IKE ESP AH  FIPS NSS(MODP)    dh16
Aug  4 15:12:46.838053:   MODP6144                          IKEv1: IKE ESP AH  IKEv2: IKE ESP AH  FIPS NSS(MODP)    dh17
Aug  4 15:12:46.838057:   MODP8192                          IKEv1: IKE ESP AH  IKEv2: IKE ESP AH  FIPS NSS(MODP)    dh18
Aug  4 15:12:46.838061:   DH19                              IKEv1: IKE         IKEv2: IKE ESP AH  FIPS NSS(ECP)     ecp_256, ecp256
Aug  4 15:12:46.838066:   DH20                              IKEv1: IKE         IKEv2: IKE ESP AH  FIPS NSS(ECP)     ecp_384, ecp384
Aug  4 15:12:46.838070:   DH21                              IKEv1: IKE         IKEv2: IKE ESP AH  FIPS NSS(ECP)     ecp_521, ecp521
Aug  4 15:12:46.838074:   DH31                              IKEv1: IKE         IKEv2: IKE ESP AH       NSS(ECP)     curve25519
Aug  4 15:12:46.838077: IPCOMP algorithms:
Aug  4 15:12:46.838081:   DEFLATE                           IKEv1:     ESP AH  IKEv2:     ESP AH  FIPS             
Aug  4 15:12:46.838085:   LZS                               IKEv1:             IKEv2:     ESP AH  FIPS             
Aug  4 15:12:46.838089:   LZJH                              IKEv1:             IKEv2:     ESP AH  FIPS             
Aug  4 15:12:46.838093: testing CAMELLIA_CBC:
Aug  4 15:12:46.838096:   Camellia: 16 bytes with 128-bit key
Aug  4 15:12:46.838162:   Camellia: 16 bytes with 128-bit key
Aug  4 15:12:46.838201:   Camellia: 16 bytes with 256-bit key
Aug  4 15:12:46.838243:   Camellia: 16 bytes with 256-bit key
Aug  4 15:12:46.838280: testing AES_GCM_16:
Aug  4 15:12:46.838284:   empty string
Aug  4 15:12:46.838319:   one block
Aug  4 15:12:46.838352:   two blocks
Aug  4 15:12:46.838385:   two blocks with associated data
Aug  4 15:12:46.838424: testing AES_CTR:
Aug  4 15:12:46.838428:   Encrypting 16 octets using AES-CTR with 128-bit key
Aug  4 15:12:46.838464:   Encrypting 32 octets using AES-CTR with 128-bit key
Aug  4 15:12:46.838502:   Encrypting 36 octets using AES-CTR with 128-bit key
Aug  4 15:12:46.838541:   Encrypting 16 octets using AES-CTR with 192-bit key
Aug  4 15:12:46.838576:   Encrypting 32 octets using AES-CTR with 192-bit key
Aug  4 15:12:46.838613:   Encrypting 36 octets using AES-CTR with 192-bit key
Aug  4 15:12:46.838651:   Encrypting 16 octets using AES-CTR with 256-bit key
Aug  4 15:12:46.838687:   Encrypting 32 octets using AES-CTR with 256-bit key
Aug  4 15:12:46.838724:   Encrypting 36 octets using AES-CTR with 256-bit key
Aug  4 15:12:46.838763: testing AES_CBC:
Aug  4 15:12:46.838766:   Encrypting 16 bytes (1 block) using AES-CBC with 128-bit key
Aug  4 15:12:46.838801:   Encrypting 32 bytes (2 blocks) using AES-CBC with 128-bit key
Aug  4 15:12:46.838841:   Encrypting 48 bytes (3 blocks) using AES-CBC with 128-bit key
Aug  4 15:12:46.838881:   Encrypting 64 bytes (4 blocks) using AES-CBC with 128-bit key
Aug  4 15:12:46.838928: testing AES_XCBC:
Aug  4 15:12:46.838932:   RFC 3566 Test Case 1: AES-XCBC-MAC-96 with 0-byte input
Aug  4 15:12:46.839126:   RFC 3566 Test Case 2: AES-XCBC-MAC-96 with 3-byte input
Aug  4 15:12:46.839291:   RFC 3566 Test Case 3: AES-XCBC-MAC-96 with 16-byte input
Aug  4 15:12:46.839444:   RFC 3566 Test Case 4: AES-XCBC-MAC-96 with 20-byte input
Aug  4 15:12:46.839600:   RFC 3566 Test Case 5: AES-XCBC-MAC-96 with 32-byte input
Aug  4 15:12:46.839756:   RFC 3566 Test Case 6: AES-XCBC-MAC-96 with 34-byte input
Aug  4 15:12:46.839937:   RFC 3566 Test Case 7: AES-XCBC-MAC-96 with 1000-byte input
Aug  4 15:12:46.840373:   RFC 4434 Test Case AES-XCBC-PRF-128 with 20-byte input (key length 16)
Aug  4 15:12:46.840529:   RFC 4434 Test Case AES-XCBC-PRF-128 with 20-byte input (key length 10)
Aug  4 15:12:46.840698:   RFC 4434 Test Case AES-XCBC-PRF-128 with 20-byte input (key length 18)
Aug  4 15:12:46.840990: testing HMAC_MD5:
Aug  4 15:12:46.840997:   RFC 2104: MD5_HMAC test 1
Aug  4 15:12:46.841200:   RFC 2104: MD5_HMAC test 2
Aug  4 15:12:46.841390:   RFC 2104: MD5_HMAC test 3
Aug  4 15:12:46.841582: testing HMAC_SHA1:
Aug  4 15:12:46.841585:   CAVP: IKEv2 key derivation with HMAC-SHA1
Aug  4 15:12:46.842055: 8 CPU cores online
Aug  4 15:12:46.842062: starting up 7 helper threads
Aug  4 15:12:46.842128: started thread for helper 0
Aug  4 15:12:46.842174: helper(1) seccomp security disabled for crypto helper 1
Aug  4 15:12:46.842188: started thread for helper 1
Aug  4 15:12:46.842219: helper(2) seccomp security disabled for crypto helper 2
Aug  4 15:12:46.842236: started thread for helper 2
Aug  4 15:12:46.842258: helper(3) seccomp security disabled for crypto helper 3
Aug  4 15:12:46.842269: started thread for helper 3
Aug  4 15:12:46.842296: helper(4) seccomp security disabled for crypto helper 4
Aug  4 15:12:46.842311: started thread for helper 4
Aug  4 15:12:46.842323: helper(5) seccomp security disabled for crypto helper 5
Aug  4 15:12:46.842346: started thread for helper 5
Aug  4 15:12:46.842369: helper(6) seccomp security disabled for crypto helper 6
Aug  4 15:12:46.842376: started thread for helper 6
Aug  4 15:12:46.842390: using Linux xfrm kernel support code on #1 SMP PREEMPT_DYNAMIC Thu Jul 20 09:11:28 EDT 2023
Aug  4 15:12:46.842393: helper(7) seccomp security disabled for crypto helper 7
Aug  4 15:12:46.842707: selinux support is NOT enabled.
Aug  4 15:12:46.842728: systemd watchdog not enabled - not sending watchdog keepalives
Aug  4 15:12:46.843813: seccomp security disabled
Aug  4 15:12:46.848083: listening for IKE messages
Aug  4 15:12:46.848252: Kernel supports NIC esp-hw-offload
Aug  4 15:12:46.848534: adding UDP interface ovn-k8s-mp0 10.129.0.2:500
Aug  4 15:12:46.848624: adding UDP interface ovn-k8s-mp0 10.129.0.2:4500
Aug  4 15:12:46.848654: adding UDP interface br-ex 169.254.169.2:500
Aug  4 15:12:46.848681: adding UDP interface br-ex 169.254.169.2:4500
Aug  4 15:12:46.848713: adding UDP interface br-ex 10.0.0.8:500
Aug  4 15:12:46.848740: adding UDP interface br-ex 10.0.0.8:4500
Aug  4 15:12:46.848767: adding UDP interface lo 127.0.0.1:500
Aug  4 15:12:46.848793: adding UDP interface lo 127.0.0.1:4500
Aug  4 15:12:46.848824: adding UDP interface lo [::1]:500
Aug  4 15:12:46.848853: adding UDP interface lo [::1]:4500
Aug  4 15:12:46.851160: loading secrets from "/etc/ipsec.secrets"
Aug  4 15:12:46.851214: no secrets filename matched "/etc/ipsec.d/*.secrets"
Aug  4 15:12:47.053369: loading secrets from "/etc/ipsec.secrets"

sh-4.4# tcpdump -i any esp
dropped privs to tcpdump
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked v1), capture size 262144 bytes^C
0 packets capturedsh-5.1# ovn-nbctl --no-leader-only get nb_global . ipsec
false
 

Version-Release number of selected component (if applicable):

openshift/cluster-network-operator#1874 

How reproducible:

Always

Steps to Reproduce:

1.Install OVN cluster and enable IPSec in runtime
2.
3.

Actual results:

no esp packets seen across the nodes

Expected results:

esp traffic should be seen across the nodes

Additional info:

 

OC mirror is GA product as of Openshift 4.11 .

The goal of this feature is to solve any future customer request for new features or capabilities in OC mirror 

Overview

This epic is a simple tracker epic for the proposed work and analysis for 4.14 delivery

As a oc-mirror user, I would like mirrored operator catalogs to have valid caches that reflect the contents of the catalog (configs folder) based on the filtering done in the ImageSetConfig for that catalog

so that the catalog image starts efficiently in a cluster.

Tasks:

  • white-out /tmp on all manifests (per platform)
  • Recreate the cache under /tmp/cache using
    • extract the whole catalog
    • use the opm binary included in the extracted catalog to call (command line)
opm serve /configs –-cache-dir /tmp/cache –-cache-only 
  • Create a new layer from /configs and /tmp/cache
    • the /tmp is compatible with all platforms
  • oc-mirror should not change the CMD nor ENTRYPOINT of the image
  • Rebuild catalog image up to the index (manifest list)

Acceptance criteria:

  • Run the catalog container with command opm serve <configDir> --cache-dir=<cacheDir> --cache-only --cache-enforce-integrity to verify the integrity of the cache
  • 4.14 catalogs mirrored with oc-mirror v4.14 run correctly in a cluster
    • when mirrored with mirrorToMirror workflow
    • when mirrored with mirrorToMirror workflow with --include-oci-local-catalogs
    • when mirrored with mirrorToDisk + diskToMirror workflow
  • 4.14 catalogs mirrored with oc-mirror v4.14 use the pre-computed cache (not sure how to test this)
  • catalogs<= 4.13 mirrored with oc-mirror v4.14 run correctly in a cluster (this is not something we publish as supported)

Description of problem:

Customer was able to limit the nested repository path with "oc adm catalog mirror" by using the argument "--max-components" but there is no alternate solution along with "oc-mirror" binary while we are suggesting to use "oc-mirror" binary for mirroring.for example:
Mirroring will work if we mirror like below
oc mirror --config=./imageset-config.yaml docker://registry.gitlab.com/xxx/yyy
Mirroring will fail with 401 unauthorized if we add one more nested path like below
oc mirror --config=./imageset-config.yaml docker://registry.gitlab.com/xxx/yyy/zzz

Version-Release number of selected component (if applicable):

 

How reproducible:

We can reproduce the issue by using a repository which is not supported deep nested paths

Steps to Reproduce:

1. Create a imageset to mirror any operator

kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v1alpha2
storageConfig:
  local:
    path: ./oc-mirror-metadata
mirror:
  operators:
  - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.12
    packages:
    - name: local-storage-operator
      channels:
      - name: stable

2. Do the mirroring to a registry where its not supported deep nested repository path, Here its gitlab and its doesnt not support netsting beyond 3 levels deep.

oc mirror --config=./imageset-config.yaml docker://registry.gitlab.com/xxx/yyy/zzz

this mirroring will fail with 401 unauthorized error
 
3. if  try to mirror the same imageset by removing one path it will work without any issues, like below

oc mirror --config=./imageset-config.yaml docker://registry.gitlab.com/xxx/yyy 

Actual results:

 

Expected results:

Need a alternative option of "--max-components" to limit the nested path in "oc-mirror"

Additional info:

 

Proposed title of this feature request

Achieve feature parity for recently introduced functionality for all modes of operation

Nature and description of the request

Currently there are gaps in functionality within oc mirror that we would like addressed.

1. Support oci: references within mirror.operators[].catalog in an ImageSetConfiguration when running in all modes of operation with the full functionality provided by oc mirror.

Currently oci: references such as the following are allowed only in limited circumstances:

mirror:
   operators:
   - catalog: oci:///tmp/oci/ocp11840
   - catalog: icr.io/cpopen/ibm-operator-catalog
 

Currently supported scenarios

  • Mirror to Mirror

In this mode of operation the images are fetched from the oci: reference rather than being pulled from a source docker image repository. These catalogs are processed through similar (yet different) mechanisms compared to docker image references. The end result in this scenario is that the catalog is potentially modified and images (i.e. catalog, bundle, related images, etc.) are pushed to their final docker image repository. This provides the full capabilities offered by oc mirror (e.g. catalog "filtering", image pruning, metadata manipulation to keep track of what has been mirrored, etc.)

Desired scenarios
In the following scenarios we would like oci: references to be processed in a similar way to how docker references are handled (as close as possible anyway given the different APIs involved). Ultimately we want oci: catalog references to provide the full set of functionality currently available for catalogs provided as a docker image reference. In other words we want full feature parity (e.g. catalog "filtering", image pruning, metadata manipulation to keep track of what has been mirrored, etc.)

  • Mirror to Disk

In this mode of operation the images are fetched from the oci: reference rather than being pulled from a docker image repository. These catalogs are processed through similar yet different mechanisms compared to docker image references. The end result of this scenario is that all mappings and catalogs are packaged into tar archives (i.e. the "imageset").

  • Disk to Mirror

In this mode of operation the tar archives (i.e. the "imageset") are processed via the "publish mechanism" which means unpacking the tar archives, processing the metadata, pruning images, rebuilding catalogs, and pushing images to their destination. In theory if the mirror-to-disk scenario is handled properly, then this mode should "just work".

Below the line was the original RFE for requesting the OCI feature and is only provided for reference.

 

Goal:
As a cluster administrator, I want OpenShift to include a recent HAProxy version, so that I have the latest available performance and security fixes.  

 Description:
We should strive to follow upstream HAProxy releases by bumping the HAProxy version that we ship in OpenShift with every 4.y release, so that OpenShift benefits from upstream performance and security fixes, and so that we avoid large version-number jumps when an urgent fix necessitates bumping to the latest HAProxy release.  This bump should happen as early as possible in the OpenShift release cycle, so as to maximize soak time.   

For OpenShift 4.13, this means bumping to 2.6.  

As a cluster administrator, 

I want OpenShift to include a recent HAProxy version, 

so that I have the latest available performance and security fixes.  

 

We should strive to follow upstream HAProxy releases by bumping the HAProxy version that we ship in OpenShift with every 4.y release, so that OpenShift benefits from upstream performance and security fixes, and so that we avoid large version-number jumps when an urgent fix necessitates bumping to the latest HAProxy release.  This bump should happen as early as possible in the OpenShift release cycle, so as to maximize soak time.   

For OpenShift 4.14, this means bumping to 2.6.  

Bump the HAProxy version in dist-git so that OCP 4.13 ships HAProxy 2.6.13, with this patch added on top: https://git.haproxy.org/?p=haproxy-2.6.git;a=commit;h=2b0aafdc92f691bc4b987300c9001a7cc3fb8d08. The patch fixes the segfault that was being tracked as OCPBUGS-13232.

This patch is in HAProxy 2.6.14, so we can stop carrying the patch once we bump to HAProxy 2.6.14 or newer in a subsequent OCP release.

Feature Overview (aka. Goal Summary)  

Tang-enforced, network-bound disk encryption has been available in OpenShift for some time, but all intended Tang-endpoints contributing unique key material to the process must be reachable during RHEL CoreOS provisioning in order to complete deployment.

If a user wants to require 3 of 6 tang servers be reachable than all 6 must be reachable during the provisioning process. This might not be possible due to maintenance, outage, or simply network policy during deployment. 

Enabling offline provisioning for first boot will help all of these scenarios.

 

Goals (aka. expected user outcomes)

The user can now provision a cluster with some or none of the Tang servers being reachable on first boot. Second boot, of course, will be subject to the Tang requirements being configured.

Done when:

  • Ignition spec default has been updated to 3.4
  • reconcile field (dependent on ignition 3.4)
  • consider Tang rotation? (write another epic)

This requires messy/complex work of grepping through for prior references to ignition and updating golang types that reference other versions.

Assumption that existing tests are sufficient to catch discrepancies. 

Goal

Allow to point to an existing OVA image stored in vSphere from the OpenShift installer, replacing the current method that uploads the OVA template every time an OpenShift cluster is installed.

Why is this important?

This is an improvement that makes the installation more efficient by not having to upload an OVA from where openshift-install is running every time a cluster is installed, saving time and bandwidth use. For example if an administrating is installing from a VPN then the OVA is upload through it to the target cluster every time an OpenShift cluster is installed. This makes the administration process more efficient by having a OVA centralised ready to use to install new clusters without uploading it from where the installer is run.

Epic Goal

  • To allow the use of a pre-existing RHCOS virtual machine or template via the IPI installer.

Why is this important?

  • It is a very common workflow in vSphere to upload a OVA. In the disconnected scenario the requirement of using a local web server, copying an ova to that webserver and then running the installer is a poor experience.

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Feature Goal

  • Enable platform=external to support onboarding new partners, e.g. Oracle Cloud Infrastructure and VCSP partners.
  • Create a new platform type, working name "External", that will signify when a cluster is deployed on a partner infrastructure where core cluster components have been replaced by the partner. “External” is different from our current platform types in that it will signal that the infrastructure is specifically not “None” or any of the known providers (eg AWS, GCP, etc). This will allow infrastructure partners to clearly designate when their OpenShift deployments contain components that replace the core Red Hat components.

This work will require updates to the core OpenShift API repository to add the new platform type, and then a distribution of this change to all components that use the platform type information. For components that partners might replace, per-component action will need to be taken, with the project team's guidance, to ensure that the component properly handles the "External" platform. These changes will look slightly different for each component.

To integrate these changes more easily into OpenShift, it is possible to take a multi-phase approach which could be spread over a release boundary (eg phase 1 is done in 4.X, phase 2 is done in 4.X+1).

OCPBU-5: Phase 1

  • Write platform “External” enhancement.
  • Evaluate changes to cluster capability annotations to ensure coverage for all replaceable components.
  • Meet with component teams to plan specific changes that will allow for supplement or replacement under platform "External".
  • Start implementing changes towards Phase 2.

OCPBU-510: Phase 2

  • Update OpenShift API with new platform and ensure all components have updated dependencies.
  • Update capabilities API to include coverage for all replaceable components.
  • Ensure all Red Hat operators tolerate the "External" platform and treat it the same as "None" platform.

OCPBU-329: Phase.Next

  • TBD

Why is this important?

  • As partners begin to supplement OpenShift's core functionality with their own platform specific components, having a way to recognize clusters that are in this state helps Red Hat created components to know when they should expect their functionality to be replaced or supplemented. Adding a new platform type is a significant data point that will allow Red Hat components to understand the cluster configuration and make any specific adjustments to their operation while a partner's component may be performing a similar duty.
  • The new platform type also helps with support to give a clear signal that a cluster has modifications to its core components that might require additional interaction with the partner instead of Red Hat. When combined with the cluster capabilities configuration, the platform "External" can be used to positively identify when a cluster is being supplemented by a partner, and which components are being supplemented or replaced.

Scenarios

  1. A partner wishes to replace the Machine controller with a custom version that they have written for their infrastructure. Setting the platform to "External" and advertising the Machine API capability gives a clear signal to the Red Hat created Machine API components that they should start the infrastructure generic controllers but not start a Machine controller.
  2. A partner wishes to add their own Cloud Controller Manager (CCM) written for their infrastructure. Setting the platform to "External" and advertising the CCM capability gives a clear to the Red Hat created CCM operator that the cluster should be configured for an external CCM that will be managed outside the operator. Although the Red Hat operator will not provide this functionality, it will configure the cluster to expect a CCM.

Acceptance Criteria

Phase 1

  • Partners can read "External" platform enhancement and plan for their platform integrations.
  • Teams can view jira cards for component changes and capability updates and plan their work as appropriate.

Phase 2

  • Components running in cluster can detect the “External” platform through the Infrastructure config API
  • Components running in cluster react to “External” platform as if it is “None” platform
  • Partners can disable any of the platform specific components through the capabilities API

Phase 3

  • Components running in cluster react to the “External” platform based on their function.
    • for example, the Machine API Operator needs to run a set of controllers that are platform agnostic when running in platform “External” mode.
    • the specific component reactions are difficult to predict currently, this criteria could change based on the output of phase 1.

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. Identifying OpenShift Components for Install Flexibility

Open questions::

  1. Phase 1 requires talking with several component teams, the specific action that will be needed will depend on the needs of the specific component. At the least the components need to treat platform "External" as "None", but there could be more changes depending on the component (eg Machine API Operator running non-platform specific controllers).

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Epic Goal

  • Empower External platform type user to specify when they will run their own CCM

Why is this important?

  • For partners wishing to use components that require zonal awareness provided by the infrastructure (for example CSI drivers), they will need to exercise their own cloud controller managers. This epic is about adding the proper configuration to OpenShift to allow users of External platform types to run their own CCMs.

Scenarios

  1. As a Red Hat partner, I would like to deploy OpenShift with my own CSI driver. To do this I need my CCM deployed as well. Having a way to instruct OpenShift to expect an external CCM deployment would allow me to do this.

Acceptance Criteria

  • CI - A new periodic test based on the External platform test would be ideal
  • Release Technical Enablement - Provide necessary release enablement details and documents.
    • Update docs.ci.openshift.org with CCM docs

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. https://github.com/openshift/enhancements/blob/master/enhancements/cloud-integration/infrastructure-external-platform-type.md#api-extensions
  2. https://github.com/openshift/api/pull/1409

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story

As a Red Hat Partner installing OpenShift using the External platform type, I would like to install my own Cloud Controller Manager(CCM). Having a field in the Infrastructure configuration object to signal that I will install my own CCM and that Kubernetes should be configured to expect an external CCM will allow me to run my own CCM on new OpenShift deployments.

Background

This work has been defined in the External platform enhancement , and had previously been part of openshift/api . The CCM API pieces were removed for the 4.13 release of OpenShift to ensure that we did not ship unused portions of the API.

In addition to the API changes, library-go will need to have an update to the  IsCloudProviderExternal function to detect the if the External platform is selected and if the CCM should be enabled for external mode.

We will also need to check the ObserveCloudVolumePlugin function to ensure that it is not affected by the external changes and that it continues to use the external volume plugin.

After updating openshift/library-go, it will need to be re-vendored into the MCO  , KCMO , and CCCMO  (although this is not as critical as the other 2).

Steps

  • update openshift/api with new CCM fields (re-revert #1409)
  • revendor api to library-go
  • update IsCloudProviderExternal in library-go to observe the new API fields
  • investigate ObserveCloudVolumePlugin to see if it requires changes
  • revendor library-go to MCO, KCMO, and CCCMO
  • update enhancement doc to reflect state

Stakeholders

  • openshift eng
  • oracle cloud install effort

Definition of Done

  • openshift can be installed with External platform type with kubelet, and related components, using the external cloud provider flags.
  • Docs
  • this will need to be documented in the API and as part of OCPCLOUD-1581
  • Testing
  • this will need validation through unit test, integration testing may be difficult as we will need a new e2e built off the external platform with a ccm

User Story

As a user I want to use the openshift installer to create clusters of platform type External so that I can use openshift more effectively on a partner provider platform.

Background

To fully support the External platform type for partners and users, it will be useful to be able to have the installer understand when it sees the external platform type in the install-config.yaml, and then to properly populate the resulting infrastructure config object with the external platform type and platform name.

As defined in https://github.com/openshift/api/blob/master/config/v1/types_infrastructure.go#L241 , the external platform type allows the user to specify a name for the platform. This card is about updating the installer so that a user can provide both the external type and a platform name that will be expressed in the infrastructure manifest.

Aside from this information, the installer should continue with a normal platform "None" installation.

Steps

  • update installer to allow platform "External" specified in the install-config.yaml
  • update installer to allow platform name to specified as part of the External platform configuration

Stakeholders

  • openshift cloud infra team
  • openshift installer team
  • openshift assisted installer team

Definition of Done

  • user can specify external platform in the install-config.yaml and have a cluster with External platform type and a name for the platform.
  • cluster installs as expected for platform external (similar to none)
  • Docs
  • Testing
  • this feature should allow us to update our external platform tests to make the installation easier, tests should be updated to include this methodology

Feature Overview

In the initial delivery of CoreOS Layering, it is required that administrators provide their own build environment to customize RHCOS images. That could be a traditional RHEL environment or potentially an enterprising administrator with some knowledge of OCP Builds could set theirs up on-cluster.

The primary virtue of an on-cluster build path is to continue using the cluster to manage the cluster. No external dependency, batteries-included.

 

MVP: bring the off-cluster build environment on-cluster

    • Repo control
      • rpm-ostree needs repo management commands
    • Entitlement management

In the context of the Machine Config Operator (MCO) in Red Hat OpenShift, on-cluster builds refer to the process of building an OS image directly on the OpenShift cluster, rather than building them outside the cluster (such as on a local machine or continuous integration (CI) pipeline) and then making a configuration change so that the cluster uses them. By doing this, we enable cluster administrators to have more control over the contents and configuration of their clusters’ OS image through a familiar interface (MachineConfigs and in the future, Dockerfiles).

This is the "consumption" side of the security – rpm-ostree needs to be able to retrieve images from the internal registry seamlessly.

This will involve setting up (or using some existing) pull secrets, and then getting them to the proper location on disk so that rpm-ostree can use them to pull images.

At the layering sync meeting on Thursday, August 10th, it was decided that for this to be considered ready for Dev / Tech Preview, cluster admins need a way to inject custom Dockerfiles into their on-cluster builds.

 

(Commentary: It was also decided 4 months ago that this was not an MVP requirement in https://docs.google.com/document/d/1QSsq0mCgOSUoKZ2TpCWjzrQpKfMUL9thUFBMaPxYSLY/edit#heading=h.jqagm7kwv0lg. And quite frankly, this requirement should have been known at that point in time as opposed to the week before tech preview.)

The first phase of the layering effort involved creating a BuildController, whose job is to start and manage builds using the OpenShift Build API. We can use the work done to create the BuildController as the basis for our MVP. However, what we need from BuildController right now is less than BuildController currently provides. With that in mind, we need to remove certain parts of BuildController to create a more streamlined and simpler implementation ideal for an MVP.

 

Done when a version of BuildController is landed which does the following things:

  • Listens for all MachineConfigPool events. If a MachineConfigPool with a specific label or annotation (e.g., machineconfiguration.openshift.io/layering-enabled), the BuildController should retrieve the latest rendered MachineConfig associated with the MachineConfigPool, generate a series of inputs to a builder backend (for now, the OpenShift Build API can be the first backend), then update the MachineConfigPool with the outcome of that action. In the case of a successful build, the MachineConfigPool should be updated with the image pullspec for the newly-built image. For now, this can come in the form of an annotation or a label (e.g., machineconfiguration.openshift.io/desired-os-image = "image-registry.openshift-image-registry.svc:5000/openshift-machine-config-operator/coreos@sha256:abcdef1234567890...). But eventually, it should be a Status field on the MachineConfigPool object.
  • Reads from a ConfigMap which contains the following items (let's call it machine-os-builder-config for now):
    • Name of the base OS image pull secret.
    • Name of the final OS image push secret.
    • Target container registry and org / repo information for where to push the final OS image (e.g., image-registry.openshift-image-registry.svc:5000/openshift-machine-config-operator/coreos).
  • All functionality around managing ImageStreams and OpenShift Builds is removed or decoupled. In the case of the OpenShift Build functionality, it will be decoupled instead of completely removed. Additionally, it should not use BuildConfigs. It should instead create and manage image Build objects directly.
  • Use contexts for handling shutdowns and timeouts.
  • Unit tests are written for the major BuildController functionalities using either FakeClient or EnvTest.
  • The modified BuildController and its tests are merged into the master branch of the MCO. Note: This does not mean that it will be immediately active in the MCO's execution path. However, tests will be executed in CI.

The second phase of the layering effort involved creating a BuildController, whose job is to start and manage builds of OS images. While it should be able to perform those functions on its own, getting the built OS image onto each of the cluster nodes involves modifying other parts of the MCO to be layering-aware. To that end, there are three pieces involve, some of which will require modification:

Render Controller

Right now, the render controller listens for incoming MachineConfig changes. It generates the rendered config which is comprised of all of the MachineConfigs for a given MachineConfigPool. Once rendered, the Render Controller updates the MachineConfigPool to point to the new config. This portion of the MCO will not likely need any modification that I'm aware of at the moment.

Node Controller

The Node Controller listens for MachineConfigPool config changes. Whenever it identifies that a change has occurred, it applies the machineconfiguration.openshift.io/desiredConfig annotation to all the nodes in the targeted MachineConfigPool which causes the Machine Config Daemon (MCD) to apply the new configs. With this new layering mechanism, we'll need to add the additional annotation of machineconfiguration.openshift.io/desiredOSimage which will contain the fully-qualified pullspec for the new OS image (referenced by the image SHA256 sum). To be clear, we will not be replacing the desiredConfig annotation with the desiredOSimage annotation; both will still be used. This will allow Config Drift Monitor to continue to function the way it does with no modification required.

Machine Config Daemon

Right now, the MCD listens to Node objects for changes to the machineconfiguration.openshift.io/desiredConfig annotation. With the new desiredOSimage annotation being present, the MCD will need to skip the parts of the update loop which write files and systemd units to disk. Instead, it will skip directly to the rpm-ostree application phase (after making sure the correct pull secrets are in place, etc.).

 

Done When:

  • The above modifications are made.
  • Each modification has been done with appropriate unit tests where feasible.

To speed development for on-cluster builds and avoid a lot of complex code paths, the decision was made to put all functionality related to building OS images and managing internal registries into a separate binary within the MCO.

Eventually, this binary will be responsible for running the productionized BuildController and know how to respond to Machine OS Builder API objects. However, until the productionized BuildController and opt-in portions are ready, the first pass of this binary will be much simpler: For now, it can connect to the API server and print a "Hello World".

 

Done When:

  • We have a new binary under cmd/machine-os-builder. This binary will be built alongside the current MCO components and will be baked into the MCO image.
  • The Dockerfile, Makefile, and build scripts will need some modification so that they how to build cmd/machine-os-builder.
  • A Deployment manifest is created under manifests/ which is set up to start up a single instance of the new binary though we don't want it to start up by default right now since it won't do anything useful.

Feature Overview

Goals

  • Support OpenShift to be deployed from day-0 on AWS Local Zones
  • Support an existing OpenShift cluster to deploy compute Nodes on AWS Local Zones (day-2)

AWS Local Zones support - feature delivered in phases:

  • Phase 0 (OCPPLAN-9630): Document how to create compute nodes on AWS Local Zones in day-0 (SPLAT-635)
  • Phase 1 ( OCPBU-2): Create edge compute pool to generate MachineSets for node with NoSchedule taints when installing a cluster in existing VPC with AWS Local Zone subnets (SPLAT-636)
  • Phase 2 (OCPBU-351): Installer automates network resources creation on Local Zone based on the edge compute pool (SPLAT-657)

Requirements

  • This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.
Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

 

Epic Goal

Fully automated installation creating subnets in AWS Local Zones when the zone names are added to the edge compute pool on install-config.yaml.

  • The installer should create the subnets on the Local Zones according to the configuration of the "edge" compute pool, provided on install-config.yaml 

Why is this important?

  • Users can extend the presence of worker nodes closer to the metropolitan regions, where the users or on-premises workloads are running, decreasing the time to deliver their workloads to their clients.

Scenarios

  • As a cluster admin, I would like to install OpenShift clusters, extending the compute nodes to the Local Zones in my day-zero operations without needing to set up the network and compute dependencies, so I can speed up the edge adoption in my organization using OCP.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated.
  • CI - custom jobs should be added to test Local Zone provisioning
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • The PR on the installer repo should be merged after being approved by the Installer team, QE, and docs
  • The product documentation has been created

Dependencies (internal and external)

  1. SPLAT-636 : install a cluster in existing VPC extending workers to Local Zones
  2. OCPBUGSM-46513 : Bug - Ingress Controller should not add Local Zones subnets to network routers/LBs (Classic/NLB)

Previous Work (Optional):

  1. Enhancement 1232
  2. SPLAT-636 : AWS Local Zones - Phase 1 IPI edge pool - Installer support to automatically create the MachineSets when installing in existing VPC

Open questions:

Done Checklist

Feature Overview

  • As a Cluster Administrator, I want to opt-out of certain operators at deployment time using any of the supported installation methods (UPI, IPI, Assisted Installer, Agent-based Installer) from UI (e.g. OCP Console, OCM, Assisted Installer), CLI (e.g. oc, rosa), and API.
  • As a Cluster Administrator, I want to opt-in to previously-disabled operators (at deployment time) from UI (e.g. OCP Console, OCM, Assisted Installer), CLI (e.g. oc, rosa), and API.
  • As a ROSA service administrator, I want to exclude/disable Cluster Monitoring when I deploy OpenShift with HyperShift — using any of the supported installation methods including the ROSA wizard in OCM and rosa cli — since I get cluster metrics from the control plane.  This configuration should be persisted through not only through initial deployment but also through cluster lifecycle operations like upgrades.
  • As a ROSA service administrator, I want to exclude/disable Ingress Operator when I deploy OpenShift with HyperShift — using any of the supported installation methods including the ROSA wizard in OCM and rosa cli — as I want to use my preferred load balancer (i.e. AWS load balancer).  This configuration should be persisted through not only through initial deployment but also through cluster lifecycle operations like upgrades.

Goals

  • Make it possible for customers and Red Hat teams producing OCP distributions/topologies/experiences to enable/disable some CVO components while still keeping their cluster supported.

Scenarios

  1. This feature must consider the different deployment footprints including self-managed and managed OpenShift, connected vs. disconnected (restricted and air-gapped), supported topologies (standard HA, compact cluster, SNO), etc.
  2. Enabled/disabled configuration must persist throughout cluster lifecycle including upgrades.
  3. If there's any risk/impact of data loss or service unavailability (for Day 2 operations), the System must provide guidance on what the risks are and let user decide if risk worth undertaking.

Requirements

  • This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.
Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

(Optional) Use Cases

This Section:

  • Main success scenarios - high-level user stories
  • Alternate flow/scenarios - high-level user stories
  • ...

Questions to answer…

  • ...

Out of Scope

Background, and strategic fit

This part of the overall multiple release Composable OpenShift (OCPPLAN-9638 effort), which is being delivered in multiple phases:

Phase 1 (OpenShift 4.11): OCPPLAN-7589 Provide a way with CVO to allow disabling and enabling of operators

  • CORS-1873 Installer to allow users to select OpenShift components to be included/excluded
  • OTA-555 Provide a way with CVO to allow disabling and enabling of operators
  • OLM-2415 Make the marketplace operator optional
  • SO-11 Make samples operator optional
  • METAL-162 Make cluster baremetal operator optional
  • OCPPLAN-8286 CI Job for disabled optional capabilities

Phase 2 (OpenShift 4.12): OCPPLAN-7589 Provide a way with CVO to allow disabling and enabling of operators

Phase 3 (OpenShift 4.13): OCPBU-117

  • OTA-554 Make oc aware of cluster capabilities
  • PSAP-741 Make Node Tuning Operator (including PAO controllers) optional

Phase 4 (OpenShift 4.14): OCPSTRAT-36 (formerly OCPBU-236)

  • CCO-186 ccoctl support for credentialing optional capabilities
  • MCO-499 MCD should manage certificates via a separate, non-MC path (formerly IR-230 Make node-ca managed by CVO)
  • CNF-5642 Make cluster autoscaler optional
  • CNF-5643 - Make machine-api operator optional
  • WRKLDS-695 - Make DeploymentConfig API + controller optional
  • CNV-16274 OpenShift Virtualization on the Red Hat Application Cloud (not applicable)
  • CNF-9115 - Leverage Composable OpenShift feature to make control-plane-machine-set optional

Phase 5 (OpenShift 4.15): OCPSTRAT-421 (formerly) OCPBU-519

  • OCPBU-352 Make Ingress Operator optional
  • BUILD-565 - Make Build v1 API + controller optional
  • OBSDA-242 Make Cluster Monitoring Operator optional
  • OCPVE-630 (formerly CNF-5647) Leverage Composable OpenShift feature to make image-registry optional (replaces IR-351 - Make Image Registry Operator optional)
  • CNF-9114 - Leverage Composable OpenShift feature to make olm optional
  • CNF-9118 - Leverage Composable OpenShift feature to make cloud-credential  optional
  • CNF-9119 - Leverage Composable OpenShift feature to make cloud-controller-manager optional

Phase 6 (OpenShift 4.16): OCPSTRAT-731

  • TBD

References

Assumptions

  • ...

Customer Considerations

  • ...

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?
  • New Content, Updates to existing content, Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?

 

 

 

Per https://github.com/openshift/enhancements/pull/922 we need `oc adm release new` to parse the resource manifests for `capability` annotations and generate a yaml file that lists the valid capability names, to embed in the release image.

This file can be used by the installer to error or warn when the install config lists capabilities for enable/disable that are not valid capability names.

 

Note: Moved the couple of cards from OTA-554 to this epic as these cards are relatively less priority for 4.13 release and we could not mark these done.

oc adm release extract --included ... or some such, that only works when no release pullspec is given, where oc connects to the cluster to ask after the current release image (as it does today when you leave off a pullspec) but also collects FeatureGates and cluster profile and all that sort of stuff so it can write only the manifests it expects the CVO to be attempting to reconcile.

This would be narrowly useful for ccoctl (see CCO-178 and CCO-186), because with this extract option, ccoctl wouldn't need to try to reproduce "which of these CredentialsRequests manifests does the cluster actually want filled?" locally.

It also seems like it would be useful for anyone trying to get a better feel for what the CVO is up to in their cluster, for the same reason that it reduces distracting manifests that don't apply.

The downside is that if we screw up the inclusion logic, we could have oc diverging from the CVO, and end up increasing confusion instead of decreasing confusion. If we move the inclusion logic to library-go, that reduces the risk a bit, but there's always the possibility that users are using an oc that is older or newer than the cluster's CVO. Some way to have oc warn when the option is used but the version differs from the current CVO version would be useful, but possibly complicated to implement, unless we take shortcuts like assuming that the currently running CVO has a version matched to the ClusterVersion's status.desired target.

Definition of done (more details in the OTA-692 spike comment):

  • Add a new --included flag to $ oc adm release extract --to <dir path> <pull-spec or version-number>. The --included flag filters extracted manifests to those that are expected to be included with the cluster. 
    • Move overrides handling here and here into library-go.

 

 here is a sketch of code which W. Trevor King suggested

While working on OTA-559, my oc#1237 broke JSON output, and needed a follow-up fix. To avoid destabilizing folks who consume the dev-tip oc, we should grow CI presubmits to exercise critical oc adm release ... pathways, to avoid that kind of accidental breakage.

Epic Goal

  • Add an optional capability that allows disabling the image registry operator entirely

Why is this important?

It is already possibly to run a cluster with no instantiated image registry, but the image registry operator itself always runs.  This is an unnecessary use of resources for clusters that don't need/want a registry.  Making it possible to disable this will reduce the resource footprint as well as bug risks for clusters that don't need it, such as SNO and OKE.

 

Acceptance Criteria

  • CI - MUST be running successfully with tests automated (we have an existing CI job that runs a cluster with all optional capabilities disabled.  Passing that job will require disabling certain image registry tests when the cap is disabled)
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1.  MCO-499 must be completed first because we still need the CA management logic running even if the image registry operator is not running.

Previous Work (Optional):

  1. The optional cap architecture and guidance for adding a new capability is described here: https://github.com/openshift/enhancements/blob/master/enhancements/installer/component-selection.md

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

To enable the MCO to replace the node-ca, the registry operator needs to provide its own CAs in isolation.

Currently, the registry provides its own CAs via the "image-registry-certificates" configmap. This configmap is a merge of the service ca, storage ca, and additionalTrustedCA (from images.config.openshift.io/cluster).

Because the MCO already has access to additionalTrustedCA, the new secret does not need to contain it.

 

ACCEPTANCE CRITERIA

TBD

  1. Proposed title of this feature request:

Update ETCD datastore encryption to use AES-GCM instead of AES-CBC

2. What is the nature and description of the request?

The current ETCD datastore encryption solution uses the aes-cbc cipher. This cipher is now considered "weak" and is susceptible to padding oracle attack.  Upstream recommends using the AES-GCM cipher. AES-GCM will require automation to rotate secrets for every 200k writes.

The cipher used is hard coded. 

3. Why is this needed? (List the business requirements here).

Security conscious customers will not accept the presence and use of weak ciphers in an OpenShift cluster. Continuing to use the AES-CBC cipher will create friction in sales and, for existing customers, may result in OpenShift being blocked from being deployed in production. 

4. List any affected packages or components.

Epic Goal*

What is our purpose in implementing this?  What new capability will be available to customers?

The Kube APIserver is used to set the encryption of data stored in etcd. See https://docs.openshift.com/container-platform/4.11/security/encrypting-etcd.html

 

Today with OpenShift 4.11 or earlier, only aescbc is allowed as the encryption field type. 

 

RFE-3095 is asking that aesgcm (which is an updated and more recent type) be supported. Furthermore RFE-3338 is asking for more customizability which brings us to how we have implemented cipher customzation with tlsSecurityProfile. See https://docs.openshift.com/container-platform/4.11/security/tls-security-profiles.html

 

 
Why is this important? (mandatory)

AES-CBC is considered as a weak cipher

 
Scenarios (mandatory) 

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.  

  1.  

 
Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic. 

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - 
  • Documentation -
  • QE - 
  • PX - 
  • Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.  

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 

AES-GCM encryption was enabled in cluster-openshift-apiserver-operator and cluster-openshift-autenthication-operator, but not in the cluster-kube-apiserver-operator. When trying to enable aesgcm encryption in the apiserver config, the kas-operator will produce an error saying that the aesgcm provider is not supported.

Feature Overview

Support Platform external to allow installing with agent on OCI, with focus on https://www.oracle.com/cloud/cloud-at-customer/dedicated-region/faq/ for disconnected, on-prem.

Related / parent feature

OCPSTRAT-510 OpenShift on Oracle Cloud Infrastructure (OCI) with VMs

Feature Overview

Support Platform external to allow installing with agent on OCI, with focus on https://www.oracle.com/cloud/cloud-at-customer/dedicated-region/faq/ for disconnected, on-prem

User Story:

As a user, I want to be able to:

  • generate the minimal ISO in the installer when the platform type is set to external/oci

so that I can achieve

  • successful cluster installation
  • any custom agent features such as network tui should be available when booting from minimal ISO

Acceptance Criteria:

Description of criteria:

  • Upstream documentation
  • Point 1
  • Point 2
  • Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

User Story:

As a user of the agent-based installer, I want to be able to:

  • validate the external platform type in the agent cluster install by providing the external platform type in the install-config.yaml

so that I can achieve

  • create agent artifacts ( ISO, PXE files)

Acceptance Criteria:

Description of criteria:

  • install-config.yaml accepts the new platform type "external"
  • agent-based installer validates the supported platforms
  • agent ISO and PXE assets should be created successfully
  • Required k8s API support is added

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

User Story:

As a user of the agent-based installer, I want to be able to:

  • create agent ISO as well as PXE assets by providing the install-config.yaml

so that I can achieve

  • create a cluster for external cloud provider platform type (OCI)

Acceptance Criteria:

Description of criteria:

  • install-config.yaml accepts the new platform type "external"
  • validate install-config so that platformName can only be set to `oci` when platform is external 
  • agent-based installer validates the supported platforms
  • agent ISO and PXE assets should be created successfully
  • necessary unit tests and integration tests are added

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

Feature Overview (aka. Goal Summary)  

Support OpenShift installation in AWS Shared VPC [1] scenario where AWS infrastructure resources (at least the Private Hosted Zone) belong to an account separate from the cluster installation target account.

Goals (aka. expected user outcomes)

As a user I need to use a Shared VPC [1] when installing OpenShift on AWS into an existing VPC. Which will at least require the use of a preexisting Route53 hosted zone where I am not allowed the user "participant" of the shared VPC to automatically create Route53 private zones.

Requirements (aka. Acceptance Criteria):

The Installer is able to successfully deploy OpenShift on AWS with a Shared VPC [1], and the cluster is able to successfully pass osde2e testing. This will include at least the scenario when private hostedZone belongs to different account (Account A) than cluster resources (Account B)

[1] https://docs.aws.amazon.com/vpc/latest/userguide/vpc-sharing.html

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Enable/confirm installation in AWS shared VPC scenario where Private Hosted Zone belongs to an account separate from the cluster installation target account

Why is this important?

  • AWS best practices suggest this setup

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story:

I want

  • the installer to check for appropriate permissions based on whether the installation is using an existing hosted zone and whether that hosted zone is in another account

so that I can

  • be sure that my credentials have sufficient and minimal permissions before beginning install

Acceptance Criteria:

Description of criteria:

  • When specifying platform.aws.hostedZoneRole. Route53:CreateHostedZone and Route53:DeleteHostedZone are not required

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic —

Links:

Enhancement PR: https://github.com/openshift/enhancements/pull/1397 

API PR: https://github.com/openshift/api/pull/1460 

Ingress  Operator PR: https://github.com/openshift/cluster-ingress-operator/pull/928 

Background

Feature Goal: Support OpenShift installation in AWS Shared VPC scenario where AWS infrastructure resources (at least the Private Hosted Zone) belong to an account separate from the cluster installation target account.

The ingress operator is responsible for creating DNS records in AWS Route53 for cluster ingress. Prior to the implementation of this epic, the ingress operator doesn't have the capability to add DNS records into an existing Route 53 hosted zone in the shared VPC.

Epic Goal

  • Add support to the ingress operator for creating DNS records in preexisting Route53 private hosted zones for Shared VPC clusters

Non-Goals

  • Ingress operator support for day-2 operations (i.e. changes to the AWS IAM Role value after installation)  
  • E2E testing (will be handled by the Installer Team) 

Design

As described in the WIP PR https://github.com/openshift/cluster-ingress-operator/pull/928, the ingress operator will consume a new API field that contains the IAM Role ARN for configuring DNS records in the private hosted zone. If this field is present, then the ingress operator will use this account to create all private hosted zone records. The API fields will be described in the Enhancement PR.

The ingress operator code will accomplish this by defining a new provider implementation that wraps two other DNS providers, using one of them to publish records to the public zone and the other to publish records to the private zone.

External DNS Operator Impact

See NE-1299

AWS Load Balancer Operator (ALBO) Impact

See NE-1299

Why is this important?

  • Without this ingress operator support, OpenShift users are unable to create DNS records in a preexisting Route53 private hosted zone which means OpenShift users can't share the Route53 component with a Shared VPC
  • Shared VPCs are considers AWS best practice

Scenarios

  1. ...

Acceptance Criteria

  • Unit tests must be written and automatically run in CI (E2E tests will be handled by the Installer Team)
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • Ingress Operator creates DNS Records in preexisting Route53 private hosted zones for shared VPC Clusters
  • Network Edge Team has reviewed all of the related enhancements and code changes for Route53 in Shared VPC Clusters

Dependencies (internal and external)

  1. Installer Team is adding the new API fields required for enabling sharing Route53 with in Shared VPCs in https://issues.redhat.com/browse/CORS-2613
  2. Testing this epic requires having access to two AWS account

Previous Work (Optional):

  1. Significant discussion was done in this thread: https://redhat-internal.slack.com/archives/C68TNFWA2/p1681997102492889?thread_ts=1681837202.378159&cid=C68TNFWA2
  1. Slack channel #tmp-xcmbu-114

Open questions:

  1.  

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

 

Feature Overview (aka. Goal Summary)  

During oc login with a token, pasting the token on command line with oc login --token command is insecure. The token is logged in bash history, and appears in a "ps" command when ran precisely at the time the oc login command runs. Moreover, the token gets logged and is searchable by any sysadmin.

Customers/Users would like either the "--web" command, or a command that prompt for a token. There should be no way to pass a secret on a command line with --token command. 

For environments where no web browser is available, a "--ask-token" option should be provided that prompts for a token instead of passing it on the command line.

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

 

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

 

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

 

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  Initial completion during Refinement status.

 

Interoperability Considerations

Which other projects and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

Epic Goal*

During oc login with a token, pasting the token on command line with oc login --token command is insecure. The token is logged in bash history, and appears in a "ps" command when ran precisely at the time the oc login command runs. Moreover, the token gets logged and is searchable by any sysadmin.

Customers/Users would like either the "--web" command, or a command that prompt for a token. There should be no way to pass a secret on a command line with --token command. 

For environments where no web browser is available, a "--ask-token" option should be provided that prompts for a token instead of passing it on the command line.

 
Why is this important? (mandatory)

Pasting the token on command line with oc login --token command is insecure

 
Scenarios (mandatory) 

Customers/Users would like either the "--web" command. There should be no way to pass a secret on a command line with --token command. 

For environments where no web browser is available, a "--ask-token" option should be provided that prompts for a token instead of passing it on the command line.

 

 
Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic. 

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - 
  • Documentation -
  • QE - 
  • PX - 
  • Others -

Acceptance Criteria (optional)

 

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be "Release Pending" 

In order to secure token usage during oc login, we need to add the capability to oc to login using the OAuth2 Authorization Code Grant Flow through a browser. This will be possible by providing a command line option to oc:

oc login --web

In order for the OAuth2 Authorization Code Grant Flow to work in oc browser login, we need a new OAuthClient that can obtain tokens through [PKCE|https://datatracker.ietf.org/doc/html/rfc7636,] as the existing clients do not have this capability. The new client will be called openshift-cli-client and will have the loopback addresses as valid Redirect URIs.

In order for the OAuth2 Authorization Code Grant Flow to work in oc browser login, the OSIN server must ignore any port used in the Redirect URIs of the flow when the URIs are the loopback addresses. This has already been added to OSIN; we need to update the oauth-server to use the latest version of OSIN in order to make use of this capability.

 

Review the OVN Interconnect proposal, figure out the work that needs to be done in ovn-kubernetes to be able to move to this new OVN architecture. 

Phase-2 of this project in continuation of what was delivered in the earlier release. 

Why is this important?

OVN IC will be the model used in Hypershift. 

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • ...

Why is this important?

See https://docs.google.com/presentation/d/17wipFv5wNjn1KfFZBUaVHN3mAKVkMgGWgQYcvss2yQQ/edit#slide=id.g547716335e_0_220 

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

For interconnect upgrades - i.e when moving from OCP 4.13 to OCP 4.14 where IC is enabled, we do a 2 phase rollout of ovnkube-master and ovnkube-node pods in the openshift-ovn-kubernetes namespace. This is to ensure we have minimum disruption since major architectural components are being brought from control-plane down to the data-plane.

Since its a two phase roll out with each phase taking taking approximately 10mins, we effectively double the time it takes for OVNK component to upgrade thereby increasing the timeout thresholds on AWS.

See https://redhat-internal.slack.com/archives/C050MC61LVA/p1689768779938889 for some more details.

See sample runs:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-aws-modern/1679589472833900544

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-aws-modern/1679589451010936832

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-aws-modern/1678480739743567872

I have noticed this happening once on GCP:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp-modern/1680563737225859072

This has not happened on Azure which has 95mins allowance. So this card tracks the work to increase the timers on AWS/GCP. This was brought up in the TRT team sync that happened yesterday (July 19th 2023) and Scott Dodson has agreed to approve this under the condition that we bring it down back to the current values in release 4.15.

SDN team is confident the time will drop back to normal for future upgrades going from 4.14 -> 4.15 and so on. This will be tracked via https://issues.redhat.com/browse/OTA-999 

In the non-IC world, we have centralised DB, running a trace is easy, in IC world, we'd need all the local DBs from each node to even run a pod2pod trace fully else we can only run half traces with one side DB.

Goal of this card:

  • Open a PR against `oc` repo to get all dbs (minimum requirement)

Users would desire to create EFA instance MachineSet in the same AWS placement group to get best network performance within that AWS placement group.

The Scope of this Epic is only to support placement groups. Customers will create them.
The customer ask is that placement groups don't need to be created by the OpenShift Container Platform
OpenShift Container Platform only needs to be able to consume them and assign machines out of a machineset to a specific Placement Group.

Users would desire to create EFA instance MachineSet in the same AWS placement group to get best network performance within that AWS placement group.

Note: This Epic was previously connected to https://issues.redhat.com/browse/OCPPLAN-8106 and has been updated to OCPBU-327.

Scope

The Scope of this Epic is only to support placement groups. Customers will create them.
The customer ask is that placement groups don't need to be created by the OpenShift Container Platform
OpenShift Container Platform only needs to be able to consume them and assign machines out of a machineset to a specific Placement Group.

Background

In CAPI, the AWS provider supports the user supplying the name of a pre-existing placement group. Which will then be used to create the instances.

https://github.com/kubernetes-sigs/cluster-api-provider-aws/pull/4273

We need to add the same field to our API and then pass the information through in the same way, to allow users to leverage placement groups.

Steps

  • Review the upstream code linked above
  • Backport the feature
  • Drop old code for placement group controller that is currently disabled

Stakeholders

  • Cluster Infra

Definition of Done

  • Users may provide a pre-existing placement group name and have their instances created within that placement group
  • Docs
  • <Add docs requirements for this card>
  • Testing
  • <Explain testing that will be added>

This epic contains all the OLM related stories for OCP release-4.14

Epic Goal

  • Track all the stories under a single epic

Console operator should be building up a set of cluster nodes OS types, which he should supply to console, so it renders only operators that could be installed on the cluster.

This will be needed when we will support different OS types on the cluster.

We need to scan through the compute nodes and build a set of supported OS from those. Each node on the cluster has a label for its operating system: e.g. kubernetes.io/os=linux,

 

AC:

  1. Implement logic in the console repo
    1. Add additional flag
    2. populate the supported OS types into SERVER_FLAGS
    3. update the filtering logic in the operator hub

1. Proposed title of this feature request

    Add a scroll bar for the resource list in the Uninstall Operator pops-up window
2. What is the nature and description of the request?

   To make user easy to check the list of all resources
3. Why does the customer need this? (List the business requirements here)

   For customers, one operator may have multiple resources, it would be easy for them to check them all in Uninstall Operator pops-up window with the scroll bar
4. List any affected packages or components.

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Console operator should be building up a set of cluster nodes OS types, which he should supply to console, so it renders only operators that could be installed on the cluster.

This will be needed when we will support different OS types on the cluster.

We need to scan through the compute nodes and build a set of supported OS from those. Each node on the cluster has a label for its operating system: e.g. kubernetes.io/os=linux,

 

AC:

  1. Implement logic in the console-operator that will scan though all the nodes and build a set of all the OS types that the cluster nodes run on and pass it to the console-config.yaml . This set of OS types will be then used by console frontend.
  2. Add unit and e2e test cases in the console-operator repository.

Goal: OperatorHub/OLM users get a more intuitive UX around discovering and selecting Operator versions to install.

Problem statement: Today it's not possible to install an older version of an Operator unless the user exactly nows the CSV semantic version. This is not exposed however through any API. `packageserver` as of today only shows the latest version per channel.

Why is this important: There are many reasons why a user would want to choose not to install the latest version - whether it's lack of testing or known problems. It should be easy for a user to discovers what versions of an Operator OLM has in its catalogs and update graphs and expose this information in a consumable way to the user.

Acceptance Criteria:

  • Users can choose from a list of "available versions" of an Operator based on the "selected channel" on the 'OperatorHub' page in the console.
  • Users can see/examine Operator metadata (e.g. descriptions, version, capability level, links, etc) per selected channel/version to confirm the exact version they are going to install on the OperatorHub page.
  • The selected channel/version info will be carried over from the 'OperatorHub' page to 'Install Operator' page in the console.
  • Note that "installing an older version" means "no automatic update"; hence, when users select a non-latest Operator version, this implies the "Update" field would be changed to "Manual".
  • Operator details sidebar data will update based on the selected channel. `createdAt` `containerImage` and `capability level`

Out of scope:

  • provide a version selector for updatres in case of existing installed operators

 

Related info

UX designs: http://openshift.github.io/openshift-origin-design/designs/administrator/olm/select-install-operator-version/
linked OLM jira: https://issues.redhat.com/browse/OPRUN-1399
where you can see the downstream PR: https://github.com/openshift/operator-framework-olm/pull/437/files
specifically: https://github.com/awgreene/operator-framework-olm/blob/f430b2fdea8bedd177550c95ec[…]r/pkg/package-server/apis/operators/v1/packagemanifest_types.go i.e., you can get a list of available versions in PackageChannel stanza from the packagemanifest API
You can reach out to OLM lead Alex Greene for any question regarding this too, thanks

 

 

Key Objective
Providing our customers with a single simplified User Experience(Hybrid Cloud Console)that is extensible, can run locally or in the cloud, and is capable of managing the fleet to deep diving into a single cluster. 
Why customers want this?

  1. Single interface to accomplish their tasks
  2. Consistent UX and patterns
  3. Easily accessible: One URL, one set of credentials

Why we want this?

  • Shared code -  improve the velocity of both teams and most importantly ensure consistency of the experience at the code level
  • Pre-built PF4 components
  • Accessibility & i18n
  • Remove barriers for enabling ACM

Phase 2 Goal: Productization of the united Console 

  1. Enable user to quickly change context from fleet view to single cluster view
    1. Add Cluster selector with “All Cluster” Option. “All Cluster” = ACM
    2. Shared SSO across the fleet
    3. Hub OCP Console can connect to remote clusters API
    4. When ACM Installed the user starts from the fleet overview aka “All Clusters”
  2. Share UX between views
    1. ACM Search —> resource list across fleet -> resource details that are consistent with single cluster details view
    2. Add Cluster List to OCP —> Create Cluster

We need a way to show metrics for workloads running on spoke clusters. This depends on ACM-876, which lets the console discover the monitoring endpoints.

  • Console operator must discover the external URLs for monitoring
  • Console operator must pass the URLs and CA files as part of the cluster config to the console backend
  • Console backend must set up proxies for each endpoint (as it does for the API server endpoints)
  • Console frontend must include the cluster in metrics requests

Open Issues:

We will depend on ACM to create a route on each spoke cluster for the prometheus tenancy service, which is required for metrics for normal users.

 

Openshift console backend should proxy managed cluster monitoring requests through the MCE cluster proxy addon to prometheus services on the managed cluster. This depends on https://issues.redhat.com/browse/ACM-1188

 

BU Priority Overview

Initiative: Improve etcd disaster recovery experience (part1)

Goals

The current etcd backup and recovery process is described in our docs https://docs.openshift.com/container-platform/4.12/backup_and_restore/control_plane_backup_and_restore/backing-up-etcd.html

The current process leaves up to the cluster-admin to figure out a way to do consistent backups following the documented procedure.

This feature is part of a progressive delivery to improve the cluster-admin experience for backup and restore of etcd clusters to a healthy state.

Scope of this feature:

  • etcd quorum loss (2 node failure) on a 3 nodes OCP control plane
  • etcd degradation (1 node failure) on a 3 nodes OCP control plane

Execution Plans

  • Improve etcd disaster recovery e2e test coverage
  • Design automated backup API. Initial target is local destination
  • Should provide a way (e.g. script or tool) for cluster-admin to validate backup files remains valid over time (e.g. account for disk failures corrupting the backup)
  • Should document updated manual steps to restore from local backup. These steps should be part of the e2e test coverage.
  • Should document manual manual steps to copy backups files to destination outside the cluster. (e.g. ssh copy a cluster admin can use in a CronJob)
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Given that we have a controller that processes one time etcd backup requests via the "operator.openshift.io/v1alpha1 EtcdBackup" CR, we need another controller that processes the "config.openshift.io/v1alpha1 Backup" CR so we can have periodic backups according the the schedule in the CR spec.

See https://github.com/openshift/api/pull/1482 for the APIs

The workflow for this controller should roughly be:

  • Watches the `config.openshift.io/v1alpha1 Backup` CR as created by an admin
  • Creates a CronJob for the specified schedule and timezone that would in turn create `operator.openshift.io/v1alpha1 EtcdBackup` CRs at the desired schedule
  • Updates the CronJob for any changes in the schedule or timezone

Along with this controller we would also need to provide the workload or Go command for the pod that is created periodically by the CronJob. This cmd e.g "create-etcdbackup-cr" effectively creates a new `operator.openshift.io/v1alpha1 EtcdBackup` CR via the following workflow:

  • Read the Backup CR to get the pvcName (and anything else) required to populate an `EtcdBackup` CR
  • Create the `operator.openshift.io/v1alpha1 EtcdBackup` CR

Lastly to fulfill the retention policy (None, number of backups saved, or total size of backups), we can employ the following workflow:

  • Have another command e.g "prune-backups" cmd that runs prior to the "create-etcdbackup-cr" command that deletes existing backups per the retention policy.
  • This cmd is run before the cmd to create the etcdbackup CR. This could be done via an init container on the CronJob execution pod.
  • This would require the backup controller to populate the CronJob spec with the pvc name from the Backup spec that would allowing mounting the PV on the execution pod for pruning the backups in the init container.

Lastly to fulfill the retention policy (None, number of backups saved, or total size of backups), we can employ the following workflow:

  • Have another command e.g "prune-backups" cmd that runs prior to the "create-etcdbackup-cr" command that deletes existing backups per the retention policy.
  • The retention policy type can either be read from the `config.openshift.io/v1alpha1 Backup` CR
    • Or easier yet, the backup controller can pass set the retention policy arg in the CronJob template spec
  • This cmd is run before the cmd to create the etcdbackup CR. This could be done via an init container on the CronJob execution pod.
  • This would require the backup controller to populate the CronJob spec with the pvc name from the Backup spec that would allowing mounting the PV on the execution pod for pruning the backups in the init container.

See the parent story for more context.
As the first part to this story we need a controller with the following workflow:

  • Watches the `config.openshift.io/v1alpha1 Backup` CR as created by an admin
  • Creates a CronJob for the specified schedule and timezone that would ultimately create `operator.openshift.io/v1alpha1 EtcdBackup` CRs at the desired schedule
  • Updates the CronJob for any changes in the schedule or timezone

Since we also want to preserve a history of successful and failed backup attempts for the periodic config, the CronJob should utilize cronjob history limits to preserve successful and failed jobs.
https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/#jobs-history-limits

To begin with we can set this to a reasonable default of 5 successful and 10 failed jobs.

 

For testing the automated backups feature we will require an e2e test that validates the backups by ensuring the restore procedure works for a quorum loss disaster recovery scenario.

See the following doc for more background:
https://docs.google.com/document/d/1NkdOwo53mkNBCktV5tkUnbM4vi7bG4fO5rwMR0wGSw8/edit?usp=sharing

This story targets the milestone 2,3 and 4 of the restore test to ensure that the test has the ability to perform a backup and then restore from that backup in a disaster recovery scenario.

While the automated backups API is still in progress, the test will rely on the existing backup script to trigger a backup. Later on when we have a functional backup API behind a feature gate, the test can switch over to using that API to trigger backups.

We're starting with a basic crash-looping member restore first. The quorum loss scenario will be done in ETCD-423.

We should add some basic backup e2e tests into our operator:

  • one off backups can be run via API
  • periodic backups can be run (also multiple times in succession)
    • retention should work

The e2e workflow should be TechPreview enabled already. 

 

For testing the automated backups feature we will require an e2e test that validates the backups by ensuring the restore procedure works for a quorum loss disaster recovery scenario.

See the following doc for more background:
https://docs.google.com/document/d/1NkdOwo53mkNBCktV5tkUnbM4vi7bG4fO5rwMR0wGSw8/edit?usp=sharing

This story targets the first milestone of the restore test to ensure we have a platform agnostic way to be able to ssh access all masters in a test cluster so that we can perform the necessary backup, restore and validation workflows.

The suggested approach is to create a static pod that can do those ssh checks and actions from within the cluster but other alternatives can also be explored as part of this story. 

To fulfill one time backup requests there needs to be a new controller that reconciles an EtcdBackup CustomResource (CR) object and executes and saves a one time backup of the etcd cluster.
 
Similar to the upgradebackupcontroller the controller would be triggered to create a backup pod/job which would save the backup to the PersistentVolume specified by the spec of the EtcdBackup CR object.

The controller would also need to honor the retention policy specified by the EtcdBackup spec and update the status accordingly.

See the following enhancement and API PRs for more details and potential updates to the API and workflow for the one time backup:
https://github.com/openshift/enhancements/pull/1370
https://github.com/openshift/api/pull/1482

< High-Level description of the feature ie: Executive Summary >

Goals

< Who benefits from this feature, and how? What is the difference between today's current state and a world with this feature? >

Requirements

Requirements Notes IS MVP
     
    • (Optional) Use Cases

< What are we making, for who, and why/what problem are we solving?>

Out of scope

<Defines what is not included in this story>

Dependencies

< Link or at least explain any known dependencies. >

Background, and strategic fit

< What does the person writing code, testing, documenting need to know? >

Assumptions

< Are there assumptions being made regarding prerequisites and dependencies?>

< Are there assumptions about hardware, software or people resources?>

Customer Considerations

< Are there specific customer environments that need to be considered (such as working with existing h/w and software)?>

Documentation Considerations

< What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)? >

What does success look like?

< Does this feature have doc impact? Possible values are: New Content, Updates to existing content, Release Note, or No Doc Impact?>

QE Contact

< Are there assumptions being made regarding prerequisites and dependencies?>

< Are there assumptions about hardware, software or people resources?>

Impact

< If the feature is ordered with other work, state the impact of this feature on the other work>

Related Architecture/Technical Documents

<links>

Done Checklist

  • Acceptance criteria are met
  • Non-functional properties of the Feature have been validated (such as performance, resource, UX, security or privacy aspects)
  • User Journey automation is delivered
  • Support and SRE teams are provided with enough skills to support the feature in production environment

What's the problem

Currently pipeline builder in dev console directly queries tekton hub APIs for searching tasks. As upstream community and Red Hat is moving to artifacthub, we need to query artifacthub API for searching tasks.

Acceptance criteria

  1. Update the pipeline builder code so that if the API to retrieve tasks is not available, there will be no errors in the UI.
  2. Perform a spike to estimate the amount of work it will take to have the pipeline builder use the artifact hub API to retrieve tasks, rather than using the tekton hub API.

Description

Hitting the Artifacthub.io search endpoint fails sometimes due to a CORS error and the Version API endpoint always fails due to a CORS error. So, we need a Proxy to hit the Artifacthub. end point to get the data.

Acceptance Criteria

  1. Create a proxy to hit the Artifacthub.io endpoint.

Additional Details:

Search endpoint: https://artifacthub.io/docs/api/#/Packages/searchPackages

eg.: https://artifacthub.io/api/v1/packages/search?offset=0&limit=20&facets=false&ts_query_web=git&kind=7&deprecated=false&sort=relevance

Version endpoint: https://artifacthub.io/docs/api/#/Packages/getTektonTaskVersionDetails

eg: https://artifacthub.io/api/v1/packages/tekton-task/tekton-catalog-tasks/git-clone/0.9.0

 

Feature Overview (aka. Goal Summary):

 

This feature will allow an x86 control plane to operate with compute nodes of type Arm in a HyperShift environment.

 

Goals (aka. expected user outcomes):

 

Enable an x86 control plane to operate with an Arm data-plane in a HyperShift environment.

 

Requirements (aka. Acceptance Criteria):

 

  • The feature must allow an x86 control plane and an Arm data-plane to be used together in a HyperShift environment.
  • The feature must provide documentation on how to set up and use the x86 control plane with an Arm data-plane in a HyperShift environment.
  • The feature must be tested and verified to work reliably and securely in a production environment.

 

Customer Considerations:

 

Customers who require a mix of x86 control plane and Arm data-plane for their HyperShift environment will benefit from this feature.

 

Documentation Considerations:

 

  • Documentation should include clear instructions on how to set up and use the x86 control plane with an Arm data-plane in a HyperShift environment.
  • Documentation will live on docs.openshift.com

 

Interoperability Considerations:

 

This feature should not impact other OpenShift layered products and versions in the portfolio.

Goal

Numerous partners are asking for ways to pre-image servers in some central location before shipping them to an edge site where they can be configured as an OpenShift cluster: OpenShift-based Appliance.

A number of these cases are a good fit for a solution based on writing an image equivalent to the agent ISO, but without the cluster configuration, to disk at the central location and then configuring and running the installation when the servers reach their final location. (Notably, some others are not a good fit, and will require OpenShift to be fully installed, using the Agent-based installer or another, at the central location.)

While each partner will require a different image, usually incorporating some of their own software to drive the process as well, some basic building blocks of the image pipeline will be widely shared across partners.

Extended documentation

OpenShift-based Appliance

Building Blocks for Agent-based Installer Partner Solutions

Interactive Workflow work (OCPBU-132)

This work must "avoid conflict with the requirements for any future interactive workflow (see Interactive Agent Installer), and build towards it where the requirements coincide. This includes a graphical user interface (future assisted installer consistency).

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Allow the user to use the openshift-installer to generate a configuration ISO that they can attach to a server running the unconfigured agent ISO from AGENT-558. This would act as alternative to the GUI, effectively leaving the interactive flow and rejoining the automation flow by doing an automatic installation using the configuration contained on the ISO.

Why is this important?

  • Helps standardise implementations of the automation flow where an agent ISO image is pre-installed on a physical disk.

Scenarios

  1. The user purchases hardware with a pre-installed unconfigured agent image. They use openshift-installer to generate a config ISO from an install config, and attach this ISO to the server as virtual media to a group of servers to cause them to install OpenShift and form a cluster.
  2. The user has a pool of servers that share the same boot mechanism (e.g. PXE). Each server is booted from a common interactive agent image, and automation can install any subset of them as a cluster by attaching the same configuration ISO to each.
  3. A cloud user could boot a group of VMs using a publicly-available unconfigured agent image (e.g. an AMI), and install them as a cluster by attaching a configuration ISO to them.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. AGENT-556 - we'll need to block startup of services until configuration is provided
  2. AGENT-558 - this won't be useful without an unconfigured image to use it with
  3. AGENT-560 - enables AGENT-556 to block in an image generated with AGENT-558

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Add a new installer subcommand, openshift-install agent create config-image.

The should create a small ISO (i.e. not a CoreOS boot image) containing just the configuration files from the automation flow:

  • rendezvousIP config file
  • ClusterDeployment manifest
  • AgentPullSecret manifest
  • AgentClusterInstall manifest
  • TLS certs for admin kubeconfig
  • password hash for kubeadmin console password
  • NMStateConfig
  • extra manifests
  • hostnames
  • hostconfig (roles, root device hints)
  • ClusterImageSet manifest (for version verification)

The contents in the disk could be in any format, but should be optimised to make it simple for the service in AGENT-562 to read.

Implement a systemd service in the unconfigured agent ISO (AGENT-558) that watches for disks to be mounted, then searches them for agent installer configuration. If such configuration is found, then copy it to the relevant places in the running system.

The rendezvousIP must be copied last, as the presence of this is what will trigger the services to start (AGENT-556).

To the extent possible, the service should be agnostic as to the method by which the config disk was mounted (e.g. virtual media, USB stick, floppy disk, &c.). It may be possible to get systemd to trigger on volume mount, avoiding the need to poll anything.

The configuration drive must contain:

  • rendezvousIP config file
  • ClusterDeployment manifest
  • AgentPullSecret manifest
  • AgentClusterInstall manifest
  • TLS certs for admin kubeconfig
  • password hash for kubeadmin console password
  • ClusterImageSet manifest (for version verification)

it may optionally contain:

  • NMStateConfig
  • extra manifests
  • hostnames
  • hostconfig (roles, root device hints)

The ClusterImageSet manifest must match the one already present in the image for the config to be accepted.

Support pd-balanced disk types for GCP deployments

OpenShift installer and Machine API should support creation and management of computing resources with disk type "pd-balanced"

Why does the customer need this?

  • pd-balanced are ssd disks with performances comparable to pd-ssd but with a lower price

Epic Goal

  • Support pd-balanced disk types for GCP deployments

Why is this important?

  • Customers will be able to reduce costs on GCP while using `pd-balanced` disk types with a comparable performance to `pd-ssd` ones.

Scenarios

  1. Enable `pd-balanced` disk types when deploying a cluster in GCP. Right now only `pd-ssd` and `pd-standard` are supported.

Overview:

  • To enable support for pd-balanced disk types during cluster deployment in Google Cloud Platform (GCP) for Openshift Installer.
  • Currently, only pd-ssd and pd-standard disk types are supported.
  • `pd-balanced` disks on GCP will offer cost reduction and comparable performance to `pd-ssd` disks, providing increased flexibility and performance for deployments.

Acceptance Criteria:

  • The Openshift Installer should be updated to include pd-balanced as a valid disk type option in the installer configuration process.
  • When pd-balanced disk type is selected during cluster deployment, the installer should handle the configuration of the disks accordingly.
  • CI (Continuous Integration) must be running successfully with tests automated.
  • Release Technical Enablement details and documents should be provided.

Done Checklist:

  • CI is running, tests are automated, and merged.
  • Release Enablement Presentation: [link to Feature Enablement Presentation].
  • Upstream code and tests merged: [link to meaningful PR or GitHub Issue].
  • Upstream documentation merged: [link to meaningful PR or GitHub Issue].
  • Downstream build attached to advisory: [link to errata].
  • Test plans in Polarion: [link or reference to Polarion].
  • Automated tests merged: [link or reference to automated tests].
  • Downstream documentation merged: [link to meaningful PR].

Dependencies:

  • Google Cloud Platform Account
  • Access to GCP ‘Installer’ Project
  • Any required permissions, authentication, access controls or CLI needed to provision pd-balanced disk types should be properly configured.

Testing:

  • Develop and conduct test cases and scenarios to verify the proper functioning of pd-balanced disk type implementation.
  • Address any bugs or issues identified during testing.

Documentation:

  • Update documentation to reflect the support for pd-balanced disk types in GCP deployments.

Success Metrics:

  • Successful deployment of Openshift clusters using the pd-balanced disk type in GCP.
  • Minimal or no disruption to existing functionality and deployment options.

Feature Overview

  • Enable user custom RHCOS images location for Installer IPI provisioned OpenShift clusters on Google Cloud and Azure

Goals

  • The Installer to accept custom locations for RHCOS images while deploying OpenShift on Google Cloud and Azure as we support already for AWS via `platform.aws.amiID` for control plane and compute nodes.
  • As a user, I want to be able to specify a custom RHCOS image location to be used for control plane and compute nodes while deploying OpenShift on Google Cloud and Azure so that I cab be complaint with my company security policies.

Requirements

  • This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.
Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES
  •  

Background, and strategic fit

Many enterprises have strict security policies where all the software must be pulled from a trusted or private source. For these scenarios the RHCOS image used to bootstrap the cluster is usually coming from shared public locations that some companies don't accept as a trusted source.

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?
  • New Content, Updates to existing content, Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?

 

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Simplify ARO's workflow by allowing Azure marketplace images to be specified in the `install-config.yaml` for all nodes (compute, control plane, and bootstrap).

Why is this important?

  • ARO is a first party Azure service and has a number of requirements/restrictions. These requirements include the following: it must not request anything from outside of Azure and it must consume RHCOS VM images from a trusted source (marketplace).
  • At the same time upstream OCP does the following:
    1. It uses quay.io to get container images.
    2. Uses a random blob as a RHCOS VM image such as this. This VHD blob is then uploaded by the Installer to an Image Gallery in the user’s Storage Account where two boot images are created: a HyperV gen1 and a HyperV gen2. See here.
      To meet the requirements ARO team currently does the following as part of the release process:
    1. Mirror container images from quay.io to Azure Container Registry to avoid leaving Azure boundaries.
    2. Copy VM image from the blob in someone else's Azure subscription into the blob on the subscription ARO team manages and then publish a VM image on Azure Marketplace (publisher: azureopenshift, offer: aro4. See az vm image list --publisher azureopenshift --all). ARO does not bill for these images.
  • ARO has to carry their own changes on top of the Installer code to allow them to specify their own images for the cluster deployment.

Scenarios

  1. ...

Acceptance Criteria

  • Custom RHCOS images can be specified in the install-config for compute, controlPlane and defaultMachinePlatform and they are used for the installation instead of the default RHCOS VHD.

Out of scope

  • A VHD blob will still be uploaded to the user's Storage Account even though it won't be used during installation. That cannot be changed for now.

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Description of problem:

ARO needs to copy RHCOS image blobs to their own Azure Marketplace offering since, as a first party Azure service, they must not request anything from outside of Azure and must consume RHCOS VM images from a trusted source (marketplace).
To meet the requirements ARO team currently does the following as part of the release process:

 1. Mirror container images from quay.io to Azure Container Registry to avoid leaving Azure boundaries.
 2. Copy VM image from the blob in someone else's Azure subscription
 into the blob on the subscription ARO team manages and then we publish a VM image on Azure Marketplace (publisher: azureopenshift, offer: aro4. See az vm image list --publisher azureopenshift --all). We do not bill for these images.

The usage of Marketplace images in the installer was already implemented as part of CORS-1823. This single line [1] needs to be refactored to enable ARO from the installer code perspective: on ARO we don't need to set type to AzureImageTypeMarketplaceWithPlan.

However, in OCPPLAN-7556 and related CORS-1823 it was mentioned that using Marketplace images is out of scope for nodes other than compute. For ARO we need to be able to use marketplace images for all nodes.

[1] https://github.com/openshift/installer/blob/f912534f12491721e3874e2bf64f7fa8d44aa7f5/pkg/asset/machines/azure/machines.go#L107

Version-Release number of selected component (if applicable):

4.13

How reproducible:

Always

Steps to Reproduce:

1. Set RHCOS image from Azure Marketplace in the installconfig
2. Deploy a cluster
3.

Actual results:

Only compute nodes use the Marketplace image.

Expected results:

All nodes created by the Installer use RHCOS image coming from Azure Marketplace.

Additional info:

 

 

Epic Goal

  • As a customer, I need to make sure that the RHCOS image I leverage is coming from a trusted source. 

Why is this important?

  • For customer who have a very restricted security policies imposed by their InfoSec teams they need to be able to manually specify a custom location for the RHCOS image to use for the Cluster Nodes.

Scenarios

  1. As a customer, I want to specify a custom location for the RHCOS image to be used for the cluster Nodes

Acceptance Criteria

A user is able to specify a custom location in the Installer manifest for the RHCOS image to be used for bootstrap and cluster Nodes. This is the similar approach we support already for AWS with the compute.platform.aws.amiID option

Previous Work (Optional):

https://issues.redhat.com/browse/CORS-1103

 

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

 

 

 

 

 

 

User Story:

Some background on the Licenses field:

https://github.com/openshift/installer/pull/3808#issuecomment-663153787

https://github.com/openshift/installer/pull/4696

So we do not want to allow licenses to be specified (it's up to customers to create a custom image with licenses embedded and supply that to the Installer) when pre-built images are specified (current behaviour). Since we don't need to specify licenses for RHCOs images anymore, the Licenses field is useless and should be deprecated.

Acceptance Criteria:

Description of criteria:

  • License field deprecated
  • Any dev docs mentioning Licenses is updated.

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

User Story:

As a user, I want to be able to:

  • Specify a RHCOS image coming from a custom source in the install config to override the installer's internal choice of bootimage  

so that I can achieve

  • a custom location in the install config for the RHCOS image to use for the Cluster Nodes

Acceptance Criteria:

A user is able to specify a custom location in the Installer manifest for the RHCOS image to be used for bootstrap and cluster Nodes. This is the similar approach we support already for AWS with the compute.platform.aws.amiID option

(optional) Out of Scope:

 

Engineering Details:

  •  

Epic Goal

  • Enable the migration from a storage intree driver to a CSI based driver with minimal impact to the end user, applications and cluster
  • These migrations would include, but are not limited to:
    • CSI driver for Azure (file and disk)
    • CSI driver for VMware vSphere

Why is this important?

  • OpenShift needs to maintain it's ability to enable PVCs and PVs of the main storage types
  • CSI Migration is getting close to GA, we need to have the feature fully tested and enabled in OpenShift
  • Upstream intree drivers are being deprecated to make way for the CSI drivers prior to intree driver removal

Scenarios

  1. User initiated move to from intree to CSI driver
  2. Upgrade initiated move from intree to CSI driver
  3. Upgrade from EUS to EUS

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Epic Goal*

Kubernetes upstream has chosen to allow users to opt-out from CSI volume migration in Kubernetes 1.26 (1.27 PR, 1.26 backport). It is still GA there, but allows opt-out due to non-trivial risk with late CSI driver availability.

We want a similar capability in OCP - a cluster admin should be able to opt-in to CSI migration on vSphere in 4.13. Once they opt-in, they can't opt-out (at least in this epic).

Why is this important? (mandatory)

See an internal OCP doc if / how we should allow a similar opt-in/opt-out in OCP.

 
Scenarios (mandatory) 

Upgrade

  1. Admin upgrades 4.12 -> 4.13 as usual
  2. Storage CR has CSI migration disabled (or nil), in-tree volume plugin handles in-tree PVs.
  3. At the same time, external CCM runs, however, due to kubelet running with –cloud-provider=vsphere, it does not do kubelet’s job.
  1. Admin can opt-in to CSI migration by editing Storage CR. That enables OPENSHIFT_DO_VSPHERE_MIGRATION env. var. everywhere + runs kubelet with –cloud-provider=external.
    1. If we have time, it should not be hard to opt out, just remove the env. var + update kubelet cmdline. Storage / in-tree volume plugin will handle in-tree PVs again, not sure about implications on external CCM.
  2. Once opted-in, it’s not possible to opt out.
  1. Both with opt-in and without it, the cluster is Upgradeable=true. Admin can upgrade to 4.14, CSI migration will be forced there.

 

New install

  1. Admin installs a new 4.13 vSphere cluster, with UPI, IPI, Assisted Installer, or Agent-based Installer.
  2. During installation, Storage CR is created with CSI migration enabled
  3. (We want to have it enabled for a new cluster to enable external CCM and have zonal.  This avoids new clusters from having in-tree as default and then having to go through migration later.)
  4. Resulting cluster has OPENSHIFT_DO_VSPHERE_MIGRATION env. var set + kubelet with –cloud-provider=external + topology support.
  5. Admin cannot opt-out after installation, we expect that they use CSI volumes for everything.
  1. If the admin really wants, they can opt-out before installation by adding a Storage install manifest with CSI migration disabled.

 

EUS to EUS (4.12 -> 4.14)

  • Will have CSI migration enabled once in 4.14
  • During the upgrade, a cluster will have 4.13 masters with CSI migration disabled (see regular upgrade to 4.13 above) + 4.12 kubelets.
  • Once the masters are 4.14, CSI migration is force-enabled there, still, 4.14 KCM + in-tree volume plugin in it will handle in-tree volume attachments required by kubelets that still have 4.12 (that’s what kcm --external-cloud-volume-plugin=vsphere does).
  • Once both masters + kubelets are 4.14, CSI migration is force enabled everywhere, in-tree volume plugin + cloud provider in KCM is still enabled by --external-cloud-volume-plugin, but it’s not used.
  • Keep in-tree storage class by default
  • A CSI storage class is already available since 4.10
  • Recommend to switch default to CSI
  • Can’t opt out from migration
    Dependencies (internal and external) (mandatory)
  • We need a new FeatureSet in openshift/api that disables CSIMigrationvSphere feature gate.
  • We need kube-apiserver-operator, kube-controller-manager-operator, kube-scheduler-operator, MCO must reconfigure their operands to use in-tree vSphere cloud provider when they see CSIMigrationvSphere FeatureGate disabled.
  • We need cloud controller manager operator to disable its operand when it sees CSIMigrationvSphere FeatureGate disabled.

Contributing Teams(and contacts) (mandatory) 

  • Development - 
  • Documentation -
  • QE - 
  • PX - 
  • Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.  

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 

When CSIMigrationvSphere is disabled, cluster-storage-operator must re-create in-tree StorageClass.

vmware-vsphere-csi-driver-operator's StorageClass must not be marked as the default there (IMO we already have code for that).

This also means we need to fix the Disable SC e2e test to ignore StorageClasses for the in-tree driver. Otherwise we will reintroduce OCPBUGS-7623.

Feature Overview

  • Customers want to create and manage OpenShift clusters using managed identities for Azure resources for authentication.

Goals

  • A customer using ARO wants to spin up an OpenShift cluster with "az aro create" without needing additional input, i.e. without the need for an AD account or service principal credentials, and the identity used is never visible to the customer and cannot appear in the cluster.
  • As an administrator, I want to deploy OpenShift 4 and run Operators on Azure using access controls (IAM roles) with temporary, limited privilege credentials.

Requirements

  • Azure managed identities must work for installation with all install methods including IPI and UPI, work with upgrades, and day-to-day cluster lifecycle operations.
  • Support HyperShift and non-HyperShift clusters.
  • Support use of Operators with Azure managed identities.
  • Support in all Azure regions where Azure managed identity is available. Note: Federated credentials is associated with Azure Managed Identity, and federated credentials is not available in all Azure regions.

More details at ARO managed identity scope and impact.

 

This Section: A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.

Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

(Optional) Use Cases

This Section:

  • Main success scenarios - high-level user stories
  • Alternate flow/scenarios - high-level user stories
  • ...

Questions to answer…

  • ...

Out of Scope

Background, and strategic fit

This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.

Assumptions

  • ...

Customer Considerations

  • ...

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?
  • New Content, Updates to existing content, Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?

References

Epic Goal

  • Build list of specific permissions to run Openshift on Azure - Components grant roles, but we need more granularity.
  • Determine and document the Azure roles and required permissions for Azure managed identity.

Why is this important?

  • Many of our customers have security policies in their organization that restrict credentials to only minimal permissions that conflict with the documented list of permissions needed for OpenShift. Customers need to know the explicit list of permissions minimally needed for deploying and running OpenShift and what they're used for so they can request the right permissions. Without this information, it can/will block adoption of OpenShift 4 in many cases.

Scenarios

  1. ...

Acceptance Criteria

  • Document explicit list of required credential permissions for installing (Day 1) OpenShift on Azure using the IPI and UPI deployment workflows and what each of the permissions are used for.
  • Document explicit list of required role and credential permissions for the operation (Day 2) of an OpenShift cluster on Azure and what each of the permissions are used for
  • Verify minimum list of permissions for Azure with IPI and UPI installation workflows
  • (Day 2) operations of OpenShift on Azure - MUST complete successfully with automated tests
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. Installer [both UPI & IPI Workflows]
  2. Control Plane
    • Kube Controller Manager
  3. Compute [Managed Identity]
  4. Cloud API enabled components
    • Cloud Credential Operator
    • Machine API
    • Internal Registry
    • Ingress
  5. ?
  6.  

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

 

 

User Story

As a cluster admin, I want the CCM and Node manager to utilize credentials generated by CCO so that the permissions granted to the identity can be scoped with least privilege on clusters utilizing Azure AD Workload Identity.

Background

The Cloud Controller Manager Operator creates a CredentialsRequest as part of CVO manifests which describes credentials that should be created for the CCM and Node manager to utilize. CCM and the Node Manager do not use the credentials created as a product of the CredentialsRequest in existing "passthrough" based Azure clusters or within Azure AD Workload Identity based Azure clusters. CCM and the Node Manager instead use a system-assigned identity which is attached to the Azure cluster VMs.

The system-assigned identity attached to the VMs is granted the "Contributor" role within the cluster's Azure resource group. In order to use the system-assigned identity, a pod must have sufficient privilege to use the host network to contact the Azure instance metadata service (IMDS). 

For Azure AD Workload Identity based clusters, administrators must process the CredentialsRequests extracted from the release image which includes the CredentialsRequest from CCCMO manifests. This CredentialsRequest processing results in the creation of a user-assigned managed identity which is not utilized by the cluster. Additionally, the permissions granted to the identity are currently scoped broadly to grant the "Contributor" role within the cluster's Azure resource group. If the CCM and Node Manager were to utilize the identity then we could scope the permissions granted to the identity to be more granular. It may be confusing to administrators to need to create this unused user-assigned managed identity with broad permissions access.

Steps

  • Modify CCM and Node manager deployments to use the CCCMO's Azure credentials injector as an init-container to merge the provided CCO credentials secret with the /etc/kube/cloud.conf file used to configure cloud-provider-azure as used within CCM and the Node Manager. An example of the init-container can be found within the azure-file-csi-driver-operator.
  • Validate that the provided credentials are used by CCM and the Node Manager and that they continue to operate normally.
  • Scope permissions specified in the CCCMO CredentialsRequest to only those permissions needed for operation rather than "Contributor" within the Azure resource group.

Stakeholders

  • <Who is interested in this/where did they request this>

Definition of Done

  • CCM and Node Manager use credentials provided by CCO rather than the system-assigned identity attached to the VMs.
  • Docs
  • <Add docs requirements for this card>
  • Testing
  • e2e tests validate that the CCM and Node manager operate normally with the credentials provided by CCO.

User Story

As a [user|developer|<other>] I want [some goal] so that [some reason]

<Describes high level purpose and goal for this story. Answers the questions: Who is impacted, what is it and why do we need it?>

Background

<Describes the context or background related to this story>

Steps

  • <Add steps to complete this card if appropriate>

Stakeholders

  • <Who is interested in this/where did they request this>

Definition of Done

  • <Add items that need to be completed for this card>
  • Docs
  • <Add docs requirements for this card>
  • Testing
  • <Explain testing that will be added>

User Story

As a [user|developer|<other>] I want [some goal] so that [some reason]

<Describes high level purpose and goal for this story. Answers the questions: Who is impacted, what is it and why do we need it?>

Background

<Describes the context or background related to this story>

Steps

  • <Add steps to complete this card if appropriate>

Stakeholders

  • <Who is interested in this/where did they request this>

Definition of Done

  • <Add items that need to be completed for this card>
  • Docs
  • <Add docs requirements for this card>
  • Testing
  • <Explain testing that will be added>

Add actuator code to satisfy permissions specified in 'Permissions' API field. The implementation should create a new custom role with specified permissions and assign it to the generated user-assigned managed identity along with the predefined roles enumerated in CredReq.RoleBindings. The role we create for the CredentialsRequest should be discoverable so that it can be idempotently updated on re-invocation of ccoctl.

Questions to answer based on lessons learned from custom roles in GCP, assuming that we will create one custom role per identity,

  • Does Azure have soft/hard role deletion? ie. are custom roles retained for some period following deletion and if so do deleted roles count towards quota?
  • What is the default quota limitation for custom roles in Azure?
  • Does it make sense to create a custom role for each identity created based on quota limitations?
    • If it doesn't make sense, how can the roles be condensed to satisfy the quota limitations?

Add a new field (DataPermissions) to the Azure Credentials Request CR, and plumb it into the custom role assigned to the generated user-assigned managed identity's data actions.

Epic Goal

  • CIRO can consume azure workload identity tokens
  • CIRO's Azure credential request uses new API field for requesting permissions

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

This effort is dependent on the completion of work for CCO-187, and effort in dependent modules is planned to be worked on by the CCO team unless individual repo owners can help. Operators owners/teams will be expected to review merge requests and complete appropriate QE effort for an openshift release.

  • azure-sdk-for-go module dependency updated to support workload identity federation.
  • Mount the OIDC token in the operator pod. This needs to go in the deployment. See example from addition to the cluster-image-registry-operator here

 

ACCEPTANCE CRITERIA

  • image-registry uses latest openshift/docker-distribution
  • CIRO can detect when the creds it gets from CCO are for federated workload identity (the credentials secret will contain a "azure_federated_token_file")
  • when using federated workload identity, CIRO adds the "AZURE_FEDERATED_TOKEN_FILE" env var to the image-registry deployment
  • when using federated workload identity, CIRO does not add the "REGISTRY_STORAGE_AZURE_ACCOUNTKEY" env var to the image-registry deployment
  • the image-registry operates normally when using federated workload identity

This effort is dependent on the completion of work for CCO-187, and effort in dependent modules is planned to be worked on by the CCO team unless individual repo owners can help. Operators owners/teams will be expected to review merge requests and complete appropriate QE effort for an openshift release.

  • azure-sdk-for-go module dependency updated to support workload identity federation.
  • Mount the OIDC token in the operator pod. This needs to go in the deployment. See example from addition to the cluster-image-registry-operator here

 

ACCEPTANCE CRITERIA

  • Upstream distribution/distribution uses azure identity sdk 1.3.0
  • openshift/docker-distribution uses the latest upstream distribution/distribution (after the above has merged)
  • Green CI
  • Every storage driver passes regression tests

OPEN QUESTIONS

  • Can DefaultAzureCredential be relied on to transparently use workload identities? (in this case the operator would need to export environment varialbes that DefaultAzureCredential expects for workload identities)
    • I have tested manually exporting the required env vars and DefaultAzureCredential correctly detects and attempts to authenticate using federated workload identity, so it works as expected.

This effort is dependent on the completion of work for CCO-187, and effort in dependent modules is planned to be worked on by the CCO team unless individual repo owners can help. Operators owners/teams will be expected to review merge requests and complete appropriate QE effort for an openshift release.

  • azure-sdk-for-go module dependency updated to support workload identity federation.
  • Mount the OIDC token in the operator pod. This needs to go in the deployment. See example from addition to the cluster-image-registry-operator here

 

ACCEPTANCE CRITERIA

  • CIRO should retrieve the "azure_resourcegroup" from the cluster Infrastructure object instead of the CCO created secret (this key will not be present when workload identity is in use)
  • CIRO's CredentialsRequest specifies the service account names (see the: cluster-storage-operator for an example)
  • CIRO is able to create storage accounts and containers when configured with azure workload identity.

Epic Goal

  • Enable the OpenShift Installer to authenticate using authentication methods supported by both the azure sdk for go and the terraform azure provider
  • Future proofing to enable Terraform support for workload identity authentication when it is enabled upstream

Why is this important?

  • This ties in to the larger OpenShift goal of: as an infrastructure owner, I want to deploy OpenShift on Azure using Azure Managed Identities (vs. using Azure Service Principal) for authentication and authorization.
  • Customers want support for using Azure managed identities in lieu of using an Azure service principal. In the OpenShift documentation, we are directed to use an Azure Service Principal - "Azure offers the ability to create service accounts, which access, manage, or create components within Azure. The service account grants API access to specific services". However, Microsoft and the customer would prefer that we use User Managed Identities to keep from putting the Service Principal and principal password in clear text within the azure.conf file. 
  • See https://docs.microsoft.com/en-us/azure/active-directory/develop/workload-identity-federation for additional information.

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. ...

Open questions::

  1. ...

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story:

As a cluster admin I want to be able to:

  • use the managed identity from the installer host VM (running in Azure)

so that I can

  • install a cluster without copying credentials to the installer host

Acceptance Criteria:

Description of criteria:

  • Installer (azure sdk) & terraform authenticate using identity from host VM (not client secret in file ~/.azure/servicePrincipal.json)
  • Cluster credential is handled appropriately (presumably we force manual mode)

Engineering Details:

Epic Overview

  • Enable customers to create and manage OpenShift clusters using managed identities for Azure resources for authentication.
  • A customer using ARO wants to spin up an OpenShift cluster with "az aro create" without needing additional input, i.e. without the need for an AD account or service principal credentials, and the identity used is never visible to the customer and cannot appear in the cluster.

Epic Goal

  • A customer creates an OpenShift cluster ("az aro create") using Azure managed identity.
  • Azure managed identities must work for installation with all install methods including IPI and UPI, work with upgrades, and day-to-day cluster lifecycle operations.
  • After Azure failed to implement workable golang API changes after deprecation of their old API, we have removed mint mode and work entirely in passthrough mode. Azure has plans to implement pod/workload identity similar to how they have been implemented in AWS and GCP, and when this feature is available, we should implement permissions similar to AWS/GCP
  • This work cannot start until Azure have implemented this feature - as such, this Epic is a placeholder to track the effort when available.

Why is this important?

  • Microsoft and the customer would prefer that we use Managed Identities vs. Service Principal (which requires putting the Service Principal and principal password in clear text within the azure.conf file).

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

 

 

This effort is dependent on the completion of work for CCO-187, and effort in dependent modules is planned to be worked on by the CCO team unless individual repo owners can help. Operators owners/teams will be expected to review merge requests and complete appropriate QE effort for an openshift release.

  • azure-sdk-for-go module dependency updated to support workload identity federation.
  • Mount the OIDC token in the operator pod. This needs to go in the deployment. See example from addition to the cluster-image-registry-operator here

This effort is dependent on the completion of work for CCO-187, and effort in dependent modules is planned to be worked on by the CCO team unless individual repo owners can help. Operators owners/teams will be expected to review merge requests and complete appropriate QE effort for an openshift release.

  • azure-sdk-for-go module dependency updated to support workload identity federation.
  • Mount the OIDC token in the operator pod. This needs to go in the deployment. See example from addition to the cluster-image-registry-operator here

This effort is dependent on the completion of work for CCO-187, and effort in dependent modules is planned to be worked on by the CCO team unless individual repo owners can help. Operators owners/teams will be expected to review merge requests and complete appropriate QE effort for an openshift release.

  • azure-sdk-for-go module dependency updated to support workload identity federation.
  • Mount the OIDC token in the operator pod. This needs to go in the deployment. See example from addition to the cluster-image-registry-operator here

This effort is dependent on the completion of work for CCO-187, and effort in dependent modules is planned to be worked on by the CCO team unless individual repo owners can help. Operators owners/teams will be expected to review merge requests and complete appropriate QE effort for an openshift release.

  • azure-sdk-for-go module dependency updated to support workload identity federation.
  • Mount the OIDC token in the operator pod. This needs to go in the deployment. See example from addition to the cluster-image-registry-operator here

This effort is dependent on the completion of work for CCO-187, and effort in dependent modules is planned to be worked on by the CCO team unless individual repo owners can help. Operators owners/teams will be expected to review merge requests and complete appropriate QE effort for an openshift release.

  • azure-sdk-for-go module dependency updated to support workload identity federation.
  • Mount the OIDC token in the operator pod. This needs to go in the deployment. See example from addition to the cluster-image-registry-operator here

This effort is dependent on the completion of work for CCO-187, and effort in dependent modules is planned to be worked on by the CCO team unless individual repo owners can help. Operators owners/teams will be expected to review merge requests and complete appropriate QE effort for an openshift release.

  • azure-sdk-for-go module dependency updated to support workload identity federation.
  • Mount the OIDC token in the operator pod. This needs to go in the deployment. See example from addition to the cluster-image-registry-operator here

This effort is dependent on the completion of work for CCO-187, and effort in dependent modules is planned to be worked on by the CCO team unless individual repo owners can help. Operators owners/teams will be expected to review merge requests and complete appropriate QE effort for an openshift release.

  • azure-sdk-for-go module dependency updated to support workload identity federation.
  • Mount the OIDC token in the operator pod. This needs to go in the deployment. See example from addition to the cluster-image-registry-operator here

Create a config secret in the openshift-cloud-credential-operator namespace which contains the AZURE_TENANT_ID to be used for configuring the Azure AD pod identity webhook deployment.

These docs should cover:

  • A general overview of the feature, what changes are made to Azure credentials secrets and how to install a new cluster.
  • A usage guide of `ccoctl azure` commands to create/manage infra required for Azure workload identity.

See existing documentation for:

This effort is dependent on the completion of work for CCO-187, and effort in dependent modules is planned to be worked on by the CCO team unless individual repo owners can help. Operators owners/teams will be expected to review merge requests and complete appropriate QE effort for an openshift release.

  • azure-sdk-for-go module dependency updated to support workload identity federation.
  • Mount the OIDC token in the operator pod. This needs to go in the deployment. See example from addition to the cluster-image-registry-operator here

Feature Overview

RHEL CoreOS should be updated to RHEL 9.2 sources to take advantage of newer features, hardware support, and performance improvements.

 

Requirements

  • RHEL 9.x sources for RHCOS builds starting with OCP 4.13 and RHEL 9.2.

 

Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

(Optional) Use Cases

  • 9.2 Preview via Layering No longer necessary assuming we stay the course of going all in on 9.2

Assumptions

  • ...

Customer Considerations

  • ...

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?
  • New Content, Updates to existing content, Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?

Epic Goal

  • The Kernel API was updated for RHEL 9, so the old approach of setting the `sched_domain` in `/sys/kernel` is no longer available. Instead, cgroups have to be worked with directly.
  • Both CRI-O and PAO need to be updated to set the cpuset of containers and other processes correctly, as well as set the correct value for sched_load_balance

Why is this important?

  • CPU load balancing is a vital piece of real time execution for processes that need exclusive access to a CPU. Without this, CPU load balancing won't work on RHEL 9 with Openshift 4.13

Scenarios

  1. As a developer on Openshift, I expect my pods to run with exclusive CPUs if I set the PAO configuration correctly

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. ...

Open questions::

  1. ...

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Part of setting CPU load balancing on RHEL 9 involves disabling sched_load_balance on cgroups that contain a cpuset that should be exclusive. The PAO may be required to be responsible for this piece

This is the Epic to track the work to add RHCOS 9 in OCP 4.13 and to make OCP use it by default.

 

CURRENT STATUS: Landed in 4.14 and 4.13

 

Testing with layering

 

Another option given an existing e.g. 4.12 cluster is to use layering.  First, get a digested pull spec for the current build:

$ skopeo inspect --format "{{.Name}}@{{.Digest}}" -n docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev:4.13-9.2
quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b4cc3995d5fc11e3b22140d8f2f91f78834e86a210325cbf0525a62725f8e099

Create a MachineConfig that looks like this:

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: worker-override
spec:
  osImageURL: <digested pull spec>

If you want to also override the control plane, create a similar one for the master role.
 
We don't yet have auto-generated release images. However, if you want one, you can ask cluster bot to e.g. "launch https://github.com/openshift/machine-config-operator/pull/3485" with options you want (e.g. "azure" etc.) or just "build https://github.com/openshift/machine-config-operator/pull/3485" to get a release image.

STATUS:  Code is merged for 4.13 and is believed to largely solve the problem.

 


 

Description of problem:

Upgrades to from OpenShift 4.12 to 4.13 will also upgrade the underlying RHCOS from 8.6 to 9.2. As part of that the names of the network interfaces may change. For example `eno1` may be renamed to `eno1np0`. If a host is using NetworkManager configuration files that rely on those names then the host will fail to connect to the network when it boots after the upgrade. For example, if the host had static IP addresses assigned it will instead boot using IP addresses assigned via DHCP.

Version-Release number of selected component (if applicable):

4.13

How reproducible:

Always.

Steps to Reproduce:

1. Select hardware (or VMs) that will have different network interface names in RHCOS 8 and RHCOS 9, for example `eno1` in RHCOS 8 and `eno1np0` in RHCOS 9.

1. Install a 4.12 cluster with static network configuration using the `interface-name` field of NetworkManager interface configuration files to match the configuration to the network interface.

2. Upgrade the cluster to 4.13.

Actual results:

The NetworkManager configuration files are ignored because they don't longer match the NIC names. Instead the NICs get new IP addresses from DHCP.

Expected results:

The NetworkManager configuration files are updated as part of the upgrade to use the new NIC names.

Additional info:

Note this a hypothetical scenario. We have detected this potential problem in a slightly different scenario where we install a 4.13 cluster with the assisted installer. During the discovery phase we use RHCOS 8 and we generate the NetworkManager configuration files. Then we reboot into RHCOS 9, and the configuration files are ignored due to the change in the NICs. See MGMT-13970 for more details.

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Create a new platform type, working name "External", that will signify when a cluster is deployed on a partner infrastructure where core cluster components have been replaced by the partner. “External” is different from our current platform types in that it will signal that the infrastructure is specifically not “None” or any of the known providers (eg AWS, GCP, etc). This will allow infrastructure partners to clearly designate when their OpenShift deployments contain components that replace the core Red Hat components.

This work will require updates to the core OpenShift API repository to add the new platform type, and then a distribution of this change to all components that use the platform type information. For components that partners might replace, per-component action will need to be taken, with the project team's guidance, to ensure that the component properly handles the "External" platform. These changes will look slightly different for each component.

To integrate these changes more easily into OpenShift, it is possible to take a multi-phase approach which could be spread over a release boundary (eg phase 1 is done in 4.X, phase 2 is done in 4.X+1).

Phase 1

  • Write platform “External” enhancement.
  • Evaluate changes to cluster capability annotations to ensure coverage for all replaceable components.
  • Meet with component teams to plan specific changes that will allow for supplement or replacement under platform "External".

Phase 2

  • Update OpenShift API with new platform and ensure all components have updated dependencies.
  • Update capabilities API to include coverage for all replaceable components.
  • Ensure all Red Hat operators tolerate the "External" platform and treat it the same as "None" platform.

Phase 3

  • Update components based on identified changes from phase 1
    • Update Machine API operator to run core controllers in platform "External" mode.

Why is this important?

  • As partners begin to supplement OpenShift's core functionality with their own platform specific components, having a way to recognize clusters that are in this state helps Red Hat created components to know when they should expect their functionality to be replaced or supplemented. Adding a new platform type is a significant data point that will allow Red Hat components to understand the cluster configuration and make any specific adjustments to their operation while a partner's component may be performing a similar duty.
  • The new platform type also helps with support to give a clear signal that a cluster has modifications to its core components that might require additional interaction with the partner instead of Red Hat. When combined with the cluster capabilities configuration, the platform "External" can be used to positively identify when a cluster is being supplemented by a partner, and which components are being supplemented or replaced.

Scenarios

  1. A partner wishes to replace the Machine controller with a custom version that they have written for their infrastructure. Setting the platform to "External" and advertising the Machine API capability gives a clear signal to the Red Hat created Machine API components that they should start the infrastructure generic controllers but not start a Machine controller.
  2. A partner wishes to add their own Cloud Controller Manager (CCM) written for their infrastructure. Setting the platform to "External" and advertising the CCM capability gives a clear to the Red Hat created CCM operator that the cluster should be configured for an external CCM that will be managed outside the operator. Although the Red Hat operator will not provide this functionality, it will configure the cluster to expect a CCM.

Acceptance Criteria

Phase 1

  • Partners can read "External" platform enhancement and plan for their platform integrations.
  • Teams can view jira cards for component changes and capability updates and plan their work as appropriate.

Phase 2

  • Components running in cluster can detect the “External” platform through the Infrastructure config API
  • Components running in cluster react to “External” platform as if it is “None” platform
  • Partners can disable any of the platform specific components through the capabilities API

Phase 3

  • Components running in cluster react to the “External” platform based on their function.
    • for example, the Machine API Operator needs to run a set of controllers that are platform agnostic when running in platform “External” mode.
    • the specific component reactions are difficult to predict currently, this criteria could change based on the output of phase 1.

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. Identifying OpenShift Components for Install Flexibility

Open questions::

  1. Phase 1 requires talking with several component teams, the specific action that will be needed will depend on the needs of the specific component. At the least the components need to treat platform "External" as "None", but there could be more changes depending on the component (eg Machine API Operator running non-platform specific controllers).

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Epic Goal

  • As defined in the part (OCPBU-5), this epic is about adding the new "External" platform type and ensuring that the OpenShift operators which react to platform types treat the "External" platform as if it were a "None" platform.
  • Add an end-to-end test to exercise the "External" platform type

Why is this important?

  • This work lays the foundation for partners and users to customize OpenShift installations that might replace infrastructure level components.

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Background

As described in the external platform enhancement , the cluster-cloud-controller-manager-opeartor should be modified to react to the external platform type in the same manner as platform none.

Steps

  • add an extra clause to the platform switch that will group "External" with "None"

Stakeholders

  • openshift eng

Definition of Done

  • CCCMO behaves as if platform None when External is selected
  • Docs
  • developer docs for CCCMO should be updated
  • Testing

Background

As described in the external platform enhancement , the machine-api-operator should be modified to react to the external platform type in the same manner as platform none.

Steps

  • add an extra clause to the platform switch that will group "External" with "None"

Stakeholders

  • openshift eng

Definition of Done

  • MAO behaves as if platform None when External is selected
  • Docs
  • developer docs for MAO should be updated
  • Testing

Feature Overview

Create a Azure cloud specific spec.resourceTags entry in the infrastructure CRD. This should create and update tags (or labels in Azure) on any openshift cloud resource that we create and manage. The behaviour should also tag existing resources that do not have the tags yet and once the tags in the infrastructure CRD are changed all the resources should be updated accordingly.

Tag deletes continue to be out of scope, as the customer can still have custom tags applied to the resources that we do not want to delete.

Due to the ongoing intree/out of tree split on the cloud and CSI providers, this should not apply to clusters with intree providers (!= "external").

Once confident we have all components updated, we should introduce an end2end test that makes sure we never create resources that are untagged.

 
Goals

  • Functionality on Azure Tech Preview
  • inclusion in the cluster backups
  • flexibility of changing tags during cluster lifetime, without recreating the whole cluster

Requirements

  • This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.
Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

List any affected packages or components.

  • Installer
  • Cluster Infrastructure
  • Storage
  • Node
  • NetworkEdge
  • Internal Registry
  • CCO

This is continuation of CORS-2249 / CFE-671 work, where support for Azure tags was delivered as TechPreview in 4.13 and to make it GA in 4.14. It would involve removing any reference to TechPreview in code and doc and to incorporate any feedback received from the users.

Remove code references related to Azure Tags is for TechPreview in below list

  • installer/data/data/install.openshift.io_installconfigs.yaml (PR#6820)
  • installer/pkg/explain/printer_test.go (PR#6820)
  • installer/pkg/types/azure/platform.go (PR#6820)
  • installer/pkg/types/validation/installconfig.go (PR#6820)
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Create a severity warning alert to alert to admin that there is packet loss occurring due to failed ovs vswitchd lookups. This may occur if vswitchd is cpu constrained and there are also numerous lookups.

Use metric  ovs_vswitchd_netlink_overflow which shows netlink messages dropped by the vswitchd daemon due to buffer overflow in userspace.

For the kernel equivalent, use metric ovs_vswitchd_dp_flows_lookup_lost . Both metrics usually have the same value but may differ if vswitchd may restart.

Both these metrics should be aggregate into a single alert if the value has increased recently.

 

DoD: QE test case, code merged to CNO, metrics document updated ( https://docs.google.com/document/d/1lItYV0tTt5-ivX77izb1KuzN9S8-7YgO9ndlhATaVUg/edit )

< High-Level description of the feature ie: Executive Summary >

Goals

< Who benefits from this feature, and how? What is the difference between today’s current state and a world with this feature? >

Requirements

Requirements Notes IS MVP
     
    • (Optional) Use Cases

< What are we making, for who, and why/what problem are we solving?>

Out of scope

<Defines what is not included in this story>

Dependencies

< Link or at least explain any known dependencies. >

Background, and strategic fit

< What does the person writing code, testing, documenting need to know? >

Assumptions

< Are there assumptions being made regarding prerequisites and dependencies?>

< Are there assumptions about hardware, software or people resources?>

Customer Considerations

< Are there specific customer environments that need to be considered (such as working with existing h/w and software)?>

Documentation Considerations

< What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)? >

What does success look like?

< Does this feature have doc impact? Possible values are: New Content, Updates to existing content, Release Note, or No Doc Impact?>

QE Contact

< Are there assumptions being made regarding prerequisites and dependencies?>

< Are there assumptions about hardware, software or people resources?>

Impact

< If the feature is ordered with other work, state the impact of this feature on the other work>

Related Architecture/Technical Documents

<links>

Done Checklist

  • Acceptance criteria are met
  • Non-functional properties of the Feature have been validated (such as performance, resource, UX, security or privacy aspects)
  • User Journey automation is delivered
  • Support and SRE teams are provided with enough skills to support the feature in production environment

Problem:

There's no way in the UI for the cluster admin to

  • change the default timeout period for the Web Terminal for all users
  • select an image from an image repository to be used as the default image for the Web Terminal for all users

Goal:

Expose the ability for cluster admins to provide customization for all web terminal users through the UI which is available in wtoctl

Why is it important?

Acceptance criteria:

  1. Cluster admin should be able to change the default timeout period for all new instances of the Web Terminal (it won't change settings)
  2. Cluster admin should be able to provide a new image as the default image for all new instances of the Web Terminal (it won't change settings)

Dependencies (External/Internal):

Design Artifacts:

Exploration:

Questions:

  • Where will this information be shared?
  • What CLI is used to accomplish this today? Get link to docs

Description

This is the follow up story for PR - https://github.com/openshift/console/pull/12718. Couple of tests, which are dependent on YAML are added as manual tests. Need to add proper tests for that.

Acceptance Criteria

  1. Add automated tests instead of manual for PR - https://github.com/openshift/console/pull/12718.
    After submit, switch to developer tab and comeback to web terminal tab and see the populated values. This should reflect the newly updated values.
  2. Remove the delays added(cy.wait) in customization-of-web-terminal.ts[ |https://github.com/openshift/console/pull/12718/files#diff-bea278ba2b0622e97023a25a89e91b2194e3ba73824d81ea4b08046558ba8718]and make sure all the tests are passing 
  3. Write test cases for utils updatedWebTerminalExec and updatedWebTerminalTooling

Additional Details:

Refer PR - https://github.com/openshift/console/pull/12718 for more details

Description

Update the help texts in initialize Terminal page as below
**
1. "This Project will be used to initialize your command line terminal" to "Project used to initialize your command line terminal"
 
2. "Set timeout for the terminal." to "Pod timeout for your command line terminal"
3. "Set custom image for the terminal." to "Custom image used for your command line terminal
 
 

Acceptance Criteria

Update the help texts in initialize Terminal page as below
**
1. "This Project will be used to initialize your command line terminal" to "Project used to initialize your command line terminal"
 
2. "Set timeout for the terminal." to "Pod timeout for your command line terminal"
3. "Set custom image for the terminal." to "Custom image used for your command line terminal

Additional Details:

Description

Allow cluster admin to provide default image and/or timeout period for all cluster users

Acceptance Criteria

  1. Add Web Terminal tab in Cluster Configuration page under Developer tab
  2. This tab should be visible only to cluster admins
  3. Add 2 fields, one is to change  default timeout and other is to change default image
  4. Default values should be pre-populated in above fields from
Default Timeout -  WEB_TERMINAL_IDLE_TIMEOUT environment variable's value in the web-terminal-exec DevWorkspaceTemplate
Default Image - .spec.components[].container.image field in the web-terminal-tooling DevWorkspaceTemplate

      5. Once user change this and save, need to update the same above resources(refer comment in epic https://issues.redhat.com/browse/ODC-7119 for more details)

      6. If the user has read access to DevWorkspaceTemplate, then save button should not be enabled and if user don't have read access to DevWorkspaceTemplate then no need to show web terminal tab in configuration page

      7. Add e2e tests

Additional Details:

Timeout and Image component should be similar to web terminal components (attached in ticket).
refer comment in epic https://issues.redhat.com/browse/ODC-7119 for more details

Overview 

HyperShift is being consumed by multiple providers. As a result, the need for documentation increases especially around infrastructure/hardware/resource requirements, networking, .. 

Goal

Before the GA of Hosted Control Planes, we need to know/document:

  • What Infrastructure is managed by HCP?
  • What are the hardware requirements/prereqs for a hosted cluster?
  • What infrastructure resources get created by Kubernetes/OpenShift during hosted cluster lifecycle?
  • What networking requirements exist, e.g., open ports? 
  • What are storage requirements?
  • What default quota limits are there (e.g., EIP default limits per region)? Do we tell the user to increase them in production? 

DoD

The above questions are answered for all platforms we support, i.e., we need to answer for

  • [x] AWS 
  • [x] Baremetal via the agent
  • [x] KubeVirt 

Feature Overview (aka. Goal Summary)  

Add support of NAT Gateways in Azure while deploying OpenShift on this cloud to manage the outbound network traffic and make this the default option for new deployments

Goals (aka. expected user outcomes)

While deploying OpenShift on Azure the Installer will configure NAT Gateways as the default method to handle the outbound network traffic so we can prevent existing issues on SNAT Port Exhaustion issue related to the configured outboundType by default.

Requirements (aka. Acceptance Criteria):

The installer will use the NAT Gateway object from Azure to manage the outbound traffic from OpenShift.

The installer will create a NAT Gateway object per AZ in Azure so the solution is HA.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

Background

Using NAT Gateway for egress traffic is the recommended approach from Microsoft

This is also a common ask from different enterprise customers as with the actual solution used by OpenShift for outbound traffic management in Azure they are hitting SNAT Port Exhaustion issues.

 

Interoperability Considerations

Which other projects and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

Epic Goal

  • Control Plane hosts should allow NAT Gateway for Internet egress for purposes of pulling images etc

Why is this important?

Scenarios

  1. Install a new cluster, control plane hosts access the Internet via NAT Gateway rather than via the public load balancer
  2. Install a new cluster, with user defined routing, control plane hosts access Internet via previously available UDR
  3. Upgraded clusters maintain their existing architecture

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Open questions::

  1. Control plane hosts are a must, but likely should just NAT gateway for all, need to understand pros/cons of doing so
  2. It'd be nice to understand what a potential migration for legacy clusters to the new architecture looks like and what options we have to automate that in a non disruptive manner.

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story:

As a administrator, I want to be able to:

  •  Allow NAT Gateway as outboundType for clusters in Azure

so that I can achieve

  • Outbound access without exhausting SNAT ports

Acceptance Criteria:

Description of criteria:

  • NAT gateway as an outboundType in install-config

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

Feature Overview (aka. Goal Summary)  

You can use the oc-mirror OpenShift CLI (oc) plugin to mirror all required OpenShift Container Platform content and other images to your mirror registry by using a single tool. It provides the following features:

  • Provides a centralized method to mirror OpenShift Container Platform releases, Operators, helm charts, and other images.
  • Maintains update paths for OpenShift Container Platform and Operators.
  • Uses a declarative image set configuration file to include only the OpenShift Container Platform releases, Operators, and images that your cluster needs.
  • Performs incremental mirroring, which reduces the size of future image sets.

This feature is track bring the oc mirror plugin to IBM Power and IBM zSystem architectures

Goals (aka. expected user outcomes)

Bring the oc mirror plugin to IBM Power and IBM zSystem architectures

 

Requirements (aka. Acceptance Criteria):

oc mirror plugin on IBM Power and IBM zSystems should behave exactly like it does on x86 platforms.

 

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

 

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

 

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

 

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

 

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

 

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  Initial completion during Refinement status.

 

Interoperability Considerations

Which other projects and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

If this Epic is an RFE, please complete the following questions to the best of your ability:

Q1: Proposed title of this RFE

Support for oc mirror plugin (parity to x86)

Q2: What is the nature and description of the RFE?

oc mirror plugin will be the tool for mirror plugin

Q3: Why does the customer need this? (List the business requirements here)

install disconnected cluster without having x86 nodes available to manage the disconnected installation

Q4: List any affected packages or components

https://docs.openshift.com/container-platform/4.12/installing/disconnected_install/installing-mirroring-disconnected.html 

 

Quay on the platform needs be available for saving the images.

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Cluster Infrastructure owned components should be running on Kubernetes 1.27
  • This includes
    • The cluster autoscaler (+operator)
    • Machine API operator
      • Machine API controllers for:
        • AWS
        • Azure
        • GCP
        • vSphere
        • OpenStack
        • IBM
        • Nutanix
    • Cloud Controller Manager Operator
      • Cloud controller managers for:
        • AWS
        • Azure
        • GCP
        • vSphere
        • OpenStack
        • IBM
        • Nutanix
    • Cluster Machine Approver
    • Cluster API Actuator Package
    • Control Plane Machine Set Operator

Why is this important?

  • ...

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. ...

Open questions::

  1. ...

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

To align with the 4.14 release, dependencies need to be updated to 1.27. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.14 release, dependencies need to be updated to 1.27. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.14 release, dependencies need to be updated to 1.27. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.14 release, dependencies need to be updated to 1.27. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.14 release, dependencies need to be updated to 1.27. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.14 release, dependencies need to be updated to 1.27. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.14 release, dependencies need to be updated to 1.27. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.14 release, dependencies need to be updated to 1.27. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.14 release, dependencies need to be updated to 1.27. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.14 release, dependencies need to be updated to 1.27. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.14 release, dependencies need to be updated to 1.27. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.14 release, dependencies need to be updated to 1.27. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.14 release, dependencies need to be updated to 1.27. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.14 release, dependencies need to be updated to 1.27. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.14 release, dependencies need to be updated to 1.27. This should be done by rebasing/updating as appropriate for the repository

Epic Goal*

What is our purpose in implementing this?  What new capability will be available to customers?

 
Why is this important? (mandatory)

What are the benefits to the customer or Red Hat?   Does it improve security, performance, supportability, etc?  Why is work a priority?

 
Scenarios (mandatory) 

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.  

  1.  

 
Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic. 

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - 
  • Documentation -
  • QE - 
  • PX - 
  • Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.  

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 

Epic Goal

  • Update all images that we ship with OpenShift to the latest upstream releases and libraries.
  • Exact content of what needs to be updated will be determined as new images are released upstream, which is not known at the beginning of OCP development work. We don't know what new features will be included and should be tested and documented. Especially new CSI drivers releases may bring new, currently unknown features. We expect that the amount of work will be roughly the same as in the previous releases. Of course, QE or docs can reject an update if it's too close to deadline and/or looks too big.

Traditionally we did these updates as bugfixes, because we did them after the feature freeze (FF). Trying no-feature-freeze in 4.12. We will try to do as much as we can before FF, but we're quite sure something will slip past FF as usual.

Why is this important?

  • We want to ship the latest software that contains new features and bugfixes.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.

Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.

(Using separate cards for each driver because these updates can be more complicated)

Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.

(Using separate cards for each driver because these updates can be more complicated)

Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.

(Using separate cards for each driver because these updates can be more complicated)

Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.

(Using separate cards for each driver because these updates can be more complicated)

Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.

(Using separate cards for each driver because these updates can be more complicated)

Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.

This includes ibm-vpc-node-label-updater!

(Using separate cards for each driver because these updates can be more complicated)

Update all OCP and kubernetes libraries in storage operators to the appropriate version for OCP release.

This includes (but is not limited to):

  • Kubernetes:
    • client-go
    • controller-runtime
  • OCP:
    • library-go
    • openshift/api
    • openshift/client-go
    • operator-sdk

Operators:

  • aws-ebs-csi-driver-operator 
  • aws-efs-csi-driver-operator
  • azure-disk-csi-driver-operator
  • azure-file-csi-driver-operator
  • openstack-cinder-csi-driver-operator
  • gcp-pd-csi-driver-operator
  • gcp-filestore-csi-driver-operator
  • csi-driver-manila-operator
  • vmware-vsphere-csi-driver-operator
  • alibaba-disk-csi-driver-operator
  • ibm-vpc-block-csi-driver-operator
  • csi-driver-shared-resource-operator
  • ibm-powervs-block-csi-driver-operator

 

  • cluster-storage-operator
  • cluster-csi-snapshot-controller-operator
  • local-storage-operator
  • vsphere-problem-detector

EOL, do not upgrade:

  • github.com/oVirt/csi-driver-operator

Update all CSI sidecars to the latest upstream release from https://github.com/orgs/kubernetes-csi/repositories

  • external-attacher
  • external-provisioner
  • external-resizer
  • external-snapshotter
  • node-driver-registrar
  • livenessprobe

Corresponding downstream repos have `csi-` prefix, e.g. github.com/openshift/csi-external-attacher.

This includes update of VolumeSnapshot CRDs in cluster-csi-snapshot-controller- operator assets and client API in  go.mod. I.e. copy all snapshot CRDs from upstream to the operator assets + go get -u github.com/kubernetes-csi/external-snapshotter/client/v6 in the operator repo.

Epic Goal

  • The goal of this epic is to upgrade all OpenShift and Kubernetes components that MCO uses to v1.27 which will keep it on par with rest of the OpenShift components and the underlying cluster version.

Why is this important?

  • Uncover any possible issues with the openshift/kubernetes rebase before it merges.
  • MCO continues using the latest kubernetes/OpenShift libraries and the kubelet, kube-proxy components.
  • MCO e2e CI jobs pass on each of the supported platform with the updated components.

Acceptance Criteria

  • All stories in this epic must be completed.
  • Go version is upgraded for MCO components.
  • CI is running successfully with the upgraded components against the 4.14/master branch.

Dependencies (internal and external)

  1. ART team creating the go 1.20 image for upgrade to go 1.20.
  2. OpenShift/kubernetes repository downstream rebase PR merge.

Open questions::

  1. Do we need a checklist for future upgrades as an outcome of this epic?-> yes, updated below.

Done Checklist

  • Step 1 - Upgrade go version to match rest of the OpenShift and Kubernetes upgraded components.
  • Step 2 - Upgrade Kubernetes client and controller-runtime dependencies (can be done in parallel with step 3)
  • Step 3 - Upgrade OpenShift client and API dependencies
  • Step 4 - Update kubelet and kube-proxy submodules in MCO repository
  • Step 5 - CI is running successfully with the upgraded components and libraries against the master branch.

Please review the following PR: https://github.com/openshift/machine-config-operator/pull/3598

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Cluster Infrastructure owned components should be running on Kubernetes 1.26
  • This includes
    • The cluster autoscaler (+operator)
    • Machine API operator
      • Machine API controllers for:
        • AWS
        • Azure
        • GCP
        • vSphere
        • OpenStack
        • IBM
        • Nutanix
    • Cloud Controller Manager Operator
      • Cloud controller managers for:
        • AWS
        • Azure
        • GCP
        • vSphere
        • OpenStack
        • IBM
        • Nutanix
    • Cluster Machine Approver
    • Cluster API Actuator Package
    • Control Plane Machine Set Operator

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

To align with the 4.13 release, dependencies need to be updated to 1.26. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.13 release, dependencies need to be updated to 1.26. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.13 release, dependencies need to be updated to 1.26. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.13 release, dependencies need to be updated to 1.26. This should be done by rebasing/updating as appropriate for the repository

Feature Overview

Agent-based installer requires to boot the generated ISO on the target nodes manually. Support for PXE booting will allow customers to automate their installations via their  DHCP/PXE infrastructure. 

This feature allows generating installation ISOs ready to add to a customer-provided DHCP/PXE infrastructure.

Goals

As an OpenShift installation admin I want to PXE-boot the image generated by the openshift-install agent subcommand

Why is this important?

We have customers requesting this booting mechanism to make it easier to automate the booting of the nodes without having to actively place the generated image in a bootable device for each host.

Epic Goal

As an OpenShift installation admin I want to PXE-boot the image generated by the openshift-install agent subcommand

Why is this important?

We have customers requesting this booting mechanism to make it easier to automate the booting of the nodes without having to actively place the generated image in a bootable device for each host.

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

As a user of the Agent-based Installer(ABI), I want to be able to perform the customizations via agent-tui in case of PXE booting so that I can modify network settings.

Implementation details:

Create a new baseImage asset that gets inherited by agentImage and agentpxefiles. The baseImage prepares the initrd along with the necessary ignition and the network tui which is now read by agentImage and agentpxefiles.

ARM kernels are compressed with gzip, but most versions of ipxe cannot handle this (it's not clear what happens with raw pxe). See https://github.com/coreos/fedora-coreos-tracker/issues/1019 for more info.

If the platform is aarch64 we'll need to decompress the kernel like we do in https://github.com/openshift/machine-os-images/commit/1ed36d657fa3db55fc649761275c1f89cd7e8abe

The new command {{agent create pxe-files }} reads - pxe-base-url from the agent-config.yaml. The field will be optional in the yaml file. If the URL is provided, then the command will generate an ipxe script specific to the given URL.

Currently, we have the kernel parameters in the iPXE script statically defined from what Assisted Service generates. If the default parameters were to change in RHCOS that would be problematic. Thus, it would be much better if we were to extract them from the ISO.

 

The kernel parameters in the ISO are defined in EFI/redhat/grub.cfg (UEFI) and /isolinux/isolinux.cfg (legacy boot)

Epic Goal

Support deploying multi-node clusters using platform none.

Why is this important?

As of Jan 2023 we have almost 5,000 clusters reported using platform none installed on-prem (metal, vmware or other hypervisors with no platform integration) out of a total of about 12,000 reported clusters installed on-prem.

Platform none is desired by users to be able to install clusters across different host platforms (e.g. mixing virtual and physical) where Kubernetes platform integration isn't a requirement. 

A goal of the Agent-Based Installer is to help users who currently can only deploy their topologies with UPI to be able to use the agent-based installer and get a simpler user experience while keeping all their flexibility.

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Currently there are validation checks for platform None in OptionalInstallConfig that limits the None platform to 1 control plane replica, 0 compute replicas, and the NetworkType to OVNKubernetes.

These validation should be removed so that the None platform can be installed on clusters of any configuration.

Acceptance Criteria:

  • SNO cluster should still continue to work.
  • SNO validation should still check only OVNKubernetes network type is allowed
  • A compact or HA cluster can be installed with platform None, given the user has configured and deployed an external load balancer.

Feature Overview (aka. Goal Summary)  

Add support to the Installer to make the S3 bucket deletion process during cluster bootstrap on AWS optional.

Goals (aka. expected user outcomes)

Allow the user to opt-out for deleting the S3 bucket created during the cluster bootstrap on AWS.

Requirements (aka. Acceptance Criteria):

The user will be able to opt-out from deleting the S3 bucket created during the cluster bootstrap on AWS via the install-config manifest so the Installer will not try to delete this resource when destroying the bootstrap instance and the S3 bucket.

The actual behavior will remain the default behavior while deploying OpenShift on AWS and both the bootstrap instance and the S3 bucket will be removed unless the user has opted-out for this via the install-config manifest.

Background

Some ROSA customers have SCP policies that prevent the deletion of any S3 bucket preventing ROSA adoption for these customers.

Documentation Considerations

There will be documentation required for this feature to explain how to prevent the Installer to remove the S3 bucket as well as an explanation on the security concerns while doing this since sensible the Installer will leave sensible data used to bootstrap the cluster in the S3 bucket.

Interoperability Considerations

Which other projects and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

Epic Goal

  • Allow the user to opt-out of deleting the S3 bucket created during the cluster bootstrap on AWS.

Why is this important?

  • Some ROSA customers have SCP policies that prevent the deletion of any S3 bucket preventing ROSA adoption for these customers.

Scenarios

  1. As a user, I want to be able to instruct the Installer to keep the S3 bucket created during the cluster bootstrap so I can be compliant with my security policies where the account used to deploy OpenShift has not the privileges to remove any S3 bucket.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated

User Story:

As a developer, I want to:

  • Make the deletion of S3 buckets in AWS optional during bootstrap destroy

so that I can

  • successfully install clusters in restricted environments.

Acceptance Criteria:

Description of criteria:

  • A field is added in the install config for the users to set the S3 deletion to optional.
  • Once the field is set, the bootstrap destroy stage does not delete the S3 buckets.

Engineering Details:

  • Adding a field in the install config and piping it to terraform.
  • If the S3 bucket creation is done in the cluster stage instead of bootstrap, the destroy bootstrap code does not delete the bucket.
  • Might need to create two instances of S3 buckets, one in the bootstrap and one in the cluster stage and control which one is created using the new field in the install config.

This requires/does not require a design proposal.
This requires/does not require a feature gate.

Feature Overview (aka. Goal Summary)  

Console Support of OpenShift Pipelines Migration to Tekton v1 API
 

Goals (aka. expected user outcomes)

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

 

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

 

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

 

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

 

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

 

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

 

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  Initial completion during Refinement status.

 

Interoperability Considerations

Which other projects and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

Description of problem

Pipeline API version is upgrading to v1 with Red Hat Pipeline operator 1.11.0 release.
https://tekton.dev/vault/pipelines-main/migrating-v1beta1-to-v1/

Acceptance Criteria

  1. Remove Resources tab from the pipeline details page
  2. Remove Resources section in pipeline builder form

Questions

Does this have to be backward compatible?
Will the features be equivalent? Will the UX / tests / documentation have to be updated?

Description

As a user, 

Acceptance Criteria

  1. should add support for API version v1 for Pipeline as per the doc https://tekton.dev/vault/pipelines-main/migrating-v1beta1-to-v1/
  2. Update the tests and test data

Additional Details:

Description of problem:
When trying the old pipelines operator with the latest 4.14 build I couldn't see the Pipelines navigation items. The operator provides the Pipeline v1beta1, not v1.

Version-Release number of selected component (if applicable):
4.14 master only after https://github.com/openshift/console/pull/12729 was merged

How reproducible:
Always?

Steps to Reproduce:

  1. Setup a 4.14 nightly cluster
  2. Install Pipelines operator from https://artifacts.ospqa.com/builds/ (tested with 1.9.2/v4.13-2303061437)

Actual results:

  1. Pipelines navigation wasn't shown

Expected results:

  1. Pipelines plugin should work also with this Pipelines operator version?

Additional info:

Feature Overview (aka. Goal Summary)  

An elevator pitch (value statement) that describes the Feature in a clear, concise way.  Complete during New status.

 

Goals (aka. expected user outcomes)

The observable functionality that the user now has as a result of receiving this feature. Complete during New status.

 

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

 

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

 

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

 

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

 

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

 

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

 

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  Initial completion during Refinement status.

 

Interoperability Considerations

Which other projects and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

Problem:

As a developer of serverless functions, we don't provide any samples.

Goal:

Provide Serverless Function samples in the sample catalog.  These would be utilizing the Builder Image capabilities.

Why is it important?

Use cases:

  1. <case>

Acceptance criteria:

  1. <criteria>

Dependencies (External/Internal):

  • Serverless team would need to provide sample repo for serverless function
  • Samples operator would need to be update

Design Artifacts:

Exploration:

Note:

  • Need to define the API and confirm with other stakeholders - need to support a serverless func image stream "tag"
  • Serverless team will need to provide updates to the existing Image Streams, as well as maintain the sample repositories which are referenced in the Image Streams.
  • Need to understand the relationship between ImageStream and Image Stream Tag
  • Should serverless function samples in the catalog have "builder image" tag?  or should it be "serverless function"

Description

As an operator author, I want to provide additional samples that are tied to an operator version, not an OpenShift release. For that, I want to create a resource to add new samples to the web console.

Acceptance Criteria

  1. openshift/console-operator update so that new clusters have the new ConsoleSample CRD
  2. Add RBAC permissions (roles and rolebinding?) so that all users have access to ConsoleSample resources

Additional Details:

Description

As an operator author, I want to provide additional samples that are tied to an operator version, not an OpenShift release. For that, I want to create a resource to add new samples to the web console.

Acceptance Criteria

  1. Load all cluster-scoped ConsoleSamples resources and show them in the sample catalog
  2. Filter duplicates based on the localization annotations (see enhancement proposal)
    1. All localization labels are optional
    2. Fallback for the name annotation should be metadata.name
    3. Fallback for the language should be english/no annotation
    4. Create a utils function with some unit tests
  3. Ensure that the Samples Import also works with Serverless functions (func.yaml detection)
  4. Show the new VSCode and IntelliJ extension cards from the "Add Serverless function" when importing a Serverless function sample.
  5. Provide some ConsoleSample YAMLs in the PR description

Additional Details:

  1. https://github.com/openshift/enhancements/pull/1429
  2. https://github.com/openshift/api/pull/1503

Feature Overview (aka. Goal Summary)  

As Arm adoption grows OpenShift on Arm is a key strategic initiative for Red Hat. Key to success is the support of all key cloud providers adopting this technology. Google have announced support for Arm in their GCP offering and we need to support OpenShift in this configuration.

Goals (aka. expected user outcomes)

The ability to have OCP on Arm running in a GCP instance

Requirements (aka. Acceptance Criteria):

OCP on Arm running in a GCP instance

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

 

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

 

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

 

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

 

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

 

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  Initial completion during Refinement status.

 

Interoperability Considerations

Which other projects and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Description:

Update 4.14 documentation to reflect new GCP support on ARM machines.

Updates: 

  • Add google instance types for ARM
  • Add config parameters 
  • Supported installation platforms 
  • Release note

Acceptance criteria: 

  • Dev and QE ack
  • PR is merged 

 

Description: 

In order to add instance types to the OCP documentation, there needs to be a .md file in the OpenShift installer repo that contains the 64-bit ARM machine types that have been texted and are supported on GCP. 

Create a PR in the OpenShift installer repo that creates a new .md file that shows the supported instance types 

Acceptance criteria: 

  • Dev and QE ack from ARM side 
  • Dev and QE ack from Installer side
  • Approval from installer product manager 
  • PR is merged and ready to be used for OCP docs referencing 

Feature Overview (aka. Goal Summary)  

Azure File CSI supports both SMB and NFS protocol. Currently we only support SMB and there is a strong demand from IBM and individual customers to support NFS for posix compliance reasons.

 

Goals (aka. expected user outcomes)

Support Azure File CSI with NFS.

The Azure File operator will not automatically create a NFS storage class, we will document how to create one.

 

Requirements (aka. Acceptance Criteria):

There are some concerns on the way Azure File CSI deals with NFS. They don't respect the FSGroup policy supplied in the pod definition. This breaks kubernetes convention where a pod should be able to define its own FSGroup policy, instead Azure File CSI set a per driver policy that pods can't override.

 

We brought up this problem to MSFT but there is no fix planned on the driver, given the pressure from the field we are going to support NFS with a on root mismatch default and document this specific behavior in our documentation.

 

 

Use Cases (Optional):

As an OCP on Azure admin i want my user to be able to consume NFS based PVs through Azure File CSI.

As an OCP on Azure user i want to attach NFS based PVs to my pods.

As an ARO customer I want to consume NFS based PVs.

 

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

 

Out of Scope

Running two drivers, one for NFS and one for SMB to solve the FSGroup issue.

 

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

 

Customer Considerations

This feature is candidate to be backported up to 4.12 if possible.

 

Documentation Considerations

Document that Azure File CSI NFS is supported, how to create a storage class as well as the FSGroup issue.

 

  1. Azure File NFS Supportability
  2. We currently have a CSI driver for Azure Files that supports SMB connectivity. In the interested of maintaining POSIX-compliance, supported NFS connectivity would be required.
  3. The goal would be to have supported parity with the current AWS offerings that use "AWS EFS Provisioner" to automatically provision NFS volumes.

It's been decided to support the driver as it is today (see spike STOR-992) knowing it violates fsGroupChangePolicy kubernetes standard where a pod is able to decide what FS group policy should be applied. Azure File with NFS applies a FS group policy at the driver level and pods cannot override it. We will keep the driver's default (on root mismatch) and document this non conventional behavior. Also, the Azure File CSI operator will not create a storage class for NFS, admins will need to create it manually this will be documented.

There is no need to specific development in the driver nor the operator, engineering will make sure we have a working CI.

1. Proposed title of this feature request

Enable privileged containers to view rootfs of other containers

 

2. What is the nature and description of the request?

The skip_mount_home=true field in the /etc/containers/storage.conf causes the mount propegation of container mounts to not be private, which allows privileged containers to access the rootfs of other containers. This is a fix for  bug 2065283 (see comment #32 [2]).

This RFE is to enable that field by default in Openshift, as well as verify there are no performance regressions when applying it.

 

3. Why does the customer need this? (List the business requirements here)

Customer's use case:

Our agent runs as a daemonset in k8s clusters and monitors the node.
Running with mount propagation set to HostToContainer allows the agent to access any container file, also containers which start running after agent startup. With this settings, when a new container starts, a new mount is created and added to the host mount namespace and also to the agent container and by that the agent can access the container files
e.g. the agent is mounted to /host and can access to the filesystem of other container by path
/host/var/lib/containers/storage/overlay/xxxxxxxxxxxxxxxxxxxxxxxxxxxxx/merged/test_file

This approach works in k8s clusters and OpenShift 3, but not in OpenShift 4. How can I make the agent pod to get noticed about any new mount which was created on the node and get access to it as well?

The workaround for that was provided in bug 2065283 (see comment #32 [2]).

 

4. List any affected packages or components.

CRI-O, Node, MCO.

 

Additional information in this Slack discussion [3].

 

 

[1] https://docs.openshift.com/container-platform/4.11/post_installation_configuration/machine-configuration-tasks.html#create-a-containerruntimeconfig_post-install-machine-configuration-tasks
[2] https://bugzilla.redhat.com/show_bug.cgi?id=2065283#c32
[3] https://coreos.slack.com/archives/CK1AE4ZCK/p1670491480185299

Epic Goal

  • use the `skip_mount_home` parameter of /etc/containers/storage.conf to allow containers to see other container's rootfs

Why is this important?

Scenarios

  1. As an author to a node agent, I would like my pods to be able to inspect the rootfs of other containers to gain insight into their behavior.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. https://issues.redhat.com/browse/PERFSCALE-2249

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Feature Overview (aka. Goal Summary)  

Currently, SCCs are part of the OpenShift API and are subject to modifications by customers. This leads to a constant stream of issues:

  • Modifications of out-of-the-box SCCs cause core workloads to malfunction
  • Addition of new higher priority SCCs may overrule existing pinned out-of-the-box SCCs during SCC admission and cause core workloads to malfunction

Goals (aka. expected user outcomes)

  • Create a way to prevent SCC preemption and modifications of out-of-the-box SCCs  
  •  

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

 

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

 

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

 

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

 

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

 

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

 

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  Initial completion during Refinement status.

 

Interoperability Considerations

Which other projects and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

Summary (PM+lead)

Currently, SCCs are part of the OpenShift API and are subject to modifications by customers. This leads to a constant stream of issues:

  • Modifications of out-of-the-box SCCs may cause core workloads to malfunction
  • Addition of new higher priority SCCs may overrule existing pinned out-of-the-box SCCs during SCC admission and cause core workloads to malfunction

We need to find and implement schemes to protect core workloads while retaining the API guarantee for modifications of SCCs (unfortunately).

Motivation (PM+lead)

Goals (lead)

Non-Goals (lead)

Deliverables

Proposal (lead)

User Stories (PM)

Dependencies (internal and external, lead)

Previous Work (lead)

Open questions (lead)

  1. ...

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Feature Overview

Users of the OpenShift Console leverage a streamlined, visual experience when discovering and installing OLM-managed operators in clusters that run on cloud providers with support for short-lived token authentication enabled. Users are intuitively becoming aware when this is the case and are put on the happy path to configure OLM-managed operators with the necessary information to support AWS STS.

 

Goals:

Customers do not need to re-learn how to enable AWS STS authentication support for each and every OLM-managed operator that supports it. The experience is standardized and repeatable so customers spend less time with initial configuration and more team implementing business value. The process is so easy that OpenShift is perceived as enabler for an increased security posture.

 

Requirements:

  • based on OCPBU-559 and OCPBU-560, the installation and configuration experience for any OLM-managed operator using short-lived token authentication is streamlined using the OCP console in the form of a guided process that avoids misconfiguration or unexpected behavior of the operators in question
  • the OCP Console helps in detecting when the cluster itself is already using AWS STS for core functionality
  • the OCP Console helps discover operators capable of AWS STS authentication and their IAM permission requirements
  • the OCP Console drives the collection of the required information for AWS STS authentication at the right stages of the installation process and stops the process when the information is not provided
  • the OCP Console implements this process with minimal differences across different cloud providers and is capable of adjusting the terminology depending on the cloud provider that the cluster is running on

 

Use Cases:

  • High-level mockups found here: Operators & STS
  • A cluster admin browses the OperatorHub catalog and looks at the details view of a particular operator, there they discover that the cluster is configured for AWS STS
  • A cluster admin browsing the OperatorHub catalog content can filter for operators that support the AWS STS flow described in OCPSTRAT-171
  • A cluster admin reviewing the details of a particular operator in the OperatorHub view can discover that this operator supports AWS STS authentication
  • A cluster admin installing a particular operator can get information about the AWS IAM permission requirements the operator has
  • A cluster admin installing a particular operator is asked to provide AWS ARN that is required for AWS STS prior to the actual installation step and is prevented from continuing without this information
  • A cluster admin reviewing an installed operators with support forAWS STS can discover the related CredentialRequest object that the operator created in an intuitive way (not generically via related objects that have an ownership reference or as part of the InstallPlan)

Out of Scope

  • update handling and blocking in case of increased permission requirements in the next / new version of the operator
  • more complex scenarios with multiple IAM roles/service principals resulting in multiple CredentialRequest objects used by a single operator

 

Background

The OpenShift Console today provides little to no support for configuring OLM-managed operators for short-lived token authentication. Users are generally unaware if their cluster runs on a cloud provider and is set up to use short-lived tokens for its core functionality and users are not aware which operators have support for that by implementing the respective flows defined in OCPBU-559 and OCPBU-560.

Customer Considerations

Customers may or may not be aware about short-lived token authentication support. They need to proper context and pointers to follow-up documentation to explain the general concept and the specific configuration flow the Console supports. It needs to become clear that the Console cannot 100% automate the overall process and some steps need to be run outside of the cluster/Console using Cloud-provider specific tooling.

This epic is tracking the console work needed for STS enablement. As well as documentation needed for enabling operator teams to use this new flow. This does not track Hypershift inclusion of CCO.

 

Plan is to backport to 4.12

 

install flow:

  • User knows which operators do and don’t support STS on a ROSA STS cluster
  • User installs Operator
  • UI has an option to add a RoleARN the sub.config.env for the operator to add to the CredentialRequest during install
  • Operator creates CredentialsRequest with AWS_ROLE_ARN
  • Operator watches secret with special name (TBD) in the namespace
    • Secret name propagated to operator via CRD field (example
    • Mount the bound service account token on the deployment  (example)
  • CCO creates a secret in the pod namespace based on the CredentialsRequest created
  • Operator extracts AWS_ROLE_ARN and AWS_WEB_IDENTITY_TOKEN from the secret
  • Operator is able to configure the cloud provider SDK

As a user of the console, I would like to provide the required fields for tokenized auth at install time (wrapping and providing sane defaults for what I can do manually in the CLI).

The role ARN provided by the user should be added to the service account of the installed operator as an annotation.

Only manual subscription is supported in STS mode - the automatic option should be not be the default or should be grey'd out entirely

 

AC: Add input field to the operator install page, where user can provide the `roleARN` value. This value will be set on the operator's Subscription resource, when installing operator.

STS - Security Token Service

Cluster is in STS mode when:

  1.  AWS
  2. credentialsMode in the `cloudcredential` resource  is "Manual"
  3. serviceAccountIssuer is non empty

AC: Inform user on the Operator Hub item details that the cluster is in the STS mode 

As a user of the console I would like to know which operators are safe to install (i.e. support tokenized auth or don't talk to the cloud provider).

 

AC: Add filter  to the Operator Hub for filtering operators which have Short Lived Token Enabled

Feature Overview (aka. Goal Summary)  

Upstream K8s deprecated PodSecurityPolicy and replaced it with a new built-in admission controller that enforces the Pod Security Standards (See here for the motivations for deprecation).] There is an OpenShift-specific dedicated pod admission system called Security Context Constraints. Our aim is to keep the Security Context Constraints pod admission system while also allowing users to have access to the Kubernetes Pod Security Admission. 

With OpenShift 4.11, we are turned on the Pod Security Admission with global "privileged" enforcement. Additionally we set the "restricted" profile for warnings and audit. This configuration made it possible for users to opt-in their namespaces to Pod Security Admission with the per-namespace labels. We also introduced a new mechanism that automatically synchronizes the Pod Security Admission "warn" and "audit" labels.

With OpenShift 4.15, we intend to move the global configuration to enforce the "restricted" pod security profile globally. With this change, the label synchronization mechanism will also switch into a mode where it synchronizes the "enforce" Pod Security Admission label rather than the "audit" and "warn". 

 

In OpenShift 4.14, we intend to deliver functionality in code that will help accelerate moving to PSA enforcement. This feature tracks those deliverables. 

Goals (aka. expected user outcomes)

The observable functionality that the user now has as a result of receiving this feature. Complete during New status.

 

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

 

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

 

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

 

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

 

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

 

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

 

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  Initial completion during Refinement status.

 

Interoperability Considerations

Which other projects and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

Epic Goal*

Deliver tools and code that helps toward PSa enforcement   

Why is this important? (mandatory)

What are the benefits to the customer or Red Hat?   Does it improve security, performance, supportability, etc?  Why is work a priority?

 
Scenarios (mandatory) 

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.  

  1.  

 
Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic. 

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - 
  • Documentation -
  • QE - 
  • PX - 
  • Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.  

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 

What

Don't enforce system defaults on a namespace's pod security labels, if it is managed by a user.

Why

If the managedFields (https://kubernetes.io/docs/reference/using-api/server-side-apply/#field-management) indicate that a user changed the pod security labels, we should not enforce system defaults.

A user might not be aware that the label syncer can be turned off and tries to manually change the state of the pod security profiles.

This fight between a user and the label syncer can cause violations.

< High-Level description of the feature ie: Executive Summary >

Goals

< Who benefits from this feature, and how? What is the difference between today’s current state and a world with this feature? >

Requirements

Requirements Notes IS MVP
     
    • (Optional) Use Cases

< What are we making, for who, and why/what problem are we solving?>

Out of scope

<Defines what is not included in this story>

Dependencies

< Link or at least explain any known dependencies. >

Background, and strategic fit

< What does the person writing code, testing, documenting need to know? >

Assumptions

< Are there assumptions being made regarding prerequisites and dependencies?>

< Are there assumptions about hardware, software or people resources?>

Customer Considerations

< Are there specific customer environments that need to be considered (such as working with existing h/w and software)?>

Documentation Considerations

< What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)? >

What does success look like?

< Does this feature have doc impact? Possible values are: New Content, Updates to existing content, Release Note, or No Doc Impact?>

QE Contact

< Are there assumptions being made regarding prerequisites and dependencies?>

< Are there assumptions about hardware, software or people resources?>

Impact

< If the feature is ordered with other work, state the impact of this feature on the other work>

Related Architecture/Technical Documents

<links>

Done Checklist

  • Acceptance criteria are met
  • Non-functional properties of the Feature have been validated (such as performance, resource, UX, security or privacy aspects)
  • User Journey automation is delivered
  • Support and SRE teams are provided with enough skills to support the feature in production environment

Goal

Additional improvements to segment, to enable the proper gathering of user telemetry and analysis

Problem

Currently, we have no accurate telemetry of the OpenShift Console usage across all fleet clusters. We should be able to utilize the auth and console telemetry to glean details which will allow us to get a picture of console usage by our customers.

There is no way to properly track specific pages

  1. Page titles are localized
  2. Details pages include the project name

Acceptance criteria

  1.  User telemetry page title for all the resource details pages should be changed to resource · tab-name format. Product name should not be part of user telemetry page title
  2. Page title in UI for all the resource details pages should be changed to resource-name . resource . tab-name . Product-name format
  3. User telemetry page title should be non-translated value for tracking purpose
  4. Page title in UI should be translated value

Note:

  • do we need to do anything to be GDPR compliant?

Description

Change page title for all resource details pages to {resource-name} · {resource} ·  {tab-name} · OKD

Acceptance Criteria

  1. Page title for all the resource details pages should be changed to  {resource-name} · {resource} · {tab-name} · OKD format
  2. If details page does not have tabs, then {tab-name} can be just "Details"
  3. Page title should be translated value

Additional Details:

Need to check all the resource pages which have details page and change the title.

Description

Update page title to have non-translated title in {resource-name} · {resource} · {tab-name} · OKD format

All page titles of resource details page to be added as a non-translated value in {resource-name} · {resource} · {tab-name} · OKD format inside <title> component as attribute with name for ex, data-title-id and use this value in fireUrlChangeEvent to send it as title for telemetry page event. Refer spike https://issues.redhat.com/browse/ODC-7269 for more details

 

Acceptance Criteria

  1. Add data-title-id attribute for all the resource details page title component 
  2. Use data-title-id as title value while sending URL change event to telemetry 
  3. If data-title-id attribute value is not present in title use page title value

Additional Details:

Refer spike https://issues.redhat.com/browse/ODC-7269 for more details

labelKeyForNodeKind now returns translated value, before it used to return label key. So Change method name for labelKeyForNodeKind to getTitleForNodeKind

Feature Overview (aka. Goal Summary)  

One of the steps in doing a disconnected environment install is to mirror the images to a designated system. This feature enhances oc-mirror to not handle the multi release payload, that is the payload that contains all the platform images (x86, Arm, IBM Power, IBM Z). This is a key feature towards supporting disconnected installs in a multi-architecture compute i.e. mixed architecture cluster environment.

 

Goals (aka. expected user outcomes)

Customers will be able to use oc-mirror to enable the multi payload in a disconnected environment.

 

Requirements (aka. Acceptance Criteria):

Allow oc-mirror to mirror the multi release payload

 

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

 

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

 

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

 

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

 

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

 

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  Initial completion during Refinement status.

 

Interoperability Considerations

Which other projects and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

Epic Goal

  • Add 'oc new-app' support for creating image streams with manifest list support
  • Add 'oc new-build' support for creating image streams with manifest list support

Why is this important?

  • oc commands that create image streams should work correctly on multi-arch clusters

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. https://issues.redhat.com/browse/IR-289
  2. https://issues.redhat.com/browse/IR-192

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

ACCEPTANCE CRITERIA

  • When creating a workload with 'oc new-app' that points to a manifest-listed image, users should be able to set an "--import-mode=" flag to 'PreserveOriginal' in order to preserve all architectures of the manifest list  
  • 'oc new-app --name <name> <manifestlist-image> --import-mode=PreserveOriginal' should not cause pods to fail due to being the incorrect architecture 
  • Ensure node scheduling happens properly on a heterogeneous cluster when running 'oc new-app' with '--import-mode=PreserveOriginal'

 

ImportMode api reference: https://github.com/openshift/api/blob/master/image/v1/types.go#L294

Original issue and discussion: https://coreos.slack.com/archives/CFFJUNP6C/p1664890804998069

 

 

ACCEPTANCE CRITERIA

  • When creating a build with 'oc new-build' that points to a manifest-listed image, users should be able to set an "--import-mode=" flag to 'PreserveOriginal' to preserve all architectures of the manifest list and let any builder pods build from the manifestlisted image.
  • 'oc new-build' should not cause pods to fail due to being the incorrect architecture 
  • Ensure node scheduling happens properly on a heterogeneous cluster when running 'oc new-build'

 

ImportMode api reference: https://github.com/openshift/api/blob/master/image/v1/types.go#L294

Feature Overview (aka. Goal Summary)  

With this feature it will be possible to autoscale from zero, that is have machinesets that create new nodes without any existing current nodes, for use in a mixed architecture cluster configured with multi-architecture compute

 

Goals (aka. expected user outcomes)

To be able to create a machineset and scale from zero in a mixed architecture cluster environment

 

Requirements (aka. Acceptance Criteria):

Create a machineset and scale from zero in a mixed architecture cluster environment

 

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

 

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

 

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

 

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

 

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

 

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  Initial completion during Refinement status.

 

Interoperability Considerations

Which other projects and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

Filing a ticket based on this conversation here: https://github.com/openshift/enhancements/pull/1014#discussion_r798674314

Basically the tl;dr here is that we need a way to ensure that machinesets are properly advertising the architecture that the nodes will eventually have. This is needed so the autoscaler can predict the correct pool to scale up/down. This could be accomplished through user driven means like adding node arch labels to machinesets and if we have to do this automatically, we need to do some more research and figure out a way.

For autoscaling nodes in a multi-arch compute cluster, node architecture needs to be taken into account because such a cluster could potentially have nodes of upto 4 different architectures. Labels can be propagated today from the machineset to the node group, but they have to be injected manually.

 

This story explores whether the autoscaler can use cloud provider APIs to derive the architecture of an instance type and set the label accordingly rather than it needing to be a manual step.

For autoscaling nodes in a multi-arch compute cluster, node architecture needs to be taken into account because such a cluster could potentially have nodes of upto 4 different architectures. Labels can be propagated today from the machineset to the node group, but they have to be injected manually.

This story explores whether the autoscaler can use cloud provider APIs to derive the architecture of an instance type and set the label accordingly rather than it needing to be a manual step.

Feature Overview (aka. Goal Summary)  

In 4.13 the vSphere CSI migration is in hybrid state. Greenfield 4.13 clusters have migration enabled by default while upgraded clusters have it turned off unless explicitely enabled by an administrator (referred as "opt-in").

This feature tracks the final work items required to enable vSphere CSI migration for all OCP clusters.

More information on the 4.13 vSphere CSI migration is available in the internal FAQ

Goals (aka. expected user outcomes)

Finalise vSphere CSI migration for all clusters ensuring that

  • Greenfield 4.14 clusters have migration enabled
  • Enable migration on clusters upgraded from 4.13 (for those that have it disable)
  • Enable migration on cluster upgraded from 4.14.
  • Disable the 4.13 featureset that allowed admins to opt in migration

Regardless of the clusters state (new or upgraded), which version it is upgrading from or status of CSI migration (enabled/disabled), they should all have CSI migration enabled.

This feature also includes upgrades checks in 4.12 & 4.13 to ensure that OCP is running on a recommended vSphere version (vSphere 7.0u3L+ or 8.0u2+)

Requirements (aka. Acceptance Criteria):

We should make sure that all issues that prevented us to enabled CSI migration by default in 4.13 are resolved. If some of these issues are fix in vSphere itself we might need to check for a certain vSphere build version before proceeding with the upgrade (from 4.12 or 4.13).

 

Use Cases (Optional):

  • New 4.14 clusters
  • Clusters upgraded from 4.13 with migration enabled
  • Clusters upgraded from 4.13 with migration disabled
  • Clusters upgraded from 4.12 (migration disabled)

 

Background

More information on the 4.13 vSphere CSI migration is available in the internal FAQ

Customer Considerations

Customers who upgraded from 4.12 will unlikely opt in migration so we will have quite a few clusters with migration enabled. Given we will enabled it in 4.14 for every clusters we need to be extra careful that all issues raised are fixed and set upgrade blockers if needed.

Documentation Considerations

Remove all migration opt-in occurences in the documentation.

Interoperability Considerations

We need to make sure that upgraded clusters are running on top of a vsphere version that contains all the required fixes.

Epic Goal*

Remove FeatureSet InTreeVSphereVolumes that we added in 4.13.

 
Why is this important? (mandatory)

We assume that the CSI Migration will be GA and locked to default in Kubernetes 1.27 / OCP 4.14. Therefore the FeatureSet must be removed.

Scenarios (mandatory) 

See https://issues.redhat.com/browse/STOR-1265 for upgrade from 4.13 to 4.14

 
Dependencies (internal and external) (mandatory)

Same as STOR-1265, just the other way around ("a big revert")

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - 
  • Documentation -
  • QE - 
  • PX - 
  • Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.  

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 

Description of problem:

The vsphereStorageDriver is deprecated and we should allow cluster admins to remove that field from the Storage object in 4.14.

This is the validation rule that prevents removing vsphereStorageDriver:
https://github.com/openshift/api/blob/0eef84f63102e9d2dfdb489b18fa22676f2bd0c4/operator/v1/types_storage.go#L42

This was originally put in place to ensure that CSI Migration is not disabled again once it has been enabled. However, in 4.14 there is no way to disable migration, and there is an explicit rule to prevent setting LegacyDeprecatedInTreeDriver. So it should be safe to allow removing the vsphereStorageDriver field in 4.14, as this will not disable migration, and the field will eventually be removed from the API in a future release.

Version-Release number of selected component (if applicable):

4.14.0

How reproducible:

 

Steps to Reproduce:

1. Set vsphereStorageDriver in the Storage object
2. Try to remove vsphereStorageDriver

Actual results:

* spec: Invalid value: "object": VSphereStorageDriver is required once set

Expected results:

should be allowed

Additional info:

 

Feature Overview (aka. Goal Summary)  

By moving MCO certificate management out of MachineConfigs, certificate rotation can happen any time, even when pools are paused and would generate no drain or reboot.

Goals (aka. expected user outcomes)

Eliminate problems causes by certificate rotations being blocked by paused pools. Keep certificates up-to-date without disruption to workloads.

Requirements (aka. Acceptance Criteria):

  • MCD reads certificates from our "controllerconfig" directly.

Interoperability Considerations

Windows MCO has been updated to work with this path.

Feature Overview (aka. Goal Summary)  

Having additional MCO metrics is helpful to customers who want to closely monitor the state of their Machines and MachineConfigPools.

 

Requirements (aka. Acceptance Criteria):

Add for each MCP:

    - Paused
    - Updated
    - Updating
    - Degraded
    - Machinecount
    - ReadyMachineCount
    - UpdatedMachineCount
    - DegradedMachineCount

Creating this to version scope the improvements merged into 4.14. Since those changes were in a story, they need an epic.

Customer like to have in Prometheus some metrics of MachineConfigOperator. For each MCP:
    
    - Paused
    - Updated
    - Updating
    - Degraded
    - Machinecount
    - ReadyMachineCount
    - UpdatedMachineCount
    - DegradedMachineCount
   

Why does the customer need this? (List the business requirements here)

These metrics would be really important, as it could show any MachineConfig action (updating, degraded, ...), which could also even trigger an alarm with a PrometheusRule. Having a dashboard of MachineConfig would be also really useful.

 

Note: Replace text in red with details of your feature request.

Feature Overview

Extend the Workload Partitioning feature to support multi-node clusters.

Goals

Customers running RAN workloads on C-RAN Hubs (i.e. multi-node clusters) that want to maximize the cores available to the workloads (DU) should be able to utilize WP to isolate CP processes to reserved cores.

Requirements

A list of specific needs or objectives that a Feature must deliver to satisfy the Feature. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts.  If a non MVP requirement slips, it does not shift the feature.

requirement Notes isMvp?
     
     
     

 

Describe Use Cases (if needed)

< How will the user interact with this feature? >

< Which users will use this and when will they use it? >

< Is this feature used as part of current user interface? >

Out of Scope

 

Background, and strategic fit

< What does the person writing code, testing, documenting need to know? >

Assumptions

< Are there assumptions being made regarding prerequisites and dependencies?>

< Are there assumptions about hardware, software or people resources?>

Customer Considerations

< Are there specific customer environments that need to be considered (such as working with existing h/w and software)?>

< Are there Upgrade considerations that customers need to account for or that the feature should address on behalf of the customer?>

<Does the Feature introduce data that could be gathered and used for Insights purposes?>

Documentation Considerations

< What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)? >

< What does success look like?>

< Does this feature have doc impact?  Possible values are: New Content, Updates to existing content,  Release Note, or No Doc Impact>

< If unsure and no Technical Writer is available, please contact Content Strategy. If yes, complete the following.>

  • <What concepts do customers need to understand to be successful in [action]?>
  • <How do we expect customers will use the feature? For what purpose(s)?>
  • <What reference material might a customer want/need to complete [action]?>
  • <Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available. >
  • <What is the doc impact (New Content, Updates to existing content, or Release Note)?>

Interoperability Considerations

< Which other products and versions in our portfolio does this feature impact?>

< What interoperability test scenarios should be factored by the layered product(s)?>

Questions

Question Outcome
   

 

 

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Write a test to execute a known management pods and create a management pod to verify that it adheres to the CPU Affinity and CPU Shares

Ex:

pgrep kube-apiserver | while read i; do taskset -cp $i; done
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Make validation tests run on all platforms by removing skips.

The original implementation of workload partitioning tried to leverage default behavior for CRIO to allow full use of CPU Sets when no Performance Profile is supplied by the user while still being a CPU partitioned cluster. This works fine for CPU affinity however because we don't supply a config and allow the default behavior to kick in, CRIO does not alter the CPU share and gives all pods 2 CPU Share value.

We need to supply a config for CRIO with an empty string for CPU Set to support both CPU share and CPU affinity behavior when NO performance profile is supplied, so that the `resource.requests` which get altered to CPU Share, are correctly being applied in a default state.

Note, this is not an issue with CPU affinity, that still behaves as expected and when a performance profile is supplied things work as intended as well. The CPU share mismatch is the only issue being identified here.

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Create generic validation tests in Origin and Release repo to check that a cluster is correctly configured. E2E tests running in a cpu partitioned cluster should run successfully.

Incomplete Features

When this image was assembled, these features were not yet completed. Therefore, only the Jira Cards included here are part of this release

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Value Statement

The HostedCluster and NodePool specs already has "pausedUntil" field.

 

 pausedUntil: description: 'PausedUntil is a field that can be used to pause reconciliation on a resource. Either a date can be provided in RFC3339 format or a boolean. If a date is provided: reconciliation is paused on the resource until that date. If the boolean true is provided: reconciliation is paused on the resource until the field is removed.' type: string  

 

This option is currently not exposed in "hypershift create cluster" command. 

 

In order to support HCP create/update automation template with ClusterCurator, users should be able to "hypershift create cluster" with the PausedUntil flag.

Definition of Done for Engineering Story Owner (Checklist)

  • I can create a hosted cluster with "hypershift create cluster <platform> --pausedUntil true"
  • HostedCluster and NodePool CRs from this command should contain "pausedUntil" field in the spec.
  • The hosted cluster creation should be paused until the pausedUntil=true field is removed from the HostedCluster and NodePool CRs 
  • This should work for agent, kubevirt and aws platforms.

Development Complete

  • The code is complete.
  • Functionality is working.
  • Any required downstream Docker file changes are made.

Tests Automated

  • [ ] Unit/function tests have been automated and incorporated into the
    build.
  • [ ] 100% automated unit/function test coverage for new or changed APIs.

Secure Design

  • [ ] Security has been assessed and incorporated into your threat model.

Multidisciplinary Teams Readiness

Support Readiness

  • [ ] The must-gather script has been updated.

Goal

Improve the kubevirt-csi storage plugin features and integration as we make progress towards the GA of a KubeVirt provider for HyperShift.

User Stories

  • "As a hypershift user,
    I want infra cluster StorageClasses made available to guest clusters,
    so that guest clusters can have persistent storage available."

Infra storage classes made available to guest clusters must support:

  • RWX AccessMode
  • Filesystem and Block VolumeModes

Non-Requirements

  • VolumeSnapshots
  • CSI Clone
  • Volume Expansion

Notes

  • Any additional details or decisions made/needed

Done Checklist

Who What Reference
DEV Upstream roadmap issue (or individual upstream PRs) <link to GitHub Issue>
DEV Upstream documentation merged <link to meaningful PR>
DEV gap doc updated <name sheet and cell>
DEV Upgrade consideration <link to upgrade-related test or design doc>
DEV CEE/PX summary presentation label epic with cee-training and add a <link to your support-facing preso>
QE Test plans in Polarion <link or reference to Polarion>
QE Automated tests merged <link or reference to automated tests>
DOC Downstream documentation merged <link to meaningful PR>
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

The HyperShift KubeVirt platform only supports guest clusters running 4.14 or greater (due to the kubevirt rhcos image only being delivered in 4.14)

and it also only supports OCP 4.14 and CNV 4.14 for the infra cluster. 

 

Add backend validation on the HostedCluster that validates the parameters are correct before processing the hosted cluster. If these conditions are not met, then report back the error as a condition on the hosted cluster CR

Based on the perf scale team's results, enabling multiqueue when jumbo frames (MTU >=9000) can greatly improve throughput. as see by comparing slides 8 and 10 in this slide deck, https://docs.google.com/presentation/d/1cIm4EcAswVDpuDp-eHVmbB7VodZqQzTYCnx4HCfI9n4/edit#slide=id.g2563dda6aa5_1_68
However enabling multiqueue with small MTU causes "throughput to crater".

This task involves adding an API option to the kubevirt platform within the nodepool api, as well as adding a cli option for enabling multiqueue in the hcp cli (new productized cli)

Problem Alignment

The Problem

Many customers still predominantly use logs as a main source to capture data that's important to quickly identify problems. Many issues can also be identified by metrics but there are some events in security, such as suspicious IP address activity, or runtime system issues such as host errors, where logs are your friend. OpenShift currently only support defining alerting rules and get notification based on metrics. That leaves a big gap to help identifying and being notified for the previous mentioned events immediately.

High-Level Approach

As we move the Logging stack towards using Loki (see OBSDA-7), we will be able to use it's out-of-the-box capabilities to define alerting rules on logs using LogQL. That approach is very similar to Prometheus' alerting ecosystem and actually gives us the opportunity to reuse Prometheus' Alertmanager to distribute alerts/notifications. For customers, this means they do not need to configure different channels twice, for metrics and logs, but reuse the same configuration.

For the configuration itself, we need to look into introducing a CRD (similar to the PrometheusRule CRD inside the Prometheus Operator) to allow users with non-admin permissions to configure the rules without changing the central Loki configuration.

Goal & Success

  • Allow individual users to configure alerting rules based on patterns inside a log record.

Solution Alignment

Key Capabilities

  • As an Application SRE, I'd like to configure SLIs to get alerted when the number of messages that meet some criteria (e.g. errors) exceeds a particular threshold.
  • As an Application SRE, I'd like to configure where alerts will be send so that I get notified on the right channels.

Key Flows

Open Questions & Key Decisions (optional)

  • Do we provide integration into Prometheus Alertmanager only and if so, how?
    • Note: We could integrate into our Monitoring's Alertmanager automatically but what happens if a customer decided to use an external Alertmanager and configures that inside Monitoring. I think we need to discuss this with the Monitoring team and identify if we actually want a more centralized approach to Alerting as supposed to divide into Metrics and Logs with each some dedicated instance. I think that's another perfect use case for why Observatorium would be better to use in general in the future. It combines Metrics and Log stack into one single deployment and we could only expose a single Alertmanager and a configuration for pointing Prometheus and Loki to an external instance if necessary.

Goals

  1. Enable OpenShift Application owners to create alerting rules based on logs scoped only for applications they have access to.
  2. Provide support for notifying on firing alerts in OpenShift Console

Non-Goals

  1. Provide support for logs-based metrics that can be used in PrometheusRule custom resources for alerting.

Motivation

Since OpenShift 4.6, application owners can configure alerting rules based on metrics themselves as described in User Workload Monitoring (UWM) enhancement. The rules are defined as PrometheusRule resources and can be based on platform and/or application metrics.

To expand the alerting capabilities on logs as an observability signal, cluster admins and application owners should be able to configure alerting rules as described in the Loki Rules docs and in the Loki Operator Ruler upstream enhancement.

AlertingRule CRD fullfills the requirement to define alerting rules for Loki similar to PrometheusRule.

RulerConfig CRD fullfills the requirement to connect the Loki Ruler component to notify a list of Prometheus AlertManager hosts on firing alerts.

Alternatives

  1. Use only the RecordingRule CRD to export logs as metrics first and rely on present cluster-monitoring/user-workload-monitoring alerting capabilities.

Acceptance Criteria

  1. OpenShift Application owners receive notifications for application logs-based alerts on the same OpenShift Console Alerts view as with metrics.

Risk and Assumptions

  1. Assuming that the present OpenShift Console implementation for Alerts view is compatible to list and manage alerts from Alertmanager which originate from Loki.
  2. Assuming that the present UWM tenancy model applies to the logs-based alerts.

Documentation Considerations

Open Questions

Additional Notes

  1. Enhancement proposal: Cluster Logging: Logs-Based Alerts
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Description

“As a dev user, I want to use the silences as admins do, so I can get the same features”

Acceptance Criteria

Given a dev user logged in to the console and using a developer perspective

When the user navigates to the observe section

Then the user can see a silences tab that has the same features as the admin but restricted only to the current selected namespace

Feature Overview (aka. Goal Summary)

This feature aims to enhance observability and user experience for customers of self-managed Hosted Control Planes (HCP) using ACM/MCE by leveraging the existing observability feature stack (e.g., the pluggable dashboard console feature in the OCP console as the MVP in case ACM is not in use). This approach ensures improved monitoring capabilities and aligns with the tenancy model of User Workload Monitoring (UWM), also strongly encourages an upsell from MCE to ACM to access those features and provide a best/practice and validated pattern for customers willing to build it on their own (with a lot of effort vs. ACM).

Goals (aka. expected user outcomes)

Users, particularly SRE teams (the cluster service provider persona), will gain enhanced visibility into the health and performance of their HCPs through a customizable monitoring dashboard. This dashboard will provide critical metrics and alerts, aiding in proactive management and troubleshooting. Existing observability features in ACM will be expanded to include these capabilities.

Requirements (aka. Acceptance Criteria)

  • Introduction of custom dashboards via the OCP console dashboard plugin feature.
  • Monitoring and tracking for agreed-upon SLIs/SLOs.
  • Dashboard configuration per HCP, aligning with the UWM tenancy model.
  • Alerts are exposed to highlight symptoms, potentially following predefined runbooks.
  • Enhanced visibility into HCP health and performance (API server, control plane).
  • Unified observability dashboard within ACM for centralized monitoring.
  • Clear reporting of key signals for SRE teams.
  • Actionable alerts based on monitored signals.

Key Considerations

  • Dashboard creation is to be initiated when the customer opts in for all metrics (not just telemetry). By default, not all metrics are exported to avoid overloading the monitoring stack. 
  • The dashboard will track key Service Level Indicators (SLIs) and Service Level Objectives (SLOs) like API availability, API server error rates, usage for the rest of the control plane and in the future latency between the control plane and workers. We will start with the top three easiest metrics to implement.
  • Additonally, Alerts should be exposed to highlight symptoms. 
  • We aim to provide a pragmatic, if not aesthetically perfect, user experience from a monitoring standpoint without muddling our ACM messaging. The Northstar here is the ACM observability stack as a sustainable comprehensive monitoring solution.
  • Dashboard configuration is per HCP, with each HCP living in its own OpenShift project (namespace). This is compatible with the tenancy model of User Workload Monitoring (UWM).

Deployment Considerations

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both Self-managed (but reusable in managed with xCM)
Classic (standalone cluster) N/A
Hosted control planes Applicable
Multi node, Compact (three node), or Single node (SNO), or all N/A
Connected / Restricted Network Applicable
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) Applicable
Operator compatibility Observability Operator (ObO)
Backport needed (list applicable versions) N/A
UI need (e.g. OpenShift Console, dynamic plugin, OCM) OpenShift Console, dynamic plugin
Other (please specify) N/A

Use Cases 

  • Monitoring API availability and error rates in a self-managed HCP.
  • Alerting SRE teams (cluster service providers) about critical performance issues in real-time.
  • Unified monitoring across multiple clusters via ACM including feature parity for HCP.

Open Discussion / Long-term Concerns 

The usage of UWM for HCP metrics on the management cluster has a few drawbacks:

  • Configuration via ConfigMap being more error-prone and less GitOps friendly
  • Fewer configuration knobs than with Out of the Box with the Observability Operator (ObO), and the slower delivery model bound to the OCP release cadence. 

These issues would be resolved with using ObO, which is currently being productized.

 

Other questions to answer:

  • How will the dashboard handle large volumes of metrics without overloading the monitoring stack?
  • What specific runbooks will be referenced for alerting?
  • How will the configuration be managed to ensure GitOps compatibility?

Background

This feature should leverage existing functionality when possible to align with other OCP observability efforts (e.g., pluggable dashboard console feature in the OCP console) to provide enhanced observability for HCP users. It should align with the existing UWM tenancy model and address immediate monitoring needs while considering future improvements via the Observability Operator.

Customer Considerations

Customers opting for full metrics export must be aware of the potential impact on the monitoring stack. Clear documentation and guidelines will be provided to manage configuration and alerts effectively.

Documentation Considerations

Documentation will include setup guides, configuration examples, and troubleshooting tips. It will also link to existing ACM observability documentation for comprehensive coverage.

Goal

  • Each self-managed MCE deployment of hosted clusters should bring a dashboard that provides key metrics of the hosted cluster. 

Why is this important?

  • Customers need to be able to have a standardized view of the key metrics in their cluster to be able to manage their infrastructure in general and management cluster in particular

Scenarios

  1. A management cluster admin wants to see the metrics for the hosted clusters that are deployed in their infrastructure

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement
  • Docs on the presence of the dashbaord

Dependencies (internal and external)

  1. Operator dashboard installation mechanism

Previous Work (Optional):

  1. CI dashboards

Open questions:

  1. Which are the key metrics?
  2. What should be the mechanism to enable this for only self-managed deployments?
  3. Should the dashboards be present in ACM deployments?

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Technical Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

As a hosted cluster deployer I want to have the HyperShift Operator:

  • Create a dashboard for inclusion in the OpenShift console that is populated with the key metrics for the new cluster
  • Delete the hosted cluster metrics dashboard when a hosted cluster is removed

so that:

  • I can determine the status and health of my hosted clusters over time
  • Identify deterioration in the service

https://docs.google.com/document/d/1UwHwkL-YtrRJYm-A922IeW3wvKEgCR-epeeeh3CBOGs/edit

configMap example: https://github.com/openshift/console-dashboards-plugin/blob/main/docs/add-datasource.md

tldr: three basic claims, the rest is explanation and one example

  1. We cannot improve long term maintainability solely by fixing bugs.
  2. Teams should be asked to produce designs for improving maintainability/debugability.
  3. Specific maintenance items (or investigation of maintenance items), should be placed into planning as peer to PM requests and explicitly prioritized against them.

While bugs are an important metric, fixing bugs is different than investing in maintainability and debugability. Investing in fixing bugs will help alleviate immediate problems, but doesn't improve the ability to address future problems. You (may) get a code base with fewer bugs, but when you add a new feature, it will still be hard to debug problems and interactions. This pushes a code base towards stagnation where it gets harder and harder to add features.

One alternative is to ask teams to produce ideas for how they would improve future maintainability and debugability instead of focusing on immediate bugs. This would produce designs that make problem determination, bug resolution, and future feature additions faster over time.

I have a concrete example of one such outcome of focusing on bugs vs quality. We have resolved many bugs about communication failures with ingress by finding problems with point-to-point network communication. We have fixed the individual bugs, but have not improved the code for future debugging. In so doing, we chase many hard to diagnose problem across the stack. The alternative is to create a point-to-point network connectivity capability. this would immediately improve bug resolution and stability (detection) for kuryr, ovs, legacy sdn, network-edge, kube-apiserver, openshift-apiserver, authentication, and console. Bug fixing does not produce the same impact.

We need more investment in our future selves. Saying, "teams should reserve this" doesn't seem to be universally effective. Perhaps an approach that directly asks for designs and impacts and then follows up by placing the items directly in planning and prioritizing against PM feature requests would give teams the confidence to invest in these areas and give broad exposure to systemic problems.


Relevant links:

OCP/Telco Definition of Done

Epic Template descriptions and documentation.

Epic Goal

Why is this important?

Drawbacks

  • N/A

Scenarios

  • CI Testing

Acceptance Criteria

  • CI - MUST be running successfully with tests automated

Dependencies (internal and external)

  1. SDN Team

Previous Work (Optional):

  1. N/A

Open questions::

  1. N/A

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub
    Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub
    Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Go 1.16 added the new embed directive to go. This embed directive lets you natively (and trivially) compile your binary with static asset files.

The current go-bindata dependency that's used in both the Ingress and DNS operator's for yaml asset compilation could be dropped in exchange for the new go embed functionality. This would reduce our dependency count, remove the need for `bindata.go` (which is version controlled and constantly updated), and make our code easier to read. This switch would also reduce the overall lines of code in our repos.

Note that this may be applicable to OCP 4.8 if and when images are built with go 1.16.

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • To refactor various unit test in cluster-ingress-operator to align with desire unit test standards. The unit tests are in need of various clean up to meet the standards of the network edge such as:
    • Using t.run in all unit tests for sub-test capabilities
    • Removing extraneous test cases
    • Fixing incorrect error messages

Why is this important?

  • Maintaining standards in unit tests is important for the debug-ability of our code

Scenarios

  1. ...

Acceptance Criteria

  • Unit tests generally meet our software standards

Dependencies (internal and external)

  1.  

Previous Work (Optional):

  1. For shift week, Miciah provided a handful commits https://github.com/Miciah/cluster-ingress-operator/commits/gateway-api that was the motivation to create this epic. 

Open questions::

  1. N/A

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Refactor Test_desiredLoadBalancerService to match our unit test standards, remove extraneous test cases, and make it more readable/maintainable.

Unit tests names should be formatted with Test_Function name, so that the scope of the function (private or Public) can be preserved.

Test_desiredHttpErrorCodeConfigMap contains a section that has dead code when checking for expect == nil || actual == ||. Clean this up.

Also replace Ruby-style #{} syntax for string interpolation with Go string formats.

Feature Overview:

Hypershift-provisioned clusters, regardless of the cloud provider support the proposed integration for OLM-managed integration outlined in OCPBU-559 and OCPBU-560.

 

Goals 

There is no degradation in capability or coverage of OLM-managed operators support short-lived token authentication on cluster, that are lifecycled via Hypershift.

 

Requirements:

  • the flows in OCPBU-559 and OCPBU-560 need to work unchanged on Hypershift-managed clusters
  • most likely this means that Hypershift needs to adopt the CloudCredentialOperator
  • all operators enabled as part of OCPBU-563, OCPBU-564, OCPBU-566 and OCPBU-568 need to be able to leverage short-lived authentication on Hypershift-managed clusters without being aware that they are on Hypershift-managed clusters
  • also OCPBU-569 and OCPBU-570 should be achievable on Hypershift-managed clusters

 

Background

Currently, Hypershift lacks support for CCO.

Customer Considerations

Currently, Hypershift will be limited to deploying clusters in which the cluster core operators are leveraging short-lived token authentication exclusively.

Documentation Considerations

If we are successful, no special documentation should be needed for this.

 

Outcome Overview

Operators on guest clusters can take advantage of the new tokenized authentication workflow that depends on CCO.

Success Criteria

CCO is included in HyperShift and its footprint is minimal while meeting the above outcome.

 

Expected Results (what, how, when)

 

 

Post Completion Review – Actual Results

After completing the work (as determined by the "when" in Expected Results above), list the actual results observed / measured during Post Completion review(s).

 

CCO currently deploys the pod identity webhook as part of its deployment. As part of the effort to reduce the footprint of CCO, the deployment of this pod should be conditional on the infrastructure.

This epic tracks work related to designing how to include CCO into HyperShift in order for operators on guest clusters to leverage the STS UX defined by this project.

Feature Overview (aka. Goal Summary)  

An elevator pitch (value statement) that describes the Feature in a clear, concise way.  Complete during New status.

  • Enable partners to create OpenShift-based appliances

Goals (aka. expected user outcomes)

The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.

  • Enable partners to create OpenShift-based appliances, which means OpenShift plus other operators plus partner software.
  • These appliances may be deployed in disconnected and/or remote sites.
  • These appliances may be SNO (lower priority) or multi-node.
  • These appliances may be physical (metal) or virtual (generally vSphere).
  • The appliance cannot rely on any external infrastructure (e.g., registry, DNS, etc.)
  • The full lifecycle of the appliance must be supported in a user-friendly manner (deploy, upgrade, backup/restore, redeploy).
  • See MGMT-13122 for feature details.

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

  • Provide documentation for how partners can use solution in KCS article
  • Develop blog and/or video that describes solution and how to use it
  • Technical enablement material

 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both Self-managed (though could be managed by partner)
Classic (standalone cluster) Classic
Hosted control planes Future
Multi node, Compact (three node), or Single node (SNO), or all SNO
Connected / Restricted Network All – connected and disconnected, air-gapped
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) x86_x64
Operator compatibility TBD
Backport needed (list applicable versions) N/A
UI need (e.g. OpenShift Console, dynamic plugin, OCM) N/A
Other (please specify)  

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

<your text here>

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

<your text here>

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

<your text here>

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

In OCP 4.14, we provided the ability to pass cluster configs to the agent-based installer (AGI) after booting image (AGENT-559).

In OCP 4.15, we published in upstream how you can use the Appliance Image builder utility to build disk images using Agent-based Installer to enable appliance installations — see https://github.com/openshift/appliance/blob/master/docs/user-guide.md. This is “Dev Preview”. The appliance tooling is currently supported and maintained by ecosystem engineering.

In OCP 4.16, this Appliance image builder utility will be bundled and shipped and will be available at registry.redhat.io (we are “productizing” this part). In the near term, we’ll document this via KCS and not official docs (to minimize confusion about documenting a feature that only impacts a small subset of appliance partners).

This appliance tool combines 2 features:

  • Registry-less clusters (air gapped) - currently there's no plans for the installation part outside of the appliance (unless we have a good solution planned for scale-out and upgrades).
  • Disk image generation from cluster configuration for appliance use **

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

<your text here>

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

<your text here>

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

<your text here>

Epic Goal

  • Enable partners to create OpenShift-based appliances, which means OpenShift plus other operators plus partner software.
  • These appliances may be deployed in disconnected and/or remote sites.
  • These appliances may be SNO (lower priority) or multi-node.
  • These appliances may be physical (metal) or virtual (generally vSphere).
  • The appliance cannot rely on any external infrastructure (e.g., registry, DNS, etc.)
  • The full lifecycle of the appliance must be supported in a user-friendly manner (deploy, upgrade, backup/restore, redeploy).

Why is this important?

  • Grow the business

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

BU Priority Overview

Enable installation and lifecycle support of OpenShift 4 on Oracle Cloud Infrastructure (OCI) Bare metal

Goals

  • Validating OpenShift on OCI baremetal to make it officially supported. 
  • Enable installation of OpenShift 4 on OCI bare metal using Assisted Installer.
  • Provide published installation instructions for how to install OpenShift on OCI baremetal
  • OpenShift 4 on OCI baremetal can be updated that results in a cluster and applications that are in a healthy state when update is completed.
  • Telemetry reports back on clusters using OpenShift 4 on OCI baremetal for connected OpenShift clusters (e.g. platform=external or none + some other indicator to know it's running on OCI baremetal).

Use scenarios

  • As a customer, I want to run OpenShift Virtualization on OpenShift running on OCI baremetal.
  • As a customer, I want to run Oracle BRM on OpenShift running OCI baremetal.

Why is this important

  • Customers who want to move from on-premises to Oracle cloud baremetal
  • OpenShift Virtualization is currently only supported on baremetal

Requirements

 

Requirement Notes
OCI Bare Metal Shapes must be certified with RHEL It must also work with RHCOS (see iSCSI boot notes) as OCI BM standard shapes require RHCOS iSCSI to boot (OCPSTRAT-1246)
Certified shapes: https://catalog.redhat.com/cloud/detail/249287
Successfully passing the OpenShift Provider conformance testing – this should be fairly similar to the results from the OCI VM test results. Oracle will do these tests.
Updating Oracle Terraform files  
Making the Assisted Installer modifications needed to address the CCM changes and surface the necessary configurations. Support Oracle Cloud in Assisted-Installer CI: MGMT-14039

 

RFEs:

  • RFE-3635 - Supporting Openshift on Oracle Cloud Infrastructure(OCI) & Oracle Private Cloud Appliance (PCA)

OCI Bare Metal Shapes to be supported

Any bare metal Shape to be supported with OCP has to be certified with RHEL.

From the certified Shapes, those that have local disks will be supported. This is due to the current lack of support in RHCOS for the iSCSI boot feature. OCPSTRAT-749 is tracking adding this support and remove this restriction in the future.

As of Aug 2023 this excludes at least all the Standard shapes, BM.GPU2.2 and BM.GPU3.8, from the published list at: https://docs.oracle.com/en-us/iaas/Content/Compute/References/computeshapes.htm#baremetalshapes 

Assumptions

  • Pre-requisite: RHEL certification which includes RHEL and OCI baremetal shapes (instance types) has successfully completed.

 

 

 

 
 

Feature goal (what are we trying to solve here?)

Please describe what this feature is going to do.

DoD (Definition of Done)

Please describe what conditions must be met in order to mark this feature as "done".

Does it need documentation support?

If the answer is "yes", please make sure to check the corresponding option.

Feature origin (who asked for this feature?)

  • A Customer asked for it

    • Name of the customer(s)
    • How many customers asked for it?
    • Can we have a follow-up meeting with the customer(s)?

 

  • A solution architect asked for it

    • Name of the solution architect and contact details
    • How many solution architects asked for it?
    • Can we have a follow-up meeting with the solution architect(s)?

 

  • Internal request

    • Who asked for it?

 

  • Catching up with OpenShift

Reasoning (why it’s important?)

  • Please describe why this feature is important
  • How does this feature help the product?

Competitor analysis reference

  • Do our competitors have this feature?
    • Yes, they have it and we can have some reference
    • No, it's unique or explicit to our product
    • No idea. Need to check

Feature usage (do we have numbers/data?)

  • We have no data - the feature doesn’t exist anywhere
  • Related data - the feature doesn’t exist but we have info about the usage of associated features that can help us
    • Please list all related data usage information
  • We have the numbers and can relate to them
    • Please list all related data usage information

Feature availability (why should/shouldn't it live inside the UI/API?)

  • Please describe the reasoning behind why it should/shouldn't live inside the UI/API
  • If it's for a specific customer we should consider using AMS
  • Does this feature exist in the UI of other installers?

Feature Overview

  • As a Cluster Administrator, I want to opt-out of certain operators at deployment time using any of the supported installation methods (UPI, IPI, Assisted Installer, Agent-based Installer) from UI (e.g. OCP Console, OCM, Assisted Installer), CLI (e.g. oc, rosa), and API.
  • As a Cluster Administrator, I want to opt-in to previously-disabled operators (at deployment time) from UI (e.g. OCP Console, OCM, Assisted Installer), CLI (e.g. oc, rosa), and API.
  • As a ROSA service administrator, I want to exclude/disable Cluster Monitoring when I deploy OpenShift with HyperShift — using any of the supported installation methods including the ROSA wizard in OCM and rosa cli — since I get cluster metrics from the control plane.  This configuration should be persisted through not only through initial deployment but also through cluster lifecycle operations like upgrades.
  • As a ROSA service administrator, I want to exclude/disable Ingress Operator when I deploy OpenShift with HyperShift — using any of the supported installation methods including the ROSA wizard in OCM and rosa cli — as I want to use my preferred load balancer (i.e. AWS load balancer).  This configuration should be persisted through not only through initial deployment but also through cluster lifecycle operations like upgrades.

Goals

  • Make it possible for customers and Red Hat teams producing OCP distributions/topologies/experiences to enable/disable some CVO components while still keeping their cluster supported.

Scenarios

  1. This feature must consider the different deployment footprints including self-managed and managed OpenShift, connected vs. disconnected (restricted and air-gapped), supported topologies (standard HA, compact cluster, SNO), etc.
  2. Enabled/disabled configuration must persist throughout cluster lifecycle including upgrades.
  3. If there's any risk/impact of data loss or service unavailability (for Day 2 operations), the System must provide guidance on what the risks are and let user decide if risk worth undertaking.

Requirements

  • This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.
Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

(Optional) Use Cases

This Section:

  • Main success scenarios - high-level user stories
  • Alternate flow/scenarios - high-level user stories
  • ...

Questions to answer…

  • ...

Out of Scope

Background, and strategic fit

This part of the overall multiple release Composable OpenShift (OCPPLAN-9638 effort), which is being delivered in multiple phases:

Phase 1 (OpenShift 4.11): OCPPLAN-7589 Provide a way with CVO to allow disabling and enabling of operators

  • CORS-1873 Installer to allow users to select OpenShift components to be included/excluded
  • OTA-555 Provide a way with CVO to allow disabling and enabling of operators
  • OLM-2415 Make the marketplace operator optional
  • SO-11 Make samples operator optional
  • METAL-162 Make cluster baremetal operator optional
  • OCPPLAN-8286 CI Job for disabled optional capabilities

Phase 2 (OpenShift 4.12): OCPPLAN-7589 Provide a way with CVO to allow disabling and enabling of operators

Phase 3 (OpenShift 4.13): OCPBU-117

  • OTA-554 Make oc aware of cluster capabilities
  • PSAP-741 Make Node Tuning Operator (including PAO controllers) optional

Phase 4 (OpenShift 4.14): OCPSTRAT-36 (formerly OCPBU-236)

  • CCO-186 ccoctl support for credentialing optional capabilities
  • MCO-499 MCD should manage certificates via a separate, non-MC path (formerly IR-230 Make node-ca managed by CVO)
  • CNF-5642 Make cluster autoscaler optional
  • CNF-5643 - Make machine-api operator optional
  • WRKLDS-695 - Make DeploymentConfig API + controller optional
  • CNV-16274 OpenShift Virtualization on the Red Hat Application Cloud (not applicable)
  • CNF-9115 - Leverage Composable OpenShift feature to make control-plane-machine-set optional
  • BUILD-565 - Make Build v1 API + controller optional
  • CNF-5647 Leverage Composable OpenShift feature to make image-registry optional (replaces IR-351 - Make Image Registry Operator optional)

Phase 5 (OpenShift 4.15): OCPSTRAT-421 (formerly OCPBU-519)

  • OCPVE-634 - Leverage Composable OpenShift feature to make olm optional
  • CCO-419 (OCPVE-629) - Leverage Composable OpenShift feature to make cloud-credential optional

Phase 6 (OpenShift 4.16): OCPSTRAT-731

Phase 7 (OpenShift 4.17): OCPSTRAT-1308

  • MON-3152 (OBSDA-242) Optional built-in monitoring
  • IR-400 - Remove node-ca from CIRO*
  • CNF-9116 Leverage Composable OpenShift feature to machine-auto-approver optional
  • CCO-493 Make Cloud Credential Operator optional for remaining providers and topologies (non-SNO topologies)

References

Assumptions

  • ...

Customer Considerations

  • ...

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?
  • New Content, Updates to existing content, Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?

 

 

 

Epic Goal

  • Remove node-ca from Cluster Image Registry Operator - its functionality is provided by the MCO

Why is this important?

  • To avoid potential issues, a single component should handle certificate distribution in OCP clusters

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • The registry should continue to work on Hypershift

Dependencies (internal and external)

  1.   HOSTEDCP-1160

Previous Work (Optional):

  1. IR-351

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Once the MCO team is done moving the node-ca functionality to the MCO (MCO-499), we need to remove the node-ca from CIRO.

ACCEPTANCE CRITERIA

  • New clusters provisioned with the registry installed come without the node-ca daemon deployed
  • Existing clusters after upgrade will have the node-ca daemon removed
  • Works on Hypershift

Feature Overview

With this feature MCE will be an additional operator ready to be enabled with the creation of clusters for both the AI SaaS and disconnected installations with Agent.

Currently 4 operators have been enabled for the Assisted Service SaaS create cluster flow: Local Storage Operator (LSO), OpenShift Virtualization (CNV), OpenShift Data Foundation (ODF), Logical Volume Manager (LVM)

The Agent-based installer doesn't leverage this framework yet.

Goals

When a user performs the creation of a new OpenShift cluster with the Assisted Installer (SaaS) or with the Agent-based installer (disconnected), provide the option to enable the multicluster engine (MCE) operator.

The cluster deployed can add itself to be managed by MCE.

Background, and strategic fit

Deploying an on-prem cluster 0 easily is a key operation for the remaining of the OpenShift infrastructure.

While MCE/ACM are strategic in the lifecycle management of OpenShift, including the provisioning of all the clusters, the first cluster where MCE/ACM are hosted, along with other supporting tools to the rest of the clusters (GitOps, Quay, log centralisation, monitoring...) must be easy and with a high success rate.

The Assisted Installer and the Agent-based installers cover this gap and must present the option to enable MCE to keep making progress in this direction.

Assumptions

MCE engineering is responsible for adding the appropriate definition as an olm-operator-plugins

See https://github.com/openshift/assisted-service/blob/master/docs/dev/olm-operator-plugins.md for more details

Epic Goal

  • When an Assisted Service SaaS user performs the creation of a new OpenShift cluster, provide the option to enable the multicluster engine (MCE) operator.

Why is this important?

  • Expose users in the Assisted Service SaaS to the value of the MCE
  • Customers/users want to leverage the cluster lifecycle capabilities within MCE inside of their on premises environment.
  • The 'cluster0' can be initiated from Assisted Service SaaS and include MCE hub for cluster deployment within the customer datacenter.

Automated storage configuration

  • The Infrastructure Operator, a dependency of MCE to deploy bare metal, vSphere and Nutanix clusters, requires storage. There are 3 scenarios to automate storage:
  • User selects to install ODF and MCE:
    • ODF is the ideal storage for clusters but requires an additional subscriptions.
    • When selected along with MCE it will be configured as the storage required by the Infrastructure Operator and the Infrastructure Operator will be deployed along with MCE.
  • User deploys an SNO cluster, which supports LVMS as its storage and is available to all OpenShift users.
    • If the user also chooses ODF then ODF is used for the Infrastructure Opertor
    • If ODF isn't configured then LVMS is enabled and the Infrastructure Operator will use it.
  • User doesn't install ODF or a SNO cluster
    • They have to choose their storage and then install the Infrastructure Operator in day-2

Scenarios

  1. When a RH cloud user logs into console.redhat SaaS, they can leverage the Assisted Service SaaS flow to create a new cluster
  2. During the Assisted Service SaaS create flow, a RH cloud user can see a list of available operators that they want to install at the same time as the cluster create. 
  3. An option is offered to select check a box next to "multicluster engine for Kubernetes (MCE)" 
  4. The RH cloud user can read a tool-tip or info-box with short description of the MCE and click a link for more details to review MCE documentation

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • Ensure MCE release channel can automatically deploy the latest x.y.z without needing any DevOps/SRE intervention
  • Ensure MCE release channel can be updated quickly (if not automatically) to ensure the later release x.y can be offered to the cloud user.

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. for example, CNV operator: https://github.com/openshift/assisted-service/blob/master/internal/operators/cnv/manifest.go#L165

Open questions:

  1. Is there any automation that will pickup the next stable-x.y MCE or do we need to manually do it with each release? For example, when MCE 2.2 comes out do we need to update the SaaS plugin code or does it automatically move to the next.  Note for example how the OLM subscription looks - and stable-2.2 will appear once MCE 2.2 comes out.
  2. How challenging is this to maintain as new OCP releases come out and QE must be performed? 

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

In this feature will follow up OCPBU-186 Image mirroring by tags.

OCPBU-186 implemented new API ImageDigestMirrorSet and ImageTagMirrorSet and rolling of them through MCO.

This feature will update the components using ImageContentSourcePolicy to use ImageDigestMirrorSet.

The list of the components: https://docs.google.com/document/d/11FJPpIYAQLj5EcYiJtbi_bNkAcJa2hCLV63WvoDsrcQ/edit?usp=sharing.

 

Migrate OpenShift Components to use the new Image Digest Mirror Set (IDMS)

This doc list openshift components currently use ICSP: https://docs.google.com/document/d/11FJPpIYAQLj5EcYiJtbi_bNkAcJa2hCLV63WvoDsrcQ/edit?usp=sharing

Plan for ImageDigestMirrorSet Rollout
Epic: https://issues.redhat.com/browse/OCPNODE-521

4.13: Enable ImageDigestMirrorSet, both ICSP and ImageDigestMirrorSet objects are functional

  • Document that ICSP is being deprecated and will be unsupported by 4.17 (to allow for EUS to EUS upgrades)
  • Reject write to both ICSP and ImageDigestMirrorSet on the same cluster

4.14: Update OpenShift components to use IDMS

4.17: Remove support for ICSP within MCO

  • Error out if an old ICSP object is used

As an openshift developer, I want --idms-file flag so that I can fetch image info from alternative mirror if --icsp-file gets deprecated.

As a <openshift developer> trying to <mirror image for disconnect environment using oc command> I want <the output give the example of ImageDigestMirrorSet manifest> because ImageContentSourcePolicy will be replaced by CRD implemented in OCPBU-186 Image mirroring by tags

the ImageContentSourcePolicy manifest snippet from the command output will be updated to ImageDigestMirrorSet manifest.{}

workloads uses `oc adm release mirror` command will be impacted.

 

 

Feature Overview (aka. Goal Summary)

This feature aims to comprehensively refactor and standardize various components across HCP, ensuring consistency, maintainability, and reliability. The overarching goal to increase customer satisfaction by increasing speed to market and saving engineering budget by reducing incidents/bugs. This will be achieved by reducing technical debt, improving code quality, and simplifying the developer experience across multiple areas, including CLI consistency, NodePool upgrade mechanisms, networking flows, and more. By addressing these areas holistically, the project aims to create a more sustainable and scalable codebase that is easier to maintain and extend.

Goals (aka. Expected User Outcomes)

  • Unified Codebase: Achieve a consistent and unified codebase across different HCP components, reducing redundancy and making the code easier to understand and maintain.
  • Enhanced Developer Experience: Streamline the developer workflow by reducing boilerplate code, standardizing interfaces, and improving documentation, leading to faster and safer development cycles.
  • Improved Maintainability: Refactor large, complex components into smaller, modular, and more manageable pieces, making the codebase more maintainable and easier to evolve over time.
  • Increased Reliability: Enhance the reliability of the platform by increasing test coverage, enforcing immutability where necessary, and ensuring that all components adhere to best practices for code quality.
  • Simplified Networking and Upgrade Mechanisms: Standardize and simplify the handling of networking flows and NodePool upgrade triggers, providing a clear, consistent, and maintainable approach to these critical operations.

Requirements (aka. Acceptance Criteria)

  • Standardized CLI Implementation: Ensure that the CLI is consistent across all supported platforms, with increased unit test coverage and refactored dependencies.
  • Unified NodePool Upgrade Logic: Implement a common abstraction for NodePool upgrade triggers, consolidating scattered inputs and ensuring a clear, consistent upgrade process.
  • Refactored Controllers: Break down large, monolithic controllers into modular, reusable components, improving maintainability and readability.
  • Improved Networking Documentation and Flows: Update networking documentation to reflect the current state, and refactor network proxies for simplicity and reusability.
  • Centralized Logic for Token and Userdata Generation: Abstract the logic for token and userdata generation into a single, reusable library, improving code clarity and reducing duplication.
  • Enforced Immutability for Critical API Fields: Ensure that immutable fields within key APIs are enforced through proper validation mechanisms, maintaining API coherence and predictability.
  • Documented and Clarified Service Publish Strategies: Provide clear documentation on supported service publish strategies, and lock down the API to prevent unsupported configurations.

Use Cases (Optional)

  • Developer Onboarding: New developers can quickly understand and contribute to the HCP project due to the reduced complexity and improved documentation.
  • Consistent Operations: Operators and administrators experience a more predictable and consistent platform, with reduced bugs and operational overhead due to the standardized and refactored components.

Out of Scope

  • Introduction of new features or functionalities unrelated to the refactor and standardization efforts.
  • Major changes to user-facing commands or APIs beyond what is necessary for standardization.

Background

Over time, the HyperShift project has grown organically, leading to areas of redundancy, inconsistency, and technical debt. This comprehensive refactor and standardization effort is a response to these challenges, aiming to improve the project's overall health and sustainability. By addressing multiple components in a coordinated way, the goal is to set a solid foundation for future growth and development.

Customer Considerations

  • Minimal Disruption: Ensure that existing users experience minimal disruption during the refactor, with clear communication about any changes that might impact their workflows.
  • Enhanced Stability: Customers should benefit from a more stable and reliable platform as a result of the increased test coverage and standardization efforts.

Documentation Considerations

Ensure all relevant project documentation is updated to reflect the refactored components, new abstractions, and standardized workflows.

This overarching feature is designed to unify and streamline the HCP project, delivering a more consistent, maintainable, and reliable platform for developers, operators, and users.

Goal

Focus on the general modernization of the codebase, addressing technical debt, and ensuring that the platform is easy to maintain and extend.

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • Dev - Has a valid enhancement if necessary
  • CI - MUST be running successfully with tests automated
  • QE - covered in Polarion test plan and tests implemented
  • Release Technical Enablement - Must have TE slides
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions:

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Technical Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Enhancement merged: <link to meaningful PR or GitHub Issue>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Goal

Improve the consistency and reliability of APIs by enforcing immutability and clarifying service publish strategy support.

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • Dev - Has a valid enhancement if necessary
  • CI - MUST be running successfully with tests automated
  • QE - covered in Polarion test plan and tests implemented
  • Release Technical Enablement - Must have TE slides
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions:

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Technical Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Enhancement merged: <link to meaningful PR or GitHub Issue>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Feature Overview (aka. Goal Summary)  

Due to low customer interest of using Openshift on Alibaba cloud we have decided to deprecate then remove the IPI support for ALibaba Cloud 

https://docs.google.com/document/d/1Kp-GrdSHqsymzezLCm0bKrCI71alup00S48QeWFa0q8/edit#heading=h.v75efohim75y 

Goals (aka. expected user outcomes)

4.14

Announcement 

  1. Update cloud.redhat.com with deprecation information 
  2. Update IPI installer code with warning
  3. Update release node with deprecation information
  4. Update Openshift Doc with deprecation information

4.15

Archive code 

 

Add a warning of depreciation in installer code for anyone trying to install Alibaba via IPI

{}USER STORY:{}

As an user of the installer binary, I want to be warned that Alibaba support will be deprecated in 4.15, so that I'm prevented from creating clusters that will soon be unsupported.

{}DESCRIPTION:{}

Alibaba support will be decommissioned from both IPI and UPI starting in 4.15. We want to warn users of the 4.14 installer binary picking 'alibabacloud' in the list of providers.

{}ACCEPTANCE CRITERIA:{}

Warning message is displayed after choosing 'alibabacloud'.

{}ENGINEERING DETAILS:{}

https://docs.google.com/document/d/1Kp-GrdSHqsymzezLCm0bKrCI71alup00S48QeWFa0q8/edit?usp=sharing_eip_m&ts=647df877

 

Feature Overview (aka. Goal Summary)  

The storage operators need to be automatically restarted after the certificates are renewed.

From OCP doc "The service CA certificate, which issues the service certificates, is valid for 26 months and is automatically rotated when there is less than 13 months validity left."

Since OCP is now offering an 18 months lifecycle per release, the storage operator pods need to be automatically restarted after the certificates are renewed.

Goals (aka. expected user outcomes)

The storage operators will be transparently restarted. The customer benefit should be transparent, it avoids manually restart of the storage operators.

 

Requirements (aka. Acceptance Criteria):

The administrator should not need to restart the storage operator when certificates are renew.

This should apply to all relevant operators with a consistent experience.

 

Use Cases (Optional):

As an administrator I want the storage operators to be automatically restarted when certificates are renewed.

 

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

 

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

 

Background

This feature request is triggered by the new extended OCP lifecycle. We are moving from 12 to 18 months support per release.

 

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

 

Documentation Considerations

No doc is required

 

Interoperability Considerations

This feature only cover storage but the same behavior should be applied to every relevant  components. 

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

The pod `csi-snapshot-webhook` mounts the secret:
```

$ cat assets/webhook/deployment.yaml
kind: Deployment
metadata:
  name: csi-snapshot-webhook
  ...
spec:
  template:
    spec:
      containers:

        volumeMounts:
          - name: certs
            mountPath: /etc/snapshot-validation-webhook/certs

      volumes:
      - name: certs
        secret:
          secretName: csi-snapshot-webhook-secret

```
Hence, if the secret is updated (e.g. as a result of CA cert update), the Pod must be restarted.

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

1. The pod `vmware-vsphere-csi-driver-controller` mounts the secret:

$ oc get po -n openshift-cluster-csi-drivers vmware-vsphere-csi-driver-controller-8467ddf4c-5lgd8 -o yaml
...
  containers:
    name: driver-kube-rbac-proxy
    name: provisioner-kube-rbac-proxy
    name: attacher-kube-rbac-proxy
    name: resizer-kube-rbac-proxy
    name: snapshotter-kube-rbac-proxy
    name: syncer-kube-rbac-proxy

    volumeMounts:
    - mountPath: /etc/tls/private
      name: metrics-serving-cert

  volumes:
  - name: metrics-serving-cert
    secret:
      defaultMode: 420
      secretName: vmware-vsphere-csi-driver-controller-metrics-serving-cert

Hence, if the secret is updated (e.g. as a result of CA cert update), the Pod must be restarted.

2. Similarly, the pod `vmware-vsphere-csi-driver-webhook` mounts another secret:

$ oc get po -n openshift-cluster-csi-drivers vmware-vsphere-csi-driver-webhook-c557dbf54-crrxp -o yaml
...
  containers:
    name: vsphere-webhook

    volumeMounts:
    - mountPath: /etc/webhook/certs
      name: certs

  volumes:
  - name: certs
    secret:
      defaultMode: 420
      secretName: vmware-vsphere-csi-driver-webhook-secret

Again, if the secret is updated (e.g. as a result of CA cert update), the Pod must be restarted.

The pod `shared-resource-csi-driver-node` mounts the secret:

$ cat assets/node.yaml
...
      containers:
        - name: hostpath

          volumeMounts:
            - mountPath: /etc/secrets
              name: shared-resource-csi-driver-node-metrics-serving-cert

      volumes:
        - name: shared-resource-csi-driver-node-metrics-serving-cert
          secret:
            defaultMode: 420
            secretName: shared-resource-csi-driver-node-metrics-serving-cert

Hence, if the secret is updated (e.g. as a result of CA cert update), the Pod must be restarted

The pod `gcp-pd-csi-driver-controller` mounts the secret:

$ oc get po -n openshift-cluster-csi-drivers gcp-pd-csi-driver-controller-5787b9c477-q78qx -o yaml
...
    name: provisioner-kube-rbac-proxy
    ...

    volumeMounts:
    - mountPath: /etc/tls/private
      name: metrics-serving-cert

  volumes:
  - name: metrics-serving-cert
    secret:
      secretName: gcp-pd-csi-driver-controller-metrics-serving-cert

Hence, if the secret is updated (e.g. as a result of CA cert update), the Pod must be restarted

The pod `openstack-manila-csi-controllerplugin` mounts the secret:

$ cat assets/controller.yaml
...
      containers:
        - name: provisioner-kube-rbac-proxy

          volumeMounts:
          - mountPath: /etc/tls/private
            name: metrics-serving-cert

      volumes:
        - name: metrics-serving-cert
          secret:
            secretName: manila-csi-driver-controller-metrics-serving-cert

Hence, if the secret is updated (e.g. as a result of CA cert update), the Pod must be restarted

The pod `openstack-cinder-csi-driver-controller` mounts the secret:

$ oc get po/openstack-cinder-csi-driver-controller-689b897df8-cx5hl -oyaml|yq .spec.volumes
- emptyDir: {}
  name: socket-dir
- name: secret-cinderplugin
  secret:
    defaultMode: 420
    items:
      - key: clouds.yaml
        path: clouds.yaml
    secretName: openstack-cloud-credentials
- configMap:
    defaultMode: 420
    items:
      - key: cloud.conf
        path: cloud.conf
    name: cloud-conf
  name: config-cinderplugin
- configMap:
    defaultMode: 420
    items:
      - key: ca-bundle.pem
        path: ca-bundle.pem
    name: cloud-provider-config
    optional: true
  name: cacert
- name: metrics-serving-cert
  secret:
    defaultMode: 420
    secretName: openstack-cinder-csi-driver-controller-metrics-serving-cert
- configMap:
    defaultMode: 420
    items:
      - key: ca-bundle.crt
        path: tls-ca-bundle.pem
    name: openstack-cinder-csi-driver-trusted-ca-bundle
  name: non-standard-root-system-trust-ca-bundle
- name: kube-api-access-hz62v
  projected:
    defaultMode: 420
    sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
            - key: ca.crt
              path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
            - fieldRef:
                apiVersion: v1
                fieldPath: metadata.namespace
              path: namespace
      - configMap:
          items:
            - key: service-ca.crt
              path: service-ca.crt
          name: openshift-service-ca.crt

Hence, if the secret is updated (e.g. as a result of CA cert update), the Pod must be restarted

The pod `shared-resource-csi-driver-webhook` mounts the secret:

$ cat assets/webhook/deployment.yaml
kind: Deployment
metadata:
  name: shared-resource-csi-driver-webhook
  ...
spec:
  template:
    spec:
      containers:

        volumeMounts:
        - mountPath: /etc/secrets/shared-resource-csi-driver-webhook-serving-cert/
          name: shared-resource-csi-driver-webhook-serving-cert

      volumes:
      - name: shared-resource-csi-driver-webhook-serving-cert
        secret:
          secretName: shared-resource-csi-driver-webhook-serving-cert

Hence, if the secret is updated (e.g. as a result of CA cert update), the Pod must be restarted.

The pod `alibaba-disk-csi-driver-controller` mounts the secret:

$ cat assets/controller.yaml
...
      containers:
        - name: provisioner-kube-rbac-proxy

          volumeMounts:
          - mountPath: /etc/tls/private
            name: metrics-serving-cert

      volumes:
        - name: metrics-serving-cert
          secret:
            secretName: alibaba-disk-csi-driver-controller-metrics-serving-cert

Hence, if the secret is updated (e.g. as a result of CA cert update), the Pod must be restarted

The pod `aws-ebs-csi-driver-controller` mounts the secret:

$ oc get po -n openshift-cluster-csi-drivers aws-ebs-csi-driver-controller-559f74d7cd-5tk4p -o yaml
...
    name: driver-kube-rbac-proxy
    name: provisioner-kube-rbac-proxy
	name: attacher-kube-rbac-proxy
	name: resizer-kube-rbac-proxy
	name: snapshotter-kube-rbac-proxy

    volumeMounts:
    - mountPath: /etc/tls/private
      name: metrics-serving-cert

  volumes:
  - name: metrics-serving-cert
    secret:
      defaultMode: 420
      secretName: aws-ebs-csi-driver-controller-metrics-serving-cert

Hence, if the secret is updated (e.g. as a result of CA cert update), the Pod must be restarted

The pod `ibm-powervs-block-csi-driver-controller` mounts the secret:

$ cat assets/controller.yaml
...
    containers:
        - name: provisioner-kube-rbac-proxy

          volumeMounts:
          - mountPath: /etc/tls/private
            name: metrics-serving-cert

      volumes:
        - name: metrics-serving-cert
          secret:
            secretName: ibm-powervs-block-csi-driver-controller-metrics-serving-cert

 Hence, if the secret is updated (e.g. as a result of CA cert update), the Pod must be restarted

The pod `azure-file-csi-driver-controller` mounts the secret:

$ oc get po -n openshift-cluster-csi-drivers azure-file-csi-driver-controller-cf84d5cf5-pzbjn -o yaml
...
  containers:
    name: driver-kube-rbac-proxy

    volumeMounts:
    - mountPath: /etc/tls/private
      name: metrics-serving-cert

  volumes:
    secret:
      defaultMode: 420
      secretName: azure-file-csi-driver-controller-metrics-serving-cert

Hence, if the secret is updated (e.g. as a result of CA cert update), the Pod must be restarted

Goals

Track goals/requirements for self-managed GA of Hosted control planes on AWS using the AWS Provider.

  • AWS flow via the AWS provider is documented. 
    • Make sure the documentation with HyperShiftDeployment is removed.
    • Make sure the documentation uses the new flow without HyperShiftDeployment 
  • HyperShift has a UI wizard with ACM/MCE for AWS. 

Requirements

  • This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.
Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

(Optional) Use Cases

This Section:

  • Main success scenarios - high-level user stories
  • Alternate flow/scenarios - high-level user stories
  • ...

Questions to answer…

  • ...

Out of Scope

Background, and strategic fit

This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.

Assumptions

  • ...

Customer Considerations

  • ...

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?
  • New Content, Updates to existing content, Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?

Overview

Today upstream and the more complete documentation of HyperShift lives on https://hypershift-docs.netlify.app/.

However product documentation today live under https://access.redhat.com/login?redirectTo=https%3A%2F%2Faccess.redhat.com%2Fdocumentation%2Fen-us%2Fred_hat_advanced_cluster_management_for_kubernetes%2F2.6%2Fhtml%2Fmulticluster_engine%2Fmulticluster_engine_overview%23hosted-control-planes-intro 

Goal

The goal of this Epic is to extract important docs and establish parity between what's documented and possible upstream and product documentation.

 

Multiple consumers have not realised a newer version of a CPO (spec.release) is not guaranteed to work with an older HO.

This is stated here https://hypershift-docs.netlify.app/reference/versioning-support/

but empiric evidences like OCM integration are telling us this is not enough.

We already deploy a CM in the HO namespace with the HC supported versions.

Additionally we can add an image label with latest HC version supported by the operator so you can quickly docker inspect...

Feature Overview

Console enhancements based on customer RFEs that improve customer user experience.

 

Goals

  • This Section:* Provide high-level goal statement, providing user context and expected user outcome(s) for this feature
  • ...

 

Requirements

  • This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.

 

Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

 

(Optional) Use Cases

This Section: 

  • Main success scenarios - high-level user stories
  • Alternate flow/scenarios - high-level user stories
  • ...

 

Questions to answer...

  • ...

 

Out of Scope

  • ...

 

Background, and strategic fit

This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.

 

Assumptions

  • ...

 

Customer Considerations

  • ...

 

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?  
  • New Content, Updates to existing content,  Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Based on https://issues.redhat.com/browse/RFE-3775 we should be extending our proxy package timeout to match the browser's timeout, which is 5 minutes.

AC: Bump the 30second timeout in the proxy pkg to 5 minutes

For the console, we would like to have a way for customers to send direct feedback about features like multi cluster.

Acceptance criteria:

  • integrate the pf feedback extension into the console
  • add the ui for rendering the form / launching the feedback url
  • add e2e testing for the customer interaction with the feedback mechanism
  • show/hide the launch mechanism where appropriate (need more info here on this topic)

Testing instructions:

  • Right click the help button in the toolbar.
  • From the help button drop down right click on Share Feedback. Previously this was report a bug but it has now been replaced with Share Feedback.
  • The Share Feedback Modal should appear.
  • Click on the share feedback link, a new tab will appear where you can share feedback.
  • Click on the open a support case link, a new tab will appear where you can report a bug.
  • Click on the the inform the direction of red hat, a new tab will appear where you can enter your information to join redhat mailing list.
  • Click cancel an the modal should close.

{}According to security it is important to disable publicly available content from OpenShift Web Console which is available through: `/opt/bridge/bin/bridge --public-dir=/opt/bridge/static --config=/var/console-config` in the console pod (openshift-console namespace).

The folder /opt/bridge/static and its files are publicly available without authentication. 
The purpose of this RFE is to disable the static assets:
https://console-openshift-console.apps.example.com/static/assets/
https://console-openshift-console.apps.example.com/static/

  1. Why does the customer need this? (List the business requirements here)
    The security department of the customer recommended disabling the static assets because they are available without authentication. 
    Even the fact that there are only images in PNG or SVG format.

 

Follow on to CONSOLE-2976

See https://github.com/openshift/console/blob/637a94a1e2e3e842cc5757ad2bbcf49fb1b4d2e1/frontend/public/components/cluster-settings/cluster-settings.tsx#L668-L670

Based on the API changes for MCP we need to check for item with`kube-apiserver-to-kubelet-signer` value for the `subject` key in `status.certExpirys` array. For that array we will render the `expiry` value which is in UTC format, as a timestamp.

 

AC:

  • Add timestamp to the Update paused notification, that already exists, using the appropriate `expiry` field.

We currently implement fuzzy search in the console (project search, search resources page / list view pages). While we don't want to change the current search behavior, we would like to add some exact search capability for users that have similarly named resources where fuzzy search doesn't help narrow down the list of resources in a list view/search page.

RFE: https://issues.redhat.com/browse/RFE-3013
Customer bug: https://issues.redhat.com/browse/OCPBUGS-2603

Acceptance criteria:
all search pages in console implement

  • exact match search option for list pages that is set in the user preference page
  • work with UX team on the hints for the search options

Design
Explore help text for search inputs - this should be shown at all times and not hidden in popover

Feature Overview (aka. Goal Summary)  

Extend the actual Installer's capabilities while deploying OCP on a GCP shared VPC (XPN) adding support to BYO hosted zones and removing the SA requirements in the bootstrap process.

Goals (aka. expected user outcomes)

While deploying OpenShift to a shared VPC (XPN) in GCP, the user can bring their own DNS zone where to create the required records for the API server and Ingress and no additional SA will be required to bootstrap the cluster.

Requirements (aka. Acceptance Criteria):

The user can provide an existing DNS zone when deploying OpenShift to a shared VPC (XPN) in GCP that will be used to host the required DNS records for the API server and Ingress. At the same time, the SA today's requirements will be removed.

Background

While adding support to shared VPC (XPN) deployments in GCP the BYO hosted zone capability was removed CORS-2474 due to multiple issues found during the QE phase validation for the the feature. At that time there was no evidence from customers/users on this being required for the shared VPC use case and this capability was removed in order to declare this feature GA.

We now have evidence from this specific use case being required by users.

Documentation Considerations

Documentation about using this capability while deploying OpenShift to a shared VPC will be required.

Epic Goal

  • Remove the requirement for a separate Service Account and minimize permissions required during the Bootstrap process in GCP.

 

Background

The GCP bootstrap process creates a service account with the role roles/storage.admin . The role is required so that the service account can create a bucket to hold the bootstrap ignition file contents. As a security request from a customer, the service account created during this process can be removed. These details mean that the not only will the service account, private key, and role not be created, but the bucket containing the bootstrap ignition file contents will not be created in terraform.

Why is this important?

  • Reduce number of permissions required to complete bootstrapping process.
  • Reduce unnecessary resources 

 

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • No additional service accounts should be created to complete an installation

 

Open questions::

  1.  

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story:

As a (user persona), I want to be able to:

  • Ensure that unnecessary roles/permissions are not assigned during install

Acceptance Criteria:

Description of criteria:

  • The service-account-user permission/role is not assigned during the gcp/cluster/masters terraform stage.

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

This is a followup to https://issues.redhat.com/browse/OPNET-13. In that epic we implemented limited support for dual stack on VSphere, but due to limitations in upstream Kubernetes we were not able to support all of the use cases we do on baremetal. This epic is to track our work up and downstream to finish the dual stack implementation.

This is a followup to https://issues.redhat.com/browse/OPNET-13. In that epic we implemented limited support for dual stack on VSphere, but due to limitations in upstream Kubernetes we were not able to support all of the use cases we do on baremetal. This epic is to track our work up and downstream to finish the dual stack implementation.

Goal

  • Allow users to set different Root volume types for each Control plane machine as a day-2 operation through CPMS
  • Allow users to set different Root volume types for each Control plane machine as install-time configuration through install-config

Why is this important?

  • In some OpenStack clouds, volume types are used to target separate OpenStack failure domains. With this feature, users can spread each Control plane root volume on separate OpenStack failure domains using the ControlPlaneMachineSet

Acceptance Criteria

  • Once the CPMS is updated with different root volume types in the Failure domains, CCPMSO spins new master machines with their root volumes spread.

Dependencies (internal and external)

  1. OpenShift-on-OpenStack integration with CPMS (OSASINFRA-3100)

Previous Work (Optional):

  1. 4.13 FailureDomains tech preview (OSASINFRA-2998)

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Technical Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

FID: https://docs.google.com/document/d/1OEB7Vml1-TmpWZWbvHhf3lnrtEU5JZt2Sptcnu3Kv2I/edit#heading=h.fu58ua5viwam

  • add the JSON array controlPlane.platform.openstack.rootVolume.types (notice the "s") in install-config (this is an API addition)
  • add validation to prevent both rootVolume.type and rootVolume.types to be set
  • add validation to ensure that if a variable field (compute availability zones, storage availability zones, root volume types) have more than one value, they have equal length
  • change Machine generation to vary rootVolume.volumeType according to the machine-pool rootVolume.types
  • instrument the Terraform code to apply variable volume types

Feature Overview (aka. Goal Summary)  

The Assisted Installer is used to help streamline and improve the install experience of OpenShift UPI. Given the install footprint of OpenShift on IBM Power and IBM zSystems we would like to bring the Assisted Installer experience to those platforms and easy the installation experience.

 

Goals (aka. expected user outcomes)

Full support of the Assisted Installer for use by IBM Power and IBM zSystems

 

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

 

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

 

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

 

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

 

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

 

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

 

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  Initial completion during Refinement status.

 

Interoperability Considerations

Which other projects and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

As a multi-arch development engineer, I would like to evaluate if the assisted installer is a good fit for simplifying UPI deployments on Power and Z.

Acceptance Criteria

  • Evaluation report of market opportunity/impact by P&Z offering managers
  • Stories filed for delivering Assisted Installer.

Description of the problem:

power and z features are not displayed in the feature usage dashboard in the elastic because there is a problem in the code
see https://kibana-assisted.apps.app-sre-prod-04.i5h0.p1.openshiftapps.com/_dashboards/app/dashboards#/view/f75f85d0-989e-11ec-ab6b-650fa8ed1edf?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:now-2w,to:now))&_a=(description:'',filters:!(('$state':(store:appState),meta:(alias:'internal%20users',disabled:!t,index:bd9dadc0-7bfa-11eb-95b8-d13a1970ae4d,key:cluster.email_domain,negate:!f,params:!(redhat.com,ibm.com),type:phrases,value:'redhat.com,%20ibm.com'),query:(bool:(minimum_should_match:1,should:!((match_phrase:(cluster.email_domain:redhat.com)),(match_phrase:(cluster.email_domain:ibm.com))))))),fullScreenMode:!f,options:(hidePanelTitles:!f,useMargins:!t),panels:!((embeddableConfig:(),gridData:(h:11,i:c9bf6a4b-3c3a-4ad4-83ea-20b3127dc4a0,w:16,x:0,y:0),id:'44328ca6-de41-4b1e-befd-683bb51cf30f',panelIndex:c9bf6a4b-3c3a-4ad4-83ea-20b3127dc4a0,type:visualization,version:'1.3.2'),(embeddableConfig:(),gridData:(h:11,i:'759bd387-9f4b-45cb-9c9a-b3c412b420ec',w:16,x:16,y:0),id:ffbb52b5-dbd9-47c3-8098-75513cddca8e,panelIndex:'759bd387-9f4b-45cb-9c9a-b3c412b420ec',type:visualization,version:'1.3.2'),(embeddableConfig:(),gridData:(h:15,i:eb038dca-baf4-42d4-8e2c-298e2bbd06f6,w:16,x:32,y:0),id:'49b26e77-a2f3-42f3-8f57-9543669de8b8',panelIndex:eb038dca-baf4-42d4-8e2c-298e2bbd06f6,type:visualization,version:'1.3.2'),(embeddableConfig:(),gridData:(h:4,i:'05c0d27a-f949-42d6-a3ef-15411411fac7',w:16,x:0,y:11),id:'088f04c9-ce46-46d0-a381-1ea822d95440',panelIndex:'05c0d27a-f949-42d6-a3ef-15411411fac7',type:visualization,version:'1.3.2'),(embeddableConfig:(),gridData:(h:4,i:'0d696304-e4a4-4c24-beb4-f04d4af4c8d6',w:16,x:16,y:11),id:'9552a99a-4355-4e14-ad9f-90cd534f70a8',panelIndex:'0d696304-e4a4-4c24-beb4-f04d4af4c8d6',type:visualization,version:'1.3.2'),(embeddableConfig:(vis:!n),gridData:(h:10,i:ee55c626-4b30-4ac3-ad23-9b7efbf1fb04,w:16,x:0,y:15),id:fac35afd-9a6f-4bdc-868a-906a5f1e1894,panelIndex:ee55c626-4b30-4ac3-ad23-9b7efbf1fb04,type:visualization,version:'1.3.2'),(embeddableConfig:(),gridData:(h:10,i:c4c65cab-2675-4575-aa54-d4bb2871804e,w:32,x:16,y:15),id:'3747662f-7c12-4299-b2ac-1038e62ad2f3',panelIndex:c4c65cab-2675-4575-aa54-d4bb2871804e,type:visualization,version:'1.3.2'),(embeddableConfig:(),gridData:(h:13,i:'1ca7879c-a458-4ae3-8dc2-4dd2da59cf32',w:15,x:0,y:25),id:'96e57324-141d-46ea-8096-9e8b1a18ef62',panelIndex:'1ca7879c-a458-4ae3-8dc2-4dd2da59cf32',type:visualization,version:'1.3.2')),query:(language:kuery,query:''),timeRestore:!f,title:'%5BAI%5D%20feature_usage_dashboard',viewMode:edit) 

How reproducible:

100%

Steps to reproduce:

1. install 2 clusters with power and z CPU architectures and check the feature usage dashboard in the elastic

Actual results:

power and z features are not displayed in the feature usage dashboard in the elastic

Expected results:

see the power and z features in the feature usage dashboard in the elastic 

 
After doing more tests on staging for Power, I have found that the cluster managed network would not  work for Power, it uses the platform.baremetal  to define API-VIP/INGRESS-VIP, most the installations have failed at the last step finalizing. After more dig, found that the machine-api operator   would not be able to start successfully, and stay in Operator is initializing  state, here is the list of the pod with error:

openshift-kube-controller-manager installer-5-master-1 0/1 Error 0 25m
openshift-kube-controller-manager installer-6-master-2 0/1 Error 0 17m
openshift-machine-api ironic-proxy-kgm9g 0/1 CreateContainerError 0 32m
openshift-machine-api ironic-proxy-nc2lz 0/1 CreateContainerError 0 8m37s
openshift-machine-api ironic-proxy-pp92t 0/1 CreateContainerError 0 32m
openshift-machine-api metal3-69b945c7ff-45hqn 1/5 CreateContainerError 0 33m
openshift-machine-api metal3-image-customization-7f6c8978cf-lxbj7 0/1 CreateContainerError 0 32m

the messages from failed pod ironic-proxy-nc2lz:

Normal Pulled 11m kubelet Successfully pulled image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4f84fd895186b28af912eea42aba1276dec98c814a79310c833202960cf05407" in 1.29310959s (1.293135461s including waiting)
Warning Failed 11m kubelet Error: container create failed: time="2023-04-06T15:16:19Z" level=error msg="runc create failed: unable to start container process: exec: \"/bin/runironic-proxy\": stat /bin/runironic-proxy: no such file or directory"

similar errors for other failed pods.
The interesting thing is some of the installation got installed in AI successfully, but these pods still are in error state.
So I ask AI team to turn off the support Cluster network support for Power.

Feature Overview (aka. Goal Summary)  

Rebase openshift-etcd to latest upstream stable version 3.5.9

Goals (aka. expected user outcomes)

OpenShift openshift-etcd should benefit from the latest enhancements on version 3.5.9

 

https://github.com/etcd-io/etcd/issues/13538

We're currently on etcd 3.5.6, since then there has been at least another newer release.  This epic description is to track changes that we need to pay attention to:

 

Golang 1.17 update

In 3.5.7 etcd was moved to 1.17 to address some vulnerabilities:

https://github.com/etcd-io/etcd/blob/main/CHANGELOG/CHANGELOG-3.5.md#go

We need to update our definitions in the release repo to match this and test what impact it has.

EDIT: now moving onto 1.19 directly: https://github.com/etcd-io/etcd/pull/15337

 

WAL fix carry

3.5.6 had a nasty WAL bug that was hit by some customers, fixed with https://github.com/etcd-io/etcd/pull/15069

Due to the Golang upgrade we carried that patch through OCPBUGS-5458

When we upgrade we need to ensure the commits are properly handled and ordered with this carry.

 

IPv6 Formatting

There were some comparison issues with same IPv6 addresses having different formats. This was fixed in https://github.com/etcd-io/etcd/pull/15187 and we need to test what impact this has on our ipv6 based SKUs.

 

serializable memberlist 

This is a carry we have for some time: https://github.com/openshift/etcd/commit/26d7d842f6fb968e55fa5dbbd21bd6e4ea4ace50

This is now officially fixed (slightly different) with the options pattern in: https://github.com/etcd-io/etcd/pull/15261 

We need to drop the carry patch and take the upstream version when rebasing.

 

 

Epic Goal

Feature Overview (aka. Goal Summary)  

The goal of this initiative to help boost adoption of OpenShift on ppc64le. This can be further broken down into several key objectives.

  • For IBM, furthering adopt of OpenShift will continue to drive adoption on their power hardware. In parallel, this can be used for existing customers to migrate their old power on-prem workloads to a cloud environment.
  • For the Multi-Arch team, this represents our first opportunity to develop an IPI offering on one of the IBM platforms. Right now, we depend on IPI on libvirt to cover our CI needs; however, this is not a supported platform for customers. PowerVS would address this caveat for ppc64le.
  • By bringing in PowerVS, we can provide customers with the easiest possible experience to deploy and test workloads on IBM architectures.
  • Customers already have UPI methods to solve their OpenShift on prem needs for ppc64le. This gives them an opportunity for a cloud based option, further our hybrid-cloud story.

Goals (aka. expected user outcomes)

  • The goal of this epic to begin the process of expanding support of OpenShift on ppc64le hardware to include IPI deployments against the IBM Power Virtual Server (PowerVS) APIs.

Requirements (aka. Acceptance Criteria):

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

 

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

 

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

 

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

 

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

 

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  Initial completion during Refinement status.

 

Interoperability Considerations

Which other projects and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

Epic Goal

  • Improve IPI on Power VS in the 4.14 cycle
    • Changes to the installer to handle edge cases, fix bugs, and improve usability.
    • No major changes are anticipated this cycle.

Running doc to describe terminologies and concepts which are specific to Power VS - https://docs.google.com/document/d/1Kgezv21VsixDyYcbfvxZxKNwszRK6GYKBiTTpEUubqw/edit?usp=sharing

Feature Overview

This feature aims to enhance and clarify the functionalities of the Hypershift CLI. It was initially developed as a developer tool, but as its purpose evolved, a mix of supported and unsupported features were included. This has caused confusion for users who attempt to utilize unsupported functionalities. The goal is to clearly define the boundaries of what is possible and what is supported by the product.

Goals

Users should be able to effectively and efficiently use the Hypershift CLI with a clear understanding of what features are supported and what are not. This should reduce confusion and complications when utilizing the tool.

Requirements (aka. Acceptance Criteria):

Clear differentiation between supported and unsupported functionalities within the Hypershift CLI.
Improved documentation outlining the supported CLI options.
Consistency between the Hypershift CLI and the quickstart guide on the UI.
Security, reliability, performance, maintainability, scalability, and usability must not be compromised while implementing these changes.

Use Cases (Optional):

A developer uses the hypershift install command and only supported features are executed.
A user attempts to create a cluster using hypershift cluster create, and the command defaults to a compatible release image.

Questions to Answer (Optional):

What is the most efficient method for differentiating supported and unsupported features within the Hypershift CLI?
What changes need to be made to the documentation to clearly outline supported CLI options?

Out of Scope

Changing the fundamental functionality of the Hypershift CLI.
Adding additional features beyond the scope of addressing the current issues.

Background

The Hypershift CLI started as a developer tool but evolved to include a mix of supported and unsupported features. This has led to confusion among users and potential complications when using the tool. This feature aims to clearly define what is and isn't supported by the product.

Customer Considerations

Customers should be educated about the changes to the Hypershift CLI and its intended use. Clear communication about supported and unsupported features will help them utilize the tool effectively.

Documentation Considerations

Documentation should be updated to clearly outline supported CLI options. This will be a crucial part of user education and should be easy to understand and follow.

Interoperability Considerations

This feature may impact the usage of Hypershift CLI across other projects and versions. A clear understanding of these impacts and planning for necessary interoperability test scenarios should be factored in during development.

Goal

  • ...

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • Dev - Has a valid enhancement if necessary
  • CI - MUST be running successfully with tests automated
  • QE - covered in Polarion test plan and tests implemented
  • Release Technical Enablement - Must have TE slides
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions:

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Technical Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Enhancement merged: <link to meaningful PR or GitHub Issue>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

As a self-managed HyperShift user I want to have a CLI tool that allows me to:

  • Create the necessary HyperShift API custom resources for hosted cluster Nodepool creation

Definition of done:

  • hypershift create nodepool aws exists and has only the relevant needed flags for what we support in AWS
  • Unit tests
  • cluster creation with nodepool creation test plan in QE

As a HyperShift user I want to:

  • Have a convenient command that destroys an AWS cluster I deployed

Definition of done:

  • hypershift destroy cluster aws exists and destroys an AWS hosted cluster
  • QE test plan that uses the destroy cluster aws command

User Story:

As a user of HCP CLI, I want to be able to set some platform agnostic default flags when creating a HostedCluster:

  • additional-trust-bundle
  • annotations
  • arch
  • auto-repair
  • base-domain
  • cluster-cidr
  • control-plane-availability-policy
  • etcd-storage-class
  • fips
  • generate-ssh
  • image-content-sources
  • infra-availability-policy
  • infra-id
  • infra-json
  • name
  • namespace
  • node-drain-timeout
  • node-selector
  • node-upgrade-type
  • network-type
  • release-stream
  • render
  • service-cidr
  • ssh-key
  • timeout
  • wait

so that I can set default values for these flags for my particular use cases.

Acceptance Criteria:

Description of criteria:

  • Aforementioned flags are included in the HCP CLI general create cluster command.
  • Aforementioned flags are included in test plans & testing.

Out of Scope:

The flags listed in HyperShift Create Cluster CLI that don't seem platform agnostic:

  • BaseDomainPrefix - only in AWS
  • ExternalDNSDomain - only in AWS

These flags are also out of scope:

  • control-plane-operator-image - for devs (see Alberto's comment below)

Engineering Details:

  • N/A

This requires/does not require a design proposal.
This requires/does not require a feature gate.

As a self-managed HyperShift user I want to have a CLI tool that allows me to:

  • Create the necessary HyperShift API custom resources for hosted cluster creation

Definition of done:

  • hypershift create cluster kubevirt exists and has only the relevant needed flags for what we support in Kubevirt
  • Unit tests
  • cluster creation test plan in QE (ECODEPQE pipeline)

As a HyperShift user I want to:

  • Have a convenient command that destroys an Agent cluster I deployed

Definition of done:

  • hypershift destroy cluster agent exists and destroys an agent hosted cluster
  • QE test plan that uses the destroy cluster agent command

As a self-managed HyperShift user I want to have a CLI tool that allows me to:

  • Create the necessary HyperShift API custom resources for hosted cluster Nodepool creation

Definition of done:

  • hypershift create nodepool kubevirt exists and has only the relevant needed flags for what we support in kubevirt
  • Unit tests
  • cluster creation with nodepool creation test plan in QE (ECODEPQE pipeline)

As a HyperShift user I want to:

  • Have a convenient command that destroys a kubevirt cluster I deployed

Definition of done:

  • hypershift destroy cluster kubevirt exists and destroys a Kubevirt hosted cluster
  • QE test plan that uses the destroy cluster kubevirt command

As a self-managed HyperShift user I want to have a CLI tool that allows me to:

  • Create the necessary HyperShift API custom resources for hosted cluster Nodepool creation

Definition of done:

  • hypershift create nodepool agent exists and has only the relevant needed flags for what we support bare metal with the cluster api agent provider
  • Unit tests
  • cluster creation with nodepool creation test plan in QE (ECODEPQE pipeline)

As a self-managed HyperShift user I want to have a CLI tool that allows me to:

  • Create the necessary HyperShift API custom resources for hosted cluster creation

Definition of done:

  • hypershift create cluster agent exists and has only the relevant needed flags for what we support bare metal with the cluster api agent provider
  • Unit tests
  • cluster creation test plan in QE (ECODEPQE pipeline)

As a software developer and user of HyperShift CLI, I would like a prototype of how the Makefile can be modified to build different versions of the HyperShift CLI, i.e., dev version vs productized version.

As a HyperShift user I want to:

  • Have a convenient command that generates the kubeconfig file to access the hosted cluster I just deployed

Definition of done:

  • hypershift kubeconfig create exists and generates a kubeconfig file that is valid to access the deployed hosted cluster
  • QE test plan that uses the kubeconfig generation

As a self-managed HyperShift user I want to have a CLI tool that allows me to:

  • Create the necessary HyperShift API custom resources for hosted cluster creation

Definition of done:

  • hypershift create cluster aws exists and has only the relevant needed flags for what we support bare metal with the cluster api agent provider
  • Unit tests
  • cluster creation test plan in QE

Feature Overview

Enable release managers/Operator authors to manage Operator releases in the file-based catalog (FBC) based on the existing catalog (in sqlite) and distribute them to multiple OCP versions at ease.

Goals

  • Operator releases can be managed declaratively in a canonical source of truth and automated via git in the context of the OpenShift release lifecycle.
  • File-based catalog (FBC) can be converted back to sqlite format in order to be distribute to those OCP versions that do not support file-based catalog yet.
  • Existing catalog image in sqlite format can be converted to the basic template of file-based catalog (FBC) for easy adoption.
  • Existing catalog image in sqlite format can be converted to the semver template of file-based catalog (FBC) when possible and/or highlights the uncompleted sections so users can easier identify the gaps. 

Requirements

Requirement Notes isMvp?
A declarative mechanism to automate the catalog update process in file-based catalog (FBC) with newly-published bundle references.   Yes
A declarative mechanism to publish Operator releases in file-based catalog (FBC) to multiple OCP releases.   Yes
A declarative mechanism to convert file-based catalog (FBC) to sqlite database format so it can be publish to OCP versions without FBC supports.    Yes
A declarative mechanism to convert existing catalog from sqlite database to file-based catalog (FBC) basic template.   Yes
A declarative mechanism to convert existing catalog from sqlite database to file-based catalog (FBC) semver template when possible and/or highlights the uncompleted sections so users can easier identify the gaps.    NO
CI - MUST be running successfully with test automation This is a requirement for ALL features. Yes
Release Technical Enablement Provide necessary release enablement details and documents. Yes

Use Cases

  • Operator authors/release managers can manage releases (i.e., edit the update paths) in a canonical source of truth (in FBC) and automate it via git to simplify the bundle release process.
  • Operator authors/release managers can mange and publish Operator releases from a canonical source of truth (in FBC) to multiple OCP versions.
  • Operator authors/release managers can mange and publish Operator releases from a canonical source of truth (in FBC) to older OCP versions without FBC supported yet.
  • Operator authors/release managers can convert their existing catalog images in sqlite format to the basic template of file-based catalog (FBC) to jumpstart the catalog migration process.
  • Operator authors/release managers can convert their existing catalog images in sqlite format to the semver template of file-based catalog (FBC), when possible to drive adoption, and/or highlights the uncompleted sections so users can easier identify the gaps. 

Definition of Done / Acceptance criteria

  • All use cases above are implemented and meet the requirements.

Background, and strategic fit

A catalog maintainer frequently needs to make changes to an OLM catalog whenever a new software version is released, promoting an existing version and releasing it to a different channel, or deprecating an existing version.  All these often require non-trivial changes to the update graph of an Operator package.  The maintainers need a git- and human-friendly maintenance approach that allows reproducing the catalog at all times and is decoupled from the release of their individual software versions.  

The original imperative catalog maintenance approach, which relies on `replaces`, `skips`, `skipRange` attributes at the bundle level to define the relationships between versions and the update channels, is perceived as complicated from the Red Hat internal developer community.  Hence, the new file-based catalog (FBC) is introduced with a declarative fashion and GitOps-friendly. 

Furthermore, the concept so-called “template”, as an abstraction layer of the FBC, is introduced to simplify interacting with FBCs.  While the “basic template” serves as a simplified abstraction of an FBC with all the `replaces`, `skips`, `skipRange` attributes supported and configurable at the package level, the “semver template” provides the capability to auto-generate an entire upgrade graph adhering to Semantic Versioning (semver) guidelines and consistent with best practices on channel naming.  

Based on the feedback in KubeCon NA 2022, folks were all generally excited to the features introduced with FBC and the UX provided by the templates.  What is still missing is the tooling to enable the adoption.  

Therefore, it is important to allow users to:

  • convert the existing catalog image in sqlite format to the basic template of file-based catalog (FBC) for easy adoption
  • convert the existing catalog image in sqlite format to the semver template of file-based catalog (FBC) when possible and/or highlights the uncompleted sections so users can easier identify the gaps
  • automate the catalog update process using FBC with newly-published bundle references
  • publish Operator releases in file-based catalog (FBC) to multiple OCP releases
  • convert file-based catalog (FBC) back to sqlite database format so it can be publish to OCP versions without FBC supports 

to help users adopt this novel file-based catalog approach and deliver value to customers with a faster release cadence and higher confidence. 

Documentation Considerations

  • The way ”to automate the catalog update process in FBC with newly-published bundle references” needs to be documented (in the context of “Developing Operators).
  • The way ”to to publish Operator releases in file-based catalog (FBC) to multiple OCP releases” needs to be documented (in the context of “Developing Operators” and “Administrator Tasks).
  • The way ”to convert file-based catalog (FBC) to sqlite database format so it can be publish to OCP versions without FBC supports” needs to be documented (in the context of “Developing Operators” and “Administrator Tasks).
  • The way ”to convert existing catalog from sqlite database to file-based catalog (FBC) basic template” needs to be documented (in the context of “Developing Operators).
  • The way ”to convert existing catalog from sqlite database to file-based catalog (FBC) semver template when possible and/or highlights the uncompleted sections so users can easier identify the gaps” needs to be documented (in the context of “Developing Operators). 

 
 
 
 

 

Epic Goal

  • SQlite catalog maintainers need a solution to facilitate veneer adoption.  The easiest capability to provide is migration to the basic veneer.  In addition, the mechanism needs to omit any properties from the original source which are no longer relevant in the new format.

Why is this important?

  • Minimizing friction to veneer adoption is key to speeding the FBC transition

Scenarios

  1. Maintainer wants to update legacy catalog to veneer
  2. operator author wants to update their catalog contribution to veneer

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Documentation - MUST have supporting documentation easily available to catalog maintainers & operator authors
  •  

Open questions::

  1. for the migration path, is documentation of current solution (opm render +  yq/jq) sufficient or do we need to support in formal tooling (e.g. opm migrate + flag)?
  2. are there any other obsolete properties we need to omit from rendered FBC?

 

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

 

Previous bundle deprecation was handled by assigning a property to the olm.bundle object of `olm.deprecated`.  SQLite DBs had to have all valid upgrade edges supported by olm.bundle information in order to prevent foreign key violations.  This property meant that the bundle was to be ignored & never installed.

FBC has a simpler method for achieving the same goal:  don't include the bundle.  Upgrade edges from it may still be specified, and the bundle will not be installable.

Likely an update to opm code base in the neighborhood of https://github.com/operator-framework/operator-registry/blob/249ae621bb8fa6fc8a8e4a5ae26355577393f127/pkg/sqlite/conversion.go#L80

A/C:

  • CI/utest/e2e passes without flakes
  • appropriate documentation (all upstream) updated/reviewed

 

 

 

 

 

 

 

Feature Overview (aka. Goal Summary)  

This feature will track upstream work from the OpenShift Control Plane teams - API, Auth, etcd, Workloads, and Storage.

Goals (aka. expected user outcomes)

To continue and develop meaningful contributions to the upstream community including feature delivery, bug fixes, and leadership contributions.

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

From https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/#topologyspreadconstraints-field:

Note: The matchLabelKeys field is a beta-level field and enabled by default in 1.27. You can disable it by disabling the MatchLabelKeysInPodTopologySpread [feature gate](https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/).

Removing from the TP as the feature is enabled by default.

Just a clean up work.

Upstream K8s deprecated PodSecurityPolicy and replaced it with a new built-in admission controller that enforces the Pod Security Standards (See here for the motivations for deprecation).] There is an OpenShift-specific dedicated pod admission system called Security Context Constraints. Our aim is to keep the Security Context Constraints pod admission system while also allowing users to have access to the Kubernetes Pod Security Admission. 

With OpenShift 4.11, we are turned on the Pod Security Admission with global "privileged" enforcement. Additionally we set the "restricted" profile for warnings and audit. This configuration made it possible for users to opt-in their namespaces to Pod Security Admission with the per-namespace labels. We also introduced a new mechanism that automatically synchronizes the Pod Security Admission "warn" and "audit" labels.

With OpenShift 4.15, we intend to move the global configuration to enforce the "restricted" pod security profile globally. With this change, the label synchronization mechanism will also switch into a mode where it synchronizes the "enforce" Pod Security Admission label rather than the "audit" and "warn". 

BU Priority Overview

Enable installation and lifecycle support of OpenShift 4 on Oracle Cloud Infrastructure (OCI) with VMs

Goals

  • Enable installation of OpenShift 4 on Oracle Cloud Infrastructure (OCI) with VMs using platform agnostics with Assisted Installer.
  • OpenShift 4 on OCI (with VMs) can be updated that results in a cluster and applications that are in a healthy state when update is completed.
  • Telemetry reports back on clusters using OpenShift 4 on OCI for connected OpenShift clusters (e.g. platform=none using Oracle CSI).

State of the Business

Currently, we don't yet support OpenShift 4 on Oracle Cloud Infrastructure (OCI), and we know from initial attempts that installing OpenShift on OCI requires the use of a qcow (OpenStack qcow seems to work fine), networking and routing changes, storage issues, potential MTU and registry issues, etc.

Execution Plans

TBD based on customer demand.

 

Why is this important

  • OCI is starting to gain momentum.
  • In the Middle East (e.g. Saudi Arabia), only OCI and Alibaba Cloud are approved hyperscalars.

Requirements

  • This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.

 

Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

(Optional) Use Cases

This Section:

  • Main success scenarios - high-level user stories
  • Alternate flow/scenarios - high-level user stories
  • ...

RFEs:

  • RFE-3635 - Supporting Openshift on Oracle Cloud Infrastructure(OCI) & Oracle Private Cloud Appliance (PCA)

Questions to answer…

  • ...

Out of Scope

Background, and strategic fit

This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.

Assumptions

  • ...

Customer Considerations

  • ...

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?
  • New Content, Updates to existing content, Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?

 

Other

 

 

 

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • ...

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Description of the problem:

Currently, the infrastructure object is create as following:

 # oc get infrastructure/cluster -oyaml
apiVersion: config.openshift.io/v1
kind: Infrastructure
metadata:
  creationTimestamp: "2023-06-19T13:49:07Z"
  generation: 1
  name: cluster
  resourceVersion: "553"
  uid: 240dc176-566e-4471-b9db-fb25c676ba33
spec:
  cloudConfig:
    name: ""
  platformSpec:
    type: None
status:
  apiServerInternalURI: https://api-int.test-infra-cluster-97ef21c5.assisted-ci.oci-rhelcert.edge-sro.rhecoeng.com:6443
  apiServerURL: https://api.test-infra-cluster-97ef21c5.assisted-ci.oci-rhelcert.edge-sro.rhecoeng.com:6443
  controlPlaneTopology: HighlyAvailable
  cpuPartitioning: None
  etcdDiscoveryDomain: ""
  infrastructureName: test-infra-cluster-97-w6b42
  infrastructureTopology: HighlyAvailable
  platform: None
  platformStatus:
    type: None

instead it should be similar to:

# oc get infrastructure/cluster -oyaml
apiVersion: config.openshift.io/v1
kind: Infrastructure
metadata:
  creationTimestamp: "2023-06-19T13:49:07Z"
  generation: 1
  name: cluster
  resourceVersion: "553"
  uid: 240dc176-566e-4471-b9db-fb25c676ba33
spec:
  cloudConfig:
    name: ""
  platformSpec:
    type: External
    external:
      platformName: oci
status:
  apiServerInternalURI: https://api-int.test-infra-cluster-97ef21c5.assisted-ci.oci-rhelcert.edge-sro.rhecoeng.com:6443
  apiServerURL: https://api.test-infra-cluster-97ef21c5.assisted-ci.oci-rhelcert.edge-sro.rhecoeng.com:6443
  controlPlaneTopology: HighlyAvailable
  cpuPartitioning: None
  etcdDiscoveryDomain: ""
  infrastructureName: test-infra-cluster-97-w6b42
  infrastructureTopology: HighlyAvailable
  platform: External
  platformStatus:
    type: External
    external:
      cloudControllerManager:
        state: External

How reproducible:

 

Steps to reproduce:

1.

2.

3.

Actual results:

 

Expected results:

We currently rely on a hack to deploy a cluster on external platform: https://github.com/openshift/assisted-service/pull/5312

The goal of this ticket is to move the definition of the external platform in in the installer-config on the openshift installer is released with the support of external platform: https://github.com/openshift/installer/pull/7217

The taint here: https://github.com/openshift/assisted-installer/pull/629/files#diff-1046cc2d18cf5f82336bbad36a2d28540606e1c6aaa0b5073c545301ef60ffd4R593

should only be removed when platform is nutanix or vsphere because the credentials for these platforms are passed after cluster installation.

In the opposite with Oracle Cloud the instance gets its credentials through the instance metadata, and should be able to label the nodes from the beginning of the installation without any user intervention.

Description of the problem:
The features API tells us that EXTERNAL_PLATFORM_OCI is supported for version 4.14 and the s390x cpu architecture but the attempt to create the cluster fails with "Can't set oci platform on s390x architecture"
 

 

Steps to reproduce:

1. Register cluster with OCI platform and z architecture

 

There are 2 options to detect if the hosts are running on OCI:

1/ On OCI, the machine will have the following chassis-asset-tag:

# dmidecode --string chassis-asset-tag
OracleCloud.com

In the agent, we can override hostInventory.SystemVendor.Manufacturer when chassis-asset-tag="OracleCloud.com".

2/  Read instance metadata: curl -v -H "Authorization: Bearer Oracle"  http://169.254.169.254/opc/v2/instance

It will allow the auto-detection of the platform from the provider in assisted-service, and validate that hosts are running in OCI when installing a cluster with platform=oci

Description of the problem:

 I've tested a cluster with platform type = 'baremetal' and hosts discovered. Then, when I try to change to Nutanix platform, BE returns an error

How reproducible:

100% 

Steps to reproduce:

1. Create cluster without platform integration

2. Discover 3 hosts

3. Try to change platform to 'Nutanix'

Actual results:

API returns an error.

Expected results:
We can change platform type, this change should be agnostic to the discovered hosts.

External platform will be available behind TechPreviewNoUpgrade feature set, automatically enable this falg in the installer config when oci platform is selected.

 Currently the  API  call "GET /v2/clusters/{cluster_id}/supported-platforms" returns the hosts supported platforms regardless of the other cluster parameters

In order to install oracle CCM driver, we need the ability to set the platform to "external" in the install-config.

The platform need to be added here: https://github.com/openshift/assisted-service/blob/3496d1d2e185343c6a3b1175c810fdfd148229b2/internal/installcfg/installcfg.go#L8

Slack thread: https://redhat-internal.slack.com/archives/CUPJTHQ5P/p1678801176091619

The goal of this ticket is to check if besides external platform, the AI can install the CCM, and document it.

Feature Overview (aka. Goal Summary)  

Unify and update hosted control planes storage operators so that they have similar code patterns and can run properly in both standalone OCP and HyperShift's control plane.

Goals (aka. expected user outcomes)

  • Simplify the operators with a unified code pattern
  • Expose metrics from control-plane components
  • Use proper RBACs in the guest cluster
  • Scale the pods according to HostedControlPlane's AvailabilityPolicy
  • Add proper node selector and pod affinity for mgmt cluster pods

Requirements (aka. Acceptance Criteria):

  • OCP regression tests work in both standalone OCP and HyperShift
  • Code in the operators looks the same
  • Metrics from control-plane components are exposed
  • Proper RBACs are used in the guest cluster
  • Pods scale according to HostedControlPlane's AvailabilityPolicy
  • Proper node selector and pod affinity is added for mgmt cluster pods

 

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

 

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

 

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

 

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

 

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

 

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  Initial completion during Refinement status.

 

Interoperability Considerations

Which other projects and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

Epic Goal*

In 4.12 we tried several approaches how to write an operator that works both in standalone OCP and in HyperShift's control plane running in the management cluster.

These operators need to be changed:

  • csi-snapshot-controller-operator
  • cluster-storage-operator
  • aws-ebs-csi-driver-operator

We need to unify the operators  to use similar approach, so the code in our operators look the same.

In addition, we need to update the operators to:

  • Expose metrics from control-plane components (esp. csi-snapshot-controller and aws-ebs-csi-driver-controller pods.
  • Use proper RBACs in the guest cluster, so csi-snapshot-controller and aws-ebs-csi-driver-controller does not run as cluster-admin
    • Note that all components already have proper RBACs in the mgmt. cluster.
  • Scale csi-snapshot-controller and aws-ebs-csi-driver-controller pods according to HostedControlPlane's AvailabilityPolicy
  • Add proper node selector + pod affinity to all Pods in the mgmt cluster according to https://hypershift-docs.netlify.app/how-to/distribute-hosted-cluster-workloads/

Why is this important? (mandatory)

It will simplify our operators - we will have the same pattern in all of them.

 

Scenarios (mandatory) 

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.  

  1. OCP users (both standalone and hypershift) should not see any change.
  2. Code in the operators looks the same.

 
Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic. 

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development -  yes
  • Documentation - No
  • QE -  Regression tests only
  • PX - No
  • Others -

Acceptance Criteria (optional)

OCP regression tests work, both on standalone OCP and HyperShift.

Drawbacks or Risk (optional)

We could introduce regressions

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • QE - Test scenarios are written and executed successfully.
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 

We should refactor CSO, so as to remove duplication of code for hypershift and standalone deployments.

We also are going to reduce duplication of manifests so as templates can be reused between hypershift and standalone clusters.

This feature is the place holder for all epics related to technical debt associated with Console team 

Outcome Overview

Once all Features and/or Initiatives in this Outcome are complete, what tangible, incremental, and (ideally) measurable movement will be made toward the company's Strategic Goal(s)?

 

Success Criteria

What is the success criteria for this strategic outcome?  Avoid listing Features or Initiatives and instead describe "what must be true" for the outcome to be considered delivered.

 

 

Expected Results (what, how, when)

What incremental impact do you expect to create toward the company's Strategic Goals by delivering this outcome?  (possible examples:  unblocking sales, shifts in product metrics, etc. + provide links to metrics that will be used post-completion for review & pivot decisions). {}For each expected result, list what you will measure and when you will measure it (ex. provide links to existing information or metrics that will be used post-completion for review and specify when you will review the measurement such as 60 days after the work is complete)

 

 

Post Completion Review – Actual Results

After completing the work (as determined by the "when" in Expected Results above), list the actual results observed / measured during Post Completion review(s).

 

Goal

Guided installation user experience that interacts via prompts for necessary inputs, informs of erroneous/invalid inputs, and provides status and feedback throughout the installation workflow with very few steps, that works for disconnected, on-premises environments.

Installation is performed from a bootable image that doesn't contain cluster details or user details, since these details will be collected during the installation flow after booting the image in the target nodes.

This means that the image is generic and can be used to install an OpenShift cluster in any supported environment.

Why is this important?

Customers/partners desire a guided installation experience to deploy OpenShift with a UI that includes support for disconnected, on-premises environments, and which is as flexible in terms of configuration as UPI.

We have partners that need to provide an installation image that can be used to install new clusters on any location and for any users, since their business is to sell the hardware along with OpenShift, where OpenShift needs to be installable in the destination premises.

Acceptance Criteria

This experience should provide an experience closely matching the current hosted service (Assisted Installer), with the exception that it is limited to a single cluster because the host running the service will reboot and become a node in the cluster as part of the deployment process.

  • User can successfully deploy OpenShift using the installer's guided experience.
  • User can specify a custom registry for disconnected scenario, which may include uploading a cert and validation.
  • User can specify node-network configurations, at a minimum: DHCP, Static IP, VLAN and Bonds.
  • User can use the same image to install clusters with different settings (collected during the installation).
  • Documentation is updated to guide user step-by-step to deploy OpenShift in disconnected settings with installer.

Dependencies

  1. Guided installation onboarding design from UXD team.
  2. UI development

 

Epic Goal

  • Allow the user to select a host to be Node 0 interactively after the booting the ISO. On each host the user would be presented with a choice between two options:
  1. Select this host as the rendezvous host (it will become part of the control plane)
  2. The IP address of the rendezvous host is: [Enter IP]

(If the former option is selected, the IP address should be displayed so that it can be entered in the other hosts.)

Why is this important?

  • Currently, when using DHCP the user must determine which IP address is assigned to at least one of the hosts prior to generating the ISO. (OpenShift requires infinite DHCP leases anyway, so no extra configuration is required but it does mean trying to manually match data with an external system.) AGENT-385 would extend a similar problem to static IPs that the user is planning to configure interactively, since in that case we won't have the network config to infer them from. We should permit the user to delay collecting this information until after the hosts are booted and we can discover it for them.

Scenarios

  1. In a DHCP network, the user creates the agent ISO without knowing which IP addresses are assigned to the hosts, then selects one to act as the rendezvous host after booting.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. AGENT-7

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Currently we use templating to set the NodeZero IP address in a number of different configuration files and scripts.

We should move this configuration to a single file (/etc/assisted/rendezvous-host.env) and reference it only from there, e.g. as a systemd environment file.

We also template values like URLs, because it is easier and safer to do this in golang (e.g. to use an IP address that may be either IPv4 or IPv6 in a URL) than in bash. We may need to include all of these variables in the file.

This will enable us to interactively configure the rendezvousIP in a single place.

Block services that depend on knowing the rendezvousIP from starting until the rendezvousIP configuration file created in AGENT-555 exists. This will probably take the form of just looping in node-zero.service until the file is present. The systemd configuration may need adjustments to prevent the service from timing out.

While we are waiting, a message should be displayed on the hardware console indicating what is happening.

Epic Goal

  • Have a friendly graphical user to perform interactive installation that runs on node0

Why is this important?

  • Allows the WebUI to run in Agent based installation where we can only count on node0 to run it
  • Provides a familiar (close to SaaS) interface to walk through the first cluster installation
  • Interactive installation takes us closer to having generated images that serve multiple first cluster installations

Scenarios

  1. As an admin, I want to generate an ISO that I can send to the field to perform a friendly, interactive installation

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. Assisted-Service WebUI needs an Agent based installation wizard

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Modify the cluster registration code in the assisted-service client (used by create-cluster-and-infraenv.service) to allow creating the cluster given only the following config manifests:

  • ClusterImageSet
  • InfraEnv

If the following manifests are present, data from them should be used:

  • AgentPullSecret
  • NMStateConfig
  • extra manifests

Other manifests (ClusterDeployment, AgentClusterInstall) will not be present in an interactive install, and the information therein will be entered via the GUI instead.

A CLI flag or environment variable can be used to select the interactive mode.

The Control Plane MachineSet enables OCP clusters to scale Control plane machines. This epic is about making the Control Plane MachineSet controller work with OpenStack.

Goal

  • The control plane nodes can be scaled up and down, lost and recovered.

Why is this important?

  • The procedure to recover from a failed control plane node and to add new nodes is lengthy. In order to increase the scale flexibility, a more simple mechanism needs to be supported.

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. https://docs.openshift.com/container-platform/4.12/machine_management/control_plane_machine_management/cpmso-about.html

The Control Plane MachineSet enables OCP clusters to scale Control plane machines. This epic is about making the Control Plane MachineSet controller work with OpenStack.

Goal

  • The control plane nodes can be scaled up and down, lost and recovered.

Why is this important?

  • The procedure to recover from a failed control plane node and to add new nodes is lengthy. In order to increase the scale flexibility, a more simple mechanism needs to be supported.

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. https://docs.openshift.com/container-platform/4.12/machine_management/control_plane_machine_management/cpmso-about.html

The FailureDomain API that was introduced in 4.13 was TechPreview and is now replaced by an API in openshift/api; not in the installer anymore.

 

Therefore, we want to clean up the installer from any unsupported API so later we can add the supported API in order to add support for CPMS on OpenStack.

Goals

  • Make kubelet aware of underlying node shutdown event and trigger pod termination with sufficient grace period to shutdown properly
  • Handle node shutdown in cloud-provider agnostic way
  • Introduce minimal shutdown delay in order to shutdown node soon as possible (but not sooner)
  • Focus on handling shutdown on systemd based machines

Story 1

  • As a cluster administrator, I can configure the nodes in my cluster to allocate X seconds for my pods to terminate gracefully during a node shutdown

https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2000-graceful-node-shutdown#story-2Story 2

  • As a developer I can expect that my pods will terminate gracefully during node shutdowns

 

https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2000-graceful-node-shutdown 

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

As an OpenShift developer, I want to have confidence that the graceful restart feature works and stays working in the future through various code changes. To that end, please add at least the following 2 E2E tests:

  • A valid pod/workload with timeout that is respected by the system before shutdown.
  • A rogue pod that has extremely high timeout that is not respected by the system.

Goals

Track goals/requirements for self-managed GA of Hosted control planes on BM using the agent provider. Mainly make sure: 

  • BM flow via the Agent is documented. 
    • Make sure the documentation with HyperShiftDeployment is removed.
    • Make sure the documentation uses the new flow without HyperShiftDeployment 
  • We have a reference architecture on the best way to deploy. 
  • UI for provisioning BM via MCE/ACM is complete (w host inventory). 

(Optional) Use Cases

This Section:

  • Main success scenarios - high-level user stories
  • Alternate flow/scenarios - high-level user stories
  • ...

Background, and strategic fit

Customers are looking at HyperShift to deploy self-managed clusters on Baremetal. We have positioned the Agent flow as the way to get BM clusters due to its ease of use (it automates many of the rather mundane tasks required to setup up BM clusters) and its planned for GA with MCE 2.3 (in the OCP 4.13 timeframe). 

 

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?
  • New Content, Updates to existing content, Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?

Feature goal (what are we trying to solve here?)

Group all tasks for CAPI-provider-agent GA readiness

Does it need documentation support?

no

Feature origin (who asked for this feature?)

  •  

Reasoning (why it’s important?)

  • In order for the Hypershift Agent platform to be GA in ACM 2.9 we need to improve our coverage and fix the bugs in this epic 

Feature usage (do we have numbers/data?)

  • We have no data - the feature doesn’t exist anywhere

Feature availability (why should/shouldn't it live inside the UI/API?)

  • Please describe the reasoning behind why it should/shouldn't live inside the UI/API
  • Does this feature exist in the UI of other installers?

The test wait until all pods in the control plane namespace report ready status but collect-profiles is a job that sometimes complete before other pods are ready.

Once the collect-profiles pod is completed it termintates and the status moves to ready=false.
And from there onwards the test is stuck.

Goal

Support Dual-Stack Networking (IPv4 & IPv6) for hosted control planes. 

Why is this important?

Many of our customer,especially Telco providers have a need to support IPv6 but can't do so immediately, they would still have legacy IPv4 workload. To support both stacks, an OpenShift cluster must be capable of allowing communication for both flavors. I.e., a OpenShift cluster running with hosted control planes should be able to allow workloads to access both IP stacks.

Scenarios

As a cluster operator, you have the option to expose external endpoints using one or both address families, in any order that suits your needs. OpenShift does not make any assumptions about the network it operates on. For instance, if you have a small IPv4 address space, you can enable dual-stack on some of your cluster nodes and have the rest running on IPv6, which typically has a more extensive address space available.

Acceptance Criteria

  • Dev - Has a valid enhancement if necessary
  • CI - MUST be running successfully with tests automated
  • QE - covered in Polarion test plan and tests implemented
  • Release Technical Enablement - Must have TE slides
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions:

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Technical Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Enhancement merged: <link to meaningful PR or GitHub Issue>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Description of problem:

When deploying a dual stack HostedCluster the KAS certificate won't be created with the proper SAN. If we look into a regular dual-stack cluster we can see the certificate gets generated as follows:

X509v3 Subject Alternative Name:
    DNS:kubernetes, DNS:kubernetes.default, DNS:kubernetes.default.svc, DNS:kubernetes.default.svc.cluster.local, DNS:openshift, DNS:openshift.default, DNS:openshift.default.svc, DNS:openshift.default.svc.cluster.local, DNS:172.30.0.1, DNS:fd02::1, IP Address:172.30.0.1,
IP Address:FD02:0:0:0:0:0:0:1


whereas in a dual-stack hosted cluster this is the SAN:

X509v3 Subject Alternative Name:
    DNS:localhost, DNS:kubernetes, DNS:kubernetes.default, DNS:kubernetes.default.svc, DNS:kubernetes.default.svc.cluster.local, DNS:kube-apiserver, DNS:kube-apiserver.clusters-hosted.svc, DNS:kube-apiserver.clusters-hosted.svc.cluster.local, DNS:api.hosted.dual.lab, DNS:api.hosted.hypershift.local, IP Address:127.0.0.1, IP Address:172.31.0.1


As you can see it's missing the IPv6 pod+service IP on the certificate.

This causes issues on some controllers when contacting the KAS.

example:
E0711 16:51:42.536367       1 reflector.go:140] github.com/openshift/router/pkg/router/template/service_lookup.go:33: Failed to watch *v1.Service: failed to list *v1.Service: Get "https://172.31.0.1:443/api/v1/services?limit=500&resourceVersion=0": x509: cannot validate certificate for 172.31.0.1 because it doesn't contain any IP SANs


Version-Release number of selected component (if applicable):

latest

How reproducible:

Always

Steps to Reproduce:

1. Deploy a HC with the networking settings specified and using the image with dual stack patches included quay.io/jparrill/hypershift:OCPBUGS-15331-mix-413v4

Actual results:

KubeApiserver cert gets generated with the wrong SAN config.

Expected results:

KubeApiserver cert gets generated with the correct SAN config.

Additional info:

 

Description of problem:

Installing a 4.14 self-managed hosted cluster on a dual-stack hub with the "hypershift create cluster agent" command. The logs of the hypershift operator pod show a bunch of these errors:

{"level":"error","ts":"2023-06-08T13:36:26Z","msg":"Reconciler error","controller":"hostedcluster","controllerGroup":"hypershift.openshift.io","controllerKind":"HostedCluster","hostedCluster":{"name":"hosted-0","namespace":"clusters"},"namespace":"clusters","name":"hosted-0","reconcileID":"a0a0f44f-7bbe-499f-95b0-e24b793ee48c","error":"failed to reconcile network policies: failed to reconcile kube-apiserver network policy: NetworkPolicy.extensions \"kas\" is invalid: spec.egress[1].to[0].ipBlock.except[1]: Invalid value: \"fd01::/48\": must be a strict subset of `cidr`","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:234"}


The hostedcluster CR is showing the same ReconciliationError. Note that the networking section in the hostedcluster CRD created by the "hypershift create cluster agent" command has ipv4 CIDR:

  networking:
    clusterNetwork:
    - cidr: 10.132.0.0/14
    networkType: OVNKubernetes
    serviceNetwork:
    - cidr: 172.31.0.0/16


while services have ipv6 nodeport addresses.

Version-Release number of selected component (if applicable):

$ oc version
Client Version: 4.14.0-0.nightly-2023-06-05-112833
Kustomize Version: v4.5.7
Server Version: 4.14.0-0.nightly-2023-06-05-112833
Kubernetes Version: v1.27.2+cc041e8

How reproducible:

100%

Steps to Reproduce:

1. Install 4.14 OCP dual-stuck BM hub cluster
2. Install MCE 2.4 and Hypershift operator
3. Install hosted cluster with "hypershift create cluster agent" command 

Actual results:

hosted cluster CR shows ReconciliationError:

  - lastTransitionTime: "2023-06-08T10:55:33Z"
    message: 'failed to reconcile network policies: failed to reconcile kube-apiserver
      network policy: NetworkPolicy.extensions "kas" is invalid: spec.egress[1].to[0].ipBlock.except[1]:
      Invalid value: "fd01::/48": must be a strict subset of `cidr`'
    observedGeneration: 2
    reason: ReconciliationError
    status: "False"
    type: ReconciliationSucceeded

Expected results:

ReconciliationSucceeded condition should be True

Additional info:

Logs and CRDs produced by the failed job: https://s3.upshift.redhat.com/DH-PROD-OCP-EDGE-QE-CI/ocp-spoke-assisted-operator-deploy/8044/post-mortem.zip

Description of problem:

When deploying a dual stack HostedCluster the worker nodes will not fully join the cluster because the CNI plugin doesn't start. If we check the cluster-network-operator pod we will see the following error:

I0711 13:46:16.012420       1 log.go:198] Failed to validate Network.Spec: hostPrefix 23 is larger than its cidr fd01::/48  



It seems that is validating the IPv4 hostPrefix against the IPv6 pod network, this is how the networking spec for the HC looks like:

  networking:
    clusterNetwork:
    - cidr: 10.132.0.0/14
    - cidr: fd01::/48
    networkType: OVNKubernetes
    serviceNetwork:
    - cidr: 172.31.0.0/16
    - cidr: fd02::/112

 

Version-Release number of selected component (if applicable):

latest

How reproducible:

Always

Steps to Reproduce:

1. Deploy a HC with the networking settings specified and using the image with dual stack patches included quay.io/jparrill/hypershift:OCPBUGS-15331-mix-413v2 

Actual results:

CNI is not deployed

Expected results:

CNI is deployed

Additional info:

Discussed on slack https://redhat-internal.slack.com/archives/C058TF9K37Z/p1689078655055779

To run a HyperShift management cluster in disconnected mode we need to document which images need to be mirrored and potentially modify the images we use for OLM catalogs.

ICSP mapping only happens for image references with a digest, not a regular tag. We need to address this for images we reference by tag:
CAPI, CAPI provider, OLM catalogs

Currently OLM catalogs placed in the control plane use image references to a tag so that the latest can be pulled when the catalog is restarted. There is a CRON job that restarts the deployment on a regular basis.

The issue with this, is that the image cannot be mirrored for offline deployments, nor can it be used in environments (IBM cloud) where all images running on a management cluster need to be approved beforehand by digest.

As a user of Hosted Control Planes, I would like the HCP Specification API. to support both ICSP & IDMS.

IDMS is replacing ICSP in OCP 4.13+.  hcp.Spec.ImageContentSources was updated in OCPBUGS-11939 to replace ICSP with IDMS. This needs to be reverted and something new added to support IDMS in addition to ICSP.

Description of problem:

HostedClusterConfigOperator doesn't check OperatorHub object in the Hosted Cluster. This causes that default catalogsources cannot be disabled. If there are failing catalogsources, operator deployments might be impacted.

Version-Release number of selected component (if applicable):

Any

How reproducible:

Always

Steps to Reproduce:

1. Deploy a HostedCluster
2. Connect to the hostedcluster and patch the operatorhub object: `oc --kubeconfig ./hosted-kubeadmin patch OperatorHub cluster --type json -p '[{"op": "add", "path": "/spec/disableAllDefaultSources", "value": true}]'` 
3. CatalogSources objects won't be removed from the openshift-marketplace namespace.

Actual results:

CatalogSources objects are not removed from the openshift-marketplace namespace.

Expected results:

CatalogSources objects are removed from the openshift-marketplace namespace.

Additional info:

This is the code where we can see that the reconcile will create the catalogsources everytime. 

https://github.com/openshift/hypershift/blob/dba2e9729024ce55f4f2eba8d6ccb8801e78a022/control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go#L1285

As a user of hosted clusters in disconnected environments, I would like RegistryClientImageMetadataProvider to support registry overrides so that registry lookups utilize the registries in the registry overrides rather than what might be listed in the image reference.

Description of problem:

When user configures HostedCluster.Spec.additionalTrustBundle, there are some deployments that add this trust bundle using a volume. The ignition-server deployment won't add this volume.

Version-Release number of selected component (if applicable):

Any

How reproducible:

Always

Steps to Reproduce:

1. Deploy a HostedCluster with additionalTrustBundle
2. Check ignition-server deployment configuration

Actual results:

No trust bundle configured

Expected results:

Trust bundle configuered.

Additional info:

There is missing code.

Ignition-server-proxy does configure the trust bundle: https://github.com/openshift/hypershift/blob/main/hypershift-operator/controllers/hostedcluster/ignitionserver/ignitionserver.go#L745-L748
Ignition-server does not: https://github.com/openshift/hypershift/blob/main/control-plane-operator/controllers/hostedcontrolplane/ignitionserver/ignitionserver.go#L694

Feature Overview (aka. Goal Summary)  

Phase 2 Goal:  

  • Complete the design of the Cluster API (CAPI) architecture and build the core operator logic
  • attach and detach of load balancers for internal and external load balancers for control plane machines on AWS, Azure, GCP and other relevant platforms
  • manage the lifecycle of Cluster API components within OpenShift standalone clusters
  • E2E tests

for Phase-1, incorporating the assets from different repositories to simplify asset management.

Background, and strategic fit

Overarching Goal
Move to using the upstream Cluster API (CAPI) in place of the current implementation of the Machine API for standalone Openshift.
Phase 1 & 2 covers implementing base functionality for CAPI.
Phase 2 also covers migrating MAPI resources to CAPI.

  • Initially CAPI did not meet the requirements for cluster/machine management that OCP had the project has moved on, and CAPI is a better fit now and also has better community involvement.
  • CAPI has much better community interaction than MAPI.
  • Other projects are considering using CAPI and it would be cleaner to have one solution
  • Long term it will allow us to add new features more easily in one place vs. doing this in multiple places.

Acceptance Criteria

There must be no negative effect to customers/users of the MAPI, this API must continue to be accessible to them though how it is implemented "under the covers" and if that implementation leverages CAPI is open

Epic Goal

  • To create an operator to manage the lifecycle of Cluster API components within OpenShift standalone clusters

Why is this important?

  • We need to be able to install and lifecycle the Cluster API ecosystem within standalone OpenShift
  • We need to make sure that we can update the components via an operator
  • We need to make sure that we can lifecycle the APIs via an operator

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.

In the cluster-capi-operator repository are present several CAPI E2E tests for specific providers.

We run these tests on every PR that lands on that repository.

In order to test rebases for the cluster-api providers we want to run these tests also there to prove rebase PRs are not breaking CAPI functionality.

DoD:

  • Set techpreview openshift E2E jobs for the cluster-api providers to make sure the build doesn't break the TP payload
  • Set techpreview CAPI E2E jobs for cluster-api providers repositories for providers where an corresponding e2e test is present in the cluster-capi-operator
    • for now only AWS, GCP and IMBCloud have E2Es
  • Add a target/script in the cluster-api providers repositories for running the E2E CAPI tests, where it applies.

Feature Overview (aka. Goal Summary)  

CgroupV2 is GA as of OCP 4.13 . 

RHEL 9 is defaulted to V2 and we want to make sure we are in sync 

V1 support in system d will end by end of 2023

 

Goals (aka. expected user outcomes)

  1. Default for new clusters 
  2. non default for upgrading clusters means customer with cgroup v1 upgrading from 4.13 to 4.14 they will still have cgroup v1 ( it will not be a force migration)
  3. Upgrading customers will have option to upgrade to V2 as day 2 

What need to be done 

  1. Default in 4.14
  2. Change 4.13Z so that so upgraded cluster to 4.14 stays on V1
  3. NTO changes to default to v1
  4. Test with cgroupv1 (where cgroupv2 were previously)
  5. Release notes on applications that are effected 
  6. If you run third-party monitoring and security agents that depend on the cgroup file system, update the agents to versions that support cgroup v2.
  7. If you run cAdvisor as a stand-alone DaemonSet for monitoring pods and containers, update it to v0.43.0 or later.
  8. If you deploy Java applications, prefer to use versions which fully support cgroup v2:
  9. OpenJDK / HotSpot: jdk8u372, 11.0.16, 15 and later
  10. IBM Semeru Runtimes: jdk8u345-b01, 11.0.16.0, 17.0.4.0, 18.0.2.0 and later
  11. IBM Java: 8.0.7.15 and later
  12.  
  1. Announcement blog (and warning about force upgrade in the future)
  2. Reach out to TRT

 

https://docs.google.com/document/d/1i6IEGjaM0-NeMqzm0ZnVm0UcfcVmZqfQRG1m5BjKzbo/edit 

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Make cgroup v2 default in 4.14

Why is this important?

  • To bring the advantages of cgroup v2 to users of 4.14+

Scenarios

  1. As a cluster owner I want to run my system using cgroup v2 for it's added benefits. 

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

This issue consists of the following changes.

  1. Update the existing 4.13 MCO code to store the context of cgroups mode in the config node spec which can be referred during the 4.14, future releases.
  2. Request all the clusters to upgrade to the above changes before upgrading to 4.14 by bumping the minor version in openshift/cincinnati-graph-data repository. Refer here (Reach out to #forum-updates)
  3. Remove the explicit setting of the cgroupsv1 and update it to cgroupsv2 in the 4.14/master code of MCO repo

Feature Overview (aka. Goal Summary)  

Add support to the OpenShift Installer to set up the field 'managedBy' on the Azure Resource Group

Goals (aka. expected user outcomes)

As a user I want to be able to provide a new field to the Installer's manifest to be used to configure the `managedBy` tag into the Azure Resource Group

Requirements (aka. Acceptance Criteria):

The Installer will provide a new field via the Install Config manifest to be used for tag the Azure Resource Group

Use Cases (Optional):

This is a requirement for the ARO SRE teams for their automation tool to identify these resources.

Background

ARO needs this field set for their automation tool in the background. Doc for more details.

Documentation Considerations

This new additional field will need to be documented as any other field supported via the Install Config manifest

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Feature Overview

  • Allow for the setting of the field 'managedBy' on the azure resource group.

Epic Goal

  • Users are able to set the 'managedBy' field of the azure resource group during installation.

Why is this important?

  • ARO needs this field set for their automation tool in the background.

Background

  • ARO needs this field set for their automation tool in the background. Doc for more details.

Out of scope

  1. None.

Acceptance Criteria

  • Install config allows users to set a new field with the value for the 'managedBy' field.
  • Once above field is set, the installation creates a new resource group with the 'managedBy' field set.

Dependencies (internal and external)

  1. Terraform provider azurerm which is used for installation in Azure does not have functionality to set this field and needs to be updated accordingly to facilitate this feature.

Customer Considerations

  1. None

Documentation Considerations

  1. Doc for new field in the install config needs to be added.

Interoperability Considerations

  1. None

Open questions::

  1. None

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story:

As a ARO developer, I want to be able to:

  • Set the ManagedBy field for the Azure cluster resource group

so that

  •  Automatic Azure RBAC plumbing can work.

Acceptance Criteria:

  • The resource groups are created with the ManagedBy field set correctly.

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

 

Catch-all epic for cleanup work around the now non-machineconfig certificate bundles written by the MCO (kubelet, image registry)

Once we remove the certificates from the MachineConfigs, the controllerconfig would be the canonical location for all certificates.

 

We should increase visibility by potentially adding either a new configmap so all components + console/users can read, or bubbling up status better in the controllerconfig object itself

MVP aims at refactoring MirrorToDisk and DiskToMirror for OCP releases

  • Execute command line
  • Copy & untar release-index
  • Inspect untarred folder
  • gather release images from disk
  • Generate artifacts (icsp)
  • bulk pull-push payload images
  • gather release-index
  • unit test & e2e

As an MVP, this epic covers the work for RFE-3800 (includes RFE-3393 and RFE-3733) for mirroring releases.

The full description / overview of the enclave support is best described here 

The design document can be found here 

Upcoming epics, such as CFE-942 will complete the RFE work with mirroring operators, additionalImages, etc.

 

Architecture Overview (diagram)

 

 

 

As a developer, I want to create an implementation based on a local container registry as the backing technology for mirroring to disk, so that:

Feature Overview (aka. Goal Summary)  

Add support of NAT Gateways in Azure while deploying OpenShift on this cloud to manage the outbound network traffic and make this the default option for new deployments

Goals (aka. expected user outcomes)

While deploying OpenShift on Azure the Installer will configure NAT Gateways as the default method to handle the outbound network traffic so we can prevent existing issues on SNAT Port Exhaustion issue related to the configured outboundType by default.

Requirements (aka. Acceptance Criteria):

The installer will use the NAT Gateway object from Azure to manage the outbound traffic from OpenShift.

The installer will create a NAT Gateway object per AZ in Azure so the solution is HA.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

Background

Using NAT Gateway for egress traffic is the recommended approach from Microsoft

This is also a common ask from different enterprise customers as with the actual solution used by OpenShift for outbound traffic management in Azure they are hitting SNAT Port Exhaustion issues.

 

Interoperability Considerations

Which other projects and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

Epic Goal

  • Allow Control Plane Machine Sets to specify multiple Subnets in Azure to support NAT Gateways for egress traffic

Why is this important?

  • In order to avoid the SNAT port exhaustion issues in Azure, Microsoft recommends to use NAT Gateways for outbound traffic management. As part of the NAT Gateway support enablement the CPMS objects need to be able to support multiple subnets

Scenarios

  1. One Nat Gateway per Availability Zone
  2. One Subnet per Availability Zone
  3. Multiple Subnets in multiple Availability Zones

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.

Dependencies (internal and external)

This work depends on the work done in CORS-2564

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story:

As a user, I want to be able to:

  • Allow Control Plane Machine Sets to specify multiple Subnets

so that I can achieve

  • One Nat Gateway per Availability Zone
  • One Subnet per Availability Zone
  • Multiple Subnets in multiple Availability Zones

Acceptance Criteria:

  • The ability to specify multiple Subnets

This requires/does not require a design proposal.
This requires/does not require a feature gate.

Feature Overview (aka. Goal Summary)  

The MCO should properly report its state in a way that's consistent and able to be understood by customers, troubleshooters, and maintainers alike. 

Some customer cases have revealed scenarios where the MCO state reporting is misleading and therefore could be unreliable to base decisions and automation on.

In addition to correcting some incorrect states, the MCO will be enhanced for a more granular view of update rollouts across machines.

The MCO should properly report its state in a way that's consistent and able to be understood by customers, troubleshooters, and maintainers alike. 

For this epic, "state" means "what is the MCO doing?" – so the goal here is to try to make sure that it's always known what the MCO is doing. 

This includes: 

  • Conditions
  • Some Logging 
  • Possibly Some Events 

While this probably crosses a little bit into the "status" portion of certain MCO objects, as some state is definitely recorded there, this probably shouldn't turn into a "better status reporting" epic.  I'm interpreting "status" to mean "how is it going" so status is maybe a "detail attached to a state". 

 

Exploration here: https://docs.google.com/document/d/1j6Qea98aVP12kzmPbR_3Y-3-meJQBf0_K6HxZOkzbNk/edit?usp=sharing

 

https://docs.google.com/document/d/17qYml7CETIaDmcEO-6OGQGNO0d7HtfyU7W4OMA6kTeM/edit?usp=sharing

 

During upgrade tests, the MCO will become temporarily degraded with the following events showing up in the event log:

Dec 13 17:34:58.478 E clusteroperator/machine-config condition/Degraded status/True reason/RequiredPoolsFailed changed: Unable to apply 4.11.0-0.ci-2022-12-13-153933: error during syncRequiredMachineConfigPools: [timed out waiting for the condition, pool master has not progressed to latest configuration: controller version mismatch for rendered-master-3c738a0c86e7fdea3b5305265f2a2cdb expected 92012a837e2ed0ed3c9e61c715579ac82ad0a464 has 768f73110bc6d21c79a2585a1ee678d5d9902ad5: 2 (ready 2) out of 3 nodes are updating to latest configuration rendered-master-61c5ab699262647bf12ea16ea08f5782, retrying]

 

This seems to be occurring with some frequency as indicated by its prevalence in CI search:

$ curl -s 'https://search.ci.openshift.org/search?search=clusteroperator%2Fmachine-config+condition%2FDegraded+status%2FTrue+reason%2F.*controller+version+mismatch&maxAge=48h&context=1&type=bug%2Bissue%2Bjunit&name=%5E%28periodic%7Crelease%29.*4%5C.1%5B1%2C2%5D.*&excludeName=&maxMatches=1&maxBytes=20971520&groupBy=job' | jq 'keys | length'
399

 

The MCO should not become degraded during an upgrade unless it cannot proceed with the upgrade. In the case of these failures, I think we're timing out at some point during node reboots as either 1 or 2 of the control plane nodes are ready, with the third being unready. The MCO eventually requeues the syncRequiredMachineConfigPools step and the remaining nodes reboot and the MCO eventually clears the Degraded status.

 

Indeed, looking at the event breakdown, one can see that control plane nodes take ~21 minutes to roll out their new config with OS upgrades. By comparison, the worker nodes take ~15 minutes.

Meanwhile, the portion of the MCO which performs this sync (the syncRequiredMachineConfigPools function) has a hard-coded timeout of 10 minutes. Additionally, to my understanding, there is an additional 10 minute grace period before the MCO marks itself as degraded. Since the control plane nodes took ~21 minutes to completely reboot and roll out their new configs, we've exceeded the time needed. With this in mind, I propose a path forward:

  1. Figure out why control plane nodes are taking > 20 minutes for OS upgrades to be performed. My initial guess is that it has to do with etcd reestablishing quorum before proceeding onto the next control plane node whereas the worker nodes don't need to delay for that. 
  2. If we conclude that OS upgrades just take longer to perform for control plane nodes, then maybe we could bump the timeout. Ideally, we could bump the timeout only for the control plane nodes, but that may take some refactoring to do.

Feature Overview (aka. Goal Summary)

When the cluster does not have v1 builds, console needs to either provide different ways to build applications or prevent erroneous actions.

Goals (aka. expected user outcomes)

Identify the build system in place and prompt user accordingly when building applications.

Requirements (aka. Acceptance Criteria):

Console will have to hide any workflows that rely solely on buildconfigs and pipelines is not installed.

Use Cases (Optional):

  1. As a developer, provide me with a default build option, and show options to override.
  2. As a developer, prevent me from trying to create applications if no build option is present on the cluster.

 

ODC Jira - https://issues.redhat.com/browse/ODC-7352

Problem:

When the cluster does not have v1 builds, console needs to either provide different ways to build applications or prevent erroneous actions.

Goal:

Identify the build system in place and prompt user accordingly when building applications.

Why is it important?

Without this enhancement, users will encounter issues when trying to create applications on clusters that do not have the default s2i setup.

Use cases:

  1. As a developer, provide me with a default build option, and show options to override.
  2. As a developer, prevent me from trying to create applications if no build option is present on the cluster.

Acceptance criteria:

Console will have to hide any workflows that rely solely on buildconfigs and pipelines is not installed.

If we detect Shipwright, then we can call that API instead of buildconfigs. We need to understand the timelines for the latter part, and create a separate work item for it.

If both buildconfigs and Shipwright are available, then we should default to Shipwright. This will be part of the separate work item needed to support Shipwright.

Dependencies (External/Internal):

Rob Gormley to confirm timelines when customers will have to option to remove buildconfigs from their clusters. That will determine whether we take on this work in 4.15 or 4.16.
 

Design Artifacts:

Exploration:

Note:

Description of problem:

Version-Release number of selected component (if applicable):
Tested with https://amd64.ocp.releases.ci.openshift.org/releasestream/4.14.0-0.nightly/release/4.14.0-0.nightly-2023-08-21-033349

How reproducible:
Always with the latest nightly build when the Build and DeploymentConfig capabilities are disabled

Steps to Reproduce:
Create a 4.14 shared cloud and disable the capabilities for Samples, Builds and DeploymentConfigs

  1. Instead of ./openshift-install create cluster{{, run {{./openshift-install create install-config
  2. Add this to the install-config.yaml:
capabilities:
  baselineCapabilitySet: None
  additionalEnabledCapabilities:
  - baremetal
  - Console
  - Insights
  - marketplace
  - Storage
  # - openshift-samples
  - CSISnapshot
  - NodeTuning
  - MachineAPI
  # - Build
  # - DeploymentConfig
    1. Start the cluster with ./openshift-install create cluster
    2. Login to the new cluster and switch to the developer perspective

Actual results:
The following main navigation entries are missed:

  1. Add page
  2. Topology
  3. Observe
  4. Search
  5. Project

(Only Helm, ConfigMap and Secret is shown.)

The add page should still show the "Import from Git" which could not be used to import a resource without the BuildConfig.

Expected results:
All navigation items should be displayed.

The add page should not show "Import from Git" if the BuildConfig CRD isn't installed.

Additional info:

Feature Overview

  • Customers want to create and manage OpenShift clusters using managed identities for Azure resources for authentication.
  • In Phase 2, we want to iterate on the CAPI-related items for ARO/Azure managed identity work, e.g. to update cluster-api-provider-azure to consume Azure workload identity tokens, Update ARO Credentials Request manifest of the Cluster CAPI Operator to use new API field for requesting permissions .

User Goals

  • A customer using ARO wants to spin up an OpenShift cluster with "az aro create" without needing additional input, i.e. without the need for an AD account or service principal credentials, and the identity used is never visible to the customer and cannot appear in the cluster.
  • As an administrator, I want to deploy OpenShift 4 and run Operators on Azure using access controls (IAM roles) with temporary, limited privilege credentials.

Requirements

  • Azure managed identities must work for installation with all install methods including IPI and UPI, work with upgrades, and day-to-day cluster lifecycle operations.
  • Support HyperShift and non-HyperShift clusters.
  • Support use of Operators with Azure managed identities.
  • Support in all Azure regions where Azure managed identity is available. Note: Federated credentials is associated with Azure Managed Identity, and federated credentials is not available in all Azure regions.

More details at ARO managed identity scope and impact.

 

This Section: A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.

Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

(Optional) Use Cases

This Section:

  • Main success scenarios - high-level user stories
  • Alternate flow/scenarios - high-level user stories
  • ...

Questions to answer…

  • ...

Out of Scope

Background, and strategic fit

This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.

Assumptions

  • ...

Customer Considerations

  • ...

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?
  • New Content, Updates to existing content, Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?

References

Evaluate if any of the ARO predefined roles in the credentials request manifests of OpenShift cluster operators give elevated permissions. Remove any such predefined role from spec.predefinedRoles field and replace it with required permissions in the new spec.permissions field.

This effort is dependent on the completion of work for CCO-187, and effort in dependent modules is planned to be worked on by the CCO team unless individual repo owners can help. Operators owners/teams will be expected to review merge requests and complete appropriate QE effort for an openshift release.

  • azure-sdk-for-go module dependency updated to support workload identity federation.
  • Mount the OIDC token in the operator pod. This needs to go in the deployment. See example from addition to the cluster-image-registry-operator here

Address technical debt around self-managed HCP deployments, including but not limited to

  • CA ConfigMaps into the trusted bundle for both the CPO and Ignition Server, improving trust and security.
  • Create dual stack clusters through CLI with or without default values, ensuring flexibility and user preference in network management.
  • Utilize CLI commands to disable default sources, enhancing customizability.
  • Benefit from less intrusive remote write failure modes,. 
  • ...

Goal

  • Address all the tasks we didn't finish for the GA
  • Collect and track all missing topics for self-managed and agent provider

Description of the Problem:

When we deploy a IPv6/Disconnected HostedCluster, we can see that the Ingress Cluster Operator looks as degraded showing this message:

clusteroperator.config.openshift.io/ingress                                    4.14.0-0.nightly-2023-08-29-102237   True        False         True     43m     The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=False (CanaryChecksRepetitive Failures: Canary route checks for the default ingress controller are failing) 

 

Also we can see the canary route accesible from the ingressOperator pod using curl command but the golang code doesn't.

2023-08-31T16:23:07.264Z    ERROR    operator.canary_controller    wait/backoff.go:226    error performing canary route check    {"error": "error sending canary HTTP request to \"canary-openshift-ingress-canary.apps.hosted.hypershiftbm.lab\": Get \"https://canary-openshift-ingress-canary.apps.hosted.hypershiftbm.lab/\": socks connect tcp 127.0.0.1:8090->canary-openshift-ingress-canary.apps.hosted.hypershiftbm.lab:443: unknown error host unreachable"} 

 

After a debugging session, looks like the DNS resolution of the ingress operator through SOCKS proxy which also go through Konnectivity component, does not work properly because delegates the resolution in the Hub cluster which is not the desired behaviour.

Release Shipwright as OpenShift Builds GA

The scope of GA is:

  • Build strategies for S2I, Buildah, and Cloud-Native Buildpacks (as Dev Preview)
  • Shipwright CLI
  • Availability in OpenShift OperatorHub
  • Product announcement with "5 minutes, 5 clicks" tutorial.

Goals

This GA release is intended to make a fully-supported offering of OpenShift Builds driven by the Shipwright framework. This includes both CLI and Operator usage. All Red Hat supported build engines are supported (buildah, s2i), but the priority is to ensure that there are no blocking Buildpacks-related issues and to triage and resolve non-blocking issues.

The softer goal of this GA release is to start to draw users to the Shipwright ecosystem, which should allow them greater flexibility in bringing their CI/CD workloads to the OpenShift platform.
 

Use Cases

  • Allow developers to utilize build strategies including Buildah, BuildKit, Buildpacks, Kaniko, ko, and Source-to-Image for their applications.
  • Allow DevOps and platform teams to move CI/CD ecosystems to OpenShift in a fundamentally integrated manner.

Out of scope

Functionality and roadmap items not specifically related to improving support for Buildpacks.

Dependencies

No known external dependencies.

Background, and strategic fit

The overarching goal for OpenShift Builds is to provide an extensible and modular framework for integrating into development workflows. Interoperability should be considered a priority, and build strategy-specific code should be kept to a minimum or implemented in a manner such that support fo other build strategies is not impacted wherever possible.

Assumptions

Shipwright is an upstream community project with its own goals and direction, and while we are involved heavily in the project, we need to ensure buy-in for our initiatives, and/or determine what functionality and features we are “willing” to accept as downstream-only.

No assumptions are made about hardware, software, or people resources.

Customer Considerations

None.

Documentation Considerations

Documentation will heavily rely on the upstream Shipwright documentation. Documentation plan is here.

What does success look like?

  • Customers are able to install the product on supported versions of OpenShift
  • Customers are able to use builds for OpenShift to build container images with s2i and buildah
  • Customers are able to interact with builds through the OpenShift Dev Console and productized CLI

QE Contact

  • Jitendar Singh

Impact

N/A

Related Architecture/Technical Documents

GA involves the status of Shipwright's Build, CLI, and Operator projects for the upstream version v0.12.0. More information can be found at https://shipwright.io

Done Checklist

  • Acceptance criteria are met
  • Non-functional properties of the Feature have been validated (such as performance, resource, UX, security, or privacy aspects)
  • User Journey automation is delivered
  • Support and SRE teams are provided with enough skills to support the feature in a production environment

Problem:

The OpenShift currently has limited support for Shipwright builds.  Additionally, this support is marked as Tech Preview and is using the alpha version of the API.

Goal:

Provide additional support for Shipwright builds, moving to the beta API and removing the Tech Preview labels.

Why is it important?

Supporting layered products

Use cases:

  1. <case>

Acceptance criteria:

  1. Use the latest API
  2. Removing the tech preview badge on appropriate pages
  3. Improve the Shipwright details page status
  4. Improve the Shipwright data of the list view shown in the Shipwright tab of the Build page

Dependencies (External/Internal):

Design Artifacts:

Exploration:

Note:

Description of problem:
When the user selects All namespaces in the admin perspective and navigates to Builds > BuildConfigs or Builds > Shipwright Builds (if the operator is installed) the last runs was selected based on their name. But the filter doesn't check that the Build / BuildRun are in the same namespace as the BuildConfig / Build.

Version-Release number of selected component (if applicable):
4.14 after we've merged ODC-7306 / https://github.com/openshift/console/pull/12809

How reproducible:
Always

Steps to Reproduce:

  1. Create two namespaces
  2. Create a BuildConfig with the same name in both namespaces
  3. Start a Build for each BuildConfig
  4. Navigate to Build > BuildConfigs and select "All namespaces"

For Shipwright Builds and BuildRun you need to install the Pipelines operator, the Shipwright operator, create a Shipwright Build Resource (to enable the SW operator) as well as to Builds and BuildsRuns in two different namespaces.

You can find some Shipwright Build samples here: https://github.com/openshift/console/tree/master/frontend/packages/shipwright-plugin/samples

Actual results:
Both BuildConfigs are shown, but both shows the same Build as last run.

Expected results:
Both BuildConfigs should show and link the Build from their own namespace.

Additional info:
This issue exists also in Pipelines, but we track this in another bug to backport that issue.

Description

As a user, I want to see the latest build status in the Build list similar to Pipelines.

Acceptance Criteria

  1. Create a custom row component for OpenShift BuildConfigs (in dev-console) and show the latest OpenShift Build status for the related BuildConfig.
    1. Add a "Status" filter similar to SW BuildRuns
  2. Update the existing custom row component for Shipwright Builds (in shipwright-plugin) and show the latest Shipwright BuildRun status for the related Build. Remove/replace the current Output/Status columns.
    1. Update the "Status" filter and replace it with one similar to SW BuildRuns.
  3. Both list views should show the same columns: 
    1. Name
    2. Namespace – only in admin perspective when all namespace is selected
    3. Last run
    4. Last run status
    5. Last run time
    6. Last run duration
  4. Create/update e2e tests for both list views, also if the shipwright-plugin tests runs currently not as part of our CI job.

Additional Details:

SW samples: https://github.com/openshift/console/tree/master/frontend/packages/shipwright-plugin/samples

The shipwright-plugin contains already code to render a status, age, and duration. Ptal: https://github.com/openshift/console/tree/master/frontend/packages/shipwright-plugin/src/components

 

For Pipelines we switched later from a "getting the related PipelineRuns for each row" to a more performant solution that "loads all PipelineRuns' and then filter them on the client side. See https://github.com/openshift/console/pull/12071 - Expect that we should do this here similar.

When multiple rows request the same API (get all PipelineRuns) our useK8sResource hook is smart enough to make just one API call.

To find all OpenShift Builds for one OpenShift BuildConfig they need to be filtered by the label openshift.io/build-config.name=build.metadata.name

Description

As a user, I want to see similar information at similar places for the 3 different Build types.

Acceptance Criteria

  1. Build detail page
    1. Add a "BuildConfig" resource link below the status on the right side (status.config || metadata.annotation["openshift.io/build-config.name"])
    2. Move Started field from the left side to the right side below this link, and rename it to "Start time"
    3. Add "Completion time" (status.completionTimestamp)
    4. Add "Duration" based on the both fields before
  2. Pipeline Run detail page
    1. Add "Completion time" (status.completionTimestamp)

Additional Details:

See https://github.com/openshift/console/blob/master/frontend/packages/shipwright-plugin/src/components/buildrun-details/BuildRunSection.tsx#L19-L55

Description

As a user, I want to see the latest build status in the Build list similar to Pipelines.

Acceptance Criteria

  1. BuildConfig list page
    1. should have a filter similar to the Build list page (New, Pending, Running ....)
    2. all columns should be sortable
    3. duration column should show the time until "now" if the latest run is running
    4. should have an action "Start last run" if there is a last run (see Build "Rebuild" action)
  2. Build list page
    1. all columns should be sortable
    2. should show a duration as well
    3. Rename column title "Created" to "Started" (similar to Pipeline Run lists)
    4. should have an action "Start last run" if there is a last run (see Build "Rerun" action)
  3. Shipwright Build list page
    1. should have a filter similar to the Shipwright BuildRun list page (New, Pending, Running ....)
    2. Rename column title "Age" to "Started" (similar to Pipeline Run lists)

Additional Details:

You might can improve this code review: https://github.com/openshift/console/pull/12809#pullrequestreview-1471632918

Description

As a user, I want to see the Output image of an Shipwright Build on the list page. Before 4.13 the Developer console shows the Build output (full image string) and the Build status.message.

With 4.14 we shows the latest BuildRun name, status, start time, and duration. But the image output is still interesting. See https://redhat-internal.slack.com/archives/C050MAQKD1A/p1688378025053659?thread_ts=1688371150.047769&cid=C050MAQKD1A

Acceptance Criteria

  1. Show the output image in the shipwright Build list page
    1. If the URL is a cluster registry URL show a link to an ImageStream
    2. If the URL is a remote registry URL try to show just the last two path parts.

For example:

  1. image-registry.openshift-image-registry.svc:5000/christoph/my-build => ImageStream "my-build"
  2. quay.io/jerolimov/nodeinfo => Show nodeinfo or jerolimov/nodeinfo and link full https:// address?

Additional Details:

Feature Overview

Telecommunications providers continue to deploy OpenShift at the Far Edge. The acceleration of this adoption and the nature of existing Telecommunication infrastructure and processes drive the need to improve OpenShift provisioning speed at the Far Edge site and the simplicity of preparation and deployment of Far Edge clusters, at scale.

Goals

  • Simplicity The folks preparing and installing OpenShift clusters (typically SNO) at the Far Edge range in technical expertise from technician to barista. The preparation and installation phases need to be reduced to a human-readable script that can be utilized by a variety of non-technical operators. There should be as few steps as possible in both the preparation and installation phases.
  • Minimize Deployment Time A telecommunications provider technician or brick-and-mortar employee who is installing an OpenShift cluster, at the Far Edge site, needs to be able to do it quickly. The technician has to wait for the node to become in-service (CaaS and CNF provisioned and running) before they can move on to installing another cluster at a different site. The brick-and-mortar employee has other job functions to fulfill and can't stare at the server for 2 hours. The install time at the far edge site should be in the order of minutes, ideally less than 20m.
  • Utilize Telco Facilities Telecommunication providers have existing Service Depots where they currently prepare SW/HW prior to shipping servers to Far Edge sites. They have asked RH to provide a simple method to pre-install OCP onto servers in these facilities. They want to do parallelized batch installation to a set of servers so that they can put these servers into a pool from which any server can be shipped to any site. They also would like to validate and update servers in these pre-installed server pools, as needed.
  • Validation before Shipment Telecommunications Providers incur a large cost if forced to manage software failures at the Far Edge due to the scale and physical disparate nature of the use case. They want to be able to validate the OCP and CNF software before taking the server to the Far Edge site as a last minute sanity check before shipping the platform to the Far Edge site.
  • IPSec Support at Cluster Boot Some far edge deployments occur on an insecure network and for that reason access to the host’s BMC is not allowed, additionally an IPSec tunnel must be established before any traffic leaves the cluster once its at the Far Edge site. It is not possible to enable IPSec on the BMC NIC and therefore even OpenShift has booted the BMC is still not accessible.

Requirements

  • Factory Depot: Install OCP with minimal steps
    • Telecommunications Providers don't want an installation experience, just pick a version and hit enter to install
    • Configuration w/ DU Profile (PTP, SR-IOV, see telco engineering for details) as well as customer-specific addons (Ignition Overrides, MachineConfig, and other operators: ODF, FEC SR-IOV, for example)
    • The installation cannot increase in-service OCP compute budget (don't install anything other that what is needed for DU)
    • Provide ability to validate previously installed OCP nodes
    • Provide ability to update previously installed OCP nodes
    • 100 parallel installations at Service Depot
  • Far Edge: Deploy OCP with minimal steps
    • Provide site specific information via usb/file mount or simple interface
    • Minimize time spent at far edge site by technician/barista/installer
    • Register with desired RHACM Hub cluster for ongoing LCM
  • Minimal ongoing maintenance of solution
    • Some, but not all telco operators, do not want to install and maintain an OCP / ACM cluster at Service Depot
  • The current IPSec solution requires a libreswan container to run on the host so that all N/S OCP traffic is encrypted. With the current IPSec solution this feature would need to support provisioning host-based containers.

 

A list of specific needs or objectives that a Feature must deliver to satisfy the Feature. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts.  If a non MVP requirement slips, it does not shift the feature.

requirement Notes isMvp?
     
     
     

 

Describe Use Cases (if needed)

Telecommunications Service Provider Technicians will be rolling out OCP w/ a vDU configuration to new Far Edge sites, at scale. They will be working from a service depot where they will pre-install/pre-image a set of Far Edge servers to be deployed at a later date. When ready for deployment, a technician will take one of these generic-OCP servers to a Far Edge site, enter the site specific information, wait for confirmation that the vDU is in-service/online, and then move on to deploy another server to a different Far Edge site.

 

Retail employees in brick-and-mortar stores will install SNO servers and it needs to be as simple as possible. The servers will likely be shipped to the retail store, cabled and powered by a retail employee and the site-specific information needs to be provided to the system in the simplest way possible, ideally without any action from the retail employee.

 

Out of Scope

Q: how challenging will it be to support multi-node clusters with this feature?

Background, and strategic fit

< What does the person writing code, testing, documenting need to know? >

Assumptions

< Are there assumptions being made regarding prerequisites and dependencies?>

< Are there assumptions about hardware, software or people resources?>

Customer Considerations

< Are there specific customer environments that need to be considered (such as working with existing h/w and software)?>

< Are there Upgrade considerations that customers need to account for or that the feature should address on behalf of the customer?>

<Does the Feature introduce data that could be gathered and used for Insights purposes?>

Documentation Considerations

< What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)? >

< What does success look like?>

< Does this feature have doc impact?  Possible values are: New Content, Updates to existing content,  Release Note, or No Doc Impact>

< If unsure and no Technical Writer is available, please contact Content Strategy. If yes, complete the following.>

  • <What concepts do customers need to understand to be successful in [action]?>
  • <How do we expect customers will use the feature? For what purpose(s)?>
  • <What reference material might a customer want/need to complete [action]?>
  • <Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available. >
  • <What is the doc impact (New Content, Updates to existing content, or Release Note)?>

Interoperability Considerations

< Which other products and versions in our portfolio does this feature impact?>

< What interoperability test scenarios should be factored by the layered product(s)?>

Questions

Question Outcome
   

 

 

Epic Goal

  • Allow relocating a pre-installed singe node OpenShift.
  • Upon deployment at the edge site, the SNO should allow reconfiguring certain cluster attributes (listed below).

Why is this important?

  • SNO installation takes a very long time and even if we optimize it we will not meet customer requirements (under 20 minutes from 0 to fully configured deployment).
  • The customers would like to sanity-check the SNO and CNF before shipment to Far Edge site.

Scenarios

  1. Install single node OpenShift on baremetal using (agent-based installer) at the factory.
  2. Configure the node as a relocatable SNO cluster.
  3. Take a random pre-installaed SNO, Run the sanity checks, and ship it to the edge site.
  4. Upon boot at the edge, the node will get the configuration (via BMC, or some other method that will place the site-specific configuration at a known <path>)
  5. Once identifying the configuration in <path>, the SNO reconfiguration will kick in, and apply all the configurations mentioned at "Reconfiguration requirements "(see below) 

Acceptance Criteria

  • Successfully install single node openshift using a pre-baked image.
  • SNO installed with this method is passing the conformance tests

Customer Requirements

  • Simplicity - no steps required for technician to deploy SNO at Far Edge site
  • Minimize Deployment Time - 20m maximum time spent deploying at Far Edge site
  • Utilize Telco Facilities - OEM or Service Depot for installation
  • Validation Before Shipment - Sanity check SNO and CNF before shipment to Far Edge site
  • Limited  Network - Support deployments with static network or untrusted networks (IPSec tunnels at boot & no BMC access)
  • Time constraint - The process to boot the node and apply the configuration should take no more than 20 minutes

Reconfiguration requirements:

  1. IP address
  2. Hostname
  3. Cluster name
  4. Domain
  5. Pull Secret
  6. Proxy
  7. ICSP
  8. DNS server
  9. SSH keys

dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

  1.  

Configure the static IP (during the initial "factory" installation) with nmstate.
Set the machine network to point to the network of this IP
Add node_ip hint according to the machine network. (done automatically when using assisted/ABI) 
Remove all current hacks (adding the env overrides to crio and kubelet)
Check whether the network manager pre-up script is still required. 

 

Context
https://docs.google.com/document/d/1Ywi-llZbOt-YUmqx7I6jWQP_Rss4eM-uoYJwD7Z0fh0/edit
https://github.com/loganmc10/openshift-edge-installer/blob/main/edge/docs/RELOCATABLE.md

Epic Goal

  • Install SNO within 10 minutes

Why is this important?

  • SNO installation takes around 40+ minutes.
  • This makes SNO less appealing when compared to k3s/microshift.
  • We should analyze the  SNO installation, figure our why it takes so long and come up with ways to optimize it

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

  1. https://docs.google.com/document/d/1ULmKBzfT7MibbTS6Sy3cNtjqDX1o7Q0Rek3tAe1LSGA/edit?usp=sharing

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

openshift- service-ca service-ca pod takes a few minutes to start when installing SNO

kubectl get events -n openshift-service-ca --sort-by='.metadata.creationTimestamp' -o custom-columns=FirstSeen:.firstTimestamp,LastSeen:.lastTimestamp,Count:.count,From:.source.component,Type:.type,Reason:.reason,Message:.message                      
FirstSeen              LastSeen               Count   From                                                                                              Type      Reason                 Message
2023-01-22T12:25:58Z   2023-01-22T12:25:58Z   1       deployment-controller                                                                             Normal    ScalingReplicaSet      Scaled up replica set service-ca-6dc5c758d to 1
2023-01-22T12:26:12Z   2023-01-22T12:27:53Z   9       replicaset-controller                                                                             Warning   FailedCreate           Error creating: pods "service-ca-6dc5c758d-" is forbidden: error fetching namespace "openshift-service-ca": unable to find annotation openshift.io/sa.scc.uid-range
2023-01-22T12:27:58Z   2023-01-22T12:27:58Z   1       replicaset-controller                                                                             Normal    SuccessfulCreate       Created pod: service-ca-6dc5c758d-k7bsd
2023-01-22T12:27:58Z   2023-01-22T12:27:58Z   1       default-scheduler                                                                                 Normal    Scheduled              Successfully assigned openshift-service-ca/service-ca-6dc5c758d-k7bsd to master1
 

Seems that creating the serivce-ca namespace early allows it to get
openshift.io/sa.scc.uid-range annotation and start running earlier, the
service-ca pod is required for other pods (CVO and all the control plane pods) to start since it's creating the serving-cert 

  • I'm not sure this is a CVO issue, but I think CVO is the one creating the namespace, CVO also renders some manifests during bootkube so it seems like the right component.

Description of problem:

The bootkube scripts spend ~1 minute failing to apply manifests while waiting fot eh openshift-config namespace to get created

Version-Release number of selected component (if applicable):

4.12

How reproducible:

100%

Steps to Reproduce:

1.Run the POC using the makefile here https://github.com/eranco74/bootstrap-in-place-poc
2. Observe the bootkube logs (pre-reboot) 

Actual results:

Jan 12 17:37:09 master1 cluster-bootstrap[5156]: Failed to create "0000_00_cluster-version-operator_01_adminack_configmap.yaml" configmaps.v1./admin-acks -n openshift-config: namespaces "openshift-config" not found
....
Jan 12 17:38:27 master1 cluster-bootstrap[5156]: "secret-initial-kube-controller-manager-service-account-private-key.yaml": failed to create secrets.v1./initial-service-account-private-key -n openshift-config: namespaces "openshift-config" not found

Here are the logs from another installation showing that it's not 1 or 2 manifests that require this namespace to get created earlier:

Jan 12 17:38:10 master1 bootkube.sh[5121]: "etcd-ca-bundle-configmap.yaml": failed to create configmaps.v1./etcd-ca-bundle -n openshift-config: namespaces "openshift-config" not found
Jan 12 17:38:10 master1 bootkube.sh[5121]: "etcd-client-secret.yaml": failed to create secrets.v1./etcd-client -n openshift-config: namespaces "openshift-config" not found
Jan 12 17:38:10 master1 bootkube.sh[5121]: "etcd-metric-client-secret.yaml": failed to create secrets.v1./etcd-metric-client -n openshift-config: namespaces "openshift-config" not found
Jan 12 17:38:10 master1 bootkube.sh[5121]: "etcd-metric-serving-ca-configmap.yaml": failed to create configmaps.v1./etcd-metric-serving-ca -n openshift-config: namespaces "openshift-config" not found
Jan 12 17:38:10 master1 bootkube.sh[5121]: "etcd-metric-signer-secret.yaml": failed to create secrets.v1./etcd-metric-signer -n openshift-config: namespaces "openshift-config" not found
Jan 12 17:38:10 master1 bootkube.sh[5121]: "etcd-serving-ca-configmap.yaml": failed to create configmaps.v1./etcd-serving-ca -n openshift-config: namespaces "openshift-config" not found
Jan 12 17:38:10 master1 bootkube.sh[5121]: "etcd-signer-secret.yaml": failed to create secrets.v1./etcd-signer -n openshift-config: namespaces "openshift-config" not found
Jan 12 17:38:10 master1 bootkube.sh[5121]: "kube-apiserver-serving-ca-configmap.yaml": failed to create configmaps.v1./initial-kube-apiserver-server-ca -n openshift-config: namespaces "openshift-config" not found
Jan 12 17:38:10 master1 bootkube.sh[5121]: "openshift-config-secret-pull-secret.yaml": failed to create secrets.v1./pull-secret -n openshift-config: namespaces "openshift-config" not found
Jan 12 17:38:10 master1 bootkube.sh[5121]: "openshift-install-manifests.yaml": failed to create configmaps.v1./openshift-install-manifests -n openshift-config: namespaces "openshift-config" not found
Jan 12 17:38:10 master1 bootkube.sh[5121]: "secret-initial-kube-controller-manager-service-account-private-key.yaml": failed to create secrets.v1./initial-service-account-private-key -n openshift-config: namespaces "openshift-config" not found

Expected results:

expected resources to get created successfully without having to wait for the namespace to get created.

Additional info:

 

Description of problem:

When installing SNO with bootstrap in place the cluster-policy-controller hangs for 6 minutes waiting for the lease to be acquired. 

Version-Release number of selected component (if applicable):

 

How reproducible:

100%

Steps to Reproduce:

1.Run the PoC using the makefile here https://github.com/eranco74/bootstrap-in-place-poc
2.Observe the cluster-policy-controller logs post reboot

Actual results:

I0530 16:01:18.011988       1 leaderelection.go:352] lock is held by leaderelection.k8s.io/unknown and has not yet expired
I0530 16:01:18.012002       1 leaderelection.go:253] failed to acquire lease kube-system/cluster-policy-controller-lock
I0530 16:07:31.176649       1 leaderelection.go:258] successfully acquired lease kube-system/cluster-policy-controller-lock

Expected results:

Expected the bootstrap cluster-policy-controller to release the lease so that the cluster-policy-controller running post reboot won't have to wait the lease to expire.  

Additional info:

Suggested resolution for bootstrap in place: https://github.com/openshift/installer/pull/7219/files#diff-f12fbadd10845e6dab2999e8a3828ba57176db10240695c62d8d177a077c7161R44-R59

Description of problem:

while trying to figure out why it takes so long to install Single node OpenShift I noticed that the kube-controller-manager cluster operator is degraded for ~5 minutes due to:
GarbageCollectorDegraded: error fetching rules: Get "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/rules": dial tcp 172.30.119.108:9091: connect: connection refused
I don't understand how the prometheusClient is successfully initialized, but we get a connection refused once we try to query the rules.
Note that if the client initialization fails the kube-controller-manger won't set the  GarbageCollectorDegraded to true.

Version-Release number of selected component (if applicable):

4.12

How reproducible:

100%

Steps to Reproduce:

1. install SNO with bootstrap in place (https://github.com/eranco74/bootstrap-in-place-poc)

2. monitor the cluster operators staus 

Actual results:

GarbageCollectorDegraded: error fetching rules: Get "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/rules": dial tcp 172.30.119.108:9091: connect: connection refused 

Expected results:

Expected the GarbageCollectorDegraded status to be false

Additional info:

It seems that for PrometheusClient to be successfully initialised it needs to successfully create a connection but we get connection refused once we make the query.
Note that installing SNO with this patch (https://github.com/eranco74/cluster-kube-controller-manager-operator/commit/26e644503a8f04aa6d116ace6b9eb7b9b9f2f23f) reduces the installation time by 3 minutes


Feature Overview

To give Telco Far Edge customers as much of the product support lifespan as possible, we need to ensure that OCP releases are "telco ready" when the OCP release is GA.

Goals

  • All Telco Far Edge regression tests pass prior to OCP GA
  • All new features that are TP or GA quality at the time of the release pass validation prior to OCP GA
  • Ensure Telco Far Edge KPIs are met prior to OCP GA

Requirements

Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES
     
     
     
     
     

(Optional) Use Cases

This Section:

  • SNO DU 
  • C-RAN Hub of DUs on compact cluster or traditional cluster
  • CU on compact cluster or traditional cluster

Questions to answer…

  • What are the scale goals?
  • How many nodes must be provisioned simultaneously?

Out of Scope

  • N/A

Background, and strategic fit

Notes

Assumptions

  •  

Customer Considerations

  • ...

Documentation Considerations

No documentation required

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Create a new informing lane in CI which includes the functional tests from OCPVE-163 and the latency tests from the RHEL Shift Left initiative started by the Telco RAN team.
  • Enable the informing lane to run on bare metal on a consistent set of hardware.

Why is this important?

  • We document that running workloads on OpenShift with a realtime kernel works but testing, in practice, is often done in a bespoke fashion in teams outside of OpenShift. This epic seeks to close the gap of automated integration testing OpenShift running a realtime kernel on real metal hardware.

Scenarios

  1. https://docs.google.com/presentation/d/1NW8vEkP7zMd0vxWpD-p82srZljcAOtqLEhymuYcUQXQ/edit#slide=id.g1407d815407_0_5

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. OCPVE-163
  2. https://github.com/openshift/enhancements/blob/master/enhancements/support-for-realtime-kernel.md

Open questions:

  1. What is the cost/licensing requirements for metal hardware (Equinix?) to support this new lane?
    1. How many jobs do we run and for how often?
  2. How do we integrate the metal hardware with Prow?
  3. Who should own this lane long term?
  4. Does OpenShift make any performance guarantees/promises?

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Feature Goal

  • Definition of a CU Profile
  • Deployment of the CU profile on multi-node Bare Metal clusters using the RH declarative framework.

Why is this important?

  • Telcos will want minimal hands-on installs of all infrastructure.

Requirements

  1. CU infrastructure deployment and life-cycle management must be performed through the ZTP workflow toolset (SiteConfig, PolicyGen, ACM and ArgoCD)
  2. Performance tuning:
    • Non-RT kernel
    • Huge pages set per NUMA
  3. Day 2 operators:
    • SR-IOV network operator and sample configuration
    • OCS / ODF sample configuration, highly available storage
    • Cluster logging operator and sample configuration
  4. Additional features
    • Disk encryption (which?)
    • SCTP
    • NTP time synchronization
    • IPV4, IPV6 and dual stack options

Scenarios

  1. CU on a Three Node Cluster - zero touch provisioning and configuration
  2. CU can be on SNO, SNO+1 worker or MNO (up to 30 nodes)
  3. Cluster expansion
  4. y-stream and z-stream upgrade
  5. in-service upgrade (progressively update cluster)
  6. EUS to EUS upgrade

Acceptance Criteria

  • Reference configurations released as part of ZTP
  • Scenarios validated and automated (Reference Design Specification)
  • Lifecycle scenarios are measured and optimized
  • Documentation completed

Open questions::

  1. What kind of disk encryption is required?
  2. Should any work be done on ZTP cluster expansion?
  3. What KPIs must be met? CaaS CPU/RAM/disk budget KPIs/targets? Overall upgrade time, cluster downtime, number of reboot per node type targets? oslat/etc targets?

References:

  1. RAN DU/CU Requirements Matrix
  2. CU baseline profile 2020
  3. CU profile - requirements
  4. Nokia blueprints

https://docs.google.com/document/d/13Db7uChVx-2JXqAMJMexzHbhG3XLNLRy9nZ_7g9WbFU/edit#

Epic Goal

* Enable setting node labels on spoke cluster during installation

  • Right now we need to add roles, need to check if additional labels are required

Why is this important?

Scenarios

  1. ZTP flow user would like to mark nodes with additional roles, like rt, storage etc, in addition to master/worker that we have right now and supported by default

Acceptance Criteria

  • CI - MUST be running successfully with tests automated

Open questions::

  1. How master/worker roles are getting to the nodes, maybe we can use the same flow?
  2. Do we need to support only roles or in general supply labels?
  3. Another alternative is to use https://github.com/openshift/assisted-service/blob/d1cde6d398a3574bda6ce356411cba93c74e1964/swagger.yaml#L4071, a remark is that this will work only for day1

 Modify the scripts in assisted-service/deploy/operator/ztp.
The following environment variables will be added:

MANIFESTS: JSON containing the manifests to be added for day1 flow.  The key is the file name, and the value is the content.

NODE_LABELS: Dictionary of dictionaries.  The Outer dictionary key is the node name and the value is the node labels (key, value) to be applied.

MACHINE_CONFIG_POOL: Dictionary of strings.  The key is the node name and the value is machine config pool name.

SPOKE_WORKER_AGENTS: Number of worker nodes to be added as part of day1 installation.  Default 0

Feature Overview

Reduce the OpenShift platform and associated RH provided components to a single physical core on Intel Sapphire Rapids platform for vDU deployments on SingleNode OpenShift.

Goals

  • Reduce CaaS platform compute needs so that it can fit within a single physical core with Hyperthreading enabled. (i.e. 2 CPUs)
  • Ensure existing DU Profile components fit within reduced compute budget.
  • Ensure existing ZTP, TALM, Observability and ACM functionality is not affected.
  • Ensure largest partner vDU can run on Single Core OCP.

Requirements

Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES
 
Provide a mechanism to tune the platform to use only one physical core. 
Users need to be able to tune different platforms.  YES 
Allow for full zero touch provisioning of a node with the minimal core budget configuration.   Node provisioned with SNO Far Edge provisioning method - i.e. ZTP via RHACM, using DU Profile. YES 
Platform meets all MVP KPIs   YES

(Optional) Use Cases

  • Main success scenario: A telecommunications provider uses ZTP to provision a vDU workload on Single Node OpenShift instance running on an Intel Sapphire Rapids platform. The SNO is managed by an ACM instance and it's lifecycle is managed by TALM.

Questions to answer...

  • N/A

Out of Scope

  • Core budget reduction on the Remote Worker Node deployment model.

Background, and strategic fit

Assumptions

  • The more compute power available for RAN workloads directly translates to the volume of cell coverage that a Far Edge node can support.
  • Telecommunications providers want to maximize the cell coverage on Far Edge nodes.
  • To provide as much compute power as possible the OpenShift platform must use as little compute power as possible.
  • As newer generations of servers are deployed at the Far Edge and the core count increases, no additional cores will be given to the platform for basic operation, all resources will be given to the workloads.

Customer Considerations

  • ...

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
    • Administrators must know how to tune their Far Edge nodes to make them as computationally efficient as possible.
  • Does this feature have doc impact?
    • Possibly, there should be documentation describing how to tune the Far Edge node such that the platform uses as little compute power as possible.
  • New Content, Updates to existing content, Release Note, or No Doc Impact
    • Probably updates to existing content
  • If unsure and no Technical Writer is available, please contact Content Strategy. What concepts do customers need to understand to be successful in [action]?
    • Performance Addon Operator, tuned, MCO, Performance Profile Creator
  • How do we expect customers will use the feature? For what purpose(s)?
    • Customers will use the Performance Profile Creator to tune their Far Edge nodes. They will use RHACM (ZTP) to provision a Far Edge Single-Node OpenShift deployment with the appropriate Performance Profile.
  • What reference material might a customer want/need to complete [action]?
    • Performance Addon Operator, Performance Profile Creator
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
    • N/A
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?
    • Likely updates to existing content / unsure

Latest status as of 4.14 freeze:

The MCD no longer uses MachineConfigs to update certs, but rather reads it off our internal resource "controllerconfig" directly. The MachineConfig path still exists but is a no-op (although the MCO still falsely claims an update is pending as a result). The MachineConfig removal work is ready, but waiting for windows-MCO to change their workflow so as to not break them.

 

--------------------------------

 

The logic for handling certificate rotation should live outside of the MachineConfig-files path as it stands today. This will allow certs to rotate live, through paused pools, without generating additional churn in rendered configs, and most, if not all, certificates do not require drains/reboots to the node.

 

Context

The MCO has, since the beginning of time, managed certificates. The general flow is a cluster configmap -> MCO -> controllerconfig -> MCC -> renderedconfig -> MCD -> laid down to disk as a regular file.

 

When we talk about certs, the MCD actually manages 4 (originally 5) certs: see https://docs.google.com/document/d/1ehdOYDY-SvUU9ffdIKlt7XaoMaZ0ioMNZMu31-Mo1l4/edit (this document is a bit outdated)

Of these, the only one we care about is "/etc/kubernetes/kubelet-ca.crt", which is a bundle of 5 (now 7) certs. This will be expanded on below.

 

Unlike regular files though, certificates rotate automatically at some set cadence. Prior to 4.7, this would cause the MCD to seemingly randomly start an update and reboot nodes, much to the annoyance of customers, so we made it disruptionless.

 

There was still one more problem, a lot of users pauses pools for additional safety (which is their way of saying we don't want you to disrupt our workloads), which still gated the certificate from actually rotating in when it updated. In 4.12 and previous versions, this means that at 80% of the 1 year mark, a new kube-apiserver-to-kubelet-signer cert would be generated. After ~12 hours, this would affect some operation (oc logs, etc.) since the old signer was no longer matching the apiserver's new cert. At the one year mark, this would proceed to break entirely the kubelet. To combat this, we added an alert MachineConfigControllerPausedPoolKubeletCA to warn the users about the effects and expiry, which was ok since this should only be an annual occurrence.

 

Updates for 4.13

In 4.13, we realized that the kubelet-ca cert was being read from a wrong location which updated the kube-apiserver-to-kubelet-signer I mentioned above, but not some other certs. This was not a problem since nobody was depending on them, but in 4.13, monitoring was updated to use the right certs which then subsequently caused reports of kubeletdown to fire, which then David Eads fixed via https://github.com/openshift/machine-config-operator/pull/3458

So now instead of expired certs we have correct certs, which is great, but now we realized that the cert rotation will happen much more frequently.

 

Previously on the system, we had:

admin-kubeconfig-signer, kubelet-signer, kube-apiserver-to-kubelet-signer, kube-control-plane-signer, kubelet-bootstrap-kubeconfig-signer

 

now with the correct certs, right after install we get: admin-kubeconfig-signer, kube-csr-signer_@1675718562, kubelet-signer, kube-apiserver-to-kubelet-signer, kube-control-plane-signer, kubelet-bootstrap-kubeconfig-signer, openshift-kube-apiserver-operator_node-system-admin-signer@1675718563

 

The most immediate issue was bootstrap drift, which John solved via https://github.com/openshift/machine-config-operator/pull/3513

 

But the issue here is now we are updating two certs:

  1. kube-csr-signer, rotated every month
  2. openshift-kube-controller-manager-operator_csr-signer-signer (called kubelet-signer until the first rotation), rotated every two months

 

Meaning that every month we would be generating at least 2 new machineconfigs (new one rotating in, old one rotating out) to manage this.

During install, due to how the certs are set up (bootstrap ones expire in 24h) this means you get 5 MCs within 24 hours: bootstrap bundle, incluster bundle, incluster bundle with 1 new, incluster bundle with 2 new, incluster bundle with 2 new 2 old removed

On top of this, previously the cluster chugged along with the expiry with only the warning, but now, when the old certs rotate and the pools paused, TargetDown and KubeletDown fires after a few hours, making it very bad from a user perspective.

 

Solutions

Solution1: don't do anything

Nothing should badly break, but the user will get critical alerts after ~1 month if they pause and upgrade to 4.13. Not a great UX

Solution2: revert the monitoring change or mask the alert

A bit late, but potentially doable? Masking the alert will likely mask real issues, though

Solution3: MVP MCD changes (Estimate: 1week)

The MCD update, MCD verification, MCD config drift monitor all ignore the kubelet-ca cert file. The MCD gets a new routine to update the file, reading from a configmap the MCC manages. The MCC still renders the cert but the cert will be updated even if the pool is paused

Solution4: MVP MCC changes (Estimate: a few days)

Have the controller splice in changes even when the pool is paused. John has a MVP here: https://github.com/openshift/machine-config-operator/compare/master...jkyros:machine-config-operator:mco-77-bypass-pause 
This is a cleaner solution compared to 3, but will cause the pool to go into updating briefly. If there are other operations causing nodes to be cordoned, etc., we would have to consider overriding that.

Solution5: MCD cert management path (full, Estimate: 1 sprint)

The cert is removed from the rendered-config. The MCC will read it off the controllerconfig and render it into a custom configmap. The MCS will add this additional file when serving content, but it is not part of the rendered-MC otherwise. The MCD will have a new routine to manage the certs live directly.

The bootstrap MCS will also need to have a way to render it into the initial served configuration without it being part of the MachineConfigs (this is especially important for HyperShift). We will have to make sure the inplace updater doesn't break

We may also have to solve config drift problems from bootstrap to incluster, for self-driving and hypershift inplace

We also have to make sure the file isn't deleted upon an update to the new management, so the certs don't disappear for awhile, since the MCD would have seen the diff and deleted it

 

DOCS (WIP)

 

https://docs.google.com/document/d/1qXYV9Hj98QhJSKx_2IWBbU_bxu30YQtbR21mmOtBsIg/edit?usp=sharing

although we are removing the config from the machineconfig, ignition (both in bootstrap and in-cluster) need to generate ignition with the certs still, so nodes can join the cluster

 

We will need the incluster MCS to read from controllerconfig, and bootstrap MCS (during install time) to be able to remove it from the machineconfigs to ensure no drift when master nodes comes up

Once we finish the new method to manage certs, we should extend it to also manage image registry certs, although that is not required for 4.14

It really hurts to have to ask customers to collect on-disk files for us, and when we do this certificate work there is the possibility we will need to chase more race-condition or rendering mismatch issues, so let's see if we can get collection of mcs-machine-config-content.json (for boostrap mismatch) and maybe currentconfig (for those pesky validation issues) added to the must-gather. 

 

 

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Description of problem:

After running tests on an SNO with Telco DU profile for a couple of hours kubernetes.io/kubelet-serving CSRs in Pending state start showing up and accumulating in time.

Version-Release number of selected component (if applicable):

4.14.0-ec.3

How reproducible:

So far on 2 different environments

Steps to Reproduce:

1. Deploy SNO with Telco DU profile
2. Run system tests
3. Check CSRs status

Actual results:

oc get csr | grep Pending | wc -l
34

Expected results:

No Pending CSRs

Additional info:

This issue blocks retrieving pod logs.

Attaching must-gather and sosreport after manually approving CSRs.
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Complete Epics

This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were completed when this image was assembled

This epic contains all the Dynamic Plugins related stories for OCP release-4.14 and implementing Core SDK utils.

Epic Goal

  • Track all the stories under a single epic

Acceptance Criteria

There are modules shared between the Console application and its dynamic plugins, as configured in

packages/console-dynamic-plugin-sdk/src/shared-modules.ts

For modules configured as "allowFallback: false" (default setting) we should validate the Console provided version range vs. plugin consumed version at webpack build time.

This allows us to detect potential compatibility problems in shared modules (i.e. plugin is built against a different version than what is provided by Console at runtime) when building dynamic plugins.

 

AC: Add validation for our shared modules of dynamic plugins

  • Changes in Console dynamic plugin SDK
    • add optional options argument to ConsoleRemotePlugin constructor
      • control JSON schema validation: validatePackageSchemavalidateExtensionSchema
      • control extension integrity validation (via ExtensionValidator): validateExtensionIntegrity
      • control consumed shared module validation: validateSharedModules
  • Changes in Console dynamic demo plugin
      • update react-router and react-router-dom dependencies to Console provided semver range
      • update typing dependencies for react-router and react-router-dom
      • remove unused dependencies comment-json and read-pkg

We are missing the DeleteModal component in our  console-dynamic-plugin-sdk, due to which we need to copy it when building a dynamic-plugin.

 

AC:

  • Expose the DeleteModal component in our  console-dynamic-plugin-sdk
  • Decouple the functionality console internal codebase.
  • Use the original DeleteModel component as a wrapper that will use the new exposed DeleteModal from console-dynamic-plugin-sdk
  • Review the new component's API before merge
  • Add TS docs the migrated DeleteModal component

We are missing the AnnotationsModal component  and functions handling the input, e.g. on AnnotationSubmit in our console-dynamic-plugin-sdk, due to which we need to copy it when building a dynamic-plugin.

 

AC:

This epic contains all the console components that should be refactored to use Core SDK.

Epic Goal

  • Track all the stories under a single epic

Acceptance Criteria

Currently there is no good way for plugins to get the active namespace outside of resource pages. We should expose useActiveNamespace to support this. (useActiveNamespace is only exposed in the internal API.)

This seems important to pair with NamespaceBar since it's unclear how to get the initial namespace from NamespaceBar. This is borderline a bug since it's not clear how to use NamespaceBar without it. We should consider for 4.12. 

AC: 

  • Expose the `useActiveNamespace` in the console-dynamic-plugin-sdk pkg
  • Remove the internal implementation of `useActiveNamespace` from console-shared pkg
    • note the solution does not currently remove the implementation from console-shared, rather useActiveNamespace will be left in console shared, and exposed via require and imports statements.
  • Replace the imports

One of the requirements for adopting OpenShift Dynamic Plugin SDK (which is the new version of HAC Core SDK) is to bump the version of react-router to version 6.

For migration from v5 to v6 there is a `react-router-dom-v5-compat` package which should ease the migration process.

 

AC: Install the `react-router-dom-v5-compat` package into console repo and test for any regressions.

Epic Goal

Remove code that was added thought the ACM integration into all of the console's codebase repositories

Why is this important?

Since there was decision made stop with the ACM integration, we as a team decided that it would be better to remove the unused code in order avoid any confusion or regressions.

Acceptance Criteria

  • Identify all the places from which we need to remove the code that was added during the ACM integration.
  • Come up with a plan how to remove the code from our repositories and CI
  • Remove the code from console-operator repoy
  • Start with code removal from the console repository

Scour through the console repo and mark all multicluster-related code for removal by adding a "TODO remove multicluster" comment.

 

AC:

  • All multicluster-related console code is marked with a "TODO remove multicluster" comment.

Revert "copiedCSVsDisabled" and "clusters" server flag changes in front and backend code.

 

AC:

  • "clusters" server flag and all references are removed from console repo
  • "copiedCSVsDisabled" server flag type updated to boolean type, and all references are updated accordingly.
  • remove these two fields from server API

One of the requirements for adopting OpenShift Dynamic Plugin SDK (which is the new version of HAC Core SDK) is to bump the version of react-router to version 6. With Console PR #12861 merged, both Console web application and its dynamic plugins should now be able to start migrating from React Router v5 to v6. 

 

As a team we decided that we are going to split the work per package, but for the core console we will split the work into standalone stories based on the migration strategy.
 
Console will keep supporting React Router v5 for two releases (end of 4.15) as per CONSOLE-3662.
 
How to prepare your dynamic plugin for React Router v5 to v6 migration:
[0] bump @openshift-console/dynamic-plugin-sdk-webpack dependency to 0.0.10 * this release adds react-router-dom-v5-compat to Console provided shared modules

[1] (optional but recommended) bump react-router and react-router-dom dependencies to v5 latest * Console provided shared module version of react-router and react-router-dom is 5.3.4

  • DO NOT bump react-router and react-router-dom dependencies to v6!

[2] add react-router-dom-v5-compat dependency * Console provided shared module version of react-router-dom-v5-compat is 6.11.2

  • this package provides React Router v6 code which can interoperate with v5 code

[3] start migrating to React Router v6 APIs * v5 code is imported from react-router or react-router-dom

  • v6 code is imported from react-router-dom-v5-compat
  • follow the official React Router Migration Strategy

[4] (optional but recommended) use appropriate TypeScript typings for react-router and react-router-dom * Console uses @types/react-router version 5.1.20 and @types/react-router-dom version 5.3.3

  • note that react-router-dom-v5-compat already ships with its own typings

One of the steps of the migration strategy is start using the v6 API for components that are passed to the `CompatRoute` component, based on the [migration strategy guide|https://github.com/remix-run/react-router/discussions/8753.]. This route now has both v5 and v6 routing contexts, so we can start migrating component code to v6.

If the component is a class component, you'll need to convert it to a function component first so that you can use hooks.

The following files in frontend/public/components/RBAC contain components that need to use the v6 useNavigate hook, requiring them to be converted from a class component to a functional component: 

  • bindings.jsx
  • edit-rule.jsx

AC: Listed components in frontend/public/components/RBAC rewritten from class component to functional component. 

One of the steps of the migration strategy is start using the v6 API for components that are passed to the `CompatRoute` component, based on the [migration strategy guide|https://github.com/remix-run/react-router/discussions/8753.]. This route now has both v5 and v6 routing contexts, so we can start migrating component code to v6.

If the component is a class component, you'll need to convert it to a function component first so that you can use hooks.

The StorageClassFormWithTranslation component in /frontend/public/components/storage-class-form.tsx needs to use the v6 useNavigate hook, which requires it to be converted from a class component to a functional component. 

AC: StorageClassFormWithTranslation component in storage-class-form.tsx is rewritten from class component to functional component.

Splitting off tile-view-page.jsx from CONSOLE-3687 into a separate story.

One of the steps of the migration strategy is start using the v6 API for components that are passed to the `CompatRoute` component, based on the migration strategy guide. This route now has both v5 and v6 routing contexts, so we can start migrating component code to v6.

If the component is a class component, you'll need to convert it to a function component first so that you can use hooks.

frontend/public/components/utils/tile-view-page.jsx contains a component that needs to use the v6 useNavigate hook, requiring it to be converted from a class component to a functional component.

AC: tile-view-page.jsx rewritten from class component to functional component. 

One of the steps of the migration strategy is start using the v6 API for components that are passed to the `CompatRoute` component, based on the [migration strategy guide|https://github.com/remix-run/react-router/discussions/8753.]. This route now has both v5 and v6 routing contexts, so we can start migrating component code to v6.

If the component is a class component, you'll need to convert it to a function component first so that you can use hooks.

The TemplateForm_ component in /frontend/public/components/instantiate-template.tsx needs to use the v6 useNavigate hook, which requires it to be converted from a class component to a functional component. 

AC: TemplateForm_ component in instantiate-template.tsx is rewritten from class component to functional component. 

One of the steps of the migration strategy is start using the v6 API for components that are passed to the `CompatRoute` component, based on the [migration strategy guide|https://github.com/remix-run/react-router/discussions/8753.]. This route now has both v5 and v6 routing contexts, so we can start migrating component code to v6.

If the component is a class component, you'll need to convert it to a function component first so that you can use hooks.

The following files in frontend/public/components/cluster-settings contain components that need to use the v6 useNavigate hook, requiring them to be converted from a class component to a functional component: 

  • basicauth-idp-form.tsx
  • github-idp-form.tsx
  • gitlab-idp-form.tsx
  • google-idp-form.tsx
  • htpasswd-idp-form.tsx
  • keystone-idp-form.tsx
  • idap-idp-form.tsx
  • openid-idp-form.tsx
  • request-header-idp-form.tsx

AC: Listed components in frontend/public/components/cluster-settings rewritten from class component to functional component. 

One of the steps of the migration strategy is start using the v6 API for components that are passed to the `CompatRoute` component, based on the [migration strategy guide|https://github.com/remix-run/react-router/discussions/8753.]. This route now has both v5 and v6 routing contexts, so we can start migrating component code to v6.

If the component is a class component, you'll need to convert it to a function component first so that you can use hooks.

The EditYAML component in /frontend/public/components/edit-yaml.jsx needs to use the v6 useNavigate hook, which requires it to be converted from a class component to a functional component. 

AC: EditYAML component in edit-yaml.jsx is rewritten from class component to functional component. 

One of the steps of the migration strategy is start using the v6 API for components that are passed to the `CompatRoute` component, based on the [migration strategy guide|https://github.com/remix-run/react-router/discussions/8753.]. This route now has both v5 and v6 routing contexts, so we can start migrating component code to v6.

If the component is a class component, you'll need to convert it to a function component first so that you can use hooks.

The App component in /frontend/public/components/app.jsx needs to use the v6 useLocation hook, which requires it to be converted from a class component to a functional component. 

AC: App component in app.jsx is rewritten from class component to functional component. 

One of the steps of the migration strategy is start using the v6 API for components that are passed to the `CompatRoute` component, based on the [migration strategy guide|https://github.com/remix-run/react-router/discussions/8753.]. This route now has both v5 and v6 routing contexts, so we can start migrating component code to v6.

If the component is a class component, you'll need to convert it to a function component first so that you can use hooks.

The CheckBoxes_ component in /frontend/public/components/row-filter.jsx needs to use the v6 useNavigate hook, which requires it to be converted from a class component to a functional component. 

AC: CheckBoxes_ component in row-filter.jsx is rewritten from class component to functional component. 

AC: CheckBoxes_ component is removed from the codebase. 

One of the steps of the migration strategy is start using the v6 API for components that are passed to the `CompatRoute` component, based on the migration strategy guide. This route now has both v5 and v6 routing contexts, so we can start migrating component code to v6.

If the component is a class component, you'll need to convert it to a function component first so that you can use hooks.

The following files in frontend/public/components/utils contain components that need to use the v6 useNavigate hook, requiring them to be converted from a class component to a functional component: 

  • dropdown.jsx
  • kebab.tsx
  • tile-view-page.jsx

AC: Listed components in frontend/public/components/utils rewritten from class component to functional component. 

One of the steps of the migration strategy is start using the v6 API for components that are passed to the `CompatRoute` component, based on the [migration strategy guide|https://github.com/remix-run/react-router/discussions/8753.]. This route now has both v5 and v6 routing contexts, so we can start migrating component code to v6.

If the component is a class component, you'll need to convert it to a function component first so that you can use hooks.

The following files in frontend/public/components/modals contain components that need to use the v6 useNavigate hook, requiring them to be converted from a class component to a functional component: 

  • add-secret-to-workload.tsx
  • create-namespace-modal.jsx

AC: Listed components in frontend/public/components/modals rewritten from class component to functional component. 

One of the steps of the migration strategy is start using the v6 API for components that are passed to the `CompatRoute` component, based on the [migration strategy guide|https://github.com/remix-run/react-router/discussions/8753.]. This route now has both v5 and v6 routing contexts, so we can start migrating component code to v6.

If the component is a class component, you'll need to convert it to a function component first so that you can use hooks.

The SecretFormWrapper component in /frontend/public/components/secrets/create-secret.tsx needs to use the v6 useNavigate hook, which requires it to be converted from a class component to a functional component. 

AC: ScretFormWrapper component in create-secret.tsx is rewritten from class component to functional component. 

One of the steps of the migration strategy is start using the v6 API for components that are passed to the `CompatRoute` component, based on the [migration strategy guide|https://github.com/remix-run/react-router/discussions/8753.]. This route now has both v5 and v6 routing contexts, so we can start migrating component code to v6.

If the component is a class component, you'll need to convert it to a function component first so that you can use hooks.

The EventStream component in /frontend/public/components/events.jsx needs to use the v6 useParams hook, which requires it to be converted from a class component to a functional component. 

AC: EventStream component in events.jsx is rewritten from class component to functional component. 

One of the steps of the migration strategy is start using the v6 API for components that are passed to the `CompatRoute` component, based on the [migration strategy guide|https://github.com/remix-run/react-router/discussions/8753.]. This route now has both v5 and v6 routing contexts, so we can start migrating component code to v6.

If the component is a class component, you'll need to convert it to a function component first so that you can use hooks.

The FireMan component in /frontend/public/components/factory/list-page.tsx needs to use the v6 useParams and useLocation hooks, which requires it to be converted from a class component to a functional component. 

AC: FireMan component in list-page.tsx is rewritten from class component to functional component. 

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Collect on-prem installation data in order to be able to structure similar ELK dashboards as from SaaS deployments
  • Collect info of ZTP/CIM deployments
  • Collect info of BILLI deployements

Why is this important?

  • We want to track trends, and be able to analyze on-prem installations

Scenarios

  1. As a cluster administrator, I can provision and manage my fleet of clusters knowing that every data point is collected and sent to the Assisted Installer team without having to do anything extra. I know my data will be safe and secure and the team will only collect data they need to improve the product.
  2. As a developer on the assisted installer team, I can analyze the customer data to determine if a feature is worth implementing/keeping/improving. I know that the customer data is accurate and up-to-date. All of the data is parse-able and can be easily tailored to the graphs/visualizations that help my analysis.
  3. As a product owner, I can determine if the product is moving in the right direction based on the actual customer data. I can prioritize features and bug fixes based on the data.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. [Internal] MGMT-11244 Decision for which event streaming service used will determine the endpoint we send the data to

Previous Work (Optional):

 

 MGMT-11244: Remodeling of SaaS data pipeline

 

Open questions:

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>
  1. Query assisted service events for this cluster when a cluster reaches an "end state"
    1. End states include when the cluster is in state `error`, `cancelled`, `installed`
  2. Authenticate and send data to data streaming service

 

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • We need new api that will allow us to skip cluster/host validations.
  • This api should have it's own feature flag. 

Why is this important?

  • Some customer and partners has very specific HW that doesn't pass our validations and we want to allow them to it
  • Sometimes we have bugs in our validations that block people from installing and we don't want our partners to stuck cause of us

Scenarios

  1. Example from kaloom:
    1. Kaloom has very specific setup where vips can be shown as busy though installation can proceed with them.
    2. Currently they need to override vips in install config to be able to install cluster
    3. After adding the new api they can just run it and skip this specific validation.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • Feature flag for this api should be added to statistics calculator and if it was set cluster failure should not be counted.

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Description of the problem:

Documentation for ignore validation API should be updated with the correct json string arrays:

  • JSON string arrays are (L53 and L62):
{ "ignored_host_validations": "[\"all\"]" "ignored_cluster_validations": "[\"all\"]" }

While it should be :

{ "host-validation-ids": "[\"all\"]", "cluster-validation-ids": "[\"all\"]" }

How reproducible:

 

Steps to reproduce:

1.

2.

3.

Actual results:

 

Expected results:

Description of the problem:

In staging, BE 2.17.0 - Ignore validation API has no validation for the values sent. For example:

curl -X 'PUT' 'https://api.stage.openshift.com/api/assisted-install/v2/clusters/be4cdbef-7ea6-48f6-a30a-d1169eeb38fb/ignored-validations'   --header "Authorization: Bearer $(ocm token)"   -H 'accept: application/json'   -H 'Content-Type: application/json'   -d '{
  "host-validation-ids": "[\"testTest\",\"HasCPUCoresForRole\"]",
  "cluster-validation-ids": "[]"       
}'

Stores:

 {"cluster-validation-ids":"[]","host-validation-ids":"[\"testTest\",\"HasCPUCoresForRole\"]"}

How reproducible:

100%

Steps to reproduce:

1.

2.

3.

Actual results:

 

Expected results:

Description of the problem:

In BE 2.16.0 Staging - while cluster is in installed or installing state, ignore validation API changes the validations, but this should be blocked.

How reproducible:

100%

Steps to reproduce:

1. send this call to installed cluster

curl -i -X PUT 'https://api.stage.openshift.com/api/assisted-install/v2/clusters/${cluster_id}/ignored-validations'   --header "Authorization: Bearer $(ocm token)"   -H 'accept: application/json'   -H 'Content-Type: application/json' -d '{"host-validation-ids": "[\"all\"]", "cluster-validation-ids": "[\"all\"]"}'
 

2. Cluster validation is changed

3.

Actual results:

 

Expected results:

1. Proposed title of this feature request

Delete worker nodes using GitOps / ACM workflow

2. What is the nature and description of the request?

We use siteConfig to deploy a cluster using the GitOPS / ACM workflow. We can also use siteConfig to add worker nodes to an existing cluster. However, today we cannot delete a worker node using the GitOps / ACM work flow. We need to go and manually delete the resources (BMH, nmstateConfig etc.) and the OpenShift node. We would like to have the node deleted as part of the GitOps workflow.

3. Why does the customer need this? (List the business requirements here)

Worker nodes may need to be replaced for any reason (hardware failures) which may require deletion of a node.

If we are colocating OpenShift and OpenStack control planes on the same infrastructure (using OpenStack director operator to create OpenStack control plane in OCP virtualization), then we also have the use case of assigning baremetal nodes as OpenShift worker nodes or OpenStack compute nodes. Over time we may need to change the role of those baremetal nodes (from worker to compute or from compute to worker). Having the ability to delete worker nodes via GitOps will make it easier to automate that use case.

4. List any affected packages or components.

ACM, GitOps

In order to cleanly remove a node without interrupting existing workloads it should be cordoned and drained before it is powered off.

This should be handled by BMAC and should not interrupt processing of other requests. The best implementation I could find so far is in the kubectl code, but using that directly is a bit problematic as the call waits for all the pods to be stopped or evicted before returning. There is a timeout, but then we have to either give up after one call and remove the node anyway, or track multiple calls to drain across multiple reconciles.

We should come up with a way to drain asynchronously (maybe investigate what CAPI does).

We should allow for users to control removing the spoke node using resources on the hub.

For the ZTP-gitops case, this needs to be the BMH as they are not aware of the agent resource.

The user will add an annotation to the BMH to indicate that they want us to manage the lifecycle of the spoke node based on the BMH. Then, when the BMH is deleted we will clean the host and remove it from the spoke cluster.

Epic Goal

  • Implement pagination for the events (API ref).

Why is this important?

  • The number of events fetched by clients could be very long, and fetching all of them in a long-polling loop impacts negatively their performance.

Considerations

  • Features in current UI design
  • We must define the semantics of the "Filter by text" field. Right now it executes the filtering on the client.
    Once we'll have the pagination in place, do we want this field to be used for filtering only entries on the active page, similar to what it does today, or should it execute a query so the filtering is performed on the BE?
  • API should contain information about the number of pages available given the number of entries per page the user would like to see.
  • API should return the current page number.
  • Link to a Patternfly Table demo for reference on what data 

Description of the problem:

In staging, UI 2.19.6 - In new cluster events - number of events is shown as "1-10 of NaN" instead of the real number

How reproducible:

100%

Steps to reproduce:

1.

2.

3.

Actual results:

 

Expected results:

Epic Goal

  • Today we have an API for feature support per OCP version but we don't have an API for feature support per architecture.
    For example, ODF is not supported on ARM so we "hard-coded" blocked it in the UI and we will return a Bad request if the user will ask it using the API.
    Now that we have more architectures such as pppc64le and s390x - it becomes more complicated.

Why is this important?

  • We would like to use the same API for both BE & UI that we can maintain instead of hard-coded limitations in the UI per architecture 

Scenarios

  1. We have 4 architectures: x86, arm, s390, ppc64
  2. We have a few features for each architecture:
    1. Static IP
    2. UMN
    3. Dual stack
    4. OLM operators: LVMS/ ODF/ CNV/ LSO/ MCE
    5. Platform type: Vsphere, Nutanix
    6. Disk encryption
    7. CMN
    8. SNO
    9. heterogeneous clusters 
       

Acceptance Criteria

  • Test each feature and architecture combination both via UI & API.
  •  

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. we have this internal doc: https://docs.google.com/spreadsheets/d/1RmU5cMoQgN-5Rk5i13nwrRoqXv3nDZ4uo65PXQnsKNc/edit#gid=0 
  2. we have OpenShift Multi Architecture Component Availability Matrix page

Open questions::

Done Checklist

  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • UI  

After deprecating the old API and make sure UI is no longer use it - remove the following endpoint and definitions:

 

/v2/feature-support-levels 

definitions:
  feature-support-levels:
  feature-support-level:
   

 

 

https://github.com/openshift/assisted-service/blob/c7f1e1cc034dbdb4629c7680c1b81b8cb362ef0b/swagger.yaml#L3657-L3682

Description of the problem:

 

How reproducible:

 

Steps to reproduce:

1.

2.

3.

Actual results:

 

Expected results:

Description of the problem:

Method return empty object when calling GET v2/support-levels/features?openshift_version=X

How reproducible:

Call GET v2/support-levels/features?openshift_version=4.13

Steps to reproduce:

1. Call GET v2/support-levels/features?openshift_version=4.13

2.

3.

Actual results:

{}

Expected results:

{   FEATURE_A: supported,   FEATURE_B supported ... }

Description of the problem:

Returning bad request on feature-support validation is colliding with multi platform feature. 

Whenever the user set the CPU architecture to P or Z the platform changed to multi causing loose of information and not failing the cluster registration/update

 

How reproducible:

Register a cluster with s390x as CPU architecture on OCP version 4.12 

 

Expected results:

Bad Request 

Description of the problem:

Currently installing ppc64le cluster with Cluster Managed Networking enabled and Minimal ISO are not supported.

 

Steps to reproduce:

1. Create ppc64le cluster with UMN enabled 

 

Actual results:

BadRequest

 

Expected results:

Created successfully 

Create single place on assisted-service (update/register cluster) where we will return bed request in case that feature combination is not supported

Description of the problem:

BE 2.17.4 - (using API calls) creating new cluster, PATCH it with OLM opertors and then create new infra-env  with P/Z should be blocked, but is allowed

How reproducible:

100%

Steps to reproduce:

1. Create new cluster

 curl -X 'POST' \
   'https://api.stage.openshift.com/api/assisted-install/v2/clusters/' \
   --header "Authorization: Bearer $(ocm token)" \
   -H 'accept: application/json' \
   -H 'Content-Type: application/json' \
   -d '{
     "name": "s390xsno2413",
     "high_availability_mode": "Full",
     "openshift_version": "4.13",
     "pull_secret": "${pull_secret}",
 "base_dns_domain": "redhat.com",
     "cpu_architecture": "s390x",
     "disk_encryption": {
         "mode": "tpmv2",
         "enable_on": "none"
     },
     "tags": "",
 "user_managed_networking": true
 }'

2. Patch with OLM operators

curl -i -X 'PATCH'   'https://api.stage.openshift.com/api/assisted-install/v2/clusters/c05ba143-cf22-44ec-b1fd-edad5d8ca5a9'   --header "Authorization: Bearer $(ocm token)"   -H 'accept: application/json'   -H 'Content-Type: application/json'   -d '{
    "olm_operators":[{"name":"cnv"},{"name":"lso"},{"name":"odf"}]
}'

3. Create infra-env

curl -X 'POST'   'https://api.stage.openshift.com/api/assisted-install/v2/infra-envs'   --header "Authorization: Bearer $(ocm token)"   -H 'accept: application/json'   -H 'Content-Type: application/json'   -d '{
    "name": "tests390xsno_infra-env2",
    "pull_secret": "${pull_secret}",
    "cluster_id": "c05ba143-cf22-44ec-b1fd-edad5d8ca5a9",
    "openshift_version": "4.13",
    "cpu_architecture": "s390x"
}' 

Actual results:

Infra-env created

Expected results:
Should be blocked

Feature goal (what are we trying to solve here?)

Making sure we no longer supporting the OCP 4.8+4.9 releases, when those get EOL (at April 27, 2023).

DoD (Definition of Done)

Installation of OCP 4.8 and 4.9 is no longer possible in any of our envs.

Does it need documentation support?

As Assisted Installer documentation embedded into the relevant OpenShift releases, no documentation changes are required. As those version docs are marked deprecated / decommissioned so are the Assisted Installer parts.

Feature origin (who asked for this feature?)

Catching up with OpenShift, as part of the usual lifecycle policy (currently, Extended life phase for OCCP 4.8 ends at April 27, 2023).

Reasoning (why it's important?)

  • Removing the burden from developers, testers and others in maintaining deprecated versions.

Competitor analysis reference

Not relevant.

Feature usage (do we have numbers/data?)

Numbers don't count too much here, as we're following the official policy for OpenShift. If a customer has any real need in OpenShift 4.8 or 4.9, it can get into the process of a Support Exception for OpenShift.

Regardless, as of today (Mar. 16, 2023) there's still some usage of OCP 4.8 & 4.9 but it's not very significant:

Feature availability (why should/shouldn't it live inside the UI/API?)

AFAIK UI shouldn't have any special code/configuration for OCP versions, so implementing the relevant pieces in the backend should suffice.

Feature goal (what are we trying to solve here?)

vSphere platform configuration is a bit different on OCP 4.13.

Changes needed:

DoD (Definition of Done)

  • Updated install-config without any deprecated parameter
  • Update the post installation guide

Does it need documentation support?

Yes

Feature origin (who asked for this feature?)

  • A Customer asked for it

    • Name of the customer(s)
    • How many customers asked for it?
    • Can we have a follow-up meeting with the customer(s)?

 

  • A solution architect asked for it

    • Name of the solution architect and contact details
    • How many solution architects asked for it?
    • Can we have a follow-up meeting with the solution architect(s)?

 

  • Internal request

    • Who asked for it?

 

  • Catching up with OpenShift

Reasoning (why it’s important?)

  • Please describe why this feature is important
  • How does this feature help the product?

Competitor analysis reference

  • Do our competitors have this feature?
    • Yes, they have it and we can have some reference
    • No, it's unique or explicit to our product
    • No idea. Need to check

Feature usage (do we have numbers/data?)

  • We have no data - the feature doesn’t exist anywhere
  • Related data - the feature doesn’t exist but we have info about the usage of associated features that can help us
    • Please list all related data usage information
  • We have the numbers and can relate to them
    • Please list all related data usage information

Feature availability (why should/shouldn't it live inside the UI/API?)

  • Please describe the reasoning behind why it should/shouldn't live inside the UI/API
  • If it's for a specific customer we should consider using AMS
  • Does this feature exist in the UI of other installers?

Manage the effort for adding jobs for release-ocm-2.8 on assisted installer

https://docs.google.com/document/d/1WXRr_-HZkVrwbXBFo4gGhHUDhSO4-VgOPHKod1RMKng

 

Merge order:

  1. Add temporary image streams for Assisted Installer migration - day before (make sure images were created)
  2. Add Assisted Installer fast forwards for ocm-2.x release <depends on #1> - need approval from test-platform team at https://coreos.slack.com/archives/CBN38N3MW 
  3. Branch-out assisted-installer components for ACM 2.(x-1) - <depends on #1, #2> - At the day of the FF
  4. Prevent merging into release-ocm-2.x - <depends on #3> - At the day of the FF
  5. Update BUNDLE_CHANNELS to ocm-2.x on master - <depends on #3> - At the day of the FF
  6. ClusterServiceVersion for release 2.(x-1) branch references "latest" tag <depends on #5> - After  #5
  7. Update external components to AI 2.x <depends on #3> - After a week, if there are no issues update external branches
  8. Remove unused jobs - after 2 weeks

 

Epic Goal

  • Enable replacing master as day2 operation

Why is this important?

  • There is currently no simple way to replace a failing control plane. The IPI install method has network infrastructure requirements that the assisted installer does not, and the UPI method is complex enough that the target user for the assisted installer may not feel comfortable doing it.

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. ...

Open questions::

  1. ...

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Description of the problem:

Update the following Day2 procedure - https://github.com/openshift/assisted-service/blob/master/docs/user-guide/day2-master/411-healthy.md

  1. In the {{Add BareMetalHost object }}and Add Machine object we should add the {{oc apply -f <filename> }}so user will know what they need to do
  2. In the Link BMH and Machine and Node using the magic script section -  {{custom-master3-2 }}should be changed to {{custom-master3-chocobomb }}according to the yaml exaples there.

How reproducible:

 

Steps to reproduce:

1.

2.

3.

Actual results:

 

Expected results:

Description of the problem:

It is possible to create such manifest file name:
ee ll ii aa.yml

How reproducible:

 

Steps to reproduce:

1.create cluster

2. add manifest with spaces within file name

3.

Actual results:

  it alows adding the manifest

Expected results:
Even if BE allows it , we should consider disabling the option (see slack thread) in UI and BE

Description of the problem:
In the File name field of the Custom manifest, there should be escription pop up text , which tell type of file which need to be added and max size
 

How reproducible:

 

Steps to reproduce:

1.create cluster with manifest

2.Navigate to custom manifest wizard

3. click on add new manifest

Actual results:

 File name label has no further description text

Expected results:

I suggest adding , file type , and maz size/length

Description of the problem:

V2CreateClusterManifest should block empty manifests

How reproducible:

100%

Steps to reproduce:

1. POST V2CreateClusterManifest manifest with empty content

Actual results:

Succeeds. Then silently breaks bootkube much later.

Expected results:
API call should fail immediately

Description of the problem:

 When installing any cluster , without customer manifest , in the installation summary page we see , name os custom manifest presented

It is not understandable if those manifest where added from the customer himself , or from the AI

How reproducible:
100%
 

Steps to reproduce:

1.Install cluster without custom manifest checked

2.After installation completed check cluster summary

3.

Actual results:
in summary several files are mentioned in the Custom manifest div part
presented as custom manifest

Expected results:
User should be informed if these custom manifest are added from UI , and which are not

We are looking into allowing users to rename the manifest file name. Currently this is only available by issuing a DELETE and POST reqs which result in a very bad UX.

We need a API to allow users to change the folder, file name or yaml content of existing custom manifest.

Discussion about that: https://redhat-internal.slack.com/archives/CUPJTHQ5P/p1675776466170339
 

When the control plane nodes are under pressure or the apiserver is just not available, no telemetry data is emitted by the monitoring stack although monitoring isn't on master node and shouldn't have to interact with the control plane in order to push metrics.

This is caused by the fact that today telemeter-client is evaluating promQL expressions on Prometheus via an oauth-proxy endpoint that requires talking to the apiserver to be authenticated.

After discussing with Simon Pasquier, a potential solution to remove the dependency on the apiserver would be to use mTLS communication between telemeter-client and the Prometheus pods.

Today, there are 3 proxies in the Prometheus pods:

  • oauth proxy for the API
  • kube-rbac-proxy for prometheus metrics
  • kube-rbac-proxy for thanos sidecar

The kube-rbac-proxy exposing the /metrics endpoint could be used by telemeter-client since it is already doing so via mTLS.

Note that this approach would require improving telemeter-client since it doesn't support configure TLS certs/keys.

Epic Description

This is the second part of Customizations for Node Exporter, following https://issues.redhat.com/browse/MON-2848
There are the following tasks remaining:

  • On/off switch for these collectors:
    • systemd
    • hwmon
    • mountstats (pending decision which metrics to collect)
    • ksmd
  • General options for node-exporter
    • maxprocs

 

The "mountstats" collector generates 53 high-cardinality metrics by default, we have to refine the story to choose only the necessary metrics to collect.

 

Cluster Monitoring Operator uses the configmap "cluster-monitoring-config" in the namespace "openshift-monitoring" as its configuration. These new configurations will be added into the section  "nodeExporter".

Node Exporter comes with a set of default activated collectors and optional collectors.

To simplify the configuration, we put a config object for each collector that we allow users to activate or deactivate.

If a collector is not present, no change is made to its default on/off status. 

Each collector has a field "enabled" as a on/off switch. If "enabled" is set to "false", other fields can be omitted.

The default value for the new options are:

  • collectors
    • systemd
      • enabled: bool, default: false
    • hwmon
      • enabled: bool, default: true
    • mountstats
      • enabled: bool, default: false
    • ksmd
      • enabled: bool, default: false
  • maxProcs: int, default: 0

Here is an example of what these options look like in CMO configmap:

apiVersion: v1
kind: ConfigMap
metadata: 
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data: 

  config.yaml: |
    maxProcs: 4 
    nodeExporter: 
      collectors: 
        hwmon: 
          enabled: true
        mountstats: 
          enabled: true
        systemd: 
          enabled: true
        ksmd: 
          enabled: true


 

If the config for nodeExporter is omitted, Node Exporter should run with the same arguments concerning collectors as those in CMO v4.12:

 
--no-collector.wifi
--collector.filesystem.mount-points-exclude=^/(dev|proc|sys|run/k3s/containerd/.+|var/lib/docker/.+|var/lib/kubelet/pods/.+)($|/)
--collector.netclass.ignored-devices=^(veth.*|[a-f0-9]{15}|enP.*|ovn-k8s-mp[0-9]*|br-ex|br-int|br-ext|br[0-9]*|tun[0-9]*)$
--collector.netdev.device-exclude=^(veth.*|[a-f0-9]{15}|enP.*|ovn-k8s-mp[0-9]*|br-ex|br-int|br-ext|br[0-9]*|tun[0-9]*)$
--collector.cpu.info
--collector.textfile.directory=/var/node_exporter/textfile
--no-collector.cpufreq
--no-collector.tcpstat
--no-collector.cpufreq
--no-collector.tcpstat
--collector.netdev
--collector.netclass
--no-collector.buddyinfo
--collector.netdev
--collector.netclass
--no-collector.buddyinfo

 

 

 

We will add a section for "ksmd" Collector in "nodeExporter.collectors" section in CMO configmap. 

It has a boolean field "enabled", the default value is false.

apiVersion: v1
kind: ConfigMap
metadata: 
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data: 

  config.yaml: |

    nodeExporter: 
      collectors: 
        # enable a collector which is disabled by default
        ksmd: 
          enabled: true

refer to: https://issues.redhat.com/browse/OBSDA-308

 

We will add a section for "systemd" Collector in "nodeExporter.collectors" section in CMO configmap. 

It has a boolean field "enabled", the default value is false.

apiVersion: v1
kind: ConfigMap
metadata: 
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data: 

  config.yaml: |

    nodeExporter: 
      collectors: 
        # enablea collector which is disabled by default
        systemd: 
          enabled: true

 

To avoid too many metrics are scraped from systemd units, the collector should collect metrics on selected units only. We put regex patterns of the units to collect in the list `collectors.systemd.units`.
 
 

apiVersion: v1
kind: ConfigMap
metadata: 
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data: 
  config.yaml: |    
    nodeExporter: 
      collectors: 
        # enable a collector which is disabled by default
        systemd: 
          enabled: true
          units: 
          - iscsi-init.*
          - sshd.service

 

 

refer to: https://issues.redhat.com/browse/OBSDA-214

We will add a section for "mountstats" Collector in "nodeExporter.collectors" section in CMO configmap. 

It has a boolean field "enabled", the default value is false.

 

apiVersion: v1
kind: ConfigMap
metadata: 
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data: 

  config.yaml: |

    nodeExporter: 
      collectors: 
        # enable a collector which is disabled by default
        mountstats: 
          enabled: true

 

The "mountstats" collector generates many high cardinality metrics, so we will collector only these metrics to avoid data congestion:

1. node_mountstats_nfs_read_bytes_total
2. node_mountstats_nfs_write_bytes_total
3. node_mountstats_nfs_operations_requests_total

 

refer to: https://issues.redhat.com/browse/OBSDA-293

Node Exporter has been upgraded to 1.5.0.
The default value of argument `--runtime.gomaxprocs` is set to 1, different from the old behavior. Node Exporter used to take advantage of multiple processes to accelerate metrics collection.
We are going to add a parameter to set the argument `--runtime.gomaxprocs` and make its default value 0. So that CMO retains the old behavior while allowing users to tune the multiprocess settings of Node Exporter.

The CMO config will have a new section `nodeExporter`, under which there is the parameter `maxProcs`, accepting an integer number as the maximum number of process Node Exporter runs concurrently. Its default value is 0 if omitted.

 config.yaml: |

    nodeExporter: 
      maxProcs: 1

Proposed title of this feature request

In 4.11 we introduced alert overrides and alert relabeling feature as tech preview. We should graduate this feature to GA.

What is the nature and description of the request?

This feature can address requests and issues we have seen from existing and potential customers. Moving this feature to GA would greatly enable adoption.

Why does the customer need this? (List the business requirements)

See linked issues.

List any affected packages or components.

CMO

Epic Goal

Why is this important?

  • Monitoring console pages should be visible when CMO is present.

Scenarios

  1. ...

Acceptance Criteria

  • Console plugin is deployed by CMO.

Dependencies (internal and external)

https://github.com/openshift/monitoring-plugin

Previous Work (Optional):

  1. https://issues.redhat.com/browse/OU-34
  2. Example resource for a similar logging plugin: https://github.com/openshift/logging-view-plugin/blob/main/logging-view-plugin-resources.yml

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Description

CMO should deploy and enable monitoring Plugin.

Acceptance Criteria

  • Monitoring Plugin must be deployed by CMO
  • The Plugin should be enabled by default
  • Must have a e2e to test if the Monitoring Plugin is deployed properly

 

Other Notes/ Considerations

  • Initial implementation only applies static yaml manifests
  • Need to parameterize image
  • CVO needs to modified to provide image url for the plugin image

Epic Goal

  • Enable static code analysis in cluster-monitoring-operator
  • Create a suitable config for the selected analyzers to allow for ignoring issues that are deemed safe or ignoring portions of the code (like tests)
  • set up PR checks

Why is this important?

  • static code analylsis can reduce certain classes of bugs
  • highlight unused code
  • enforces consistent code quality

 

We should run at least https://github.com/golangci/golangci-lint. 

https://github.com/securego/gosec could be interesting.

We also have an internal team: https://gitlab.cee.redhat.com/covscan/covscan/-/wikis/home. Maybe there are additional scanners we can possibly run.

Acceptance Criteria

  • CI - set up PR checks
  • Run at least golangci-lint
  • Fix existing issues or create exceptions in the relevant config files.

Description of problem:

 

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

 

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

 

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

 

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

 

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

 

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

 

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

https://github.com/openshift/cluster-monitoring-operator/pull/1989

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

 

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

 

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Epic Goal

  • Users can today specify resource requirements for some of the components:
    • Prometheus (in-cluster and UWM), only for the prometheus container.
    • Alertmanager (in-cluster and UWM), only for the alertmanager container.
    • Thanos Querier, only for the thanos-query container.
    • Thanos Ruler, only for the thanos-ruler container.
  • We should extend that too all containers, that potentially use too many resources.
  • Similar configuration options might be exposed for other components (subject for evaluation)
    • Node exporter
    • Kube state metrics
    • OpenShift state metrics
    • prometheus-adapter
    • prometheus-operator + admission webhook (platform and UWM)
    • telemeter-client

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

to help customers debugging we need to able to collect include noo pods and resource in the collected must-gather script 

to collect it use 

oc adm must-gather

Epic Goal

  • Update OpenShift components that are owned by the Builds + Jenkins Team to use Kubernetes 1.27

Why is this important?

  • Our components need to be updated to ensure that they are using the latest bug/CVE fixes, features, and that they are API compatible with other OpenShift components.

Acceptance Criteria

  • Existing CI/CD tests must be passing

User Story

As a developer i want to have my testing and build tooling managed in a consistent way for reduce amount of context switches during doing a maintenance work. 

Background

Currently our approach to manage and update auxiliary tooling (such as envtest, controller-gen, etc) is inconsistent. Fine pattern was introduced in CPMS repo, which relies on golang toolchain for update, vendor and run this auxiliary tooling.

For CPMS context see: 

https://github.com/openshift/cluster-control-plane-machine-set-operator/blob/main/tools/tools.go

https://github.com/openshift/cluster-control-plane-machine-set-operator/blob/main/go.mod#L24

https://github.com/openshift/cluster-control-plane-machine-set-operator/blob/main/Makefile#L19

Steps

  • Align envtest, controller-gen and other tooling management with pattern introduced within CPMS repo
  • Introduce additional test which compares envtest version (if envtest is in use within particular repo) with using k8s related libraries version. This will help to not forget to update envtest and other aux tools during dependency bumps.

Stakeholders

  • Cluster infra team

Definition of Done

  • All Cluster Infra Team owned repos updated and uses consistent pattern for auxiliary tools management
    • REPO LIST TBD, raw below
    • MAPI providers
    • MAO
    • CCCMO
    • CMA
  • Testing
  • Existing tests should pass
  • additional test for checking envtest version should be introduced

Background

Currently our approach to manage and update auxiliary tooling (such as envtest, controller-gen, etc) is inconsistent. Fine pattern was introduced in CPMS repo, which relies on golang toolchain for update, vendor and run this auxiliary tooling.

For CPMS context see:

https://github.com/openshift/cluster-control-plane-machine-set-operator/blob/main/tools/tools.go

https://github.com/openshift/cluster-control-plane-machine-set-operator/blob/main/go.mod#L24

https://github.com/openshift/cluster-control-plane-machine-set-operator/blob/main/Makefile#L19

Steps

  • Align envtest, controller-gen and other tooling management with pattern introduced within CPMS repo
  • Introduce additional test which compares envtest version (if envtest is in use within particular repo) with using k8s related libraries version. This will help to not forget to update envtest and other aux tools during dependency bumps.

Goal

This epic has 3 main goals

  1. Improve segment implementation so that we can easily enable additional telemetry pieces (hotjar, etc) for particular cluster types (starting with sandbox, maybe expanding to RHPDS). This will help us better understand where errors and drop off occurs in our trial and workshop clusters, thus being able to (1) help conversion and (2) proactively detect issues before they are "reported" by customers.
  2. Improve telemetry so we can START capturing console usage across the fleet
  3. Additional improvements to segment, to enable proper gathering of user telemetry and analysis

Problem

Currently we have no accurate telemetry of usage of the OpenShift Console across all clusters in the fleet. We should be able to utilize the auth and console telemetry to glean details which will allow us to get a picture of console usage by our customers.

Acceptance criteria

Let's do a spike to validate, and possibly have to update this list after the spike:

Need to verify HOW do we define a cluster Admin -> Listing all namespaces in a cluster? Install operators? Make sure that we consider OSD cluster admins as well (this should be aligned with how we send people to dev perspective in my mind)

Capture additional information via console plugin ( and possibly the auth operator )

  1. Average number of users per cluster
  2. Average number of cluster admin users per cluster
  3. Average number of dev users per cluster
  4. Average # of page views across the fleet
  5. Average # of page views per perspective across the fleet
  6. # of cluster which have disabled the admin perspective for any users
  7. # of cluster which have disabled the dev perspective for any users
  8. # of cluster which have disabled the “any” perspective for any users
  9. # of clusters which have plugin “x” installed
  10. Total number of unique users across the fleet
  11. Total number of cluster admin users across the fleet
  12. Total number of developer users across the fleet

Dependencies (External/Internal):

Understanding how to capture telemetry via the console operator

Exploration:

Note:

We have removed the following ACs for this release:

  1. (p2) Average total active time spent per User in console (per cluster for all users)
    1. per Cluster Admins
    2. per non-Cluster Admins
  2. (p2) Average active time spent in Dev Perspective [implies we can calculate this for admin perspective]
    1. per Cluster Admins
    2. per non-Cluster Admins-
  3. (p3) Average # of times they change the perspective (per cluster for all users)

Description

As RH PM/engineer, we want to understand the usage of the (dev) console, for that, we want to add new Prometheus metrics (how many users have a cluster, etc.) and collect them later (as telemetry data) via cluster-monitoring-operator.

Acceptance Criteria

  1. Add metrics that are collected in ODC-7232 to cluster-monitoring-operator so that we can get this data later in Superset DataHat or Tableau.

Additional Details:

As Red Hat, we want to understand the usage of the (dev) console, for that, we want to add new Prometheus metrics (how many users have a cluster, etc.) and collect them later (as telemetry data) via cluster-monitoring-operator.

Eigther the console-operator or cluster-monitoring-operator needs to apply a PrometheusRule to collect the right data and make these later available in Superset DataHat or Tableau.

Description of problem:
With 4.13 we added new metrics to the console (Epic ODC-7171 - Improved telemetry (provide new metrics), that collect different user and cluster metrics.

The cluster metrics include:

  1. which perspectives are customized (enabled, disabled, only available for a subset of users)
  2. which plugins are installed and enabled

These metrics contain the perspective name or plugin name which was unbounded. Admins could configure any perspective and plugin name, also if the perspective or plugin with that name is not available.

Based on the feedback in https://github.com/openshift/cluster-monitoring-operator/pull/1910 we need to reduce the cardinality and limit the metrics to, for example:

  1. perspectives: admin, dev, acm, other
  2. plugins: redhat, demo, other

Version-Release number of selected component (if applicable):
4.13.0

How reproducible:
Always

Steps to Reproduce:
On a cluster, you must update the console configuration, configure some perspectives or plugins and check the metrics in Admin > Observe > Metrics:

avg by (name, state) (console_plugins_info)

avg by (name, state) (console_customization_perspectives_info)

On a local machine, you can use this console yaml:

apiVersion: console.openshift.io/v1
kind: ConsoleConfig
plugins: 
  logging-view-plugin: https://logging-view-plugin.logging-view-plugin-namespace.svc.cluster.local:9443/
  crane-ui-plugin: https://crane-ui-plugin.crane-ui-plugin-namespace.svc.cluster.local:9443/
  acm: https://acm.acm-namespace.svc.cluster.local:9443/
  mce: https://mce.mce-namespace.svc.cluster.local:9443/
  my-plugin: https://my-plugin.my-plugin-namespace.svc.cluster.local:9443/
customization: 
  perspectives: 
  - id: admin
    visibility: 
      state: Enabled
  - id: dev
    visibility: 
      state: AccessReview
      accessReview: 
        missing: 
          - resource: namespaces
            verb: get
  - id: dev1
    visibility: 
      state: AccessReview
      accessReview: 
        missing: 
          - resource: namespaces
            verb: get
  - id: dev2
    visibility: 
      state: AccessReview
      accessReview: 
        missing: 
          - resource: namespaces
            verb: get
  - id: dev3
    visibility: 
      state: AccessReview
      accessReview: 
        missing: 
          - resource: namespaces
            verb: get

And start the bridge with:

./build-backend.sh
./bin/bridge -config ../config.yaml

After that you can fetch the metrics in a second terminal:

Actual results:

curl -s localhost:9000/metrics | grep ^console_plugins

console_plugins_info{name="acm",state="enabled"} 1
console_plugins_info{name="crane-ui-plugin",state="enabled"} 1
console_plugins_info{name="logging-view-plugin",state="enabled"} 1
console_plugins_info{name="mce",state="enabled"} 1
console_plugins_info{name="my-plugin",state="enabled"} 1
curl -s localhost:9000/metrics | grep ^console_customization

console_customization_perspectives_info{name="dev",state="only-for-developers"} 1
console_customization_perspectives_info{name="dev1",state="only-for-developers"} 1
console_customization_perspectives_info{name="dev2",state="only-for-developers"} 1
console_customization_perspectives_info{name="dev3",state="only-for-developers"} 1

Expected results:
Less cardinality, that means, results should be grouped somehow.

Additional info:

Goal:

This epic aims to address some of the RFEs associated with the Pipeline user experience.

Why is it important?

Improve the overall user experience when working with OpenShift Pipelines

Acceptance criteria:

  1. Users should be able to visually differentiate between canceled & failed pipelines in the Pipeline metrics tab
  2. Users should be able to see the duration of TaskRuns in the list view, this can be achieved via the Column management feature in the TaskRuns list page
  3. Users should be able to see the duration of TaskRuns in the TaskRun details view
  4. Users should be able to see the PipelineRun duration on the PipelineRun details page
  5. Users should be able to see a list of all PipelineRuns in their project from a PipelineRuns tab in the Dev perspective Pipeline page
  6. Users should be able to easily view webhook informations on Repository details page
  7. Users should be able to easily view webhook information on the summary page

Dependencies (External/Internal):

None

Exploration:

Exploration is available in this Miro board

Description

As a user, I want to manage the column available for the TaskRuns list page

Acceptance Criteria

  1. should provide a manage columns option on the TaskRuns details page
  2. By default, the Duration column should not be present and the user can make it visible by using manage columns option

Additional Details:

Description

As a user, I want to see the information about the cancelled pipeline on the Pipeline metrics page

Acceptance Criteria

  1. should show the cancelled status in a different color in the Pipeline Success Ratio donut chart.

Additional Details:

Description

With many PipelineRuns based on the same pipeline, it will get confusing if re-runs are named by the pipeline as they will all be named similarly. Losing the distinction between PipelineRuns will cause lots of additional hassles.

Acceptance Criteria

  1. Prefix the last run name into the newly created PipelineRun  
  2. Validate the existing e2e and add new e2e tests if needed

Additional Details:

Description

As a user, I want to see the duration on the details page of PipelineRun and TaskRun

Acceptance Criteria

  1. should show PipelineRun duration on the details page
  2. should show TaskRun duration on the details page

Additional Details:

Description

As a user, I want to see the webhook link and webhook secret on the Repository details page and the webhook link on the Repository summary page

Acceptance Criteria

  1. Should add the webhook link on the Repository details page
  2. should add the webhook secret on the Repository details page
  3. should show the webhook link and secret only if the Repository has been created using the Setup a Webhook option
  4. should add the webhook link on the Repository  summary page   

Additional Details:

Description

As a user, I want to see the PipelineRuns present in the current namespace from the Dev perspective

Acceptance Criteria

  1. Should add a PipelineRuns tab after the Pipeline tab on the dev perspective Pipeline page
  2. Should list all the PipelineRuns present in the namespace
  3. Should add the Create PipelineRun option in the  Create action menu

Additional Details:

Problem:

ODC tests are mainly focused on running tests with kube:admin(cluster-admin privileges) which creates an issue when something gets broken due to rbac issue

Goal:

To define some basic tests focused on the self-provisioner users which can also be run on CI

Why is it important?

Testing with users as pr changes should not break UI

Use cases:

  1. <case>

Acceptance criteria:

  1. Collect requirements for user tests
  2. Write some basic tests for different packages
  3. Run tests with non-admin users locally
  4. Run tests with CI

Dependencies (External/Internal):

Design Artifacts:

Exploration:

Note:

Description

Running pull CI tests for devconsole, pipelines, and knative packages with non-admin user

Acceptance Criteria

  1. <criteria>

Additional Details:

Problem:

ODC E2E test have flakes which creates failures on CI.

Goal:

Improving the ODC E2E test flakes, by stabilising the test and improving the speed of test execution.

Why is it important?

To improve health of CI which will impact PR review effectiveness.

Use cases:

  1. <case>

Acceptance criteria:

  1. Improving E2E test of Pipelines and Knative Packages
  2. Improving E2E test of Topology and Helm Packages
  3. Improving E2E test of Dev-Console Package
  4. Running the improved test against CI

Dependencies (External/Internal):

Design Artifacts:

Exploration:

Note:

Description

Skip waiting for the authentication operator to start progressing when the secret already exists

For periodic jobs, our tests will append to existing console tests but because
the value of `waitForAuthOperatorProgressing` changes from true to false at the start of console tests and with the same procedure our tests keep on waiting to fetch its value as true which never happens and tests do not start

Description

Automation for customization for developer view to be covered

Acceptance Criteria

  1. Customisation of developer catalog and Add page through form view

Additional Details:

upstream repos which contribute to the OLM v0 downstream repo have a 90+ commit delta, with several substantial dependency version bumps.

The interaction between these repos necessitates a coordinated solution, and potentially new upstream contributions to reach dependency equilibrium before bringing downstream. 

The goals of this epic are:

  1. to attempt a bulk sync of the upstream contributing repositories, bringing all commits downstream in accordance with the OLMv0 downstreaming doc
  2. identify impediments to the downstream, and capture a list of remediating actions to be taken both up/downstream
  3. coordinate across teams (OPRUN, OPECO, QE) to resolve the impediments and handle test impacts

 

We have some existing work in this direction, and this epic is mostly to coordinate across teams.  As a result, some existing stories will need some remodeling as we go, and teams should feel free to keep them up to date to reflect the identified work.

 

 

 

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • ...

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

The openshift/operator-framework-olm repository is very out-of-date, and needs to be sync'd from upstream.

Acceptance Criteria:

All upstream necessary commits from:

  • operator-framework/operator-lifecycle-manager
  • operator-framework/operator-registry
  • operator-framework/api

are merged into into openshift/operator-framework-olm repository.

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

The Kube APIServer has a sidecar to output audit logs. We need similar sidecars for other APIServers that run on the control plane side. We also need to pass the same audit log policy that we pass to the KAS to these other API servers.

During a PerfScale 80 HC test in stage we found that the OBO prometheus monitoring stack was consuming 50G of memory (enough to cause OOMing on the m5.4xlarge instance it was residing on). Additionally, during this time it would also consume over 10 CPU cores. 

Snapshot of the time leading up to (effectively idle) and during the test: https://snapshots.raintank.io/dashboard/snapshot/2K5s0PzaN1U2JE1jrxTPZ5jX0fifBuRC 

As a SRE, I want to have the ability to filter metrics exposed from the Management Clusters.

Context:
RHOBS resources allocated to HCP are scarce. Currently, we push every single metric to the RHOBS instance.
However, in https://issues.redhat.com/browse/OSD-13741, we've identified a subset of metrics that are important to SRE.

The ability to only export those metrics to RHOBS will reduce significantly the cost of monitoring as well as increase our ability to scale RHOBS.

As discussed in this Slack thread, most of the CPU and memory consumption of the OBO operator is caused at scraping time.

The idea here is to make sure hypershift & control-plane-operator operators are no more specify the scrape interval in servicemonitor & podmonitor scrape configs (unless there is a very good reason to do so).

Indeed, when the scrape interval is not specified at scrape config level, the global scrape interval specified at the root of the config is used. This offers the following benefits:

  • The interval can be set for all scrape configs at once.
  • The interval is no more hard-coded in Hypershift code
  • The interval can be set to a higher value.
    This will allow reducing the quantity of data scrapped by Prometheus and consequently lower its memory consumption.
    See next sub-task which will set the global scrape interval to 60 sec

This is part of solution #1 described here.

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

When quorum breaks and we are able to get a snapshot of one of the etcd members, we need a procedure to restore the etcd cluster for a given HostedCluster. 

Documented here: https://docs.google.com/document/d/1sDngZF-DftU8_oHKR70E7EhU_BfyoBBs2vA5WpLV-Cs/edit?usp=sharing 

Add the above documentation to the HyperShift repo documentation.

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Problem statement:

Many internal projects rely on Red Hat's fork of the OAuth2 Proxy project. The fork differs from the main upstream project in that it added an OpenShift authentication backend provider, allowing the OAuth2 Proxy service to use the OpenShift platform as an authentication broker.

Still, unfortunately, it had never been contributed back to the upstream project - this caused both of the projects, the fork and the upstream, to severely diverge. The fork is also extremely outdated and lacks features.

Among such features not present in the forked version is the support for setting up a timeout for requests from the proxy to the upstream service, otherwise controlled using the --upstream-timeout command-line switch in the official OAuth2 Proxy project.

Without the ability to specifically request timeout, the default value of 30 seconds is assumed (coming from Go's libraries), and this is often not enough to serve a response from a busy backend.

Thus, we need to backport this feature from the upstream project.

Resources:

Implementation ideas:

Backport the Pull Request from the upstream project into the Red Hat's fork.

Acceptance Criteria:

  • The fork of the OAuth2 Proxy can now use the --upstream-timeout command-line switch to set the desired timeout.
  • A new container image has been built and uploaded to Quay so that it can be pulled when services are deployed into our OpenShift clusters.

Default Acceptance Criteria:

  • Any relevant documentation and SOPs are updated or written.
  • The code, let it be the backport, has sufficient test coverage.

Goal: Support OVN-IPsec on IBM Cloud platform.

Why is this important: IBM Cloud is being added as a new OpenShift supported platform, targeting 4.9/4.10 GA.

Dependencies (internal and external):

Prioritized epics + deliverables (in scope / not in scope):

  • Need to have permission to spin up IBM clusters

Not in scope:

Estimate (XS, S, M, L, XL, XXL):

Previous Work:

Open questions:

Acceptance criteria:

Epic Done Checklist:

  • CI - CI Job & Automated tests: <link to CI Job & automated tests>
  • Release Enablement: <link to Feature Enablement Presentation> 
  • DEV - Upstream code and tests merged: <link to meaningful PR orf GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>
  • Notes for Done Checklist
    • Adding links to the above checklist with multiple teams contributing; select a meaningful reference for this Epic.
    • Checklist added to each Epic in the description, to be filled out as phases are completed - tracking progress towards “Done” for the Epic.
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

sig-cli is failing in two different ways:

  • Missing api resources from api groups.
  • A bug where the loop variables are not captured in closures, rendering random errors on each execution because they get overwritten for past It functions.

Failing tests 

 

  • [sig-cli] oc adm must-gather runs successfully [Suite:openshift/conformance/parallel]
  • [sig-cli] oc adm must-gather runs successfully with options [Suite:openshift/conformance/parallel]
  • [sig-cli] oc adm must-gather when looking at the audit logs [sig-node] kubelet runs apiserver processes strictly sequentially in order to not risk audit log corruption [Suite:openshift/conformance/parallel]

5 tests fail because of system:authenticated group not having enough permissions on some resources (routes and configmaps).

"[sig-cli] oc basics can create and interact with a list of resources [Suite:openshift/conformance/parallel]"

"[sig-cli] oc basics can show correct whoami result [Suite:openshift/conformance/parallel]"

"[sig-cli] oc can route traffic to services [apigroup:route.openshift.io] [Suite:openshift/conformance/parallel]"

"[sig-cli] oc expose can ensure the expose command is functioning as expected [apigroup:route.openshift.io] [Suite:openshift/conformance/parallel]"

"[sig-network-edge][Feature:Idling] Idling with a single service and ReplicationController should idle the service and ReplicationController properly [Suite:openshift/conformance/parallel]"

There are quite a few tests which are dependent on API groups which do not exist in MicroShift. We can add [apigroup] annotation to skip these tests

[apigroup:oauth.openshift.io]

"[sig-auth][Feature:OAuthServer] OAuthClientWithRedirectURIs must validate request URIs according to oauth-client definition": " [Suite:openshift/conformance/parallel]"

"[sig-auth][Feature:OAuthServer] well-known endpoint should be reachable [apigroup:route.openshift.io]": " [Suite:openshift/conformance/parallel]" 

[apigroup:operator.openshift.io]

"[sig-network][Feature:MultiNetworkPolicy][Serial] should enforce a network policies on secondary network IPv4": " [Suite:openshift/conformance/serial]"

"[sig-network][Feature:MultiNetworkPolicy][Serial] should enforce a network policies on secondary network IPv6": " [Suite:openshift/conformance/serial]"

"[sig-storage][Feature:DisableStorageClass][Serial] should not reconcile the StorageClass when StorageClassState is Unmanaged": " [Suite:openshift/conformance/serial]"
"[sig-storage][Feature:DisableStorageClass][Serial] should reconcile the StorageClass when StorageClassState is Managed": " [Suite:openshift/conformance/serial]",
"[sig-storage][Feature:DisableStorageClass][Serial] should remove the StorageClass when StorageClassState is Removed": " [Suite:openshift/conformance/serial]", 
"[sig-auth][Feature:Authentication]  TestFrontProxy should succeed [Suite:openshift/conformance/parallel]" 

This test is failing because it depends on "aggregator-client" secret, which is not present in MicroShift. We can skip this test. 

The goal of this EPIC is to solve several issues related to PDBs

 
 
 
 

 

 causing issues during OCP upgrades, especially when new apiservers (which is rolling one by one) were wedged (there was some issue with networking on new pods due to rhel upgrades)

 

slack thread: https://redhat-internal.slack.com/archives/CC3CZCQHM/p1673886138422059
 
 
 

 

Epic Goal*

This is a tracking issue for the Workloads related work for Microshift 4.13 Improvements. See API-1506 for the whole feature.

followup to https://issues.redhat.com/browse/WRKLDS-487

  • refactor route-controller-manager to use NewControllerCommandConfig and ControllerBuilder from library-go. Then update dep in microshift and we can pass LeaderElection.Disable in the config to disable leader election as it is not needed in microshift.
  • OCMO refactoring/separating status conditions. Create two instances of the NewDeploymentController to create separate status conditions.

refactor route-controller-manager to use NewControllerCommandConfig and ControllerBuilder from library-go. Then update dep in microshift and we can pass LeaderElection.Disable in the config to disable leader election as it is not needed in microshift.

Incomplete Epics

This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were not completed when this image was assembled

Epic Goal

This epic tracks any part of our codebase / solutions we implemented taking shortcuts.
Whenever a shortcut is taken, we should add a story here not to forget to improve it in a safer and more maintainabile way.

Why is this important?

maintanability and debuggability, and in general fighting the technical debt, is critical to keep velocity and ensure overall high quality

Scenarios

  1. N/A

Acceptance Criteria

  • depends on the specific card

Dependencies (internal and external)

  • depends on the specific card

Previous Work (Optional):

https://issues.redhat.com/browse/CNF-796
https://issues.redhat.com/browse/CNF-1479 
https://issues.redhat.com/browse/CNF-2134
https://issues.redhat.com/browse/CNF-6745
https://issues.redhat.com/browse/CNF-8036
https://issues.redhat.com/browse/CNF-9566 

 Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Description of problem:

According API documentation policyTypes field is optional:
 https://docs.openshift.com/container-platform/4.11/rest_api/network_apis/networkpolicy-networking-k8s-io-v1.html#specification

If this field is not specified, it will default based on the existence of Ingress or Egress rules;
But if policyTypes is not specified all traffic dropped despite what is stated in the rule

 

Version-Release number of selected component (if applicable):

4.12

How reproducible:

100%

Steps to Reproduce:

1. Configure sriov (nodepolicy + sriovnetwork)
2. Configure 2 pods
3. enable MutiNetworkPolicy
4. apply  MutiNetworkPolicy:
spec:
  podSelector:
    matchLabels:
      pod: pod1
  ingress:
  - from:
    - ipBlock:
        cidr: 192.168.0.2/32
5. send traffic between pods (192.168.0.2 => pod=pod1)

Actual results:

traffic dropped 

Expected results:

traffic passed

Additional info:

 

Goal

Address miscellaneous technical debt items in order to maintain code quality, maintainability, and improved user experience.

User Stories

Non-Requirements

Notes

  • Any additional details or decisions made/needed

Owners

Role Contact
PM Peter Lauterbach
Documentation Owner TBD
Delivery Owner (See assignee)
Quality Engineer (See QA contact)

Done Checklist

Who What Reference
DEV Upstream roadmap issue <link to GitHub Issue>
DEV Upstream code and tests merged <link to meaningful PR or GitHub Issue>
DEV Upstream documentation merged <link to meaningful PR or GitHub Issue>
DEV gap doc updated <name sheet and cell>
DEV Upgrade consideration <link to upgrade-related test or design doc>
DEV CEE/PX summary presentation label epic with cee-training and add a <link to your support-facing preso>
QE Test plans in Polarion N/A details in user stories.
QE Automated tests merged N/A details in user stories.
DOC Downstream documentation merged <link to meaningful PR>

kubevirt csi is unable to unpublish a volume in the event that the VM/VMI that the volume was published on unexpectedly disappears. This situation can occur for many reasons. Someone could forcibly delete the VM, an replace update could occur that destroys a VM before it can unpublish a volume, a VM node can become unresponsible and the capi machine controller will delete it, and other scenarios like this.

 

When this situation occurs, the PVC within the guest will never get deleted properly. Kubevirt csi will report the following error.

 
I0531 13:07:51.338413 1 controller.go:264] Detaching DataVolume pvc-4c4d4744-8a04-4df1-964b-d4eac90a93a2 from Node ID fc3ad096-53f0-535d-bbd8-45a3ab3803d1
E0531 13:07:51.349493 1 server.go:124] /csi.v1.Controller/ControllerUnpublishVolume returned with error: rpc error: code = NotFound desc = failed to find VM with domain.firmware.uuid 5cb46a00-2b8b-509b-b32b-39d1bab4e8b5
 

To resolve this, the kubevirt-csi controller needs to gracefully handle unpublishing a volume when the VM and VMI associated with the volume no longer exists.

An epic we can duplicate for each release to ensure we have a place to catch things we ought to be doing regularly but can tend to fall by the wayside.

Console-operator should switch from using bindata to using assets, similar to what cluster-kube-apiserver-operator and other operators are doing so we dont need to regenerate the bindata when yaml files are changes. 

There is also an issue with generating bindata on ARM and other arch., where switching to assets, will  make it obsolete.

 

https://github.com/openshift/cluster-kube-apiserver-operator/blob/005a95607cf9f8db490e962b549811d8bc0c5eaf/bindata/assets.go

Epic Goal*

Provide a way to tune the etcd latency parameters ETCD_HEARTBEAT_INTERVAL and ETCD_ELECTION_TIMEOUT.

 
Why is this important? (mandatory)

OCP4 does not have a way to tune the etc parameters like timeout, heartbeat intervals, etc. Adjusting these parameters indiscriminately may compromise the stability of the control plane. In scenarios where disk IOPS are not ideal (e.g. disk degradation, storage providers in Cloud environments) this parameters could be adjusted to improve stability of the control plane while raising the corresponding warning notifications.

In the past:

The current default values on a 4.10 deployment
```
name: ETCD_ELECTION_TIMEOUT
value: "1000"
name: ETCD_ENABLE_PPROF
value: "true"
name: ETCD_EXPERIMENTAL_MAX_LEARNERS
value: "3"
name: ETCD_EXPERIMENTAL_WARNING_APPLY_DURATION
value: 200ms
name: ETCD_EXPERIMENTAL_WATCH_PROGRESS_NOTIFY_INTERVAL
value: 5s
name: ETCD_HEARTBEAT_INTERVAL
value: "100"
```
and these are modified for exceptions of specific cloud providers (https://github.com/openshift/cluster-etcd-operator/blob/master/pkg/etcdenvvar/etcd_env.go#L232-L254).

The guidance for latency among control plane nodes do not translate well to on-premise live scenarios https://access.redhat.com/articles/3220991

 
Scenarios (mandatory) 

Defining etcd-operator API to provide the cluster-admin the ability to set `ETCD_ELECTION_TIMEOUT` and `ETCD_HEARTBEAT_INTERVAL` within certain range.

 
Dependencies (internal and external) (mandatory)

No external teams

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - etcd team
  • Documentation -
  • QE - 
  • PX - 
  • Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.  

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 

For https://issues.redhat.com/browse/OCPBU-333 we need an enhancement proposal so we can go over the different options in how we want to allow configuration of the etcd heartbeat, leader election and any other latency parameters that might be required for OCPBU-333.

Once we have the API for configuring the heartbeat interval and leader election timeouts from https://github.com/openshift/api/pull/1538 we will need to reconcile the tuning profile set on the API onto the actual etcd deployment.

This would require updating how we set the env vars for both parameters by first reading the operator.openshift.io/v1alpha1 Etcd "cluster" object and mapping the profile value to the required heartbeat and leader election timeout values in:
https://github.com/openshift/cluster-etcd-operator/blob/381ffb81706699cdadd0735a52f9d20379505ef7/pkg/etcdenvvar/etcd_env.go#L208-L254

Place holder epic to track spontaneous task which does not deserve its own epic.

ServicePublishingStrategy of type LoadBalancer or Route could specify the same hostname, which will result on one of the services not being published. i.e. no DNS records created.
context: https://redhat-internal.slack.com/archives/C04EUL1DRHC/p1678287502260289
 
DOD:
Validate ServicePublishingStrategy and report conflicting services hostnames

DoD:

This feature is supported by ROSA.

To have an e2e to validate publicAndPrivate <-> Private in the presubmits.

Once the HostedCluster and NodePool gets stopped using PausedUntil statement, the awsprivatelink controller will continue reconciling.

 

How to test this:

  • Deploy a private cluster
  • Put it in pause once deployed
  • Delete the AWSEndPointService and the Service from the HCP namespace
  • And wait for a reconciliation, the result it's that they should not be recreated
  • Unpause it and wait for recreation.

DoD:

If change a NodePool from having .replicas to autoscaler min/Max and set a min beyond the current replicas, that might leave the machineDeployment in a state not suitable to be autoscalable. This require the consumer to ensure the min is <= current replicas which is poor UX. We should be able to automate this ideally

The Hypershift operator deployment fails when we try to deploy it in the RootCI server which has the PSA enabled. So we need to make the hypershift operator deployment restricted PSA compliant

Event:

0s          Warning   FailedCreate        replicaset/operator-66cc5794c9       (combined from similar events): Error creating: pods "operator-66cc5794c9-k2sq7" is forbidden: violates PodSecurity "restricted:latest": allowPrivilegeEscalation != false (container "operator" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "operator" must set securityContext.capabilities.drop=["ALL"]), seccompProfile (pod or container "operator" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost") 

OCP components could change their image key in the release payload, which might not be immediately visible to us and would break Hypershift. 

 
DOD:
Validate release contains all the images required by Hypershift and report missing images in a condition

AC:

We have connectDirectlyToCloudAPIs flag in konnectiviy socks5 proxy to dial directly to cloud providers without going through konnectivity.

This introduce another path for exception https://github.com/openshift/hypershift/pull/1722

We should consolidate both by keep using connectDirectlyToCloudAPIs until there's a reason to not.

 

AWS has a hard limit of 100 OIDC providers globally. 
Currently each HostedCluster created by e2e creates its own OIDC provider, which results in hitting the quota limit frequently and causing the tests to fail as a result.

 
DOD:
Only a single OIDC provider should be created and shared between all e2e HostedClusters. 

Most of our conditions status is driven by programatic output of reconciliation loops.

E.g: the HostedCluster available

  • depends on kas, etcd and infra conditions.
  • For kas/etcd we check the Deployment/stateful resource healthy

This is a good signal for day 1, but we might be missing relevant real state of the world for day 2. E.g:

  • Do we flip HCAvailable condition if the our ingress controller is deleted/unhealthy.
  • Do we flip HCAvailable condition if a Route resource is deleted?
  • Do we flip HCAvailable condition if the LB is deleted out of band?

DoD:

Reproduce and review behaviour the examples above.

Consider adding additional knowledge for computing the HCAvailable condition. Health check on expected day 2 holistic e2e behaviour rather than in particular status of subcomponents.

E.g. actually query the kas through the url we expose

This is a placeholder to capture the necessary CI changes to do every release cut.

There are a few places in our CI config which requires pinning to the new release every release cut:

DOD:

Make sure we have this documented in hypershift repo and that all needed is done for current release branch.

DoD:

At the moment if the input etcd kms encryption (key and role) is invalid we fail transparently.

We should check that both key and role are compatible/operational for a given cluster and fail in a condition otherwise

Background

This is intended to be a place to capture general "tech debt" items so they don't get lost. I very much doubt that this will ever get completed as a feature, but that's okay, the desire is more that stores get pulled out of here and put with feature work "opportunistically" when it makes sense. 

Goal

If you find a "tech debt" item, and it doesn't have an obvious home with something else (e.g. with MCO-1 if it's metrics and alerting) then put it here, and we can start splitting these out/marrying them up with other epics when it makes sense.

 

Background

As part of https://github.com/openshift/machine-config-operator/pull/3270, Joel moved us to ConfigMapsLeases for our lease because the old way of using ConfigMaps was being deprecated in favor of the "Leases" resource.

ConfigMapsLeases were meant to be the first phase of the migration, eventually ending up on LeasesResourceLock,so at some point we need to finish.

Since we've already had ConfigMapsLeases for at least a release, we should now be able to complete the migration by changing the type of resource lock here https://github.com/openshift/machine-config-operator/blob/4f48e1737ffc01b3eb991f22154fc3696da53737/cmd/common/helpers.go#L43 to LeasesResourceLock

We should probably also clean up after ourselves so nobody has to open something like https://bugzilla.redhat.com/show_bug.cgi?id=1975545 again

(Yes this really should be as easy as it looks, but someone needs to pay attention to make sure something weird doesn't happen when we do it)

 

Some supporting information is here, if curious:

https://github.com/kubernetes/kubernetes/pull/106852

https://github.com/kubernetes/kubernetes/issues/80289

 

Goal

Finish lease lock type migration by changing lease lock type to LeaseResourceLock

Done When

  • MCO is no longer using ConfigMapsLeases
  • No weird/unexplainable timings/errors are introduced
  • Tests pass

 

 

Currently, adding a forcefile(/run/machine-config-daemon-force) will start an update, but it doesn't necessarily do a complete upgrade; if it fits into one of the carve-outs we have for a rebootless update/OSImageURL is the same...it won't do an OS update. We have had a few customers whose clusters are stuck in a quasi state and need to do a complete OS upgrade; even if the "conditions" on cluster indicate that this isn't necessary.

The goal of the story is to update this behavior so that it will also do an OS upgrade(execute applyOSChanges() in its entirety). 

This has been broken for a long time, and the actual functionality is quite useless. We have put out a deprecation notice in 4.12, and now we should look to remove it.

The MCD read/writes items to the journal. We should look to remove unnecessary reads from the journal and just log important info, so a broken journal doesn't break the MCD.

 

Spun off of https://issues.redhat.com/browse/OCPBUGS-8716

Requires MCO-595 and MCO-596 to be finished first

The MCD today writes pending configs to journal, which the next boot then uses to read the state.

 

This is mostly redundant since we also read/write the updated config to disk. The pending config was originally implemented very early on, and today causes more trouble than it helps, since the journal could be broken, or the config could not be found, which is very troublesome to debug and recover.

 

We should remove the workflow entirely

Epic Goal

As an OpenShift infrastructure owner, I want to use the Zero Touch Provisioning flow with RHACM, where RHACM is in a dual-stack hub cluster and the deployed cluster is an IPv6-only cluster.

Why is this important?

Currently ZTP doesn't work when provisioning IPv6 clusters from a dual-stack hub cluster. We have customers who aim to deploy new clusters via ZTP that don't have IPv4 and work exclusively over IPv6. To enable this use case work on the metal platform has been identified as a requirement.

Dependencies

Converge IPI and ZTP Boot Flows: METAL-10

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

     

 

Epic Goal

  • Currently, we are polling events from assisted service, enriching the events, and pushing it to elastic in event scrape service.
    In order to support also sending events from On-Prem environments - we need to remodel the data pipelines towards push-based model. Since we'll benefit from this approach in SaaS environment as well, we'll seek for a model as unified as possible

Why is this important?

  • Support on-prem environments
  • Increase efficiency (we'll stop performing thousands of requests per minute to the SaaS)
  • Enhance resilience (right now if something fails, we have a relatively short time window to fix it before we lose data)

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. Make a decision on what design to implement (internal)
  2. Authorization with pull-secret (TBD, is there a ticket for this? Oved Ourfali )
  3. RH Pipelines RHOSAK implementation

Previous Work (Optional):

  1. First analysis
  2. We then discussed the topic extensively: Riccardo Piccoli Igal Tsoiref Michael Levy liat gamliel Oved Ourfali Juan Hernández 
  3. We explored already existing systems that would support our needs, and we found that RH Pipelines almost exactly matches them:
  • Covers auth needed from on prem to the server
  • Accepts HTTP-based payload and files to be uploaded (very handy for bulk upload from on-prem)
  • Lacks routing: limits our ability to scale data processing horizontally
  • Lacks infinite data retention: the original design has kafka infinite retention as key characteristic
  1. We need to evaluate requirements and options we have to implement the system. Another analysis with a few alternatives

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Roadmap

  • Stream events from service to kafka
  • Enable feature flag hiding this feature in staging to gather data
  • Read events and project them to elasticsearch
  • Process on-prem events and re-stream them into the kafka stream
  • Adapt CCX export

 We are missing event notifications on creation of some resources. We need to make sure they are notified

 Due to change of kafka provider, SASL/PLAIN is not supported any longer

We now need SASL/SCRAML for app-interface integrated MSK

Epic Goal

  • Assisted installer should give a formal infraenv kube-api for adding additional certs to trust

Why is this important?

  • Users that install OCP on servers that communicate through transparent proxies must trust the proxy's CA for the communication to work
  • The only way users can currently do that is by using both infraenv ignition overrides and install-config overrides. These are generic messy APIs that are very error prone. We should give users a more formal, simpler API to achieve both at the same time. 

Scenarios

  1. Day 1 - discovery ISO OS should trust the bundles the user gives us as an infraenv creation param (either via REST or kube-api). A cluster formed from hosts should trust all certs from all infraenvs of all of its hosts combined.
  2. Day 2 - obviously we don't want to modify existing clusters to trust the cert bundles of infra-envs of hosts that want to join them, so we will simply not handle this case. 

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • ...

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Feature goal (what are we trying to solve here?)

Do not change the cluster platform in the background due to networking configuration.

DoD (Definition of Done)

Remove user_managed_networking from assisted service

 

Does it need documentation support?

Yes

Reasoning (why it’s important?)

  • We don't have actual use with the user_managed_networking field and the coupling between umn and platform_type keep causing issues over time (especially umn and none platform)

 

 

Allow the user to decide which platform is compatible with each feature, especially UMN and CMN.

e.g. on the networking step, when a platform is being selected, the UI need to know if to show to the user the UMN or CMN networking configuration without taking into consideration cluster.user_managed_networking.

 

This task goal is give the UI option to not use the current UMN implementation, and give the BE the flexibility to "break" the API.

When creating a cluster in the UI, there is a checkbox that the user can set to indicate that they wan t to use custom manifests.

Presently this will cause the upload of an empty manifest, the presence of which is later used to determine whether the checkbox is checked or not (and whether the custom manifest tab should be shown in the UI).

This is a clunky approach that confuses the user and leads to validation issues.

This functionality needs to be changed to use a cluster tag for this purpose instead.

Presently, when creating a cluster in the UI, there is a checkbox that the user can set to indicate that they wan t to use custom manifests.

Presently this will cause the upload of an empty manifest, the presence of which is later used to determine whether the checkbox is checked or not (and whether the custom manifest tab should be shown in the UI).

This is a clunky approach that confuses the user and leads to validation issues.

To rememdy this, we would like to give the UI team a facility to store
raw JSON data containing freeform UI specific settings for a cluster.

This PR enables that.

Feature goal (what are we trying to solve here?)

  • When using ACM/MCE with infrastructure operator automatically import local cluster to enable adding nodes

DoD (Definition of Done)

When enabling infrastructure operator automatically import the cluster and enable users to add nodes to self cluster via Infrastructure operator

Does it need documentation support?

Yes, it's a new functionality that will need to be documented

Feature origin (who asked for this feature?)

Reasoning (why it’s important?)

  • Right now in order to enable this flow the user will need to install MCE and enable infrastructure operator and follow this guide in order to add nodes using the infrastructure operator, we would like to make this process easier for the users
  • it will automatically provide an easy start with CIM

Competitor analysis reference

  • Do our competitors have this feature? N/A

Feature usage (do we have numbers/data?)

  • We are not collecting MCE data yet
  • We were asked several times by customer support how to run this flow

Feature availability (why should/shouldn't it live inside the UI/API?)

  • Please describe the reasoning behind why it should/shouldn't live inside the UI/API - UI will benefit from it by having the cluster prepared for the user
  • If it's for a specific customer we should consider using AMS - Some users would like to manage the cluster locally, otherwise why did they install MCE?
  • Does this feature exist in the UI of other installers? - No

Open questions

  1. How to handle static network - infrastructure operator is not aware of how the local network was defined
  2. How the api should look like? should the user specify a target namespace or should it be automatic?
  3. How to get the local kubeconfig? can we use the one that the pod is using?

When assisted service is started in KubeAPI mode, we want to ensure that the local cluster is registered with ACM so that it may be managed in a similar fashion to a spoke, or to put it another way, register the Hub cluster as a Day 2 spoke cluster in ACM running on itself.

The purpose of this task is to create the required secrets, agentclusterinstall and clusterdeployment CR's required to register the hub.

As referenced in the parent Epic, the following guide details the CR's that need to be created to import a "Day 2" spoke cluster https://github.com/openshift/assisted-service/blob/master/docs/hive-integration/import-installed-cluster.md

During this change, it should be ensured that this functionality is added to the reconcile loop of the service.

note: just a placeholder for now

 
It already happened that operators had configured Prometheus rules which aren't valid:

While we can't catch everything, it should be feasible to check for most common mistakes with the CI.

Exceptions for following Alerts can be cleared, as the Bugzilla is already fixed and released.

  • CsvAbnormalFailedOver2Min
  • CsvAbnormalOver30Min
  • InstallPlanStepAppliedWithWarnings

For the BZs not fixed, create new Jira OCPBUGS

We added E2E tests for alerting style-guide issues in MON-1643, but a lot of components needed exceptions. We filed bugzillas for these, but we need to check on them and remove the exceptions for any that are fixed.

Epic Goal

  • Scrape Profiles was introduced as Tech Preview in 4.13, goal it to now promote it to GA
  • Scrape Profiles Enhancement Proposal should be merged
  • OpenShift developers that want to adopt the feature should have the necessary tooling and documentation on how to do so
  • OpenShift CI should validate if possible changes in profiles that might break a profile or cluster functionality

This has no link to a planing session, as this predates our Epic workflow definition.

Why is this important?

  • Enables users to minimize the resrource overhead for Monitoring.

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. https://issues.redhat.com/browse/MON-2483

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

CMO should expose a metric, that gives insight into collection profile usage. We add this signal to our telemetry payload.

The minimum solution here is to expose a metric about the collection profile configured.
Other optional metrics could include:

  • how many ServiceMonitors implement a specific profile
  • how many SMs are "unprofiled"
  • ...?

Epic Goal

  • Today users can specify TopologySpreadConstraints for in-cluster Prometheus and alertmanager, and Thanos Ruler pods.
  • We should support setting these constraints on all pods that we deploy

Why is this important?

  • Users want to constrain pod scheduling based on their infrastructure. Currently users have the option to use
    • Node affinity. However we do not expose that field and we use it for our own purpose.
    • Node taints. Taints and tolerations lack the flexibility to specify a preferred pod locations
    • Node selectors. Node selectors have the same inflexibility as tolerations. If no node can be found, the pod is not scheduled.
  • Exposing TSC for all pods would allow users to control pod scheduling according to their own or preexisting infra structure labels, while at the same time allow the scheduler to deploy pods even if the constraints can not be fulfilled.

Scenarios

  1. A user wants the Monitoring pods preferably scheduled on nodes labeled Infra, but wants them scheduled anywhere in case no nodes are carry that label or they are discarded during scheduling for other reasons.

Acceptance Criteria

  • Users can configure TopologySpreadConstraints for all pods that CMO deploys
  • Unit tests are in place to confirm the config is propagated to the pod artifact
  • Documentation is changed to make clear which components can be configured with TopologySpreadConstraints

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions:

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Give users a TopologySpreadConstraints field in the PrometheusRestrictedConfig field and propagate this to the pod that is created.

Give users a TopologySpreadConstraints field in the KubeStateMetricsConfig field and propagate this to the pod that is created.

Give users a TopologySpreadConstraints field in the AlertmanagerUserWorkloadConfig field and propagate this to the pod that is created.

Give users a TopologySpreadConstraints field in the OpenShiftStateMetricsConfig field and propagate this to the pod that is created.

Give users a TopologySpreadConstraints field in the TelemeterClientConfig field and propagate this to the pod that is created.

Give users a TopologySpreadConstraints field in the PrometheusOperatorConfig field and propagate this to the pod that is created. This will take care of both the incluster PO and UWM PO.

Give users a TopologySpreadConstraints field in the ThanosQuerierConfig field and propagate this to the pod that is created.

Epic Goal

  • CMO currently has several ServiceMonitor and Rule object that belong to other componets
  • We should migrate these away from the CMO code base to the owning teams

Why is this important?

  • The respective component teams are the experts for their components and can more accurately decide on how to alert and what metrics to expose.
  • Issues with these artifacts get routed to the correct teams.

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

IIUC, before using hard affinities for HA components, we needed this to avoid scheduling problems during upgrade.

See https://github.com/openshift/cluster-monitoring-operator/pull/1431#issuecomment-960845938

Now that 4.8 is no longer supported, we can get rid of this logic to simplify the code.

 

  • I don't know if no longer supported also applies to 4.8 that may want to upgrade someday. (in this case, we'll never be able to get rid of the code.)
  • Maybe keep this mechanism to influence upgrades somewhere (maybe git history is sufficient), we may need to re-use it in the future.

 

This will reduce technical debt and improve CMO learning curve.

To support the transition from soft anti-affinity to hard anti-affinity (4.9 > 4.10), CMO gained the ability to rebalance PVCs for Prometheus pods. The capability isn't required anymore so we can safely remove it.

Proposed title of this feature request

Enable the processes_linux collector in node_exporter

What is the nature and description of the request?

Enable node_exporter's processes_linux collector to allow customer to monitor the number of PIDs on OCP nodes.

Why does the customer need this? (List the business requirements)

They need to be able to monitor the number of PIDs on the OCP nodes.

List any affected packages or components.

cluster-monitoring-operator, node-exporter

We will add a section for "processes" Collector in "nodeExporter.collectors" section in CMO configmap. 

It has a boolean field "enabled", the default value is false.

apiVersion: v1
kind: ConfigMap
metadata: 
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data: 

  config.yaml: |

    nodeExporter: 
      collectors: 
        # enable a collector which is disabled by default
        processes: 
          enabled: true

 

Epic Goal

Why is this important?

Scenarios
1. …

Acceptance Criteria

  • (Enter a list of Acceptance Criteria unique to the Epic)

Dependencies (internal and external)
1. …

Previous Work (Optional):
1. …

Open questions::
1. …

Done Checklist

  • CI - For new features (non-enablement), existing Multi-Arch CI jobs are not broken by the Epic
  • Release Enablement: <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR orf GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - If the Epic is adding a new stream, downstream build attached to advisory: <link to errata>
  • QE - Test plans in Test Plan tracking software (e.g. Polarion, RQM, etc.): <link or reference to the Test Plan>
  • QE - Automated tests merged: <link or reference to automated tests>
  • QE - QE to verify documentation when testing
  • DOC - Downstream documentation merged: <link to meaningful PR>
  • All the stories, tasks, sub-tasks and bugs that belong to this epic need to have been completed and indicated by a status of 'Done'.

Following issues need to be take care on cluster deletion with resource reuse flags.

  1. Currently it's trying to remove DHCP server on an existing PowerVS instance, need to reuse the existing one to keep it simple.
  2. In case reusing existing VPC, load balancer is not getting removed. 
  1. Error not thrown on new cloud connection creation when two already exist.
  2. Error not reported for failed powervs job.

https://github.com/openshift/cluster-image-registry-operator/commit/eac9584446660721c5a31f54fd342f01415a8e92

 

With the above commit in 4.13, for powervs platform storage is not handled, which causes cluster image-registry operator to not getting installed. 

 

Need to handle powervs platform here.

Options discussed are to go with pvc with CSI.

If its not feasible, will try to use IBMCOS used by satellite team.

Getting below error while deleting infra with failed powervs instance

 

Failed to destroy infrastructure        {"error": "error in destroying infra: provided cloud instance id is not in active state, current state: failed"}

 

Also need to take care of create infra process in case of powervs instance goes to failed state. Looping on printing same statement while waiting for it to become active.

 

2022-11-11T13:03:01+05:30       INFO    hyp-dhar-osa-2  Waiting for cloud instance to up        {"id": "crn:v1:bluemix:public:power-iaas:osa21:a/c265c8cefda241ca9c107adcbbacaa84:cd743ba9-195b-46ba-951e-639f97f443d2::", "state": "failed"}

With the latest changes capi by default expects v1beta2 APIs. Need to upgrade capi API from v1beta1 to v1beta2 in hypershift.

When resources running short in management cluster when we deploy new apps, which evicts the cloud-controller-manager pod in existing HC's control plane.

 

Flags similar to these https://github.com/openshift/hypershift/blob/main/cmd/cluster/powervs/create.go#L57toL61 from create command are missing in destroy command, so that infra destroy functionality not getting these flags for proper destroy of infra with existing resources.

Issue and Design: https://github.com/ovn-org/ovn-kubernetes/blob/master/docs/design/shared_gw_dgp.md 

Upstream PR: https://github.com/ovn-org/ovn-kubernetes/pull/3160 

Document that describes how to use the mgmt port VF rep for hardware offloading: https://docs.google.com/document/d/1yR4lphjPKd6qZ9sGzZITl0wH1r4ykfMKPjUnlzvWji4/edit# 

==========================================================================

After the upstream PR has been merged. We need to find a way to make the user experience configuring the mgmt port VF rep as streamlined as possible. Basic Streaming that we have committed to is improving the config map to only require the DP resource name with the MGMT VF in the pool. Also OVN-K will need to make use of DP resources.

Description of problem:

- Add support for Dynamic Creation Of DPU/Smart-NIC Daemon Sets and Device-Plugin Resources For OVN-K
- DPU/Smart-NIC Daemonsets need a way to be dynamically created via specific node labels
- The config map needs to support device plugin resources (namely SR-IOV) to be used for the management port configuration in OVN-K
- This should enhance the performance of these flows (planned to be GA-ed in 4.14) for Smart-NIC
   5-a: Pod -> NodePort Service traffic (Pod Backend - Same Node)
   4-a: Pod -> Cluster IP Service traffic (Host Backend - Same Node)

Version-Release number of selected component (if applicable):

4.14.0 (Merged D/S) 
https://github.com/openshift/ovn-kubernetes/commit/cad6ed35183a6a5b43c1550ceb8457601b53460b
https://github.com/openshift/cluster-network-operator/commit/0bb035e57ac3fd0ef7b1a9451336bfd133fa8c1e 

How reproducible:

Never been supported in the past.

Steps to Reproduce:

Please follow the documentation on how to configure this on NVIDIA Smart-NICs in OvS HWOL mode.
 - https://issues.redhat.com/browse/NHE-550 

Please also check the OVN-K daemon sets. There should be a new "smart-nic" daemon set for OVN-K.
Please check on the nodes that the interface ovn-k8s-mp0_0 interface exists alongside ovn-k8s-mp0 interface.

Actual results:

Iperf3 performance:
  5-a: Pod -> NodePort Service traffic (Pod Backend - Same Node)    => ~22.5 Gbits/sec
  4-a: Pod -> Cluster IP Service traffic (Host Backend - Same Node) => ~22.5 Gbits/sec

Expected results:

Iperf3 performance:
 5-a: Pod -> NodePort Service traffic (Pod Backend - Same Node)    => ~29 Gbits/sec
 4-a: Pod -> Cluster IP Service traffic (Host Backend - Same Node) => ~29 Gbits/sec
As you can see we can gain an additional 6.5 Gbits/sec performance with these service flows.

Additional info:

https://docs.google.com/spreadsheets/d/1LHY-Af-2kQHVwtW4aVdHnmwZLTiatiyf-ySffC8O5NM/edit#gid=88193790
https://github.com/ovn-org/ovn-kubernetes/pull/3160

Epic Goal

NVIDIA and Microsoft have partnered to provide instances on Azure that use the security of NVIDIA Hopper GPU to create a Trusted Execution Environment (TEE) where the data is encrypted while processed. This is achieved by using the AMD's SEV-SNP extension, alongside the NVIDIA Hopper confidential computing capabilties.

The virtual machine created on Azure is the TEE, so any workload running within is protected from the Azure host. This is a good approach for customers to protect their data when running OpenShift on Azure, but it doesn't protect the data in a container from the OpenShift node. In this epic, we focus on protecting the OpenShift node from the Azure host.

Why is this important?

Running workloads in CSP virtual machines doesn't protect the data from an attack on the virtualization host itself. If an attacker manages to read the host memory, they can get access to the virtual machines data, so it can break confidentiality or integrity. In the context of AI/ML, both the data and the model represent intellectual property and sensitive data, so customers will want to protect them from leaks.

NVIDIA and Microsoft are key partners for Red Hat for AI/ML in the public cloud. Being able to run workloads encrypted at rest, in transport and in process will allow creating a trusted solution for our customers, spanning from self-managed OpenShift clusters to Azure Red Hat OpenShift (ARO) clusters. This will strengthen OpenShift as the Kubernetes distribution of choice in public clouds.

Scenario

  1. As an OpenShift administrator, I want to add OpenShift nodes with NVIDIA GPU and confidential computing enabled. These nodes are deployed via a MachineSet, like any other node, i.e. the experience is identical to normal nodes.
  1. As an Azure Red Hat OpenShift (ARO) customer, I want to add OpenShift nodes with NVIDIA GPU and confidential computing enabled. These nodes are deployed with the same mechanism as other nodes in ARO.

Acceptance Criteria

  • CI - Must be running successfully with tests automated.
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • Docs - Add confidential VMs configuration to the OpenShift documentation.
  • Marketing - Joint blog post with Microsoft and NVIDIA for self-managed cluster.
  • Marketing - Joint blog post with Microsoft and NVIDIA for ARO cluster.

Dependencies (internal and external)

  • NVIDIA signed open source kernel driver for RHEL
  • Attestation capabilities in the NVIDIA open source kernel driver for RHEL
  • NVIDIA GPU Operator with precompiled driver container

Previous Work & References:

 

Open questions:

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Add support for OCP cluster creation with Confidential VMs on Azure to the OpenShift installer. The additional configuration options required are:

  • SecurityEncryptionType (enum)
  • SecureVirtualMachineEncryptionSetID (string)
  • SecureBoot (enum)
  • VirtualTrustedPlatformModule (enum)

In addition, in order to create a Confidential VM in Azure, the OS image needs to have its Security Type defined as "Confindential VM" or "Confidential VM Supported". 

The changes required are:

  • add Confidential VM options to the install-config schema
  • enable Azure Confidential VM instance type families in install-config validation
  • add Confidential VM options to the Azure Machine Pool type
  • add Confidential VM options to Azure terraform variables
  • add Confidential VM options to the bootstrap and master Azure terraform modules
  • update the Azure terraform provider to support the Confidential VM Security Types for the OS image definitions
  • add the Confidential VM Security Type to the Azure terraform vnet module image

Resources:

User Story

As the OCM team member, I want to provide support for cluster service, and improve the usability and interoperability of Hypershift.

Acceptance Criteria

  • All the things that have to be done for the feature to be ready to
    release.

Default Done Criteria

  • All existing/affected SOPs have been updated.
  • New SOPs have been written.
  • Internal training has been developed and delivered.
  • The feature has both unit and end to end tests passing in all test
    pipelines and through upgrades.
  • If the feature requires QE involvement, QE has signed off.
  • The feature exposes metrics necessary to manage it (VALET/RED).
  • The feature has had a security review.* Contract impact assessment.
  • Service Definition is updated if needed.* Documentation is complete.
  • Product Manager signed off on staging/beta implementation.

Dates

Integration Testing:
Beta:
GA:

Current Status

GREEN | YELLOW | RED
GREEN = On track, minimal risk to target date.
YELLOW = Moderate risk to target date.
RED = High risk to target date, or blocked and need to highlight potential
risk to stakeholders.

References

Links to Gdocs, github, and any other relevant information about this epic.

The last version of OpenShift on RHV should target OpenShift 4.13. There are several factors for this requirement.

  1. Customers are regularly keeping up with OpenShift releases on RHV.
  2. The general guidance for a major deprecation like removal of a supported platform is 3 releases. Since we are announcing this with OpenShift 4.11, the last supported version should target OCP 4.13.
  3. From a timing point of view OCP 4.13 is coming very soon. OCP releases are every 4 months, OCP 4.12 is Q4 2022, and OCP 4.13 is Q1 2023, just six months away.
  4. The support for the OCP is six months full support + 12 months of maintenance support. This aligns the end of maintenance support for RHV in Aug 2024 with OCP 4.13 (approx Sep 2024)
  5. There are no new feature that are planning nor expected for OCP on RHV after OCP 4.12. We have no plans to revert on anything that we have already NAKed.

previous: The last OCP on RHV version will be 4.13. Remove RHV from OCP in OCP 4.14.

https://access.redhat.com/support/policy/updates/rhev

On August 31, 2022, Red Hat Virtualization enters the maintenance support phase, which runs until August 31, 2024. In accordance, Red Hat Virtualization (RHV) will be deprecated beginning with OpenShift v4.13. This means that RHV will be supported through OCP 4.13. RHV will be removed from OpenShift in OpenShift v4.14.

We will use this to address tech debt in OLM in the 4.10 timeframe.

 

Items to prioritize are:

CI e2e flakes

 

Update the downstream READMEs to better describe the downstereaming process.

Include help in the sync scripts as necessary/

It has been determined that "make verify" is a necessary part of the downstream process. The scripts that do the downstreaming do not run this command.

Add "make verify" somewhere in the downstreaming scripts, either as a last step in sync.sh or per commit (which might be both necessary yet overkill) in sync_pop_candidate.sh.

The client cert/key pair is a way of authenticating that will function even without live kube-apiserver connections so we can collect metrics if the kube-apiserver is unavailable.

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Any ERRORs produces by TuneD will result in Degraded Tuned Profiles.  Cleanup upstream and NTO/PPC-shipped TuneD profiles and add ways of limiting the ERROR message count.
  • Review the policy of restarting TuneD on errors every resync period.  See: OCPBUGS-11150

Why is this important?

  •  

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. https://issues.redhat.com/browse/PSAP-908

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Description of problem:

CU cluster of the Mavenir deployment has cluster-node-tuning-operator in a CrashLoopBackOff state and does not apply performance profile

Version-Release number of selected component (if applicable):

4.14rc0 and 4.14rc1

How reproducible:

100%

Steps to Reproduce:

1. Deploy CU cluster with ZTP gitops method
2. Wait for Policies to be complient
3. Check worker nodes and cluster-node-tuning-operator status 

Actual results:

Nodes do not have performance profile applied
cluster-node-tuning-operator is crashing with following in logs:

E0920 12:16:57.820680       1 runtime.go:79] Observed a panic: &runtime.TypeAssertionError{_interface:(*runtime._type)(nil), concrete:(*runtime._type)(nil), asserted:(*runtime._type)(0x1e68ec0), missingMethod:""} (interface conversion: interface is nil, not v1.Object)
goroutine 615 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x1c98c20?, 0xc0006b7a70})
        /go/src/github.com/openshift/cluster-node-tuning-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x99
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc000d49500?})
        /go/src/github.com/openshift/cluster-node-tuning-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x75
panic({0x1c98c20, 0xc0006b7a70})
        /usr/lib/golang/src/runtime/panic.go:884 +0x213
github.com/openshift/cluster-node-tuning-operator/pkg/util.ObjectInfo({0x0?, 0x0})
        /go/src/github.com/openshift/cluster-node-tuning-operator/pkg/util/objectinfo.go:10 +0x39
github.com/openshift/cluster-node-tuning-operator/pkg/operator.(*ProfileCalculator).machineConfigLabelsMatch(0xc000a23ca0?, 0xc000445620, {0xc0001b38e0, 0x1, 0xc0010bd480?})
        /go/src/github.com/openshift/cluster-node-tuning-operator/pkg/operator/profilecalculator.go:374 +0xc7
github.com/openshift/cluster-node-tuning-operator/pkg/operator.(*ProfileCalculator).calculateProfile(0xc000607290, {0xc000a40900, 0x33})
        /go/src/github.com/openshift/cluster-node-tuning-operator/pkg/operator/profilecalculator.go:208 +0x2b9
github.com/openshift/cluster-node-tuning-operator/pkg/operator.(*Controller).syncProfile(0xc000195b00, 0x0?, {0xc000a40900, 0x33})
        /go/src/github.com/openshift/cluster-node-tuning-operator/pkg/operator/controller.go:664 +0x6fd
github.com/openshift/cluster-node-tuning-operator/pkg/operator.(*Controller).sync(0xc000195b00, {{0x1f48661, 0x7}, {0xc000000fc0, 0x26}, {0xc000a40900, 0x33}, {0x0, 0x0}})
        /go/src/github.com/openshift/cluster-node-tuning-operator/pkg/operator/controller.go:371 +0x1571
github.com/openshift/cluster-node-tuning-operator/pkg/operator.(*Controller).eventProcessor.func1(0xc000195b00, {0x1dd49c0?, 0xc000d49500?})
        /go/src/github.com/openshift/cluster-node-tuning-operator/pkg/operator/controller.go:193 +0x1de
github.com/openshift/cluster-node-tuning-operator/pkg/operator.(*Controller).eventProcessor(0xc000195b00)
        /go/src/github.com/openshift/cluster-node-tuning-operator/pkg/operator/controller.go:212 +0x65
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)
        /go/src/github.com/openshift/cluster-node-tuning-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x3e
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0x0?, {0x224ee20, 0xc000c48ab0}, 0x1, 0xc00087ade0)
        /go/src/github.com/openshift/cluster-node-tuning-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xb6
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0?, 0x3b9aca00, 0x0, 0x0?, 0xc0004e6710?)
        /go/src/github.com/openshift/cluster-node-tuning-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x89
k8s.io/apimachinery/pkg/util/wait.Until(0xc0004e67d0?, 0x91af86?, 0xc000ace0c0?)
        /go/src/github.com/openshift/cluster-node-tuning-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:161 +0x25
created by github.com/openshift/cluster-node-tuning-operator/pkg/operator.(*Controller).run
        /go/src/github.com/openshift/cluster-node-tuning-operator/pkg/operator/controller.go:1407 +0x1ba5
panic: interface conversion: interface is nil, not v1.Object [recovered]
        panic: interface conversion: interface is nil, not v1.Object

Expected results:

cluster-node-tuning-operator is functional, performance profiles applied to worker nodes

Additional info:

There is no issue on a DU node of the same deployment coming from same repository, DU node is configured as requested and cluster-node-tuning-operator is functioning correctly.

must gather from rc0: https://drive.google.com/file/d/1DlzrjQiKTVnQKXdcRIijBkEKjAGsOFn1/view?usp=sharing
must gather from rc1: https://drive.google.com/file/d/1qSqQtIunQe5e1hDVDYwa90L9MpEjEA4j/view?usp=sharing

performance profile: https://gitlab.cee.redhat.com/agurenko/mavenir-ztp/-/blob/airtel-4.14/policygentemplates/group-cu-mno-ranGen.yaml

Revived from OCSCNV-56 which was archived.

Need a solution to support OCS encrypted volume for CNV so that smart cloning across namespaces can be achieved for encrypted volume.

Now the problem with encrypted OCS volumes is secrets are stored in the original namespace and will get left behind. (The cloned metadata still points to the original namespace)

The annotation required is `cdi.kubevirt.io/clone-strategy=copy`.

Tasks:

  • add annotation to encrypted default rbd storageclass.
  • add annotation to ui created encrypted rbd storageclass.
  • add KCS/document to manually annotate previously ui created encrypted rbd storageclass.

PPT Link: https://ibm-my.sharepoint.com/:p:/p/sanjal_dhir_katiyar/ESnXPI-TwmpPn3D9nliC6TMBKH1X_C7Xvth_tNXJZc3ubQ?e=QufNMW

Need to update: "console.storage-class/provisioner" extension.
Ref: https://github.com/openshift/console/pull/11931

Something like:

"properties" : { 
             "CSI" : {                                   
                   . .                                 
                   "parameter" : { . . }
                   "annotations" : {
                                 [annotationKey: string] : {                                                                                                                  "value" ?: string,                                                                                                 "annotate" ?: CodeRef<(arg) => boolean | boolean>                                            }
                   . .              

 We can do same for `properties.others.annotations` as well (not a requirement, but to have consistency with `properties.csi.annotations`).

Epic Goal

OpenShift Container Platform is shipping a finely tuned set of alerts to inform the cluster's owner and/or operator of events and bad conditions in the cluster.

Runbooks are associated with alerts and help SREs take action to resolve an alert. This is critical to share engineering best practices following an incident.

Goal 1: Current alerts/runbooks for hypershift needs to be evaluated to ensure we have sufficient coverage before hypershift hits GA.

Goal 2: Actionable runbooks need to be provided for all alerts therefore, we should attempt to cover as many as possible in this epic.

Goal 3: Continue adding alerts/runbooks to cover existing OVN-K functionality.

This epic will NOT cover refactors needed to alerts/runbooks due to new arch (OVN IC).

Why is this important?

In-order to scale, we (engineering) must share our institutional knowledge.

In-order for SREs to respond to alerts, they must have the knowledge to do so.

SD needs to have actionable runbooks to respond to alerts otherwise, they will require engineering to engage more frequently.

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • ...

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

As an administrator of a cluster utilizing AWS STS with a public S3 bucket OIDC provider, I would like a documented procedure with steps that can be followed to migrate to a private S3 bucket with CloudFront Distribution so that I do not have to recreate my cluster.

ccoctl documentation including parameter `--create-private-s3-bucket`: https://github.com/openshift/cloud-credential-operator/blob/a8ee8a426d38cca3f7339ecd0eac88f922b6d5a0/docs/ccoctl.md

Existing manual procedure for configuring private S3 bucket with CloudFront Distribution: https://github.com/openshift/cloud-credential-operator/blob/master/docs/sts-private-bucket.md

https://coreos.slack.com/archives/CE3ETN3J8/p1666174054230389?thread_ts=1665496599.847459&cid=CE3ETN3J8

Goal:

The participation on SPLAT will be:

 

ACCEPTANCE CRITERIA

  • Document created on CCO repo, reviewed, approved by QE and merged
  • KCS/Article created

 

REFERENCES:

Supporting document: https://github.com/openshift/cloud-credential-operator/blob/master/docs/sts.md#steps-to-in-place-migrate-an-openshift-cluster-to-sts

NOTE: we should add that this step is not supported or recommended.

 

We have identified gaps in our attempted test coverage that monitors for acceptable Alerts firing during cluster upgrades that need to be addressed to make sure we are not allowing regressions into the product.

This epic is to group that work.

This will make transitioning to new releases very simple because ci-tools doesn't need logic, it just makes sure to include current + previous release data in the file and pr going to origin. Origin is then responsible for logic to determine which to use. Origin will check if we have at least 100 runs, if not try to fall back to previous release data. All other fallback logic should exist.

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

legacy apiserver disruption

legacy network pod sandbox creation

kubelet logs through /api/v1/nodes/<node>/proxy/logs/

Other Complete

This section includes Jira cards that are not linked to either an Epic or a Feature. These tickets were completed when this image was assembled

Description of problem:

When CNO is managed by Hypershift, it's deployment has "hypershift.openshift.io/release-image" template metadata annotation. The annotation's value is used to track progress of cluster control plane version upgrades. Example:

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      generation: 24
      labels:
        hypershift.openshift.io/managed-by: control-plane-operator
      name: cluster-network-operator
      namespace: master-cg319sf10ghnddkvo8j0
    ...
    spec:
      progressDeadlineSeconds: 600
      ...
      template:
        metadata:
          annotations:
            hypershift.openshift.io/release-image: us.icr.io/armada-master/ocp-release:4.12.7-x86_64
            target.workload.openshift.io/management: '{"effect": "PreferredDuringScheduling"}'
      ...

The same annotation must be set by CNO on multus-admission-controller deployment so that service providers can track its version upgrades as well.

CNO need a code fix to implement this annotation propagation logic.

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1.Create OCP cluster using Hypershift
2.Check deployment template metadata annotations on multus-admission-controller

Actual results:

No "hypershift.openshift.io/release-image" deployment template metadata annotation exists 

Expected results:

"hypershift.openshift.io/release-image" annotation must be present

Additional info:

 

 

Description of problem:
Control plane upgrades takes about 23 minutes on average. The shortest time I saw was 14 minutes, and the longest is 43 minutes.
The requirement is < 10 min for a successful complete control plane upgrade.

Version-Release number of selected component (if applicable): 4.12.12

How reproducible:
100 %

Steps to Reproduce:

1. Install a hosted cluster on 4.12.12. Wait for it to be 'ready'.
2. Upgrade the control plane to 4.12.13 via OCM.

Actual results: upgrade completes on average after 23 minutes.

Expected results: upgrade completes after < 10 min

Additional info:

N/A

When the user is providing ZTP manifests, a missing value for userManagedNetworking (in AgentClusterInstall) should be defaulted based on the platform type - for platform None this should default to true.

This is only happening if the platform type is misspelled as none instead of None. (Both are accepted for backwards compat with OCPBUGS-7495, but they should not result in different behaviour.)

When the user starts from an install-config, we set the correct value explicitly in the generated AgentClusterInstall, so this is not a problem so long as the user doesn't edit it.

Description of problem:


This test is permafailing on techpreview since https://github.com/openshift/origin/pull/27915 landed

[sig-instrumentation][Late] Alerts shouldn't exceed the 650 series limit of total series sent via telemetry from each cluster [Suite:openshift/conformance/parallel]

            s: "promQL query returned unexpected results:\navg_over_time(cluster:telemetry_selected_series:count[49m15s]) >= 650\n[\n  {\n    \"metric\": {\n      \"prometheus\": \"openshift-monitoring/k8s\"\n    },\n    \"value\": [\n      1685504058.881,\n      \"700.3636363636364\"\n    ]\n  }\n]",


Version-Release number of selected component (if applicable):


4.14

How reproducible:


Always

Steps to Reproduce:

1. Run conformance tests on a techpreview cluster

Actual results:

Test fails

Expected results:

Test succeeds

Additional info:


Example job https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.14-e2e-azure-sdn-techpreview/1663723476923453440

Description of problem:

Due to security vulnerability[1] affecting Azure CLI versions previous to 2.40.0(not included), it is recommended to update azure cli to higher version to avoid this issue. Currently, azure cli in CI is 2.38.0.

[1] https://github.com/Azure/azure-cli/security/advisories/GHSA-47xc-9rr2-q7p4

Version-Release number of selected component (if applicable):

All supported version

How reproducible:

Always

Steps to Reproduce:

1. Trigger CI jobs on azure platform that contains steps using azure cli.
2. 
3.

Actual results:

azure cli 2.38.0 is used now.

Expected results:

azure cli 2.40.0+ to be used in CI on all supported version

Additional info:

As azure cli 2.40.0+ is only available in rhel8-based repository, need to update its repo in upi-installer rhel8-based docker file[1]

[1] https://github.com/openshift/installer/blob/master/images/installer/Dockerfile.upi.ci.rhel8#L23

Description of problem:

We suspect that https://github.com/openshift/oc/pull/1521 has broken all Metal jobs, an example of a failure: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-baremetal-operator/355/pull-ci-openshift-cluster-baremetal-operator-master-e2e-metal-ipi-ovn-ipv6/1691359315740332032.

Details:

The testing scripts we use set KUBECONFIG in advance to the location where we'll create it. At the time "oc adm extract" is called, the file does not exist yet. While you could argue that we should not do it, it has worked for years, and it's quite possible that customers have similar automation (e.g. setting KUBECONFIG as a global variable in their playbooks). In any case, I don't think "oc adm extract" should try to read the configuration if it does not explicitly need it.

Updated details:

After the change, "oc adm extract" expects KUBECONFIG to be present, but at the point when we call it, there is no cluster. I initially assumed that unsetting KUBECONFIG will help but it does not.

Background

Update the CPMS docs to reflect the newly supported flavours for the upcoming 4.13 release.

Steps

  • Create a PR to update the docs

Stakeholders

  • Cloud Team

Definition of Done

  • PR merged{}

Please review the following PR: https://github.com/openshift/cloud-provider-aws/pull/37

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Our telemetry contains only vCenter version ("7.0.3") and not the exact build number. We need the build number to know what exact vCenter build user has and what bugs are fixed there (e.g. https://issues.redhat.com/browse/OCPBUGS-5817).

 

Description of problem

CI is flaky because the TestClientTLS test fails.

Version-Release number of selected component (if applicable)

I have seen these failures in 4.13 and 4.14 CI jobs.

How reproducible

Presently, search.ci reports the following stats for the past 14 days:

Found in 16.07% of runs (20.93% of failures) across 56 total runs and 13 jobs (76.79% failed) in 185ms

Steps to Reproduce

1. Post a PR and have bad luck.
2. Check https://search.ci.openshift.org/?search=FAIL%3A+TestAll%2Fparallel%2FTestClientTLS&maxAge=336h&context=1&type=all&name=cluster-ingress-operator&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job.

Actual results

The test fails:

=== RUN   TestAll/parallel/TestClientTLS
=== PAUSE TestAll/parallel/TestClientTLS
=== CONT  TestAll/parallel/TestClientTLS
=== CONT  TestAll/parallel/TestClientTLS
        stdout:
        Healthcheck requested
        200

        stderr:
        * Added canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com:443:172.30.53.236 to DNS cache
        * Rebuilt URL to: https://canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com/
        * Hostname canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com was found in DNS cache
        *   Trying 172.30.53.236...
        * TCP_NODELAY set
          % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                         Dload  Upload   Total   Spent    Left  Speed

        * ALPN, offering h2
        * ALPN, offering http/1.1
        * successfully set certificate verify locations:
        *   CAfile: /etc/pki/tls/certs/ca-bundle.crt
          CApath: none
        } [5 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Client hello (1):
        } [512 bytes data]
        * TLSv1.3 (IN), TLS handshake, Server hello (2):
        { [122 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
        { [10 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Request CERT (13):
        { [82 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Certificate (11):
        { [1763 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, CERT verify (15):
        { [264 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Finished (20):
        { [36 bytes data]
        * TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, [no content] (0):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Certificate (11):
        } [8 bytes data]
        * TLSv1.3 (OUT), TLS handshake, [no content] (0):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Finished (20):
        } [36 bytes data]
        * SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256
        * ALPN, server did not agree to a protocol
        * Server certificate:
        *  subject: CN=*.client-tls.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com
        *  start date: Mar 22 18:55:46 2023 GMT
        *  expire date: Mar 21 18:55:47 2025 GMT
        *  issuer: CN=ingress-operator@1679509964
        *  SSL certificate verify result: self signed certificate in certificate chain (19), continuing anyway.
        } [5 bytes data]
        * TLSv1.3 (OUT), TLS app data, [no content] (0):
        } [1 bytes data]
        > GET / HTTP/1.1
        > Host: canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com
        > User-Agent: curl/7.61.1
        > Accept: */*
        >
        { [5 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
        { [313 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
        { [313 bytes data]
        * TLSv1.3 (IN), TLS app data, [no content] (0):
        { [1 bytes data]
        < HTTP/1.1 200 OK
        < x-request-port: 8080
        < date: Wed, 22 Mar 2023 18:56:24 GMT
        < content-length: 22
        < content-type: text/plain; charset=utf-8
        < set-cookie: c6e529a6ab19a530fd4f1cceb91c08a9=683c60a6110214134bed475edc895cb9; path=/; HttpOnly; Secure; SameSite=None
        < cache-control: private
        <
        { [22 bytes data]

        * Connection #0 to host canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com left intact

        stdout:
        Healthcheck requested
        200

        stderr:
        * Added canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com:443:172.30.53.236 to DNS cache
        * Rebuilt URL to: https://canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com/
        * Hostname canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com was found in DNS cache
        *   Trying 172.30.53.236...
        * TCP_NODELAY set
          % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                         Dload  Upload   Total   Spent    Left  Speed

        * ALPN, offering h2
        * ALPN, offering http/1.1
        * successfully set certificate verify locations:
        *   CAfile: /etc/pki/tls/certs/ca-bundle.crt
          CApath: none
        } [5 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Client hello (1):
        } [512 bytes data]
        * TLSv1.3 (IN), TLS handshake, Server hello (2):
        { [122 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
        { [10 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Request CERT (13):
        { [82 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Certificate (11):
        { [1763 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, CERT verify (15):
        { [264 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Finished (20):
        { [36 bytes data]
        * TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, [no content] (0):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Certificate (11):
        } [799 bytes data]
        * TLSv1.3 (OUT), TLS handshake, [no content] (0):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, CERT verify (15):
        } [264 bytes data]
        * TLSv1.3 (OUT), TLS handshake, [no content] (0):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Finished (20):
        } [36 bytes data]
        * SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256
        * ALPN, server did not agree to a protocol
        * Server certificate:
        *  subject: CN=*.client-tls.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com
        *  start date: Mar 22 18:55:46 2023 GMT
        *  expire date: Mar 21 18:55:47 2025 GMT
        *  issuer: CN=ingress-operator@1679509964
        *  SSL certificate verify result: self signed certificate in certificate chain (19), continuing anyway.
        } [5 bytes data]
        * TLSv1.3 (OUT), TLS app data, [no content] (0):
        } [1 bytes data]
        > GET / HTTP/1.1
        > Host: canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com
        > User-Agent: curl/7.61.1
        > Accept: */*
        >
        { [5 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
        { [1097 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
        { [1097 bytes data]
        * TLSv1.3 (IN), TLS app data, [no content] (0):
        { [1 bytes data]
        < HTTP/1.1 200 OK
        < x-request-port: 8080
        < date: Wed, 22 Mar 2023 18:56:24 GMT
        < content-length: 22
        < content-type: text/plain; charset=utf-8
        < set-cookie: c6e529a6ab19a530fd4f1cceb91c08a9=eb40064e54af58007f579a6c82f2bcd7; path=/; HttpOnly; Secure; SameSite=None
        < cache-control: private
        <
        { [22 bytes data]

        * Connection #0 to host canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com left intact

        stdout:
        Healthcheck requested
        200

        stderr:
        * Added canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com:443:172.30.53.236 to DNS cache
        * Rebuilt URL to: https://canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com/
        * Hostname canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com was found in DNS cache
        *   Trying 172.30.53.236...
        * TCP_NODELAY set
          % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                         Dload  Upload   Total   Spent    Left  Speed

        * ALPN, offering h2
        * ALPN, offering http/1.1
        * successfully set certificate verify locations:
        *   CAfile: /etc/pki/tls/certs/ca-bundle.crt
          CApath: none
        } [5 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Client hello (1):
        } [512 bytes data]
        * TLSv1.3 (IN), TLS handshake, Server hello (2):
        { [122 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
        { [10 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Request CERT (13):
        { [82 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Certificate (11):
        { [1763 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, CERT verify (15):
        { [264 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Finished (20):
        { [36 bytes data]
        * TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, [no content] (0):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Certificate (11):
        } [802 bytes data]
        * TLSv1.3 (OUT), TLS handshake, [no content] (0):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, CERT verify (15):
        } [264 bytes data]
        * TLSv1.3 (OUT), TLS handshake, [no content] (0):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Finished (20):
        } [36 bytes data]
        * SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256
        * ALPN, server did not agree to a protocol
        * Server certificate:
        *  subject: CN=*.client-tls.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com
        *  start date: Mar 22 18:55:46 2023 GMT
        *  expire date: Mar 21 18:55:47 2025 GMT
        *  issuer: CN=ingress-operator@1679509964
        *  SSL certificate verify result: self signed certificate in certificate chain (19), continuing anyway.
        } [5 bytes data]
        * TLSv1.3 (OUT), TLS app data, [no content] (0):
        } [1 bytes data]
        > GET / HTTP/1.1
        > Host: canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com
        > User-Agent: curl/7.61.1
        > Accept: */*
        >
        { [5 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
        { [1097 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
        { [1097 bytes data]
        * TLSv1.3 (IN), TLS app data, [no content] (0):
        { [1 bytes data]
        < HTTP/1.1 200 OK
        < x-request-port: 8080
        < date: Wed, 22 Mar 2023 18:56:25 GMT
        < content-length: 22
        < content-type: text/plain; charset=utf-8
        < set-cookie: c6e529a6ab19a530fd4f1cceb91c08a9=104beed63d6a19782a5559400bd972b6; path=/; HttpOnly; Secure; SameSite=None
        < cache-control: private
        <
        { [22 bytes data]

        * Connection #0 to host canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com left intact

        stdout:
        000

        stderr:
        * Added canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com:443:172.30.53.236 to DNS cache
        * Rebuilt URL to: https://canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com/
        * Hostname canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com was found in DNS cache
        *   Trying 172.30.53.236...
        * TCP_NODELAY set
          % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                         Dload  Upload   Total   Spent    Left  Speed

        * ALPN, offering h2
        * ALPN, offering http/1.1
        * successfully set certificate verify locations:
        *   CAfile: /etc/pki/tls/certs/ca-bundle.crt
          CApath: none
        } [5 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Client hello (1):
        } [512 bytes data]
        * TLSv1.3 (IN), TLS handshake, Server hello (2):
        { [122 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
        { [10 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Request CERT (13):
        { [82 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Certificate (11):
        { [1763 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, CERT verify (15):
        { [264 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Finished (20):
        { [36 bytes data]
        * TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, [no content] (0):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Certificate (11):
        } [799 bytes data]
        * TLSv1.3 (OUT), TLS handshake, [no content] (0):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, CERT verify (15):
        } [264 bytes data]
        * TLSv1.3 (OUT), TLS handshake, [no content] (0):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Finished (20):
        } [36 bytes data]
        * SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256
        * ALPN, server did not agree to a protocol
        * Server certificate:
        *  subject: CN=*.client-tls.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com
        *  start date: Mar 22 18:55:46 2023 GMT
        *  expire date: Mar 21 18:55:47 2025 GMT
        *  issuer: CN=ingress-operator@1679509964
        *  SSL certificate verify result: self signed certificate in certificate chain (19), continuing anyway.
        } [5 bytes data]
        * TLSv1.3 (OUT), TLS app data, [no content] (0):
        } [1 bytes data]
        > GET / HTTP/1.1
        > Host: canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com
        > User-Agent: curl/7.61.1
        > Accept: */*
        >
        { [5 bytes data]
        * TLSv1.3 (IN), TLS alert, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS alert, unknown CA (560):
        { [2 bytes data]
        * OpenSSL SSL_read: error:14094418:SSL routines:ssl3_read_bytes:tlsv1 alert unknown ca, errno 0

        * Closing connection 0
        curl: (56) OpenSSL SSL_read: error:14094418:SSL routines:ssl3_read_bytes:tlsv1 alert unknown ca, errno 0

=== CONT  TestAll/parallel/TestClientTLS
        stdout:
        000

        stderr:
        * Added canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com:443:172.30.53.236 to DNS cache
        * Rebuilt URL to: https://canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com/
        * Hostname canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com was found in DNS cache
        *   Trying 172.30.53.236...
        * TCP_NODELAY set
          % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                         Dload  Upload   Total   Spent    Left  Speed

        * ALPN, offering h2
        * ALPN, offering http/1.1
        * successfully set certificate verify locations:
        *   CAfile: /etc/pki/tls/certs/ca-bundle.crt
          CApath: none
        } [5 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Client hello (1):
        } [512 bytes data]
        * TLSv1.3 (IN), TLS handshake, Server hello (2):
        { [122 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
        { [10 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Request CERT (13):
        { [82 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Certificate (11):
        { [1763 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, CERT verify (15):
        { [264 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Finished (20):
        { [36 bytes data]
        * TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, [no content] (0):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Certificate (11):
        } [8 bytes data]
        * TLSv1.3 (OUT), TLS handshake, [no content] (0):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Finished (20):
        } [36 bytes data]
        * SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256
        * ALPN, server did not agree to a protocol
        * Server certificate:
        *  subject: CN=*.client-tls.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com
        *  start date: Mar 22 18:55:46 2023 GMT
        *  expire date: Mar 21 18:55:47 2025 GMT
        *  issuer: CN=ingress-operator@1679509964
        *  SSL certificate verify result: self signed certificate in certificate chain (19), continuing anyway.
        } [5 bytes data]
        * TLSv1.3 (OUT), TLS app data, [no content] (0):
        } [1 bytes data]
        > GET / HTTP/1.1
        > Host: canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com
        > User-Agent: curl/7.61.1
        > Accept: */*
        >
        { [5 bytes data]
        * TLSv1.3 (IN), TLS alert, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS alert, unknown (628):
        { [2 bytes data]
        * OpenSSL SSL_read: error:1409445C:SSL routines:ssl3_read_bytes:tlsv13 alert certificate required, errno 0

        * Closing connection 0
        curl: (56) OpenSSL SSL_read: error:1409445C:SSL routines:ssl3_read_bytes:tlsv13 alert certificate required, errno 0

=== CONT  TestAll/parallel/TestClientTLS
        stdout:
        Healthcheck requested
        200

        stderr:
        * Added canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com:443:172.30.53.236 to DNS cache
        * Rebuilt URL to: https://canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com/
        * Hostname canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com was found in DNS cache
        *   Trying 172.30.53.236...
        * TCP_NODELAY set
          % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                         Dload  Upload   Total   Spent    Left  Speed

        * ALPN, offering h2
        * ALPN, offering http/1.1
        * successfully set certificate verify locations:
        *   CAfile: /etc/pki/tls/certs/ca-bundle.crt
          CApath: none
        } [5 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Client hello (1):
        } [512 bytes data]
        * TLSv1.3 (IN), TLS handshake, Server hello (2):
        { [122 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
        { [10 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Request CERT (13):
        { [82 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Certificate (11):
        { [1763 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, CERT verify (15):
        { [264 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Finished (20):
        { [36 bytes data]
        * TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, [no content] (0):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Certificate (11):
        } [799 bytes data]
        * TLSv1.3 (OUT), TLS handshake, [no content] (0):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, CERT verify (15):
        } [264 bytes data]
        * TLSv1.3 (OUT), TLS handshake, [no content] (0):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Finished (20):
        } [36 bytes data]
        * SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256
        * ALPN, server did not agree to a protocol
        * Server certificate:
        *  subject: CN=*.client-tls.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com
        *  start date: Mar 22 18:55:46 2023 GMT
        *  expire date: Mar 21 18:55:47 2025 GMT
        *  issuer: CN=ingress-operator@1679509964
        *  SSL certificate verify result: self signed certificate in certificate chain (19), continuing anyway.
        } [5 bytes data]
        * TLSv1.3 (OUT), TLS app data, [no content] (0):
        } [1 bytes data]
        > GET / HTTP/1.1
        > Host: canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com
        > User-Agent: curl/7.61.1
        > Accept: */*
        >
        { [5 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
        { [1097 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
        { [1097 bytes data]
        * TLSv1.3 (IN), TLS app data, [no content] (0):
        { [1 bytes data]
        < HTTP/1.1 200 OK
        < x-request-port: 8080
        < date: Wed, 22 Mar 2023 18:57:00 GMT
        < content-length: 22
        < content-type: text/plain; charset=utf-8
        < set-cookie: c6e529a6ab19a530fd4f1cceb91c08a9=683c60a6110214134bed475edc895cb9; path=/; HttpOnly; Secure; SameSite=None
        < cache-control: private
        <
        { [22 bytes data]

        * Connection #0 to host canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com left intact

=== CONT  TestAll/parallel/TestClientTLS
        stdout:
        Healthcheck requested
        200

        stderr:
        * Added canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com:443:172.30.53.236 to DNS cache
        * Rebuilt URL to: https://canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com/
        * Hostname canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com was found in DNS cache
        *   Trying 172.30.53.236...
        * TCP_NODELAY set
          % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                         Dload  Upload   Total   Spent    Left  Speed

        * ALPN, offering h2
        * ALPN, offering http/1.1
        * successfully set certificate verify locations:
        *   CAfile: /etc/pki/tls/certs/ca-bundle.crt
          CApath: none
        } [5 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Client hello (1):
        } [512 bytes data]
        * TLSv1.3 (IN), TLS handshake, Server hello (2):
        { [122 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
        { [10 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Request CERT (13):
        { [82 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Certificate (11):
        { [1763 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, CERT verify (15):
        { [264 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Finished (20):
        { [36 bytes data]
        * TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, [no content] (0):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Certificate (11):
        } [802 bytes data]
        * TLSv1.3 (OUT), TLS handshake, [no content] (0):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, CERT verify (15):
        } [264 bytes data]
        * TLSv1.3 (OUT), TLS handshake, [no content] (0):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Finished (20):
        } [36 bytes data]
        * SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256
        * ALPN, server did not agree to a protocol
        * Server certificate:
        *  subject: CN=*.client-tls.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com
        *  start date: Mar 22 18:55:46 2023 GMT
        *  expire date: Mar 21 18:55:47 2025 GMT
        *  issuer: CN=ingress-operator@1679509964
        *  SSL certificate verify result: self signed certificate in certificate chain (19), continuing anyway.
        } [5 bytes data]
        * TLSv1.3 (OUT), TLS app data, [no content] (0):
        } [1 bytes data]
        > GET / HTTP/1.1
        > Host: canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com
        > User-Agent: curl/7.61.1
        > Accept: */*
        >
        { [5 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
        { [1097 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
        { [1097 bytes data]
        * TLSv1.3 (IN), TLS app data, [no content] (0):
        { [1 bytes data]
        < HTTP/1.1 200 OK
        < x-request-port: 8080
        < date: Wed, 22 Mar 2023 18:57:00 GMT
        < content-length: 22
        < content-type: text/plain; charset=utf-8
        < set-cookie: c6e529a6ab19a530fd4f1cceb91c08a9=eb40064e54af58007f579a6c82f2bcd7; path=/; HttpOnly; Secure; SameSite=None
        < cache-control: private
        <
        { [22 bytes data]

        * Connection #0 to host canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com left intact

=== CONT  TestAll/parallel/TestClientTLS
        stdout:
        000

        stderr:
        * Added canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com:443:172.30.53.236 to DNS cache
        * Rebuilt URL to: https://canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com/
        * Hostname canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com was found in DNS cache
        *   Trying 172.30.53.236...
        * TCP_NODELAY set
          % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                         Dload  Upload   Total   Spent    Left  Speed

        * ALPN, offering h2
        * ALPN, offering http/1.1
        * successfully set certificate verify locations:
        *   CAfile: /etc/pki/tls/certs/ca-bundle.crt
          CApath: none
        } [5 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Client hello (1):
        } [512 bytes data]
        * TLSv1.3 (IN), TLS handshake, Server hello (2):
        { [122 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
        { [10 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Request CERT (13):
        { [82 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Certificate (11):
        { [1763 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, CERT verify (15):
        { [264 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Finished (20):
        { [36 bytes data]
        * TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, [no content] (0):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Certificate (11):
        } [799 bytes data]
        * TLSv1.3 (OUT), TLS handshake, [no content] (0):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, CERT verify (15):
        } [264 bytes data]
        * TLSv1.3 (OUT), TLS handshake, [no content] (0):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Finished (20):
        } [36 bytes data]
        * SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256
        * ALPN, server did not agree to a protocol
        * Server certificate:
        *  subject: CN=*.client-tls.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com
        *  start date: Mar 22 18:55:46 2023 GMT
        *  expire date: Mar 21 18:55:47 2025 GMT
        *  issuer: CN=ingress-operator@1679509964
        *  SSL certificate verify result: self signed certificate in certificate chain (19), continuing anyway.
        } [5 bytes data]
        * TLSv1.3 (OUT), TLS app data, [no content] (0):
        } [1 bytes data]
        > GET / HTTP/1.1
        > Host: canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com
        > User-Agent: curl/7.61.1
        > Accept: */*
        >
        { [5 bytes data]
        * TLSv1.3 (IN), TLS alert, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS alert, unknown CA (560):
        { [2 bytes data]
        * OpenSSL SSL_read: error:14094418:SSL routines:ssl3_read_bytes:tlsv1 alert unknown ca, errno 0

        * Closing connection 0
        curl: (56) OpenSSL SSL_read: error:14094418:SSL routines:ssl3_read_bytes:tlsv1 alert unknown ca, errno 0

=== CONT  TestAll/parallel/TestClientTLS
--- FAIL: TestAll (1538.53s)
    --- FAIL: TestAll/parallel (0.00s)
        --- FAIL: TestAll/parallel/TestClientTLS (123.10s)

Expected results

CI passes, or it fails on a different test.

Additional info

I saw that TestClientTLS failed on the test case with no client certificate and ClientCertificatePolicy set to "Required". My best guess is that the test is racy and is hitting a terminating router pod. The test uses waitForDeploymentComplete to wait until all new pods are available, but perhaps waitForDeploymentComplete should also wait until all old pods are terminated.

Description of problem:

During a fresh install of an operator with conversion webhooks enabled, `crd.spec.conversion.webhook.clientConfig` is dynamically updated initially, as expected, with the proper webhook ns, name, & caBundle. However, within a few seconds, those critical settings are overwritten with the bundle’s packaged CRD conversion settings. This breaks the operator and stops the installation from completing successfully.

Oddly though, if that same operator version is installed as part of an upgrade from a prior release... the dynamic clientConfig settings are retained and all works as expected.

 

Version-Release number of selected component (if applicable):

OCP 4.10.36
OCP 4.11.18

How reproducible:

Consistently

 

Steps to Reproduce:

1. oc apply -f https://gist.githubusercontent.com/tchughesiv/0951d40f58f2f49306cc4061887e8860/raw/3c7979b58705ab3a9e008b45a4ed4abc3ef21c2b/conversionIssuesFreshInstall.yaml
2. oc get crd dbaasproviders.dbaas.redhat.com --template '{{ .spec.conversion.webhook.clientConfig }}' -w

 

Actual results:

Eventually, the clientConfig settings will revert to the following and stay that way.

$ oc get crd dbaasproviders.dbaas.redhat.com --template '{{ .spec.conversion.webhook.clientConfig }}'
map[service:map[name:dbaas-operator-webhook-service namespace:openshift-dbaas-operator path:/convert port:443]]
 conversion:
   strategy: Webhook
   webhook:
     clientConfig:
       service:
         namespace: openshift-dbaas-operator
         name: dbaas-operator-webhook-service
         path: /convert
         port: 443
     conversionReviewVersions:
       - v1alpha1
       - v1beta1

 

Expected results:

The `crd.spec.conversion.webhook.clientConfig` should instead retain the following settings.

$ oc get crd dbaasproviders.dbaas.redhat.com --template '{{ .spec.conversion.webhook.clientConfig }}'
map[caBundle:LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUJpRENDQVMyZ0F3SUJBZ0lJUVA1b1ZtYTNqUG93Q2dZSUtvWkl6ajBFQXdJd0dERVdNQlFHQTFVRUNoTU4KVW1Wa0lFaGhkQ3dnU1c1akxqQWVGdzB5TWpFeU1UWXhPVEEwTWpsYUZ3MHlOREV5TVRVeE9UQTBNamxhTUJneApGakFVQmdOVkJBb1REVkpsWkNCSVlYUXNJRWx1WXk0d1dUQVRCZ2NxaGtqT1BRSUJCZ2dxaGtqT1BRTUJCd05DCkFBVGcxaEtPWW40MStnTC9PdmVKT21jbkx5MzZNWTBEdnRGcXF3cjJFdlZhUWt2WnEzWG9ZeWlrdlFlQ29DZ3QKZ2VLK0UyaXIxNndzSmRSZ2paYnFHc3pGbzJFd1h6QU9CZ05WSFE4QkFmOEVCQU1DQW9Rd0hRWURWUjBsQkJZdwpGQVlJS3dZQkJRVUhBd0lHQ0NzR0FRVUZCd01CTUE4R0ExVWRFd0VCL3dRRk1BTUJBZjh3SFFZRFZSME9CQllFCkZPMWNXNFBrbDZhcDdVTVR1UGNxZWhST1gzRHZNQW9HQ0NxR1NNNDlCQU1DQTBrQU1FWUNJUURxN0pkUjkxWlgKeWNKT0hyQTZrL0M0SG9sSjNwUUJ6bmx3V3FXektOd0xiZ0loQU5ObUd6RnBqaHd6WXpVY2RCQ3llU3lYYkp3SAphYllDUXFkSjBtUGFha28xCi0tLS0tRU5EIENFUlRJRklDQVRFLS0tLS0K service:map[name:dbaas-operator-controller-manager-service namespace:redhat-dbaas-operator path:/convert port:443]]
 conversion:
   strategy: Webhook
   webhook:
     clientConfig:
       service:
         namespace: redhat-dbaas-operator
         name: dbaas-operator-controller-manager-service
         path: /convert
         port: 443
       caBundle: >-
         LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUJoekNDQVMyZ0F3SUJBZ0lJZXdhVHNLS0hhbWd3Q2dZSUtvWkl6ajBFQXdJd0dERVdNQlFHQTFVRUNoTU4KVW1Wa0lFaGhkQ3dnU1c1akxqQWVGdzB5TWpFeU1UWXhPVEF5TURkYUZ3MHlOREV5TVRVeE9UQXlNRGRhTUJneApGakFVQmdOVkJBb1REVkpsWkNCSVlYUXNJRWx1WXk0d1dUQVRCZ2NxaGtqT1BRSUJCZ2dxaGtqT1BRTUJCd05DCkFBUVRFQm8zb1BWcjRLemF3ZkE4MWtmaTBZQTJuVGRzU2RpMyt4d081ZmpKQTczdDQ2WVhOblFzTjNCMVBHM04KSXJ6N1dKVkJmVFFWMWI3TXE1anpySndTbzJFd1h6QU9CZ05WSFE4QkFmOEVCQU1DQW9Rd0hRWURWUjBsQkJZdwpGQVlJS3dZQkJRVUhBd0lHQ0NzR0FRVUZCd01CTUE4R0ExVWRFd0VCL3dRRk1BTUJBZjh3SFFZRFZSME9CQllFCkZJemdWbC9ZWkFWNmltdHl5b0ZkNFRkLzd0L3BNQW9HQ0NxR1NNNDlCQU1DQTBnQU1FVUNJRUY3ZXZ0RS95OFAKRnVrTUtGVlM1VkQ3a09DRzRkdFVVOGUyc1dsSTZlNEdBaUVBZ29aNmMvYnNpNEwwcUNrRmZSeXZHVkJRa25SRwp5SW1WSXlrbjhWWnNYcHM9Ci0tLS0tRU5EIENFUlRJRklDQVRFLS0tLS0K 

 

Additional info:

If the operator is, instead, installed as an upgrade... vs a fresh install... the webhook settings are properly/permanently set and everything works as expected. This can be tested in a fresh cluster like this.

1. oc apply -f https://gist.githubusercontent.com/tchughesiv/703109961f22ab379a45a401be0cf351/raw/2d0541b76876a468757269472e8e3a31b86b3c68/conversionWorksUpgrade.yaml
2. oc get crd dbaasproviders.dbaas.redhat.com --template '{{ .spec.conversion.webhook.clientConfig }}' -w

Description of problem:

4.12.0-0.nightly-2022-09-20-095559 fresh cluster,  alertmanager pod restarted once to become ready, this is a 4.12 regression, we should make sure the /etc/alertmanager/config_out/alertmanager.env.yaml exists first

# oc -n openshift-monitoring get pod
NAME                                                     READY   STATUS    RESTARTS       AGE
alertmanager-main-0                                      6/6     Running   1 (118m ago)   118m
alertmanager-main-1                                      6/6     Running   1 (118m ago)   118m
...

# oc -n openshift-monitoring describe pod alertmanager-main-0 
...
Containers:
  alertmanager:
    Container ID:  cri-o://31b6f3231f5a24fe85188b8b8e26c45b660ebc870ee6915919031519d493d7f8
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:34003d434c6f07e4af6e7a52e94f703c68e1f881e90939702c764729e2b513aa
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:34003d434c6f07e4af6e7a52e94f703c68e1f881e90939702c764729e2b513aa
    Ports:         9094/TCP, 9094/UDP
    Host Ports:    0/TCP, 0/UDP
    Args:
      --config.file=/etc/alertmanager/config_out/alertmanager.env.yaml
      --storage.path=/alertmanager
      --data.retention=120h
      --cluster.listen-address=[$(POD_IP)]:9094
      --web.listen-address=127.0.0.1:9093
      --web.external-url=https:/console-openshift-console.apps.qe-daily1-412-0922.qe.azure.devcluster.openshift.com/monitoring
      --web.route-prefix=/
      --cluster.peer=alertmanager-main-0.alertmanager-operated:9094
      --cluster.peer=alertmanager-main-1.alertmanager-operated:9094
      --cluster.reconnect-timeout=5m
      --web.config.file=/etc/alertmanager/web_config/web-config.yaml
    State:       Running
      Started:   Wed, 21 Sep 2022 19:40:14 -0400
    Last State:  Terminated
      Reason:    Error
      Message:   s=2022-09-21T23:40:06.507Z caller=main.go:231 level=info msg="Starting Alertmanager" version="(version=0.24.0, branch=rhaos-4.12-rhel-8, revision=4efb3c1f9bc32ba0cce7dd163a639ca8759a4190)"
ts=2022-09-21T23:40:06.507Z caller=main.go:232 level=info build_context="(go=go1.18.4, user=root@b2df06f7fbc3, date=20220916-18:08:09)"
ts=2022-09-21T23:40:07.119Z caller=cluster.go:260 level=warn component=cluster msg="failed to join cluster" err="2 errors occurred:\n\t* Failed to resolve alertmanager-main-0.alertmanager-operated:9094: lookup alertmanager-main-0.alertmanager-operated on 172.30.0.10:53: no such host\n\t* Failed to resolve alertmanager-main-1.alertmanager-operated:9094: lookup alertmanager-main-1.alertmanager-operated on 172.30.0.10:53: no such host\n\n"
ts=2022-09-21T23:40:07.119Z caller=cluster.go:262 level=info component=cluster msg="will retry joining cluster every 10s"
ts=2022-09-21T23:40:07.119Z caller=main.go:329 level=warn msg="unable to join gossip mesh" err="2 errors occurred:\n\t* Failed to resolve alertmanager-main-0.alertmanager-operated:9094: lookup alertmanager-main-0.alertmanager-operated on 172.30.0.10:53: no such host\n\t* Failed to resolve alertmanager-main-1.alertmanager-operated:9094: lookup alertmanager-main-1.alertmanager-operated on 172.30.0.10:53: no such host\n\n"
ts=2022-09-21T23:40:07.119Z caller=cluster.go:680 level=info component=cluster msg="Waiting for gossip to settle..." interval=2s
ts=2022-09-21T23:40:07.173Z caller=coordinator.go:113 level=info component=configuration msg="Loading configuration file" file=/etc/alertmanager/config_out/alertmanager.env.yaml
ts=2022-09-21T23:40:07.174Z caller=coordinator.go:118 level=error component=configuration msg="Loading configuration file failed" file=/etc/alertmanager/config_out/alertmanager.env.yaml err="open /etc/alertmanager/config_out/alertmanager.env.yaml: no such file or directory"
ts=2022-09-21T23:40:07.174Z caller=cluster.go:689 level=info component=cluster msg="gossip not settled but continuing anyway" polls=0 elapsed=54.469985ms      Exit Code:    1
      Started:      Wed, 21 Sep 2022 19:40:06 -0400
      Finished:     Wed, 21 Sep 2022 19:40:07 -0400
    Ready:          True
    Restart Count:  1
    Requests:
      cpu:     4m
      memory:  40Mi
    Startup:   exec [sh -c exec curl --fail http://localhost:9093/-/ready] delay=20s timeout=3s period=10s #success=1 #failure=40
...

# oc -n openshift-monitoring exec -c alertmanager alertmanager-main-0 -- cat /etc/alertmanager/config_out/alertmanager.env.yaml
"global":
  "resolve_timeout": "5m"
"inhibit_rules":
- "equal":
  - "namespace"
  - "alertname"
  "source_matchers":
  - "severity = critical"
  "target_matchers":
  - "severity =~ warning|info"
- "equal":
  - "namespace"
  - "alertname"

...

Version-Release number of selected component (if applicable):

# oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.12.0-0.nightly-2022-09-20-095559   True        False         109m    Cluster version is 4.12.0-0.nightly-2022-09-20-095559

How reproducible:

always

Steps to Reproduce:

1. see the steps
2.
3.

Actual results:

alertmanager pod restarted once to become ready

Expected results:

no restart

Additional info:

no issue with 4.11

# oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-09-20-140029   True        False         16m     Cluster version is 4.11.0-0.nightly-2022-09-20-140029
# oc -n openshift-monitoring get pod | grep alertmanager-main
alertmanager-main-0                                      6/6     Running   0          54m
alertmanager-main-1                                      6/6     Running   0          55m 

Description of problem:

library-go should use Lease for leader election by default. 
In 4.10 we switched from configmaps to configmapsleases, now we can switch to leases

change library-go to use lease by default, we already have an open pr for that: https://github.com/openshift/library-go/pull/1448 

once the pr merges, we should revendor library-go for:
- kas operator
- oas operator
- etcd operator
- kcm operator
- openshift controller manager operator
- scheduler operator
- auth operator
- cluster policy controller
 

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

Critical Alert Rules do not have runbook url

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

This bug is being raised by Openshift Monitoring team as part of effort to detect invalid Alert Rules in OCP.

1.  Check details of MultipleDefaultStorageClasses Alert Rule
2.
3.

Actual results:

The Alert Rule MultipleDefaultStorageClasses has Critical Severity, but does not have runbook_url annotation.

Expected results:

All Critical Alert Rules must have runbbok_url annotation

Additional info:

Critical Alerts must have a runbook, please refer to style guide at https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/alerting-consistency.md#style-guide 

The runbooks are located at github.com/openshift/runbooks

To resolve the bug, 
- Add runbooks for the relevant Alerts at github.com/openshift/runbooks
- Add the link to the runbook in the Alert annotation 'runbook_url'
- Remove the exception in the origin test, added in PR https://github.com/openshift/origin/pull/27933

Description of problem:

The vsphere-problem-detector feature is triggering VSphereOpenshiftClusterHealthFail alerts regarding “CheckFolderPermissions” and “CheckDefaultDatastore” after upgrading from 4.9.54. Forcing users to update configuration solely to get around the problem detector. Depending on the customer policies around vCenter passwords or configuration updates, this can be a major obstacle for a user who wants to keep the current vSphere settings since they've worked correctly in the previous Openshift versions.

Version-Release number of selected component (if applicable):

4.10.55

How reproducible:

Consistently

Steps to Reproduce:

1.Upgrading a cluster to 4.10 with invalid vSphere credentials

Actual results:

The cluster-storage-operator fires alarms regarding vSphere configuration in Openshift.

Expected results:

Bypass the vsphere-problem-detector if the user doesn't want to make a config change, since the setup is working, and upgrades like this succeeded for user previous to 4.10.

Additional info:

 

Description of problem:

Create Serverless Function Form is Broken

Version-Release number of selected component (if applicable):

4.14

How reproducible:

Always on Master.

Steps to Reproduce:

1. Go to Add Page
2. Click Create Serverless Function form

Actual results:

Form throwing error.

Expected results:

Form should open and submit

Screenshot of Error: https://drive.google.com/file/d/1uyzGHktfr8tEGWPyYkv9ISYI6BhdnK6f/view?usp=sharing

Additional info:

 

One of the 4.13 nightly payload test is failing and it seems like kernel-uname-r is needed in base RHCOS.

Error message from rpm-ostree rebase made

 Problem: package kernel-modules-core-5.14.0-284.25.1.el9_2.x86_64 requires kernel-uname-r = 5.14.0-284.25.1.el9_2.x86_64, but none of the providers can be installed
  - conflicting requests

MCD pod log: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.13-upgrade-from-stable-4.12-e2e-gcp-ovn-rt-upgrade/1686324400581775360/artifacts/e2e-gcp-ovn-rt-upgrade/gather-extra/artifacts/pods/openshift-machine-config-operator_machine-config-daemon-bjhq4_machine-config-daemon.log

Perhaps something changed recently in packaging.

Description of problem:

Test in periodic job of 4.13 release fails in about 30% jobs:
[rfe_id:27363][performance] CPU Management Hyper-thread aware scheduling for guaranteed pods Verify Hyper-Thread aware scheduling for guaranteed pods [test_id:46959] Number of CPU requests as multiple of SMT count allowed when HT enabled

Version-Release number of selected component (if applicable):

4.13

How reproducible:

In periodic jobs

Steps to Reproduce:

Run cnf tests on 4.13

Actual results:

 

Expected results:

 

Additional info:

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.13-e2e-telco5g-cnftests/1628395172440051712/artifacts/e2e-telco5g-cnftests/telco5g-cnf-tests/artifacts/test_results.html

Baremetal ipi jobs are failing in 4.14 CI since May 12th

bootkube is failing to start with 

 

May 15 10:11:56 localhost.localdomain systemd[1]: Started Bootstrap a Kubernetes cluster.
May 15 10:12:04 localhost.localdomain bootkube.sh[82661]: Rendering Kubernetes Controller Manager core manifests...
May 15 10:12:09 localhost.localdomain bootkube.sh[84029]: F0515 10:12:09.396398       1 render.go:45] error getting FeatureGates: error creating feature accessor: unable to determine features: missing desired version "4.14.0-0.nightly-2023-05-12-121801" in featuregates.config.openshift.io/cluster
May 15 10:12:09 localhost.localdomain systemd[1]: bootkube.service: Main process exited, code=exited, status=255/EXCEPTION
May 15 10:12:09 localhost.localdomain systemd[1]: bootkube.service: Failed with result 'exit-code'.

Description of problem:

Cluster deployment of 4.14.0-0.nightly-2023-06-20-065807 fails as worker nodes are stuck in INSPECTING state despite being reported as MANAGEABLE

From the logs of machine-controller container in machine-api-controllers pod:

I0621 06:12:02.779472       1 request.go:682] Waited for 2.095824347s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/performance.openshift.io/v2?timeout=32s
E0621 06:12:02.781540       1 logr.go:270] controller-runtime/source "msg"="if kind is a CRD, it should be installed before calling Start" "error"="no matches for kind \"Metal3Remediation\" in version \"infrastructure.cluster.x-k8s.io/v1beta1\""  "kind"={"Group":"infrastructure.cluster.x-k8s.io","Kind":"Metal3Remediation"}
I0621 06:12:02.783418       1 controller.go:179] kni-qe-4-tj65t-worker-0-h6s8g: reconciling Machine
2023/06/21 06:12:02 Checking if machine kni-qe-4-tj65t-worker-0-h6s8g exists.
2023/06/21 06:12:02 Machine kni-qe-4-tj65t-worker-0-h6s8g does not exist.
I0621 06:12:02.783439       1 controller.go:372] kni-qe-4-tj65t-worker-0-h6s8g: reconciling machine triggers idempotent create
2023/06/21 06:12:02 Creating machine kni-qe-4-tj65t-worker-0-h6s8g
2023/06/21 06:12:02 0 hosts available while choosing host for machine 'kni-qe-4-tj65t-worker-0-h6s8g'
2023/06/21 06:12:02 No available BareMetalHost found
W0621 06:12:02.783735       1 controller.go:374] kni-qe-4-tj65t-worker-0-h6s8g: failed to create machine: requeue in: 30s
I0621 06:12:02.783748       1 controller.go:404] Actuator returned requeue-after error: requeue in: 30s
I0621 06:12:02.783780       1 controller.go:179] kni-qe-4-tj65t-worker-0-j259x: reconciling Machine
2023/06/21 06:12:02 Checking if machine kni-qe-4-tj65t-worker-0-j259x exists.
2023/06/21 06:12:02 Machine kni-qe-4-tj65t-worker-0-j259x does not exist.
I0621 06:12:02.783792       1 controller.go:372] kni-qe-4-tj65t-worker-0-j259x: reconciling machine triggers idempotent create
2023/06/21 06:12:02 Creating machine kni-qe-4-tj65t-worker-0-j259x
2023/06/21 06:12:02 0 hosts available while choosing host for machine 'kni-qe-4-tj65t-worker-0-j259x'
2023/06/21 06:12:02 No available BareMetalHost found
W0621 06:12:02.783971       1 controller.go:374] kni-qe-4-tj65t-worker-0-j259x: failed to create machine: requeue in: 30s
I0621 06:12:02.783976       1 controller.go:404] Actuator returned requeue-after error: requeue in: 30s

BMH Resources:

oc get bmh -A
NAMESPACE               NAME                 STATE                    CONSUMER                  ONLINE   ERROR   AGE
openshift-machine-api   openshift-master-0   externally provisioned   kni-qe-4-tj65t-master-0   true             175m
openshift-machine-api   openshift-master-1   externally provisioned   kni-qe-4-tj65t-master-1   true             175m
openshift-machine-api   openshift-master-2   externally provisioned   kni-qe-4-tj65t-master-2   true             175m
openshift-machine-api   openshift-worker-0   inspecting                                         true             175m
openshift-machine-api   openshift-worker-1   inspecting                                         true             175m

From Ironic:

baremetal node list
+--------------------------------------+------------------------------------------+--------------------------------------+-------------+--------------------+-------------+
| UUID                                 | Name                                     | Instance UUID                        | Power State | Provisioning State | Maintenance |
+--------------------------------------+------------------------------------------+--------------------------------------+-------------+--------------------+-------------+
| 86f146e3-3e48-4a7a-b0ef-57c42083fc92 | openshift-machine-api~openshift-master-0 | 7eeb9e57-2df2-4710-82d9-d3f99a20348e | power on    | active             | False       |
| 2380f211-934f-4193-8cb1-d09e7008410c | openshift-machine-api~openshift-master-2 | fd856ced-2912-4800-848c-256c00a1fdb7 | power on    | active             | False       |
| 9ad70c58-de44-4d56-9304-4bf7c95de6fb | openshift-machine-api~openshift-master-1 | aa1a4c89-4215-44ec-90c7-9c5f3de95ab8 | power on    | active             | False       |
| bb5ea5f4-016c-4bdd-834d-61d575284bf3 | openshift-machine-api~openshift-worker-0 | None                                 | power off   | manageable         | False       |
| 3045a07a-09d6-43a0-ab9c-d856b54bad6c | openshift-machine-api~openshift-worker-1 | None                                 | power off   | manageable         | False       |
+--------------------------------------+------------------------------------------+--------------------------------------+-------------+--------------------+-------------+

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-06-20-065807

How reproducible:

so far once

Steps to Reproduce:

1. Deploy baremetal dualstack cluster with day1 networking

Actual results:

Deployment fails as worker nodes are not provisioned

Expected results:

Deployment succeeds

Description of problem: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.14-e2e-openstack-sdn/1682353286402805760 failed with:

fail [github.com/openshift/origin/test/extended/authorization/scc.go:69]: 2 pods failed before test on SCC errors
Error creating: pods "openstack-cinder-csi-driver-controller-7c4878484d-" is forbidden: unable to validate against any security context constraint: [provider "anyuid": Forbidden: not usable by user or serviceaccount, provider restricted-v2: .spec.securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used, provider restricted-v2: .containers[0].hostNetwork: Invalid value: true: Host network is not allowed to be used, provider restricted-v2: .containers[0].containers[0].hostPort: Invalid value: 10301: Host ports are not allowed to be used, provider restricted-v2: .containers[0].containers[2].hostPort: Invalid value: 9202: Host ports are not allowed to be used, provider restricted-v2: .containers[0].containers[4].hostPort: Invalid value: 9203: Host ports are not allowed to be used, provider restricted-v2: .containers[0].containers[6].hostPort: Invalid value: 9204: Host ports are not allowed to be used, provider restricted-v2: .containers[0].containers[8].hostPort: Invalid value: 9205: Host ports are not allowed to be used, provider restricted-v2: .containers[1].hostNetwork: Invalid value: true: Host network is not allowed to be used, provider restricted-v2: .containers[1].containers[0].hostPort: Invalid value: 10301: Host ports are not allowed to be used, provider restricted-v2: .containers[1].containers[2].hostPort: Invalid value: 9202: Host ports are not allowed to be used, provider restricted-v2: .containers[1].containers[4].hostPort: Invalid value: 9203: Host ports are not allowed to be used, provider restricted-v2: .containers[1].containers[6].hostPort: Invalid value: 9204: Host ports are not allowed to be used, provider restricted-v2: .containers[1].containers[8].hostPort: Invalid value: 9205: Host ports are not allowed to be used, provider restricted-v2: .containers[2].hostNetwork: Invalid value: true: Host network is not allowed to be used, provider restricted-v2: .containers[2].containers[0].hostPort: Invalid value: 10301: Host ports are not allowed to be used, provider restricted-v2: .containers[2].containers[2].hostPort: Invalid value: 9202: Host ports are not allowed to be used, provider restricted-v2: .containers[2].containers[4].hostPort: Invalid value: 9203: Host ports are not allowed to be used, provider restricted-v2: .containers[2].containers[6].hostPort: Invalid value: 9204: Host ports are not allowed to be used, provider restricted-v2: .containers[2].containers[8].hostPort: Invalid value: 9205: Host ports are not allowed to be used, provider restricted-v2: .containers[3].hostNetwork: Invalid value: true: Host network is not allowed to be used, provider restricted-v2: .containers[3].containers[0].hostPort: Invalid value: 10301: Host ports are not allowed to be used, provider restricted-v2: .containers[3].containers[2].hostPort: Invalid value: 9202: Host ports are not allowed to be used, provider restricted-v2: .containers[3].containers[4].hostPort: Invalid value: 9203: Host ports are not allowed to be used, provider restricted-v2: .containers[3].containers[6].hostPort: Invalid value: 9204: Host ports are not allowed to be used, provider restricted-v2: .containers[3].containers[8].hostPort: Invalid value: 9205: Host ports are not allowed to be used, provider restricted-v2: .containers[4].hostNetwork: Invalid value: true: Host network is not allowed to be used, provider restricted-v2: .containers[4].containers[0].hostPort: Invalid value: 10301: Host ports are not allowed to be used, provider restricted-v2: .containers[4].containers[2].hostPort: Invalid value: 9202: Host ports are not allowed to be used, provider restricted-v2: .containers[4].containers[4].hostPort: Invalid value: 9203: Host ports are not allowed to be used, provider restricted-v2: .containers[4].containers[6].hostPort: Invalid value: 9204: Host ports are not allowed to be used, provider restricted-v2: .containers[4].containers[8].hostPort: Invalid value: 9205: Host ports are not allowed to be used, provider restricted-v2: .containers[5].hostNetwork: Invalid value: true: Host network is not allowed to be used, provider restricted-v2: .containers[5].containers[0].hostPort: Invalid value: 10301: Host ports are not allowed to be used, provider restricted-v2: .containers[5].containers[2].hostPort: Invalid value: 9202: Host ports are not allowed to be used, provider restricted-v2: .containers[5].containers[4].hostPort: Invalid value: 9203: Host ports are not allowed to be used, provider restricted-v2: .containers[5].containers[6].hostPort: Invalid value: 9204: Host ports are not allowed to be used, provider restricted-v2: .containers[5].containers[8].hostPort: Invalid value: 9205: Host ports are not allowed to be used, provider restricted-v2: .containers[6].hostNetwork: Invalid value: true: Host network is not allowed to be used, provider restricted-v2: .containers[6].containers[0].hostPort: Invalid value: 10301: Host ports are not allowed to be used, provider restricted-v2: .containers[6].containers[2].hostPort: Invalid value: 9202: Host ports are not allowed to be used, provider restricted-v2: .containers[6].containers[4].hostPort: Invalid value: 9203: Host ports are not allowed to be used, provider restricted-v2: .containers[6].containers[6].hostPort: Invalid value: 9204: Host ports are not allowed to be used, provider restricted-v2: .containers[6].containers[8].hostPort: Invalid value: 9205: Host ports are not allowed to be used, provider restricted-v2: .containers[7].hostNetwork: Invalid value: true: Host network is not allowed to be used, provider restricted-v2: .containers[7].containers[0].hostPort: Invalid value: 10301: Host ports are not allowed to be used, provider restricted-v2: .containers[7].containers[2].hostPort: Invalid value: 9202: Host ports are not allowed to be used, provider restricted-v2: .containers[7].containers[4].hostPort: Invalid value: 9203: Host ports are not allowed to be used, provider restricted-v2: .containers[7].containers[6].hostPort: Invalid value: 9204: Host ports are not allowed to be used, provider restricted-v2: .containers[7].containers[8].hostPort: Invalid value: 9205: Host ports are not allowed to be used, provider restricted-v2: .containers[8].hostNetwork: Invalid value: true: Host network is not allowed to be used, provider restricted-v2: .containers[8].containers[0].hostPort: Invalid value: 10301: Host ports are not allowed to be used, provider restricted-v2: .containers[8].containers[2].hostPort: Invalid value: 9202: Host ports are not allowed to be used, provider restricted-v2: .containers[8].containers[4].hostPort: Invalid value: 9203: Host ports are not allowed to be used, provider restricted-v2: .containers[8].containers[6].hostPort: Invalid value: 9204: Host ports are not allowed to be used, provider restricted-v2: .containers[8].containers[8].hostPort: Invalid value: 9205: Host ports are not allowed to be used, provider restricted-v2: .containers[9].hostNetwork: Invalid value: true: Host network is not allowed to be used, provider restricted-v2: .containers[9].containers[0].hostPort: Invalid value: 10301: Host ports are not allowed to be used, provider restricted-v2: .containers[9].containers[2].hostPort: Invalid value: 9202: Host ports are not allowed to be used, provider restricted-v2: .containers[9].containers[4].hostPort: Invalid value: 9203: Host ports are not allowed to be used, provider restricted-v2: .containers[9].containers[6].hostPort: Invalid value: 9204: Host ports are not allowed to be used, provider restricted-v2: .containers[9].containers[8].hostPort: Invalid value: 9205: Host ports are not allowed to be used, provider "restricted": Forbidden: not usable by user or serviceaccount, provider "nonroot-v2": Forbidden: not usable by user or serviceaccount, provider "nonroot": Forbidden: not usable by user or serviceaccount, provider "hostmount-anyuid": Forbidden: not usable by user or serviceaccount, provider "machine-api-termination-handler": Forbidden: not usable by user or serviceaccount, provider "hostnetwork-v2": Forbidden: not usable by user or serviceaccount, provider "hostnetwork": Forbidden: not usable by user or serviceaccount, provider "hostaccess": Forbidden: not usable by user or serviceaccount, provider "privileged": Forbidden: not usable by user or serviceaccount] for ReplicaSet.apps/v1/openstack-cinder-csi-driver-controller-7c4878484d -n openshift-cluster-csi-drivers happened 13 times
Error creating: pods "openstack-cinder-csi-driver-node-" is forbidden: unable to validate against any security context constraint: [provider "anyuid": Forbidden: not usable by user or serviceaccount, provider restricted-v2: .spec.securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used, spec.volumes[0]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[1]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[2]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[3]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[7]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[8]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, provider restricted-v2: .containers[0].privileged: Invalid value: true: Privileged containers are not allowed, provider restricted-v2: .containers[0].capabilities.add: Invalid value: "SYS_ADMIN": capability may not be added, provider restricted-v2: .containers[0].hostNetwork: Invalid value: true: Host network is not allowed to be used, provider restricted-v2: .containers[0].containers[0].hostPort: Invalid value: 10300: Host ports are not allowed to be used, provider restricted-v2: .containers[0].allowPrivilegeEscalation: Invalid value: true: Allowing privilege escalation for containers is not allowed, provider restricted-v2: .containers[1].hostNetwork: Invalid value: true: Host network is not allowed to be used, provider restricted-v2: .containers[1].containers[0].hostPort: Invalid value: 10300: Host ports are not allowed to be used, provider restricted-v2: .containers[2].hostNetwork: Invalid value: true: Host network is not allowed to be used, provider restricted-v2: .containers[2].containers[0].hostPort: Invalid value: 10300: Host ports are not allowed to be used, provider "restricted": Forbidden: not usable by user or serviceaccount, provider "nonroot-v2": Forbidden: not usable by user or serviceaccount, provider "nonroot": Forbidden: not usable by user or serviceaccount, provider "hostmount-anyuid": Forbidden: not usable by user or serviceaccount, provider "machine-api-termination-handler": Forbidden: not usable by user or serviceaccount, provider "hostnetwork-v2": Forbidden: not usable by user or serviceaccount, provider "hostnetwork": Forbidden: not usable by user or serviceaccount, provider "hostaccess": Forbidden: not usable by user or serviceaccount, provider "privileged": Forbidden: not usable by user or serviceaccount] for DaemonSet.apps/v1/openstack-cinder-csi-driver-node -n openshift-cluster-csi-drivers happened 12 times

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:

1.
2.
3.

Actual results:


Expected results:


Additional info:


Description of problem:

OKD/FCOS uses FCOS as its bootimage, i.e. when booting cluster nodes
the first time during installation. FCOS does not provide tools such
as OpenShift Client (oc) or hyperkube which are used during
single-node cluster installation at first boot (e.g. oc in
bootkube.sh) and thus setup fails.
 

Version-Release number of selected component (if applicable):

4.14

Please review the following PR: https://github.com/openshift/cluster-control-plane-machine-set-operator/pull/197

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description

As a user, I would like to see the type of technology used by the samples on the samples view similar to the all services view. 

On the samples view:

It is showing different types of samples, e.g. devfile, helm and all showing as .NET. It is difficult for user to decide which .Net entry to select on the list. We'll need something like the all service view where it shows the type of technology on the top right of each card for users to differentiate between the entries:

Acceptance Criteria

  1. Add visible label as the all services view on each card to show the technology used by the sample on the samples view.

Additional Details:

Description of problem:

The ExternalLink 'OpenShift Pipelines based on Tekton' in Pipeline Build Strategy deprecation Alert is incorrect, currently it's defined as https://openshift.github.io/pipelines-docs/ and will redirect to a 'Not found' page

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-06-04-133505

How reproducible:

Always

Steps to Reproduce:

1. $oc new app -n test https://github.com/openshift/cucushift/blob/master/testdata/pipeline/samplepipeline.yaml
  
   OR Create Jenkins server and Pipeline BC
   $ oc new-app https://raw.githubusercontent.com/openshift/origin/master/examples/jenkins/jenkins-ephemeral-template.json
   $ oc new-app -f https://raw.githubusercontent.com/openshift/origin/master/examples/jenkins/pipeline/samplepipeline.yaml

2. Admin user login console and navigate to Builds -> Build Configs -> sample-pipeline Details page
3.Check the External link 'OpenShift Pipelines based on Tekton' in the 'Pipeline build strategy deprecation' Alert

Actual results:

Now a 'Not found' page would be redirected for the user

Expected results:

The link should be correct and existing 

Additional info:

Impact file build.tsx
https://github.com/openshift/console/blob/a0e7e98e5ffe4aca73f9f1f441d15cc4e9b33ee6/frontend/public/components/build.tsx#LL238C17-L238C60

Base bug:
https://bugzilla.redhat.com/show_bug.cgi?id=1768350

Description of the problem:

Debug info is not printed for data collection

How reproducible:

Always

Steps to reproduce:

1. Deploy MCE multicluster-engine.v2.3.0-81. 

2. Enable log level debug for AI

3. Deploy spoke multinode 4.12

Actual results:

No debug info printed. 

Expected results:

should print debug info :
log.Debugf("Red Hat Insights Request ID: %+v", res.Header.Get("X-Rh-Insights-Request-Id"))

Please review the following PR: https://github.com/openshift/kube-rbac-proxy/pull/64

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

This issue was supposed to be fixed in 4.13.4 but is happening again. Manually creating the directory "/etc/systemd/network" allow to complete the upgrade but is not a sustainable workaround when there are several cluster to update.

Version-Release number of selected component (if applicable):

4.13.4

How reproducible:

At customer environment.

Steps to Reproduce:

1. Update to 4.13.4 from 4.12.21
2.
3.

Actual results:

MCO degraded blocking the upgrade.

Expected results:

Upgrade to complete.

Additional info:

 

Description of problem:

The HCP Create NodePool AWS Render command does not work correctly since it does not render a specification with the arch and instance type defined.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

No arch or instance type defined in specification.

Expected results:

Arch and instance type defined in specification.

Additional info:

 

When we create an HCP, the Root CA in the HCP namespaces has the certificate and key named as

  • ca.key
  • ca.crt
    But to cert manager expects them to be named as
  • tls.key
  • tls.cert

Done criteria: The Root CA should have the certificate and key named as the cert manager expects.

Description of problem:

Once a user makes a change to the log component from master node's log section, then the user is unable to change or select a different log component from the dropdown.

To make different log component selection , the user needs to revisit the logs section under master node again and this refreshes the pane and reloads to default options.

 

Version-Release number of selected components (if applicable):

4.11.0-0.nightly-2022-08-15-152346

How reproducible:

 Always

Steps to Reproduce:

  1. Login to OCP web console.
  2. Go to Compute >  Nodes > Click on one of the master nodes.
  3. Go to the Logs section.
  4. Change the dropdown value from journal to openshift-apiserver ( also select audit log)
  5. Try to change the dropdown value from openshift-apiserver to journal/kube-apiserver/oauth-apiserver.
  6. View the behavior.

Actual results:

Unable to select or change the log component once the user already made a selection from the dropdown under master nodes' logs section.

Expected results:

Users should be allowed to change/select the log component from master node's logs section whenever required with the help of available dropdown.

Additional info:

Reproduced in both chrome[103.0.5060.114 (Official Build) (64-bit)] and firefox[91.11.0esr (64-bit)] browsers
Attached screen capture for  the same.ScreenRecorder_2022-08-16_26457662-aea5-4a00-aeb4-0fbddf8f16f0.mp4

Description of problem:

Azure CCM should be GA before the end of 4.14. When we previously tried to promote it there were issues, so we need to improve the feature gates promotion so that we can promote all components in a single release.
And then promote the CCM to GA once those changes are in place.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Please review the following PR: https://github.com/openshift/machine-api-operator/pull/1137

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

All the DaemonSets defined within the openshift-multus namespace have a node selector predicate on the kubernetes.io/os label to schedule the daemonset's pods only on linux workers. The wherebout-reconciler seems missing it. We might need to add the `kubernetes.io/os: linux` label to stay consistent with the other daemonsets definitions and avoid risks in case of clusters with windows workers.

Version-Release number of selected component (if applicable):

4.13+

How reproducible:

Always

Steps to Reproduce:

1. oc get daemonsets -n openshift-multus
NAME                            DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
multus                          6         6         6       6            6           kubernetes.io/os=linux   4h1m
multus-additional-cni-plugins   6         6         6       6            6           kubernetes.io/os=linux   4h1m
multus-networkpolicy            6         6         6       6            6           kubernetes.io/os=linux   19s

 

Actual results:

network-metrics-daemon          6         6         6       6            6           kubernetes.io/os=linux   4h1m whereabouts-reconciler          6         6         6       6            6           <none>                   23s

note the missing kuberentes.io/os nodeselector

Expected results:

The whereabouts-reconciler should also have the nodeselecto term kubernetes.io/os: linux.

Additional info:

https://redhat-internal.slack.com/archives/CFFSAHWHF/p1687158805205059

Description of problem:

The oc binary stored at /usr/local/bin in the cli-artifacts image of a non-amd64 payload is not the one for the architecture bound to the payload. It is an amd64 binary.

Version-Release number of selected component (if applicable):

4.11.4

How reproducible:

always

Steps to Reproduce:

1. CLI_ARTIFACTS_IMAGE=$(oc adm release info quay.io/openshift-release-dev/ocp-release:4.11.4-aarch64 --image-for=cli-artifacts)
2. CONTAINER=$(podman create $CLI_ARTIFACTS_IMAGE)
3. podman cp $CONTAINER:/usr/bin/oc /tmp/oc
4. file /tmp/oc

Actual results:

/tmp/oc: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked,.....

Expected results:

It should be a binary bound to the architecture for which the image is built. i.e., using the above aarch64 payload should lead to an arm64 binary at /usr/bin and the other arches bins in /usr/share/openshift

Additional info:

https://github.com/openshift/oc/blob/master/images/cli-artifacts/Dockerfile.rhel#L13

Description of problem:
Create two custom SCCs with different permissions, for example, custom-scc-1 with 'privileged' and custom-scc-2 with 'restricted'. Deploy a pod with annotations "openshift.io/required-scc: custom-scc-1, custom-scc-2". Pod deployment failed with error "Error creating: pods "test-747555b669-" is forbidden: required scc/custom-restricted-v2-scc, custom-privileged-scc not found". The system fails to provide appropriate error messages for multiple required SCC annotations, leaving users unable to identify the cause of the failure effectively.

 

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-07-31-181848

How reproducible:

Always

Steps to Reproduce:

$ oc login -u testuser-0
$ oc new-project scc-test
$ oc create sa scc-test -n scc-test
serviceaccount/scc-test created

$ oc get scc restricted-v2 -o yaml --context=admin > custom-restricted-v2-scc.yaml
$ sed -i -e 's/restricted-v2/custom-restricted-v2-scc/g' -e "s/MustRunAsRange/RunAsAny/" -e "s/priority: null/priority: 10/" custom-restricted-v2-scc.yaml

$ oc create -f custom-restricted-v2-scc.yaml --context=admin
securitycontextconstraints.security.openshift.io/custom-restricted-v2-scc created

$ oc adm policy add-scc-to-user custom-restricted-v2-scc system:serviceaccount:scc-test:scc-test --context=admin
clusterrole.rbac.authorization.k8s.io/system:openshift:scc:custom-restricted-v2-scc added: "scc-test"

$ oc get scc privileged -o yaml --context=admin > custom-privileged-scc.yaml
$ sed -i -e 's/privileged/custom-privileged-scc/g' -e "s/priority: null/priority: 5/" custom-privileged-scc.yaml

$ oc create -f custom-privileged-scc.yaml --context=admin
securitycontextconstraints.security.openshift.io/custom-privileged-scc created

$ oc adm policy add-scc-to-user custom-privileged-scc system:serviceaccount:scc-test:scc-test --context=admin
clusterrole.rbac.authorization.k8s.io/system:openshift:scc:custom-privileged-scc added: "scc-test"


$ cat deployment.yaml 
apiVersion: apps/v1
kind: Deployment
metadata:
  name: test
spec:
  selector:
    matchLabels:
      deployment: test
  template:
    metadata:
      annotations:
        openshift.io/required-scc: custom-restricted-v2-scc, custom-privileged-scc
      labels:
        deployment: test
    spec:
      containers:
      - args:
        - infinity
        command:
        - sleep
        image: fedora:latest
        name: sleeper
      securityContext:
        runAsNonRoot: true
      serviceAccountName: scc-test


$ oc create -f deployment.yaml 
deployment.apps/test created

$ oc describe rs test-747555b669 | grep FailedCreate
  ReplicaFailure   True    FailedCreate
  Warning  FailedCreate  61s (x15 over 2m23s)  replicaset-controller  Error creating: pods "test-747555b669-" is forbidden: required scc/custom-restricted-v2-scc, custom-privileged-scc not found

Actual results:

Pod deployment failed with "Error creating: pods "test-747555b669-" is forbidden: required scc/custom-restricted-v2-scc, custom-privileged-scc not found"

Expected results:

Either it should ignore the second scc instead of "not found"  or it should show a proper error message

Additional info:

 

This is a clone of issue OCPBUGS-17589. The following is the description of the original issue:

This bug has been seen during the analysis of another issue

If the Server Internal IP is not defined, CBO crashes as nil is not handled in https://github.com/openshift/cluster-baremetal-operator/blob/release-4.12/provisioning/utils.go#L99

 

I0809 17:33:09.683265       1 provisioning_controller.go:540] No Machines with cluster-api-machine-role=master found, set provisioningMacAddresses if the metal3 pod fails to start

I0809 17:33:09.690304       1 clusteroperator.go:217] "new CO status" reason=SyncingResources processMessage="Applying metal3 resources" message=""

I0809 17:33:10.488862       1 recorder_logging.go:37] &Event{ObjectMeta:{dummy.1779c769624884f4  dummy    0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] []  []},InvolvedObject:ObjectReference{Kind:Pod,Namespace:dummy,Name:dummy,UID:,APIVersion:v1,ResourceVersion:,FieldPath:,},Reason:ValidatingWebhookConfigurationUpdated,Message:Updated ValidatingWebhookConfiguration.admissionregistration.k8s.io/baremetal-operator-validating-webhook-configuration because it changed,Source:EventSource{Component:,Host:,},FirstTimestamp:2023-08-09 17:33:10.488745204 +0000 UTC m=+5.906952556,LastTimestamp:2023-08-09 17:33:10.488745204 +0000 UTC m=+5.906952556,Count:1,Type:Normal,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:,ReportingInstance:,}

panic: runtime error: invalid memory address or nil pointer dereference

[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1768fd4]

 

goroutine 574 [running]:

github.com/openshift/cluster-baremetal-operator/provisioning.getServerInternalIP({0x1e774d0?, 0xc0001e8fd0?})

        /go/src/github.com/openshift/cluster-baremetal-operator/provisioning/utils.go:75 +0x154

github.com/openshift/cluster-baremetal-operator/provisioning.GetIronicIP({0x1ea2378?, 0xc000856840?}, {0x1bc1f91, 0x15}, 0xc0004c4398, {0x1e774d0, 0xc0001e8fd0})

        /go/src/github.com/openshift/cluster-baremetal-operator/provisioning/utils.go:98 +0xfb

Description of problem:

Reported in https://github.com/openshift/cluster-ingress-operator/issues/911

When you open a new issue, it still directs you to Bugzilla, and then doesn't work.

It can be changed here: https://github.com/openshift/cluster-ingress-operator/blob/master/.github/ISSUE_TEMPLATE/config.yml
, but to what?

The correct Jira link is
https://issues.redhat.com/secure/CreateIssueDetails!init.jspa?pid=12332330&issuetype=1&components=12367900&priority=10300&customfield_12316142=26752

But can the public use this mechanism? Yes - https://redhat-internal.slack.com/archives/CB90SDCAK/p1682527645965899 

Version-Release number of selected component (if applicable):

n/a

How reproducible:

May be in other repos too.

Steps to Reproduce:

1. Open Issue in the repo - click on New Issue
2. Follow directions and click on link to open Bugzilla
3. Get message that this doesn't work anymore

Actual results:

You get instructions that don't work to open a bug from an Issue.

Expected results:

You get instructions to just open an Issue, or get correct instructions on how to open a bug using Jira.

Additional info:

 

Description of problem:

In HA mode there are two dedicated nodes, ignition-server-proxy and konnectivity-server only have one replica, I expect that they have two replicas, each runs on one dedicated node.

Version-Release number of selected component (if applicable):

 

How reproducible:

always

Steps to Reproduce:

1. allocate two dedicated nodes
2. create a cluster in HA mode
3. check ignition-server-proxy and konnectivity-server in control plane

Actual results:

ignition-server-proxy and konnectivity-server have one replica

Expected results:

ignition-server-proxy and konnectivity-server have two replicas, each replica runs on one dedicated node

Additional info:

 

Description of problem:

More than one cluster can be created in openshift-cluster-api

$ oc get cluster                                                             
NAME                          PHASE          AGE   VERSION
ci-ln-kv1gj4b-72292-jn4rw     Provisioning   19m
ci-ln-kv1gj4b-72292-jn4rw-1   Provisioning   7s

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2022-11-25-204445

How reproducible:

Always

Steps to Reproduce:

1. 
2.
3.

Actual results:

More than one cluster can be created in openshift-cluster-api
$ oc get cluster                                                             NAME                          PHASE          AGE   VERSION ci-ln-kv1gj4b-72292-jn4rw     Provisioning   19m ci-ln-kv1gj4b-72292-jn4rw-1   Provisioning   7s

Expected results:

The cluster-api namespace to be only the cluster you're running on, and allow users to use cluster API for creating other clusters only in other namespaces 

Additional info:

Related to https://issues.redhat.com/browse/OCPBUGS-1493

Description of the problem:

When machines have multiple IP addresses assigned to the same network interface the assisted service will create the bare metal host configuration using the first IP address of the interface. That IP address may or may not be inside the machine CIDR of the cluster. If it isn't then the bare metal host will have an IP address that is different to the IP address of the corresponding node. As a result of that the machine operator will not link the machine and the node, and the machine will never move to the `Running` phase. In that situation the corresponding machine pool will never have the minimum required number of replicas. For worker machine pools that means that the cluster will never be considered completely installed.

How reproducible:

Note that this easy to reproduce using the current zero touch provisioning factory workflow, because when machines have a single NIC they will have two IP addresses assigned. May be harder to reproduce in other scenarios.

Steps to reproduce:

1. Create a bare metal cluster with three control plane nodes and one worker node, where nodes have one NIC and two IP addresses assigned to that NIC. In the ZTPFW scenario that will be a static IP address in the 192.168.7.0/24 range (which is the machine CIDR of the cluster) and another IP address assigned via DHCP, say in the 192.168.150.0/24 range (whic is not the machine CIDR of the cluster).

2. Stat the installation.

3. Check the manifests generated by the assisted service, in particular the `99_openshift-cluster-api_hosts-*.yaml` files. Those will contain the definition of the bare metal hosts, together with a `baremetalhost.metal3.io/status` annotation that contains the status that they should have. Check that it contains the wrong IP address in the 192.168.150.0/24 range, outside of the machine CIDR of the cluster.

4. Check that all the machines (oc get machine -A) didn't move to the `Running` phase. That is because the machine API operator can't link them to the nodes due to the missmatching IP addresses: nodes have 192.168.7.* and machines have 192.168.150.* (copied from the bare metal hosts).

5. Check that the worker machine pool doesn't have the minimum required number of replicas.

6. Check that the installation doesn't complete.

Actual results:

The machines aren't in the `Running` phase, the worker pool doesn't have the minimum required number of replicas and the installation doesn't complete.
 
Expected results:

All the machines should move to the `Running` phase, the worker pool should have the minimum required number of replicas and the installation should complete.

Description of problem:

Agent create sub-command is showing fatal error when executing invalid command.

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1. Execute `openshift-install agent create invalid`

Actual results:

FATA[0000] Error executing openshift-install: accepts 0 arg(s), received 1 

Expected results:

It should return the help of the create command.

Additional info:

 

As a developer, I would like a Make file command that performs all the pre-commit checks that should be run before committing any code to GitHub. This includes updating Golang and API dependencies, building the source code, building the e2e's, verifying source code formatting, and running unit tests.

Please review the following PR: https://github.com/openshift/vmware-vsphere-csi-driver/pull/62

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

What happens:

When deploying OpenShift 4.13 with Failure Domains, the PrimarySubnet in the ProviderSpec of the Machine is  set to the MachinesSubnet set in install-config.yaml.

 

What is expected:

Machines in failure domains with a control-plane port target should not use the MachinesSubnet as a primary subnet in the provider spec. it should be the ID of the subnet that is actually used for the control plane on that domain.

 

How to reproduce:

install-config.yaml:

apiVersion: v1
baseDomain: shiftstack.com
compute:
- name: worker
  platform:
    openstack:
      type: m1.xlarge
  replicas: 1
controlPlane:
  name: master
  platform:
    openstack:
      type: m1.xlarge
      failureDomains:
      - portTargets:
        - id: control-plane
          network:
            id: fb6f8fea-5063-4053-81b3-6628125ed598
          fixedIPs:
          - subnet:
              id: b02175dd-95c6-4025-8ff3-6cf6797e5f86
        computeAvailabilityZone: nova-az1
        storageAvailabilityZone: cinder-az1
      - portTargets:
        - id: control-plane
          network:
            id: 9a5452a8-41d9-474c-813f-59b6c34194b6
          fixedIPs:
          - subnet:
              id: 5fe5b54a-217c-439d-b8eb-441a03f7636d
        computeAvailabilityZone: nova-az1
        storageAvailabilityZone: cinder-az1
      - portTargets:
        - id: control-plane
          network:
            id: 3ed980a6-6f8e-42d3-8500-15f18998c434
          fixedIPs:
          - subnet:
              id: a7d57db6-f896-475f-bdca-c3464933ec02
        computeAvailabilityZone: nova-az1
        storageAvailabilityZone: cinder-az1
  replicas: 3
metadata:
  name: mycluster
networking:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  machineNetwork:
  - cidr: 192.168.10.0/24
  - cidr: 192.168.20.0/24
  - cidr: 192.168.30.0/24
  - cidr: 192.168.72.0/24
  - cidr: 192.168.100.0/24
platform:
  openstack:
    cloud: foch_openshift
    machinesSubnet: b02175dd-95c6-4025-8ff3-6cf6797e5f86
    apiVIPs:
    - 192.168.100.240
    ingressVIPs:
    - 192.168.100.250
    loadBalancer:
      type: UserManaged
featureSet: TechPreviewNoUpgrade

Machine spec:

  Provider Spec:
    Value:
      API Version:  machine.openshift.io/v1alpha1
      Cloud Name:   openstack
      Clouds Secret:
        Name:       openstack-cloud-credentials
        Namespace:  openshift-machine-api
      Flavor:       m1.xlarge
      Image:        foch-bgp-2fnjz-rhcos
      Kind:         OpenstackProviderSpec
      Metadata:
        Creation Timestamp:  <nil>
      Networks:
        Filter:
        Subnets:
          Filter:
            Id:        5fe5b54a-217c-439d-b8eb-441a03f7636d
        Uuid:          9a5452a8-41d9-474c-813f-59b6c34194b6
      Primary Subnet:  b02175dd-95c6-4025-8ff3-6cf6797e5f86
      Security Groups:
        Filter:
        Name:  foch-bgp-2fnjz-master
        Filter:
        Uuid:             1b142123-c085-4e14-b03a-cdf5ef028d91
      Server Group Name:  foch-bgp-2fnjz-master
      Server Metadata:
        Name:                  foch-bgp-2fnjz-master
        Openshift Cluster ID:  foch-bgp-2fnjz
      Tags:
        openshiftClusterID=foch-bgp-2fnjz
      Trunk:  true
      User Data Secret:
        Name:  master-user-data
Status:
  Addresses:
    Address:  192.168.20.20
    Type:     InternalIP
    Address:  foch-bgp-2fnjz-master-1
    Type:     Hostname
    Address:  foch-bgp-2fnjz-master-1
    Type:     InternalDNS 

The machine is connected to the right subnet, but has a wrong PrimarySubnet configured.

Description of problem:

 

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:
It is better for pod-security admission config to use v1 like upstream instead of still using v1beta1

Version-Release number of selected component (if applicable):
4.12, 4.13

How reproducible:
Always

Steps to Reproduce:
1. In upstream, when it was 1.24, https://v1-24.docs.kubernetes.io/docs/tasks/configure-pod-container/enforce-standards-admission-controller/#configure-the-admission-controller shows "pod-security.admission.config.k8s.io/v1beta1".

When it was 1.25 (OCP 4.12), https://v1-25.docs.kubernetes.io/docs/tasks/configure-pod-container/enforce-standards-admission-controller/#configure-the-admission-controller does not show "shows pod-security.admission.config.k8s.io/v1beta1" any longer. In the bottom, it notes: pod-security.admission.config.k8s.io/v1 configuration requires v1.25+. For v1.23 and v1.24, use v1beta1.

In OCP 4.12 (1.25) and 4.13 (1.26), it is still v1beta1, we'd better to align with upstream:

4.12:
$ oc version
..
Server Version: 4.12.9
Kubernetes Version: v1.25.7+eab9cc9

$ jq "" $(oc extract cm/config -n openshift-kube-apiserver --confirm) | jq '.admission.pluginConfig.PodSecurity'
{
  "configuration": {
    "apiVersion": "pod-security.admission.config.k8s.io/v1beta1",
    "defaults": {
      "audit": "restricted",
      "audit-version": "latest",
      "enforce": "privileged",
      "enforce-version": "latest",
      "warn": "restricted",
      "warn-version": "latest"
    },
    "exemptions": {
      "usernames": [
        "system:serviceaccount:openshift-infra:build-controller"
      ]
    },
    "kind": "PodSecurityConfiguration"
  }
}

4.13:
$ oc version
...
Server Version: 4.13.0-0.nightly-2023-03-23-204038
Kubernetes Version: v1.26.2+dc93b13

$ jq "" $(oc extract cm/config -n openshift-kube-apiserver --confirm) | jq '.admission.pluginConfig.PodSecurity'
{
  "configuration": {
    "apiVersion": "pod-security.admission.config.k8s.io/v1beta1",
    "defaults": {
      "audit": "restricted",
      "audit-version": "latest",
      "enforce": "privileged",
      "enforce-version": "latest",
      "warn": "restricted",
      "warn-version": "latest"
    },
    "exemptions": {
      "usernames": [
        "system:serviceaccount:openshift-infra:build-controller"
      ]
    },
    "kind": "PodSecurityConfiguration"
  }
}

Actual results:

See above.

Expected results:

It is better for pod-security admission config to align with upstream to use v1 than v1beta1.

Additional info:

 

Description of problem:

InfraStructureRef* is dereferenced without checking for nil value

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1. Run TechPreview cluster
2. Try to create Cluster object with empty spec
apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  name: example
  namespace: openshift-cluster-api
spec: {}
 3. Observe panic in cluster-capi-operator

Actual results:

2023/03/10 14:13:31 http: panic serving 10.129.0.2:39614: runtime error: invalid memory address or nil pointer dereference
goroutine 3619 [running]:
net/http.(*conn).serve.func1()
    /usr/lib/golang/src/net/http/server.go:1850 +0xbf
panic({0x16cada0, 0x2948bc0})
    /usr/lib/golang/src/runtime/panic.go:890 +0x262
github.com/openshift/cluster-capi-operator/pkg/webhook.(*ClusterWebhook).ValidateCreate(0xc000ceac00?, {0x24?, 0xc00090fff0?}, {0x1b72d68?, 0xc0010831e0?})
    /go/src/github.com/openshift/cluster-capi-operator/pkg/webhook/cluster.go:32 +0x39
sigs.k8s.io/controller-runtime/pkg/webhook/admission.(*validatorForType).Handle(_, {_, _}, {{{0xc000ceac00, 0x24}, {{0xc00090fff0, 0x10}, {0xc000838000, 0x7}, {0xc000838007, ...}}, ...}})
    /go/src/github.com/openshift/cluster-capi-operator/vendor/sigs.k8s.io/controller-runtime/pkg/webhook/admission/validator_custom.go:79 +0x2dd
sigs.k8s.io/controller-runtime/pkg/webhook/admission.(*Webhook).Handle(_, {_, _}, {{{0xc000ceac00, 0x24}, {{0xc00090fff0, 0x10}, {0xc000838000, 0x7}, {0xc000838007, ...}}, ...}})
    /go/src/github.com/openshift/cluster-capi-operator/vendor/sigs.k8s.io/controller-runtime/pkg/webhook/admission/webhook.go:169 +0xfd
sigs.k8s.io/controller-runtime/pkg/webhook/admission.(*Webhook).ServeHTTP(0xc000630e80, {0x7f26f94b5580?, 0xc000f80280}, 0xc000750800)
    /go/src/github.com/openshift/cluster-capi-operator/vendor/sigs.k8s.io/controller-runtime/pkg/webhook/admission/http.go:98 +0xeb5
github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerInFlight.func1({0x7f26f94b5580, 0xc000f80280}, 0x1b7ff00?)
    /go/src/github.com/openshift/cluster-capi-operator/vendor/github.com/prometheus/client_golang/prometheus/promhttp/instrument_server.go:60 +0xd4
net/http.HandlerFunc.ServeHTTP(0x1b7ffb0?, {0x7f26f94b5580?, 0xc000f80280?}, 0x7afe60?)
    /usr/lib/golang/src/net/http/server.go:2109 +0x2f
github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerCounter.func1({0x1b7ffb0?, 0xc000a72000?}, 0xc000750800)
    /go/src/github.com/openshift/cluster-capi-operator/vendor/github.com/prometheus/client_golang/prometheus/promhttp/instrument_server.go:146 +0xb8
net/http.HandlerFunc.ServeHTTP(0x0?, {0x1b7ffb0?, 0xc000a72000?}, 0xc00056f0e1?)
    /usr/lib/golang/src/net/http/server.go:2109 +0x2f
github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerDuration.func2({0x1b7ffb0, 0xc000a72000}, 0xc000750800)
    /go/src/github.com/openshift/cluster-capi-operator/vendor/github.com/prometheus/client_golang/prometheus/promhttp/instrument_server.go:108 +0xbf
net/http.HandlerFunc.ServeHTTP(0xc000a72000?, {0x1b7ffb0?, 0xc000a72000?}, 0x18e45d1?)
    /usr/lib/golang/src/net/http/server.go:2109 +0x2f
net/http.(*ServeMux).ServeHTTP(0xc00056f0c0?, {0x1b7ffb0, 0xc000a72000}, 0xc000750800)
    /usr/lib/golang/src/net/http/server.go:2487 +0x149
net/http.serverHandler.ServeHTTP({0x1b71dc8?}, {0x1b7ffb0, 0xc000a72000}, 0xc000750800)
    /usr/lib/golang/src/net/http/server.go:2947 +0x30c
net/http.(*conn).serve(0xc00039af00, {0x1b81198, 0xc000416c00})
    /usr/lib/golang/src/net/http/server.go:1991 +0x607
created by net/http.(*Server).Serve
    /usr/lib/golang/src/net/http/server.go:3102 +0x4db

Expected results:

Webhook returns error, but does not panic

Additional info:

 

Description of problem:


On August 24th, a bugfix was merged into the hypershift repo to address OCPBUGS-16813 (https://github.com/openshift/hypershift/pull/2942). This resulted in a change in the konnectivity server with the HCP namespace. The change is that we went from a single konnectivity server to multiple when HA hcps are in use.

The konnectivity agents within the HCP worker nodes connect to the server through a route. When connecting through this route, the agents on the worker are supposed to discover all the HA konnectivity servers through round robin load balancing, meaning if the agents try to connect to the route endpoint enough times, the theory is that they should eventually discover all the servers.

With the kubevirt platform, only a single konnectivity server is discovered by the agents in the worker nodes, which leads to the inability for the kas on the HCP to reliably contact kubelets within the worker nodes.

The outcome of this issue is that webhooks (and other connections that require the kas (api server) in the HCP to contact worker nodes) to fail the majority of the time.

Version-Release number of selected component (if applicable):


How reproducible:


create a kubevirt platform HCP using the `hcp` cli tool. This will default to HA mode, and the cluster will never fully roll out. The ingress, monitoring, and console clusteroperators will flap back and forth between failing and success. Usually we'll see an error about webhook connectivity failing.

During this time, any `oc` command that attempts to tunnel a connection through the kas to the kubelets will fail the majority of the time. This means `oc logs`, `oc exec`, etc... will not work. 


Actual results:{code:none}

kas -> kubelet connections are unreliable

Expected results:


kas -> kubelet connections are reliable

Additional info:


Description of problem:

Update cpms vmSize on ASH, got error "The value 1024 of parameter 'osDisk.diskSizeGB' is out of range. The value must be between '1' and '1023', inclusive." Target="osDisk.diskSizeGB"when provisioning new control plane node, change diskSizeGB to 1023, new nodes are provisioned. But for fresh install, the default diskSizeGB is 1024 for master.

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2023-01-27-165107

How reproducible:

Always

Steps to Reproduce:

1. Update cpms vmSize to Standard_DS3_v2
2. Check new machine state
$ oc get machine  
NAME                                PHASE     TYPE              REGION   ZONE   AGE
jima28b-r9zht-master-h7g67-1        Running   Standard_DS5_v2   mtcazs          11h
jima28b-r9zht-master-hhfzl-0        Failed                                      24s
jima28b-r9zht-master-qtb9j-0        Running   Standard_DS5_v2   mtcazs          11h
jima28b-r9zht-master-tprc7-2        Running   Standard_DS5_v2   mtcazs          11h

$ oc get machine jima28b-r9zht-master-hhfzl-0 -o yaml
  errorMessage: 'failed to reconcile machine "jima28b-r9zht-master-hhfzl-0": failed
    to create vm jima28b-r9zht-master-hhfzl-0: failure sending request for machine
    jima28b-r9zht-master-hhfzl-0: cannot create vm: compute.VirtualMachinesClient#CreateOrUpdate:
    Failure sending request: StatusCode=400 -- Original Error: Code="InvalidParameter"
    Message="The value 1024 of parameter ''osDisk.diskSizeGB'' is out of range. The
    value must be between ''1'' and ''1023'', inclusive." Target="osDisk.diskSizeGB"'
  errorReason: InvalidConfiguration
  lastUpdated: "2023-01-29T02:35:13Z"
  phase: Failed
  providerStatus:
    conditions:
    - lastTransitionTime: "2023-01-29T02:35:13Z"
      message: 'failed to create vm jima28b-r9zht-master-hhfzl-0: failure sending
        request for machine jima28b-r9zht-master-hhfzl-0: cannot create vm: compute.VirtualMachinesClient#CreateOrUpdate:
        Failure sending request: StatusCode=400 -- Original Error: Code="InvalidParameter"
        Message="The value 1024 of parameter ''osDisk.diskSizeGB'' is out of range.
        The value must be between ''1'' and ''1023'', inclusive." Target="osDisk.diskSizeGB"'
      reason: MachineCreationFailed
      status: "False"
      type: MachineCreated
    metadata: {}
3. Checke logs
$ oc logs -f machine-api-controllers-84444d49f-mlldl -c machine-controller
I0129 02:35:15.047784       1 recorder.go:103] events "msg"="InvalidConfiguration: failed to reconcile machine \"jima28b-r9zht-master-hhfzl-0\": failed to create vm jima28b-r9zht-master-hhfzl-0: failure sending request for machine jima28b-r9zht-master-hhfzl-0: cannot create vm: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: Code=\"InvalidParameter\" Message=\"The value 1024 of parameter 'osDisk.diskSizeGB' is out of range. The value must be between '1' and '1023', inclusive.\" Target=\"osDisk.diskSizeGB\"" "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"jima28b-r9zht-master-hhfzl-0","uid":"6cb07114-41a6-40bc-8e83-d9f27931bc8c","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"451889"} "reason"="FailedCreate" "type"="Warning"

 $ oc logs -f control-plane-machine-set-operator-69b756df4f-skv4x E0129 02:35:13.282358       1 controller.go:818]  "msg"="Observed failed replacement control plane machines" "error"="found replacement control plane machines in an error state, the following machines(s) are currently reporting an error: jima28b-r9zht-master-hhfzl-0" "controller"="controlplanemachineset" "failedReplacements"="jima28b-r9zht-master-hhfzl-0" "name"="cluster" "namespace"="openshift-machine-api" "reconcileID"="a988d699-8ddc-4880-9930-0db64ca51653" I0129 02:35:13.282380       1 controller.go:264]  "msg"="Cluster state is degraded. The control plane machine set will not take any action until issues have been resolved." "controller"="controlplanemachineset" "name"="cluster" "namespace"="openshift-machine-api" "reconcileID"="a988d699-8ddc-4880-9930-0db64ca51653" 
4. Change diskSizeGB to 1023, new machine Provisioned.
            osDisk:
              diskSettings: {}
              diskSizeGB: 1023

$ oc get machine                  
NAME                                PHASE      TYPE              REGION   ZONE   AGE
jima28b-r9zht-master-h7g67-1        Running    Standard_DS5_v2   mtcazs          11h
jima28b-r9zht-master-hhfzl-0        Deleting                                     7m1s
jima28b-r9zht-master-qtb9j-0        Running    Standard_DS5_v2   mtcazs          12h
jima28b-r9zht-master-tprc7-2        Running    Standard_DS5_v2   mtcazs          11h
jima28b-r9zht-worker-mtcazs-p8d79   Running    Standard_DS3_v2   mtcazs          18h
jima28b-r9zht-worker-mtcazs-x5gvh   Running    Standard_DS3_v2   mtcazs          18h
jima28b-r9zht-worker-mtcazs-xmdvw   Running    Standard_DS3_v2   mtcazs          18h
$ oc get machine        
NAME                                PHASE         TYPE              REGION   ZONE   AGE
jima28b-r9zht-master-h7g67-1        Running       Standard_DS5_v2   mtcazs          11h
jima28b-r9zht-master-qtb9j-0        Running       Standard_DS5_v2   mtcazs          12h
jima28b-r9zht-master-tprc7-2        Running       Standard_DS5_v2   mtcazs          11h
jima28b-r9zht-master-vqd7r-0        Provisioned   Standard_DS3_v2   mtcazs          16s
jima28b-r9zht-worker-mtcazs-p8d79   Running       Standard_DS3_v2   mtcazs          18h
jima28b-r9zht-worker-mtcazs-x5gvh   Running       Standard_DS3_v2   mtcazs          18h
jima28b-r9zht-worker-mtcazs-xmdvw   Running       Standard_DS3_v2   mtcazs          18h

Actual results:

For fresh install, the default diskSizeGB is 1024 for master. But update cpms vmSize, new master was created failed, report error "The value 1024 of parameter ''osDisk.diskSizeGB'' is out of range.  The value must be between ''1'' and ''1023'', inclusive"
When changing diskSizeGB to 1023, new machine got Provisioned. 

Expected results:

New master could be created when change vmtype, and don't need update diskSizeGB to 1023.

Additional info:

Minimum recommendation for control plane nodes is 1024 GB
https://docs.openshift.com/container-platform/4.12/installing/installing_azure_stack_hub/installing-azure-stack-hub-network-customizations.html#installation-azure-stack-hub-config-yaml_installing-azure-stack-hub-network-customizations

Description of problem:

When the releaseImage is a digest, for example quay.io/openshift-release-dev/ocp-release@sha256:bbf1f27e5942a2f7a0f298606029d10600ba0462a09ab654f006ce14d314cb2c, a spurious warning is putput when running
openshift-install agent create image

Its not calculating the releaseImage properly (see the '@sha' suffix below) so it causes this spurious message
WARNING The ImageContentSources configuration in install-config.yaml should have at-least one source field matching the releaseImage value quay.io/openshift-release-dev/ocp-release@sha256 

This can cause confusion for users.

Version-Release number of selected component (if applicable):

4.13

How reproducible:

Every time when using a release image with a digest is used

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

Not able to convert a deployment to a Serverless as Make Serverless form in the console is broken.

Version-Release number of selected component (if applicable):

4.13

How reproducible:

 

Steps to Reproduce:

1. Create a deployment using a Container image flow
2. Select Make Serverless option from the topology actions menu of the created deployment
3.

Actual results:

After clicking on create it throw an error

Expected results:

Should create a Serverless resource.

Additional info:

 

Description of problem:

OpenStack features SG rules opening traffic from `0.0.0.0/0` on NodePorts. This was required for the OVN loadbalancers to work properly as they keep the source IP of the traffic when traffic reaches the LB members. This isn't needed anymore as in 4.14 OSASINFRA-3067 implemented and enabled `manage-security-groups` option on the cloud-provider-openstack, so that it will create and attach the proper SG on its own to make sure only necessary NodePorts are open.

Version-Release number of selected component (if applicable):


How reproducible:

Always

Steps to Reproduce:

1. Check for existence of rules opening traffic from 0.0.0.0/0 on the master and worker nodes.

Actual results:

Rules are still there.

Expected results:

Rules are not needed anymore.

Additional info:


Description of the problem:

According to swagger.yaml cpu_architecture in infra-envs can include 'multi', but that only makes sense in the cluster entity. 

(Slack thread: https://redhat-internal.slack.com/archives/CUPJTHQ5P/p1680095368006089)

How reproducible:

100%

Steps to reproduce:

1. Check out the swagger.yaml here

Actual results:

 enum: ['x86_64', 'aarch64', 'arm64','ppc64le','s390x','multi']

Expected results:

enum: ['x86_64', 'aarch64', 'arm64','ppc64le','s390x']

Description of problem:

Running `openshift-install cluster destroy` defeats an OpenStack cloud with many Swift objects, if said cloud is low on resources.

In particular, testing the teardown of an OCP cluster with 500.000 objects in the image registry caused RabbitMQ to crash on a standalone (single-host) OpenStack deployment backed with NVMe storage.

Version-Release number of selected component (if applicable):


How reproducible:

on a constrained (single-host) OpenStack cloud, with the default limit of 10000 to the bulk-deletion of Swift objects.

Steps to Reproduce:

1. install OpenShift
2. upload 500000 arbitrary objects in the image-registry container
3. launch cluster teardown
4. enjoy Swift responding 504 errors, and the rest of the cluster to become unstable

Description of problem:

Ingress operator is constantly reverting Internal Services when it detects a service change that are default values.

Version-Release number of selected component (if applicable):

4.13, 4.14

How reproducible:

100%

Steps to Reproduce:

1. Create an ingress controller
2. Watch ingress operator logs for excess updates "updated internal service"
[I'll provide a more specific reproducer if needed]

Actual results:

Excess:
2023-05-04T02:08:02.331Z INFO operator.ingress_controller ingress/internal_service.go:44 updated internal service ...

Expected results:

No updates

Additional info:

The diff looks like:
2023-05-05T15:12:06.668Z    INFO    operator.ingress_controller    ingress/internal_service.go:44    updated internal service    {"namespace": "openshift-ingress", "name": "router-internal-default", "diff": "  &v1.Service{
    TypeMeta:   {},
    ObjectMeta: {Name: \"router-internal-default\", Namespace: \"openshift-ingress\", UID: \"815f1499-a4d4-4cb8-9a5b-9905580e0ffd\", ResourceVersion: \"8031\", ...},
    Spec: v1.ServiceSpec{
      Ports:                    {{Name: \"http\", Protocol: \"TCP\", Port: 80, TargetPort: {Type: 1, StrVal: \"http\"}, ...}, {Name: \"https\", Protocol: \"TCP\", Port: 443, TargetPort: {Type: 1, StrVal: \"https\"}, ...}, {Name: \"metrics\", Protocol: \"TCP\", Port: 1936, TargetPort: {Type: 1, StrVal: \"metrics\"}, ...}},
      Selector:                 {\"ingresscontroller.operator.openshift.io/deployment-ingresscontroller\": \"default\"},
      ClusterIP:                \"172.30.56.107\",
-     ClusterIPs:               []string{\"172.30.56.107\"},
+     ClusterIPs:               nil,
      Type:                     \"ClusterIP\",
      ExternalIPs:              nil,
-     SessionAffinity:          \"None\",
+     SessionAffinity:          \"\",
      LoadBalancerIP:           \"\",
      LoadBalancerSourceRanges: nil,
      ... // 3 identical fields
      PublishNotReadyAddresses:      false,
      SessionAffinityConfig:         nil,
-     IPFamilies:                    []v1.IPFamily{\"IPv4\"},
+     IPFamilies:                    nil,
-     IPFamilyPolicy:                &\"SingleStack\",
+     IPFamilyPolicy:                nil,
      AllocateLoadBalancerNodePorts: nil,
      LoadBalancerClass:             nil,
-     InternalTrafficPolicy:         &\"Cluster\",
+     InternalTrafficPolicy:         nil,
    },
    Status: {},
  }
"}

Messing around with unit testing, it looks like internalServiceChanged triggers true when spec.IPFamilies, spec.IPFamilyPolicy, and spec.InternalTrafficPolicy are set to the default values that you see in the diff above.

Ingress operator then resets back to nil, then the API server sets them to their defaults, and this process repeats.

internalServiceChanged should either ignore, or explicitly set these values.

Description of the problem:

In the Create cluster wizard -> Networking page, an error is shown saying that the cluster is not ready yet. The warning message suggests to  define the API or Ingress IP but they are already input in the form and in the YAML (see screenshots attached)

Also, the hosts are oscillating between "Pending input" and "Insufficient" states, with the errors shown in the images

Found this error while testing epic MGMT-9907

MCE image 2.3.0-DOWNANDBACK-2023-03-28-23-01-58

 

Please review the following PR: https://github.com/openshift/machine-api-provider-aws/pull/62

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

The OpenShift DNS daemonset has the rolling update strategy. The "maxSurge" parameter is set to a non zero value which means that the "maxUnavailable" parameter is set to zero. When the user replaces the toleration in the daemonset's template spec (via the OpenShift DNS config API) from the one which helps to be scheduled on the master node into any other toleration: the new pods are still trying to be scheduled on the master nodes. The old pods from the tolerated nodes can be lucky enough to be recreated but only if they go before any pod from the intolerable node.

The new pods are not expected to be scheduled on the nodes which are not tolerated by the new damonset's template spec. The daemonset controller should just delete the old pods from the nodes which cannot be tolerated anymore. The old pods from the nodes which can still be tolerated should be recreated according to the rolling update parameters.

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:
1. Create the daemonset which tolerates "node-role.kubernetes.io/master" taint and has the following rolling update parameters:

$ oc -n openshift-dns get ds dns-default -o yaml | yq .spec.updateStrategy
rollingUpdate:
  maxSurge: 10%
  maxUnavailable: 0
type: RollingUpdate

$ oc  -n openshift-dns get ds dns-default -o yaml | yq .spec.template.spec.tolerations
- key: node-role.kubernetes.io/master
  operator: Exists

2. Let the daemonset to be scheduled on all the target nodes (e.g. all masters and all workers)

$ oc -n openshift-dns get pods  -o wide | grep dns-default
dns-default-6bfmf     2/2     Running   0          119m    10.129.0.40   ci-ln-sb5ply2-72292-qlhc8-master-2         <none>           <none>
dns-default-9cjdf     2/2     Running   0          2m35s   10.129.2.15   ci-ln-sb5ply2-72292-qlhc8-worker-c-m5wzq   <none>           <none>
dns-default-c6j9x     2/2     Running   0          119m    10.128.0.13   ci-ln-sb5ply2-72292-qlhc8-master-0         <none>           <none>
dns-default-fhqrs     2/2     Running   0          2m12s   10.131.0.29   ci-ln-sb5ply2-72292-qlhc8-worker-a-6q7hs   <none>           <none>
dns-default-lx2nf     2/2     Running   0          119m    10.130.0.15   ci-ln-sb5ply2-72292-qlhc8-master-1         <none>           <none>
dns-default-mmc78     2/2     Running   0          112m    10.128.2.7    ci-ln-sb5ply2-72292-qlhc8-worker-b-bpjdk   <none>           <none>

3. Update the daemonset's tolerations by removing "node-role.kubernetes.io/master" and adding any other toleration (not existing works too):

$ oc -n openshift-dns get ds dns-default -o yaml | yq .spec.template.spec.tolerations
- key: test-taint
  operator: Exists

Actual results:

$ oc -n openshift-dns get pods  -o wide | grep dns-default
dns-default-6bfmf     2/2     Running   0          124m    10.129.0.40   ci-ln-sb5ply2-72292-qlhc8-master-2         <none>           <none>
dns-default-76vjz     0/2     Pending   0          3m2s    <none>        <none>                                     <none>           <none>
dns-default-9cjdf     2/2     Running   0          7m24s   10.129.2.15   ci-ln-sb5ply2-72292-qlhc8-worker-c-m5wzq   <none>           <none>
dns-default-c6j9x     2/2     Running   0          124m    10.128.0.13   ci-ln-sb5ply2-72292-qlhc8-master-0         <none>           <none>
dns-default-fhqrs     2/2     Running   0          7m1s    10.131.0.29   ci-ln-sb5ply2-72292-qlhc8-worker-a-6q7hs   <none>           <none>
dns-default-lx2nf     2/2     Running   0          124m    10.130.0.15   ci-ln-sb5ply2-72292-qlhc8-master-1         <none>           <none>
dns-default-mmc78     2/2     Running   0          117m    10.128.2.7    ci-ln-sb5ply2-72292-qlhc8-worker-b-bpjdk   <none>           <none>

Expected results:

$ oc -n openshift-dns get pods  -o wide | grep dns-default
dns-default-9cjdf     2/2     Running   0          7m24s   10.129.2.15   ci-ln-sb5ply2-72292-qlhc8-worker-c-m5wzq   <none>           <none>
dns-default-fhqrs     2/2     Running   0          7m1s    10.131.0.29   ci-ln-sb5ply2-72292-qlhc8-worker-a-6q7hs   <none>           <none>
dns-default-mmc78     2/2     Running   0          7m54s   10.128.2.7    ci-ln-sb5ply2-72292-qlhc8-worker-b-bpjdk   <none>           <none>

Additional info:
Upstream issue: https://github.com/kubernetes/kubernetes/issues/118823
Slack discussion: https://redhat-internal.slack.com/archives/CKJR6200N/p1687455135950439

Description of problem:

We shouldn't enforce PSa in 4.14, neither by label sync, neither by global cluster config.

Version-Release number of selected component (if applicable):

4.14

How reproducible:

100%

Steps to Reproduce:

As a cluster admin:
1. create two new namespaces/projects: pokus, openshift-pokus
2. as a cluster-admin, attempt to create a privileged pod in both the namespaces from 1.

Actual results:

pod creation is blocked by pod security admission

Expected results:

only a warning about pod violating the namespace pod security level should be emitted

Additional info:


Description of problem:

When you have a HCP running and it's creating the HostedCluster pods it renders this IgnitionProxy config:

defaults
  mode http
  timeout connect 5s
  timeout client 30s
  timeout server 30s

frontend ignition-server
  bind *:8443 ssl crt /tmp/tls.pem
  default_backend ignition_servers

backend ignition_servers
  server ignition-server ignition-server:443 check ssl ca-file /etc/ssl/root-ca/ca.crt

This Configuration is not supported on Ipv6 causing the worker nodes to fail downloading the Ignition Payload

 

Version-Release number of selected component (if applicable):

MCE 2.4
OCP 4.14

How reproducible:

Always

Steps to Reproduce:

1. Create a HostedCluster with the networking parameters set to IPv6 networks.
2. Check the IgnitionProxy config using: 

oc rsh <pod>
cat /tmp/haproxy.conf

Actual results:

Agent pod in the destination workers fails with:

Jul 26 10:23:44 localhost.localdomain next_step_runne[4242]: time="26-07-2023 10:23:44" level=error msg="ignition file download failed: request failed: Get \"https://ignition-server-clusters-hosted.apps.ocp-edge-cluster-0.qe.lab.redhat.com/ignition\": EOF" file="apivip_check.go:160"

Expected results:

The worker should download the ignition payload properly

Additional info:

N/A

4.14 e2e-metal-ipi jobs are failing with 

: [sig-instrumentation] Prometheus [apigroup:image.openshift.io] when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early][apigroup:config.openshift.io] [Skipped:Disconnected] [Suite:openshift/conformance/parallel] 

e.g. https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-metal-ipi-sdn/1643459330390888448

 

This is the alert that is firing,

promQL query returned unexpected results:
    ALERTS{alertname!~"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards|KubeJobFailed|Watchdog|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|etcdMembersDown|etcdMembersDown|etcdGRPCRequestsSlow|etcdGRPCRequestsSlow|etcdHighNumberOfFailedGRPCRequests|etcdHighNumberOfFailedGRPCRequests|etcdMemberCommunicationSlow|etcdMemberCommunicationSlow|etcdNoLeader|etcdNoLeader|etcdHighFsyncDurations|etcdHighFsyncDurations|etcdHighCommitDurations|etcdHighCommitDurations|etcdInsufficientMembers|etcdInsufficientMembers|etcdHighNumberOfLeaderChanges|etcdHighNumberOfLeaderChanges|KubeAPIErrorBudgetBurn|KubeAPIErrorBudgetBurn|KubeClientErrors|KubeClientErrors|KubePersistentVolumeErrors|KubePersistentVolumeErrors|MCDDrainError|MCDDrainError|MCDPivotError|MCDPivotError|PrometheusOperatorWatchErrors|PrometheusOperatorWatchErrors|RedhatOperatorsCatalogError|RedhatOperatorsCatalogError|VSphereOpenshiftNodeHealthFail|VSphereOpenshiftNodeHealthFail|SamplesImagestreamImportFailing|SamplesImagestreamImportFailing",alertstate="firing",severity!="info"} >= 1
    [
      {
        "metric":

{           "__name__": "ALERTS",           "alertname": "TargetDown",           "alertstate": "firing",           "job": "catalog-operator-metrics",           "namespace": "openshift-operator-lifecycle-manager",           "prometheus": "openshift-monitoring/k8s",           "service": "catalog-operator-metrics",           "severity": "warning"         }

,
        "value": [
          1680670057.374,
          "1"
        ]
      },

Please review the following PR: https://github.com/openshift/cloud-provider-vsphere/pull/37

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Currently, we unconditionally use an image mapping from the management
cluster if a mapping exists for ocp-release-dev or ocp/release.
When the individual images do not use those registries, the wrong
mapping is used.

Version-Release number of selected component (if applicable):

4.14

How reproducible:

Always

Steps to Reproduce:

1.Create an ICSP on a management cluster:

apiVersion: operator.openshift.io/v1alpha1
kind: ImageContentSourcePolicy
metadata:
  name: image-policy-39
spec:
  repositoryDigestMirrors:
  - mirrors:
    - quay.io/openshift-release-dev/ocp-release
    - pull.q1w2.quay.rhcloud.com/openshift-release-dev/ocp-release
    source: quay.io/openshift-release-dev/ocp-release

2. Create a HostedCluster that uses a CI release

Actual results:

Nodes never join because ignition server is looking up the wrong image for the CCO and MCO.

Expected results:

Nodes can join the cluster.

Additional info:

 

Please review the following PR: https://github.com/openshift/coredns/pull/89

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Tracker issue for bootimage bump in 4.14. This issue should block issues which need a bootimage bump to fix.

The previous bump was OCPBUGS-11788.

Description of problem:


TRT has identified a likely regression in Metal IPv6 installations.  4.14 installs are statistically worse than 4.13. We are working on a new tool called Component Readiness that does cross-release comparisons to ensure nothing get worse. I think it has found something in metal.

At GA, 4.13 metal installs for ipv6 upgrade micro jobs were 100%.  They are now around 89% in 4.14.  All the failures seem to have the same mode where no workers come up, with PXE errors in the serial console.  

 !image-2023-06-06-10-13-13-310.png|thumbnail! 

You can view the report here:

https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?arch=amd64&baseEndTime=2023-05-16%2023%3A59%3A59&baseRelease=4.13&baseStartTime=2023-04-18%2000%3A00%3A00&capability=Other&component=Installer%20%2F%20openshift-installer&confidence=95&environment=ovn%20upgrade-micro%20amd64%20metal-ipi%20standard&excludeArches=arm64&excludeClouds=alibaba%2Cibmcloud%2Clibvirt%2Covirt&groupBy=cloud%2Carch%2Cnetwork&ignoreDisruption=true&ignoreMissing=false&minFail=3&network=ovn&pity=5&platform=metal-ipi&sampleEndTime=2023-06-06%2023%3A59%3A59&sampleRelease=4.14&sampleStartTime=2023-05-09%2000%3A00%3A00&testId=cluster%20install%3A0cb1bb27e418491b1ffdacab58c5c8c0&testName=install%20should%20succeed%3A%20overall&upgrade=upgrade-micro&variant=standard

The serial console on the workers shows PXE errors:

>>Start PXE over IPv4.
  PXE-E18: Server response timeout.
BdsDxe: failed to load Boot0001 "UEFI PXEv4 (MAC:00962801D023)" from PciRoot(0x0)/Pci(0x2,0x0)/Pci(0x0,0x0)/MAC(00962801D023,0x1)/IPv4(0.0.0.0,0x0,DHCP,0.0.0.0,0.0.0.0,0.0.0.0): Not Found

>>Start PXE over IPv6..
  Station IP address is FD00:1101:0:0:2EE1:8456:96FB:68B1
  Server IP address is FD00:1101:0:0:0:0:0:3
  NBP filename is snponly.efi
  NBP filesize is 0 Bytes
  PXE-E18: Server response timeout.
BdsDxe: failed to load Boot0002 "UEFI PXEv6 (MAC:00962801D023)" from PciRoot(0x0)/Pci(0x2,0x0)/Pci(0x0,0x0)/MAC(00962801D023,0x1)/IPv6(0000:0000:0000:0000:0000:0000:0000:0000,0x0,Static,0000:0000:0000:0000:0000:0000:0000:0000,0x40,0000:0000:0000:0000:0000:0000:0000:0000): Not Found

>>Start HTTP Boot over IPv4.
  Error: Could not retrieve NBP file size from HTTP server.

  Error: Server response timeout.
BdsDxe: failed to load Boot0003 "UEFI HTTPv4 (MAC:00962801D023)" from PciRoot(0x0)/Pci(0x2,0x0)/Pci(0x0,0x0)/MAC(00962801D023,0x1)/IPv4(0.0.0.0,0x0,DHCP,0.0.0.0,0.0.0.0,0.0.0.0)/Uri(): Not Found

>>Start HTTP Boot over IPv6..
  Error: Could not retrieve NBP file size from HTTP server.

  Error: Remote boot cancelled.
BdsDxe: failed to load Boot0004 "UEFI HTTPv6 (MAC:00962801D023)" from PciRoot(0x0)/Pci(0x2,0x0)/Pci(0x0,0x0)/MAC(00962801D023,0x1)/IPv6(0000:0000:0000:0000:0000:0000:0000:0000,0x0,Static,0000:0000:0000:0000:0000:0000:0000:0000,0x40,0000:0000:0000:0000:0000:0000:0000:0000)/Uri(): Not Found
BdsDxe: No bootable option or device was found.
BdsDxe: Press any key to enter the Boot Manager Menu.



Version-Release number of selected component (if applicable):


4.14

How reproducible:

10%

Steps to Reproduce:

1. 
2.
3.

Actual results:


Expected results:


Additional info:


Example failures:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-metal-ipi-upgrade-ovn-ipv6/1665428719952465920

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-metal-ipi-upgrade-ovn-ipv6/1664711616538611712

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-metal-ipi-upgrade-ovn-ipv6/1664645418744549376

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-metal-ipi-upgrade-ovn-ipv6/1663915360878858240




Description of the problem:

Creating a host without any disks will cause the following error log message without any indicative error message displayed to the user.

In this case the status remains Discovering and the user cannot know what the issue is.

 

Log from the service:

time="2023-06-07T12:36:09Z" level=error msg="failed to create new validation context for host e0b465cc-e91f-4ca6-9594-27052a9a6f28" func="github.com/openshift/assisted-service/internal/host.(*Manager).IsValidMasterCandidate" file="/assisted-service/internal/host/host.go:1280" error="Inventory is not valid" pkg=cluster-state 

Example inventory:

{
  "bmc_address": "0.0.0.0",
  "bmc_v6address": ":: /0",
  "boot": {
    "current_boot_mode": "uefi"
  },
  "cpu": {
    "architecture": "x86_64",
    "count": 8,
    "flags": [
      "fpu",
      "vme",
      "de",
      "pse",
      "tsc",
      "msr",
      "pae",
      "mce",
      "cx8",
      "apic",
      "sep",
      "mtrr",
      "pge",
      "mca",
      "cmov",
      "pat",
      "pse36",
      "clflush",
      "mmx",
      "fxsr",
      "sse",
      "sse2",
      "ht",
      "syscall",
      "nx",
      "mmxext",
      "fxsr_opt",
      "pdpe1gb",
      "rdtscp",
      "lm",
      "rep_good",
      "nopl",
      "cpuid",
      "extd_apicid",
      "tsc_known_freq",
      "pni",
      "pclmulqdq",
      "ssse3",
      "fma",
      "cx16",
      "pcid",
      "sse4_1",
      "sse4_2",
      "x2apic",
      "movbe",
      "popcnt",
      "tsc_deadline_timer",
      "aes",
      "xsave",
      "avx",
      "f16c",
      "rdrand",
      "hypervisor",
      "lahf_lm",
      "cmp_legacy",
      "cr8_legacy",
      "abm",
      "sse4a",
      "misalignsse",
      "3dnowprefetch",
      "osvw",
      "topoext",
      "perfctr_core",
      "ssbd",
      "ibrs",
      "ibpb",
      "stibp",
      "vmmcall",
      "fsgsbase",
      "tsc_adjust",
      "bmi1",
      "avx2",
      "smep",
      "bmi2",
      "rdseed",
      "adx",
      "smap",
      "clflushopt",
      "clwb",
      "sha_ni",
      "xsaveopt",
      "xsavec",
      "xgetbv1",
      "xsaves",
      "clzero",
      "xsaveerptr",
      "wbnoinvd",
      "arat",
      "umip",
      "vaes",
      "vpclmulqdq",
      "rdpid",
      "arch_capabilities"
    ],
    "frequency": 2545.214,
    "model_name": "AMD EPYC 7J13 64-Core Processor"
  },
  "disks": [],
  "gpus": [
    {
      "address": "0000: 00: 02.0"
    }
  ],
  "hostname": "02-00-17-01-2c-cf",
  "interfaces": [
    {
      "flags": [
        "up",
        "broadcast",
        "multicast"
      ],
      "has_carrier": true,
      "ipv4_addresses": [
        "10.0.28.205/20"
      ],
      "ipv6_addresses": [],
      "mac_address": "02: 00: 17: 01: 2c: cf",
      "mtu": 9000,
      "name": "ens3",
      "product": "0x101e",
      "speed_mbps": 50000,
      "type": "physical",
      "vendor": "0x15b3"
    }
  ],
  "memory": {
    "physical_bytes": 17179869184,
    "physical_bytes_method": "dmidecode",
    "usable_bytes": 16765730816
  },
  "routes": [
    {
      "destination": "0.0.0.0",
      "family": 2,
      "gateway": "10.0.16.1",
      "interface": "ens3",
      "metric": 100
    },
    {
      "destination": "10.0.16.0",
      "family": 2,
      "interface": "ens3",
      "metric": 100
    },
    {
      "destination": "10.88.0.0",
      "family": 2,
      "interface": "cni-podman0"
    },
    {
      "destination": "169.254.0.0",
      "family": 2,
      "interface": "ens3",
      "metric": 100
    },
    {
      "destination": ":: 1",
      "family": 10,
      "interface": "lo",
      "metric": 256
    },
    {
      "destination": "fe80:: ",
      "family": 10,
      "interface": "cni-podman0",
      "metric": 256
    },
    {
      "destination": "fe80:: ",
      "family": 10,
      "interface": "ens3",
      "metric": 1024
    }
  ],
  "system_vendor": {
    "manufacturer": "QEMU",
    "product_name": "Standard PC (i440FX + PIIX, 1996)",
    "virtual": true
  },
  "tpm_version": "none"
}
 

Steps to reproduce:

1. Register a new cluster 

2. Generate image and deploy nodes without disks

 

Actual results:

 

Expected results:

Fail validation if the inventory is invalid.

 

Description of problem:

`cluster-reader` ClusterRole should have ["get", "list", "watch"] permissions for a number of privileged CRs, but lacks them for the API Group "k8s.ovn.org", which includes CRs such as EgressFirewalls, EgressIPs, etc.

Version-Release number of selected component (if applicable):

OCP 4.10 - 4.12 OVN

How reproducible:

Always

Steps to Reproduce:

1. Create a cluster with OVN components, e.g. EgressFirewall
2. Check permissions of ClusterRole `cluster-reader`

Actual results:

No permissions for OVN resources 

Expected results:

Get, list, and watch verb permissions for OVN resources

Additional info:

Looks like a similar bug was opened for "network-attachment-definitions" in OCPBUGS-6959 (whose closure is being contested).

Description of problem:

The HostedCluster name is not currently validated against RFC1123.

Version-Release number of selected component (if applicable):

 

How reproducible:

Every time

Steps to Reproduce:

1.
2.
3.

Actual results:

Any HostedCluster name is allowed

Expected results:

Only HostedCluster names meeting RFC1123 validation should be allowed.

Additional info:

 

Hypershift needs to be able to specify a different release payload for control plane components without redeploying anything in the hosted cluster.

ovnkube-node DaemonSet pods in the hosted cluster and the ovnkube-master pods that run in the control plane both use the same ovn-kubernetes image passed to the CNO.

https://github.com/openshift/hypershift/blob/fc42313fc93125799f7eba5361190043cc2f6561/control-plane-operator/controllers/hostedcontrolplane/cno/clusternetworkoperator.go#L90

We need a way to specify these images separately for ovnkube-node and ovnkube-master.

Background:
https://docs.google.com/document/d/1a3tAS_K6lQ2iicjvuIvPIK5lervXFEVQBCAXopBAJ6o/edit

Description of problem:

Coredns template implementations using incorrect Regex for resolving dot [.] character

Version-Release number of selected component (if applicable):

NA

How reproducible:

100% when you use router sharding with domains including apps

Steps to Reproduce:

1. Create an additional IngressRouter with domains names including apps. for ex: example.test-apps.<clustername>.<clusterdomain>
2. Create and configure the external LB corresponding to the additonal IngressController 
3. Configure the corporate DNS server and create records for the this additional IngressController resolving to the LB Ip setup in step 2 above.  
4. Try resolving the additional domain routes from outside cluster and within cluster, the DNS resolution works fine fro outside cluster. However within cluster all additional domains consisting apps in the domain name resolve to the default ingress VIP instead of their corresponding LB IPs configured on the corportae DNS server.

As an alternate and simple test to reroduce you can reproduce it simply by using the dig command on the cluster node with the additinal domain

for ex: 
sh-4.4# dig test.apps-test..<clustername>.<clusterdomain> 

Actual results:

DNS resolved all the domains consisting of apps to the defult Ingres VIP for example: example.test-apps.<clustername>.<clusterdomain> resolves to default ingressVIP instead of their actual coresponding LB IP.

Expected results:

DNS should resolve it to coresponding LB IP configured at the DNS server.

Additional info:

The DNS solution is happenng using the CoreFile Templates used on the node. which is treating dot(.) as character instead of actual dot[.] this is a Regex configuration bug inside CoreFile used on Vspehere IPI clusters.

Description of problem:

We currently do some frontend logic to list and search CatalogSources for the source associated with the CSV and Subscription on the CSV details page. If we can't find the CatalogSource, we show an error message and prevent updates from the Subscription tab. 

Version-Release number of selected component (if applicable):

4.14

How reproducible:

Always

Steps to Reproduce:

1. Create an htpasswd idp with any user
2. Create a project admin role binding for this user
3. Install an operator in the namespace where the user has project admin permissions
4. Visit the CSV details page while logged in as the project admin user
5. View the subscriptions tab

Actual results:

An alert is shown indicating the the CatalogSource is missing, and the updates to the operator are prevented.

Expected results:

If the Subscription shows the catalog source as healthy in its status stanza, we shouldn't show an alert or prevent updates.

Additional info:

Reproducing this bug is dependent on the fix for OCPBUGS-3036 which prevents project admin users from viewing the Subscription tab at all. 

 

Description of problem:

While investigating issue [1] we've noticed a few problems with CNO error reporting on the ClusterOperator status [2]:

that's fine, but I think there are a couple bugs to write up:
1. when a panic happens, the operator doesnt' go degraded. This can definitely be done
2. when status cannot be updated, the operator should go degraded
3. when service network and/or clusternetwork in status is missing, the operator should go Available=false.

[1] https://github.com/openshift/cluster-network-operator/pull/1669
[2] https://coreos.slack.com/archives/CB48XQ4KZ/p1671207248527519?thread_ts=1671197854.825529&cid=CB48XQ4KZ

Version-Release number of selected component (if applicable):

 4.13 and previous.

How reproducible:

 Always

Steps to Reproduce:

1. Cause a deliberate panic e.g. in the bootstrap code.

Actual results:

 Operator keeps getting restarted and is not Degraded.

Expected results:

 Operator goes Degraded.

Additional info:


Description of problem:

The advertise address configured for our hcp etcd clusters is not resolvable via DNS (ie. etcd-0.etcd-client.namespace.svc:2379). This impacts some of the etcd tooling that expects to access each member by their advertise address.

Version-Release number of selected component (if applicable):

4.14 (and earlier)

How reproducible:

Always

Steps to Reproduce:

1. Create a HostedCluster and wait for it to come up.
2. Exec into an etcd pod and query cluster endpoint health:
   $ oc rsh etcd-0
   $ etcdctl --cacert /etc/etcd/tls/etcd-ca/ca.crt \
             --cert /etc/etcd/tls/server/server.crt \
             --key /etc/etcd/tls/server/server.key \
             --endpoints https://localhost:2379 \
             endpoint health --cluster -w table

Actual results:

An error is returned similar to:
{"level":"warn","ts":"2023-08-07T20:40:49.890254Z","logger":"client","caller":"v3@v3.5.9/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000378fc0/etcd-0.etcd-client.clusters-test-cluster.svc:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp: lookup etcd-0.etcd-client.clusters-test-cluster.svc on 172.30.0.10:53: no such host\""}

Expected results:

Actual cluster health is returned:
+--------------------------------------------------------------+--------+-------------+-------+
|                           ENDPOINT                           | HEALTH |    TOOK     | ERROR |
+--------------------------------------------------------------+--------+-------------+-------+
| https://etcd-0.etcd-discovery.clusters-cewong-guest.svc:2379 |   true |  9.372168ms |       |
| https://etcd-2.etcd-discovery.clusters-cewong-guest.svc:2379 |   true | 12.269226ms |       |
| https://etcd-1.etcd-discovery.clusters-cewong-guest.svc:2379 |   true | 12.291392ms |       |
+--------------------------------------------------------------+--------+-------------+-------+

Additional info:

The etcd statefulset is created with spec.serviceName set to `etcd-discovery`. This means that pods in the statefulset, get subdomain set to `etcd-discovery` and names like etcd-0.etcd-discovery.[ns].svc are resolvable. However, the same is not true for the etcd-client service. etcd-0.etcd-client.[ns].svc is not resolvable. The fix would be to change the advertise address of each member to a resolvable name (ie. etcd-0.etcd-discvoery.[ns].svc) and adjust the server certificate to allow those names as well.

Description of problem:

While/after upgrading to 4.11 2023-01-14 CoreDNS has a problem with UDP overflows so DNS lookups are very slow and cause the ingress operator upgrade to stall. We needed to work around with force_tcp following this: https://access.redhat.com/solutions/5984291

Version-Release number of selected component (if applicable):

 

How reproducible:

100%, but seems to depend on the network environemnt (excact cause unknown)

Steps to Reproduce:

1. install cluster with OKD 4.11-2022-12-02 or earlier
2. initiate upgrade to OKD 4.11-2023-01-14
3. upgrade will stall after upgrading CoreDNS

Actual results:

CoreDNS logs: [ERROR] plugin/errors: 2 oauth-openshift.apps.okd-admin.muc.lv1871.de. AAAA: dns: overflowing header size 

Expected results:

 

Additional info:

 

Description of problem:


Version-Release number of selected component (if applicable):

 4.13.0-0.nightly-2023-03-17-161027 

How reproducible:

Always

Steps to Reproduce:

1.  Create a GCP XPN cluster with flexy job template ipi-on-gcp/versioned-installer-xpn-ci, then 'oc descirbe node'

2. Check logs for cloud-network-config-controller pods

Actual results:


 % oc get nodes
NAME                                                          STATUS   ROLES                  AGE    VERSION
huirwang-0309d-r85mj-master-0.c.openshift-qe.internal         Ready    control-plane,master   173m   v1.26.2+06e8c46
huirwang-0309d-r85mj-master-1.c.openshift-qe.internal         Ready    control-plane,master   173m   v1.26.2+06e8c46
huirwang-0309d-r85mj-master-2.c.openshift-qe.internal         Ready    control-plane,master   173m   v1.26.2+06e8c46
huirwang-0309d-r85mj-worker-a-wsrls.c.openshift-qe.internal   Ready    worker                 162m   v1.26.2+06e8c46
huirwang-0309d-r85mj-worker-b-5txgq.c.openshift-qe.internal   Ready    worker                 162m   v1.26.2+06e8c46
 `oc describe node`, there is no related egressIP annotations 
% oc describe node huirwang-0309d-r85mj-worker-a-wsrls.c.openshift-qe.internal 
Name:               huirwang-0309d-r85mj-worker-a-wsrls.c.openshift-qe.internal
Roles:              worker
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=n2-standard-4
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=us-central1
                    failure-domain.beta.kubernetes.io/zone=us-central1-a
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=huirwang-0309d-r85mj-worker-a-wsrls.c.openshift-qe.internal
                    kubernetes.io/os=linux
                    machine.openshift.io/interruptible-instance=
                    node-role.kubernetes.io/worker=
                    node.kubernetes.io/instance-type=n2-standard-4
                    node.openshift.io/os_id=rhcos
                    topology.gke.io/zone=us-central1-a
                    topology.kubernetes.io/region=us-central1
                    topology.kubernetes.io/zone=us-central1-a
Annotations:        csi.volume.kubernetes.io/nodeid:
                      {"pd.csi.storage.gke.io":"projects/openshift-qe/zones/us-central1-a/instances/huirwang-0309d-r85mj-worker-a-wsrls"}
                    k8s.ovn.org/host-addresses: ["10.0.32.117"]
                    k8s.ovn.org/l3-gateway-config:
                      {"default":{"mode":"shared","interface-id":"br-ex_huirwang-0309d-r85mj-worker-a-wsrls.c.openshift-qe.internal","mac-address":"42:01:0a:00:...
                    k8s.ovn.org/node-chassis-id: 7fb1870c-4315-4dcb-910c-0f45c71ad6d3
                    k8s.ovn.org/node-gateway-router-lrp-ifaddr: {"ipv4":"100.64.0.5/16"}
                    k8s.ovn.org/node-mgmt-port-mac-address: 16:52:e3:8c:13:e2
                    k8s.ovn.org/node-primary-ifaddr: {"ipv4":"10.0.32.117/32"}
                    k8s.ovn.org/node-subnets: {"default":["10.131.0.0/23"]}
                    machine.openshift.io/machine: openshift-machine-api/huirwang-0309d-r85mj-worker-a-wsrls
                    machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable
                    machineconfiguration.openshift.io/currentConfig: rendered-worker-bec5065070ded51e002c566a9c5bd16a
                    machineconfiguration.openshift.io/desiredConfig: rendered-worker-bec5065070ded51e002c566a9c5bd16a
                    machineconfiguration.openshift.io/desiredDrain: uncordon-rendered-worker-bec5065070ded51e002c566a9c5bd16a
                    machineconfiguration.openshift.io/lastAppliedDrain: uncordon-rendered-worker-bec5065070ded51e002c566a9c5bd16a
                    machineconfiguration.openshift.io/reason: 
                    machineconfiguration.openshift.io/state: Done
                    volumes.kubernetes.io/controller-managed-attach-detach: true


 % oc logs cloud-network-config-controller-5cd96d477d-2kmc9  -n openshift-cloud-network-config-controller  
W0320 03:00:08.981493       1 client_config.go:618] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0320 03:00:08.982280       1 leaderelection.go:248] attempting to acquire leader lease openshift-cloud-network-config-controller/cloud-network-config-controller-lock...
E0320 03:00:38.982868       1 leaderelection.go:330] error retrieving resource lock openshift-cloud-network-config-controller/cloud-network-config-controller-lock: Get "https://api-int.huirwang-0309d.qe.gcp.devcluster.openshift.com:6443/api/v1/namespaces/openshift-cloud-network-config-controller/configmaps/cloud-network-config-controller-lock": dial tcp: lookup api-int.huirwang-0309d.qe.gcp.devcluster.openshift.com: i/o timeout
E0320 03:01:23.863454       1 leaderelection.go:330] error retrieving resource lock openshift-cloud-network-config-controller/cloud-network-config-controller-lock: Get "https://api-int.huirwang-0309d.qe.gcp.devcluster.openshift.com:6443/api/v1/namespaces/openshift-cloud-network-config-controller/configmaps/cloud-network-config-controller-lock": dial tcp: lookup api-int.huirwang-0309d.qe.gcp.devcluster.openshift.com on 172.30.0.10:53: read udp 10.129.0.14:52109->172.30.0.10:53: read: connection refused
I0320 03:02:19.249359       1 leaderelection.go:258] successfully acquired lease openshift-cloud-network-config-controller/cloud-network-config-controller-lock
I0320 03:02:19.250662       1 controller.go:88] Starting node controller
I0320 03:02:19.250681       1 controller.go:91] Waiting for informer caches to sync for node workqueue
I0320 03:02:19.250693       1 controller.go:88] Starting secret controller
I0320 03:02:19.250703       1 controller.go:91] Waiting for informer caches to sync for secret workqueue
I0320 03:02:19.250709       1 controller.go:88] Starting cloud-private-ip-config controller
I0320 03:02:19.250715       1 controller.go:91] Waiting for informer caches to sync for cloud-private-ip-config workqueue
I0320 03:02:19.258642       1 controller.go:182] Assigning key: huirwang-0309d-r85mj-master-2.c.openshift-qe.internal to node workqueue
I0320 03:02:19.258671       1 controller.go:182] Assigning key: huirwang-0309d-r85mj-master-1.c.openshift-qe.internal to node workqueue
I0320 03:02:19.258682       1 controller.go:182] Assigning key: huirwang-0309d-r85mj-master-0.c.openshift-qe.internal to node workqueue
I0320 03:02:19.351258       1 controller.go:96] Starting node workers
I0320 03:02:19.351303       1 controller.go:102] Started node workers
I0320 03:02:19.351298       1 controller.go:96] Starting secret workers
I0320 03:02:19.351331       1 controller.go:102] Started secret workers
I0320 03:02:19.351265       1 controller.go:96] Starting cloud-private-ip-config workers
I0320 03:02:19.351508       1 controller.go:102] Started cloud-private-ip-config workers
E0320 03:02:19.589704       1 controller.go:165] error syncing 'huirwang-0309d-r85mj-master-1.c.openshift-qe.internal': error retrieving the private IP configuration for node: huirwang-0309d-r85mj-master-1.c.openshift-qe.internal, err: error retrieving the network interface subnets, err: googleapi: Error 404: The resource 'projects/openshift-qe/regions/us-central1/subnetworks/installer-shared-vpc-subnet-1' was not found, notFound, requeuing in node workqueue
E0320 03:02:19.615551       1 controller.go:165] error syncing 'huirwang-0309d-r85mj-master-0.c.openshift-qe.internal': error retrieving the private IP configuration for node: huirwang-0309d-r85mj-master-0.c.openshift-qe.internal, err: error retrieving the network interface subnets, err: googleapi: Error 404: The resource 'projects/openshift-qe/regions/us-central1/subnetworks/installer-shared-vpc-subnet-1' was not found, notFound, requeuing in node workqueue
E0320 03:02:19.644628       1 controller.go:165] error syncing 'huirwang-0309d-r85mj-master-2.c.openshift-qe.internal': error retrieving the private IP configuration for node: huirwang-0309d-r85mj-master-2.c.openshift-qe.internal, err: error retrieving the network interface subnets, err: googleapi: Error 404: The resource 'projects/openshift-qe/regions/us-central1/subnetworks/installer-shared-vpc-subnet-1' was not found, notFound, requeuing in node workqueue
E0320 03:02:19.774047       1 controller.go:165] error syncing 'huirwang-0309d-r85mj-master-0.c.openshift-qe.internal': error retrieving the private IP configuration for node: huirwang-0309d-r85mj-master-0.c.openshift-qe.internal, err: error retrieving the network interface subnets, err: googleapi: Error 404: The resource 'projects/openshift-qe/regions/us-central1/subnetworks/installer-shared-vpc-subnet-1' was not found, notFound, requeuing in node workqueue
E0320 03:02:19.783309       1 controller.go:165] error syncing 'huirwang-0309d-r85mj-master-1.c.openshift-qe.internal': error retrieving the private IP configuration for node: huirwang-0309d-r85mj-master-1.c.openshift-qe.internal, err: error retrieving the network interface subnets, err: googleapi: Error 404: The resource 'projects/openshift-qe/regions/us-central1/subnetworks/installer-shared-vpc-subnet-1' was not found, notFound, requeuing in node workqueue
E0320 03:02:19.816430       1 controller.go:165] error syncing 'huirwang-0309d-r85mj-master-2.c.openshift-qe.internal': error retrieving the private IP configuration for node: huirwang-0309d-r85mj-master-2.c.openshift-qe.internal, err: error retrieving the network interface subnets, err: googleapi: Error 404: The resource 'projects/openshift-qe/regions/us-central1/subnetworks/installer-shared-vpc-subnet-1' was not found, notFound, requeuing in node workqueue

Expected results:

EgressIP should work

Additional info:

It can be reproduced in  4.12 as well, not regression issue.

Description of problem:

documentationBaseURL is still linking to 4.13

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-04-05-183601

How reproducible:

Always

Steps to Reproduce:

1. get documentationBaseURL in cm/console-config
$ oc get cm console-config -n openshift-console -o yaml | grep documentationBaseURL
      documentationBaseURL: https://access.redhat.com/documentation/en-us/openshift_container_platform/4.13/
$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.14.0-0.nightly-2023-04-05-183601   True        False         68m     Cluster version is 4.14.0-0.nightly-2023-04-05-183601
2.
3.

Actual results:

documentationBaseURL: https://access.redhat.com/documentation/en-us/openshift_container_platform/4.13/

Expected results:

documentationBaseURL should be  https://access.redhat.com/documentation/en-us/openshift_container_platform/4.14/

Additional info:

 

We should adjust CSI RPC call timeout from sidecars to CSI driver. We seem to be using default values which are just too short and hence can cause unintended side-effects.

I am using a BuildConfig with git source and the Docker strategy. The git repo contains a large zip file via LFS and that zip file is not getting downloaded. Instead just the ascii metadata is getting downloaded. I've created a simple reproducer (https://github.com/selrahal/buildconfig-git-lfs) on my personal github. If you clone the repo

git clone git@github.com:selrahal/buildconfig-git-lfs.git

and apply the bc.yaml file with

oc apply -f bc.yaml

Then start the build with

oc start-build test-git-lfs

You will see the build fails at the unzip step in the docker file

STEP 3/7: RUN unzip migrationtoolkit-mta-cli-5.3.0-offline.zip
End-of-central-directory signature not found. Either this file is not
a zipfile, or it constitutes one disk of a multi-part archive. In the
latter case the central directory and zipfile comment will be found on
the last disk(s) of this archive.

I've attached the full build logs to this issue.

Description of problem:

Pages should have unique page titles, so that we can gather accurate user telemetry data via segment. The page title should differ based on the selected tab.

In order to do proper analysis, branding should not be included in the page title.

Currently the following pages have this title "Red Hat OpenShift Dedicated" (or the respective brand name):
Dev perspective:

  • BuildConfigs
  • Pipelines>Pipelines
  • Pipelines>Repositories
  • Helm>Helm Releases
  • Helm>Repositories
  • Install Helm Chart
    Admin perspective:
  • Pipelines>Pipelines
  • Pipelines>PipelineRuns
  • Pipelines>PipelineResources
  • Pipelines>Repositories
  • Tasks>Tasks
  • Tasks>TaskRuns
  • Tasks>ClusterTasks

The following tabs all have the same page title Observe · Red Hat OpenShift Dedicated:
Dev perspective:

  • Observe>Dashboard
  • Observe>Alerts
  • Observe>Metrics

The following tabs all have the same page title Project Details · Red Hat OpenShift Dedicated:
Dev perspective:

  • Project>Overview
  • Project>Details
  • Project>Project access

All the user preferences tabs have the same page title : User Preferences · Red Hat OpenShift Dedicated

  • User Preferences>General
  • User Preferences>Language
  • User Preferences>Notifications
  • User Preferences>Applications

The Topology page in the Dev Perspective and the workloads tab of the Project Details/Workloads tab both share the same title: Topology · Red Hat OpenShift Dedicated

The following tabs on the Admin Project page all share the same title. Unsure if we can handle this since it is including the namespace name: sdoyle-dev · Details · Red Hat OpenShift Dedicated. If not, we can drop til 4.14.

  • Project>Project details>Overview
  • Project>Project details>Details
  • Project>Project details>YAML
  • Project>Project details>RoleBindings

Description of the problem:

As discussed on the Github PR, we want to align the severities filter with the previous implementation. Therefore the severity counts in the response headers should be:

  • the total counts of events with the respective severity across all possible pages
  • with regards to the applied filters (hosts, cluster-level, message,...)
  • but they should not take the severities filter itself into account.

In addition to that, we need a new response header with a total number of events with all current filters (severities included) applied.

Please review the following PR: https://github.com/openshift/cluster-api-provider-vsphere/pull/12

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Updated Description:

The MCD, during a node lifespan, can go through multiple iterations of RHEL8 and RHEL9. This was not a problem until we turned on fips enabled golang with dynamic linking. This requires the MCD binary running (either in container or on host) to always match the host built version. As an additional complication, we have an early boot process (machine-config-daemon-pull/firstboot.service) that can be different from the rest of the cluster node versions (bootimage version is not updated) as well as the fact that we chroot (dynamically go from rhel8 to rhel9) in the container, so we need a better process to ensure the right binary is always used.

 

Current testing of this flow in https://github.com/openshift/machine-config-operator/pull/3799 

 

Description of problem:

MCO CI started failing this week, and 4.14 nightlies have also made it into 4.14 nightlies. See also: https://issues.redhat.com/browse/TRT-1143. The failure manifests as a warning in the MCO. Looking at a MCD log, you will see a failure like:

W0712 08:52:15.475268    7971 daemon.go:1089] Got an error from auxiliary tools: kubelet health check has failed 3 times: Get "http://localhost:10248/healthz": dial tcp: lookup localhost: device or resource busy

The root cause so far seems to be that 4.14 switched from a regular 1.20.3 golang to 1.20.5 with FIPS and dynamic linking in the builder, causing the failures to begin. Most functionality is not broken, but the daemon subroutine that does the kubelet health check appears to be unable to reach the localhost endpoint

One possibility is that the rhel8-daemon chroot'ing into the rhel9-host and running these commands is causing the issue. Regardless, there are a bunch of issues with rhel8/rhel9 duality in the MCD that we would need to address in 4.13/4.14

Also tangentially related: https://issues.redhat.com/browse/MCO-663

Version-Release number of selected component (if applicable):

4.14

How reproducible:

Always

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

When using oc image mirror, oc creates a new manifest lists when filtering platforms. When this happens, oc still tries to push and tag the original manifest list.

Version-Release number of selected component (if applicable):

4.8

How reproducible:

Consistent

Steps to Reproduce:

1. Run oc image mirror --filter-by-os 'linux/arm' docker.io/library/busybox@sha256:7b3ccabffc97de872a30dfd234fd972a66d247c8cfc69b0550f276481852627c yourregistry.io/busybox:target
2. Check the plan, see that the original manifest digest is being used for the tag

Actual results:

jammy:Downloads$ oc image mirror --filter-by-os 'linux/arm' docker.io/library/busybox@sha256:7b3ccabffc97de872a30dfd234fd972a66d247c8cfc69b0550f276481852627c sparse-registry1.fyre.ibm.com/jammy/busybox:target
sparse-registry1.fyre.ibm.com/
  jammy/busybox
    blobs:
      docker.io/library/busybox sha256:1d57ab16f681953c15d7485bf3ee79a49c2838e5f9394c43e20e9accbb1a2b20 1.436KiB
      docker.io/library/busybox sha256:99ee43e96ff50e90c5753954d7ce2dfdbd7eb9711c1cd96de56d429cb628e343 1.436KiB
      docker.io/library/busybox sha256:a22ab831b2b2565a624635af04e5f76b4554d9c84727bf7e6bc83306b3b339a9 1.436KiB
      docker.io/library/busybox sha256:abaa813f94fdeebd3b8e6aeea861ab474a5c4724d16f1158755ff1e3a4fde8b0 1.438KiB
      docker.io/library/busybox sha256:b203a35cab50f0416dfdb1b2260f83761cb82197544b9b7a2111eaa9c755dbe7 937.1KiB
      docker.io/library/busybox sha256:46758452d3eef8cacb188405495d52d265f0c3a7580dfec51cb627c04c7bafc4 1.604MiB
      docker.io/library/busybox sha256:4c45e4bb3be9dbdfb27c09ac23c050b9e6eb4c16868287c8c31d34814008df80 1.847MiB
      docker.io/library/busybox sha256:f78e6840ded1aafb6c9f265f52c2fc7c0a990813ccf96702df84a7dcdbe48bea 1.908MiB
    manifests:
      sha256:4ff685e2bcafdab0d2a9b15cbfd9d28f5dfe69af97e3bb1987ed483b0abf5a99
      sha256:5e42fbc46b177f10319e8937dd39702e7891ce6d8a42d60c1b4f433f94200bd2
      sha256:7128d7c7704fb628f1cedf161c01d929d3d831f2a012780b8191dae49f79a5fc
      sha256:77ed5ebc3d9d48581e8afcb75b4974978321bd74f018613483570fcd61a15de8
      sha256:dde8e930c7b6a490f728e66292bc9bce42efc9bbb5278bae40e4f30f6e00fe8c
      sha256:7b3ccabffc97de872a30dfd234fd972a66d247c8cfc69b0550f276481852627c -> target

Expected results:

jammy:~$ oc-devel image mirror --filter-by-os 'linux/arm' docker.io/library/busybox@sha256:7b3ccabffc97de872a30dfd234fd972a66d247c8cfc69b0550f276481852627c sparse-registry1.fyre.ibm.com/jammy/busybox:target
sparse-registry1.fyre.ibm.com/
  jammy/busybox
    blobs:
      docker.io/library/busybox sha256:1d57ab16f681953c15d7485bf3ee79a49c2838e5f9394c43e20e9accbb1a2b20 1.436KiB
      docker.io/library/busybox sha256:99ee43e96ff50e90c5753954d7ce2dfdbd7eb9711c1cd96de56d429cb628e343 1.436KiB
      docker.io/library/busybox sha256:a22ab831b2b2565a624635af04e5f76b4554d9c84727bf7e6bc83306b3b339a9 1.436KiB
      docker.io/library/busybox sha256:abaa813f94fdeebd3b8e6aeea861ab474a5c4724d16f1158755ff1e3a4fde8b0 1.438KiB
      docker.io/library/busybox sha256:b203a35cab50f0416dfdb1b2260f83761cb82197544b9b7a2111eaa9c755dbe7 937.1KiB
      docker.io/library/busybox sha256:46758452d3eef8cacb188405495d52d265f0c3a7580dfec51cb627c04c7bafc4 1.604MiB
      docker.io/library/busybox sha256:4c45e4bb3be9dbdfb27c09ac23c050b9e6eb4c16868287c8c31d34814008df80 1.847MiB
      docker.io/library/busybox sha256:f78e6840ded1aafb6c9f265f52c2fc7c0a990813ccf96702df84a7dcdbe48bea 1.908MiB
    manifests:
      sha256:4ff685e2bcafdab0d2a9b15cbfd9d28f5dfe69af97e3bb1987ed483b0abf5a99
      sha256:5e42fbc46b177f10319e8937dd39702e7891ce6d8a42d60c1b4f433f94200bd2
      sha256:7128d7c7704fb628f1cedf161c01d929d3d831f2a012780b8191dae49f79a5fc
      sha256:77ed5ebc3d9d48581e8afcb75b4974978321bd74f018613483570fcd61a15de8
      sha256:dde8e930c7b6a490f728e66292bc9bce42efc9bbb5278bae40e4f30f6e00fe8c
      sha256:7128d7c7704fb628f1cedf161c01d929d3d831f2a012780b8191dae49f79a5fc -> target

Additional info:

 

Description of problem:

The IPI installation in some regions got bootstrap failure, and without any node available/ready.

Version-Release number of selected component (if applicable):

12-22 16:22:27.970  ./openshift-install 4.12.0-0.nightly-2022-12-21-202045
12-22 16:22:27.970  built from commit 3f9c38a5717c638f952df82349c45c7d6964fcd9
12-22 16:22:27.970  release image registry.ci.openshift.org/ocp/release@sha256:2d910488f25e2638b6d61cda2fb2ca5de06eee5882c0b77e6ed08aa7fe680270
12-22 16:22:27.971  release architecture amd64

How reproducible:

Always

Steps to Reproduce:

1. try the IPI installation in the problem regions (so far tried and failed with ap-southeast-2, ap-south-1, eu-west-1, ap-southeast-6, ap-southeast-3, ap-southeast-5, eu-central-1, cn-shanghai, cn-hangzhou and cn-beijing) 

Actual results:

Bootstrap failed to complete

Expected results:

Installation in those regions should succeed.

Additional info:

FYI the QE flexy-install job: https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/166672/

No any node available/ready, and no any operator available.
$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version             False       True          30m     Unable to apply 4.12.0-0.nightly-2022-12-21-202045: an unknown error has occurred: MultipleErrors
$ oc get nodes
No resources found
$ oc get machines -n openshift-machine-api -o wide
NAME                         PHASE   TYPE   REGION   ZONE   AGE   NODE   PROVIDERID   STATE
jiwei-1222f-v729x-master-0                                  30m                       
jiwei-1222f-v729x-master-1                                  30m                       
jiwei-1222f-v729x-master-2                                  30m                       
$ oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication
baremetal
cloud-controller-manager                                                                          
cloud-credential                                                                                  
cluster-autoscaler                                                                                
config-operator                                                                                   
console                                                                                           
control-plane-machine-set                                                                         
csi-snapshot-controller                                                                           
dns                                                                                               
etcd                                                                                              
image-registry                                                                                    
ingress                                                                                           
insights                                                                                          
kube-apiserver                                                                                    
kube-controller-manager                                                                           
kube-scheduler                                                                                    
kube-storage-version-migrator                                                                     
machine-api                                                                                       
machine-approver                                                                                  
machine-config                                                                                    
marketplace                                                                                       
monitoring                                                                                        
network                                                                                           
node-tuning                                                                                       
openshift-apiserver                                                                               
openshift-controller-manager                                                                      
openshift-samples                                                                                 
operator-lifecycle-manager                                                                        
operator-lifecycle-manager-catalog                                                                
operator-lifecycle-manager-packageserver
service-ca
storage
$

Mater nodes don't run for example kubelet and crio services.
[core@jiwei-1222f-v729x-master-0 ~]$ sudo crictl ps
FATA[0000] unable to determine runtime API version: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /var/run/crio/crio.sock: connect: no such file or directory" 
[core@jiwei-1222f-v729x-master-0 ~]$ 

The machine-config-daemon firstboot tells "failed to update OS".
[jiwei@jiwei log-bundle-20221222085846]$ grep -Ei 'error|failed' control-plane/10.0.187.123/journals/journal.log 
Dec 22 16:24:16 localhost kernel: GPT: Use GNU Parted to correct GPT errors.
Dec 22 16:24:16 localhost kernel: GPT: Use GNU Parted to correct GPT errors.
Dec 22 16:24:18 localhost ignition[867]: failed to fetch config: resource requires networking
Dec 22 16:24:18 localhost ignition[891]: GET error: Get "http://100.100.100.200/latest/user-data": dial tcp 100.100.100.200:80: connect: network is unreachable
Dec 22 16:24:18 localhost ignition[891]: GET error: Get "http://100.100.100.200/latest/user-data": dial tcp 100.100.100.200:80: connect: network is unreachable
Dec 22 16:24:19 localhost.localdomain NetworkManager[919]: <info>  [1671726259.0329] hostname: hostname: hostnamed not used as proxy creation failed with: Could not connect: No such file or directory
Dec 22 16:24:19 localhost.localdomain NetworkManager[919]: <warn>  [1671726259.0464] sleep-monitor-sd: failed to acquire D-Bus proxy: Could not connect: No such file or directory
Dec 22 16:24:19 localhost.localdomain ignition[891]: GET error: Get "https://api-int.jiwei-1222f.alicloud-qe.devcluster.openshift.com:22623/config/master": dial tcp 10.0.187.120:22623: connect: connection refused
...repeated logs omitted...
Dec 22 16:27:46 jiwei-1222f-v729x-master-0 ovs-ctl[1888]: 2022-12-22T16:27:46Z|00001|dns_resolve|WARN|Failed to read /etc/resolv.conf: No such file or directory
Dec 22 16:27:46 jiwei-1222f-v729x-master-0 ovs-vswitchd[1888]: ovs|00001|dns_resolve|WARN|Failed to read /etc/resolv.conf: No such file or directory
Dec 22 16:27:46 jiwei-1222f-v729x-master-0 dbus-daemon[1669]: [system] Activation via systemd failed for unit 'dbus-org.freedesktop.resolve1.service': Unit dbus-org.freedesktop.resolve1.service not found.
Dec 22 16:27:46 jiwei-1222f-v729x-master-0 nm-dispatcher[1924]: Error: Device '' not found.
Dec 22 16:27:46 jiwei-1222f-v729x-master-0 nm-dispatcher[1937]: Error: Device '' not found.
Dec 22 16:27:46 jiwei-1222f-v729x-master-0 nm-dispatcher[2037]: Error: Device '' not found.
Dec 22 08:35:32 jiwei-1222f-v729x-master-0 machine-config-daemon[2181]: Warning: failed, retrying in 1s ... (1/2)I1222 08:35:32.477770    2181 run.go:19] Running: nice -- ionice -c 3 oc image extract --path /:/run/mco-extensions/os-extensions-content-910221290 --registry-config /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:259d8c6b9ec714d53f0275db9f2962769f703d4d395afb9d902e22cfe96021b0
Dec 22 08:56:06 jiwei-1222f-v729x-master-0 rpm-ostree[2288]: Txn Rebase on /org/projectatomic/rpmostree1/rhcos failed: remote error: Get "https://quay.io/v2/openshift-release-dev/ocp-v4.0-art-dev/blobs/sha256:27f262e70d98996165748f4ab50248671d4a4f97eb67465cd46e1de2d6bd24d0": net/http: TLS handshake timeout
Dec 22 08:56:06 jiwei-1222f-v729x-master-0 machine-config-daemon[2181]: W1222 08:56:06.785425    2181 firstboot_complete_machineconfig.go:46] error: failed to update OS to quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:411e6e3be017538859cfbd7b5cd57fc87e5fee58f15df19ed3ec11044ebca511 : error running rpm-ostree rebase --experimental ostree-unverified-registry:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:411e6e3be017538859cfbd7b5cd57fc87e5fee58f15df19ed3ec11044ebca511: Warning: The unit file, source configuration file or drop-ins of rpm-ostreed.service changed on disk. Run 'systemctl daemon-reload' to reload units.
Dec 22 08:56:06 jiwei-1222f-v729x-master-0 machine-config-daemon[2181]: error: remote error: Get "https://quay.io/v2/openshift-release-dev/ocp-v4.0-art-dev/blobs/sha256:27f262e70d98996165748f4ab50248671d4a4f97eb67465cd46e1de2d6bd24d0": net/http: TLS handshake timeout
Dec 22 08:57:31 jiwei-1222f-v729x-master-0 machine-config-daemon[2181]: Warning: failed, retrying in 1s ... (1/2)I1222 08:57:31.244684    2181 run.go:19] Running: nice -- ionice -c 3 oc image extract --path /:/run/mco-extensions/os-extensions-content-4021566291 --registry-config /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:259d8c6b9ec714d53f0275db9f2962769f703d4d395afb9d902e22cfe96021b0
Dec 22 08:59:20 jiwei-1222f-v729x-master-0 systemd[2353]: /usr/lib/systemd/user/podman-kube@.service:10: Failed to parse service restart specifier, ignoring: never
Dec 22 08:59:21 jiwei-1222f-v729x-master-0 podman[2437]: Error: open default: no such file or directory
Dec 22 08:59:21 jiwei-1222f-v729x-master-0 podman[2450]: Error: failed to start API service: accept unixgram @00026: accept4: operation not supported
Dec 22 08:59:21 jiwei-1222f-v729x-master-0 systemd[2353]: podman-kube@default.service: Failed with result 'exit-code'.
Dec 22 08:59:21 jiwei-1222f-v729x-master-0 systemd[2353]: Failed to start A template for running K8s workloads via podman-play-kube.
Dec 22 08:59:21 jiwei-1222f-v729x-master-0 systemd[2353]: podman.service: Failed with result 'exit-code'.
[jiwei@jiwei log-bundle-20221222085846]$ 

 

Description of problem:

 

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

Route Checkbox getting checked even if it is unchecked during editing the Serverless Function form.

Version-Release number of selected component (if applicable):

4.14

How reproducible:

Always

Steps to Reproduce:

1. Install Serverless Operator and Create KN Serving Instance
2. Create a Serverless Function and open the Edit form of the KSVC
3. Uncheck the Create Route option and save.
4. Reopen the Edit form again.

Actual results:

The checkbox still shows checked.

Expected results:

It should retain the previous condtion.

Additional info:

 

Description of problem:

Opened the web-console and navigate to Dashboards, the default API performance V2 option selected, shows No datapoints found for each sub-pages.
 

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-06-27-000502

How reproducible:

always 

Steps to Reproduce:

1. Open the web-console and navigate to Dashboards, the default API performance V2 option selected, shows No datapoints found for each sub-pages.

Actual results:

No datapoints found for Dashboards default API performance V2 option and shows blank page.

Expected results:

Should show diagrams for Dashboards default API performance V2 option

Additional info:
This blocked bug https://issues.redhat.com/browse/OCPBUGS-14940, when I filed the bug https://issues.redhat.com/browse/OCPBUGS-14940, not seen this.

Description of problem:

OVN image pre-puller blocks upgrades in environments where the images have already been pulled but the registry server is not available.

Version-Release number of selected component (if applicable):

4.12

How reproducible:

Always

Steps to Reproduce:

1. Create a cluster in a disconnected environment.

2. Manually pre-pull all the images required for the upgrade. For example, get the list of images needed:

# oc adm release info quay.io/openshift-release-dev/ocp-release:4.12.10-x86_64 -o json > release-info.json

And then pull them in all the nodes of the cluster:

# crio pull $(cat release-info.json | jq -r '.references.spec.tags[].from.name')

3. Stop or somehow make the registry unreachable, then trigger the upgrade.

Actual results:

The upgrade blocks with the following error reported by the cluster version operator:

# oc get clusterversion; oc get co network
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.12.10   True        True          62m     Working towards 4.12.11: 483 of 830 done (58% complete), waiting on network
NAME      VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
network   4.12.10   True        True          False      133m    DaemonSet "/openshift-ovn-kubernetes/ovnkube-upgrades-prepuller" is not available (awaiting 1 nodes)

The reason for that is that the `ovnkube-upgrades-prepuller-...` pod uses `imagePullPolicy: Always` and that fails if there is no registry, even if the image has already been pulled:

# oc get pods -n openshift-ovn-kubernetes ovnkube-upgrades-prepuller-5s2cn
NAME                               READY   STATUS             RESTARTS   AGE
ovnkube-upgrades-prepuller-5s2cn   0/1     ImagePullBackOff   0          44m

# oc get events -n openshift-ovn-kubernetes --field-selector involvedObject.kind=Pod,involvedObject.name=ovnkube-upgrades-prepuller-5s2cn,reason=Failed
LAST SEEN   TYPE      REASON   OBJECT                                 MESSAGE
43m         Warning   Failed   pod/ovnkube-upgrades-prepuller-5s2cn   Failed to pull image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:52f189797a83cae8769f1a4dc6dfd46d586914575ee99de6566fc23c77282071": rpc error: code = Unknown desc = (Mirrors also failed: [server.home.arpa:8443/openshift/release@sha256:52f189797a83cae8769f1a4dc6dfd46d586914575ee99de6566fc23c77282071: pinging container registry server.home.arpa:8443: Get "https://server.home.arpa:8443/v2/": dial tcp 192.168.100.1:8443: connect: connection refused]): quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:52f189797a83cae8769f1a4dc6dfd46d586914575ee99de6566fc23c77282071: pinging container registry quay.io: Get "https://quay.io/v2/": dial tcp: lookup quay.io on 192.168.100.1:53: server misbehaving
43m         Warning   Failed   pod/ovnkube-upgrades-prepuller-5s2cn   Error: ErrImagePull
43m         Warning   Failed   pod/ovnkube-upgrades-prepuller-5s2cn   Error: ImagePullBackOff

# oc get pod -n openshift-ovn-kubernetes ovnkube-upgrades-prepuller-5s2cn -o json | jq -r '.spec.containers[] | .imagePullPolicy + " " + .image'
Always quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:52f189797a83cae8769f1a4dc6dfd46d586914575ee99de6566fc23c77282071

Expected results:

The upgrade should not block.

Additional info:

We detected this in a situation where we want to be able to perform upgrades in a disconnected environment and without the registry server running. See MGMT-13733 for details.

Description of problem:

When using the --oci-registries-config flag explicitly or getting registries.conf from the environment, execution time when processing related images via the addRelatedImageToMapping function serially can drastically impact performance depending on the number of images involved. In my testing of a large catalog, there were approximately 470 images and this took approximately 13 minutes. This processing occurs prior to letting the underlying oc mirror code plan out the images that should be mirrored. Actual planning time is consistent at around 1 min 30 seconds.

The cause of this is due to the need to determine mirrors for each one of the related images based on the configuration provided in registries.conf, and this action is done serially in a loop. If I introduce parallel execution, the processing time for addRelatedImageToMapping is reduced from ~13 min to ~14 seconds.

Version-Release number of selected component (if applicable): 4.13

How reproducible: always

Steps to Reproduce:

Note: the catalog used here is publicly available, but the related images are not so this may be difficult to reproduce.

  1. Copy catalog image to disk in OCI layout
    mkdir -p /tmp/oci/registriesconf/performance
    skopeo --override-os linux copy docker://quay.io/jhunkins/ocp13762:v1 oci:///tmp/oci/registriesconf/performance --format v2s2
    
  2. Create a ~/.config/containers/registries.conf file with this content
    [[registry]]
    location = "icr.io/cpopen"
    insecure = false
    blocked = false
    mirror-by-digest-only = true
    prefix = ""
    [[registry.mirror]]
      location = "quay.io/jhunkins"
      insecure = false
    
  3. Create a ISC [path to isc]/isc-registriesconf-performance.yaml
    kind: ImageSetConfiguration
    apiVersion: mirror.openshift.io/v1alpha2
    mirror: 
      operators: 
      - catalog: oci:///tmp/oci/registriesconf/performance
        full: true
        targetTag: latest
        targetCatalog: ibm-catalog
    storageConfig: 
      local: 
        path: /tmp/oc-mirror-temp
    
  4. run oc mirror with OCI flags (running with dry run is sufficient to replicate this issue)
    oc mirror --config [path to isc]/isc-registriesconf-performance.yaml --include-local-oci-catalogs --oci-insecure-signature-policy --dest-use-http docker://localhost:5000/oci --skip-cleanup --dry-run
    

Actual results:

roughly 13 minutes elapses before the planning phase begins

Expected results:

much faster execution before the planning phase begins

Additional info:

I intend to create a PR which adds parallel execution around the addRelatedImageToMapping function

Description of problem:

A cluster installed via ACM and nodes are showing as Unmanaged. When trying to set the BMH credential via console, the Apply button is not clickable(greyed out).

Version-Release number of selected component (if applicable): 4.11

How reproducible: Always

Steps to Reproduce:
1. Install a cluster via ACM
2. Setting a BMH credential on console
3.

Actual results:

The Apply button on the console screen is greyed out, unclickable.

Expected results:

Should be able to configure BHM credential

Additional info:{code:none}

Based on a suggestion from Omer

"Now that we can tell apart user manifests from our own service manifests, I think it's best that this function deletes the service manifests.

https://github.com/openshift/assisted-service/blob/master/internal/cluster/cluster.go#L1418

The original motivation for this skip was that we didn't want to destroy user uploaded manifests when the user resets their installation, but preserving the service generated ones is useless, and was just an unfortunate side-effect of protecting the user manifests. The service ones would anyway get regenerated when the user hits install again, there's no point in protecting them. If anything, clearing those manifests I think this might solve some edge case bugs I can think of"

We will need to wait for https://github.com/openshift/assisted-service/pull/5278/files to be merged before starting this as this depends on changes made in this PR

Description of problem:

The alerts table displays incorrect values (Prometheus) in the source column 

Version-Release number of selected component (if applicable):

4.13

How reproducible:

Always

Steps to Reproduce:

1. Install LokiOperator, Cluster Logging operator and enable the logging view plugin with the alerts feature toggle enabled
2. Add a log-based alert
3. Check the alerts table source in the observe -> alerts section

Actual results:

Incorrect "Prometheus" value is displayed for non log-based alerts

Expected results:

"Platform" or "User" value is displayed for non log-based alerts

Additional info:

 

Description of problem:

When HyperShift HostedClusters are created with "OLMCatalogPlacement" set to "guest" and if the desired release is pre-GA, the CatalogSource pods cannot pull their images due to using unreleased images.

Version-Release number of selected component (if applicable):

4.13

How reproducible:

Common

Steps to Reproduce:

1. Create a HyperShift 4.13 HostedCluster with spec.OLMCatalogPlacement = "guest"
2. See the openshift-marketplace/community-operator-* pods in the guest cluster in ImagePullBackoff

Actual results:

openshift-marketplace/community-operator-* pods in the guest cluster in ImagePullBackoff

Expected results:

All CatalogSource pods to be running and to use n-1 images if pre-GA

Additional info:

 

This is a clone of issue OCPBUGS-18800. The following is the description of the original issue:

Description of problem:

currently the mco updates its image registry certificate configmap by deleting and re-creating it on each MCO sync. Instead, we should be patching it

Version-Release number of selected component (if applicable):

4.14

How reproducible:

always

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

When processing an install-config containing either BMC passwords in the baremetal platform config, or a vSphere password in the vsphere platform config, we log a warning message to say that the value is ignored.

This warning currently includes the value in the password field, which may be inconvenient for users reusing IPI configs who don't want their password values to appear in logs.

Description of problem:

Once the https://issues.redhat.com/browse/OCPBUGS-14783 is fixed we found another issue which prevents the KubeapiServer's init-container to finish successful. The init-container tries to reach the Kubeapiserver in a ipv4 based url and that's not up, it should go to the IPv6 one.

 

Description of problem:

Hypershift kubevirt provider hosted cluster cannot start up after activating ovn-k interconnect at hosted cluster.

The issue is that ovn-k configurations missmatch:

The cluster manager config in the hosted cluster namespace:

  ovnkube.conf: |-
    [default]
    mtu="8801"
    cluster-subnets="10.132.0.0/14/23"
    encap-port="9880"
    enable-lflow-cache=true
    lflow-cache-limit-kb=1048576

    [kubernetes]
    service-cidrs="172.31.0.0/16"
    ovn-config-namespace="openshift-ovn-kubernetes"
    cacert="/hosted-ca/ca.crt"
    apiserver="https://kube-apiserver:6443"
    host-network-namespace="openshift-host-network"
    platform-type="KubeVirt"
    dns-service-namespace="openshift-dns"
    dns-service-name="dns-default"

    [ovnkubernetesfeature]
    enable-egress-ip=true
    enable-egress-firewall=true
    enable-egress-qos=true
    enable-egress-service=true
    egressip-node-healthcheck-port=9107

    [gateway]
    mode=shared
    nodeport=true
    v4-join-subnet="100.65.0.0/16"

    [masterha]
    election-lease-duration=137
    election-renew-deadline=107
    election-retry-period=26

The controller config in the hosted cluster
  ovnkube.conf: |-
    [default]
    mtu="8801"
    cluster-subnets="10.132.0.0/14/23"
    encap-port="9880"
    enable-lflow-cache=true
    lflow-cache-limit-kb=1048576
    enable-udp-aggregation=true

    [kubernetes]
    service-cidrs="172.31.0.0/16"
    ovn-config-namespace="openshift-ovn-kubernetes"
    apiserver="https://a392ee248c42a4ffca67f2909823466e-18e866c0f5fb5880.elb.us-west-2.amazonaws.com:6443"
    host-network-namespace="openshift-host-network"
    platform-type="KubeVirt"
    healthz-bind-address="0.0.0.0:10256"
    dns-service-namespace="openshift-dns"
    dns-service-name="dns-default"

    [ovnkubernetesfeature]
    enable-egress-ip=true
    enable-egress-firewall=true
    enable-egress-qos=true
    enable-egress-service=true
    egressip-node-healthcheck-port=9107
    enable-multi-network=true

    [gateway]
    mode=shared
    nodeport=true

    [masterha]
    election-lease-duration=137
    election-renew-deadline=107
    election-retry-period=26

Version-Release number of selected component (if applicable):

4.14

How reproducible:

Always

Steps to Reproduce:

1. Deploy latest 4.14 ocp clustrer
2. Install latest hypershift operator
3. Deploy hosted cluster with latest 4.14 ocp release image

Actual results:

Hosted cluster get stuck at 

network                                    4.14.0-0.ci-2023-08-20-221659   True        True          False      3h53m   DaemonSet "/openshift-multus/network-metrics-daemon" is waiting for other operators to become ready...

Expected results:

All the hosted clusters operators should be ok

Additional info:

 

This is a clone of issue OCPBUGS-19699. The following is the description of the original issue:

Description of problem:


When CPUPartitioning is not set in install-config.yaml a warning message is still generated

WARNING CPUPartitioning:  is ignored

This warning is both incorrect, since the check is against "None" and the the value is an empty string when not set, and also no longer relevant now that https://issues.redhat.com//browse/OCPBUGS-18876 has been fixed.

Version-Release number of selected component (if applicable):


How reproducible:

Every time

Steps to Reproduce:

1. Create an install config with CPUPartitioning not set
2. Run "openshift-install agent create image --dir cluster-manifests/ --log-level debug"

Actual results:

See the output "WARNING CPUPartitioning:  is ignored"

Expected results:

No warning

Additional info:


Description of problem:

Since the `registry.centos.org` is closed, all the unit tests in oc relying on this registry started failing. 

Version-Release number of selected component (if applicable):

all versions

How reproducible:

trigger CI jobs and see unit tests are failing

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

We use the state machine design pattern to have explicit clear rules for how hosts can move in and out of states depending on the things that are happening.

This makes it relatively easy to follow / understand host behavior.

We should ensure our code doesn't contain places where we force a host into a state, without going through the state machine 🍝, otherwise it beats the purpose of having a state machine

One example that personally confused me is this switch statement, which contains updates like this one , this one and this one and also this one

Description of problem:

After the changes of OCPBUGS-3036 and OCPBUGS-11596, the user who has project admin permision would be able to check all the subscription information on the operaotor details page. But currently the installPlan infromation will shown "None" in the page wich is incorrect

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-05-03-163151

How reproducible:

Always 

Steps to Reproduce:

1. Configure IDP. add a user
2. Install any operator in specific namespace 
3. Assign project admin permission to the user for the same namespace
   $ oc adm policy add-role-to-user admin <username> -n <projectname>
4. Check user have enough permission to check installplan via CLI
   $ oc get clusterrole admin -o yaml | grep -C10 installplan
     - apiGroups:
       - operators.coreos.com
       resources:
       - clusterserviceversions
       - catalogsources
       - installplans
       - subscriptions
       verbs:
       - delete
     - apiGroups:
       - operators.coreos.com
       resources:
       - clusterserviceversions
       - catalogsources
       - installplans
       - subscriptions
       - operatorgroups
       verbs:
       - get
       - list
       - watch
4. Login OCP with the user, and go to InstallPlan page, user is able to check the InstallPlan list without any error
   /k8s/ns/<projectname>/operators.coreos.com~v1alpha1~InstallPlan
5. Navigate to OperatorDetails -> Subscription Tab, check if the 'InstallPlan' name could be shown on page

Actual results:

Only 'None' is shown on the InstallPlan section 

Expected results:

The installplan name can be shown on the subsctiption page 

Additional info:

 

Description of problem:

Since registry.centos.org is closed, tests relying on this registry in e2e-agnostic-ovn-cmd job are failing.

Version-Release number of selected component (if applicable):

all

How reproducible:

Trigger e2e-agnostic-ovn-cmd job

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

Cluster does not finish rolling out on a 4.13 management cluster because of pod security constraints.

Version-Release number of selected component (if applicable):

4.14

How reproducible:

Always

Steps to Reproduce:

1.Install 4.14 hypershift operator on a recent 4.13 mgmt cluster
2.Create an AWS PublicAndPrivate hosted cluster on that hypershift cluster

Actual results:

Hosted cluster stalls rollout because the private router never gets created

Expected results:

Hosted cluster comes up successfully

Additional info:

Pod security enforcement is preventing the private router from getting created.

Description of problem:

Altering the ImageURL or ExtraKernelParams values in a PreprovisioningImage CR should cause the host to boot using the new image or parameters, but currently the host doesn't respond at all to changes in those fields.

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2023-01-11-225449

How reproducible:

Always

Steps to Reproduce:

1. Create a BMH
2. Set preprovisioning image image URL
3. Allow host to boot
4. Change image URL or extra kernel params

Actual results:

Host does not reboot

Expected results:

Host reboots using the newly provided image or parameters

Additional info:
BMH:

- apiVersion: metal3.io/v1alpha1
  kind: BareMetalHost
  metadata:
    annotations:
      inspect.metal3.io: disabled
    creationTimestamp: "2023-01-13T16:06:12Z"
    finalizers:
    - baremetalhost.metal3.io
    generation: 4
    labels:
      infraenvs.agent-install.openshift.io: myinfraenv
    name: ostest-extraworker-0
    namespace: assisted-installer
    resourceVersion: "61077"
    uid: 444d7246-3d0a-4188-a8c4-f407ee4f741f
  spec:
    automatedCleaningMode: disabled
    bmc:
      address: redfish+http://192.168.111.1:8000/redfish/v1/Systems/6f45ba9f-251a-46f7-a7a8-10c6ca9231dd
      credentialsName: ostest-extraworker-0-bmc-secret
    bootMACAddress: 00:b2:71:b8:14:4f
    customDeploy:
      method: start_assisted_install
    online: true
  status:
    errorCount: 0
    errorMessage: ""
    goodCredentials:
      credentials:
        name: ostest-extraworker-0-bmc-secret
        namespace: assisted-installer
      credentialsVersion: "44478"
    hardwareProfile: unknown
    lastUpdated: "2023-01-13T16:06:22Z"
    operationHistory:
      deprovision:
        end: null
        start: null
      inspect:
        end: null
        start: null
      provision:
        end: null
        start: "2023-01-13T16:06:22Z"
      register:
        end: "2023-01-13T16:06:22Z"
        start: "2023-01-13T16:06:12Z"
    operationalStatus: OK
    poweredOn: false
    provisioning:
      ID: b5e8c1a9-8061-420b-8c32-bb29a8b35a0b
      bootMode: UEFI
      image:
        url: ""
      raid:
        hardwareRAIDVolumes: null
        softwareRAIDVolumes: []
      rootDeviceHints:
        deviceName: /dev/sda
      state: provisioning
    triedCredentials:
      credentials:
        name: ostest-extraworker-0-bmc-secret
        namespace: assisted-installer
      credentialsVersion: "44478"
 

Preprovisioning Image (with changes)

- apiVersion: metal3.io/v1alpha1
  kind: PreprovisioningImage
  metadata:
    creationTimestamp: "2023-01-13T16:06:22Z"
    generation: 1
    labels:
      infraenvs.agent-install.openshift.io: myinfraenv
    name: ostest-extraworker-0
    namespace: assisted-installer
    ownerReferences:
    - apiVersion: metal3.io/v1alpha1
      blockOwnerDeletion: true
      controller: true
      kind: BareMetalHost
      name: ostest-extraworker-0
      uid: 444d7246-3d0a-4188-a8c4-f407ee4f741f
    resourceVersion: "56838"
    uid: 37f4da76-0d1c-4e05-b618-2f0ab9d5c974
  spec:
    acceptFormats:
    - initrd
    architecture: x86_64
  status:
    architecture: x86_64
    conditions:
    - lastTransitionTime: "2023-01-13T16:34:26Z"
      message: Image has been created
      observedGeneration: 1
      reason: ImageCreated
      status: "True"
      type: Ready
    - lastTransitionTime: "2023-01-13T16:06:24Z"
      message: Image has been created
      observedGeneration: 1
      reason: ImageCreated
      status: "False"
      type: Error
    extraKernelParams: coreos.live.rootfs_url=https://assisted-image-service-assisted-installer.apps.ostest.test.metalkube.org/boot-artifacts/rootfs?arch=x86_64&version=4.12
      rd.break=initqueue
    format: initrd
    imageUrl: https://assisted-image-service-assisted-installer.apps.ostest.test.metalkube.org/images/79ef3924-ee94-42c6-96c3-2d784283120d/pxe-initrd?api_key=eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJpbmZyYV9lbnZfaWQiOiI3OWVmMzkyNC1lZTk0LTQyYzYtOTZjMy0yZDc4NDI4MzEyMGQifQ.YazOZS01NoI7g_eVhLmRNmM6wKVVaZJdWbxuePia46Fo0GMLYtSOp1JTvtcStoT51g7VkSnTf8LBJ0zmbGu3HQ&arch=x86_64&version=4.12
    kernelUrl: https://assisted-image-service-assisted-installer.apps.ostest.test.metalkube.org/boot-artifacts/kernel?arch=x86_64&version=4.12
    networkData: {}

This was found while testing ZTP so in this case the assisted-service controllers are altering the preprovisioning image in response to changes made in the assisted-specific CRs, but I don't think this issue is ZTP specific.
 

Please review the following PR: https://github.com/openshift/k8s-prometheus-adapter/pull/68

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Add snyk-secret to parameters to the push & pull tekton files so that snyk scan will be performed on HO RHTAP builds.

Description of problem:

Porting rhbz#2057740 to Jira. Pods without a controller: true entry in ownerReferences are not gracefully drained by the autoscaler (and potentially other drain-library drainers). Checking a recent 4.13 CI run:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.13-e2e-aws-ovn/1625150492994703360/artifacts/e2e-aws-ovn/gather-extra/artifacts/pods.json | jq -r '.items[].metadata | select([(.ownerReferences // [])[] | select(.controller)] | length == 0) | .namespace + " " + .name + " " + (.ownerReferences | tostring)' | grep -v '^\(openshift-etcd\|openshift-kube-apiserver\|openshift-kube-controller-manager\|openshift-kube-scheduler\) ' 
openshift-marketplace certified-operators-fnm5z [{"apiVersion":"operators.coreos.com/v1alpha1","blockOwnerDeletion":false,"controller":false,"kind":"CatalogSource","name":"certified-operators","uid":"4eb36072-7c56-4663-9b5a-fd23cee85432"}]
openshift-marketplace community-operators-nrfl6 [{"apiVersion":"operators.coreos.com/v1alpha1","blockOwnerDeletion":false,"controller":false,"kind":"CatalogSource","name":"community-operators","uid":"0e164593-5656-4592-9915-1a5367a6a548"}]
openshift-marketplace redhat-marketplace-7j7k9 [{"apiVersion":"operators.coreos.com/v1alpha1","blockOwnerDeletion":false,"controller":false,"kind":"CatalogSource","name":"redhat-marketplace","uid":"14b910c4-0e45-4188-ab57-671070b6a9f1"}]
openshift-marketplace redhat-operators-hxhxw [{"apiVersion":"operators.coreos.com/v1alpha1","blockOwnerDeletion":false,"controller":false,"kind":"CatalogSource","name":"redhat-operators","uid":"ca9028e5-affb-4537-81f1-15e3a5129c6e"}]

Version-Release number of selected component (if applicable):

At least 4.11 and 4.13 (above). Likely all OpenShift 4.y which have had these openshift-marketplace pods.

How reproducible:

100%

Steps to Reproduce:

1. Launch a cluster.
2. Inspect the openshift-marketplace pods with: oc -n openshift-marketplace get -o json pods | jq -r '.items[].metadata | select(.namespace == "openshift-marketplace" and (([.ownerReferences[] | select(.controller == true)]) | length) == 0) | .name + " " + (.ownerReferences | tostring)'

Actual results:

certified-operators-fnm5z [{"apiVersion":"operators.coreos.com/v1alpha1","blockOwnerDeletion":false,"controller":false,"kind":"CatalogSource","name":"certified-operators","uid":"4eb36072-7c56-4663-9b5a-fd23cee85432"}]
community-operators-nrfl6 [{"apiVersion":"operators.coreos.com/v1alpha1","blockOwnerDeletion":false,"controller":false,"kind":"CatalogSource","name":"community-operators","uid":"0e164593-5656-4592-9915-1a5367a6a548"}]
redhat-marketplace-7j7k9 [{"apiVersion":"operators.coreos.com/v1alpha1","blockOwnerDeletion":false,"controller":false,"kind":"CatalogSource","name":"redhat-marketplace","uid":"14b910c4-0e45-4188-ab57-671070b6a9f1"}]
redhat-operators-hxhxw [{"apiVersion":"operators.coreos.com/v1alpha1","blockOwnerDeletion":false,"controller":false,"kind":"CatalogSource","name":"redhat-operators","uid":"ca9028e5-affb-4537-81f1-15e3a5129c6e"}]

Expected results:

No output.

Additional info:

Figuring out which resource to list as the controller is tricky, but there are workarounds, including pointing at the triggering resource or a ClusterOperator as the controller.

Description of problem:

The chk_default_ingress.sh script for keepalived is not correctly matching the default ingress pod name anymore. The pod name in a recently deployed dev-scripts cluster is router-default-97fb6b94c-wfxfk which does not match our grep pattern of router-default-[[:xdigit:]]\\{10}-[[:alnum:]]
{5}{
}. The main issue seems to be that the first id is only 9 digits, not 10.

Version-Release number of selected component (if applicable):

4.14

How reproducible:

Unsure, but has been seen at least twice

Steps to Reproduce:

1. Deploy recent nightly build
2. Look at chk_default_ingress status
3.

Actual results:

Always failing, even on nodes with the default ingress pod

Expected results:

Passes on nodes with default ingress pod

Additional info:

 

Description of problem:

ci job "amd64-nightly-4.13-upgrade-from-stable-4.12-vsphere-ipi-proxy-workers-rhel8" failed at rhel node upgrade stage with following error:

TASK [openshift_node : Apply machine config] ***********************************3583task path: /usr/share/ansible/openshift-ansible/roles/openshift_node/tasks/apply_machine_config.yml:683584Using module file /opt/python-env/ansible-core/lib64/python3.8/site-packages/ansible/modules/command.py3585Pipelining is enabled.3586<192.168.233.236> ESTABLISH SSH CONNECTION FOR USER: test3587<192.168.233.236> SSH: EXEC ssh -o ControlMaster=auto -o ControlPersist=600s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="test"' -o ConnectTimeout=30 -o IdentityFile=/var/run/secrets/ci.openshift.io/cluster-profile/ssh-privatekey -o StrictHostKeyChecking=no -o 'ControlPath="/alabama/.ansible/cp/%h-%r"' 192.168.233.236 '/bin/sh -c '"'"'sudo -H -S -n  -u root /bin/sh -c '"'"'"'"'"'"'"'"'echo BECOME-SUCCESS-vwugynewkogzaosazvikpnplnmjoluxs ; http_proxy=http://XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX@192.168.221.228:3128 https_proxy=http://XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX@192.168.221.228:3128 no_proxy=.cluster.local,.svc,10.128.0.0/14,127.0.0.1,172.30.0.0/16,192.168.233.0/25,api-int.ci-op-ssnlf4qb-1dacf.vmc-ci.devcluster.openshift.com,localhost /usr/libexec/platform-python'"'"'"'"'"'"'"'"' && sleep 0'"'"''3588Escalation succeeded3589<192.168.233.236> (1, b'\n{"changed": XXXX, "stdout": "I0726 23:36:56.436283   27240 start.go:61] Version: v4.13.0-202307242035.p0.g7b54f1d.assembly.stream-dirty (7b54f1dcce4ea9f69f300d0e1cf2316def45bf72)\\r\\nI0726 23:36:56.437075   27240 daemon.go:478] not chrooting for source=rhel-8 target=rhel-8\\r\\nF0726 23:36:56.437240   27240 start.go:75] failed to re-exec: writing /rootfs/run/bin/machine-config-daemon: open /rootfs/run/bin/machine-config-daemon: text file busy", "stderr": "time=\\"2023-07-26T19:36:55-04:00\\" level=warning msg=\\"The input device is not a TTY. The --tty and --interactive flags might not work properly\\"", "rc": 255, "cmd": ["podman", "run", "-v", "/:/rootfs", "--pid=host", "--privileged", "--rm", "--entrypoint=/usr/bin/machine-config-daemon", "-ti", "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0110276ce82958a105cdd59028043bcdb1e5c33a77e550a13a1dc51aee08b032", "start", "--node-name", "ci-op-ssnlf4qb-1dacf-bbmqt-rhel-1", "--once-from", "/tmp/ansible.mlldlsm5/worker_ignition_config.json", "--skip-reboot"], "start": "2023-07-26 19:36:55.852527", "end": "2023-07-26 19:36:56.827081", "delta": "0:00:00.974554", "failed": XXXX, "msg": "non-zero return code", "invocation": {"module_args": {"_raw_params": "podman run -v /:/rootfs --pid=host --privileged --rm --entrypoint=/usr/bin/machine-config-daemon -ti quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0110276ce82958a105cdd59028043bcdb1e5c33a77e550a13a1dc51aee08b032 start --node-name ci-op-ssnlf4qb-1dacf-bbmqt-rhel-1 --once-from /tmp/ansible.mlldlsm5/worker_ignition_config.json --skip-reboot", "_uses_shell": false, "warn": false, "stdin_add_newline": XXXX, "strip_empty_ends": XXXX, "argv": null, "chdir": null, "executable": null, "creates": null, "removes": null, "stdin": null}}}\n', b'')3590<192.168.233.236> Failed to connect to the host via ssh: 3591fatal: [192.168.233.236]: FAILED! => {3592    "changed": XXXX,3593    "cmd": [3594        "podman",3595        "run",3596        "-v",3597        "/:/rootfs",3598        "--pid=host",3599        "--privileged",3600        "--rm",3601        "--entrypoint=/usr/bin/machine-config-daemon",3602        "-ti",3603        "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0110276ce82958a105cdd59028043bcdb1e5c33a77e550a13a1dc51aee08b032",3604        "start",3605        "--node-name",3606        "ci-op-ssnlf4qb-1dacf-bbmqt-rhel-1",3607        "--once-from",3608        "/tmp/ansible.mlldlsm5/worker_ignition_config.json",3609        "--skip-reboot"3610    ],3611    "delta": "0:00:00.974554",3612    "end": "2023-07-26 19:36:56.827081",3613    "invocation": {3614        "module_args": {3615            "_raw_params": "podman run -v /:/rootfs --pid=host --privileged --rm --entrypoint=/usr/bin/machine-config-daemon -ti quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0110276ce82958a105cdd59028043bcdb1e5c33a77e550a13a1dc51aee08b032 start --node-name ci-op-ssnlf4qb-1dacf-bbmqt-rhel-1 --once-from /tmp/ansible.mlldlsm5/worker_ignition_config.json --skip-reboot",3616            "_uses_shell": false,3617            "argv": null,3618            "chdir": null,3619            "creates": null,3620            "executable": null,3621            "removes": null,3622            "stdin": null,3623            "stdin_add_newline": XXXX,3624            "strip_empty_ends": XXXX,3625            "warn": false3626        }3627    },3628    "msg": "non-zero return code",3629    "rc": 255,3630    "start": "2023-07-26 19:36:55.852527",3631    "stderr": "time=\"2023-07-26T19:36:55-04:00\" level=warning msg=\"The input device is not a TTY. The --tty and --interactive flags might not work properly\"",3632    "stderr_lines": [3633        "time=\"2023-07-26T19:36:55-04:00\" level=warning msg=\"The input device is not a TTY. The --tty and --interactive flags might not work properly\""3634    ],3635    "stdout": "I0726 23:36:56.436283   27240 start.go:61] Version: v4.13.0-202307242035.p0.g7b54f1d.assembly.stream-dirty (7b54f1dcce4ea9f69f300d0e1cf2316def45bf72)\r\nI0726 23:36:56.437075   27240 daemon.go:478] not chrooting for source=rhel-8 target=rhel-8\r\nF0726 23:36:56.437240   27240 start.go:75] failed to re-exec: writing /rootfs/run/bin/machine-config-daemon: open /rootfs/run/bin/machine-config-daemon: text file busy",3636    "stdout_lines": [3637        "I0726 23:36:56.436283   27240 start.go:61] Version: v4.13.0-202307242035.p0.g7b54f1d.assembly.stream-dirty (7b54f1dcce4ea9f69f300d0e1cf2316def45bf72)",3638        "I0726 23:36:56.437075   27240 daemon.go:478] not chrooting for source=rhel-8 target=rhel-8",3639        "F0726 23:36:56.437240   27240 start.go:75] failed to re-exec: writing /rootfs/run/bin/machine-config-daemon: open /rootfs/run/bin/machine-config-daemon: text file busy"3640    ]3641}3642

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2023-07-26-101700

How reproducible:

always

Steps to Reproduce:

Found in ci:
1. Install a v4.13.6 cluster with rhel8 node
2. Upgrade ocp succeed
3. Upgrade rhel node

Actual results:

rhel node upgrade failed

Expected results:

rhel node upgrade succeed

Additional info:

job link: https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.13-amd64-nightly-4.13-upgrade-from-stable-4.12-vsphere-ipi-proxy-workers-rhel8-p2-f28/1684288836412116992

Description of problem:

A customer is raising security concerns about using port 80 for bootstrap

Version-Release number of selected component (if applicable):

4.13

RFE-3577

We should include HostedClusterDegraded in hypershift_hostedclusters_failure_conditions metric so it's obvious when there's an issue across the fleet.

  • lastTransitionTime: "2023-05-04T13:53:50Z" message: kube-controller-manager deployment has 1 unavailable replicas observedGeneration: 1 reason: UnavailableReplicas status: "True" type: Degraded

 

Description of problem:

This issue is triggered by the lack of the file "/etc/kubernetes/kubeconfig" in the node, but what i found interesting is the aesthetic error that follows:

2023-01-04T10:56:50.807982171Z I0104 10:56:50.807918   18013 start.go:112] Version: v4.11.0-202212070335.p0.g60746a8.assembly.stream-dirty (60746a843e7ef8855ae00f2ffcb655c53e0e8296)
2023-01-04T10:56:50.810326376Z I0104 10:56:50.810190   18013 start.go:125] Calling chroot("/rootfs")
2023-01-04T10:56:50.810326376Z I0104 10:56:50.810274   18013 update.go:1972] Running: systemctl start rpm-ostreed
2023-01-04T10:56:50.855151883Z I0104 10:56:50.854666   18013 rpm-ostree.go:353] Running captured: rpm-ostree status --json
2023-01-04T10:56:50.899635929Z I0104 10:56:50.899574   18013 rpm-ostree.go:353] Running captured: rpm-ostree status --json
2023-01-04T10:56:50.941236704Z I0104 10:56:50.941179   18013 daemon.go:236] Booted osImageURL: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:318187717bd19ef265000570d5580ea680dfbe99c3bece6dd180537a6f268f
e1 (410.84.202210061459-0)
2023-01-04T10:56:50.973206073Z I0104 10:56:50.973131   18013 start.go:101] Copied self to /run/bin/machine-config-daemon on host
2023-01-04T10:56:50.973259966Z E0104 10:56:50.973196   18013 start.go:177] failed to load kubelet kubeconfig: open /etc/kubernetes/kubeconfig: no such file or directory
2023-01-04T10:56:50.975399571Z panic: runtime error: invalid memory address or nil pointer dereference
2023-01-04T10:56:50.975399571Z [signal SIGSEGV: segmentation violation code=0x1 addr=0x40 pc=0x173d84f]
2023-01-04T10:56:50.975399571Z
2023-01-04T10:56:50.975399571Z goroutine 1 [running]:
2023-01-04T10:56:50.975399571Z main.runStartCmd(2023-01-04T10:56:50.975436752Z 0x2c3da80?, {0x1bc0b3b?, 0x0?, 0x0?})
2023-01-04T10:56:50.975436752Z  /go/src/github.com/openshift/machine-config-operator/cmd/machine-config-daemon/start.go:179 +0x70f
2023-01-04T10:56:50.975436752Z github.com/spf13/cobra.(*Command).execute(0x2c3da80, {0x2c89310, 0x0, 0x0})
2023-01-04T10:56:50.975436752Z  /go/src/github.com/openshift/machine-config-operator/vendor/github.com/spf13/cobra/command.go:860 +0x663
2023-01-04T10:56:50.975448580Z github.com/spf13/cobra.(*Command).ExecuteC(0x2c3d580)
2023-01-04T10:56:50.975448580Z  /go/src/github.com/openshift/machine-config-operator/vendor/github.com/spf13/cobra/command.go:974 +0x3b4
2023-01-04T10:56:50.975456464Z github.com/spf13/cobra.(*Command).Execute(...)
2023-01-04T10:56:50.975456464Z  2023-01-04T10:56:50.975464649Z /go/src/github.com/openshift/machine-config-operator/vendor/github.com/spf13/cobra/command.go:902
2023-01-04T10:56:50.975464649Z k8s.io/component-base/cli.Run(2023-01-04T10:56:50.975472575Z 0x2c3d580)
2023-01-04T10:56:50.975472575Z  /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/component-base/cli/run.go:105 +0x385
2023-01-04T10:56:50.975485076Z main.main()
2023-01-04T10:56:50.975485076Z  /go/src/github.com/openshift/machine-config-operator/cmd/machine-config-daemon/main.go:28 +0x25

Version-Release number of selected component (if applicable):

4.11.20

How reproducible:

Always

Steps to Reproduce:

1. Remove / change the name of the file "/etc/kubernetes/kubeconfig"
2. Delete machine-config-daemon pod
3. 

Actual results:

2023-01-04T10:56:50.973259966Z E0104 10:56:50.973196   18013 start.go:177] failed to load kubelet kubeconfig: open /etc/kubernetes/kubeconfig: no such file or directory
2023-01-04T10:56:50.975399571Z panic: runtime error: invalid memory address or nil pointer dereference

Expected results:

Fatal error
 
 failed to load kubelet kubeconfig: open /etc/kubernetes/kubeconfig: no such file or directory

but no runtime error

Additional info:

https://github.com/openshift/machine-config-operator/blob/92012a837e2ed0ed3c9e61c715579ac82ad0a464/cmd/machine-config-daemon/start.go#L179

Description of problem:

Installer get stuck at the beginning of installation if BYO private hosted zone is configured in install-config, from the CI logs, installer has no actions in 2 hours.

Errors:
level=info msg=Credentials loaded from the "default" profile in file "/var/run/secrets/ci.openshift.io/cluster-profile/.awscred"
185
{"component":"entrypoint","file":"k8s.io/test-infra/prow/entrypoint/run.go:164","func":"k8s.io/test-infra/prow/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 2h0m0s timeout","severity":"error","time":"2023-03-05T16:44:27Z"}


Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2023-03-23-000343

How reproducible:

Always

Steps to Reproduce:

1. Create an install-config.yaml, and config byo private hosted zone
2. Create the cluster

Actual results:

installer showed the following message and then get stuck, the cluster can not be created.

level=info msg=Credentials loaded from the "default" profile in file "/var/run/secrets/ci.openshift.io/cluster-profile/.awscred"

Expected results:

create cluster successfully

Additional info:

 

Description of problem:

It's not currently possible to override the base image selected by the command:

$ openshift-install agent create image

Also defining the OPENSHIFT_INSTALL_OS_IMAGE_OVERRIDE variable does not have any effect 

Version-Release number of selected component (if applicable):

4.14

How reproducible:

By defining the OPENSHIFT_INSTALL_OS_IMAGE_OVERRIDE when creating the image

 

Steps to Reproduce:

1. $ OPENSHIFT_INSTALL_OS_IMAGE_OVERRIDE=<valid url to rhcos image> 
2. $ openshift-install agent create image
3.
 

Actual results:

The agent ISO is built by using the embedded rhcos.json metadata, instead of the rhcos image specified in the OPENSHIFT_INSTALL_OS_IMAGE_OVERRIDE

Expected results:

Defining OPENSHIFT_INSTALL_OS_IMAGE_OVERRIDE should allow overriding the base image selected for creating the agent ISO

Additional info:

 

Description of the problem:

In staging, UI 2.18.2, BE 2.18.0 - Day2 add hosts - getting the following error when assigning auto-assign role:

Failed to set roleRequested
 role (auto-assign) is invalid for host 
c746e34f-f44a-4291-9064-402ab95b5831 from infraEnv 
2b4ee2bf-ee45-4f57-b64e-715bc955f92e 

How reproducible:

100%

Steps to reproduce:

1. install day1 cluster

2. In OCM, go to add host and discover new host

3. Assign auto-select role to this host

Actual results:

 

Expected results:

Description of the problem:
Please see Screening
Once installation started of cluster with valid custom manifest , manifest is no longer listable not mentioned in UI neither in cluster logs also when calling api/assisted-install/v2/clusters/{}/manifests
before installation manifest is listed , however after installation starts http api return error

{ "code": "500", "href": "", "id": 500, "kind": "Error", "reason": "Cannot list file 3a46c77e-bafc-4b66-87c8-80fe4e18806c/manifests/openshift/50-masters-chrony-configuration.yaml in cluster 3a46c77e-bafc-4b66-87c8-80fe4e18806c" }

 

How reproducible:
100%
 

Steps to reproduce:

1. created cluster with custom manifest

2. was able to see manifest in cluster details in installation page (before installation started)

3.also able to retrieve it via http get request

4. started installation
Actual results:
custom manifest no longer visible and not mentioned in logs
http get request returning above mentioned error (500)
it seems custom manifest was not added

Expected results:
manifest should still be visible and applied

Description of problem:

For https://issues.redhat.com//browse/OCPBUGS-4998, additional logging was added to the wait-for command when the state is in pending-user-action in order to show the particular host errors preventing installation. This additional host info should be added at the WARNING level.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1. Test this in the same as bug https://issues.redhat.com//browse/OCPBUGS-4998, i.e. by swapping the boot order of the disks
2. When the log message with additional info is logged it is logged at DEBUG level, for example
DEBUG Host master-2 Expected the host to boot from disk, but it booted the installation image - please reboot and fix boot order to boot from disk Virtual_disk 6000c295b246decdbb4f4e691c185fcf (sda, /dev/disk/by-id/wwn-0x6000c295b246decdbb4f4e691c185fcf)INFO cluster has stopped installing... working to recover installation
3. This has now been changed to log at WARNING level
4. In addition multiple messages are logged:
"level=info msg=cluster has stopped installing... working to recover installation". This will change to only log it one time.

Actual results:

 

Expected results:

1. The message is now logged at WARNING level
2. Only one message for "cluster has stopped installing... working to recover installation" will appear

Additional info:

 

From a recent PR run of the recovery suite:

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-etcd-operator/1049/pull-ci-openshift-cluster-etcd-operator-master-e2e-aws-etcd-recovery/1651162451397316608

> event happened 49 times, something is wrong: ns/openshift-etcd-operator deployment/etcd-operator hmsg/593a6eb603 - pathological/true reason/UnstartedEtcdMember unstarted members: NAME-PENDING-10.0.167.169 From: 10:39:53Z To: 10:39:54Z result=reject 

 

Since the remainder of the test has passed, the event might not be reconciled correctly when a member is coming back in CEO. We should fix this event.

This is a clone of issue OCPBUGS-19492. The following is the description of the original issue:

Description of problem:

Keepalived constantly fails on bootstrap causing installation failure

Seems like it doesn't have keepalived.conf file and keepalived monitor fails on 
Version-Release number of selected component (if applicable):

4.13.12

How reproducible:

Regular installation through assisted installer 

Steps to Reproduce:

1.
2.
3.

Actual results:

keepalived fails to start

Expected results:

Success

Additional info:
*

Extend multus resource collection so that we gather all resources on a per namespace basis with oc adm inspect.
This way, users can create a combined must-gather with all resources in one place.

We might have to revisit this once the reconciler and other changes land in more recent version of multus, but for the time being I think that this is a good change to make that we can also bp to older versions

Due to removal of in-tree AWS provider https://github.com/kubernetes/kubernetes/pull/115838 we need to ensure that KCM is setting --external-cloud-volume-plugin flag accordingly, especially that the CSI migration was GA-ed in 4.12/1.25.

The original PR that fixed this (https://github.com/openshift/cluster-kube-controller-manager-operator/pull/721) got reverted by mistake. We need to bring it back to unblock the kube rebase.

Description of problem:

When there is no public zone in dns zone, the look up will fail during install. During the installation of a private cluster, there is no need for a public zone.

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1.
2.
3.

Actual results:

FATAL failed to fetch Terraform Variables: failed to generate asset 
"Terraform Variables": failed to get GCP public zone: no matching public
 DNS Zone found 

Expected results:

Installation complete 

Additional info:

 

Description of problem:

cluster-dns-operator startup has an error message:

[controller-runtime] log.SetLogger(...) was never called, logs will not be displayed:

Version-Release number of selected component (if applicable):

4.14

How reproducible:

Always

Steps to Reproduce:

1. Start cluster-dns-operator
2. oc edit dnses.operator.openshift.io default
  -> Change operatorLogLevel to "Trace" or "Debug" (it doesn't matter which, we just want to trigger an update)
3. Observe backtrace in logs

Actual results:

[controller-runtime] log.SetLogger(...) was never called, logs will not be displayed:
goroutine 201 [running]:
runtime/debug.Stack()
	/usr/lib/golang/src/runtime/debug/stack.go:24 +0x65
sigs.k8s.io/controller-runtime/pkg/log.eventuallyFulfillRoot()
	/dns-operator/vendor/sigs.k8s.io/controller-runtime/pkg/log/log.go:59 +0xbd
sigs.k8s.io/controller-runtime/pkg/log.(*delegatingLogSink).WithValues(0xc0000bae40, {0xc000768ae0, 0x6, 0x6})
	/dns-operator/vendor/sigs.k8s.io/controller-runtime/pkg/log/deleg.go:168 +0x54
github.com/go-logr/logr.Logger.WithValues(...)
	/dns-operator/vendor/github.com/go-logr/logr/logr.go:323
sigs.k8s.io/controller-runtime/pkg/controller.NewUnmanaged.func1(0xc000991980)
	/dns-operator/vendor/sigs.k8s.io/controller-runtime/pkg/controller/controller.go:121 +0x1f6
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0003265a0, {0x1bddf28, 0xc00049d7c0}, {0x17b6120?, 0xc000991960?})
	/dns-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:305 +0x18b
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0003265a0, {0x1bddf28, 0xc00049d7c0})
	/dns-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:265 +0x1d9
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
	/dns-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:226 +0x85
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
	/dns-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:222 +0x587

Expected results:

No error message

Additional info:

This is due to 1.27 rebase: https://github.com/openshift/cluster-dns-operator/pull/368

Will require following

  • fork webhook
  • make part of build process + OCP build dockerfile receival
  • write CCO controller which deploys the webhook

https://azure.github.io/azure-workload-identity/docs/installation/mutating-admission-webhook.html

Background

  • We deploy the AWS STS pod identity webhook as a customer convenience for configuring their applications to utilize service account tokens minted by a cluster that supports STS. When you create a pod that references a service account, the webhook looks for annotations on that service account and if found, the webhook mutates the deployment in order to set environment variables + mounts the service account token on that deployment so that the pod has everything it needs to make an API client.
  • Our temporary access token (using TAT in place of STS because STS is AWS specific) enablement for (select) third party operators does not rely on the webhook and is instead using CCO to create a secret containing the variables based on the credentials requests. The service account token is also explicitly mounted for those operators. Pod identity webhooks were considered as an alternative to this approach but weren't chosen.
  • Basically, if we deploy this webhook it will be for customer convenience and will enable us to potentially use the Azure pod identity webhook in the future if we so chose. Note that AKS provides this webhook and other clouds like Google offer a webhook solution for configuring customer applications.
  • This is about providing parity with other solutions but not required for anything directly related to the product.
  • If we don't provide this Azure pod identity webhook method, customer would need to get the details via some other way like a secret or set explicitly as environment variables. With the webhook, you just annotate your service account.

Please review the following PR: https://github.com/openshift/cluster-api-provider-gcp/pull/193

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/cluster-dns-operator/pull/363

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

when run local bridge with auth disabled, we can see error
GET http://localhost:9000/api/request-token 404 (Not Found) 

Version-Release number of selected component (if applicable):

latest master

How reproducible:

Always

Steps to Reproduce:

1. fetch latest openshift/console code and build
2. run local bridge './bin/bridge'
3.

Actual results:

visiting localhost:9000 we can see errors GET http://localhost:9000/api/request-token 404 (Not Found) 

Expected results:

maybe we should skip /api/request-token request when auth is disabled, as suggested in https://github.com/openshift/console/pull/12553#discussion_r1103151813

Additional info:

 

 

Nodes in Ironic are created following pattern <namespace>~<host name>.

However, when creating nodes in Ironic, baremetal-operator first creates them without a namespace, and only prepends the namespace prefix later. This open a possibility of node clashes, especially in the ACM context.

This is a clone of issue OCPBUGS-19313. The following is the description of the original issue:

Description

As a user, I dont want to see the option of "DeploymentConfigs" in any form I am filling, when I have not installed the same in the cluster.

Acceptance Criteria

  1. Remove the DC option under the Resource Type dropdown in following forms:
    • Import from Git
    • Container Image
    • Import JAR
    • Builder Images (Developer Catalog)

Additional Details:

Description of problem:

The issue is regarding the Add Pipeline Checkbox. When there are 2 pipelines displayed in the dropdown menu, selecting one, unchecks the Add Pipeline checkbox.

Version-Release number of selected component (if applicable):

4.13

How reproducible:

Always when 2 pipelines are in the ns

Steps to Reproduce:

1. Go to the Git Import Page. Create the application with Add Pipelines checked and a pipeline selected.
2. Go to the Serverless Function Page. Select Add Pipelines checkbox and try to select a pipeline from the drop-down. 

Actual results:

The Add Pipelines checkbox automatically gets unchecked on selecting a Pipeline from the drop-down (in case of multiple pipelines in the dropdown)

Expected results:

The Add Pipelines checkbox must not get un-checked.

Additional info:

Video Link: https://drive.google.com/file/d/1OPRXbMw-EiihO3LAlDiOsh8qvhhiJK5H/view?usp=sharing

Description of problem:

In Agent TUI, setting

IPV6 Configuration to Automatic

and enabling

Require IPV6 addressing for this connection

generates a message saying that the feature is not supported. The user is allowed to quit the TUI (formally correct given that we select 'Quit' from the menu, I wonder if the 'Quit' options should remain greyed out until a valid config is applied? ) and the boot process proceeds using an unsupported/not working network configuration

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2023-03-07-131556 

How reproducible:

 

Steps to Reproduce:

1. Feed the agent ISO with an agent-config.yaml file that defines an ipv6 only, static network configuration

2. Boot from the generated agent ISO, wait for the agent TUI to appear, select 'Edit a connection', than change Ipv6 configuration from Manual to Automatic, contextually enable the 'Require IPV6 addressing for this connection' option. Accept the changes.

3. (Not sure if this step is necessary) Once back in the main agent TUI screen, select 'Activate a connection'.
Select the currently active connection, de-activate and re-activate it.

4. Go back to main agent TUI screen, select Quit

Actual results:

The agent TUI displays the following message than quits

Failed to generate network state view: support for multiple default routes not yet implemented in agent-tui

Once the TUI quits, the boot process proceeds

Expected results:

The TUI blocks the possibility to enable unsupported configurations

The agent TUI informs the user about the unsupported configuration the moment it is applied (instead of informing the user the moment he selects 'Quit') and stays opened until a valid network configuration is applied

The TUI should put the boot process on hold until a valid network config is applied

Additional info:

OCP Version: 4.13.0-0.nightly-2023-03-07-131556 

agent-config.yaml snippet

  networkConfig:
    interfaces:
      - name: eno1
        type: ethernet
        state: up
        mac-address: 34:73:5A:9E:59:10
        ipv6:
          enabled: true
          address:
            - ip: 2620:52:0:1eb:3673:5aff:fe9e:5910
              prefix-length: 64
          dhcp: false

Description of problem:

I found an old shell error while checking logs. We don't quote a variable with [ -z ].

    if [ -z $DHCP6_IP6_ADDRESS ]
    then
        >&2 echo "Not a DHCP6 address. Ignoring."
        exit 0
    fi

https://github.com/openshift/machine-config-operator/blob/master/templates/common/baremetal/files/NetworkManager-static-dhcpv6.yaml#L8


Dec 05 12:05:02 master-0-2 nm-dispatcher[1365]: time="2022-12-05T12:05:02Z" level=debug msg="Ignoring filtered route {Ifindex: 10 Dst: fd2e:6f44:5dd8::59/128 Src: <nil> Gw: <nil> Flags: [] Table: 254}"
Dec 05 12:05:02 master-0-2 nm-dispatcher[1365]: time="2022-12-05T12:05:02Z" level=debug msg="Ignoring filtered route {Ifindex: 10 Dst: fd2e:6f44:5dd8::5a/128 Src: <nil> Gw: <nil> Flags: [] Table: 254}"

Dec 05 12:05:27 master-0-2 nm-dispatcher[1365]: req:19 'up' [br-ex], "/etc/NetworkManager/dispatcher.d/30-static-dhcpv6": run script
Dec 05 12:05:27 master-0-2 nm-dispatcher[1365]: + '[' -z fd2e:6f44:5dd8::5a fd2e:6f44:5dd8::59 ']'
Dec 05 12:05:27 master-0-2 nm-dispatcher[1365]: /etc/NetworkManager/dispatcher.d/30-static-dhcpv6: line 4: [: fd2e:6f44:5dd8::5a: binary operator expected
Dec 05 12:05:27 master-0-2 nm-dispatcher[1365]: ++ ip -j -6 a show br-ex
Dec 05 12:05:27 master-0-2 nm-dispatcher[1365]: ++ jq -r '.[].addr_info[] | select(.scope=="global") | select(.deprecated!=true) | select(.local=="fd2e:6f44:5dd8::5a fd2e:6f44:5dd8::59") | .preferred_life_time'
Dec 05 12:05:27 master-0-2 nm-dispatcher[1365]: + LEASE_TIME=
Dec 05 12:05:27 master-0-2 nm-dispatcher[1365]: ++ ip -j -6 a show br-ex
Dec 05 12:05:27 master-0-2 nm-dispatcher[1365]: ++ jq -r '.[].addr_info[] | select(.scope=="global") | select(.deprecated!=true) | select(.local=="fd2e:6f44:5dd8::5a fd2e:6f44:5dd8::59") | .prefixlen'
Dec 05 12:05:27 master-0-2 nm-dispatcher[1365]: + PREFIX_LEN=
Dec 05 12:05:27 master-0-2 nm-dispatcher[1365]: + '[' 0 -lt 4294967295 ']'
Dec 05 12:05:27 master-0-2 nm-dispatcher[1365]: + echo 'Not an infinite DHCP6 lease. Ignoring.'
Dec 05 12:05:27 master-0-2 nm-dispatcher[1365]: Not an infinite DHCP6 lease. Ignoring.
Dec 05 12:05:27 master-0-2 nm-dispatcher[1365]: + exit 0
Dec 05 12:05:27 master-0-2 nm-dispatcher[1365]: req:19 'up' [




Version-Release number of selected component (if applicable):

4.10.0-0.nightly-2022-11-30-111136

How reproducible:

Twice

Steps to Reproduce:

1. Somehow DHCPv6 provides two IPv6 leases
2. NetworkManager sets $DHCP6_IP6_ADDRESS to be all IPv6 address with spaces in-between
3. Bash error

Actual results:


/etc/NetworkManager/dispatcher.d/30-static-dhcpv6: line 4: [: fd2e:6f44:5dd8::5a: binary operator expected

Expected results:

shell inputs are sanitized or properly quoted.

Additional info:

This is a clone of issue OCPBUGS-19868. The following is the description of the original issue:

Description of problem:

The cluster-version operator should not crash while trying to evaluate a bogus condition.

Version-Release number of selected component (if applicable):

4.10 and later are exposed to the bug. It's possible that the OCPBUGS-19512 series increases exposure.

How reproducible:

Unclear.

Steps to Reproduce:

1. Create a cluster.
2. Point it at https://raw.githubusercontent.com/shellyyang1989/upgrade-cincy/master/cincy-conditional-edge.json (you may need to adjust version strings and digests for your test-cluster's release).
3. Wait around 30 minutes.
4. Point it at https://raw.githubusercontent.com/shellyyang1989/upgrade-cincy/master/cincy-conditional-edge-invalid-promql.json (again, may need some customization).

Actual results:

$ grep -B1 -A15 'too fresh' previous.log
I0927 12:07:55.594222       1 cincinnati.go:114] Using a root CA pool with 0 root CA subjects to request updates from https://raw.githubusercontent.com/shellyyang1989/upgrade-cincy/master/cincy-conditional-edge-invalid-promql.json?arch=amd64&channel=stable-4.15&id=dc628f75-7778-457a-bb69-6a31a243c3a9&version=4.15.0-0.test-2023-09-27-091926-ci-ln-01zw7kk-latest
I0927 12:07:55.726463       1 cache.go:118] {"type":"PromQL","promql":{"promql":"0 * group(cluster_version)"}} is the most stale cached cluster-condition match entry, but it is too fresh (last evaluated on 2023-09-27 11:37:25.876804482 +0000 UTC m=+175.082381015).  However, we don't have a cached evaluation for {"type":"PromQL","promql":{"promql":"group(cluster_version_available_updates{channel=buggy})"}}, so attempt to evaluate that now.
I0927 12:07:55.726602       1 cache.go:129] {"type":"PromQL","promql":{"promql":"0 * group(cluster_version)"}} is stealing this cluster-condition match call for {"type":"PromQL","promql":{"promql":"group(cluster_version_available_updates{channel=buggy})"}}, because its last evaluation completed 30m29.849594461s ago
I0927 12:07:55.758573       1 cvo.go:703] Finished syncing available updates "openshift-cluster-version/version" (170.074319ms)
E0927 12:07:55.758847       1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 194 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x1c4df00?, 0x32abc60})
        /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x99
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc001489d40?})
        /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x75
panic({0x1c4df00, 0x32abc60})
        /usr/lib/golang/src/runtime/panic.go:884 +0x213
github.com/openshift/cluster-version-operator/pkg/clusterconditions/promql.(*PromQL).Match(0xc0004860e0, {0x220ded8, 0xc00041e550}, 0x0)
        /go/src/github.com/openshift/cluster-version-operator/pkg/clusterconditions/promql/promql.go:134 +0x419
github.com/openshift/cluster-version-operator/pkg/clusterconditions/cache.(*Cache).Match(0xc0002d3ae0, {0x220ded8, 0xc00041e550}, 0xc0033948d0)
        /go/src/github.com/openshift/cluster-version-operator/pkg/clusterconditions/cache/cache.go:132 +0x982
github.com/openshift/cluster-version-operator/pkg/clusterconditions.(*conditionRegistry).Match(0xc000016760, {0x220ded8, 0xc00041e550}, {0xc0033948a0, 0x1, 0x0?})

Expected results:

No panics.

Additional info:

I'm still not entirely clear on how OCPBUGS-19512 would have increased exposure.

There are prometheus rules defined in the kubestate rules which trigger alerts for the `Kube*QuotaOvercommit` ,

 

These alerts are triggered when the sum of memory/CPU resource quotas for the default/kube-/openshift- namespaces exceed the capacity of the cluster.

Since there are no quotas defined inside default OCP projects and Cu is not expected to create any quota for the default ocp project having these alerts is not adding any value , it would be good to have them removed 

This is a clone of issue OCPBUGS-18267. The following is the description of the original issue:

Description of problem:

'404: Not Found' will show on Knative-serving Details page

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-06-13-223353

How reproducible:

Always

Steps to Reproduce:

1. Installed 'Serveless' Operator, make sure the operator has been installed successfully, and the Knative Serving instance is created without any error
2. Navigate to Administration -> Cluster Settings -> Global Configuration
3. Go to Knative-serving Details page, check if 404 not found message is there
3.

Actual results:

Page will show 404 not found

Expected results:

the 404 not found page should not show

Additional info:

the dependency ticket is OCPBUGs-15008, more information could be checked in the comment

Description of problem:

When deploying KafkaMirrorMaker through OLM form (in AMQ Streams and Strimzi operator) we have to specify fields, which already have defaults and are optional:

  • Liveness Probe
  • Readiness Probe
  • Tracing

For all other components it's correct.

Version-Release number of selected component (if applicable):

4.6
4.7
4.8
4.9

How reproducible:

Steps to Reproduce:
1. Deploy Strimzi 0.27.0 or AMQ Streams 1.8.4 via OLM
2. Try to deploy KafkaMirrorMaker via Form view without any changes

Actual results:
CR cannot be created because several required fields (all are in Liveness probe, Readiness probe and Tracing part) are not filled.

Expected results:
CR will be created, because all required fields are set (whitelist/include, kafka bootstrap address and replicas count, nothing else is needed)

Additional info:

openshift-azure-routes.path has the following [Path] section:

[Path]
PathExistsGlob=/run/cloud-routes/*
PathChanged=/run/cloud-routes/
MakeDirectory=true

 

There was a change in systemd that re-checks the files watched with PathExistsGlob once the service finishes:

With this commit, systemd rechecks all paths specs whenever the triggered unit deactivates. If any PathExists=, PathExistsGlob= or DirectoryNotEmpty= predicate passes, the triggered unit is reactivated

 

This means that openshift-azure-routes will get triggered all the time as long there are files in /run/cloud-routes.

Description of problem:

Backport https://github.com/kubernetes/kubernetes/pull/117371

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

Web-terminal tests are constantly failing on CI. Disable them till they are fixed.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:

1.
2.
3.

Actual results:

https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-console-master-e2e-gcp-console

https://search.ci.openshift.org/?search=Web+Terminal+for+Admin+user&maxAge=336h&context=1&type=junit&name=pull-ci-openshift-console-master-e2e-gcp-console&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Expected results:


Additional info:


Description of problem:

kubevirt digest missing from RHCOS boot image

Version-Release number of selected component (if applicable):

4.14

How reproducible:

Always

Steps to Reproduce:

1.
2.
3.

Actual results:

Unable to create kubevirt cluster

Expected results:

Able to create kubevirt cluster

Additional info:

 

Description of problem:

aws-proxy jobs are failing with workers unable to come up. Example job run[1].  On the console, the workers report 500 errors trying to retrieve the worker ignition[2]. 

Is it possible https://github.com/openshift/machine-config-operator/pull/3662 broke things? See logs below.


[1] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-proxy/1648560213655031808
[2] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-proxy/1648560213655031808/artifacts/e2e-aws-ovn-proxy/gather-aws-console/artifacts/i-071b5af3ddb12e55c

Version-Release number of selected component (if applicable):

4.14

How reproducible:

Always

Steps to Reproduce:

1.  Install with a proxy

Actual results:

No workers come up

Expected results:

 

Additional info:

Logs are reporting: 

2023-04-19T12:29:38.244051716Z I0419 12:29:38.244006 1 container_runtime_config_controller.go:415] Error syncing image config openshift-config: could not get ControllerConfig controllerconfig.machineconfiguration.openshift .io "machine-config-controller" not found 2023-04-19T12:29:56.507515526Z I0419 12:29:56.507472 1 render_controller.go:377] Error syncing machineconfigpool worker: controllerconfig.machineconfiguration.openshift.io "machine-config-controller" not found

./pods/machine-config-operator-6d7c6c8ccf-m7c57/machine-config-operator/machine-config-operator/logs/current.log:2023-04-19T12:38:15.240508503Z E0419 12:38:15.240437 1 operator.go:342] ControllerConfig.machineconfiguration.openshift.io "machine-config-controller" is invalid: [spec.proxy.apiVersion: Required value: must not be empty, spec.proxy.kind: Required value: must not be empty, <nil>: Invalid value: "null": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation]

csi-snapshot-controller ServiceAccount does not include the HCP pull-secret in its imagePullSecrets. Thus, if a HostedCluster is created with a `pullSecret` that contains creds that the management cluster pull secret does not have, the image pull fails.

Description of problem:

Configure diskEncryptionSet as below in install-config.yaml, and not set subscriptionID as it is optional parameter.

install-config.yaml
--------------------------------
compute:
- architecture: amd64
  hyperthreading: Enabled
  name: worker
  platform:
    azure:
      encryptionAtHost: true
      osDisk:
        diskEncryptionSet:
          resourceGroup: jima07a-rg
          name: jima07a-des
  replicas: 3
controlPlane:
  architecture: amd64
  hyperthreading: Enabled
  name: master
  platform:
    azure:
      encryptionAtHost: true
      osDisk:
        diskEncryptionSet:
          resourceGroup: jima07a-rg
          name: jima07a-des
  replicas: 3
platform:
  azure:
    baseDomainResourceGroupName: os4-common
    cloudName: AzurePublicCloud
    outboundType: Loadbalancer
    region: centralus
    defaultMachinePlatform:
      osDisk:
        diskEncryptionSet:
          resourceGroup: jima07a-rg
          name: jima07a-des

Then create manifests file and create cluster, installer failed with error:
$ ./openshift-install create cluster --dir ipi --log-level debug
...
INFO Credentials loaded from file "/home/fedora/.azure/osServicePrincipal.json" 
FATAL failed to fetch Terraform Variables: failed to fetch dependency of "Terraform Variables": failed to generate asset "Platform Provisioning Check": platform.azure.defaultMachinePlatform.osDisk.diskEncryptionSet: Invalid value: azure.DiskEncryptionSet{SubscriptionID:"", ResourceGroup:"jima07a-rg", Name:"jima07a-des"}: failed to get disk encryption set: compute.DiskEncryptionSetsClient#Get: Failure responding to request: StatusCode=400 -- Original Error: autorest/azure: Service returned an error. Status=400 Code="InvalidSubscriptionId" Message="The provided subscription identifier 'resourceGroups' is malformed or invalid." 

Checked manifest file cluster-config.yaml, and found that subscriptionId is not filled out automatically under defaultMachinePlatform
$ cat cluster-config.yaml
apiVersion: v1
data:
  install-config: |
    additionalTrustBundlePolicy: Proxyonly
    apiVersion: v1
    baseDomain: qe.azure.devcluster.openshift.com
    compute:
    - architecture: amd64
      hyperthreading: Enabled
      name: worker
      platform:
        azure:
          encryptionAtHost: true
          osDisk:
            diskEncryptionSet:
              name: jima07a-des
              resourceGroup: jima07a-rg
              subscriptionId: 53b8f551-f0fc-4bea-8cba-6d1fefd54c8a
            diskSizeGB: 0
            diskType: ""
          osImage:
            offer: ""
            publisher: ""
            sku: ""
            version: ""
          type: ""
      replicas: 3
    controlPlane:
      architecture: amd64
      hyperthreading: Enabled
      name: master
      platform:
        azure:
          encryptionAtHost: true
          osDisk:
            diskEncryptionSet:
              name: jima07a-des
              resourceGroup: jima07a-rg
              subscriptionId: 53b8f551-f0fc-4bea-8cba-6d1fefd54c8a
            diskSizeGB: 0
            diskType: ""
          osImage:
            offer: ""
            publisher: ""
            sku: ""
            version: ""
          type: ""
      replicas: 3
    metadata:
      creationTimestamp: null
      name: jimadesa
    networking:
      clusterNetwork:
      - cidr: 10.128.0.0/14
        hostPrefix: 23
      machineNetwork:
      - cidr: 10.0.0.0/16
      networkType: OVNKubernetes
      serviceNetwork:
      - 172.30.0.0/16
    platform:
      azure:
        baseDomainResourceGroupName: os4-common
        cloudName: AzurePublicCloud
        defaultMachinePlatform:
          osDisk:
            diskEncryptionSet:
              name: jima07a-des
              resourceGroup: jima07a-rg
            diskSizeGB: 0
            diskType: ""
          osImage:
            offer: ""
            publisher: ""
            sku: ""
            version: ""
          type: ""
        outboundType: Loadbalancer
        region: centralus
    publish: External

It works well when setting disk encryption set without subscriptionId under defalutMachinePlatform or controlPlane/compute.    

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2023-03-05-104719

How reproducible:

Always on 4.11, 4.12, 4.13

Steps to Reproduce:

1. Prepare install-config, configure diskEncrpytionSet under defaultMchinePlatform, controlPlane and compute without subscriptionId
2. Install cluster 
3.

Actual results:

Cluster is installed successfully

Expected results:

installer failed

Additional info:

 

 

 

 

Description of problem:

OCP installer's OpenStack Ironic iRMC driver doesn'e work with FIPS mode enabled, as it requires SNMP version to be set to v3. However, there is no way to set the SNMP version parameter in the RHOCP installer yaml file, so it falls back to default v2, and it fails 100% of the time.

Version-Release number of selected component (if applicable):

Release Number: 14.0-ec.0

Drivers or hardware or architecture dependency:
Deploy baremetal node with BMC using iRMC protocol(When RHOCP installer uses OpenStack Ironic iRMC driver)

Hardware configuration:
Model/Hypervisor: PRIMERGY RX2540 M6
CPU Info: Intel(R) Xeon(R) Gold 5317 CPU @ 3.00GHz
Memory Info: 125G
Hardware Component Information: None
Configuration Info: None
Guest Configuration Info: None

How reproducible:

Always

Steps to Reproduce:

  1. Enable FIPS mode of RHOCP nodes through setting "fips" to "true" at install-config.yaml.
  2. In install-config.yaml, set platform.baremetal.hosts.bmc.address to start with 'irmc://'
  3. Run OpenShift Container Platform installer.

Actual results:

OpenStack Ironic iRMC driver used in OpenShift Container Platform installer doesn't work and installation fails. Log message suggests setting SNMP version parameter of Ironic iRMC driver to v3 (non-default value) under FIPS mode enabled.

Expected results:

When FIPS mode is enabled on RHOCP, OpenStack Ironic iRMC driver used in RHOCP installer checks whether iRMC driver is configured to use SNMP (current OCP installer configures iRMC driver not to use SNMP) and if iRMC driver is configured not to use SNMP, driver doesn't require setting SNMP version parameter to v3 and installation proceeds. If iRMC driver is configured to use SNMP, driver requires setting SNMP version parameter to v3.

Additional info:

When FIPS mode is enabled, installation of RHOCP into Fujitsu server fails
because OpenStack Ironic iRMC driver, which is used in RHOCP installer,
requires iRMC driver's SNMP version parameter to be set to v3 even though
iRMC driver isn't configured to use SNMP and there is no way to set it to v3.

Installing RHOCP with IPI to baremetal node uses install-config.yaml.
User sets configuration related to RHOCP in install-config.yaml.
This installation uses OpenStack Ironic internally and values in
install-config.yaml affect behavior of Ironic.
During installation, Ironic connects to BMC(Baseboard management controller)
and does operation related to RHOCP installation (e.g. power management).

Ironic uses iRMC driver to operate on Fujitsu server's BMC. And iRMC driver checks
iRMC-driver-specific Ironic parameters stored at Ironic component.
When FIPS is enabled (i.e. "fips" is set to "true" in install-config.yaml), iRMC
driver checks whether SNMP version specified in Ironic parameter to be set to v3
even though iRMC driver isn't configured to use SNMP internally.
Currently, default value of SNMP version parameter of Ironic, which is iRMC driver
specific parameter, is v2c and not v3. And iRMC driver fails with error if SNMP 
version is set to other than v3 when FIPS enabled.

However, there is no way to set SNMP version parameter in RHOCP and that
parameter is set to v2c by default. So when FIPS is enabled, deployment of
OpenShift to Fujitsu server always fails.

Cause of problem is, when FIPS is enabled, iRMC driver always requires SNMP
version parameter to be set to v3 even though iRMC driver is not configured
to use SNMP (current RHOCP installer configures iRMC driver not to use SNMP).
To solve this problem, iRMC driver should be modified to check whether iRMC driver
is configured to use SNMP internally and, if iRMC driver is configured to use SNMP
and FIPS is enabled, requires SNMP version parameter to be set to v3.
Such modification patch is already submitted to OpenStack Ironic community[1].

Summary of actions taken to resolve issue:
Use OpenStack Ironic iRMC driver which incorporates bug fix patch[1] submitted on OpenStack Ironic community.

 [1] https://review.opendev.org/c/openstack/ironic/+/881358

Description of problem:

Currently PowerVS uses a DefaultMachineCIDR: 192.168.0.0/24
This will create network conflicts if another cluster is created in the same zone.

Version-Release number of selected component (if applicable):

current master branch

How reproducible:

Always

Steps to Reproduce:

1.
2.
3.

Actual results:


Expected results:


Additional info:

The fix is to use a random number for DefaultMachineCIDR: 192.168.%d.0/24 This should significantly reduce the chances for collisions.

This is a clone of issue OCPBUGS-13829. The following is the description of the original issue:

Description of problem:

The configured accessTokenInactivityTimeout under tokenConfig in HostedCluster doesn't have any effect.
1. The value is not getting updated in oauth-openshift configmap 
2. hostedcluster allows user to set accessTokenInactivityTimeout value < 300s, where as in master cluster the value should be > 300s. 

Version-Release number of selected component (if applicable):

4.13

How reproducible:

Always

Steps to Reproduce:

1. Install a fresh 4.13 hypershift cluster  
2. Configure accessTokenInactivityTimeout as below:
$ oc edit hc -n clusters
...
  spec:
    configuration:
      oauth:
        identityProviders:
        ...
        tokenConfig:          
          accessTokenInactivityTimeout: 100s
...
3. Check the hcp:
$ oc get hcp -oyaml
...
        tokenConfig:           
          accessTokenInactivityTimeout: 1m40s
...

4. Login to guest cluster with testuser-1 and get the token
$ oc login https://a8890bba21c9b48d4a05096eee8d4edd-738276775c71fb8f.elb.us-east-2.amazonaws.com:6443 -u testuser-1 -p xxxxxxx
$ TOKEN=`oc whoami -t`
$ oc login --token="$TOKEN"
WARNING: Using insecure TLS client config. Setting this option is not supported!
Logged into "https://a8890bba21c9b48d4a05096eee8d4edd-738276775c71fb8f.elb.us-east-2.amazonaws.com:6443" as "testuser-1" using the token provided.
You don't have any projects. You can try to create a new project, by running
    oc new-project <projectname>

Actual results:

1. hostedcluster will allow user to set the value < 300s for accessTokenInactivityTimeout which is not possible on master cluster.

2. The value is not updated in oauth-openshift configmap:
$ oc get cm oauth-openshift -oyaml -n clusters-hypershift-ci-25785 
...
      tokenConfig:
        accessTokenMaxAgeSeconds: 86400
        authorizeTokenMaxAgeSeconds: 300
...

3. Login doesn't fail even if the user is not active for more than the set accessTokenInactivityTimeout seconds.

Expected results:

Login fails if the user is not active within the accessTokenInactivityTimeout seconds.

Description of problem:

administrator console UI, admin user goes to "Workloads -> Pods", select one project, example: openshift-console, select one pod and go to Pod details page, click "Metrics" tab, then click on "Network in" or "Network out" graph, it will show the prometheus expression, would find there are spaces before and after "pod_network_name_info", it's "( pod_network_name_info )", "pod_network_name_info" is enough

"Network in" expression

(sum(irate(container_network_receive_bytes_total{pod='console-5f4978747c-vmxqf', namespace='openshift-console'}[5m])) by (pod, namespace, interface)) + on(namespace,pod,interface) group_left(network_name) ( pod_network_name_info )

"Network out" expression

(sum(irate(container_network_transmit_bytes_total{pod='console-5f4978747c-vmxqf', namespace='openshift-console'}[5m])) by (pod, namespace, interface)) + on(namespace,pod,interface) group_left(network_name) ( pod_network_name_info ) 

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-05-19-234822

How reproducible:

always

Steps to Reproduce:

1. see the description
2.
3.

Actual results:

there are spaces before and after pod_network_name_info

Expected results:

no additional spaces

Additional info:

the bug does not have functional impact 

Description of problem:

Using agent-config.yaml with DHCP network mode (i.e. without 'hosts' property), throws this error when loading the config-image: 
load-config-iso.sh[1656]: Expected file /etc/assisted/manifests/nmstateconfig.yaml is not in archive

Version-Release number of selected component (if applicable):

4.14 (master)

How reproducible:

100%

Steps to Reproduce:

1. Create an agent-config.yaml without 'hosts' property.
2. Generate a config-image.
3. Boot the machine and mount the ISO.

Actual results:

Installation can't continue due to an error on config-iso load:
load-config-iso.sh[1656]: Expected file /etc/assisted/manifests/nmstateconfig.yaml is not in archive

Expected results:

The installation should continue as normal.

Additional info:

The issue is probably due to a fix introduced for static networking:
https://issues.redhat.com/browse/OCPBUGS-15637
I.e. since '/etc/assisted/manifests/nmstateconfig.yaml' was added to GetConfigImageFiles, it's now mandatory on load-config.iso.sh (see 'copy_archive_contents' func).

The failure was missed on dev-scripts tests probably due to this issue: https://github.com/openshift-metal3/dev-scripts/pull/1551

Description of problem:

https://github.com/kubernetes/kubernetes/issues/118916

Version-Release number of selected component (if applicable):

4.14

How reproducible:

100% 

Steps to Reproduce:

1. compare memory usage from v1 and v2 and notice differences with the same workloads
2.
3.

Actual results:

they slightly differ because of accounting differences 

Expected results:

they should be largely the same

Additional info:

 

Description of problem:

Since the operator watches plugins to enable dynamic plugins, it should list that resource under `status.relatedObjects` in its ClusterOperator.

Additional info:

Migrated from bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2044588

Description of problem:

In a fresh installed cluster, we can see hot-loopings on Service openshift-monitoring/cluster-monitoring-operator.

  1. grep -o 'Updating .*due to diff' cvo2.log | sort | uniq -c
    18 Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff
    12 Updating Service openshift-monitoring/cluster-monitoring-operator due to diff

Looking at the CronJob hot-looping

# grep -A60 'Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff' cvo2.log | tail -n61
I0110 06:32:44.489277       1 generic.go:109] Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff:   &unstructured.Unstructured{
  	Object: map[string]interface{}{
  		"apiVersion": string("batch/v1"),
  		"kind":       string("CronJob"),
  		"metadata":   map[string]interface{}{"annotations": map[string]interface{}{"include.release.openshift.io/ibm-cloud-managed": string("true"), "include.release.openshift.io/self-managed-high-availability": string("true")}, "creationTimestamp": string("2022-01-10T04:35:19Z"), "generation": int64(1), "managedFields": []interface{}{map[string]interface{}{"apiVersion": string("batch/v1"), "fieldsType": string("FieldsV1"), "fieldsV1": map[string]interface{}{"f:metadata": map[string]interface{}{"f:annotations": map[string]interface{}{".": map[string]interface{}{}, "f:include.release.openshift.io/ibm-cloud-managed": map[string]interface{}{}, "f:include.release.openshift.io/self-managed-high-availability": map[string]interface{}{}}, "f:ownerReferences": map[string]interface{}{".": map[string]interface{}{}, `k:{"uid":"334d6c04-126d-4271-96ec-d303e93b7d1c"}`: map[string]interface{}{}}}, "f:spec": map[string]interface{}{"f:concurrencyPolicy": map[string]interface{}{}, "f:failedJobsHistoryLimit": map[string]interface{}{}, "f:jobTemplate": map[string]interface{}{"f:spec": map[string]interface{}{"f:template": map[string]interface{}{"f:spec": map[string]interface{}{"f:containers": map[string]interface{}{`k:{"name":"collect-profiles"}`: map[string]interface{}{".": map[string]interface{}{}, "f:args": map[string]interface{}{}, "f:command": map[string]interface{}{}, "f:image": map[string]interface{}{}, ...}}, "f:dnsPolicy": map[string]interface{}{}, "f:priorityClassName": map[string]interface{}{}, "f:restartPolicy": map[string]interface{}{}, ...}}}}, "f:schedule": map[string]interface{}{}, ...}}, "manager": string("cluster-version-operator"), ...}, map[string]interface{}{"apiVersion": string("batch/v1"), "fieldsType": string("FieldsV1"), "fieldsV1": map[string]interface{}{"f:status": map[string]interface{}{"f:lastScheduleTime": map[string]interface{}{}, "f:lastSuccessfulTime": map[string]interface{}{}}}, "manager": string("kube-controller-manager"), ...}}, ...},
  		"spec": map[string]interface{}{
+ 			"concurrencyPolicy":      string("Allow"),
+ 			"failedJobsHistoryLimit": int64(1),
  			"jobTemplate": map[string]interface{}{
+ 				"metadata": map[string]interface{}{"creationTimestamp": nil},
  				"spec": map[string]interface{}{
  					"template": map[string]interface{}{
+ 						"metadata": map[string]interface{}{"creationTimestamp": nil},
  						"spec": map[string]interface{}{
  							"containers": []interface{}{
  								map[string]interface{}{
  									... // 4 identical entries
  									"name":                     string("collect-profiles"),
  									"resources":                map[string]interface{}{"requests": map[string]interface{}{"cpu": string("10m"), "memory": string("80Mi")}},
+ 									"terminationMessagePath":   string("/dev/termination-log"),
+ 									"terminationMessagePolicy": string("File"),
  									"volumeMounts":             []interface{}{map[string]interface{}{"mountPath": string("/etc/config"), "name": string("config-volume")}, map[string]interface{}{"mountPath": string("/var/run/secrets/serving-cert"), "name": string("secret-volume")}},
  								},
  							},
+ 							"dnsPolicy":                     string("ClusterFirst"),
  							"priorityClassName":             string("openshift-user-critical"),
  							"restartPolicy":                 string("Never"),
+ 							"schedulerName":                 string("default-scheduler"),
+ 							"securityContext":               map[string]interface{}{},
+ 							"serviceAccount":                string("collect-profiles"),
  							"serviceAccountName":            string("collect-profiles"),
+ 							"terminationGracePeriodSeconds": int64(30),
  							"volumes": []interface{}{
  								map[string]interface{}{
  									"configMap": map[string]interface{}{
+ 										"defaultMode": int64(420),
  										"name":        string("collect-profiles-config"),
  									},
  									"name": string("config-volume"),
  								},
  								map[string]interface{}{
  									"name": string("secret-volume"),
  									"secret": map[string]interface{}{
+ 										"defaultMode": int64(420),
  										"secretName":  string("pprof-cert"),
  									},
  								},
  							},
  						},
  					},
  				},
  			},
  			"schedule":                   string("*/15 * * * *"),
+ 			"successfulJobsHistoryLimit": int64(3),
+ 			"suspend":                    bool(false),
  		},
  		"status": map[string]interface{}{"lastScheduleTime": string("2022-01-10T06:30:00Z"), "lastSuccessfulTime": string("2022-01-10T06:30:11Z")},
  	},
  }
I0110 06:32:44.499764       1 sync_worker.go:771] Done syncing for cronjob "openshift-operator-lifecycle-manager/collect-profiles" (574 of 765)
I0110 06:32:44.499814       1 sync_worker.go:759] Running sync for deployment "openshift-operator-lifecycle-manager/olm-operator" (575 of 765)

Extract the manifest:

# cat 0000_50_olm_07-collect-profiles.cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  annotations:
    include.release.openshift.io/ibm-cloud-managed: "true"
    include.release.openshift.io/self-managed-high-availability: "true"
  name: collect-profiles
  namespace: openshift-operator-lifecycle-manager
spec:
  schedule: "*/15 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: collect-profiles
          priorityClassName: openshift-user-critical
          containers:
            - name: collect-profiles
              image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2a8d116943a7c1eb32cd161a0de5cb173713724ff419a03abe0382a2d5d9c9a7
              imagePullPolicy: IfNotPresent
              command:
                - bin/collect-profiles
              args:
                - -n
                - openshift-operator-lifecycle-manager
                - --config-mount-path
                - /etc/config
                - --cert-mount-path
                - /var/run/secrets/serving-cert
                - olm-operator-heap-:https://olm-operator-metrics:8443/debug/pprof/heap
                - catalog-operator-heap-:https://catalog-operator-metrics:8443/debug/pprof/heap
              volumeMounts:
                - mountPath: /etc/config
                  name: config-volume
                - mountPath: /var/run/secrets/serving-cert
                  name: secret-volume
              resources:
                requests:
                  cpu: 10m
                  memory: 80Mi
          volumes:
            - name: config-volume
              configMap:
                name: collect-profiles-config
            - name: secret-volume
              secret:
                secretName: pprof-cert
          restartPolicy: Never

Looking at the in-cluster object:

# oc get cronjob.batch/collect-profiles -oyaml -n openshift-operator-lifecycle-manager
apiVersion: batch/v1
kind: CronJob
metadata:
  annotations:
    include.release.openshift.io/ibm-cloud-managed: "true"
    include.release.openshift.io/self-managed-high-availability: "true"
  creationTimestamp: "2022-01-10T04:35:19Z"
  generation: 1
  name: collect-profiles
  namespace: openshift-operator-lifecycle-manager
  ownerReferences:
  - apiVersion: config.openshift.io/v1
    kind: ClusterVersion
    name: version
    uid: 334d6c04-126d-4271-96ec-d303e93b7d1c
  resourceVersion: "450801"
  uid: d0b92cd3-3213-466c-921c-d4c4c77f7a6b
spec:
  concurrencyPolicy: Allow
  failedJobsHistoryLimit: 1
  jobTemplate:
    metadata:
      creationTimestamp: null
    spec:
      template:
        metadata:
          creationTimestamp: null
        spec:
          containers:
          - args:
            - -n
            - openshift-operator-lifecycle-manager
            - --config-mount-path
            - /etc/config
            - --cert-mount-path
            - /var/run/secrets/serving-cert
            - olm-operator-heap-:https://olm-operator-metrics:8443/debug/pprof/heap
            - catalog-operator-heap-:https://catalog-operator-metrics:8443/debug/pprof/heap
            command:
            - bin/collect-profiles
            image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2a8d116943a7c1eb32cd161a0de5cb173713724ff419a03abe0382a2d5d9c9a7
            imagePullPolicy: IfNotPresent
            name: collect-profiles
            resources:
              requests:
                cpu: 10m
                memory: 80Mi
            terminationMessagePath: /dev/termination-log
            terminationMessagePolicy: File
            volumeMounts:
            - mountPath: /etc/config
              name: config-volume
            - mountPath: /var/run/secrets/serving-cert
              name: secret-volume
          dnsPolicy: ClusterFirst
          priorityClassName: openshift-user-critical
          restartPolicy: Never
          schedulerName: default-scheduler
          securityContext: {}
          serviceAccount: collect-profiles
          serviceAccountName: collect-profiles
          terminationGracePeriodSeconds: 30
          volumes:
          - configMap:
              defaultMode: 420
              name: collect-profiles-config
            name: config-volume
          - name: secret-volume
            secret:
              defaultMode: 420
              secretName: pprof-cert
  schedule: '*/15 * * * *'
  successfulJobsHistoryLimit: 3
  suspend: false
status:
  lastScheduleTime: "2022-01-11T03:00:00Z"
  lastSuccessfulTime: "2022-01-11T03:00:07Z"

Version-Release number of the following components:
4.10.0-0.nightly-2022-01-09-195852

How reproducible:
1/1

Steps to Reproduce:
1.Install a 4.10 cluster
2. Grep 'Updating .*due to diff' in the cvo log to check hot-loopings
3.

Actual results:
CVO hotloops on CronJob openshift-operator-lifecycle-manager/collect-profiles

Expected results:
CVO should not hotloop on it in a fresh installed cluster

Additional info:
attachment 1850058 CVO log file

https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-installer-master-e2e-metal-ipi-sdn-virtualmedia

Reproduced locally, the failure is:

level=error msg=Attempted to gather debug logs after installation failure: must provide bootstrap host address                                                                               
level=info msg=Cluster operator cloud-controller-manager TrustedCABundleControllerControllerAvailable is True with AsExpected: Trusted CA Bundle Controller works as expected                
level=info msg=Cluster operator cloud-controller-manager TrustedCABundleControllerControllerDegraded is False with AsExpected: Trusted CA Bundle Controller works as expected                
level=info msg=Cluster operator cloud-controller-manager CloudConfigControllerAvailable is True with AsExpected: Cloud Config Controller works as expected                                   
level=info msg=Cluster operator cloud-controller-manager CloudConfigControllerDegraded is False with AsExpected: Cloud Config Controller works as expected                                   
level=error msg=Cluster operator network Degraded is True with ApplyOperatorConfig: Error while updating operator configuration: could not apply (rbac.authorization.k8s.io/v1, Kind=RoleBindi
ng) openshift-config-managed/openshift-network-public-role-binding: failed to apply / update (rbac.authorization.k8s.io/v1, Kind=RoleBinding) openshift-config-managed/openshift-network-publi
c-role-binding: Patch "https://api-int.ostest.test.metalkube.org:6443/apis/rbac.authorization.k8s.io/v1/namespaces/openshift-config-managed/rolebindings/openshift-network-public-role-binding
?fieldManager=cluster-network-operator%2Foperconfig&force=true": dial tcp 192.168.111.5:6443: connect: connection refused 

I haven't gone back to pin down all affected versions, but I wouldn't be surprised if we've had this exposure for a while. On a 4.12.0-ec.2 cluster, we have:

cluster:usage:resources:sum{resource="podnetworkconnectivitychecks.controlplane.operator.openshift.io"}

currently clocking in around 67983. I've gathered a dump with:

$ oc --as system:admin -n openshift-network-diagnostics get podnetworkconnectivitychecks.controlplane.operator.openshift.io | gzip >checks.gz

And many, many of these reference nodes which no longer exist (the cluster is aggressively autoscaled, with nodes coming and going all the time). We should fix garbage collection on this resource, to avoid consuming excessive amounts of memory in the Kube API server and etcd as they attempt to list the large resource set.

Description of problem:

machine config pool selection will be failed when single node has master+custom roles, controller logged the error but node is not marked as degraded, end user does not know this error. no config can be applied on the node

Version-Release number of selected component (if applicable):

4.12. 4.11.z

Steps to Reproduce:

1. setup SNO cluster
2. create custom mcp
3. add custom mcp label on the node
4. check mcc pod log to see the error message about pool selection 
5. create mc to apply config

Actual results:

node state is good, the single node cannot be assigned to any mcp

Expected results:

node can be marked as degraded with error message

Additional info:

 

Description of problem:

Azure MAG install failed by Terraform error ‘Error ensuring Resource Providers are registered’ 

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-07-27-172239 

How reproducible:

Always

Steps to Reproduce:

1. Create MAG Azure cluster with IPI 

Actual results:

Fail to create the installer when ‘Creating infrastructure resources…’

In terraform.log: 
2023-07-29T11:33:02.938Z [ERROR] provider.terraform-provider-azurerm: Response contains error diagnostic: @module=sdk.proto tf_proto_version=5.3 tf_provider_addr=provider tf_req_id=45c10824-360b-b211-1ba1-9c3a722014af @caller=/go/src/github.com/openshift/installer/terraform/providers/azurerm/vendor/github.com/hashicorp/terraform-plugin-go/tfprotov5/internal/diag/diagnostics.go:55 diagnostic_detail= diagnostic_severity=ERROR diagnostic_summary="Error ensuring Resource Providers are registered.Terraform automatically attempts to register the Resource Providers it supports to
ensure it's able to provision resources.If you don't have permission to register Resource Providers you may wish to use the
"skip_provider_registration" flag in the Provider block to disable this functionality.Please note that if you opt out of Resource Provider Registration and Terraform tries
to provision a resource from a Resource Provider which is unregistered, then the errors
may appear misleading - for example:> API version 2019-XX-XX was not found for Microsoft.FooCould indicate either that the Resource Provider "Microsoft.Foo" requires registration,
but this could also indicate that this Azure Region doesn't support this API version.More information on the "skip_provider_registration" flag can be found here:
https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs#skip_provider_registrationOriginal Error: determining which Required Resource Providers require registration: the required Resource Provider "Microsoft.CustomProviders" wasn't returned from the Azure API" tf_rpc=Configure timestamp=2023-07-29T11:33:02.937Z
2023-07-29T11:33:02.938Z [ERROR] vertex "provider[\"openshift/local/azurerm\"]" error: Error ensuring Resource Providers are registered.Terraform automatically attempts to register the Resource Providers it supports to
ensure it's able to provision resources.If you don't have permission to register Resource Providers you may wish to use the
"skip_provider_registration" flag in the Provider block to disable this functionality.Please note that if you opt out of Resource Provider Registration and Terraform tries
to provision a resource from a Resource Provider which is unregistered, then the errors
may appear misleading - for example:> API version 2019-XX-XX was not found for Microsoft.FooCould indicate either that the Resource Provider "Microsoft.Foo" requires registration,
but this could also indicate that this Azure Region doesn't support this API version.More information on the "skip_provider_registration" flag can be found here:
https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs#skip_provider_registrationOriginal Error: determining which Required Resource Providers require registration: the required Resource Provider "Microsoft.CustomProviders" wasn't returned from the Azure API

Expected results:

Create the installer should succeed. 

Additional info:

Suspect that issue with https://github.com/openshift/installer/pull/7205/, IPI install on Azure MAG with 4.14.0-0.nightly-2023-07-27-051258 is OK

In many cases, the /dev/disk/by-path symlink is the only way to stably identify a disk without having prior knowledge of the hardware from some external source (e.g. a spreadsheet of disk serial numbers). It should be possible to specify this path in the root device hints.
Metal³ is planning to allow these paths in the `name` hint (see OCPBUGS-13080), and assisted's implementation of root device hints (which is used in ZTP and the agent-based installer) should be changed to match.

Description of problem:

console-operator may panic when IncludeNamesFilter receives an object from a shared informer event of type cache.DeletedFinalStateUnknown.

Example job with panic: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.14-e2e-aws-sdn-serial/1687876857824808960

Specific log that shows the full stack trace: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.14-e2e-aws-sdn-serial/1687876857824808960/artifacts/e2e-aws-sdn-serial/gather-extra/artifacts/pods/openshift-console-operator_console-operator-748d7c6cdd-vwxmx_console-operator.log

Version-Release number of selected component (if applicable):

 

How reproducible:

Sporadically

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of the problem:
Assisted installer namespace `assisted-installer` is not compliant with the `ocp4-cis-configure-network-policies-namespaces` Compliance Operator scan.

How reproducible:
Everytime
 
Steps to reproduce:

1. Install a cluster with Assisted Intaller
2. Confirm the `assisted-installer` Namespace is present and not removed
3. Install the Red Hat Compliance Operator
4. Run a compliance scan using the `ocp4-cis`

Actual results:
Cluster fails the scan with the following warning
```
Ensure that application Namespaces have Network Policies defined high
fail
```

Expected results:
Cluster does not fail the scan

Please review the following PR: https://github.com/openshift/ibm-powervs-block-csi-driver/pull/28

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Cluster Monitoring Operator (CMO) lacks golangci-lint checking and has several violations for linters. The ones we'd be specifically interested into are the staticcheck ones as they are tied to deprecated libraries in go.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

Links for both markdown documents in console-dynamic-plugin-sdk/docs are not working.
Check https://github.com/openshift/console/blob/master/frontend/packages/console-dynamic-plugin-sdk/docs

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

Clicking on a link in any markdown doc is not taking user to the appropriate section.

Expected results:

Clicking on a link in any markdown doc should take user to the appropriate section.

Additional info:

 

Description of problem:

We have observed a situation where:
- A workload mounting multiple EBS volumes gets stuck in a Terminating state when it finishes.
- The node that the workload ran on eventually gets stuck draining, because it gets stuck on unmounting one of the volumes from that workload, despite no containers from the workload now running on the node.

What we observe via the node logs is that the volume seems to unmount successfully. Then it attempts to unmount a second time, unsuccessfully. This unmount attempt then repeats and holds up the node.

Specific examples from the node's logs to illustrate this will be included in a private comment. 

Version-Release number of selected component (if applicable):

4.11.5

How reproducible:

Has occurred on four separate nodes on one specific cluster, but the mechanism to reproduce it is not known.

Steps to Reproduce:

1.
2.
3.

Actual results:

A volume gets stuck unmounting, holding up removal of the node and completed deletion of the pod.

Expected results:

The volume should not get stuck unmounting.

Additional info:

 

Description of problem

CI is flaky because the TestAWSELBConnectionIdleTimeout test fails. Example failures:

Version-Release number of selected component (if applicable)

I have seen these failures in 4.14 and 4.13 CI jobs.

How reproducible

Presently, search.ci reports the following stats for the past 14 days:

Found in 1.24% of runs (3.52% of failures) across 404 total runs and 34 jobs (35.15% failed)

This includes two jobs:

  • pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-operator (all) - 40 runs, 63% failed, 16% of failures match = 10% impact
  • pull-ci-openshift-cluster-ingress-operator-release-4.13-e2e-aws-operator (all) - 10 runs, 70% failed, 14% of failures match = 10% impact

Steps to Reproduce

1. Post a PR and have bad luck.
2. Check https://search.ci.openshift.org/?search=FAIL%3A+TestAll%2Fparallel%2FTestAWSELBConnectionIdleTimeout&maxAge=336h&context=1&type=all&name=cluster-ingress-operator&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job.

Actual results

The test fails because it times out waiting for DNS to resolve:

=== RUN   TestAll/parallel/TestAWSELBConnectionIdleTimeout
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2656: failed to observe expected condition: timed out waiting for the condition
    panic.go:522: deleted ingresscontroller test-idle-timeout

The above output comes from build-log.txt from https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-ingress-operator/917/pull-ci-openshift-cluster-ingress-operator-release-4.13-e2e-aws-operator/1658840125502656512.

Expected results

CI passes, or it fails on a different test.

Description of problem:

'hostedcluster.spec.configuration.ingress.loadBalancer.platform.aws.type' is ignored

Version-Release number of selected component (if applicable):

 

How reproducible:

set field to 'NLB'

Steps to Reproduce:

1. set the field to 'NLB'
2.
3.

Actual results:

a classic load balancer is created

Expected results:

Should create a Network load balancer

Additional info:

 

Since the change we did on https://github.com/openshift/assisted-test-infra/pull/1989, whenever deploying assisted installer services using "make run" or "make deploy_assisted_service" we are deploying with only single image - the default one (e.g. OPENSHIFT_VERSION=4.13).

 

 

Description of problem:

EgressIP was NOT migrated to correct workers after deleting machine it was assigned in GCP XPN cluster.

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2023-03-29-235439

How reproducible:

Always

Steps to Reproduce:

1. Set up GCP XPN cluster.
2. Scale two new worker nodes
% oc scale --replicas=2 machineset huirwang-0331a-m4mws-worker-c -n openshift-machine-api        
machineset.machine.openshift.io/huirwang-0331a-m4mws-worker-c scaled

3. Wait the two new workers node ready.
 % oc get machineset -n openshift-machine-api
NAME                            DESIRED   CURRENT   READY   AVAILABLE   AGE
huirwang-0331a-m4mws-worker-a   1         1         1       1           86m
huirwang-0331a-m4mws-worker-b   1         1         1       1           86m
huirwang-0331a-m4mws-worker-c   2         2         2       2           86m
huirwang-0331a-m4mws-worker-f   0         0                             86m
% oc get nodes
NAME                                                          STATUS   ROLES                  AGE     VERSION
huirwang-0331a-m4mws-master-0.c.openshift-qe.internal         Ready    control-plane,master   82m     v1.26.2+dc93b13
huirwang-0331a-m4mws-master-1.c.openshift-qe.internal         Ready    control-plane,master   82m     v1.26.2+dc93b13
huirwang-0331a-m4mws-master-2.c.openshift-qe.internal         Ready    control-plane,master   82m     v1.26.2+dc93b13
huirwang-0331a-m4mws-worker-a-hfqsn.c.openshift-qe.internal   Ready    worker                 71m     v1.26.2+dc93b13
huirwang-0331a-m4mws-worker-b-vbqf2.c.openshift-qe.internal   Ready    worker                 71m     v1.26.2+dc93b13
huirwang-0331a-m4mws-worker-c-rhbkr.c.openshift-qe.internal   Ready    worker                 8m22s   v1.26.2+dc93b13
huirwang-0331a-m4mws-worker-c-wnm4r.c.openshift-qe.internal   Ready    worker                 8m22s   v1.26.2+dc93b13
3. Label one new worker node as egress node
 % oc label node huirwang-0331a-m4mws-worker-c-rhbkr.c.openshift-qe.internal k8s.ovn.org/egress-assignable="" 
node/huirwang-0331a-m4mws-worker-c-rhbkr.c.openshift-qe.internal labeled

4. Create egressIP object
oc get egressIP
NAME         EGRESSIPS     ASSIGNED NODE                                                 ASSIGNED EGRESSIPS
egressip-1   10.0.32.100   huirwang-0331a-m4mws-worker-c-rhbkr.c.openshift-qe.internal   10.0.32.100
5. Label second new worker node as egress node 
% oc label node huirwang-0331a-m4mws-worker-c-wnm4r.c.openshift-qe.internal k8s.ovn.org/egress-assignable="" 
node/huirwang-0331a-m4mws-worker-c-wnm4r.c.openshift-qe.internal labeled
6. Delete the assigned egress node
% oc delete machines.machine.openshift.io huirwang-0331a-m4mws-worker-c-rhbkr  -n openshift-machine-api
machine.machine.openshift.io "huirwang-0331a-m4mws-worker-c-rhbkr" deleted
 % oc get nodes
NAME                                                          STATUS   ROLES                  AGE   VERSION
huirwang-0331a-m4mws-master-0.c.openshift-qe.internal         Ready    control-plane,master   87m   v1.26.2+dc93b13
huirwang-0331a-m4mws-master-1.c.openshift-qe.internal         Ready    control-plane,master   86m   v1.26.2+dc93b13
huirwang-0331a-m4mws-master-2.c.openshift-qe.internal         Ready    control-plane,master   87m   v1.26.2+dc93b13
huirwang-0331a-m4mws-worker-a-hfqsn.c.openshift-qe.internal   Ready    worker                 76m   v1.26.2+dc93b13
huirwang-0331a-m4mws-worker-b-vbqf2.c.openshift-qe.internal   Ready    worker                 76m   v1.26.2+dc93b13
huirwang-0331a-m4mws-worker-c-wnm4r.c.openshift-qe.internal   Ready    worker                 13m   v1.26.2+dc93b13
29468 W0331 02:48:34.917391       1 egressip_healthcheck.go:162] Could not connect to huirwang-0331a-m4mws-worker-c-rhbkr.c.openshift-qe.internal (10.129.4.2:9107): context       deadline exceeded
29469 W0331 02:48:34.917417       1 default_network_controller.go:903] Node: huirwang-0331a-m4mws-worker-c-rhbkr.c.openshift-qe.internal is not ready, deleting it from egre      ss assignment
29470 I0331 02:48:34.917590       1 client.go:783]  "msg"="transacting operations" "database"="OVN_Northbound" "operations"="[{Op:update Table:Logical_Switch_Port Row:map[o      ptions:{GoMap:map[router-port:rtoe-GR_huirwang-0331a-m4mws-worker-c-rhbkr.c.openshift-qe.internal]}] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column       _uuid == {6efd3c58-9458-44a2-a43b-e70e669efa72}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}]"
29471 E0331 02:48:34.920766       1 egressip.go:993] Allocator error: EgressIP: egressip-1 assigned to node: huirwang-0331a-m4mws-worker-c-rhbkr.c.openshift-qe.internal whi      ch is not reachable, will attempt rebalancing
29472 E0331 02:48:34.920789       1 egressip.go:997] Allocator error: EgressIP: egressip-1 assigned to node: huirwang-0331a-m4mws-worker-c-rhbkr.c.openshift-qe.internal whi      ch is not ready, will attempt rebalancing
29473 I0331 02:48:34.920808       1 egressip.go:1212] Deleting pod egress IP status: {huirwang-0331a-m4mws-worker-c-rhbkr.c.openshift-qe.internal 10.0.32.100} for EgressIP:       egressip-1

Actual results:

The egressIP was not migrated to correct worker
 oc get egressIP      
NAME         EGRESSIPS     ASSIGNED NODE                                                 ASSIGNED EGRESSIPS
egressip-1   10.0.32.100   huirwang-0331a-m4mws-worker-c-rhbkr.c.openshift-qe.internal   10.0.32.100

Expected results:

The egressIP should migrated to correct worker from deleted node.

Additional info:


Description of problem:

In order to test proxy installations, the CI base image for OpenShift on OpenStack needs netcat.

Description of problem:

Installation failed when setting featureSet: LatencySensitive or featureSet: CustomNoUpgrade.
When setting featureSet: CustomNoUpgrade in install-config and create cluster.See below error info:
[core@bootstrap ~]$ journalctl -b -f -u release-image.service -u bootkube.service
Apr 26 07:02:48 bootstrap.wwei-426g.qe.devcluster.openshift.com bootkube.sh[670367]:         github.com/spf13/cobra@v1.6.0/command.go:968
Apr 26 07:02:48 bootstrap.wwei-426g.qe.devcluster.openshift.com bootkube.sh[670367]: k8s.io/component-base/cli.run(0xc00025c300)
Apr 26 07:02:48 bootstrap.wwei-426g.qe.devcluster.openshift.com bootkube.sh[670367]:         k8s.io/component-base@v0.26.1/cli/run.go:146 +0x317
Apr 26 07:02:48 bootstrap.wwei-426g.qe.devcluster.openshift.com bootkube.sh[670367]: k8s.io/component-base/cli.Run(0x2ce59e8?)
Apr 26 07:02:48 bootstrap.wwei-426g.qe.devcluster.openshift.com bootkube.sh[670367]:         k8s.io/component-base@v0.26.1/cli/run.go:46 +0x1d
Apr 26 07:02:48 bootstrap.wwei-426g.qe.devcluster.openshift.com bootkube.sh[670367]: main.main()
Apr 26 07:02:48 bootstrap.wwei-426g.qe.devcluster.openshift.com bootkube.sh[670367]:         github.com/openshift/cluster-kube-controller-manager-operator/cmd/cluster-kube-controller-manager-operator/main.go:24 +0x2c
Apr 26 07:02:48 bootstrap.wwei-426g.qe.devcluster.openshift.com systemd[1]: bootkube.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Apr 26 07:02:48 bootstrap.wwei-426g.qe.devcluster.openshift.com systemd[1]: bootkube.service: Failed with result 'exit-code'.
Apr 26 07:02:48 bootstrap.wwei-426g.qe.devcluster.openshift.com systemd[1]: bootkube.service: Consumed 1.935s CPU time.
Apr 26 07:02:54 bootstrap.wwei-426g.qe.devcluster.openshift.com systemd[1]: bootkube.service: Scheduled restart job, restart counter is at 343.
Apr 26 07:02:54 bootstrap.wwei-426g.qe.devcluster.openshift.com systemd[1]: Stopped Bootstrap a Kubernetes cluster.
Apr 26 07:02:54 bootstrap.wwei-426g.qe.devcluster.openshift.com systemd[1]: bootkube.service: Consumed 1.935s CPU time.
Apr 26 07:02:54 bootstrap.wwei-426g.qe.devcluster.openshift.com systemd[1]: Started Bootstrap a Kubernetes cluster.
Apr 26 07:02:56 bootstrap.wwei-426g.qe.devcluster.openshift.com bootkube.sh[670489]: Rendering Kubernetes Controller Manager core manifests...
Apr 26 07:02:56 bootstrap.wwei-426g.qe.devcluster.openshift.com bootkube.sh[672314]: panic: interface conversion: interface {} is nil, not []interface {}
Apr 26 07:02:56 bootstrap.wwei-426g.qe.devcluster.openshift.com bootkube.sh[672314]: goroutine 1 [running]:
Apr 26 07:02:56 bootstrap.wwei-426g.qe.devcluster.openshift.com bootkube.sh[672314]: github.com/openshift/cluster-kube-controller-manager-operator/pkg/operator/targetconfigcontroller.GetKubeControllerManagerArgs(0xc000746100?)
Apr 26 07:02:56 bootstrap.wwei-426g.qe.devcluster.openshift.com bootkube.sh[672314]:         github.com/openshift/cluster-kube-controller-manager-operator/pkg/operator/targetconfigcontroller/targetconfigcontroller.go:696 +0x379
Apr 26 07:02:56 bootstrap.wwei-426g.qe.devcluster.openshift.com bootkube.sh[672314]: github.com/openshift/cluster-kube-controller-manager-operator/pkg/cmd/render.(*renderOpts).Run(0xc0008d22c0)
Apr 26 07:02:56 bootstrap.wwei-426g.qe.devcluster.openshift.com bootkube.sh[672314]:         github.com/openshift/cluster-kube-controller-manager-operator/pkg/cmd/render/render.go:269 +0x85c
Apr 26 07:02:56 bootstrap.wwei-426g.qe.devcluster.openshift.com bootkube.sh[672314]: github.com/openshift/cluster-kube-controller-manager-operator/pkg/cmd/render.NewRenderCommand.func1.1(0x0?)
Apr 26 07:02:56 bootstrap.wwei-426g.qe.devcluster.openshift.com bootkube.sh[672314]:         github.com/openshift/cluster-kube-controller-manager-operator/pkg/cmd/render/render.go:48 +0x32
Apr 26 07:02:56 bootstrap.wwei-426g.qe.devcluster.openshift.com bootkube.sh[672314]: github.com/openshift/cluster-kube-controller-manager-operator/pkg/cmd/render.NewRenderCommand.func1(0xc000bee600?, {0x285dffa?, 0x8?, 0x8?})
Apr 26 07:02:56 bootstrap.wwei-426g.qe.devcluster.openshift.com bootkube.sh[672314]:         github.com/openshift/cluster-kube-controller-manager-operator/pkg/cmd/render/render.go:58 +0xc8
Apr 26 07:02:56 bootstrap.wwei-426g.qe.devcluster.openshift.com bootkube.sh[672314]: github.com/spf13/cobra.(*Command).execute(0xc000bee600, {0xc00071cb00, 0x8, 0x8})
Apr 26 07:02:56 bootstrap.wwei-426g.qe.devcluster.openshift.com bootkube.sh[672314]:         github.com/spf13/cobra@v1.6.0/command.go:920 +0x847
Apr 26 07:02:56 bootstrap.wwei-426g.qe.devcluster.openshift.com bootkube.sh[672314]: github.com/spf13/cobra.(*Command).ExecuteC(0xc000bee000)
Apr 26 07:02:56 bootstrap.wwei-426g.qe.devcluster.openshift.com bootkube.sh[672314]:         github.com/spf13/cobra@v1.6.0/command.go:1040 +0x3bd
Apr 26 07:02:56 bootstrap.wwei-426g.qe.devcluster.openshift.com bootkube.sh[672314]: github.com/spf13/cobra.(*Command).Execute(...)


When setting featureSet: LatencySensitive in install-config and create cluster.See below error info:
[core@bootstrap ~]$ journalctl -b -f -u release-image.service -u bootkube.service
Apr 26 07:07:09 bootstrap.wwei-426h.qe.devcluster.openshift.com bootkube.sh[16835]: "cluster-infrastructure-02-config.yml": failed to create infrastructures.v1.config.openshift.io/cluster -n : the server could not find the requested resource
Apr 26 07:07:09 bootstrap.wwei-426h.qe.devcluster.openshift.com bootkube.sh[16835]: Failed to create "cluster-infrastructure-02-config.yml" infrastructures.v1.config.openshift.io/cluster -n : the server could not find the requested resource
Apr 26 07:07:09 bootstrap.wwei-426h.qe.devcluster.openshift.com bootkube.sh[16835]: [#1105] failed to create some manifests:
Apr 26 07:07:09 bootstrap.wwei-426h.qe.devcluster.openshift.com bootkube.sh[16835]: "cluster-infrastructure-02-config.yml": failed to create infrastructures.v1.config.openshift.io/cluster -n : the server could not find the requested resource
Apr 26 07:07:09 bootstrap.wwei-426h.qe.devcluster.openshift.com bootkube.sh[16835]: Failed to create "cluster-infrastructure-02-config.yml" infrastructures.v1.config.openshift.io/cluster -n : the server could not find the requested resource

Version-Release number of selected component (if applicable):

OCP version: 4.13.0-0.nightly-2023-04-21-084440

How reproducible:

always

Steps to Reproduce:

1.Create install-config.yaml like below(LatencySensitive)
  apiVersion: v1
  controlPlane:
    architecture: amd64
    hyperthreading: Enabled
    name: master
    replicas: 3
  compute:
  - architecture: amd64
    hyperthreading: Enabled
    name: worker
   replicas: 2
  metadata:
    name: wwei-426h
  platform:
   none: {}
  pullSecret: xxxxx
  featureSet: LatencySensitive
  networking:
    clusterNetwork:
    - cidr: xxxxx
      hostPrefix: 23
    serviceNetwork:
    - xxxxx
    networkType: OpenShiftSDN
  publish: External
  baseDomain: xxxxxx
  sshKey: xxxxxxx

2.Then continue to install the cluster:
openshift-install create cluster --dir <install_folder> --log-level debug

3.Create install-config.yaml like below(CustomNoUpgrade):
  apiVersion: v1
  controlPlane:
    architecture: amd64
    hyperthreading: Enabled
    name: master
    replicas: 3
  compute:
  - architecture: amd64
    hyperthreading: Enabled
    name: worker
   replicas: 2
  metadata:
    name: wwei-426h
  platform:
   none: {}
  pullSecret: xxxxx
  featureSet: CustomNoUpgrade
  networking:
    clusterNetwork:
    - cidr: xxxxx
      hostPrefix: 23
    serviceNetwork:
    - xxxxx
    networkType: OpenShiftSDN
  publish: External
  baseDomain: xxxxxx
  sshKey: xxxxxxx

4.Then continue to install the cluster:
openshift-install create cluster --dir <install_folder> --log-level debug

Actual results:

Installation failed.

Expected results:

Installation succeeded.

Additional info:

log-bundle can get from below link : https://drive.google.com/drive/folders/1kg1EeYR6ApWXbeRZTiM4DV205nwMfSQv?usp=sharing

Description of the problem:

Some validations are only related to agents that are bound to clusters.  We had a case where an agent couldn't be bound due to failing validations, and the irrelevant validations added unnecessary noise.  I attached the relevant agent CR to the ticket.  You can see in the Conditions:

  - lastTransitionTime: "2023-01-26T21:00:29Z"
    message: 'The agent''s validations are failing: Validation pending - no cluster,Host
      couldn''t synchronize with any NTP server,Missing inventory, or missing cluster'
    reason: ValidationsFailing
    status: "False"
    type: Validated

The only relevant validation is that there is no NTP server.  "no cluster" and "Missing inventory, or missing cluster" are misleading.

How reproducible:

100%

Steps to reproduce:

1. Boot an unbound agent

2. Look at the CR

Actual results:

All validations are shown in the CR

Expected results:

Only relevant validations are shown in the CR

Description of problem:

Most recent nightly https://amd64.ocp.releases.ci.openshift.org/releasestream/4.14.0-0.nightly/release/4.14.0-0.nightly-2023-04-18-152947 has a lot of OAuth test failures

Example runs:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.14-e2e-aws-ovn/1648348911074545664

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-metal-ipi-sdn-bm/1648348885556400128

Error looks like:

fail [github.com/openshift/origin/test/extended/oauth/expiration.go:105]: Unexpected error:
    <*tls.CertificateVerificationError | 0xc0023b6330>: {
        UnverifiedCertificates: [
            {...


Looking at changes in the last day or so, nothing sticks out to me.

Although I believed ART bumped everything to be built with go1.20 and this error is new to go1.20:

"For a handshake failure due to a certificate verification failure, the TLS client and server now return an error of the new type CertificateVerificationError, which includes the presented certificates." - https://go.dev/doc/go1.20

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-04-18-152947

How reproducible:

Looks repeatable

Steps to Reproduce:

1. Build oauth, origin, and related containers with go1.20 (not clear which is causing the test failure)
2.
3.

Actual results:

Tests fail

Expected results:

 

Additional info:

 

Description of problem:

https://github.com/openshift/hypershift/pull/2437 added the ability to override image registries with CR ImageDigestMirrorSet; however, ImageDigestMirrorSet is only valid for 4.13+.

Version-Release number of selected component (if applicable):

4.12

How reproducible:

Install HO on Mgmt Cluster 4.12

Steps to Reproduce:

1.
2.
3.

Actual results:

failed to populate image registry overrides: no matches for kind "ImageDigestMirrorSet" in version "config.openshift.io/v1"

Expected results:

No errors and HyperShift doesn't try to use ImageDigestMirrorSet prior to 4.13.

Additional info:

 

Description of problem:

 

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

The ACLs are disabled for all newly created s3 buckets, this causes all OCP installs to fail: the bootstrap ignition can not be uploaded:

level=info msg=Creating infrastructure resources...
level=error
level=error msg=Error: error creating S3 bucket ACL for yunjiang-acl413-4dnhx-bootstrap: AccessControlListNotSupported: The bucket does not allow ACLs
level=error msg=	status code: 400, request id: HTB2HSH6XDG0Q3ZA, host id: V6CrEgbc6eyfJkUbLXLxuK4/0IC5hWCVKEc1RVonSbGpKAP1RWB8gcl5dfyKjbrLctVlY5MG2E4=
level=error
level=error msg=  with aws_s3_bucket_acl.ignition,
level=error msg=  on main.tf line 62, in resource "aws_s3_bucket_acl" "ignition":
level=error msg=  62: resource "aws_s3_bucket_acl" ignition {
level=error
level=error msg=failed to fetch Cluster: failed to generate asset "Cluster": failure applying terraform for "bootstrap" stage: failed to create cluster: failed to apply Terraform: exit status 1
level=error
level=error msg=Error: error creating S3 bucket ACL for yunjiang-acl413-4dnhx-bootstrap: AccessControlListNotSupported: The bucket does not allow ACLs
level=error msg=	status code: 400, request id: HTB2HSH6XDG0Q3ZA, host id: V6CrEgbc6eyfJkUbLXLxuK4/0IC5hWCVKEc1RVonSbGpKAP1RWB8gcl5dfyKjbrLctVlY5MG2E4=
level=error
level=error msg=  with aws_s3_bucket_acl.ignition,
level=error msg=  on main.tf line 62, in resource "aws_s3_bucket_acl" "ignition":
level=error msg=  62: resource "aws_s3_bucket_acl" ignition {


Version-Release number of selected component (if applicable):

4.11+
 

How reproducible:

Always
 

Steps to Reproduce:

1.Create a cluster via IPI

Actual results:

install fail
 

Expected results:

install succeed
 

Additional info:

Heads-Up: Amazon S3 Security Changes Are Coming in April of 2023 - https://aws.amazon.com/blogs/aws/heads-up-amazon-s3-security-changes-are-coming-in-april-of-2023/

https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-ownership-error-responses.html - After you apply the bucket owner enforced setting for Object Ownership, ACLs are disabled.

 

As IBM, I would like to replace flag  --use-oci-feature flag with --include-oci-local-catalogs 

--use-oci-feature is implying to users that this might be about using oci format for images rather than docker-v2, and this might be hard to understand and generate questions, bugs, and new misunderstood requests.
For clarity, and before this feature goes GA, this flag will be replaced by --include-local-oci-catalog in 4.14. The --use-oci-feature will be marked deprecated in 4.13, and completely removed in 4.14

 

As an oc-mirror user I want a well documented and intuitive  process
so that I can effectively and efficiently deliver image artifacts in both connected and disconnected installs with no impact on my current workflow

Glossary:

  • OCI-FBC operator catalog: catalog image in oci format saved to disk, referenced with oci://path-to-image
  • registry based operator catalog: catalog image hosted on a container registry.

References:

 

Acceptance criteria:

  • No regression on oc-mirror use cases that are not using OCI-FBC feature
  • mirrorToMirror use case with oci feature flag should be successful when all operator catalogs in ImageSetConfig are OCI-FBC:
    • oc-mirror -c config.yaml docker://remote-registry --use-oci-feature succeeds
    • All release images, helm charts, additional images are mirrored to the remote-registry in an incremental manner (only new images are mirrored based on contents of the storageConfig)
    • All catalogs OCI-FBC, selected bundles and their related images are mirrored to the remote-registry and corresponding catalogSource and ImageSourceContentPolicy generated
    • All registry based catalogs, selected bundles and their related images are mirrored to the remote-registry and corresponding catalogSource and ImageSourceContentPolicy generated
  • mirrorToDisk use case with the oci feature flag is forbidden. The following command should fail:
    • oc-mirror --from=seq_xx_tar docker://remote-registry --use-oci-feature
  • diskToMirror use case with oci feature flag is forbidden. The following command should fail:

Description of problem:

When using agent based installer to provision OCP on baremetal, some of the machine fail to use the static nmconnection files, and got ip address via DHCP.
This may cause the network vaildaiton fails. 

Version-Release number of selected component (if applicable):

4.13.3

How reproducible:

100%

Steps to Reproduce:

1. Generate agent iso
2. Mount it to BMC and reboot from live cd
3. Use openshift-install agent wait for to monitor the progress

Actual results:

network vaildation fails due to overlay ip address

Expected results:

vaildation success

Additional info:

 

Description of problem:
The dev console shows a list of samples. The user can create a sample based on a git repository. But some of these samples doesn't include a git repository reference and could not be created.

Version-Release number of selected component (if applicable):
Tested different frontend versions against a 4.11 cluster and all (oldest tested frontend was 4.8) show the sample without git repository.

But the result also depends on the installed samples operator and installed ImageStreams.

How reproducible:
Always

Steps to Reproduce:

  1. Switch to the Developer perspective
  2. Navigate to Add > All Samples
  3. Search for Jboss
  4. Click on "JBoss EAP XP 4.0 with OpenJDK 11" (for example)

Actual results:
The git repository is not filled and the create button is disabled.

Expected results:
Samples without git repositories should not be displayed in the list.

Additional info:
The Git repository is saved as "sampleRepo" in the ImageStream tag section.

Description of problem:

Arm HCP's are currently broken. The following error message was observed in the ignition-server pod:

{"level":"error","ts":"2023-06-29T13:38:19Z","msg":"Reconciler error","controller":"secret","controllerGroup":"","controllerKind":"Secret","secret":{"name":"token-brcox-hypershift-arm-us-east-1a-dbe0ce2a","namespace":"clusters-brcox-hypershift-arm"},"namespace":"clusters-brcox-hypershift-arm","name":"token-brcox-hypershift-arm-us-east-1a-dbe0ce2a","reconcileID":"ff813140-d10a-464e-a1b0-c05859b64ef9","error":"error getting ignition payload: failed to execute cluster-config-operator: cluster-config-operator process failed: /bin/bash: line 21: /payloads/get-payload1590526115/bin/cluster-config-operator: cannot execute binary file: Exec format error\n: exit status 126","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal...

Version-Release number of selected component (if applicable):

 

How reproducible:

Every time

Steps to Reproduce:

1. Create an Arm Mgmt Cluster
2. Create an Arm HCP

Actual results:

Error message in ignition-server pod and failure to generate appropriate payload.

Expected results:

ignition-server picks the appropriate arch based on the mgmt cluster.

Additional info:

 

Testgrid for single-node-workers-upgrade-conformance shows that tests are failing due to the 'KubeMemoryOvercommit' alert.

We should avoid failing on this alert for single node environments assuming it's ok to overcommit memory on single node Openshift clusters.

Ref: https://redhat-internal.slack.com/archives/C01CQA76KMX/p1687375398906129

Description of problem:

Fail to collect the vm serial log with ‘openshift-install gather bootstrap’

Version-Release number of selected component (if applicable):

 4.13.0-0.nightly-2023-03-14-053612

How reproducible:

Always

Steps to Reproduce:

1.IPI install a private cluster, Once bootstrap node boot up, before it is terminated,
2. ssh to the bastion, then try to get bootstrap log 
$openshift-install gather bootstrap --key openshift-qe.pem --bootstrap 10.0.0.5 --master 10.0.0.7 –loglevel debug
3.

Actual results:

Fail to get the vm serial logs, in the output:
…
DEBUG Gather remote logs                           
DEBUG Collecting info from 10.0.0.6                
DEBUG scp: ./installer-masters-gather.sh: Permission denied 
 EBUG Warning: Permanently added '10.0.0.6' (ECDSA) to the list of known hosts.…DEBUG Waiting for logs ...                         
DEBUG Log bundle written to /var/home/core/log-bundle-20230317033401.tar.gz 
WARNING Unable to stat /var/home/core/serial-log-bundle-20230317033401.tar.gz, skipping 
INFO Bootstrap gather logs captured here "/var/home/core/log-bundle-20230317033401.tar.gz"

Expected results:

Get the vm serial log and in the log has not the above “WARNING  Unable to stat…”

Additional info:

IPI install on local install, has the same issue.
INFO Pulling VM console logs                     
DEBUG attemping to download                       
…                       
INFO Failed to gather VM console logs: unable to download file: /root/temp/4.13.0-0.nightly-2023-03-14-053612/ipi/serial-log-bundle-20230317042338

We've had several forum cases and bugs already where a restart of the CEO was fixing issues that could be resolved automatically by a liveness probe.

We previously traced it down to stuck/deadlocked controllers, missing timeouts in grpc calls and other issues we haven't been able to find yet. Since the list of failures that can happen is pretty large, we should add a liveness probe to the CEO that will periodically health check:

  • all controllers have been running sync at least once in the last 5/10 minutes
  • on failure, produce a goroutine dump to analyse what went wrong

This check should not indicate whether the etcd cluster itself is healthy, it's purely for the CEO itself.

Description of problem:

While creating the deployment, if image stream is added, then while edit-deployment save button will not be enabled until imagestream tag is changed. 

On click of Reload button Save button will be automatically enabled.

Version-Release number of selected component (if applicable):

4.13

How reproducible:

Always

Steps to Reproduce:

1. Search Deployment under resources
2. create deployment with Image stream
3. edit deployment 

Actual results:

On click of edit deployment the save button is disabled on change of any values

Expected results:

On click of edit deployment the save button should be enabled on change of any value

Video Link - https://drive.google.com/file/d/1luqcjQS5Azc0XRjpMNfKKqbXYSc17Rxc/view?usp=share_link

Description of problem:

Cluster recently upgraded to OCP 4.12.19 experiencing serious slowness issues with Project>Project access page.
The loading time of that page grows significantly faster than the number of entries, and is very noticeable even at a relatively low number of entries.

Version-Release number of selected component (if applicable):

4.12.19

How reproducible:

Easily 

Steps to Reproduce:

1. Create a namespace, and add RoleBindings for multiple users, for instance with :
$ oc -n test-namespace create rolebinding test-load --clusterrole=view --user=user01 --user=user02 --user=...
2. In Developer view of that namespace, navigate to "Project"->"Project access". The page will take a long time to load compared to the time an "oc get rolebinding" would take.

Actual results:

0 RB => instantaneous loading
40 RB => about 10 seconds until page loaded
100 RB => one try took 50 seconds, another 110 seconds
200 RB => nothing for 8 minutes, after which my web browser (Firefox) proposed to stop the page since it slowed the browser down, and after 10 minutes I stopped the attempt without ever seeing the page load. 

Expected results:

Page should load almost instantly with only a few hundred role bindings

Run isVSphereDiskUUIDEnabled validation also on baremetal platform installation.

 

From the description of https://issues.redhat.com/browse/OCPBUGS-16955: 

Storage team has observed that if disk.EnableUUID flag is not enabled on vSphere VMs in any platform, including baremetal, then no symlinks are generated in /dev/disk/by-id for attached disks.

Installing ODF via LSO or something on such a platform results in somewhat fragile installation because disks themselves could be renamed on reboot and since no permanent ids exists for disks, the PVs could become invalid.

We should update baremetal installs - https://docs.openshift.com/container-platform/4.13/installing/installing_bare_metal/installing-bare-metal.html to always enable disk.EnableUUID in both IPI and UPI installs.

Description of problem:

After enabling realtime and high power consumption under workload hints in the performance profile, the test is falling since it cannot find stalld pid:
msg: "failed to run command [pidof stalld]: output \"\"; error \"\"; command terminated with exit code 1",

Version-Release number of selected component (if applicable):

Openshift 4.14, 4.13

How reproducible:

Often (Flaky test)

Description of problem:

The environment variable OPENSHIFT_IMG_OVERRIDES is not retaining the order of mirrors listed under a source compared to the original mirror/source listing in the ICSP/IDMSs.

Version-Release number of selected component (if applicable):

 

How reproducible:

Every time

Steps to Reproduce:

1. Setup a mgmt cluster with either an ICSP like:

  apiVersion: operator.openshift.io/v1alpha1
  kind: ImageContentSourcePolicy
  metadata:
    name: image-policy-39
  spec:
    repositoryDigestMirrors:
    - mirrors:
      - quay.io/openshift-release-dev/ocp-release
      - pull.q1w2.quay.rhcloud.com/openshift-release-dev/ocp-release
      source: quay.io/openshift-release-dev/ocp-release

2. Create a Hosted Cluster

Actual results:

Nodes cannot join the cluster because ignition cannot be generated

Expected results:

Nodes can join the cluster

Additional info:

Issue is most likely coming from here - https://github.com/openshift/hypershift/blob/dce6f51355317173be6bc525edfe059572c07690/support/util/util.go#L224

Description of problem:

Tested on gcp, there are 4 failureDomains a, b, c, f in CPMS, remove one a, a new master will be created in f. If readd f to CPMS, instance will be moved back from f to a

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

Before update cpms.
      failureDomains:
        gcp:
        - zone: us-central1-a
        - zone: us-central1-b
        - zone: us-central1-c
        - zone: us-central1-f
$ oc get machine                  
NAME                              PHASE     TYPE            REGION        ZONE            AGE
zhsungcp22-4glmq-master-2         Running   n2-standard-4   us-central1   us-central1-c   3h4m
zhsungcp22-4glmq-master-hzsf2-0   Running   n2-standard-4   us-central1   us-central1-b   90m
zhsungcp22-4glmq-master-plch8-1   Running   n2-standard-4   us-central1   us-central1-a   11m
zhsungcp22-4glmq-worker-a-cxf5w   Running   n2-standard-4   us-central1   us-central1-a   3h
zhsungcp22-4glmq-worker-b-d5vzm   Running   n2-standard-4   us-central1   us-central1-b   3h
zhsungcp22-4glmq-worker-c-4d897   Running   n2-standard-4   us-central1   us-central1-c   3h

1. Delete failureDomain "zone: us-central1-a" in cpms, new machine Running in zone f.
      failureDomains:
        gcp:
        - zone: us-central1-b
        - zone: us-central1-c
        - zone: us-central1-f 
$ oc get machine              
NAME                              PHASE     TYPE            REGION        ZONE            AGE
zhsungcp22-4glmq-master-2         Running   n2-standard-4   us-central1   us-central1-c   3h19m
zhsungcp22-4glmq-master-b7pdl-1   Running   n2-standard-4   us-central1   us-central1-f   13m
zhsungcp22-4glmq-master-hzsf2-0   Running   n2-standard-4   us-central1   us-central1-b   106m
zhsungcp22-4glmq-worker-a-cxf5w   Running   n2-standard-4   us-central1   us-central1-a   3h16m
zhsungcp22-4glmq-worker-b-d5vzm   Running   n2-standard-4   us-central1   us-central1-b   3h16m
zhsungcp22-4glmq-worker-c-4d897   Running   n2-standard-4   us-central1   us-central1-c   3h16m
2. Add failureDomain "zone: us-central1-a" again, new machine running in zone a, the machine in zone f will be deleted.
      failureDomains:
        gcp:
        - zone: us-central1-a
        - zone: us-central1-f
        - zone: us-central1-c
        - zone: us-central1-b
$ oc get machine                          
NAME                              PHASE     TYPE            REGION        ZONE            AGE
zhsungcp22-4glmq-master-2         Running   n2-standard-4   us-central1   us-central1-c   3h35m
zhsungcp22-4glmq-master-5kltp-1   Running   n2-standard-4   us-central1   us-central1-a   12m
zhsungcp22-4glmq-master-hzsf2-0   Running   n2-standard-4   us-central1   us-central1-b   121m
zhsungcp22-4glmq-worker-a-cxf5w   Running   n2-standard-4   us-central1   us-central1-a   3h32m
zhsungcp22-4glmq-worker-b-d5vzm   Running   n2-standard-4   us-central1   us-central1-b   3h32m
zhsungcp22-4glmq-worker-c-4d897   Running   n2-standard-4   us-central1   us-central1-c   3h32m  

Actual results:

Instance is moved back from f to a

Expected results:

Instance shouldn't be moved back from f to a

Additional info:

https://issues.redhat.com//browse/OCPBUGS-7366

Description of the problem:

In staging, UI 2.20.6, BE 2.20.1 - not able to set ODF on, getting "Failed to update the cluster", although according to the support-level api it should be supported

How reproducible:

100%

Steps to reproduce:

1. Create new OCP 4.13 and P/Z cpu_arc

2. try to enable ODF

3.

Actual results:

 

Expected results:

Description of problem:

API fields that are defaulted by a controller should document what their default is for each release version.
Currently the field documents that "if empty, subject to platform chosen default", but it does not state what that is.

To fix this, please add, after the platform chosen default prose:
// The current default is XYZ.

This will allow users to track the platform defaults over time from the API documentation.

I would like to see this fixed before 4.13 and 4.14 are released please, it should be pretty quick to fix if we understand what those defaults are.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

When ODF StorageSystem CR gets created through the Wizard, the LocalVolumeDiscovery doesn't bring/show devices with mpath type

Version-Release number of selected component (if applicable):

OCP 4.11.31

How reproducible:

All the time

Steps to Reproduce:

1. Get OCP 4.11 running with the LSO and ODF operators
2. Configure and present mpath devices to nodes used for ODF
3. Use the ODF wizard to create a StorageSystem object
4. Inspect the LocalVolumeDiscovery results.

Actual results:

There are no devices of mpath type shown by the ODF wizard / LocalVolumeDiscovery CR

Expected results:

LocalVolumeDiscovery should discover mpath device type 

Additional info:

LocalVolumeSet already works with mpath if you manually define them in .spec or  LocalVolume pointing to mpath devicePaths

Description of problem:

MCO depends on image registry, if not install image registry, installation will failed due to mco going to degraded

Version-Release number of selected component (if applicable):

payload image built from https://github.com/openshift/installer/pull/7421

How reproducible:

always

Steps to Reproduce:

1.Set "baselineCapabilitySet: None" when install a cluster, all the optional operators will not be installed.
2.
3.

Actual results:

09-01 15:50:34.770  level=error msg=Cluster operator machine-config Degraded is True with RenderConfigFailed: Failed to resync 4.14.0-0.ci.test-2023-08-31-033001-ci-ln-7xhl7yt-latest because: clusteroperators.config.openshift.io "image-registry" not found
09-01 15:50:34.770  level=error msg=Cluster operator machine-config Available is False with RenderConfigFailed: Cluster not available for [{operator 4.14.0-0.ci.test-2023-08-31-033001-ci-ln-7xhl7yt-latest}]: clusteroperators.config.openshift.io "image-registry" not found
09-01 15:50:34.770  level=info msg=Cluster operator network ManagementStateDegraded is False with : 
09-01 15:50:34.770  level=error msg=Cluster initialization failed because one or more operators are not functioning properly.

Expected results:

MCO should not be degraded if image registry is not installed

Additional info:

must-gather log https://drive.google.com/file/d/1E3FbPcVwZxBi33tHq7pyaHc8EM3eiTUa/view?usp=drive_link 

Description of problem:

I am trying to build the operator image locally and fail because the registry `registry.ci.openshift.org/ocp/` requires authorization

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1. git clone git@github.com:openshift/cluster-ingress-operator.git
2. export REPO=<path to a repository to upload the image>
3. run `make release-local`

Actual results:

[skip several lines]
Step 1/10 : FROM registry.ci.openshift.org/ocp/builder:rhel-8-golang-1.19-openshift-4.12 AS builder                                                                                             
unauthorized: authentication required 

Expected results:

image is pulled and the build succeeded

Additional info:

There are two images that are not available:
- registry.ci.openshift.org/ocp/builder:rhel-8-golang-1.19-openshift-4.12
- registry.ci.openshift.org/ocp/4.12:base

I was able to fix this by changing the images to
- registry.ci.openshift.org/openshift/release:golang-1.19                     - registry.ci.openshift.org/origin/4.12:base                                 

see https://github.com/dudinea/cluster-ingress-operator/tree/fix-build-images-not-public

I am not sure what I did is OK, but I suppose that this project,               being part of OKD should be easily buildable by the public
or at least the issue should be documented somewhere.                         
                                                        
I wanted to post this to the OKD project, but I am unable to select it in jira.                
                                                                                                                                                                                                
                                

Description of problem:

Machine-config operator is  not compliant with CIS benchmark rule "Ensure Usage of Unique Service Accounts" [1] as part of "ocp4-cis" profile used in compliance operator [2]. Observed that machine-config operator is using the default service account where default SA comes into play if there is no other service account specified. OpenShift core  operators should be compliant with the CIS benchmark, i.e. the operators should run with their own serviceaccount rather than using the "default" one.


[1] https://static.open-scap.org/ssg-guides/ssg-ocp4-guide-cis.html#xccdf_org.ssgproject.content_group_accounts
[2] https://docs.openshift.com/container-platform/4.11/security/compliance_operator/compliance-operator-supported-profiles.html

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

Core operators are using default service account

Expected results:

Core operators should run with their own service account 

Additional info:

 

Kubernetes 1.27 removes long deprecated --container-runtime flag, see https://github.com/kubernetes/kubernetes/pull/114017

To ensure the upgrade path between 4.13 to 4.14 isn't affected we need to backport the changes to both 4.14 and 4.13.

Description of problem:

'Create' button on image pull secret creation form can not be re-enabled if it is disabled once

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2023-02-17-090603

How reproducible:

Always

Steps to Reproduce:

1. user logins to console
2. goes to Secrets -> Create Image pull secret, on the page
- Secret name: test-secret
- Authentication type: Upload configuration file, here we upload invalid JSON format file, console will give warning message 'Configuration file should be in JSON format.' and 'Create' button will be disabled
3. then we change Authentication type to 'Image registry credentials', fill up every required fields: Registry server address, Username and Password, we can see 'Create' button is still disabled

Actual results:

3. 'Create' button is still disabled, user has to cancel and fill the form again 

Expected results:

3. we should re-enable Create button since we are trying to filling a form in a different way with all required fields correctly configured

Additional info:

 

 

 

Description of problem:

Hide the Duplicate Pipelines Card in the DevConsole Add Page

Version-Release number of selected component (if applicable):

4.14

How reproducible:

Always

Steps to Reproduce:

1. Visit +Add Page of Dev Perspective

Actual results:

Duplicate Entry

Expected results:

No duplicates

Additional info:

 

Description of problem:

The control-plane-operator pod gets stuck deleting an awsendpointservice if its hostedzone is already gone:

Logs:

{"level":"error","ts":"2023-07-13T03:06:58Z","msg":"Reconciler error","controller":"awsendpointservice","controllerGroup":"hypershift.openshift.io","controllerKind":"AWSEndpointService","aWSEndpointService":{"name":"private-router","namespace":"ocm-staging-24u87gg3qromrf8mg2r2531m41m0c1ji-diegohcp-west2"},"namespace":"ocm-staging-24u87gg3qromrf8mg2r2531m41m0c1ji-diegohcp-west2","name":"private-router","reconcileID":"59eea7b7-1649-4101-8686-78113f27567d","error":"failed to delete resource: NoSuchHostedZone: No hosted zone found with ID: Z05483711XJV23K8E97HK\n\tstatus code: 404, request id: f8686dd6-a906-4a5e-ba4a-3dd52ad50ec3","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:234"} 

Version-Release number of selected component (if applicable):

4.12.24

How reproducible:

Have not tried to reproduce yet, but should be fairly reproducible

Steps to Reproduce:

1. Install a PublicAndPrivate or Private HCP
2. Delete the Route53 Hosted Zone defined in its awsendpointservice's .status.dnsZoneID field
3. Start an uninstall
4. Observe the control-plane-operator looping on the above logs and the uninstall hanging

Actual results:

Uninstall hangs due to CPO being unable to delete the awsendpointservice

Expected results:

awsendpointservice cleans up, if the hosted zone is already gone CPO shouldn't care that it can't list hosted zones

Additional info:

 

Description of problem:

CredentialsRequest for Azure AD Workload Identity contains unnecessary network permissions.

- Microsoft.Network/applicationSecurityGroups/delete
- Microsoft.Network/applicationSecurityGroups/write
- Microsoft.Network/loadBalancers/delete
- Microsoft.Network/networkSecurityGroups/delete
- Microsoft.Network/routeTables/delete
- Microsoft.Network/routeTables/write
- Microsoft.Network/virtualNetworks/subnets/delete
- Microsoft.Network/virtualNetworks/subnets/write
- Microsoft.Network/virtualNetworks/write
- Microsoft.Resources/subscriptions/resourceGroups/delete
- Microsoft.Resources/subscriptions/resourceGroups/write

Version-Release number of selected component (if applicable):

4.14.0

How reproducible:

N/A

Steps to Reproduce:

1. Remove above permissions from the Azure Credentials request and validate that MAO continues to function in Azure AD Workload Identity cluster.

Actual results:

Unnecessary network write permissions enumerated in CredentialsRequest.

Expected results:

Only necessary permissions enumerated in CredentialsRequest.

Additional info:

Additional unnecessary permissions will be hard to pin point but these specific permissions were questioned by MSFT and are likely only needed by the installer as output by CORS-1870 investigation.

Description of problem:
The oc client has recently had functionality added to reference an icsp manifest with a variety of commands (using the --icsp flag).

The issue is that the registry/repo scope in an icsp required to trigger a mapping is different between ocp and oc. OCP icsp will match an image at the registry level, where the OC client requires exact registry + repo to match. This difference can cause major confusion (especially without adequate warning/error messages in the oc client).

Example Image to mirror: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1631b0f0bf9c6dc4f9519ceb06b6ec9277f53f4599853fcfad3b3a47d2afd404o

In OCP registry.mirrorregistry.com:5000/openshift-release-dev/ will accurately mirror the image

But using OC with --icsp , quay.io/openshift-release-dev/ocp-v4.0-art-dev is required or or the mirroring will not match.

Version-Release number of selected component (if applicable):{code:none}
oc version
Client Version: 4.11.0-202212070335.p0.g1928ac4.assembly.stream-1928ac4
Kustomize Version: v4.5.4
Server Version: 4.12.0-rc.8
Kubernetes Version: v1.25.4+77bec7a



How reproducible:

100%

Steps to Reproduce:
1. Create an ICSP file with content similar to below (Replace with your mirror registry url)

apiVersion: operator.openshift.io/v1alpha1
kind: ImageContentSourcePolicy
metadata:
  creationTimestamp: null
  name: image-policy
spec:
  repositoryDigestMirrors:
  - mirrors:
    - registry.mirrorregistry.com:5005/openshift-release-dev
    source: quay.io/openshift-release-dev

2. Add the ICSP to a bm openshift cluster and wait for MCP to finish node restarts
3. SSH to a cluster node
4. Try to podman pull the following image with debug log level

podman pull --log-level=debug quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1631b0f0bf9c6dc4f9519ceb06b6ec9277f53f4599853fcfad3b3a47d2afd404

5. The log will show the mirror registry is attempted (Which is similar behavior to OCP)
6. Now try to extract a the payload image from the release using oc client and --icsp flag (ICSP file should be the same manifest uses at step 1)

oc adm release extract --command=openshift-baremetal-install --to=/data/install-config-generate/installercache/registry.mirrorregistry.com:5005/openshift-release-dev/ocp-release:4.12.0-rc.8-x86_64 --insecure=false --icsp-file=/tmp/icsp-file1635083302 registry.mirrorregistry.com:5005/openshift-release-dev/ocp-release:4.12.0-rc.8-x86_64 --registry-config=/tmp/registry-config1265925963

Expected results:
openshift-baremetal-install is extracted to the proper directory using the mirrored payload image

Actual result:
oc client does not match the payload image because the icsp is not exact, so it immediately tries quay.io rather than the mirror registry

ited with non-zero exit code 1: \nwarning: --icsp-file only applies to images referenced by digest and will be ignored for tags\nerror: unable to read image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1631b0f0bf9c6dc4f9519ceb06b6ec9277f53f4599853fcfad3b3a47d2afd404: Get \"https://quay.io/v2/\": dial tcp 52.203.129.140:443: i/o timeout\n" func=github.com/openshift/assisted-service/internal/oc.execute file="/remote-source/assisted-service/app/internal/oc/release.go:404" go-id=26228 request_id=

Additional info:

I understand that oc-mirror or oc adm release mirror provides an icsp manifest to use, but as OCP itself allows for a wider scope for mapping, it can cause great confusion that oc icsp scope is not in parity. 

At the very least a warning/error message in the oc client when the icsp partially matches an image (but is not used) would be VERY useful. 

For reasons I still struggle to understand, in trying to mitigate issues stemming from the PSA changes to k8s, we decided on a convoluted architecture where one reconciler by one team (cluster-policy-controller) ignores openshift-* namespaces unless they have a specific label and are not part of the payload, while a reconciler on our team labels non-payload openshift-* namespaces appropriately so that the first one will do its security magic and keep workloads stable during this transition. This cockamamie scheme lead to a dependency between olm and cpc s.t. we can share the list of payload openshift-* namespaces. 

This also means that we need to update the dependency at each release to keep parity with the OCP version of the dependency and olm.

We need to update the cpc dependency as the pipeline is blocked until we do (to letting an old version of the dependency, perhaps with a different list of payload openshift-* namespaces and breaking customer cluster or impacting their experience). 

Note: this is currently blocking ART compliance PRs. We need to get this in ASAP.

1. Proposed title of this feature request

Allow Ingress to be modified the log length when using a sidecar

2. What is the nature and description of the request?

In the past we had the RFE-1794 where an option was created to specify the length of the HAProxy log, however this option was only available for when redirecting the log for an external syslog. We need this option to be available for when using a sidecar to collect the logs.

 

apiVersion: operator.openshift.io/v1
kind: IngressController
metadata:
  name: default
  namespace: openshift-ingress-operator
spec:
  replicas: 2
  logging:
    access:
      destination:
        type: Container
        container: {}

Differently from the Syslog type, the Container type does not have any sub-parameter, which makes possible to configurate the log length.

As we can see in the RFE-1794, the option to change the log length already exists in the haproxy configuration, but when using the sidecar, only the default value(1024) is used.

3. Why does the customer need this? (List the business requirements here)

The default log length of HAProxy is 1024. When the clients communicate to the application with the long uri arguments, it cannot catch the full access log and the parameter info. It is required a option to setup 8192 or higher.

4. List any affected packages or components.

  • haproxy
  • ingress
  • ingress-operator

Description of problem:

Multus mac-vlan/ipvlan/vlan cni panics when master interface in container is missing

Version-Release number of selected component (if applicable):

metallb-operator.v4.13.0-202304190216   MetalLB Operator   4.13.0-202304190216 Succeeded

How reproducible:

Create pod with multiple vlan interfaces connected to missing master interface.

Steps to Reproduce:

1. Create pod with multiple vlan interfaces connected to missing master interface in container
2. Make sure that pod stuck in ContainerCreating state 
3. Run oc describe pod PODNAME and read crash message:

 Normal   Scheduled               22s   default-scheduler  Successfully assigned cni-tests/pod-one to worker-0
  Normal   AddedInterface          21s   multus             Add eth0 [10.128.2.231/23] from ovn-kubernetes
  Normal   AddedInterface          21s   multus             Add ext0 [] from cni-tests/tap-one
  Normal   AddedInterface          21s   multus             Add ext0.1 [2001:100::1/64] from cni-tests/mac-vlan-one
  Warning  FailedCreatePodSandBox  18s   kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_pod-one_cni-tests_2e831519-effc-4502-8ea7-749eda95bf1d_0(321d7181626b8bbfad062dd7c7cc2ef096f8547e93cb7481a18b7d3613eabffd): error adding pod cni-tests_pod-one to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): [cni-tests/pod-one/2e831519-effc-4502-8ea7-749eda95bf1d:mac-vlan]: error adding container to network "mac-vlan": plugin type="macvlan" failed (add): netplugin failed: "panic: runtime error: invalid memory address or nil pointer dereference\n[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x54281a]\n\ngoroutine 1 [running, locked to thread]:\npanic({0x560b00, 0x6979d0})\n\t/usr/lib/golang/src/runtime/panic.go:987 +0x3ba fp=0xc0001ad8f0 sp=0xc0001ad830 pc=0x433d7a\nruntime.panicmem(...)\n\t/usr/lib/golang/src/runtime/panic.go:260\nruntime.sigpanic()\n\t/usr/lib/golang/src/runtime/signal_unix.go:835 +0x2f6 fp=0xc0001ad940 sp=0xc0001ad8f0 pc=0x449cd6\nmain.getMTUByName({0xc00001a978, 0x4}, {0xc00002004a, 0x33}, 0x1)\n\t/usr/src/plugins/plugins/main/macvlan/macvlan.go:167 +0x33a fp=0xc0001ada00 sp=0xc0001ad940 pc=0x54281a\nmain.loadConf(0xc000186770, {0xc00001e009, 0x19e})\n\t/usr/src/plugins/plugins/main/macvlan/macvlan.go:120 +0x155 fp=0xc0001ada80 sp=0xc0001ada00 pc=0x5422d5\nmain.cmdAdd(0xc000186770)\n\t/usr/src/plugins/plugins/main/macvlan/macvlan.go:287 +0x47 fp=0xc0001adcd0 sp=0xc0001ada80 pc=0x543b07\ngithub.com/containernetworking/cni/pkg/skel.(*dispatcher).checkVersionAndCall(0xc0000bdec8, 0xc000186770, {0x5c02b8, 0xc0000e4e40}, 0x592e80)\n\t/usr/src/plugins/vendor/github.com/containernetworking/cni/pkg/skel/skel.go:166 +0x20a fp=0xc0001add60 sp=0xc0001adcd0 pc=0x5371ca\ngithub.com/containernetworking/cni/pkg/skel.(*dispatcher).pluginMain(0xc0000bdec8, 0x698320?, 0xc0000bdeb0?, 0x44ed89?, {0x5c02b8, 0xc0000e4e40}, {0xc0000000f0, 0x22})\n\t/usr/src/plugins/vendor/github.com/containernetworking/cni/pkg/skel/skel.go:219 +0x2ca fp=0xc0001ade68 sp=0xc0001add60 pc=0x53772a\ngithub.com/containernetworking/cni/pkg/skel.PluginMainWithError(...)\n\t/usr/src/plugins/vendor/github.com/containernetworking/cni/pkg/skel/skel.go:273\ngithub.com/containernetworking/cni/pkg/skel.PluginMain(0x588e01?, 0x10?, 0xc0000bdf50?, {0x5c02b8?, 0xc0000e4e40?}, {0xc0000000f0?, 0x0?})\n\t/usr/src/plugins/vendor/github.com/containernetworking/cni/pkg/skel/skel.go:288 +0xd1 fp=0xc0001adf18 sp=0xc0001ade68 pc=0x537d51\nmain.main()\n\t/usr/src/plugins/plugins/main/macvlan/macvlan.go:432 +0xb6 fp=0xc0001adf80 sp=0xc0001adf18 pc=0x544b76\nruntime.main()\n\t/usr/lib/golang/src/runtime/proc.go:250 +0x212 fp=0xc0001adfe0 sp=0xc0001adf80 pc=0x436a12\nruntime.goexit()\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1594 +0x1 fp=0xc0001adfe8 sp=0xc0001adfe0 pc=0x462fc1\n\ngoroutine 2 [force gc (idle)]:\nruntime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)\n\t/usr/lib/golang/src/runtime/proc.go:363 +0xd6 fp=0xc0000acfb0 sp=0xc0000acf90 pc=0x436dd6\nruntime.goparkunlock(...)\n\t/usr/lib/golang/src/runtime/proc.go:369\nruntime.forcegchelper()\n\t/usr/lib/golang/src/runtime/proc.go:302 +0xad fp=0xc0000acfe0 sp=0xc0000acfb0 pc=0x436c6d\nruntime.goexit()\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1594 +0x1 fp=0xc0000acfe8 sp=0xc0000acfe0 pc=0x462fc1\ncreated by runtime.init.6\n\t/usr/lib/golang/src/runtime/proc.go:290 +0x25\n\ngoroutine 3 [GC sweep wait]:\nruntime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)\n\t/usr/lib/golang/src/runtime/proc.go:363 +0xd6 fp=0xc0000ad790 sp=0xc0000ad770 pc=0x436dd6\nruntime.goparkunlock(...)\n\t/usr/lib/golang/src/runtime/proc.go:369\nruntime.bgsweep(0x0?)\n\t/usr/lib/golang/src/runtime/mgcsweep.go:278 +0x8e fp=0xc0000ad7c8 sp=0xc0000ad790 pc=0x423e4e\nruntime.gcenable.func1()\n\t/usr/lib/golang/src/runtime/mgc.go:178 +0x26 fp=0xc0000ad7e0 sp=0xc0000ad7c8 pc=0x418d06\nruntime.goexit()\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1594 +0x1 fp=0xc0000ad7e8 sp=0xc0000ad7e0 pc=0x462fc1\ncreated by runtime.gcenable\n\t/usr/lib/golang/src/runtime/mgc.go:178 +0x6b\n\ngoroutine 4 [GC scavenge wait]:\nruntime.gopark(0xc0000ca000?, 0x5bf2b8?, 0x1?, 0x0?, 0x0?)\n\t/usr/lib/golang/src/runtime/proc.go:363 +0xd6 fp=0xc0000adf70 sp=0xc0000adf50 pc=0x436dd6\nruntime.goparkunlock(...)\n\t/usr/lib/golang/src/runtime/proc.go:369\nruntime.(*scavengerState).park(0x6a0920)\n\t/usr/lib/golang/src/runtime/mgcscavenge.go:389 +0x53 fp=0xc0000adfa0 sp=0xc0000adf70 pc=0x421ef3\nruntime.bgscavenge(0x0?)\n\t/usr/lib/golang/src/runtime/mgcscavenge.go:617 +0x45 fp=0xc0000adfc8 sp=0xc0000adfa0 pc=0x4224c5\nruntime.gcenable.func2()\n\t/usr/lib/golang/src/runtime/mgc.go:179 +0x26 fp=0xc0000adfe0 sp=0xc0000adfc8 pc=0x418ca6\nruntime.goexit()\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1594 +0x1 fp=0xc0000adfe8 sp=0xc0000adfe0 pc=0x462fc1\ncreated by runtime.gcenable\n\t/usr/lib/golang/src/runtime/mgc.go:179 +0xaa\n\ngoroutine 5 [finalizer wait]:\nruntime.gopark(0x0?, 0xc0000ac670?, 0xab?, 0x61?, 0xc0000ac770?)\n\t/usr/lib/golang/src/runtime/proc.go:363 +0xd6 fp=0xc0000ac628 sp=0xc0000ac608 pc=0x436dd6\nruntime.goparkunlock(...)\n\t/usr/lib/golang/src/runtime/proc.go:369\nruntime.runfinq()\n\t/usr/lib/golang/src/runtime/mfinal.go:180 +0x10f fp=0xc0000ac7e0 sp=0xc0000ac628 pc=0x417e0f\nruntime.goexit()\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1594 +0x1 fp=0xc0000ac7e8 sp=0xc0000ac7e0 pc=0x462fc1\ncreated by runtime.createfing\n\t/usr/lib/golang/src/runtime/mfinal.go:157 +0x45\n"

Actual results:

The readable error message should be provided instead.

Expected results:

We should handle such scenario without crash and The following log should be used instead. 

Error: Failed to create container due to the missing master interface XXX.

Additional info:

 

Description of problem:

Users are not able to upgrade an namespace scoped operator in OpenShift console . 
Subscription tab is not visible in web console to the user with admin rights.
Only cluster-Admin users are able to update the operator.

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1. Configure IDP. Add user. 
2. Install any operator in specific namespace.
3. Assign project admin permission to the user for the same namespace
4. Login with the user and check if `Subscription` tab is visible to update the operator.

Actual results:

User is not able to update the operator. Subscription tab is not visible to the user in web console.

Expected results:

User must get an access to update the namespace scoped operator if user has the admin permission for the same project.

Additional info:

Tried to reproduce the issue and observed same behavior in OCP 4.10.20 , OCP 4.10.25 and OCP 4.10.34

 

 

 

Description of problem:

Installer as used with AWS, during a cluster destroy, does a get-all-roles and would delete roles based on a tag. If a customer is using AWS SEA which would deny any roles doing a get-all-roles in the AWS account, the installer fails.

Instead of error-out, the installer should gracefully handle being denied get-all-roles and move onward, so that a denying SCP would not get in the way of a successful cluster destroy on AWS.

Version-Release number of selected component (if applicable):

[ec2-user@ip-172-16-32-144 ~]$ rosa version
1.2.6

How reproducible:

1. Deploy ROSA STS, private with PrivateLink with AWS SEA
2. rosa delete cluster --debug
3. watch the debug logs of the installer to see it try to get-all-roles
4. installer fails when the SCP from AWS SEA denies the get-all-roles task

Steps to Reproduce:  Philip Thomson Would you please fill out the below?

Steps list above.

Actual results:

time="2022-09-01T00:10:40Z" level=error msg="error after waiting for command completion" error="exit status 4" installID=zp56pxql
time="2022-09-01T00:10:40Z" level=error msg="error provisioning cluster" error="exit status 4" installID=zp56pxql
time="2022-09-01T00:10:40Z" level=error msg="error running openshift-install, running deprovision to clean up" error="exit status 4" installID=zp56pxql


time="2022-09-01T00:12:47Z" level=info msg="copied /installconfig/install-config.yaml to /output/install-config.yaml" installID=55h2cvl5
time="2022-09-01T00:12:47Z" level=info msg="cleaning up resources from previous provision attempt" installID=55h2cvl5
time="2022-09-01T00:12:47Z" level=debug msg="search for matching resources by tag in ca-central-1 matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:12:48Z" level=debug msg="search for matching resources by tag in us-east-1 matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:12:48Z" level=debug msg="search for IAM roles" installID=55h2cvl5
time="2022-09-01T00:12:49Z" level=debug msg="iterating over a page of 64 IAM roles" installID=55h2cvl5
time="2022-09-01T00:12:52Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-ConfigRecorderRole-B749E1E6: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-ConfigRecorderRole-B749E1E6 with an explicit deny in a service control policy\n\tstatus code: 403, request id: 6b4b5144-2f4e-4fde-ba1a-04ed239b84c2" installID=55h2cvl5
time="2022-09-01T00:12:52Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-CWL-Add-Subscription-Filter-9D3CF73C: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-CWL-Add-Subscription-Filter-9D3CF73C with an explicit deny in a service control policy\n\tstatus code: 403, request id: 6152e9c2-9c1c-478b-a5e3-11ff2508684e" installID=55h2cvl5
time="2022-09-01T00:12:52Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-AWS679f53fac002430cb0da5-S4CHZ22EC1B2: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-AWS679f53fac002430cb0da5-S4CHZ22EC1B2 with an explicit deny in a service control policy\n\tstatus code: 403, request id: 8636f0ff-e984-4f02-870e-52170ab4e7bb" installID=55h2cvl5
time="2022-09-01T00:12:52Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-AWS679f53fac002430cb0da5-X9UQK0CYNPPO: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-AWS679f53fac002430cb0da5-X9UQK0CYNPPO with an explicit deny in a service control policy\n\tstatus code: 403, request id: 2385a980-dc9b-480f-955a-62ac1aaa6718" installID=55h2cvl5
time="2022-09-01T00:12:53Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomCentralEndpointDep-1H6K6CZ6AEUBO: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomCentralEndpointDep-1H6K6CZ6AEUBO with an explicit deny in a service control policy\n\tstatus code: 403, request id: 02ccef62-14e7-4310-b254-a0731995bd45" installID=55h2cvl5
time="2022-09-01T00:12:53Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomCreateSSMDocument7-1JDO2BN7QTXRH: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomCreateSSMDocument7-1JDO2BN7QTXRH with an explicit deny in a service control policy\n\tstatus code: 403, request id: eca2081d-abd7-4c9b-b531-27ca8758f933" installID=55h2cvl5
time="2022-09-01T00:12:53Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomEBSDefaultEncrypti-19EVAXFRG2BEJ: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomEBSDefaultEncrypti-19EVAXFRG2BEJ with an explicit deny in a service control policy\n\tstatus code: 403, request id: 6bda17e9-83e5-4688-86a0-2f84c77db759" installID=55h2cvl5
time="2022-09-01T00:12:53Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomEc2OperationsB1799-1WASK5J6GUYHO: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomEc2OperationsB1799-1WASK5J6GUYHO with an explicit deny in a service control policy\n\tstatus code: 403, request id: 827afa4a-8bb9-4e1e-af69-d5e8d125003a" installID=55h2cvl5
time="2022-09-01T00:12:53Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomGetDetectorIdRole6-9VGPM8U0HMV7: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomGetDetectorIdRole6-9VGPM8U0HMV7 with an explicit deny in a service control policy\n\tstatus code: 403, request id: 8dcd0480-6f9e-49cb-a0dd-0c5f76107696" installID=55h2cvl5
time="2022-09-01T00:12:53Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomGuardDutyCreatePub-1W03UREYK3KTX: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomGuardDutyCreatePub-1W03UREYK3KTX with an explicit deny in a service control policy\n\tstatus code: 403, request id: 5095aed7-45de-4ca0-8c41-9db9e78ca5a6" installID=55h2cvl5
time="2022-09-01T00:12:53Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomIAMCreateRoleE62B6-1AQL8IBN9938I: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomIAMCreateRoleE62B6-1AQL8IBN9938I with an explicit deny in a service control policy\n\tstatus code: 403, request id: 04f7d0e0-4139-4f74-8f67-8d8a8a41d6b9" installID=55h2cvl5
time="2022-09-01T00:12:53Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomIAMPasswordPolicyC-16TPLHRY1FZ43: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomIAMPasswordPolicyC-16TPLHRY1FZ43 with an explicit deny in a service control policy\n\tstatus code: 403, request id: 115f9514-b78b-42d1-b008-dc3181b61d33" installID=55h2cvl5
time="2022-09-01T00:12:53Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomLogsLogGroup49AC86-1D03LOLE2CARP: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomLogsLogGroup49AC86-1D03LOLE2CARP with an explicit deny in a service control policy\n\tstatus code: 403, request id: 68da4d93-a93e-410a-b3af-961122fe8df0" installID=55h2cvl5
time="2022-09-01T00:12:53Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomLogsMetricFilter7F-DLA5E1PZSFHH: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomLogsMetricFilter7F-DLA5E1PZSFHH with an explicit deny in a service control policy\n\tstatus code: 403, request id: 012221ea-2121-4b04-91f2-26c31c8458b1" installID=55h2cvl5
time="2022-09-01T00:12:53Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomMacieExportConfigR-1QT1WNNWPSL36: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomMacieExportConfigR-1QT1WNNWPSL36 with an explicit deny in a service control policy\n\tstatus code: 403, request id: e6c9328d-a4b9-4e69-8194-a68ed7af6c73" installID=55h2cvl5
time="2022-09-01T00:12:54Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomMacieUpdateSession-1NHBPTB4GOSM8: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomMacieUpdateSession-1NHBPTB4GOSM8 with an explicit deny in a service control policy\n\tstatus code: 403, request id: 214ca7fb-d153-4d0d-9f9c-21b073c5bd35" installID=55h2cvl5
time="2022-09-01T00:12:54Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomResourceCleanupC59-1MSCB57N479UU: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomResourceCleanupC59-1MSCB57N479UU with an explicit deny in a service control policy\n\tstatus code: 403, request id: 63b54e82-e2f6-48d4-bd0f-d2663bbc58bf" installID=55h2cvl5
time="2022-09-01T00:12:54Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomS3PutReplicationRo-FE5Q26BTAG9K: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomS3PutReplicationRo-FE5Q26BTAG9K with an explicit deny in a service control policy\n\tstatus code: 403, request id: d24982b6-df65-4ba2-a3c0-5ac8d23947e1" installID=55h2cvl5
time="2022-09-01T00:12:54Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomSecurityHubRole660-1UX115B9Q68WX: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomSecurityHubRole660-1UX115B9Q68WX with an explicit deny in a service control policy\n\tstatus code: 403, request id: e2c5737a-5014-4eb5-9150-1dd1939137c0" installID=55h2cvl5
time="2022-09-01T00:12:54Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomSSMUpdateRoleD3D5C-AZ9GBJG6UM4F: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomSSMUpdateRoleD3D5C-AZ9GBJG6UM4F with an explicit deny in a service control policy\n\tstatus code: 403, request id: 7793fa7c-4c8d-4f9f-8f23-d393b85be97c" installID=55h2cvl5
time="2022-09-01T00:12:54Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomVpcDefaultSecurity-HC931RYMVKKC: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomVpcDefaultSecurity-HC931RYMVKKC with an explicit deny in a service control policy\n\tstatus code: 403, request id: bef2c5ab-ef59-4be6-bf1a-2d89fddb90f1" installID=55h2cvl5
time="2022-09-01T00:12:54Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-DefaultBucketReplication-OIM43YBJSMGD: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-DefaultBucketReplication-OIM43YBJSMGD with an explicit deny in a service control policy\n\tstatus code: 403, request id: ff04eb1b-9cf6-4fff-a503-d9292ff17ccd" installID=55h2cvl5
time="2022-09-01T00:12:54Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-PipelineRole: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-PipelineRole with an explicit deny in a service control policy\n\tstatus code: 403, request id: 85e05de8-ba16-4366-bc86-721da651d770" installID=55h2cvl5
time="2022-09-01T00:12:56Z" level=info msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-VPC-FlowLog-519F0B57: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-VPC-FlowLog-519F0B57 with an explicit deny in a service control policy\n\tstatus code: 403, request id: a9d864e4-cfdf-483d-a0d2-9b48a117abc4" installID=55h2cvl5
time="2022-09-01T00:12:56Z" level=debug msg="search for IAM users" installID=55h2cvl5
time="2022-09-01T00:12:56Z" level=debug msg="iterating over a page of 0 IAM users" installID=55h2cvl5
time="2022-09-01T00:12:56Z" level=debug msg="search for IAM instance profiles" installID=55h2cvl5
time="2022-09-01T00:12:56Z" level=info msg="error while finding resources to delete" error="get tags for arn:aws:iam::646284873784:role/PBMMAccel-VPC-FlowLog-519F0B57: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-VPC-FlowLog-519F0B57 with an explicit deny in a service control policy\n\tstatus code: 403, request id: a9d864e4-cfdf-483d-a0d2-9b48a117abc4" installID=55h2cvl5
time="2022-09-01T00:12:56Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:12:57Z" level=info msg=Disassociated id=i-03d7570547d32071d installID=55h2cvl5 name=rosa-mv9dx3-xls7g-master-profile role=ROSA-ControlPlane-Role
time="2022-09-01T00:12:57Z" level=info msg=Deleted InstanceProfileName=rosa-mv9dx3-xls7g-master-profile arn="arn:aws:iam::646284873784:instance-profile/rosa-mv9dx3-xls7g-master-profile" id=i-03d7570547d32071d installID=55h2cvl5
time="2022-09-01T00:12:57Z" level=debug msg=Terminating id=i-03d7570547d32071d installID=55h2cvl5
time="2022-09-01T00:12:58Z" level=debug msg=Terminating id=i-08bee3857e5265ba4 installID=55h2cvl5
time="2022-09-01T00:12:58Z" level=debug msg=Terminating id=i-00df6e7b34aa65c9b installID=55h2cvl5
time="2022-09-01T00:13:08Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:13:18Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:13:28Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:13:38Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:13:48Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:13:58Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:14:08Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:14:18Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:14:28Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:14:38Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:14:48Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:14:58Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:15:08Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:15:18Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:15:28Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:15:38Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:15:48Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:15:58Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:16:08Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:16:18Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:16:28Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:16:38Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:16:48Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:16:58Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:17:08Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:17:18Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:17:28Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:17:38Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:17:48Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:17:49Z" level=info msg=Deleted id=rosa-mv9dx3-xls7g-sint/2e99b98b94304d80 installID=55h2cvl5
time="2022-09-01T00:17:49Z" level=info msg=Deleted id=eni-0e4ee5cf8f9a8fdd2 installID=55h2cvl5
time="2022-09-01T00:17:50Z" level=debug msg="Revoked ingress permissions" id=sg-03265ad2fae661b8c installID=55h2cvl5
time="2022-09-01T00:17:50Z" level=debug msg="Revoked egress permissions" id=sg-03265ad2fae661b8c installID=55h2cvl5
time="2022-09-01T00:17:50Z" level=debug msg="DependencyViolation: resource sg-03265ad2fae661b8c has a dependent object\n\tstatus code: 400, request id: f7c35709-a23d-49fd-ac6a-f092661f6966" arn="arn:aws:ec2:ca-central-1:646284873784:security-group/sg-03265ad2fae661b8c" installID=55h2cvl5
time="2022-09-01T00:17:51Z" level=info msg=Deleted id=eni-0e592a2768c157360 installID=55h2cvl5
time="2022-09-01T00:17:52Z" level=debug msg="listing AWS hosted zones \"rosa-mv9dx3.0ffs.p1.openshiftapps.com.\" (page 0)" id=Z072427539WBI718F6BCC installID=55h2cvl5
time="2022-09-01T00:17:52Z" level=debug msg="listing AWS hosted zones \"0ffs.p1.openshiftapps.com.\" (page 0)" id=Z072427539WBI718F6BCC installID=55h2cvl5
time="2022-09-01T00:17:53Z" level=info msg=Deleted id=Z072427539WBI718F6BCC installID=55h2cvl5
time="2022-09-01T00:17:53Z" level=debug msg="Revoked ingress permissions" id=sg-08bfbb32ea92f583e installID=55h2cvl5
time="2022-09-01T00:17:53Z" level=debug msg="Revoked egress permissions" id=sg-08bfbb32ea92f583e installID=55h2cvl5
time="2022-09-01T00:17:54Z" level=info msg=Deleted id=sg-08bfbb32ea92f583e installID=55h2cvl5
time="2022-09-01T00:17:54Z" level=info msg=Deleted id=rosa-mv9dx3-xls7g-aint/635162452c08e059 installID=55h2cvl5
time="2022-09-01T00:17:54Z" level=info msg=Deleted id=eni-049f0174866d87270 installID=55h2cvl5
time="2022-09-01T00:17:54Z" level=debug msg="search for matching resources by tag in ca-central-1 matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:17:55Z" level=debug msg="search for matching resources by tag in us-east-1 matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:17:55Z" level=debug msg="no deletions from us-east-1, removing client" installID=55h2cvl5
time="2022-09-01T00:17:55Z" level=debug msg="search for IAM roles" installID=55h2cvl5
time="2022-09-01T00:17:56Z" level=debug msg="iterating over a page of 64 IAM roles" installID=55h2cvl5
time="2022-09-01T00:17:56Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-ConfigRecorderRole-B749E1E6: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-ConfigRecorderRole-B749E1E6 with an explicit deny in a service control policy\n\tstatus code: 403, request id: 06b804ae-160c-4fa7-92de-fd69adc07db2" installID=55h2cvl5
time="2022-09-01T00:17:56Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-CWL-Add-Subscription-Filter-9D3CF73C: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-CWL-Add-Subscription-Filter-9D3CF73C with an explicit deny in a service control policy\n\tstatus code: 403, request id: 2a5dd4ad-9c3e-40ee-b478-73c79671d744" installID=55h2cvl5
time="2022-09-01T00:17:56Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-AWS679f53fac002430cb0da5-S4CHZ22EC1B2: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-AWS679f53fac002430cb0da5-S4CHZ22EC1B2 with an explicit deny in a service control policy\n\tstatus code: 403, request id: e61daee8-6d2c-4707-b4c9-c4fdd6b5091c" installID=55h2cvl5
time="2022-09-01T00:17:56Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-AWS679f53fac002430cb0da5-X9UQK0CYNPPO: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-AWS679f53fac002430cb0da5-X9UQK0CYNPPO with an explicit deny in a service control policy\n\tstatus code: 403, request id: 1b743447-a778-4f9e-8b48-5923fd5c14ce" installID=55h2cvl5
time="2022-09-01T00:17:56Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomCentralEndpointDep-1H6K6CZ6AEUBO: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomCentralEndpointDep-1H6K6CZ6AEUBO with an explicit deny in a service control policy\n\tstatus code: 403, request id: da8c8a42-8e79-48e5-b548-c604cb10d6f4" installID=55h2cvl5
time="2022-09-01T00:17:57Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomCreateSSMDocument7-1JDO2BN7QTXRH: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomCreateSSMDocument7-1JDO2BN7QTXRH with an explicit deny in a service control policy\n\tstatus code: 403, request id: 7d7840e4-a1b4-4ea2-bb83-9ee55882de54" installID=55h2cvl5
time="2022-09-01T00:17:57Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomEBSDefaultEncrypti-19EVAXFRG2BEJ: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomEBSDefaultEncrypti-19EVAXFRG2BEJ with an explicit deny in a service control policy\n\tstatus code: 403, request id: 7f2e04ed-8c49-42e4-b35e-563093a57e5b" installID=55h2cvl5
time="2022-09-01T00:17:57Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomEc2OperationsB1799-1WASK5J6GUYHO: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomEc2OperationsB1799-1WASK5J6GUYHO with an explicit deny in a service control policy\n\tstatus code: 403, request id: cd2b4962-e610-4cc4-92bc-827fe7a49b48" installID=55h2cvl5
time="2022-09-01T00:17:57Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomGetDetectorIdRole6-9VGPM8U0HMV7: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomGetDetectorIdRole6-9VGPM8U0HMV7 with an explicit deny in a service control policy\n\tstatus code: 403, request id: be005a09-f62c-4894-8c82-70c375d379a9" installID=55h2cvl5
time="2022-09-01T00:17:57Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomGuardDutyCreatePub-1W03UREYK3KTX: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomGuardDutyCreatePub-1W03UREYK3KTX with an explicit deny in a service control policy\n\tstatus code: 403, request id: 541d92f4-33ce-4a50-93d8-dcfd2306eeb0" installID=55h2cvl5
time="2022-09-01T00:17:57Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomIAMCreateRoleE62B6-1AQL8IBN9938I: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomIAMCreateRoleE62B6-1AQL8IBN9938I with an explicit deny in a service control policy\n\tstatus code: 403, request id: 6dd81743-94c4-479a-b945-ffb1af763007" installID=55h2cvl5
time="2022-09-01T00:17:57Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomIAMPasswordPolicyC-16TPLHRY1FZ43: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomIAMPasswordPolicyC-16TPLHRY1FZ43 with an explicit deny in a service control policy\n\tstatus code: 403, request id: a269f47b-97bc-4609-b124-d1ef5d997a91" installID=55h2cvl5
time="2022-09-01T00:17:57Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomLogsLogGroup49AC86-1D03LOLE2CARP: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomLogsLogGroup49AC86-1D03LOLE2CARP with an explicit deny in a service control policy\n\tstatus code: 403, request id: 33c3c0a5-e5c9-4125-9400-aafb363c683c" installID=55h2cvl5
time="2022-09-01T00:17:57Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomLogsMetricFilter7F-DLA5E1PZSFHH: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomLogsMetricFilter7F-DLA5E1PZSFHH with an explicit deny in a service control policy\n\tstatus code: 403, request id: 32e87471-6d21-42a7-bfd8-d5323856f94d" installID=55h2cvl5
time="2022-09-01T00:17:57Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomMacieExportConfigR-1QT1WNNWPSL36: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomMacieExportConfigR-1QT1WNNWPSL36 with an explicit deny in a service control policy\n\tstatus code: 403, request id: b2cc6745-0217-44fe-a48b-44e56e889c9e" installID=55h2cvl5
time="2022-09-01T00:17:57Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomMacieUpdateSession-1NHBPTB4GOSM8: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomMacieUpdateSession-1NHBPTB4GOSM8 with an explicit deny in a service control policy\n\tstatus code: 403, request id: 09f81582-6685-4dc9-99f0-ed33565ab4f4" installID=55h2cvl5
time="2022-09-01T00:17:58Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomResourceCleanupC59-1MSCB57N479UU: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomResourceCleanupC59-1MSCB57N479UU with an explicit deny in a service control policy\n\tstatus code: 403, request id: cea9116c-2b54-4caa-9776-83559d27b8f8" installID=55h2cvl5
time="2022-09-01T00:17:58Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomS3PutReplicationRo-FE5Q26BTAG9K: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomS3PutReplicationRo-FE5Q26BTAG9K with an explicit deny in a service control policy\n\tstatus code: 403, request id: 430d7750-c538-42a5-84b5-52bc77ce2d56" installID=55h2cvl5
time="2022-09-01T00:17:58Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomSecurityHubRole660-1UX115B9Q68WX: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomSecurityHubRole660-1UX115B9Q68WX with an explicit deny in a service control policy\n\tstatus code: 403, request id: 279038e4-f3c9-4700-b590-9a90f9b8d3a2" installID=55h2cvl5
time="2022-09-01T00:17:58Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomSSMUpdateRoleD3D5C-AZ9GBJG6UM4F: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomSSMUpdateRoleD3D5C-AZ9GBJG6UM4F with an explicit deny in a service control policy\n\tstatus code: 403, request id: 5e2f40ae-3dc7-4773-a5cd-40bf9aa36c03" installID=55h2cvl5
time="2022-09-01T00:17:58Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomVpcDefaultSecurity-HC931RYMVKKC: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomVpcDefaultSecurity-HC931RYMVKKC with an explicit deny in a service control policy\n\tstatus code: 403, request id: 92a27a7b-14f5-455b-aa39-3c995806b83e" installID=55h2cvl5
time="2022-09-01T00:17:58Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-DefaultBucketReplication-OIM43YBJSMGD: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-DefaultBucketReplication-OIM43YBJSMGD with an explicit deny in a service control policy\n\tstatus code: 403, request id: 0da4f66c-c6b1-453c-a8c8-dc0399b24bb9" installID=55h2cvl5
time="2022-09-01T00:17:58Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-PipelineRole: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-PipelineRole with an explicit deny in a service control policy\n\tstatus code: 403, request id: f2c94beb-a222-4bad-abe1-8de5786f5e59" installID=55h2cvl5
time="2022-09-01T00:17:58Z" level=info msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-VPC-FlowLog-519F0B57: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-VPC-FlowLog-519F0B57 with an explicit deny in a service control policy\n\tstatus code: 403, request id: 829c3569-b2f2-4b9d-94a0-69644b690066" installID=55h2cvl5
time="2022-09-01T00:17:58Z" level=debug msg="search for IAM users" installID=55h2cvl5
time="2022-09-01T00:17:58Z" level=debug msg="iterating over a page of 0 IAM users" installID=55h2cvl5
time="2022-09-01T00:17:58Z" level=debug msg="search for IAM instance profiles" installID=55h2cvl5
time="2022-09-01T00:17:58Z" level=info msg="error while finding resources to delete" error="get tags for arn:aws:iam::646284873784:role/PBMMAccel-VPC-FlowLog-519F0B57: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-VPC-FlowLog-519F0B57 with an explicit deny in a service control policy\n\tstatus code: 403, request id: 829c3569-b2f2-4b9d-94a0-69644b690066" installID=55h2cvl5
time="2022-09-01T00:18:09Z" level=info msg=Deleted id=sg-03265ad2fae661b8c installID=55h2cvl5
time="2022-09-01T00:18:09Z" level=debug msg="search for matching resources by tag in ca-central-1 matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:18:09Z" level=debug msg="no deletions from ca-central-1, removing client" installID=55h2cvl5
time="2022-09-01T00:18:09Z" level=debug msg="search for IAM roles" installID=55h2cvl5
time="2022-09-01T00:18:10Z" level=debug msg="iterating over a page of 64 IAM roles" installID=55h2cvl5
time="2022-09-01T00:18:10Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-ConfigRecorderRole-B749E1E6: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-ConfigRecorderRole-B749E1E6 with an explicit deny in a service control policy\n\tstatus code: 403, request id: 0e8e0bea-b512-469b-a996-8722a0f7fa25" installID=55h2cvl5
time="2022-09-01T00:18:10Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-CWL-Add-Subscription-Filter-9D3CF73C: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-CWL-Add-Subscription-Filter-9D3CF73C with an explicit deny in a service control policy\n\tstatus code: 403, request id: 288456a2-0cd5-46f1-a5d2-6b4006a5dc0e" installID=55h2cvl5
time="2022-09-01T00:18:10Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-AWS679f53fac002430cb0da5-S4CHZ22EC1B2: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-AWS679f53fac002430cb0da5-S4CHZ22EC1B2 with an explicit deny in a service control policy\n\tstatus code: 403, request id: 321df940-70fc-45e7-8c56-59fe5b89e84f" installID=55h2cvl5
time="2022-09-01T00:18:10Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-AWS679f53fac002430cb0da5-X9UQK0CYNPPO: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-AWS679f53fac002430cb0da5-X9UQK0CYNPPO with an explicit deny in a service control policy\n\tstatus code: 403, request id: 45bebf36-8bf9-4c78-a80f-c6a5e98b2187" installID=55h2cvl5
time="2022-09-01T00:18:10Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomCentralEndpointDep-1H6K6CZ6AEUBO: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomCentralEndpointDep-1H6K6CZ6AEUBO with an explicit deny in a service control policy\n\tstatus code: 403, request id: eea00ae2-1a72-43f9-9459-a1c003194137" installID=55h2cvl5
time="2022-09-01T00:18:10Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomCreateSSMDocument7-1JDO2BN7QTXRH: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomCreateSSMDocument7-1JDO2BN7QTXRH with an explicit deny in a service control policy\n\tstatus code: 403, request id: 0ef5a102-b764-4e17-999f-d820ebc1ec12" installID=55h2cvl5
time="2022-09-01T00:18:10Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomEBSDefaultEncrypti-19EVAXFRG2BEJ: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomEBSDefaultEncrypti-19EVAXFRG2BEJ with an explicit deny in a service control policy\n\tstatus code: 403, request id: 107d0ccf-94e7-41c4-96cd-450b66a84101" installID=55h2cvl5
time="2022-09-01T00:18:10Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomEc2OperationsB1799-1WASK5J6GUYHO: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomEc2OperationsB1799-1WASK5J6GUYHO with an explicit deny in a service control policy\n\tstatus code: 403, request id: da9bd868-8384-4072-9fb4-e6a66e94d2a1" installID=55h2cvl5
time="2022-09-01T00:18:11Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomGetDetectorIdRole6-9VGPM8U0HMV7: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomGetDetectorIdRole6-9VGPM8U0HMV7 with an explicit deny in a service control policy\n\tstatus code: 403, request id: 74fbf44c-d02d-4072-b038-fa456246b6a8" installID=55h2cvl5
time="2022-09-01T00:18:11Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomGuardDutyCreatePub-1W03UREYK3KTX: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomGuardDutyCreatePub-1W03UREYK3KTX with an explicit deny in a service control policy\n\tstatus code: 403, request id: 365116d6-1467-49c3-8f58-1bc005aa251f" installID=55h2cvl5
time="2022-09-01T00:18:11Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomIAMCreateRoleE62B6-1AQL8IBN9938I: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomIAMCreateRoleE62B6-1AQL8IBN9938I with an explicit deny in a service control policy\n\tstatus code: 403, request id: 20f91de5-cfeb-45e0-bb46-7b66d62cc749" installID=55h2cvl5
time="2022-09-01T00:18:11Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomIAMPasswordPolicyC-16TPLHRY1FZ43: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomIAMPasswordPolicyC-16TPLHRY1FZ43 with an explicit deny in a service control policy\n\tstatus code: 403, request id: 924fa288-f1b9-49b8-b549-a930f6f771ce" installID=55h2cvl5
time="2022-09-01T00:18:11Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomLogsLogGroup49AC86-1D03LOLE2CARP: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomLogsLogGroup49AC86-1D03LOLE2CARP with an explicit deny in a service control policy\n\tstatus code: 403, request id: 4beb233d-40d6-4016-872a-8757af8f98ee" installID=55h2cvl5
time="2022-09-01T00:18:11Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomLogsMetricFilter7F-DLA5E1PZSFHH: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomLogsMetricFilter7F-DLA5E1PZSFHH with an explicit deny in a service control policy\n\tstatus code: 403, request id: 77951f62-e0b4-4a9b-a20c-ea40d6432e84" installID=55h2cvl5
time="2022-09-01T00:18:11Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomMacieExportConfigR-1QT1WNNWPSL36: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomMacieExportConfigR-1QT1WNNWPSL36 with an explicit deny in a service control policy\n\tstatus code: 403, request id: 13ad38c8-89dc-461d-9763-870eec3a6ba1" installID=55h2cvl5
time="2022-09-01T00:18:11Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomMacieUpdateSession-1NHBPTB4GOSM8: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomMacieUpdateSession-1NHBPTB4GOSM8 with an explicit deny in a service control policy\n\tstatus code: 403, request id: a8fe199d-12fb-4141-a944-c7c5516daf25" installID=55h2cvl5
time="2022-09-01T00:18:11Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomResourceCleanupC59-1MSCB57N479UU: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomResourceCleanupC59-1MSCB57N479UU with an explicit deny in a service control policy\n\tstatus code: 403, request id: b487c62f-5ac5-4fa0-b835-f70838b1d178" installID=55h2cvl5
time="2022-09-01T00:18:11Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomS3PutReplicationRo-FE5Q26BTAG9K: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomS3PutReplicationRo-FE5Q26BTAG9K with an explicit deny in a service control policy\n\tstatus code: 403, request id: 97bfcb55-ae1f-4859-9c12-03de09607f79" installID=55h2cvl5
time="2022-09-01T00:18:11Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomSecurityHubRole660-1UX115B9Q68WX: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomSecurityHubRole660-1UX115B9Q68WX with an explicit deny in a service control policy\n\tstatus code: 403, request id: ca1094f6-714e-4042-9134-75f4c6d9d0df" installID=55h2cvl5
time="2022-09-01T00:18:12Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomSSMUpdateRoleD3D5C-AZ9GBJG6UM4F: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomSSMUpdateRoleD3D5C-AZ9GBJG6UM4F with an explicit deny in a service control policy\n\tstatus code: 403, request id: ca1db477-ee6a-4d03-8b57-52b335b2bbe6" installID=55h2cvl5
time="2022-09-01T00:18:12Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomVpcDefaultSecurity-HC931RYMVKKC: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomVpcDefaultSecurity-HC931RYMVKKC with an explicit deny in a service control policy\n\tstatus code: 403, request id: 1fc32d09-588b-4d80-ad62-748f7fb55efd" installID=55h2cvl5
time="2022-09-01T00:18:12Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-DefaultBucketReplication-OIM43YBJSMGD: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-DefaultBucketReplication-OIM43YBJSMGD with an explicit deny in a service control policy\n\tstatus code: 403, request id: 7d906cc2-eaaa-439b-97e0-503615ce5d43" installID=55h2cvl5
time="2022-09-01T00:18:12Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-PipelineRole: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-PipelineRole with an explicit deny in a service control policy\n\tstatus code: 403, request id: ee6a5647-20b1-4880-932b-bfd70b945077" installID=55h2cvl5
time="2022-09-01T00:18:12Z" level=info msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-VPC-FlowLog-519F0B57: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-VPC-FlowLog-519F0B57 with an explicit deny in a service control policy\n\tstatus code: 403, request id: a424891e-48ab-4ad4-9150-9ef1076dcb9c" installID=55h2cvl5

Repeats the not authroized errors probably 50+ times.

Expected results:

For these errors not to show up during install.

Additional info:

Again this is only due to ROSA being install in an AWS SEA environment - https://github.com/aws-samples/aws-secure-environment-accelerator.

"etcdserver: leader changed" causes clients to fail.

This error should never bubble up to clients because the kube-apiserver can always retry this failure mode since it knows the data was not modified. When etcd adjusts timeouts for leader election and heartbeating for slow hardware like Azure, the hardcoded timeouts in the kube-apiserver/etcd fail. See

  1. kube-apiserver tries to use etcd retries: https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/storage/storagebackend/factory/etcd3.go#L308-L317
  2. etcd retries appear to be unconditionally added: https://github.com/etcd-io/etcd/blob/main/client/v3/client.go#L243-L249 and https://github.com/etcd-io/etcd/blob/release-3.5/client/v3/client.go#L286
  3. etcd retries retry a max of 2.5 seconds: https://github.com/etcd-io/etcd/blob/main/client/v3/options.go#L53 + https://github.com/etcd-io/etcd/blob/main/client/v3/options.go#L45
  4. etcd retries are further reduced by zero-second retry on quorum
  5. On azure https://github.com/openshift/cluster-etcd-operator/blob/d7d43ee21aff6b178b2104228bba374977777a84/pkg/etcdenvvar/etcd_env.go#L229 slower leader change reactions https://github.com/openshift/cluster-etcd-operator/blob/master/pkg/hwspeedhelpers/hwhelper.go#L28 mean we are likely to exceed the number of retries for requests near the beginning of a change

Simply saying, "oh, it's hardcoded and kube" isn't good enough. We have previously had a storage shim to retry such problems. If all else fails, bringing back the small shim to retry Unavailable etcd errors longer is appropriate to fix all available clients.

Additionally, this etcd capability is being made more widely available and this bug prevents that from working.

This came up a while ago, see https://groups.google.com/u/1/a/redhat.com/g/aos-devel/c/HuOTwtI4a9I/m/nX9mKjeqAAAJ

Basically this MC:

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: worker-override
spec:
  kernelType: realtime
  osImageURL: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b4cc3995d5fc11e3b22140d8f2f91f78834e86a210325cbf0525a62725f8e099 

 

Will degrade the node with

 

E0301 21:25:09.234001    3306 writer.go:200] Marking Degraded due to: error running rpm-ostree override remove kernel kernel-core kernel-modules kernel-modules-extra --install kernel-rt-core --install kernel-rt-modules --install kernel-rt-modules-extra --install kernel-rt-kvm: error: Could not depsolve transaction; 1 problem detected:
 Problem: package kernel-modules-core-5.14.0-282.el9.x86_64 requires kernel-uname-r = 5.14.0-282.el9.x86_64, but none of the providers can be installed
  - conflicting requests
: exit status 1
 

 

It's kind of annoying here because the packages to remove are now OS version dependent.  A while ago I filed https://github.com/coreos/rpm-ostree/issues/2542 which would push the problem down into rpm-ostree, which is in a better situation to deal with it, and that may be the fix...but it's also pushing the problem down there in a way that's going to be maintenance pain (but, we can deal with that).

 

It's also possible that we may need to explicitly request installation of `kernel-rt-modules-core`...I'll look.

Description of problem:


    

Version-Release number of selected component (if applicable):


    

How reproducible:


    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:


    

Expected results:


    

Additional info:


    

Description of problem:
Ingress-canary Daemon Set does not tolerate Infra taint "NoExecute"

Version-Release number of selected component (if applicable):
OCPv4.9

How reproducible:
Always

Steps to Reproduce:
1.Label and Taint Node
$ oc describe node worker-0.cluster49.lab.pnq2.cee.redhat.com | grep infra
Roles: custom,infra,test
node-role.kubernetes.io/infra= <----
Taints: node-role.kubernetes.io/infra=reserved:NoExecute <----
node-role.kubernetes.io/infra=reserved:NoSchedule <----

2.Edit ingress-canary ds and add NoExecute toleration
$ oc get ds -o yaml | grep -i tole -A6
tolerations:

  • effect: NoSchedule
    key: node-role.kubernetes.io/infra
    value: reserved
  • effect: NoExecute <----
    key: node-role.kubernetes.io/infra <----
    value: reserved <----

3. The Daemon Set configuration gets overwritten after some time, probably by the managing operator, and the pods are terminated on the infra nodes.

Actual results:
Infra taint toleration NoExecute gets overwritten :
$ oc get ds -o yaml | grep -i tole -A6
tolerations:

  • effect: NoSchedule
    key: node-role.kubernetes.io/infra
    operator: Exists

Expected results:
Ingress canary Daemon Set should be able to tolerate the NoExecute taint toleration.

Additional info: Same taint as the product documentation are used (node-role.kubernetes.io/infra)

Description of problem:

Under heavy control plane load (bringing up ~200 pods), prometheus/promtail spikes to over 100% CPU, node_exporter goes to ~200% cpu and stays there for 5-10 minutes. Tested on a GCP cluster bot using 2 physical core (4 vcpu) workers. This starves out essential platform functions like OVS from getting any CPU and causes the data plane to go down.

Running perf against node_exporter reveals the application is consuming the majority of its CPU trying to list new interfaces being added in sysfs. This looks like it is due to disbling netlink via:

https://issues.redhat.com/browse/OCPBUGS-8282

This operation grabs the rtnl lock which can compete with other components on the host that are trying to configure networking.

Version-Release number of selected component (if applicable):

Tested on 4.13 and 4.14 with GCP.

How reproducible:

3/4 times

Steps to Reproduce:

1. Launch gcp with cluster bot
2. Create a deployment with pause containers which will max out pods on the nodes:

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: webserver-deployment
  namespace: openshift-ovn-kubernetes
  labels:
    pod-name: server
    app: nginx
    role: webserver
spec:
  replicas: 700
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
        role: webserver
    spec:
      containers:
        - name: webserver1
          image: k8s.gcr.io/pause:3.1
          ports:
            - containerPort: 80
              name: serve-80
              protocol: TCP 
3. Watch top cpu output. Wait for node_exporter and prometheus to show very high CPU. If this does not happen, proceed to step 4.
4. Delete the deployment and then recreate it.
5. High and persistent CPU usage should now be observed.

Actual results:

CPU is pegged on the host for several minutes. Terminal is almost unresponsive. Only way to fix it was to delete node_exporter and prometheus DS.

Expected results:

Prometheus and other metrics related applications should:
1. use netlink to avoid grabbing rtnl lock
2. should be cpu limited. Certain required applications in OCP are resource unbounded (like networking data plane) to ensure the node's core functions continue to work. Metrics however should be CPU limited to avoid tooling from locking up a node.

Additional info:

Perf summary (will attach full perf output)
    99.94%     0.00%  node_exporter  node_exporter      [.] runtime.goexit.abi0
            |
            ---runtime.goexit.abi0
               |
                --99.33%--github.com/prometheus/node_exporter/collector.NodeCollector.Collect.func2
                          |
                           --99.33%--github.com/prometheus/node_exporter/collector.NodeCollector.Collect.func1
                                     |
                                      --99.33%--github.com/prometheus/node_exporter/collector.execute
                                                |
                                                |--97.67%--github.com/prometheus/node_exporter/collector.(*netClassCollector).Update
                                                |          |
                                                |           --97.67%--github.com/prometheus/node_exporter/collector.(*netClassCollector).netClassSysfsUpdate
                                                |                     |
                                                |                      --97.67%--github.com/prometheus/node_exporter/collector.(*netClassCollector).getNetClassInfo
                                                |                                |
                                                |                                 --97.64%--github.com/prometheus/procfs/sysfs.FS.NetClassByIface
                                                |                                           |
                                                |                                            --97.64%--github.com/prometheus/procfs/sysfs.parseNetClassIface
                                                |                                                      |
                                                |                                                       --97.61%--github.com/prometheus/procfs/internal/util.SysReadFile
                                                |                                                                 |
                                                |                                                                  --97.45%--syscall.read
                                                |                                                                            |
                                                |                                                                             --97.45%--syscall.Syscall
                                                |                                                                                       |
                                                |                                                                                        --97.45%--runtime/internal/syscall.Syscall6
                                                |                                                                                                  |
                                                |                                                                                                   --70.34%--entry_SYSCALL_64_after_hwframe
                                                |                                                                                                             do_syscall_64
                                                |                                                                                                             |
                                                |                                                                                                             |--39.13%--ksys_read
                                                |                                                                                                             |          |
                                                |                                                                                                             |          |--31.97%--vfs_read

Description of problem:

Since we migrated some our jobs to OCP 4.14, we are experiencing a lot of flakiness with the "openshift-tests" binary which panics when trying to retrieve the logs of etcd: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_assisted-test-infra/2212/pull-ci-openshift-assisted-test-infra-master-e2e-metal-assisted/1673615526967906304#1:build-log.txt%3A161-191

Here's the impact on our jobs:
https://search.ci.openshift.org/?search=error+reading+pod+logs&maxAge=48h&context=1&type=build-log&name=.*assisted.*&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Version-Release number of selected component (if applicable):

 N/A

How reproducible:

Happens from time to time against OCP 4.14

Steps to Reproduce:

1. Provision an OCP cluster 4.14
2. Run the conformance tests on it with "openshift-tests"

Actual results:


The binary "openshift-tests" panics from time to time:

 [2023-06-27 10:12:07] time="2023-06-27T10:12:07Z" level=error msg="error reading pod logs" error="container \"etcd\" in pod \"etcd-test-infra-cluster-a1729bd4-master-2\" is not available" pod=etcd-test-infra-cluster-a1729bd4-master-2
[2023-06-27 10:12:07] panic: runtime error: invalid memory address or nil pointer dereference
[2023-06-27 10:12:07] [signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x26eb9b5]
[2023-06-27 10:12:07] 
[2023-06-27 10:12:07] goroutine 1 [running]:
[2023-06-27 10:12:07] bufio.(*Scanner).Scan(0xc005954250)
[2023-06-27 10:12:07] 	bufio/scan.go:214 +0x855
[2023-06-27 10:12:07] github.com/openshift/origin/pkg/monitor/intervalcreation.IntervalsFromPodLogs({0x8d91460, 0xc004a43d40}, {0xc8b83c0?, 0xc006138000?, 0xc8b83c0?}, {0x8d91460?, 0xc004a43d40?, 0xc8b83c0?})
[2023-06-27 10:12:07] 	github.com/openshift/origin/pkg/monitor/intervalcreation/podlogs.go:130 +0x8cd
[2023-06-27 10:12:07] github.com/openshift/origin/pkg/monitor/intervalcreation.InsertIntervalsFromCluster({0x8d441e0, 0xc000ffd900}, 0xc0008b4000?, {0xc005f88000?, 0x539, 0x0?}, 0x25e1e39?, {0xc11ecb5d446c4f2c, 0x4fb99e6af, 0xc8b83c0}, ...)
[2023-06-27 10:12:07] 	github.com/openshift/origin/pkg/monitor/intervalcreation/types.go:65 +0x274
[2023-06-27 10:12:07] github.com/openshift/origin/pkg/test/ginkgo.(*MonitorEventsOptions).End(0xc001083050, {0x8d441e0, 0xc000ffd900}, 0x1?, {0x7fff15b2ccde, 0x16})
[2023-06-27 10:12:07] 	github.com/openshift/origin/pkg/test/ginkgo/options_monitor_events.go:170 +0x225
[2023-06-27 10:12:07] github.com/openshift/origin/pkg/test/ginkgo.(*Options).Run(0xc0013e2000, 0xc00012e380, {0x8126d1e, 0xf})
[2023-06-27 10:12:07] 	github.com/openshift/origin/pkg/test/ginkgo/cmd_runsuite.go:506 +0x2d9a
[2023-06-27 10:12:07] main.newRunCommand.func1.1()
[2023-06-27 10:12:07] 	github.com/openshift/origin/cmd/openshift-tests/openshift-tests.go:330 +0x2d4
[2023-06-27 10:12:07] main.mirrorToFile(0xc0013e2000, 0xc0014cdb30)
[2023-06-27 10:12:07] 	github.com/openshift/origin/cmd/openshift-tests/openshift-tests.go:476 +0x5f2
[2023-06-27 10:12:07] main.newRunCommand.func1(0xc0013e0300?, {0xc000862ea0?, 0x6?, 0x6?})
[2023-06-27 10:12:07] 	github.com/openshift/origin/cmd/openshift-tests/openshift-tests.go:311 +0x5c
[2023-06-27 10:12:07] github.com/spf13/cobra.(*Command).execute(0xc0013e0300, {0xc000862e40, 0x6, 0x6})
[2023-06-27 10:12:07] 	github.com/spf13/cobra@v1.6.0/command.go:916 +0x862
[2023-06-27 10:12:07] github.com/spf13/cobra.(*Command).ExecuteC(0xc0013e0000)
[2023-06-27 10:12:07] 	github.com/spf13/cobra@v1.6.0/command.go:1040 +0x3bd
[2023-06-27 10:12:07] github.com/spf13/cobra.(*Command).Execute(...)
[2023-06-27 10:12:07] 	github.com/spf13/cobra@v1.6.0/command.go:968
[2023-06-27 10:12:07] main.main.func1(0xc00011b300?)
[2023-06-27 10:12:07] 	github.com/openshift/origin/cmd/openshift-tests/openshift-tests.go:96 +0x8a
[2023-06-27 10:12:07] main.main()
[2023-06-27 10:12:07] 	github.com/openshift/origin/cmd/openshift-tests/openshift-tests.go:97 +0x516 

Expected results:

No panics

Additional info:

The source of the panic has been pin-pointed here: https://github.com/openshift/origin/pull/27772#discussion_r1243600596

Description of problem:

per oc set route-backends -h output:
Routes may have one or more optional backend services with weights controlling how much traffic flows to each service.
[...]
**If all weights are zero the route will not send traffic to any backends.**

this is not the case anymore for a route with a single backend.

Version-Release number of selected component (if applicable):

at least from OCP 4.12 onward

How reproducible:

all the time

Steps to Reproduce:

1. kubectl create -f example/
2. kubectl patch route example -p '{"spec":{"to": {"weight": 0}}}' --type merge
3. curl http://localhost -H "Host: example.local" 

Actual results:

curl succeeds

Expected results:

curl fails

Additional info:

https://access.redhat.com/support/cases/#/case/03567697

is regression following NE-822. Reverting
https://github.com/openshift/router/commit/9656da7d5e2ac0962f3eaf718ad7a8c8b2172cfa makes it work again.

Sanitize OWNERS/OWNER_ALIASES in all CSI driver and operator repos.

For driver repos:

1) OWNERS must have `component`:

component: "Storage / Kubernetes External Components"

2) OWNER_ALIASES must have all team members of Storage team.

For operator repos:

1) OWNERS must have:

  • all team members of Storage team as `approvers`
  • `component`:
    component: "Storage / Operators"
    

if the kubeadmin secret was deleted successfully from the guest cluster, but the `SecretHashAnnotation` annotation deletion on the oauthDeployment failed, the annotation will not be reconciled again and the annotation will never be removed.

context: https://redhat-internal.slack.com/archives/C01C8502FMM/p1684765042825929

Description of problem:

GCP XPN installs require the permission `projects/<host-project>/roles/dns.networks.bindPrivateDNSZone` in the host project. This permission is not always provided in organizations. The installer requires this permission in order to create a private DNS zone and bind it to the shared networks.

Instead, the installer should be able to create records in a provided private zone that matches the base domain.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

While deploy 3671 SNOs via ACM and ZTP, 19 SNO clusters failed to install because the clusterversion object complained that the cluster operator operator-lifecycle-manager is not available.

Version-Release number of selected component (if applicable):

Hub OCP 4.12.14
SNO Deployed OCP 4.13.0-rc.6
ACM - 2.8.0-DOWNSTREAM-2023-04-30-18-44-29

How reproducible:

19 out of 51 failed clusters out of 3671 total installs
~.5% of installs might experience this however it represents ~37% of all install failures

Steps to Reproduce:

1.
2.
3.

Actual results:

# cat cluster-install-failures | grep OLM | awk '{print $1}' | xargs -I % sh -c "echo -n '% '; oc --kubeconfig /root/hv-vm/kc/%/kubeconfig get clusterversion --no-headers"
vm00096 version         False   True   15h   Unable to apply 4.13.0-rc.6: the cluster operator operator-lifecycle-manager is not available                                 
vm00334 version         False   True   19h   Unable to apply 4.13.0-rc.6: the cluster operator operator-lifecycle-manager is not available                                 
vm00593 version         False   True   19h   Unable to apply 4.13.0-rc.6: the cluster operator operator-lifecycle-manager is not available                                 
vm01095 version         False   True   19h   Unable to apply 4.13.0-rc.6: the cluster operator operator-lifecycle-manager is not available                                 
vm01192 version         False   True   19h   Unable to apply 4.13.0-rc.6: the cluster operator operator-lifecycle-manager is not available                                 
vm01447 version         False   True   18h   Unable to apply 4.13.0-rc.6: the cluster operator operator-lifecycle-manager is not available
vm01566 version         False   True   19h   Unable to apply 4.13.0-rc.6: the cluster operator operator-lifecycle-manager is not available
vm01707 version         False   True   17h   Unable to apply 4.13.0-rc.6: the cluster operator operator-lifecycle-manager is not available
vm01742 version         False   True   15h   Unable to apply 4.13.0-rc.6: the cluster operator operator-lifecycle-manager is not available
vm01798 version         False   True   13h   Unable to apply 4.13.0-rc.6: the cluster operator operator-lifecycle-manager is not available
vm01810 version         False   True   19h   Unable to apply 4.13.0-rc.6: the cluster operator operator-lifecycle-manager is not available
vm02020 version         False   True   19h   Unable to apply 4.13.0-rc.6: the cluster operator operator-lifecycle-manager is not available
vm02091 version         False   True   20h   Unable to apply 4.13.0-rc.6: the cluster operator operator-lifecycle-manager is not available
vm02363 version         False   True   13h   Unable to apply 4.13.0-rc.6: the cluster operator operator-lifecycle-manager is not available
vm02590 version         False   True   20h   Unable to apply 4.13.0-rc.6: the cluster operator operator-lifecycle-manager is not available
vm02908 version         False   True   18h   Unable to apply 4.13.0-rc.6: the cluster operator operator-lifecycle-manager is not available
vm03253 version         False   True   14h   Unable to apply 4.13.0-rc.6: the cluster operator operator-lifecycle-manager is not available
vm03500 version         False   True   17h   Unable to apply 4.13.0-rc.6: the cluster operator operator-lifecycle-manager is not available
vm03654 version         False   True   17h   Unable to apply 4.13.0-rc.6: the cluster operator operator-lifecycle-manager is not available

Expected results:

 

Additional info:

There appears to be two distinguishing failure signatures in the list of cluster operators, every cluster shows that the OLM isn't available and is degraded and more than half of the clusters show no information regarding the operator-lifecycle-manager-packageserver.

# cat cluster-install-failures | grep OLM | awk '{print $1}' | xargs -I % sh -c "echo -n '% '; oc --kubeconfig /root/hv-vm/kc/%/kubeconfig get co operator-lifecycle-manager --no-headers"
vm00096 operator-lifecycle-manager         False   True   True   15h   
vm00334 operator-lifecycle-manager         False   True   True   19h   
vm00593 operator-lifecycle-manager         False   True   True   19h   
vm01095 operator-lifecycle-manager         False   True   True   19h   
vm01192 operator-lifecycle-manager         False   True   True   19h   
vm01447 operator-lifecycle-manager         False   True   True   18h   
vm01566 operator-lifecycle-manager         False   True   True   19h   
vm01707 operator-lifecycle-manager         False   True   True   17h   
vm01742 operator-lifecycle-manager         False   True   True   15h   
vm01798 operator-lifecycle-manager         False   True   True   13h   
vm01810 operator-lifecycle-manager         False   True   True   19h   
vm02020 operator-lifecycle-manager         False   True   True   19h   
vm02091 operator-lifecycle-manager         False   True   True   20h   
vm02363 operator-lifecycle-manager         False   True   True   13h   
vm02590 operator-lifecycle-manager         False   True   True   20h   
vm02908 operator-lifecycle-manager         False   True   True   18h   
vm03253 operator-lifecycle-manager         False   True   True   14h   
vm03500 operator-lifecycle-manager         False   True   True   17h   
vm03654 operator-lifecycle-manager         False   True   True   17h
# cat cluster-install-failures | grep OLM | awk '{print $1}' | xargs -I % sh -c "echo -n '% '; oc --kubeconfig /root/hv-vm/kc/%/kubeconfig get co operator-lifecycle-manager-packageserver --no-headers"
vm00096 operator-lifecycle-manager-packageserver                                 
vm00334 operator-lifecycle-manager-packageserver         False   True   False   19h   
vm00593 operator-lifecycle-manager-packageserver         False   True   False   19h   
vm01095 operator-lifecycle-manager-packageserver                                 
vm01192 operator-lifecycle-manager-packageserver                                 
vm01447 operator-lifecycle-manager-packageserver                                 
vm01566 operator-lifecycle-manager-packageserver         False   True   False   19h   
vm01707 operator-lifecycle-manager-packageserver                                 
vm01742 operator-lifecycle-manager-packageserver         False   True   False   15h   
vm01798 operator-lifecycle-manager-packageserver                                 
vm01810 operator-lifecycle-manager-packageserver                                 
vm02020 operator-lifecycle-manager-packageserver                                 
vm02091 operator-lifecycle-manager-packageserver         False   True   False   20h   
vm02363 operator-lifecycle-manager-packageserver         False   True   False   13h   
vm02590 operator-lifecycle-manager-packageserver         False   True   False   20h   
vm02908 operator-lifecycle-manager-packageserver         False   True   False   18h   
vm03253 operator-lifecycle-manager-packageserver                                 
vm03500 operator-lifecycle-manager-packageserver                                 
vm03654 operator-lifecycle-manager-packageserver

Viewing the pods in the  openshift-operator-lifecycle-manager for these clusters shows no packageserver pod:

# cat cluster-install-failures | grep OLM | awk '{print $1}' | xargs -I % sh -c "echo '% '; oc --kubeconfig /root/hv-vm/kc/%/kubeconfig get po -n openshift-operator-lifecycle-manager"
vm00096
NAME                                     READY   STATUS      RESTARTS      AGE
catalog-operator-94b8bfddc-9rm9j         1/1     Running     1 (15h ago)   15h
collect-profiles-28053720-kbsdn          0/1     Completed   0             33m
collect-profiles-28053735-dzkf8          0/1     Completed   0             18m
collect-profiles-28053750-skvcn          0/1     Completed   0             3m1s
olm-operator-66658fffbb-gj294            1/1     Running     0             15h
package-server-manager-654759688-bxnwj   1/1     Running     0             15h
vm00334
NAME                                     READY   STATUS      RESTARTS      AGE
catalog-operator-94b8bfddc-xcw9r         1/1     Running     1 (19h ago)   19h
collect-profiles-28053720-ppq6x          0/1     Completed   0             32m
collect-profiles-28053735-r2rvw          0/1     Completed   0             18m
collect-profiles-28053750-lgb4r          0/1     Completed   0             3m2s
olm-operator-66658fffbb-t4nxg            1/1     Running     0             19h
package-server-manager-654759688-6n7gp   1/1     Running     0             19h
vm00593
NAME                                     READY   STATUS      RESTARTS      AGE
catalog-operator-94b8bfddc-rwfwp         1/1     Running     1 (19h ago)   19h
collect-profiles-28053720-7p6tq          0/1     Completed   0             33m
collect-profiles-28053735-nqzn9          0/1     Completed   0             18m
collect-profiles-28053750-zppm6          0/1     Completed   0             3m2s
olm-operator-66658fffbb-4gcpv            1/1     Running     0             19h
package-server-manager-654759688-rbjdw   1/1     Running     0             19h
vm01095
NAME                                     READY   STATUS      RESTARTS   AGE
catalog-operator-94b8bfddc-2tp6j         1/1     Running     0          19h
collect-profiles-28053720-bnrfz          0/1     Completed   0          33m
collect-profiles-28053735-p8bl5          0/1     Completed   0          18m
collect-profiles-28053750-mg9nv          0/1     Completed   0          3m2s
olm-operator-66658fffbb-cb95l            1/1     Running     0          19h
package-server-manager-654759688-2mqdm   1/1     Running     0          19h
vm01192
NAME                                     READY   STATUS      RESTARTS   AGE
catalog-operator-94b8bfddc-2crgg         1/1     Running     0          19h
collect-profiles-28053720-2rknm          0/1     Completed   0          33m
collect-profiles-28053735-wc5dn          0/1     Completed   0          18m
collect-profiles-28053750-g5bhj          0/1     Completed   0          3m2s
olm-operator-66658fffbb-5hlh4            1/1     Running     0          19h
package-server-manager-654759688-xfp24   1/1     Running     0          19h
vm01447
NAME                                     READY   STATUS      RESTARTS      AGE
catalog-operator-94b8bfddc-p8gd4         1/1     Running     0             18h
collect-profiles-28053720-kjw4w          0/1     Completed   0             33m
collect-profiles-28053735-k7xxp          0/1     Completed   0             17m
collect-profiles-28053750-fn5gq          0/1     Completed   0             3m3s
olm-operator-66658fffbb-rshjq            1/1     Running     1 (18h ago)   18h
package-server-manager-654759688-hrmfd   1/1     Running     0             18h
vm01566
NAME                                     READY   STATUS      RESTARTS      AGE
catalog-operator-94b8bfddc-gbrnj         1/1     Running     0             19h
collect-profiles-28053720-2wdcp          0/1     Completed   0             33m
collect-profiles-28053735-t7x5b          0/1     Completed   0             18m
collect-profiles-28053750-wdmtt          0/1     Completed   0             3m3s
olm-operator-66658fffbb-fsxrx            1/1     Running     0             19h
package-server-manager-654759688-4mdz8   1/1     Running     1 (19h ago)   19h
vm01707
NAME                                     READY   STATUS      RESTARTS   AGE
catalog-operator-94b8bfddc-f2ns6         1/1     Running     0          17h
collect-profiles-28053720-72sjt          0/1     Completed   0          33m
collect-profiles-28053735-qzgx4          0/1     Completed   0          18m
collect-profiles-28053750-mrpbl          0/1     Completed   0          3m3s
olm-operator-66658fffbb-jwp2l            1/1     Running     0          17h
package-server-manager-654759688-f7bm4   1/1     Running     0          17h
vm01742
NAME                                     READY   STATUS      RESTARTS      AGE
catalog-operator-94b8bfddc-lhv6f         1/1     Running     1 (15h ago)   15h
collect-profiles-28053720-4kqtf          0/1     Completed   0             33m
collect-profiles-28053735-hw7kp          0/1     Completed   0             18m
collect-profiles-28053750-6ztq2          0/1     Completed   0             3m4s
olm-operator-66658fffbb-5sqlc            1/1     Running     0             15h
package-server-manager-654759688-n6sms   1/1     Running     0             15h
vm01798
NAME                                     READY   STATUS      RESTARTS      AGE
catalog-operator-94b8bfddc-kx7nx         1/1     Running     2 (13h ago)   13h
collect-profiles-28053720-7vlqq          0/1     Completed   0             33m
collect-profiles-28053735-m8ltn          0/1     Completed   0             18m
collect-profiles-28053750-hrfnk          0/1     Completed   0             3m4s
olm-operator-66658fffbb-5z74m            1/1     Running     1 (13h ago)   13h
package-server-manager-654759688-6jbnz   1/1     Running     0             13h
vm01810
NAME                                     READY   STATUS      RESTARTS      AGE
catalog-operator-94b8bfddc-v5vr6         1/1     Running     2 (19h ago)   19h
collect-profiles-28053720-m26dn          0/1     Completed   0             33m
collect-profiles-28053735-64j7f          0/1     Completed   0             18m
collect-profiles-28053750-qf69b          0/1     Completed   0             3m4s
olm-operator-66658fffbb-gxt2b            1/1     Running     0             19h
package-server-manager-654759688-dz6p6   1/1     Running     0             19h
vm02020
NAME                                     READY   STATUS      RESTARTS   AGE
catalog-operator-94b8bfddc-2qqk6         1/1     Running     0          19h
collect-profiles-28053720-5cktx          0/1     Completed   0          33m
collect-profiles-28053735-ls6n9          0/1     Completed   0          18m
collect-profiles-28053750-bj6gl          0/1     Completed   0          3m4s
olm-operator-66658fffbb-zsr4g            1/1     Running     0          19h
package-server-manager-654759688-2dnfd   1/1     Running     0          19h
vm02091
NAME                                     READY   STATUS      RESTARTS      AGE
catalog-operator-94b8bfddc-whftg         1/1     Running     1 (20h ago)   20h
collect-profiles-28053720-zqcbs          0/1     Completed   0             33m
collect-profiles-28053735-v8lf5          0/1     Completed   0             18m
collect-profiles-28053750-rshdd          0/1     Completed   0             3m5s
olm-operator-66658fffbb-876ps            1/1     Running     0             20h
package-server-manager-654759688-smc8q   1/1     Running     0             20h
vm02363
NAME                                     READY   STATUS      RESTARTS      AGE
catalog-operator-94b8bfddc-zgn5m         1/1     Running     1 (13h ago)   13h
collect-profiles-28053720-dpkqq          0/1     Completed   0             33m
collect-profiles-28053735-nfqmf          0/1     Completed   0             18m
collect-profiles-28053750-jfhdz          0/1     Completed   0             3m5s
olm-operator-66658fffbb-bbrgb            1/1     Running     1 (13h ago)   13h
package-server-manager-654759688-7pv96   1/1     Running     0             13h
vm02590
NAME                                     READY   STATUS      RESTARTS      AGE
catalog-operator-94b8bfddc-v9mvc         1/1     Running     2 (20h ago)   20h
collect-profiles-28053720-pfcbd          0/1     Completed   0             33m
collect-profiles-28053735-5dxbl          0/1     Completed   0             18m
collect-profiles-28053750-95f6g          0/1     Completed   0             3m5s
olm-operator-66658fffbb-5knlj            1/1     Running     0             20h
package-server-manager-654759688-7qkgb   1/1     Running     0             20h
vm02908
NAME                                     READY   STATUS      RESTARTS      AGE
catalog-operator-94b8bfddc-cnmjf         1/1     Running     0             18h
collect-profiles-28053720-ks6h7          0/1     Completed   0             33m
collect-profiles-28053735-r682b          0/1     Completed   0             18m
collect-profiles-28053750-9jrx4          0/1     Completed   0             3m5s
olm-operator-66658fffbb-7bd2v            1/1     Running     1 (18h ago)   18h
package-server-manager-654759688-5r6gq   1/1     Running     0             18h
vm03253
NAME                                     READY   STATUS      RESTARTS      AGE
catalog-operator-94b8bfddc-8wtgg         1/1     Running     2 (14h ago)   14h
collect-profiles-28053720-kwcgk          0/1     Completed   0             33m
collect-profiles-28053735-dv5hx          0/1     Completed   0             18m
collect-profiles-28053750-8xbmw          0/1     Completed   0             3m6s
olm-operator-66658fffbb-f2n9f            1/1     Running     0             14h
package-server-manager-654759688-tjlc9   1/1     Running     0             14h
vm03500
NAME                                     READY   STATUS      RESTARTS   AGE
catalog-operator-94b8bfddc-wdq9b         1/1     Running     0          17h
collect-profiles-28053720-jcmwf          0/1     Completed   0          33m
collect-profiles-28053735-tjw5j          0/1     Completed   0          18m
collect-profiles-28053750-5mjq9          0/1     Completed   0          3m6s
olm-operator-66658fffbb-q92bg            1/1     Running     0          17h
package-server-manager-654759688-2z656   1/1     Running     0          17h
vm03654
NAME                                     READY   STATUS      RESTARTS   AGE
catalog-operator-94b8bfddc-vq9wt         1/1     Running     0          17h
collect-profiles-28053720-dlknz          0/1     Completed   0          33m
collect-profiles-28053735-mshs7          0/1     Completed   0          18m
collect-profiles-28053750-86xrc          0/1     Completed   0          3m6s
olm-operator-66658fffbb-5qd99            1/1     Running     0          17h

 

 

Description of problem:

Kubernetes and other associated dependencies need to be updated to protect against potential vulnerabilities.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-16796. The following is the description of the original issue:

Description of problem:

 

Observation from CISv1.4 pdf:
1.1.1 Ensure that the API server pod specification file permissions are set to 600 or more restrictive



“Ensure that the API server pod specification file has permissions of 600 or more restrictive.
OpenShift 4 deploys two API servers: the OpenShift API server and the Kube API server. The OpenShift API server delegates requests for Kubernetes objects to the Kube API server.
The OpenShift API server is managed as a deployment. The pod specification yaml for openshift-apiserver is stored in etcd.
The Kube API Server is managed as a static pod. The pod specification file for the kube-apiserver is created on the control plane nodes at /etc/kubernetes/manifests/kube-apiserver-pod.yaml. The kube-apiserver is mounted via hostpath to the kube-apiserver pods via /etc/kubernetes/static-pod-resources/kube-apiserver-pod.yaml with permissions 600.”
 
To conform with CIS benchmarksChange, the pod specification file for the kube-apiserver /etc/kubernetes/static-pod-resources/kube-apiserver-pod.yaml  files should be updated to 600.

$ for i in $( oc get pods -n openshift-kube-apiserver -l app=openshift-kube-apiserver -o name )
do                 
oc exec -n openshift-kube-apiserver $i -- \
stat -c %a /etc/kubernetes/static-pod-resources/kube-apiserver-pod.yaml
done
644
644
644

 

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-07-20-215234

How reproducible:

Always

Steps to Reproduce:

1.
2.
3.

Actual results:

The permission of the pod specification file for the kube-apiserver is 644.

Expected results:

The permission of the pod specification file for the kube-apiserver should be updated to 600.

Additional info:

PR: https://github.com/openshift/library-go/commit/19a42d2bae8ba68761cfad72bf764e10d275ad6e

 

Description of problem:

There is forcedns dispatcher script added by assisted installed installation process that create etc/resolv.conf 

This script has no shebang that caused installation to fail as no resolv.conf was generated. 

I order to fix upgrades in already installed clusters we need to workaround this issue.

 

Version-Release number of selected component (if applicable):

4.13.0

How reproducible:

Happens every time

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

Dockerfile.upi.ci.rhel8 does not work with the following error:

[3/3] STEP 26/32: RUN chown 1000:1000 /output && chmod -R g=u "$HOME/.bluemix/"
chmod: cannot access '/root/.bluemix/': No such file or directory
error: build error: building at STEP "RUN chown 1000:1000 /output && chmod -R g=u "$HOME/.bluemix/"": while running runtime: exit status 1

Version-Release number of selected component (if applicable):

master (and possibly all other branches where the ibmcli tool was introduced)

How reproducible:

always

Steps to Reproduce:

1. Try to use Dockerfile.ci.upi.rhel8
2.
3.

Actual results:

[3/3] STEP 26/32: RUN chown 1000:1000 /output && chmod -R g=u "$HOME/.bluemix/" chmod: cannot access '/root/.bluemix/': No such file or directory error: build error: building at STEP "RUN chown 1000:1000 /output && chmod -R g=u "$HOME/.bluemix/"": while running runtime: exit status 1

Expected results:

No failures

Additional info:

We should also change the downloading of the govc image with curl to importing it from the cached container in quay.io, as it is done in Dockerfile.ci.upi

AWS Local Zone Support for OCP UPI/IPI

Current AWS Based OCP deployment models do not address Local Zones which offer lower latency and geo-proximity to OCP Cluster Consumers.

OCP Install Support for AWS Local Zones will address Customer Segments where low latency and data locality requirements enforce as deal breaker/show-stopper for our sales teams engagements. 

Description of problem:

When users are trying to DuplicateClusterRoleBinding and Edit ClusterRoleBinding subject in RHOCP web console , getting below error :
" Error Loading : Name parameter invalid: "system%3Acontroller%3A<name-of-role-ref>": may not contain '%' "

Version-Release number of selected component (if applicable):

Tested in OCP 4.12.18

How reproducible:

Always

Steps to Reproduce:

1. Open OpenShift web console
2. Select project : Openshift
3. Under User management -> Click Rolebindings
4. Look for any RoleBinding having Role Ref with format `system:<name>` 
5. At the end of that line, click on 3 dots where below options will be available :
- Duplicate ClusterRoleBinding
- Edit ClusterroleBinding subject
6. Select/click on any of the option

Actual results:

After selecting Duplicate ClusterRoleBinding or Edit ClusterroleBinding subject, getting below error :
Error Loading : Name parameter invalid: "system%3AXXX": may not contain '%'

Expected results:

After selecting Duplicate ClusterRoleBinding or Edit ClusterroleBinding subject, the correct/expected web page must be open.

Additional info:

When Duplicate or Edit RoleBinding `registry-registry-role` with Role Ref `system:registry` , it is working as expected.
When Duplicate or Edit RoleBinding `system:sdn-readers` with Role Ref `system:sdn-reader` , getting below error :
Error Loading : Name parameter invalid: "system%3Asdn-readers": may not contain '%'

Duplicate ClusterRoleBinding  or Edit ClusterRoleBindingBut subject working for few RoleBindings only (having Role ref system:<name>).

Screenshots are attached here : https://drive.google.com/drive/folders/1QHpdensG2gKx0tSv1zkF7Qiyert6eaSg?usp=sharing

Description of problem:

The topology page is crashed 

Version-Release number of selected component (if applicable):

 

How reproducible:

100%

Steps to Reproduce:

1. Visit developer console
2. Topology view
3.

Actual results:

Error message:
TypeError
Description:
e is null
Component trace:
f@https://console-openshift-console.apps.cl2.cloud.local/static/vendors~app/code-refs/actions~delete-revision~dev-console-add~dev-console-deployImage~dev-console-ed~cf101ec3-chunk-5018ae746e2320e4e737.min.js:26:14244
5363/t.a@https://console-openshift-console.apps.cl2.cloud.local/static/dev-console-topology-chunk-492be609fb2f16849dfa.min.js:1:177913
u@https://console-openshift-console.apps.cl2.cloud.local/static/dev-console-topology-chunk-492be609fb2f16849dfa.min.js:1:275718
8248/t.a<@https://console-openshift-console.apps.cl2.cloud.local/static/dev-console-topology-chunk-492be609fb2f16849dfa.min.js:1:475504
i@https://console-openshift-console.apps.cl2.cloud.local/static/main-chunk-378881319405723c0627.min.js:1:470135
withFallback()
5174/t.default@https://console-openshift-console.apps.cl2.cloud.local/static/dev-console-topology-chunk-492be609fb2f16849dfa.min.js:1:78258
s@https://console-openshift-console.apps.cl2.cloud.local/static/main-chunk-378881319405723c0627.min.js:1:237096
[...]
ne<@https://console-openshift-console.apps.cl2.cloud.local/static/main-chunk-378881319405723c0627.min.js:1:1592411
r@https://console-openshift-console.apps.cl2.cloud.local/static/vendors~main-chunk-12b31b866c0a4fea4c58.min.js:36:125397
t@https://console-openshift-console.apps.cl2.cloud.local/static/vendors~main-chunk-12b31b866c0a4fea4c58.min.js:21:58042
t@https://console-openshift-console.apps.cl2.cloud.local/static/vendors~main-chunk-12b31b866c0a4fea4c58.min.js:21:60087
t@https://console-openshift-console.apps.cl2.cloud.local/static/vendors~main-chunk-12b31b866c0a4fea4c58.min.js:21:54647
re@https://console-openshift-console.apps.cl2.cloud.local/static/main-chunk-378881319405723c0627.min.js:1:1592722
t.a@https://console-openshift-console.apps.cl2.cloud.local/static/main-chunk-378881319405723c0627.min.js:1:791129
t.a@https://console-openshift-console.apps.cl2.cloud.local/static/main-chunk-378881319405723c0627.min.js:1:1062384
s@https://console-openshift-console.apps.cl2.cloud.local/static/main-chunk-378881319405723c0627.min.js:1:613567
t.a@https://console-openshift-console.apps.cl2.cloud.local/static/vendors~main-chunk-12b31b866c0a4fea4c58.min.js:141:244663

Expected results:

No error should be there

Additional info:

Cloud Pak Operator is installed 

Description of problem:

In ROSA, user can be specified an HostPrefix, but we are currently not passing it to the HostedCluster CR. Trying to fix it, it seems that we are not setting up it correctly in the Nodes.

Version-Release number of selected component (if applicable):

4.12.16

How reproducible:

Always

Steps to Reproduce:

1. Create an HC. Inside the spec add 
  networking:
    clusterNetwork:
    - cidr: 10.128.0.0/14
      hostPrefix: 25
2. Deploy the HC. Check its configuration. 

Actual results:

oc get network cluster is showing the right config (see attachment) 
An oc describe node is always showing a /24 hostPrefix.

Note that this is valid also with the default value of /23. In the node, under podCIDR I always see something like
PodCIDR:                                   10.128.1.0/24 
PodCIDRs:                                  10.128.1.0/24 

Expected results:

I would expect the pod cidr mask to be reflected in the pod configuration

Additional info:

pod cidr is correctly set

Description of problem:

Running through instructions for a smoke test on 4.14, the DNS record is incorrectly created for the Gateway.  It is missing a trailing dot in the dnsName.

Version-Release number of selected component (if applicable):

4.14

How reproducible:

Always

Steps to Reproduce:

1.Run through the steps in https://github.com/openshift/network-edge-tools/blob/2fd044d110eb737c94c8b86ea878a130cae0d03e/docs/blogs/EnhancedDevPreviewGatewayAPI/GettingStarted.md until the step "oc get dnsrecord -n openshift-ingress"
2. Check the status of the DNS record: "oc get dnsrecord xxx -n openshift-ingress -ojson | jq .status.zones[].conditions"

 

Actual results:

The status shows error conditions with a message like 'The DNS provider failed to ensure the record: googleapi: Error 400: Invalid value for ''entity.change.additions[*.gwapi.apps.ci-ln-3vxsgxb-72292.origin-ci-int-gce.dev.rhcloud.com][A].name'': ''*.gwapi.apps.ci-ln-3vxsgxb-72292.origin-ci-int-gce.dev.rhcloud.com'', invalid'

Expected results:

The status of the DNS record should show a successful publishing of the record.

Additional info:

Backport to 4.13.z

When the user specifies the 'vendor' hint, it actually checks for the value of the 'model' hint in the vendor field.

Description of problem:

The title on Overview page has changed to "Cluster · Red Hat OpenShift" instead of "Overview · Red Hat OpenShift" that we had starting from 4.11. 

Version-Release number of selected component (if applicable):

OCP 4.14

How reproducible:

Install OpenShift 4.14, login to management console and navigate to Home / Overview

Steps to Reproduce:

1. Install OpenShift 4.14 
2. login to management console 
3. Navigate to Home / Overview 
4. Load the HTML DOM and verify the HTML node <title>; title is also visible when hovering on the opened tab in Chrome or Firefox

Actual results:

Cluster · Red Hat OpenShift

HTML node: <title data-telemetry="Cluster" data-react-helmet="data-telemetry" xpath="1">Cluster · Red Hat OpenShift</title>

Expected results:

Overview · Red Hat OpenShift

Additional info:

started from 4.11 the title on that page was always Overview · Red Hat OpenShift. UI tests rely on consistent titles to detect currently opened web page. 

* It is important to notice the change has an effect on accessibility, since it is a common accessibility feature to navigate with the text speech.

Description of problem:

Azure managed identity role assignments created using 'ccoctl azure' sub-commands are not cleaned up when running 'ccoctl azure delete'

Version-Release number of selected component (if applicable):

4.14.0

How reproducible:

100%

Steps to Reproduce:

1. Create Azure workload identity infrastructure using 'ccoctl azure create-all'
2. Delete Azure workload identity infrastructure using 'ccoctl azure delete'
3. Observe lingering role assignments in either the OIDC resource group if not deleted OR in the DNS Zone resource group if the OIDC resource group is deleted by providing '--delete-oidc-resource-group'. 

Actual results:

Role assignments for managed identities are not deleted following 'ccoctl azure delete'

Expected results:

Role assignments for managed identities are deleted following 'ccoctl azure delete'

Additional info:

 

Description of problem:

Cluster Provisioning fails with the message:
Internal error: failed to fetch instance type, this error usually occurs if the region or the instance type is not found

This is likely because OCM uses GCP custom machine types, for example custom-4-16384 and now the installer is validating machine types per zone (see GetMachineTypeWithZones function), which don't include custom machine types.

See https://cloud.google.com/compute/docs/instances/creating-instance-with-custom-machine-type#gcloud for more details.

Version-Release number of selected component (if applicable):

4.14

How reproducible:

Always

Steps to Reproduce:

ocm create cluster cluster001 --provider=gcp --ccs=true --region=us-east1 --service-account-file=token.json --version="4.14.0-0.nightly-2023-08-02-102121-nightly" 2.

Actual results:

Cluster installation fails 

Expected results:

Cluster installation succeeds

Additional info:

 

As a developer, I would like the Getting Started page to use numbered list so that it is easier to point people to specific sections of the document.

As a developer, I would like the Contribute page to be a numbered list so that it is easier to point people to specific line items of the document.

Description of problem:

Library-go contains code for creating token requests that should be reused by all OpenShift components. Because of time-constraints, this code did not make it to `oc` in the past.

Fix that to prevent code out-of-sync issues.

Version-Release number of selected component (if applicable):

4.14

How reproducible:

100%

Steps to Reproduce:

1. see if https://github.com/openshift/oc/pull/991 merged

Actual results:

it hasn't merged at the time of writing this bug

Expected results:

it's merged

Additional info:


Description of problem:
When adding a "Git Repository" (a tekton or pipelines Repository) and enter a GitLab or Bitbucket PAC repository the created Repository resource is invalid.

Version-Release number of selected component (if applicable):
411-4.13

How reproducible:
Always

Steps to Reproduce:
Setup a PAC git repo, you can mirror these projects if you want: https://github.com/jerolimov/nodeinfo-pac

For GitHub you need setup

  1. an account-global "private access token" > a classic access token, see https://github.com/settings/tokens
  2. a repo > webhook

For GitLab:

  1. a repo > Project Access Tokens
  2. a repo > webhook

For Bitbucket:

  1. an account-global "app password, see https://bitbucket.org/account/settings/app-passwords/
  2. a repo > webhook

On a cluster bot instance:

  1. Install OpenShift Pipelines operator
  2. Navigate to Developer perspective > Pipelines
  3. Select Create > Repository
  4. Enter a GitLab based git repository with Git access token and Webhook secret
  5. Enter a Bitbucket based git repository with Git access token (webhook secret isn't supported)

Actual results:
The GitLab created resource looks like this:

apiVersion: pipelinesascode.tekton.dev/v1alpha1
kind: Repository
metadata:
  name: gitlab-nodeinfo-pac
spec:
  git_provider:
    secret:
      key: provider.token
      name: gitlab-nodeinfo-pac-token-gfr66
    url: gitlab.com   # missing schema
    webhook_secret:
      key: webhook.secret
      name: gitlab-nodeinfo-pac-token-gfr66
  url: 'https://gitlab.com/jerolimov/nodeinfo-pac'

The Bitbucket resource looks like this:

apiVersion: pipelinesascode.tekton.dev/v1alpha1
kind: Repository
metadata:
  name: bitbucket-nodeinfo-pac
spec:
  git_provider:
    secret:
      key: provider.token
      name: bitbucket-nodeinfo-pac-token-9pf75
    url: bitbucket.org   # missing schema and invalid API URL !
    webhook_secret:   # don't entered a webhook URL, see OCPBUGS-7035
      key: webhook.secret
      name: bitbucket-nodeinfo-pac-token-9pf75
  url: 'https://bitbucket.org/jerolimov/nodeinfo-pac'

The pipeline-as-code controller Pod log contains some error messages and no PipelineRun is created.

Expected results:
For GitLab:

  1. The spec.git_provider.url should contain the schema https://, so it should be https://gitlab.com, or can be removed completely. Both work fine.
    A working example:
apiVersion: pipelinesascode.tekton.dev/v1alpha1
kind: Repository
metadata:
  name: gitlab-nodeinfo-pac
spec:
  git_provider:
    secret:
      key: provider.token
      name: gitlab-nodeinfo-pac-token-gfr66
    url: https://gitlab.com
    webhook_secret:
      key: webhook.secret
      name: gitlab-nodeinfo-pac-token-gfr66
  url: 'https://gitlab.com/jerolimov/nodeinfo-pac'

Bitbucket:

  1. The spec.git_provider.url should be https://api.bitbucket.org/2.0, or can be removed completely. Both work fine.
  2. The Account Secret needs also a Bitbucket login name, passed as spec.git_provider.user.

A working example:

apiVersion: pipelinesascode.tekton.dev/v1alpha1
kind: Repository
metadata:
  name: bitbucket-nodeinfo-pac
spec:
  git_provider:
    user: jerolimov
    secret:
      key: provider.token
      name: bitbucket-nodeinfo-pac-token-9pf75
    webhook_secret:
      key: webhook.secret
      name: bitbucket-nodeinfo-pac-token-9pf75
  url: 'https://bitbucket.org/jerolimov/nodeinfo-pac'

A PipelineRun should be created for each push to the git repo.

Additional info:

  1. Bitbucket use a small 2nd b.
  2. For the Bitbucket issue see also https://github.com/openshift-pipelines/pipelines-as-code/issues/416

The "sufficient-masters-count' failed" test is intermittently failing due to a suspected race condition that causes as duplicate cluster event.

"Cluster validation 'sufficient-masters-count' that used to succeed is now failing"

The aim of this ticket is to ensure that this test does not flake

Description of problem:

PipelineRun default template name has been updated in the backend in Pipeline operator 1.10, So we need to update the name in the UI code as well.

 

https://github.com/openshift/console/blob/master/frontend/packages/pipelines-plugin/src/components/pac/const.ts#L9

 

Please review the following PR: https://github.com/openshift/ibm-vpc-block-csi-driver/pull/33

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Seeing `Secret {{newImageSecret}} was created.` string for the created Image pull secret alert in the Container image flow.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1. Navigate +Add page
2. Open the Container Image form
3. click on Create an Image pull secret link and create a secret

Actual results:

Secret {{newImageSecret}} was created. get render in the alert

Expected results:

Secret <-Secret name-> was created. should render in the alert

Additional info:

 

Description of problem:

https://issues.redhat.com//browse/OCPBUGS-10342 tracked the issue when the number of replicas exceeded the number of hosts. However, it does not detect the case when the number of hosts exceeds the number of replicas as it was not counting the hosts correctly. Fix to detect this case correctly.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1. Set compute replicas in install-config.yaml
2. Add hosts in agent-config.yaml - 3 with role of master and more than 2 with role of worker.
3. The installation will fail and following error could be seen in the journal 
Jun 12 01:10:57 master-0 start-cluster-installation.sh[3879]: Hosts known and ready for cluster installation (5/3) 

Actual results:

No warning regarding the number of configured hosts

Expected results:

A warning about the number of configured hosts not matching the replicas.

Additional info:

 

Derscription of problem:

On a hypershift cluster that has public certs for OAuth configured, the console reports a x509 certificate error when attempting to display a token

Version-Release number of selected component (if applicable):

4.12.z

How reproducible:

always

Steps to Reproduce:

1. Create a hosted cluster configured with a letsencrypt certificate for the oauth endpoint.
2. Go to the console of the hosted cluster. Click on the user icon and get token.

Actual results:

The console displays an oauth cert error

Expected results:

The token displays

Additional info:

The hcco reconciles the oauth cert into the console namespace. However, it is only reconciling the self-signed one and not the one that was configured through .spec.configuration.apiserver of the hostedcluster. It needs to detect the actual cert used for oauth and send that one.

 

Description of the problem:

BE 2.15.x, API and Ingress VIPs values doesn't have validation for broadcast IPs (i.e. if network is 192.168.123.0/24 --> 192.168.123.0 and 192.168.123.255). 

How reproducible:

100%

Steps to reproduce:

1. Create cluster with Ingress or API vip with broadcast IP

2.

3.

Actual results:

 

Expected results:
BE should block those IPs

Description of problem:

Missing workload annotations from deployments. This is in relation to the openshift/platform-operator repo.

Missing annotations.

Namespace name, `workload.openshift.io/allowed: management`

`target.workload.openshift.io/management: '{"effect": "PreferredDuringScheduling"}'`. That annotation is required for the admission webhook to modify the resource for workload pinning. 

Related Enhancements: 
https://github.com/openshift/enhancements/pull/703 
https://github.com/openshift/enhancements/pull/1213

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

KCM crashes when Topology cache's HasPopulatedHints method attempts concurrent map access

Miciah has started working on the upstream fix and we need to bring in the changes into openshift/kubernetes as soon as we can

https://redhat-internal.slack.com/archives/C01CQA76KMX/p1684876782205129 for more context  

Version-Release number of selected component (if applicable):

 

How reproducible:

CI 4.14 upgrade jobs run into this problem quite often: https://search.ci.openshift.org/?search=pkg%2Fcontroller%2Fendpointslice%2Ftopologycache%2Ftopologycache.go&maxAge=48h&context=1&type=bug%2Bissue%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job 

Steps to Reproduce:

 

Actual results:

KCM crashing

Expected results:

KCM not crashing

Additional info:

 

We are pushing to find a resolution for OCPBUGS-11591 and the SDN team has identified a key message that appears related in the system journald logs:

Apr 12 11:53:51.395838 ci-op-xs3rnrtc-2d4c7-4mhm7-worker-b-dwc7w ovs-vswitchd[1124]: ovs|00002|timeval(urcu4)|WARN|Unreasonably long 109127ms poll interval (0ms user, 0ms system)

We should detect this in origin and create an interval so it can be charted in the timelines, as well as a unit test that fails if detected so we can see where it's happening.

Ovnkube-node container max memory usage was 110 MiB with 4.14.0-0.nightly-2023-05-18-231932 image and now it is 530 MiB with 4.14.0-0.nightly-2023-07-31-181848 image, for the same test (cluster-density-v2 with 800 iterations, churn=false) on 120 node environment. We observed the same pattern in the OVN-IC environment as well.

Note: As churn is false, we are calculating memory usage for only resource creation.

Grafana panel for OVN with 4.14.0-0.nightly-2023-05-18-231932 image -

https://grafana.rdu2.scalelab.redhat.com:3000/dashboard/snapshot/H9pAb07fsPEOFyd5dhKLFP602A7S18uC

 

Grafana panel for OVN with 4.14.0-0.nightly-2023-07-31-181848 image -

https://grafana.rdu2.scalelab.redhat.com:3000/dashboard/snapshot/8158bJgv3e4P2uiVernbc2E5ypBWFYHt

 

As the test was successfully run in the CI, we couldn't collect a must-gather. I can provide must-gather and pprof data if needed.

 

We observed 100 MiB to 550 MiB increase in OVN-IC between 4.14.0-0.nightly-2023-06-12-141936 and  4.14.0-0.nightly-2023-07-30-191504 versions.

OVN-IC  4.14.0-0.nightly-2023-06-12-141936

https://grafana.rdu2.scalelab.redhat.com:3000/dashboard/snapshot/o5SXLdHIL8whsdgaMyXwWamipBP8J2fF

 

OVN-IC 4.14.0-0.nightly-2023-07-30-191504

https://grafana.rdu2.scalelab.redhat.com:3000/dashboard/snapshot/NMuSQx7YAJ9jokoKMl6Me9StHp33tjwD

Description of the problem:

When invoking installation with assisted-service scripts (make deploy-all), as being done in installation for PSI env, the pods for assisted-service and assisted-image-service produce warning about readiness-probe validation that is failing:

Readiness probe failed: Get "http://172.28.8.39:8090/ready": dial tcp 172.28.8.39:8090: connect: connection refused 

Those warnings are harmless, but they make people think that there is a problem with the running pods (or that they are not ready yet, even though the pods are marked as ready).

How reproducible:

100%

Steps to reproduce:

1. invoke make deploy-all on PSI or other places (for some reason it doesn't reproduce on minikube)

2. inspect the pod's conditions part with oc describe, and look for warnings

Actual results:

Warnings emitted 

Expected results:
No warnings should be emitted for the initial setup time of each pod. The fix just requires setting initialDelaySeconds in the readinessProbe configuration, just like we did in the template: https://github.com/openshift/assisted-service/pull/4557 
see also: https://github.com/openshift/assisted-service/pull/380#pullrequestreview-490308765 

Please review the following PR: https://github.com/openshift/machine-api-provider-gcp/pull/44

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

ODC automatically loads all Camel K Kamelets from openshift-operators namespace in order to display those resources in the event sources/sinks catalog. This is not working when the Camel K operator is installed in another namespace (e.g. in Developer Sandbox the Camel K operator had to be installed in camel-k-operator namespace)

Version-Release number of selected component (if applicable):

4.12

How reproducible:

Display event sources/sinks catalog in ODC on a cluster where Camel K is installed in a namespace other than openshift-operators (e.g. Developer Sandbox)

Steps to Reproduce:

1. Make sure to have a cluster where Knative eventing is available
2. Install Camel K operator in camel-k-operator namespace (e.g. via OLM)
3. Display the event source/sink catalog in ODC

Actual results:

No Kamelets are visible in the catalog

Expected results:

All Kamelets (automatically installed with the operator) should be visible as potential event sources/sinks in the catalog

Additional info:

The Kamelet resources are being watched in two namespaces (current user namespace and global operator namespace. https://github.com/openshift/console/blob/master/frontend/packages/knative-plugin/src/hooks/useKameletsData.ts#L12-L28

We should allow configuration of the global namespace or also add camel-k-operator namespace as 3rd place to look for installed Kamelets.

This is a clone of issue OCPBUGS-19017. The following is the description of the original issue:

dnsmasq isn't starting on okd-scos in the bootstrap VM

 

logs should it failing with "Operation not permitted"

`useExtensions` is not available in the dynamic plugin SDK, which prevents this functionality being copied to `monitoring-plugin`. `useResolvedExtensions` is available and provides the same functionality so we should use that instead.

For static pod readiness we check /readyz and /healthz endpoints for kube-apiserver. For SNO exclude openshift-apiserver from the health checks using the 'exclude' query parameter

Example:
> oc get --raw /readyz?verbose&exclude=api-openshift-apiserver-available

Should we also remove 'oauth-apiserver'?

Description of problem:

No MachineSet is created for workers if replicas == 0

Version-Release number of selected component (if applicable):

4.14.0

How reproducible:

replicas: 0 in install-config for workers

Steps to Reproduce:

1. Deploy a cluster with 0 worker
2. After deployment, list MachineSets
3. Zero can be found

Actual results:

No MachineSet found:
No resources found in openshift-machine-api namespace.

Expected results:

A worker MachineSet should have been created like before.

Additional info:

We broke it during CPMS integration.

Please review the following PR: https://github.com/openshift/cloud-provider-nutanix/pull/18

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

When install a cluster on IBM cloud, the image registry default to Removed, no storage configured after 4.13.0-ec.3
Image registry should use ibmcos object storage on IPI-IBM cluster 
https://github.com/openshift/cluster-image-registry-operator/blob/master/pkg/storage/storage.go#L182 

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2023-02-27-101545

How reproducible:

always

Steps to Reproduce:

1.Install an IPI cluster on IBM cloud
2.Check image registry after install successfully
3.

Actual results:

oc get config.image/cluster -o yaml 
  spec:
    logLevel: Normal
    managementState: Removed
    observedConfig: null
    operatorLogLevel: Normal
    proxy: {}
    replicas: 1
    requests:
      read:
        maxWaitInQueue: 0s
      write:
        maxWaitInQueue: 0s
    rolloutStrategy: RollingUpdate
    storage: {}
    unsupportedConfigOverrides: null
oc get infrastructure cluster -o yaml
apiVersion: config.openshift.io/v1
kind: Infrastructure
metadata:
  creationTimestamp: "2023-03-02T02:21:06Z"
  generation: 1
  name: cluster
  resourceVersion: "531"
  uid: 8d61a1e2-3852-40a2-bf5d-b7f9c92cda7b
spec:
  cloudConfig:
    key: config
    name: cloud-provider-config
  platformSpec:
    type: IBMCloud
status:
  apiServerInternalURI: https://api-int.wxjibm32.ibmcloud.qe.devcluster.openshift.com:6443
  apiServerURL: https://api.wxjibm32.ibmcloud.qe.devcluster.openshift.com:6443
  controlPlaneTopology: HighlyAvailable
  etcdDiscoveryDomain: ""
  infrastructureName: wxjibm32-lmqh7
  infrastructureTopology: HighlyAvailable
  platform: IBMCloud
  platformStatus:
    ibmcloud:
      cisInstanceCRN: 'crn:v1:bluemix:public:internet-svcs:global:a/fdc2e14cf8bc4d53a67f972dc2e2c861:e8ee6ca1-4b31-4307-8190-e67f6925f83b::'
      location: eu-gb
      providerType: VPC
      resourceGroupName: wxjibm32-lmqh7
    type: IBMCloud 

Expected results:

Image registry should use ibmcos object storage on IPI-IBM cluster 

Additional info:

Must-gather log https://drive.google.com/file/d/1N-WUOZLRjlXcZI0t2O6MXsxwnsVPDCGQ/view?usp=share_link 

Description of the problem:

When patching platform and leaving umn without change the logs shows "false" instead of nil, causing us to think that the cluster will not be in a not valid state (e.g. none + umn disabled)

 

time="2023-06-15T09:59:54Z" level=info msg="Platform verification completed, setting platform type to none and user-managed-networking to false" func="github.com/openshift/assisted-service/internal/bminventory.(*bareMetalInventory).validateUpdateCluster" file="/assisted-service/internal/bminventory/inventory.go:1928" cluster_id=468bffe8-ce24-400e-a104-b0aab378eb75 go-id=94310 pkg=Inventory request_id=2fbb74ba-4390-4f27-b6fd-ee11ac1a7895 

 

Steps to reproduce:

1. Create cluster with platform == OCI or vSphere with UMN enabled

2.  Patch the cluster with "{"platfrom": {"type": "none"}}"

 

Actual results:

Log shows 

setting platform type to none and user-managed-networking to false 

 

Expected results:

setting platform type to none and user-managed-networking to nil

aws-ebs-csi-driver-controller-ca ServiceAccount does not include the HCP pull-secret in its imagePullSecrets. Thus, if a HostedCluster is created with a `pullSecret` that contains creds that the management cluster pull secret does not have, the image pull fails.

Description of problem

CI is flaky because tests pull the "openshift/origin-node" image from Docker Hub and get rate-limited:

E0803 20:44:32.429877    2066 kuberuntime_image.go:53] "Failed to pull image" err="rpc error: code = Unknown desc = reading manifest latest in docker.io/openshift/origin-node: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit" image="openshift/origin-node:latest"

This particular failure comes from https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-ingress-operator/929/pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-operator/16871891662673059841687189166267305984. I don't know how to search for this failure using search.ci. I discovered the rate-limiting through Loki: https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%7B%22datasource%22:%22PCEB727DF2F34084E%22,%22queries%22:%5B%7B%22expr%22:%22%7Binvoker%3D%5C%22openshift-internal-ci%2Fpull-ci-openshift-cluster-ingress-operator-master-e2e-aws-operator%2F1687189166267305984%5C%22%7D%20%7C%20unpack%20%7C~%20%5C%22pull%20rate%20limit%5C%22%22,%22refId%22:%22A%22,%22editorMode%22:%22code%22,%22queryType%22:%22range%22%7D%5D,%22range%22:%7B%22from%22:%221691086303449%22,%22to%22:%221691122303451%22%7D%7D.

Version-Release number of selected component (if applicable)

This happened on 4.14 CI job.

How reproducible

I have observed this once so far, but it is quite obscure.

Steps to Reproduce

1. Post a PR and have bad luck.
2. Check Loki using the following query:

{...} {invoker="openshift-internal-ci/pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-operator/*"} | unpack | systemd_unit="kubelet.service" |~ "pull rate limit"

Actual results

CI pulls from Docker Hub and fails.

Expected results

CI passes, or fails on some other test failure. CI should never pull from Docker Hub.

Additional info

We have been using the "openshift/origin-node" image in multiple tests for years. I have no idea why it is suddenly pulling from Docker Hub, or how we failed to notice that it was pulling from Docker Hub if that's what it was doing all along.

Description of problem:

[CSI Inline Volume admission plugin] when using deployment/statefulset/daemonset workload with inline volume doesn't record audit logs/warning correctly

Version-Release number of selected component (if applicable):

4.13.0-0.ci.test-2023-03-02-013814-ci-ln-yd4m4st-latest (nightly build also could be reproduced)

How reproducible:

Always

Steps to Reproduce:

1. Enable feature gate to auto install the csi.sharedresource csi driver

2. Add security.openshift.io/csi-ephemeral-volume-profile: privileged to CSIDriver 'csi.sharedresource.openshift.io' # scale down the cvo,cso and shared-resource-csi-driver-operator $ oc scale --replicas=0 deploy/cluster-version-operator -n openshift-cluster-version deployment.apps/cluster-version-operator scaled $oc scale --replicas=0 deploy/cluster-storage-operator -n openshift-cluster-storage-operator deployment.apps/cluster-storage-operator scaled $ oc scale --replicas=0 deploy/shared-resource-csi-driver-operator -n openshift-cluster-csi-drivers deployment.apps/shared-resource-csi-driver-operator scaled # Add security.openshift.io/csi-ephemeral-volume-profile: privileged to CSIDriver $ oc get csidriver/csi.sharedresource.openshift.io -o yaml apiVersion: storage.k8s.io/v1 kind: CSIDriver metadata: annotations: csi.openshift.io/managed: "true" operator.openshift.io/spec-hash: 4fc61ff54015a7e91e07b93ac8e64f46983a59b4b296344948f72187e3318b33 creationTimestamp: "2022-10-26T08:10:23Z" labels: security.openshift.io/csi-ephemeral-volume-profile: privileged

3. Create different workloads with inline volume in a restricted namespace
$ oc apply -f examples/simple 
role.rbac.authorization.k8s.io/shared-resource-my-share-pod created 
rolebinding.rbac.authorization.k8s.io/shared-resource-my-share-pod created configmap/my-config created sharedconfigmap.sharedresource.openshift.io/my-share-pod created 
Error from server (Forbidden): error when creating "examples/simple/03-pod.yaml": pods "my-csi-app-pod" is forbidden: admission denied: pod my-csi-app-pod uses an inline volume provided by CSIDriver csi.sharedresource.openshift.io and namespace my-csi-app-namespace has a pod security enforce level that is lower than privileged 
Error from server (Forbidden): error when creating "examples/simple/04-deployment.yaml": deployments.apps "mydeployment" is forbidden: admission denied: pod  uses an inline volume provided by CSIDriver csi.sharedresource.openshift.io and namespace my-csi-app-namespace has a pod security enforce level that is lower than privileged 
Error from server (Forbidden): error when creating "examples/simple/05-statefulset.yaml": statefulsets.apps "my-sts" is forbidden: admission denied: pod  uses an inline volume provided by CSIDriver csi.sharedresource.openshift.io and namespace my-csi-app-namespace has a pod security enforce level that is lower than privileged

4.  Add enforce: privileged label to the test ns and create different workloads with inline volume again 
$ oc label ns/my-csi-app-namespace security.openshift.io/scc.podSecurityLabelSync=false pod-security.kubernetes.io/enforce=privileged pod-security.kubernetes.io/audit=restricted pod-security.kubernetes.io/warn=restricted --overwrite
namespace/my-csi-app-namespace labeled

$ oc apply -f examples/simple                    
role.rbac.authorization.k8s.io/shared-resource-my-share-pod created
rolebinding.rbac.authorization.k8s.io/shared-resource-my-share-pod created
configmap/my-config created
sharedconfigmap.sharedresource.openshift.io/my-share-pod created
Warning: pod my-csi-app-pod uses an inline volume provided by CSIDriver csi.sharedresource.openshift.io and namespace my-csi-app-namespace has a pod security warn level that is lower than privileged
pod/my-csi-app-pod created
Warning: pod  uses an inline volume provided by CSIDriver csi.sharedresource.openshift.io and namespace my-csi-app-namespace has a pod security warn level that is lower than privileged
deployment.apps/mydeployment created
daemonset.apps/my-ds created
statefulset.apps/my-sts created

$ oc get po                                               
NAME                            READY   STATUS    RESTARTS   AGE
my-csi-app-pod                  1/1     Running   0          34s
my-ds-cw4k7                     1/1     Running   0          32s
my-ds-sv9vp                     1/1     Running   0          32s
my-ds-v7f9m                     1/1     Running   0          32s
my-sts-0                        1/1     Running   0          31s
mydeployment-664cd95cb4-4s2cd   1/1     Running   0          33s

5. Check the api-server audit logs
$ oc adm node-logs ip-10-0-211-240.us-east-2.compute.internal --path=kube-apiserver/audit.log | grep 'uses an inline volume provided by'| tail -1 | jq . | grep 'CSIInlineVolumeSecurity'
    "storage.openshift.io/CSIInlineVolumeSecurity": "pod  uses an inline volume provided by CSIDriver csi.sharedresource.openshift.io and namespace my-csi-app-namespace has a pod security audit level that is lower than privileged"

Actual results:

In step 3 and step 4: deployment workloads the warning info pod name is empty
statefulset/daemonset workloads the warning info doesn't display
In step 5: audit logs the pod name is empty 

Expected results:

In step 3 and step 4: deployment workloads the warning info pod name should be exist
statefulset/daemonset workloads the warning info should display
In step 5: audit logs the pod name shouldn't be empty it should record the workload type and pod specific names

Additional info:

Testdata:
https://github.com/Phaow/csi-driver-shared-resource/tree/test-inlinevolume/examples/simple

Description of problem:

When running a cluster on application credentials, this event appears repeatedly:

ns/openshift-machine-api machineset/nhydri0d-f8dcc-kzcwf-worker-0 hmsg/173228e527 - pathological/true reason/ReconcileError could not find information for "ci.m1.xlarge"

Version-Release number of selected component (if applicable):

 

How reproducible:

Happens in the CI (https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/33330/rehearse-33330-periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.13-e2e-openstack-ovn-serial/1633149670878351360).

Steps to Reproduce:

1. On a living cluster, rotate the OpenStack cloud credentials
2. Invalidate the previous credentials
3. Watch the machine-api events (`oc -n openshift-machine-api get event`). A `Warning` type of issue could not find information for "name-of-the-flavour" will appear.

If the cluster was installed using a password that you can't invalidate:
1. Rotate the cloud credentials to application credentials
2. Restart MAPO (`oc -n openshift-machine-api get pods -o NAME | xargs -r oc -n openshift-machine-api delete`)
3. Rotate cloud credentials again
4. Revoke the first application credentials you set
5. Finally watch the events (`oc -n openshift-machine-api get event`)

The event signals that MAPO wasn't able to update flavour information on the MachineSet status.

Actual results:

 

Expected results:

No issue detecting the flavour details

Additional info:

Offending code likely around this line: https://github.com/openshift/machine-api-provider-openstack/blob/bcb08a7835c08d20606d75757228fd03fbb20dab/pkg/machineset/controller.go#L116

Currently the assisted installer adds to the ISO a dracut hook that is executed early during the boot process. That hook generates the NetworkManager configuration files that will be used during the boot and also once the machine is installed. But that hook is not guaranteed to run before NetworkManager, and the files it generates may not be loaded by NetworkManager at the right time. We have seen such issues in the recent upgrade from RHEL 8 to RHEL 9 that is part of OpenShift 4.13. The RCHOS team recommends replacing it with a systemd unit that runs before NetworkManager.

Please review the following PR: https://github.com/openshift/cloud-provider-gcp/pull/29

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/machine-api-provider-azure/pull/53

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

When creating machine and attaching Azure Ultra Disks as Data Disks in Arm cluster, machine is Provisioned, but checked in azure web console, instance is failed with error ZonalAllocationFailed.

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-arm64-2023-03-22-204044

How reproducible:

Always

Steps to Reproduce:


/// Not Needed up to point 6 ////

1. Make sure storagecluster is already present
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: ultra-disk-sc
provisioner: disk.csi.azure.com # replace with "kubernetes.io/azure-disk" if aks version is less than 1.21
volumeBindingMode: WaitForFirstConsumer # optional, but recommended if you want to wait until the pod that will use this disk is created 
parameters:
  skuname: UltraSSD_LRS
  kind: managed
  cachingMode: None
  diskIopsReadWrite: "2000"  # minimum value: 2 IOPS/GiB 
  diskMbpsReadWrite: "320"   # minimum value: 0.032/GiB
2. Create a new custom secret using the worker-data-secret  
$ oc -n openshift-machine-api get secret worker-user-data --template='{{index .data.userData | base64decode}}' | jq > userData.txt
3. Edit the userData.txt by adding below part just before the ending '}' and add a comma 
"storage": {
  "disks": [
    {
      "device": "/dev/disk/azure/scsi1/lun0",
      "partitions": [
        {
          "label": "lun0p1",
          "sizeMiB": 1024,
          "startMiB": 0
        }
      ]
    }
  ],
  "filesystems": [
    {
      "device": "/dev/disk/by-partlabel/lun0p1",
      "format": "xfs",
      "path": "/var/lib/lun0p1"
    }
  ]
},
"systemd": {
  "units": [
    {
      "contents": "[Unit]\nBefore=local-fs.target\n[Mount]\nWhere=/var/lib/lun0p1\nWhat=/dev/disk/by-partlabel/lun0p1\nOptions=defaults,pquota\n[Install]\nWantedBy=local-fs.target\n",
      "enabled": true,
      "name": "var-lib-lun0p1.mount"
    }
  ]
}
4. Extract the disabling template value using below
$ oc -n openshift-machine-api get secret worker-user-data --template='{{index .data.disableTemplating | base64decode}}' | jq > disableTemplating.txt
5. Merge the two files to create a datasecret file to be used 
$ oc -n openshift-machine-api create secret generic worker-user-data-x5 --from-file=userData=userData.txt --from-file=disableTemplating=disableTemplating.txt 


/// Not needed up to here ///

6.modify the new machineset yaml with below datadisk being seperate field as the osDisks 
          dataDisks:
          - nameSuffix: ultrassd
            lun: 0
            diskSizeGB: 4 # The same issue on the machine status fields is reproducible on x86_64 by setting 65535 to overcome the maximum limits of the Azure accounts we use.
            cachingType: None
            deletionPolicy: Delete
            managedDisk:
              storageAccountType: UltraSSD_LRS
7. scale up machineset or delete an existing machine to force the reprovisioning.

Actual results:

Machine stuck in Provisoned phase, but check from azure, it failed
$ oc get machine -o wide                
NAME                                        PHASE         TYPE               REGION      ZONE   AGE     NODE                                        PROVIDERID                                                                                                                                                                              STATE
zhsunaz3231-lds8h-master-0                  Running       Standard_D8ps_v5   centralus   1      4h15m   zhsunaz3231-lds8h-master-0                  azure:///subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/zhsunaz3231-lds8h-rg/providers/Microsoft.Compute/virtualMachines/zhsunaz3231-lds8h-master-0                  Running
zhsunaz3231-lds8h-master-1                  Running       Standard_D8ps_v5   centralus   2      4h15m   zhsunaz3231-lds8h-master-1                  azure:///subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/zhsunaz3231-lds8h-rg/providers/Microsoft.Compute/virtualMachines/zhsunaz3231-lds8h-master-1                  Running
zhsunaz3231-lds8h-master-2                  Running       Standard_D8ps_v5   centralus   3      4h15m   zhsunaz3231-lds8h-master-2                  azure:///subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/zhsunaz3231-lds8h-rg/providers/Microsoft.Compute/virtualMachines/zhsunaz3231-lds8h-master-2                  Running
zhsunaz3231-lds8h-worker-centralus1-sfhs7   Provisioned   Standard_D4ps_v5   centralus   1      3m23s                                               azure:///subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/zhsunaz3231-lds8h-rg/providers/Microsoft.Compute/virtualMachines/zhsunaz3231-lds8h-worker-centralus1-sfhs7   Creating

$ oc get machine zhsunaz3231-lds8h-worker-centralus1-sfhs7 -o yaml
  - lastTransitionTime: "2023-03-23T06:07:32Z"
    message: 'Failed to check if machine exists: vm for machine zhsunaz3231-lds8h-worker-centralus1-sfhs7
      exists, but has unexpected ''Failed'' provisioning state'
    reason: ErrorCheckingProvider
    status: Unknown
    type: InstanceExists
  - lastTransitionTime: "2023-03-23T06:07:05Z"
    status: "True"
    type: Terminable
  lastUpdated: "2023-03-23T06:07:32Z"
  phase: Provisioned

Expected results:

Machine should be failed if failed in azure

Additional info:

must-gather: https://drive.google.com/file/d/1z1gyJg4NBT8JK2-aGvQCruJidDHs0DV6/view?usp=sharing

Description of the problem:

Staging , Ignition override test was passing successfully before , looks like in latest code the returned api code exception changed to 500 (internal server error) .

Before that we have error 400 api code exception.

 

 

(Pdb++) cluster.patch_discovery_ignition(ignition=ignition_override)
 'image_type': None,
 'kernel_arguments': None,
 'proxy': None,
 'pull_secret': None,
 'ssh_authorized_key': None,
 'static_network_config': None}     (/home/benny/assisted-test-infra/src/service_client/assisted_service_api.py:169)
*** assisted_service_client.rest.ApiException: (500)
Reason: Internal Server Error
HTTP response headers: HTTPHeaderDict({'content-type': 'application/json', 'vary': 'Accept-Encoding,Origin', 'date': 'Sun, 11 Jun 2023 04:26:53 GMT', 'content-length': '141', 'x-envoy-upstream-service-time': '1538', 'server': 'envoy', 'set-cookie': 'bd0de3dae0f495ebdb32e3693e2b9100=de3a34d29f1e78d0c404b6c5e84b502b; path=/; HttpOnly; Secure; SameSite=None'})
HTTP response body: {"code":"500","href":"","id":500,"kind":"Error","reason":"The ignition archive size (365 KiB) is over the maximum allowable size (256 KiB)"}
Traceback (most recent call last):
  File "/home/benny/assisted-test-infra/src/assisted_test_infra/test_infra/helper_classes/cluster.py", line 501, in patch_discovery_ignition
    self._infra_env.patch_discovery_ignition(ignition_info=ignition)
  File "/home/benny/assisted-test-infra/src/assisted_test_infra/test_infra/helper_classes/infra_env.py", line 116, in patch_discovery_ignition
    self.api_client.patch_discovery_ignition(infra_env_id=self.id, ignition_info=ignition_info)
  File "/home/benny/assisted-test-infra/src/service_client/assisted_service_api.py", line 407, in patch_discovery_ignition
    self.update_infra_env(infra_env_id=infra_env_id, infra_env_update_params=infra_env_update_params)
  File "/home/benny/assisted-test-infra/src/service_client/assisted_service_api.py", line 170, in update_infra_env
    self.client.update_infra_env(infra_env_id=infra_env_id, infra_env_update_params=infra_env_update_params)
  File "/root/.pyenv/versions/3.11.0/lib/python3.11/site-packages/assisted_service_client/api/installer_api.py", line 1696, in update_infra_env
    (data) = self.update_infra_env_with_http_info(infra_env_id, infra_env_update_params, **kwargs)  # noqa: E501
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.0/lib/python3.11/site-packages/assisted_service_client/api/installer_api.py", line 1767, in update_infra_env_with_http_info
    return self.api_client.call_api(
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.0/lib/python3.11/site-packages/assisted_service_client/api_client.py", line 325, in call_api
    return self.__call_api(resource_path, method,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.0/lib/python3.11/site-packages/assisted_service_client/api_client.py", line 157, in __call_api
    response_data = self.request(
                    ^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.0/lib/python3.11/site-packages/assisted_service_client/api_client.py", line 383, in request
    return self.rest_client.PATCH(url,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.0/lib/python3.11/site-packages/assisted_service_client/rest.py", line 289, in PATCH
    return self.request("PATCH", url,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.0/lib/python3.11/site-packages/assisted_service_client/rest.py", line 228, in request
    raise ApiException(http_resp=r)
(Pdb++) 
 
 

How reproducible:

Always

 

Steps to reproduce:

Run test: 
test_discovery_ignition_exceed_size_limit
Actual results:

Returns error 500

Expected results:

erorr 400

Please review the following PR: https://github.com/openshift/telemeter/pull/452

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

The upgrade to 4.14.0-ec.2 from 4.14.0-ec.1 was blocked by the error message on the UI:

Could not update rolebinding "openshift-monitoring/cluster-monitoring-operator-techpreview-only" (531 of 993): the object is invalid, possibly due to local cluster configuration

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:

1.
2.
3.

Actual results:


Expected results:


Additional info:

Unblocked by 

oc --context build02 delete rolebinding cluster-monitoring-operator-techpreview-only -n openshift-monitoring --as system:admin
rolebinding.rbac.authorization.k8s.io "cluster-monitoring-operator-techpreview-only" deleted

Description of problem:

Some of the components in Console Dynamic Plugin SDK take `GroupVersionKind` type, which is string for the `groupVersionKind` prop, but instead they should be using new `K8sGroupVersionKind` object.

Version-Release number of selected component (if applicable):

 

How reproducible:

always

Please review the following PR: https://github.com/openshift/kubernetes-kube-storage-version-migrator/pull/192

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

The agent-config-template creation command give no INFO log in the output, however, it generates the file.

Version-Release number of selected component (if applicable):

v4.13

How reproducible:

$ openshift-install agent create agent-config-template --dir=./foo

Steps to Reproduce:

1.
2.
3.

Actual results:

$ openshift-install agent create agent-config-template --dir=./foo
INFO

Expected results:

 

Additional info:

$ openshift-install agent create agent-config-template --dir=./foo
INFO Created Agent Config Template in . directory

Description of problem:

On the openshift/console master branch, a devfile import fails by default. I have noticed that when a repository url has a .git extension, the pod fails due to a bug where the container image is trying to pull from dockerhub rather than the openshift image registry. For example, the container image is Image:          devfile-sample-code-with-quarkus.git:latest but the image from the imagestreamtag is image-registry.openshift-image-registry.svc:5000/maysun/devfile-sample-code-with-quarkus.git@sha256:e6aa9d29be48b33024eb271665d11a7557c9f140c9bd58aeb19fe4570fffb421.

A pod describe shows the expected error "Failed to pull image "devfile-sample-code-with-quarkus.git:latest": rpc error: code = Unknown desc = reading manifest latest in docker.io/library/devfile-sample-code-with-quarkus.git: requested access to the resource is denied".

However, during import, if you were to remove the .git extention from the repository link, the import is successful.

I only see this on the master branch and it seems to be fine on my local crc which is on OpenShift version: 4.13.0

Version-Release number of selected component (if applicable):

4.13.z

How reproducible:

Always

Steps to Reproduce:

1. Build from openshift/console master
2. Import Devfile sample
3. If repo has a .git extension, pod fails with the wrong image

Actual results:

POD describe:

Failed to pull image "devfile-sample-code-with-quarkus.git:latest": rpc error: code = Unknown desc = reading manifest latest in docker.io/library/devfile-sample-code-with-quarkus.git: requested access to the resource is denied

Expected results:

Successful running pod

Additional info:

Fine on Openshift 4.13.0, tested on local crc:

$ crc version
WARN A new version (2.23.0) has been published on https://developers.redhat.com/content-gateway/file/pub/openshift-v4/clients/crc/2.23.0/crc-macos-installer.pkg 
CRC version: 2.20.0+f3a947
OpenShift version: 4.13.0
Podman version: 4.4.4

This is a clone of issue OCPBUGS-5969. The following is the description of the original issue:

Description of problem:

Nutanix machine without enough memory stuck in Provisioning and machineset scale/delete cannot work

Version-Release number of selected component (if applicable):

Server Version: 
4.12.0
4.13.0-0.nightly-2023-01-17-152326

How reproducible:

Always

Steps to Reproduce:

1. Install Nutanix Cluster 
Template https://gitlab.cee.redhat.com/aosqe/flexy-templates/-/tree/master/functionality-testing/aos-4_12/ipi-on-nutanix//versioned-installer
master_num_memory: 32768
worker_num_memory: 16384
networkType: "OVNKubernetes"
installer_payload_image: quay.io/openshift-release-dev/ocp-release:4.12.0-x86_64 2.
3. Scale up the cluster worker machineset from 2 replicas to 40 replicas
4. Install a Infra machinesets with 3 replicas, and a Workload machinesets with 1 replica
Refer to this doc https://docs.openshift.com/container-platform/4.11/machine_management/creating-infrastructure-machinesets.html#machineset-yaml-nutanix_creating-infrastructure-machinesets  and config the following resource
VCPU=16
MEMORYMB=65536
MEMORYSIZE=64Gi

Actual results:

1. The new infra machines stuck in 'Provisioning' status for about 3 hours.

% oc get machines -A | grep Prov                                               
openshift-machine-api   qili-nut-big-jh468-infra-48mdt      Provisioning                                      175m
openshift-machine-api   qili-nut-big-jh468-infra-jnznv      Provisioning                                      175m
openshift-machine-api   qili-nut-big-jh468-infra-xp7xb      Provisioning                                      175m

2. Checking the Nutanix web console, I found 
infra machine 'qili-nut-big-jh468-infra-jnznv' had the following msg
"
No host has enough available memory for VM qili-nut-big-jh468-infra-48mdt (8d7eb6d6-a71e-4943-943a-397596f30db2) that uses 4 vCPUs and 65536MB of memory. You could try downsizing the VM, increasing host memory, power off some VMs, or moving the VM to a different host. Maximum allowable VM size is approximately 17921 MB
"

infra machine 'qili-nut-big-jh468-infra-jnznv' is not round

infra machine 'qili-nut-big-jh468-infra-xp7xb' is in green without warning.
But In must gather I found some error:
03:23:49openshift-machine-apinutanixcontrollerqili-nut-big-jh468-infra-xp7xbFailedCreateqili-nut-big-jh468-infra-xp7xb: reconciler failed to Create machine: failed to update machine with vm state: qili-nut-big-jh468-infra-xp7xb: failed to get node qili-nut-big-jh468-infra-xp7xb: Node "qili-nut-big-jh468-infra-xp7xb" not found

3. Scale down the worker machineset from 40 replicas to 30 replicas can not work. Still have 40 Running worker machines and 40 Ready nodes after about 3 hours.

% oc get machinesets -A
NAMESPACE               NAME                          DESIRED   CURRENT   READY   AVAILABLE   AGE
openshift-machine-api   qili-nut-big-jh468-infra      3         3                             176m
openshift-machine-api   qili-nut-big-jh468-worker     30        30        30      30          5h1m
openshift-machine-api   qili-nut-big-jh468-workload   1         1                             176m

% oc get machines -A | grep worker| grep Running -c
40

% oc get nodes | grep worker | grep Ready -c
40

4. I delete the infra machineset, but the machines still in Provisioning status and won't get deleted

% oc delete machineset -n openshift-machine-api   qili-nut-big-jh468-infra
machineset.machine.openshift.io "qili-nut-big-jh468-infra" deleted

% oc get machinesets -A
NAMESPACE               NAME                          DESIRED   CURRENT   READY   AVAILABLE   AGE
openshift-machine-api   qili-nut-big-jh468-worker     30        30        30      30          5h26m
openshift-machine-api   qili-nut-big-jh468-workload   1         1                             3h21m

% oc get machines -A | grep -v Running
NAMESPACE               NAME                                PHASE          TYPE   REGION    ZONE              AGE
openshift-machine-api   qili-nut-big-jh468-infra-48mdt      Provisioning                                      3h22m
openshift-machine-api   qili-nut-big-jh468-infra-jnznv      Provisioning                                      3h22m
openshift-machine-api   qili-nut-big-jh468-infra-xp7xb      Provisioning                                      3h22m
openshift-machine-api   qili-nut-big-jh468-workload-qdkvd                                                     3h22m

Expected results:

The new infra machines should be either Running or Failed.
Cluster worker machinest scaleup and down should not be impacted.

Additional info:

must-gather download url will be added to the comment.

Description of problem:

On an SNO node one of the CatalogSources gets deleted after multiple reboots.

In the initial stage we have 2 catalogsources:

$ oc get catsrc -A
NAMESPACE NAME DISPLAY TYPE PUBLISHER AGE
openshift-marketplace certified-operators Intel SRIOV-FEC Operator grpc Red Hat 20h
openshift-marketplace redhat-operators Red Hat Operators Catalog grpc Red Hat 18h

After running several node reboots, one of the catalogsouce doesn't show up anylonger:

$ oc get catsrc -A
NAMESPACE NAME DISPLAY TYPE PUBLISHER AGE
openshift-marketplace certified-operators Intel SRIOV-FEC Operator grpc Red Hat 21h

Version-Release number of selected component (if applicable):
4.11.0-fc.3

How reproducible:
Inconsistent but reproducible

Steps to Reproduce:

1. Deploy and configure SNO node via ZTP process. Configuration sets up 2 CatalogSources in a restricted environment for redhat-operators and certified-operators

  • apiVersion: operators.coreos.com/v1alpha1
    kind: CatalogSource
    metadata:
    name: certified-operators
    namespace: openshift-marketplace
    spec:
    displayName: Intel SRIOV-FEC Operator
    image: registry.kni-qe-0.lab.eng.rdu2.redhat.com:5000/olm/far-edge-sriov-fec:v4.11
    publisher: Red Hat
    sourceType: grpc
  • apiVersion: operators.coreos.com/v1alpha1
    kind: CatalogSource
    metadata:
    name: redhat-operators
    namespace: openshift-marketplace
    spec:
    displayName: Red Hat Operators Catalog
    image: registry.kni-qe-0.lab.eng.rdu2.redhat.com:5000/olm/redhat-operators:v4.11
    publisher: Red Hat
    sourceType: grpc

2. Reboot the node via `sudo reboot` several times

3. Check catalogsources

Actual results:

$ oc get catsrc -A
NAMESPACE NAME DISPLAY TYPE PUBLISHER AGE
openshift-marketplace certified-operators Intel SRIOV-FEC Operator grpc Red Hat 22h

Expected results:

All catalogsources created initially are still present.

Additional info:

Attaching must-gather.

Description of problem:

Users cannot install single-node-openshift if the hostname contains the word etcd

Version-Release number of selected component (if applicable):

Probably since 4.8

How reproducible:

100%

Steps to Reproduce:

1. Install SNO with either Assisted or BIP
2. Make sure node hostname is etcd-1 (e.g. via DHCP hostname)

Actual results:

Bootstrap phase never ends

Expected results:

Bootstrap phase should complete successfully

Additional info:

This code is the likely culprit - it uses a naive way to check if etcd is running, accidentally capturing the node name (which contains etcd) in the crictl output as "evidence" that etcd is still running, so it never completes.

See OCPBUGS-15826 (aka AITRIAGE-7677)

Description of problem:

CheckNodePerf is running on non master nodes, when the worker role label is not present. 

Version-Release number of selected component (if applicable):

 

How reproducible:

in a Vmware cluster create a infra MCP, and label a node as role:infra

vsphere-problem-detector-operator will produce CheckNodePerf alerts and logs like

CheckNodePerf: xxxxxx failed: master node has disk latency of greater than 100ms

https://docs.openshift.com/container-platform/4.10/machine_management/creating-infrastructure-machinesets.html#creating-infra-machines_creating-infrastructure-machinesets

Steps to Reproduce:

1.
2.
3.

Actual results:

CheckNodePerf: xxxxx failed: master node has disk latency of greater than 100ms

Expected results:

no log entry, and no alert

Additional info:

The code only considers worker and master labels, also very complex nesting of conditions.

https://github.com/openshift/vsphere-problem-detector/blob/ca408db88a70cfa5aefa3128dff971a555994c29/pkg/check/node_perf.go#L133-L143

 This will allow the installer to depend on just the client/api/models modules, and not pull in all of the dependencies of the service (such as libnmstate).

Description of problem:

When deploying a disconnected cluster with the installer, the image-registry operator will fail to deploy because it cannot reach the COS endpoint.

Version-Release number of selected component (if applicable):

 

How reproducible:

Easily

Steps to Reproduce:

1. Deploy a disconnected cluster with the installer
2. Watch the image-registry operator, it will  fail to deploy

Actual results:

image-registry operator doesn't deploy because the COS endpoint is unreachable.

Expected results:

image-registry operator should deploy

Additional info:

Fix identified.

Sanitize OWNERS/OWNER_ALIASES:

1) OWNERS must have:

component: "Storage / Kubernetes External Components"

2) OWNER_ALIASES must have all team members of Storage team.

This is a clone of issue OCPBUGS-18386. The following is the description of the original issue:

How reproducible:

Always

Steps to Reproduce:

1. the Kubernetes API introduces a new Pod Template parameter (`ephemeral`)
2. this parameter is not in the allowed list of the default SCC
3. customer is not allowed to edit the default SCCs nor we have a  mechanism in  place to update the built in SCCs AFAIK
4. users of existing clusters cannot use the new parameter without creating manual SCCs and assigning this SCC to service accounts themselves which looks clunky. This is documented in https://access.redhat.com/articles/6967808 

Actual results:

Users of existing clusters cannot use ephemeral volumes after an upgrade

Expected results:

Users of existing clusters *can* use ephemeral volumes after an upgrade

Current status

Description of problem:

Deployment of a standard masters+workers cluster using 4.13.0-rc.6 does not configure the cgroup structure according to OCPNODE-1539

Version-Release number of selected component (if applicable):

OCP 4.13.0-rc.6

How reproducible:

Always

Steps to Reproduce:

1. Deploy the cluster
2. Check for presence of /sys/fs/cgroup/cpuset/system*
3. Check the status of cpu balancing of the root cpuset cgroup (should be disabled)

Actual results:

No system cpuset exists and all services are still present in the root cgroup with cpu balancing enabled.

Expected results:

 

Additional info:

The code has a bug we missed. It is nested under the Workload partitioning check on line https://github.com/haircommander/cluster-node-tuning-operator/blob/123e26df30c66fd5c9836726bd3e4791dfd82309/pkg/performanceprofile/controller/performanceprofile/components/machineconfig/machineconfig.go#L251

This is a clone of issue OCPBUGS-18999. The following is the description of the original issue:

Description of problem:

Image pulls fail with http status 504, gateway timeout until image registry pods are restarted.

Version-Release number of selected component (if applicable):

4.13.12

How reproducible:

Intermittent

Steps to Reproduce:

1.
2.
3.

Actual results:

Images can't be pulled: 
podman pull registry.ci.openshift.org/ci/applyconfig:latest Trying to pull registry.ci.openshift.org/ci/applyconfig:latest... Getting image source signatures Error: reading signatures: downloading signatures for sha256:83c1b636069c3302f5ba5075ceeca5c4a271767900fee06b919efc3c8fa14984 in registry.ci.openshift.org/ci/applyconfig: received unexpected HTTP status: 504 Gateway Time-out


Image registry pods contain errors:
time="2023-09-01T02:25:39.596485238Z" level=warning msg="error authorizing context: access denied" go.version="go1.19.10 X:strictfipsruntime" http.request.host=registry.ci.openshift.org http.request.id=3e805818-515d-443f-8d9b-04667986611d http.request.method=GET http.request.remoteaddr=18.218.67.82 http.request.uri="/v2/ocp/4-dev-preview/manifests/sha256:caf073ce29232978c331d421c06ca5c2736ce5461962775fdd760b05fb2496a0" http.request.useragent="containers/5.24.1 (github.com/containers/image)" vars.name=ocp/4-dev-preview vars.reference="sha256:caf073ce29232978c331d421c06ca5c2736ce5461962775fdd760b05fb2496a0"

Expected results:

Image registry does not return gateway timeouts

Additional info:

Must gather(s) attached, additional information in linked OHSS ticket.

 

Please review the following PR: https://github.com/openshift/router/pull/455

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Unit test failing 

=== RUN   TestNewAppRunAll/app_generation_using_context_dir
    newapp_test.go:907: app generation using context dir: Error mismatch! Expected <nil>, got supplied context directory '2.0/test/rack-test-app' does not exist in 'https://github.com/openshift/sti-ruby'
    --- FAIL: TestNewAppRunAll/app_generation_using_context_dir (0.61s)


Version-Release number of selected component (if applicable):

 

How reproducible:

100

Steps to Reproduce:

see for example https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_oc/1376/pull-ci-openshift-oc-master-images/1638172620648091648 

Actual results:

unit tests fail

Expected results:

TestNewAppRunAll unit test should pass

Additional info:

 

Please review the following PR: https://github.com/openshift/prometheus-alertmanager/pull/70

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

This Jira is filed to track upstream issue (fix and backport) 
https://github.com/kubernetes-sigs/azuredisk-csi-driver/issues/1893

Version-Release number of selected component (if applicable):

4.14

Description of problem:
An un-privileged user with cluster-readers role cannot view NetworkAttachmentDefinition resource.

Version-Release number of selected component (if applicable):
oc Version: 4.10.0-202203141248.p0.g6db43e2.assembly.stream-6db43e2
OCP Version: 4.10.4
Kubernetes Version: v1.23.3+e419edf
ose-multus-cni:v4.1.0-7.155662231

How reproducible:
100%

Steps to Reproduce:
1. In an OCP cluster with multus installed - search which roles can view ("get") NetworkAttachmentDefinition resource, and see if "cluster-readers" role is part of this list, by running:
$ oc adm policy who-can get network-attachment-definitions | grep "cluster-reader"

Actual results:
Empty output

Expected results:
Non-empty output with "cluster-readers" in it, e.g. when running the same command for the Namespace resource:
$ oc adm policy who-can get namespace | grep "cluster-reader"
system:cluster-readers

Description of problem:

After upgrading from OpenShift 4.13 to 4.14 with Kuryr network type, the network operator shows as Degraded and the cluster version reports that it's unable to apply the 4.14 update. The issue seems to be related to mtu settings, as indicated by the message: "Not applying unsafe configuration change: invalid configuration: [cannot change mtu for the Pods Network]."

Version-Release number of selected component (if applicable):

Upgrading from 4.13 to 4.14
4.14.0-0.nightly-2023-09-15-233408
Kuryr network type
RHOS-17.1-RHEL-9-20230907.n.1

How reproducible:

Consistently reproducible on attempting to upgrade from 4.13 to 4.14.

Steps to Reproduce:

1.Install OpenShift version 4.13 on OpenStack. 
2.Initiate an upgrade to OpenShift version 4.14.  

Actual results:

The network operator shows as Degraded with the message:

network                                    4.13.13                              True        False         True       13h     Not applying unsafe configuration change: invalid configuration: [cannot change mtu for the Pods Network]. Use 'oc edit network.operator.openshift.io cluster' to undo the change.
 
Additionally, "oc get clusterversions" shows:

Unable to apply 4.14.0-0.nightly-2023-09-15-233408: wait has exceeded 40 minutes for these operators: network

Expected results:

The upgrade should complete successfully without any operator being degraded.

Additional info:

Some components remain at version 4.13.13 despite the upgrade attempt. Specifically, the dns, machine-config, and network operators are still at version 4.13.13. :

$ oc get co
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE                                                                                                         
authentication                             4.14.0-0.nightly-2023-09-15-233408   True        False         False      13h                                                                                                                     
baremetal                                  4.14.0-0.nightly-2023-09-15-233408   True        False         False      13h                                                                                                                     
cloud-controller-manager                   4.14.0-0.nightly-2023-09-15-233408   True        False         False      13h                                                                                                                     
cloud-credential                           4.14.0-0.nightly-2023-09-15-233408   True        False         False      13h                                                                                                                     
cluster-autoscaler                         4.14.0-0.nightly-2023-09-15-233408   True        False         False      13h                                                                                                                     
config-operator                            4.14.0-0.nightly-2023-09-15-233408   True        False         False      13h                                                                                                                     
console                                    4.14.0-0.nightly-2023-09-15-233408   True        False         False      13h                                                                                                                     
control-plane-machine-set                  4.14.0-0.nightly-2023-09-15-233408   True        False         False      13h                                                                                                                     
csi-snapshot-controller                    4.14.0-0.nightly-2023-09-15-233408   True        False         False      13h                                                                                                                     
dns                                        4.13.13                              True        False         False      13h                                                                                                                     
etcd                                       4.14.0-0.nightly-2023-09-15-233408   True        False         False      13h                                                                                                                     
image-registry                             4.14.0-0.nightly-2023-09-15-233408   True        False         False      13h                                                                                                                     
ingress                                    4.14.0-0.nightly-2023-09-15-233408   True        False         False      13h                                                                                                                     
insights                                   4.14.0-0.nightly-2023-09-15-233408   True        False         False      13h                                                                                                                     
kube-apiserver                             4.14.0-0.nightly-2023-09-15-233408   True        False         False      13h                                                                                                                     
kube-controller-manager                    4.14.0-0.nightly-2023-09-15-233408   True        False         False      13h                                                                                                                     
kube-scheduler                             4.14.0-0.nightly-2023-09-15-233408   True        False         False      13h                                                                                                                     
kube-storage-version-migrator              4.14.0-0.nightly-2023-09-15-233408   True        False         False      13h                                                                                                                     
machine-api                                4.14.0-0.nightly-2023-09-15-233408   True        False         False      13h                                                                                                                     
machine-approver                           4.14.0-0.nightly-2023-09-15-233408   True        False         False      13h                                                                                                                     
machine-config                             4.13.13                              True        False         False      13h                                                                                                                     
marketplace                                4.14.0-0.nightly-2023-09-15-233408   True        False         False      13h                                                                                                                     
monitoring                                 4.14.0-0.nightly-2023-09-15-233408   True        False         False      13h                                                                                                                     
network                                    4.13.13                              True        False         True       13h     Not applying unsafe configuration change: invalid configuration: [cannot change mtu for the Pods Network]. Use 'oc edit network.operator.openshift.io cluster' to undo the change.
node-tuning                                4.14.0-0.nightly-2023-09-15-233408   True        False         False      12h                                                                                                                     
openshift-apiserver                        4.14.0-0.nightly-2023-09-15-233408   True        False         False      13h                                                                                                                     
openshift-controller-manager               4.14.0-0.nightly-2023-09-15-233408   True        False         False      13h                                                                                                                     
openshift-samples                          4.14.0-0.nightly-2023-09-15-233408   True        False         False      12h                                                                                                                     
operator-lifecycle-manager                 4.14.0-0.nightly-2023-09-15-233408   True        False         False      13h                                                                                                                     
operator-lifecycle-manager-catalog         4.14.0-0.nightly-2023-09-15-233408   True        False         False      13h                                                                                                                     
operator-lifecycle-manager-packageserver   4.14.0-0.nightly-2023-09-15-233408   True        False         False      12h                                                                                                                     
service-ca                                 4.14.0-0.nightly-2023-09-15-233408   True        False         False      13h                                                                                                                     
storage                                    4.14.0-0.nightly-2023-09-15-233408   True        False         False      13h  

Description of problem:

Updating the k* version to v0.27.2 in cluster samples operator for OCP 4.14 release

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

I get synchronization error in fully disconnected environment when i synchronize two time with the target mirror and there no change / diff between first synchronization and second.  The first time synchronization works, on second synchronization there is an error and exit code -1.

 

This case occurs when you want synchronize your disconnected registry regularly and there is no change between two synchronization.

This case is presented hereafter:
https://docs.openshift.com/container-platform/4.11/installing/disconnected_install/installing-mirroring-disconnected.html#oc-mirror-differential-updates_installing-mirroring-disconnected

In documentation we have:

« Like this, the desired mirror content can be declared in the imageset configuration file statically while the mirror jobs are executed regularly, for example as part of a cron job. This way, the mirror can be kept up to date in an automated fashion”

The main question is how to synchronize fully disconnected registry regularly (with no change between each synchronization) without returning error.

 

Version-Release number of selected component (if applicable):

oc-mirror 4.11

 

How reproducible:

Follow https://docs.openshift.com/container-platform/4.11/installing/disconnected_install/installing-mirroring-disconnected.html#mirroring-image-set-full and synchronize two time with target mirror.

 

Steps to Reproduce:

1. oc-mirror --from=output-dir/mirror_seq1_000000.tar  docker://quay-server.example.com/foo --dest-skip-tls 
2. oc-mirror --from=output-dir/mirror_seq1_000000.tar  docker://quay-server.example.com/foo --dest-skip-tls  

Actual results:

oc-mirror --from=output-dir/mirror_seq1_000000.tar  docker://quay-server.example.com/foo --dest-skip-tls 
Checking push permissions for quay-server.example.com Publishing image set from archive "output-dir/mirror_seq1_000000.tar" to registry "quay-server.example.com" error: error during publishing, expecting imageset with prefix mirror_seq2: invalid mirror sequence order, want 2, got 1

=> return -1

Expected results:

oc-mirror --from=output-dir/mirror_seq1_000000.tar  docker://quay-server.example.com/foo --dest-skip-tls 
...
No diff from last synchronization, nothing to do

=> return 0

 

Additional info:

Error is trigered in pkg/cli/mirror/sequence.go

+       default:
+               // Complete metadata checks
+               // UUID mismatch will now be seen as a new workspace.
+               klog.V(3).Info("Checking metadata sequence number")
+               currRun := current.PastMirror
+               incomingRun := incoming.PastMirror
+               if incomingRun.Sequence != (currRun.Sequence + 1) {
+                       return &ErrInvalidSequence{currRun.Sequence + 1, incomingRun.Sequence}
+               }

Error management in ./pkg/cli/mirror/mirror.go may be warning, no difference and return 0 instead of -1.

          }
        case diskToMirror:
                dir, err := o.createResultsDir()
                if err != nil {
                        return err
                }
                o.OutputDir = dir

                // Publish from disk to registry
                // this takes care of syncing the metadata to the
                // registry backends.
                mapping, err = o.Publish(ctx)
                if err != nil {
                        serr := &ErrInvalidSequence{}
                        if errors.As(err, &serr) {
                                return fmt.Errorf("error during publishing, expecting imageset with prefix mirror_seq%d: %v", serr.wantSeq, err)
                        }
                        return err
                }

 

 

 

 

 

Description of problem:

OSSM Daily builds were updated to no longer support the spec.techPreview.controlPlaneMode field and OSSM will not create a SMCP as a result. The field needs to be updated to spec.mode.

Gateway API enhanced dev preview is currently broken (currently using latest 2.4 daily build because 2.4 is unreleased). This should be resolved before OSSM 2.4 is GA.

Version-Release number of selected component (if applicable):

4.13

How reproducible:

100%

Steps to Reproduce:

1. Follow instructions in http://pastebin.test.redhat.com/1092754

Actual results:

CIO fails to create a SMCP

"error": "failed to create ServiceMeshControlPlane openshift-ingress/openshift-gateway: admission webhook \"smcp.validation.maistra.io\" denied the request: the spec.techPreview.controlPlaneMode field is not supported in version 2.4+; use spec.mode"

Expected results:

CIO is able to create a SMCP

Additional info:

 

Description of the problem:
e2e-metal-assisted-day2-arm-workers-periodic job fails to install the day2 ARM worker because the the service marks the setup incompatible:

  time="2023-04-04T12:03:37Z" level=error msg="cannot use arm64 architecture because it's not compatible on version  of OpenShift" func="github.com/openshift/assisted-service/internal/bminventory.(*bareMetalInventory).handlerClusterInfoOnRegisterInfraEnv" file="/assisted-service/internal/bminventory/inventory.go:4466" pkg=Inventory
time="2023-04-04T12:03:37Z" level=error msg="Failed to register InfraEnv test-infra-infra-env-fd527e12 with id 3e21770d-d607-431c-967c-5f632bec0cfb. Error: cannot use arm64 architecture because it's not compatible on version  of OpenShift" func="github.com/openshift/assisted-service/internal/bminventory.(*bareMetalInventory).RegisterInfraEnvInternal.func1" file="/assisted-service/internal/bminventory/inventory.go:4528" cluster_id=3e21770d-d607-431c-967c-5f632bec0cfb go-id=235 pkg=Inventory request_id=f8dd7eeb-efa7-4828-a8c5-e1486a8bc1d2

See https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_assisted-test-infra/2109/pull-ci-openshift-assisted-test-infra-master-e2e-metal-assisted-day2-arm-workers/1643199500098998272

How reproducible:

Run the job e2e-metal-assisted-day2-arm-workers which:

  • install a day1 x86 cluster
  • Add a day2 ARM worker to the day1 x86 cluster

Steps to reproduce:

1.

2.

3.

Actual results:

The job fails to add the day2 worker and the assisted service log shows:
"Error: cannot use arm64 architecture because it's not compatible on version of OpenShift"
 

Expected results:

The installation of the day2 ARM worker succeed without errors.

Elior Erez I assign this ticket to you as it looks like it is linked to the feature support code, can you have a look?

Description of problem:

PRs were previously merged to add SC2S support via AWS SDK here:

https://github.com/openshift/installer/pull/5710
https://github.com/openshift/installer/pull/5597
https://github.com/openshift/cluster-ingress-operator/pull/703

However, further updates to add support for SC2S region (us-isob-east-1) and new TC2S region (us-iso-west-1) are still required.           

Version-Release number of selected component (if applicable):

4.14

How reproducible:

always

Steps to Reproduce:

1. Try to deploy a cluster on us-isob-east-1 or us-iso-west-1
2.
3.

Actual results:

Regions are not supported

Expected results:

 

Additional info:

Both TC2S and SC2S support ALIAS records now.

Description of problem:

For unknown reasons, the management cluster AWS endpoint service sometimes has an active connection leftover. This blocks the uninstallation, as the AWS endpoint service cannot be deleted before this connection is rejected.

Version-Release number of selected component (if applicable):

4.12.z,4.13.z,4.14.z

How reproducible:

Irregular

Steps to Reproduce:

1.
2.
3.

Actual results:

AWSEndpointService cannot be deleted by the hypershift operator, the uninstallation is stuck

Expected results:

There are no leftover active AWSEndpoint connections when deleting the AWSEndpointService and it can be deleted properly.

OR

Hypershift operator rejects active endpoint connections when trying to delete AWSEndpointServices from the management cluster aws account

Additional info:

Added mustgathers in comment. 

Description of problem:

In the Konnectivity SOCKS proxy: currently the default is to proxy cloud endpoint traffic: https://github.com/openshift/hypershift/blob/main/konnectivity-socks5-proxy/main.go#L61

Due to this after this change: https://github.com/openshift/hypershift/commit/0c52476957f5658cfd156656938ae1d08784b202

The oauth server had a behavior change where it began to proxy iam traffic instead of not proxying it. This causes a regression in Satellite environments running with an HTTP_PROXY server. The original network traffic path needs to be restored

Version-Release number of selected component (if applicable):

4.13 4.12

How reproducible:

100%

Steps to Reproduce:

1. Setup HTTP_PROXY IBM Cloud Satellite environment
2. In the oauth-server pod run a curl against iam (curl -v https://iam.cloud.ibm.com)
3. It will log it is using proxy

Actual results:

It is using proxy 

Expected results:

It should send traffic directly (as it does in 4.11 and 4.10)

Additional info:

 

This is a clone of issue OCPBUGS-18830. The following is the description of the original issue:

Description of problem:

Failed to install cluster on SC2S region as:

level=error msg=Error: reading Security Group (sg-0b0cd054dd599602f) Rules: UnsupportedOperation: The functionality you requested is not available in this region. 

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-09-11-201102
 

How reproducible:

Always
 

Steps to Reproduce:

1. Create an OCP cluster on SC2S

Actual results:

Install fail:
level=error msg=Error: reading Security Group (sg-0b0cd054dd599602f) Rules: UnsupportedOperation: The functionality you requested is not available in this region.

Expected results:

Install succeed.
 

Additional info:

* C2S region is not affected

Description of problem:

When you migrate a HostedCluster, the AWSEndpointService conflicts from the old MGMT Server with the new MGMT Server. The AWSPrivateLink_Controller does not have any validation when this happens. This is needed to make the Disaster Recovery HC Migration works. So the issue will raise up when the nodes of the HostedCluster cannot join the new Management cluster because the AWSEndpointServiceName is still pointing to the old one.

Version-Release number of selected component (if applicable):

4.12
4.13
4.14

How reproducible:

Follow the migration procedure from upstream documentation and the nodes in the destination HostedCluster will keep in NotReady state.

Steps to Reproduce:

1. Setup a management cluster with the 4.12-13-14/main version of the HyperShift operator.
2. Run the in-place node DR Migrate E2E test from this PR https://github.com/openshift/hypershift/pull/2138:
bin/test-e2e \
  -test.v \
  -test.timeout=2h10m \
  -test.run=TestInPlaceUpgradeNodePool \
  --e2e.aws-credentials-file=$HOME/.aws/credentials \
  --e2e.aws-region=us-west-1 \
  --e2e.aws-zones=us-west-1a \
  --e2e.pull-secret-file=$HOME/.pull-secret \
  --e2e.base-domain=www.mydomain.com \
  --e2e.latest-release-image="registry.ci.openshift.org/ocp/release:4.13.0-0.nightly-2023-03-17-063546" \
  --e2e.previous-release-image="registry.ci.openshift.org/ocp/release:4.13.0-0.nightly-2023-03-17-063546" \
  --e2e.skip-api-budget \
  --e2e.aws-endpoint-access=PublicAndPrivate

Actual results:

The nodes stay in NotReady state

Expected results:

The nodes should join the migrated HostedCluster

Additional info:

 

Description of problem:

When forcing a reboot of a BMH with the annotation  reboot.metal3.io: '{"force": true}' with a new preprovisioningimage URL the host never reboots.

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2023-05-03-150228

How reproducible:

100%

Steps to Reproduce:

1. Create a BMH and stall the provisioning process at "provisioning"
2. Set a new URL in the preprovisioningimage
3. Set the force reboot annotation on the BMH (reboot.metal3.io: '{"force": true}')

Actual results:

Host does not reboot and the annotation remains on the BMH

Expected results:

Host reboots into the new image

Additional info:

This was reproduced using assisted installer (MCE central infrastructure management)

This is a ticket created based off a GitHub comment from a random user

Description of the problem:

 See GitHub comment

How reproducible:

 Unknown

Steps to reproduce:

1. See GitHub comment

Actual results:

DNS wildcard validation failure is a false-postiive

Expected results:

DNS wildcard validation should probably avoid domain-search

Description of problem:

During cluster installation if the host systems had multiple dual-stack interfaces configured via install-config.yaml, the installation will fail. Notably, when a single-stack ipv4 installation is attempted with multiple interfaces it is successful. Additionally, when a dual-stack installation is attempted with only a single interface it is successful.

Version-Release number of selected component (if applicable):

Reproduced on 4.12.1 and 4.12.7

How reproducible:

100%

Steps to Reproduce:

1. Assign an IPv4 and an IPv6 address to both the apiVIPs and ingressVIPs parameters in the install-config.yaml
2. Configure all hosts with at least two interfaces in the install-config.yaml
3. Assign an IPv4 and an IPv6 address to each interface in the install-config.yaml
4. Begin cluster installation and wait for failure

Actual results:

Failed cluster installation

Expected results:

Successful cluster installation

Additional info:

 

The cli option --logtostderr was removed in prometheus-adapter v0.11. CMO uses this argument and this currently blocks the update to v0.11: https://github.com/openshift/k8s-prometheus-adapter/pull/72

Iiuc we can simply drop this argument.

Description of problem:

SNO installation does not finish due to machine-config waiting for a non existing machine config.

 oc get co machine-config
NAME             VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
machine-config             True        True          True       14h     Unable to apply 4.14.0-0.nightly-2023-08-23-075058: error during syncRequiredMachineConfigPools: [context deadline exceeded, failed to update clusteroperator: [client rate limiter Wait returned an error: context deadline exceeded, error MachineConfigPool master is not ready, retrying. Status: (pool degraded: true total: 1, ready 0, updated: 0, unavailable: 1)]]

oc -n openshift-machine-config-operator logs machine-config-daemon-2stpc --tail 5
Defaulted container "machine-config-daemon" out of: machine-config-daemon, kube-rbac-proxy
I0824 07:39:12.117508   22874 daemon.go:1370] In bootstrap mode
E0824 07:39:12.117525   22874 writer.go:226] Marking Degraded due to: machineconfig.machineconfiguration.openshift.io "rendered-master-231b9341930d0616544ad05989a5c1b8" not found
W0824 07:40:12.131400   22874 daemon.go:1630] Failed to persist NIC names: open /etc/systemd/network: no such file or directory
I0824 07:40:12.131417   22874 daemon.go:1370] In bootstrap mode
E0824 07:40:12.131429   22874 writer.go:226] Marking Degraded due to: machineconfig.machineconfiguration.openshift.io "rendered-master-231b9341930d0616544ad05989a5c1b8" not found

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-08-23-075058

How reproducible:

100%

Steps to Reproduce:

1. Deploy SNO with Telco DU profile
2. Wait for installation to finish

Actual results:

Installation doesn't complete due to master MCP being degraded waiting for a non-existing machineconfig.

Expected results:

Installation succeeds.

Additional info:

Attaching sosreport and must-gather

This is a clone of issue OCPBUGS-18113. The following is the description of the original issue:

Description of problem:

When the installer generates a CPMS, it should only add the `failureDomains` field when there is more than one failure domain. When there is only one failure domain, the fields from the failure domain, eg the zone, should be injected directly into the provider spec and the failure domain should be omitted.

By doing this, we avoid having to care about failure domain injection logic for single zone clusters. Potentially avoiding bugs (such as some we have seen recently).

IIRC we already did this for OpenStack, but AWS, Azure and GCP may not be affected.

Version-Release number of selected component (if applicable):

 

How reproducible:

Can be demonstrated on Azure on the westus region which has no AZs available. Currently the installer creates the following, which we can omit entirely:
```
failureDomains:
  platform: Azure
  azure:
  - zone: ""
```

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Apart from default SC, we should check if non-default SCs that were created on vSphere platform use datastore for which OCP has accessibility and necessary permissions.

This will avoid hard to debug errors in cases where customer creates additional SC but forgets to give necessary permission to newer datastore.

Description of problem:

When Creating Sample Devfile from the Samples Page, corresponding Topology Icon for the app is not set. This issue is not observed when we create a BuildImage from the Samples page.

Version-Release number of selected component (if applicable):

4.13

How reproducible:

Always

Steps to Reproduce:

1. Create a Sample Devfile App from the Samples Page
2. Go to the Topology Page and check the icon of the app created.

Actual results:

The generic Openshift logo is displayed

Expected results:

Need to show the corresponding app icon (Golang, Quarkus, etc.)

Additional info:

In case of creating sample of BuilderImage, the icon gets properly set as per the BuilderImage used.

Current label: app.openshift.io/runtime=dotnet-basic
Change to: app.openshift.io/runtime=dotnet

Please review the following PR: https://github.com/openshift/configmap-reload/pull/51

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

4.14-e2e-metal-ipi-sdn-bm jobs are failing with 

 

2023-08-29 15:43:27.066 1 ERROR ironic.api.method [None req-00977b71-1b61-4452-8f6c-a43a47b1e92e - - - - - -] Server-side error: "<Future at 0x7fe7b2b86250 state=finished raised OperationalError>". Detail: 
Traceback (most recent call last):
File "/usr/lib64/python3.9/site-packages/sqlalchemy/engine/base.py", line 1089, in _commit_impl
self.engine.dialect.do_commit(self.connection)
File "/usr/lib64/python3.9/site-packages/sqlalchemy/engine/default.py", line 686, in do_commit
dbapi_connection.commit()
sqlite3.OperationalError: database is locked

 

Description of problem:

Install issues for 4.14 && 4.15 where we lose contact with kublet on master nodes.

https://search.ci.openshift.org/?search=Kubelet+stopped+posting+node+status&maxAge=168h&context=1&type=build-log&name=periodic.*4.14.*azure.*sdn&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

This search shows its happening on about 35% of azure sdn 4.14 jobs over the past week at least. There are no ovn hits.

1703590387039342592/artifacts/e2e-azure-sdn-upgrade/gather-extra/artifacts/nodes.json

                    {
                        "lastHeartbeatTime": "2023-09-18T02:33:11Z",
                        "lastTransitionTime": "2023-09-18T02:35:39Z",
                        "message": "Kubelet stopped posting node status.",
                        "reason": "NodeStatusUnknown",
                        "status": "Unknown",
                        "type": "Ready"
                    }

4.14 is interesting as it is a minor upgrade from 4.13 and we see the install failures with a master node dropping out.

Focusing on periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-azure-sdn-upgrade/1703590387039342592

Build log shows

INFO[2023-09-18T02:03:03Z] Using explicitly provided pull-spec for release initial (registry.ci.openshift.org/ocp/release:4.13.0-0.ci-2023-09-17-050449) 

ipi-azure-conf shows region centralus (not the single zone westus)

get ocp version: 4.13
/output
Azure region: centralus

oc_cmds/nodes shows master-1 not ready

ci-op-82xkimh8-0dd98-9g9wh-master-1                  NotReady   control-plane,master   82m   v1.26.7+c7ee51f   10.0.0.6      <none>        Red Hat Enterprise Linux CoreOS 413.92.202309141211-0 (Plow)  

ci-op-82xkimh8-0dd98-9g9wh-master-1-boot.log shows ignition

install log shows we have lost contact

time="2023-09-18T03:15:33Z" level=error msg="Cluster operator kube-apiserver Degraded is True with GuardController_SyncError::NodeController_MasterNodesReady: GuardControllerDegraded: [Missing operand on node ci-op-82xkimh8-0dd98-9g9wh-master-0, Missing operand on node ci-op-82xkimh8-0dd98-9g9wh-master-2]\nNodeControllerDegraded: The master nodes not ready: node \"ci-op-82xkimh8-0dd98-9g9wh-master-1\" not ready since 2023-09-18 02:35:39 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)"

4.15 4.15.0-0.ci-2023-09-17-172341 and 4.14 4.14.0-0.ci-2023-09-18-020137

Version-Release number of selected component (if applicable):

 

How reproducible:

We are seeing this on a high number of failed payloads for 4.14 && 4.15. Additional recent failures

4.14.0-0.ci-2023-09-17-012321
aggregated-azure-sdn-upgrade-4.14-minor shows failures like: Passed 5 times, failed 0 times, skipped 0 times: we require at least 6 attempts to have a chance at success indicating that only 5 of the 10 runs were valid.
Checking install logs shows we have lost master-2

time="2023-09-17T02:44:22Z" level=error msg="Cluster operator kube-apiserver Degraded is True with GuardController_SyncError::NodeController_MasterNodesReady: GuardControllerDegraded: [Missing operand on node ci-op-crj5cf00-0dd98-p5snd-master-1, Missing operand on node ci-op-crj5cf00-0dd98-p5snd-master-0]\nNodeControllerDegraded: The master nodes not ready: node \"ci-op-crj5cf00-0dd98-p5snd-master-2\" not ready since 2023-09-17 02:01:49 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)"

oc_cmds/nodes also shows master-2 not ready

4.15.0-0.nightly-2023-09-17-113421 install analysis failed due to azure tech preview oc_cmds/nodes shows master-1 not ready

4.15.0-0.ci-2023-09-17-112341 aggregated-azure-sdn-upgrade-4.15-minor only 5 of 10 runs are valid sample oc_cmds/nodes shows master-0 not ready

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

When using the k8sResourcePrefix x-descriptor with custom resource kinds, the form-view dropdown selection currently doesn't accept the initial user selection...requiring the user to make their selection twice. Also...if the configuration panel contains multiple custom resource dropdowns, then each previous dropdown selection on the panel is also cleared each time the user configures another custom resource dropdown, requiring the user to also reconfigure each previous selection.Here's an example of my configuration below:specDescriptors:
          - displayName: Collection
            path: collection
            x-descriptors:
              - >-
                urn:alm:descriptor:io.kubernetes:abc.zzz.com:v1beta1:Collection
          - displayName: Endpoints
            path: 'mapping[0].endpoints[0].name'
            x-descriptors:
              - >-
                urn:alm:descriptor:io.kubernetes:abc.zzz.com:v1beta1:Endpoint
          - displayName: Requested Credential Secret
            path: 'mapping[0].endpoints[0].credentialName'
            x-descriptors:
              - 'urn:alm:descriptor:io.kubernetes:Secret'
          - displayName: Namespaces
            path: 'mapping[0].namespace'
            x-descriptors:
              - 'urn:alm:descriptor:io.kubernetes:Namespace'
With this configuration, when a user wants to select a Collection or Endpoint from the form view dropdown, the user is forced to make their selection twice before the selection is accepted in the dropdown. Also, if the user does configure the Collection dropown, and then decides to configure the Endpoint dropdown, once the Endpoint selection is made, the Collection dropdown is then cleared.

Version-Release number of selected component (if applicable):

4.8

How reproducible:

Always

Steps to Reproduce:

1. Create a new project: 
  oc new-project descriptor-test
2. Create the resources in this gist: 
  oc create -f https://gist.github.com/TheRealJon/99aa89c4af87c4b68cd92a544cd7c08e/raw/a633ad172ff071232620913d16ebe929430fd77a/reproducer.yaml
3. In the admin console, go to the installed operators page in project 'descriptor-test'
4. Select Mock Operator from the list
5. Select "Create instance" in the Mock Resource provided API card
6. Scroll to the field-1
7. Select 'example-1' from the dropdown

Actual results:

Selection is not retained on the first click.

Expected results:

The selection should be retained on the first click.

Additional info:

In addition to this behavior, if a form has multiple k8sResourcePrefix dropdown fields, they all get cleared when attempting to select an item from one of them.

Description of problem:

The kube apiserver manages the endpoints resource of the default/kubernetes service so that pods can access the kube apiserver. It does this via the --advertise-address flag and the container port for the kube apiserver pod. Currently the HCCO overwrites the endpoints resource with another port. This conflicts with what the KAS manages, it should not do that.

Version-Release number of selected component (if applicable):

 

How reproducible:

always

Steps to Reproduce:

1. Create an AWS publicAndPrivate cluster with DNS hostnames and a Route publishing strategy for the apiserver.

Actual results:

The HCCO overwrites the default/kubernetes endpoints resource in the guest cluster.

Expected results:

The HCCO does not overwrite the default/kubernetes endpoints resource 

Additional info:

 

Description of problem:

when cluster with abnormal operator status , run the `oc adm must-gather` will exit with code 1 .

Version-Release number of selected component (if applicable):

4.12/4.13

Actual results:

     [must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-gfcpc deleted
      
      
      Reprinting Cluster State:
      When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information:
      ClusterID: 0ba6ca81-e6d8-4d15-b345-70f81bd5a005
      ClusterVersion: Stable at "4.13.0-0.nightly-2023-04-01-062001"
      ClusterOperators:
      	clusteroperator/cloud-credential is not upgradeable because Upgradeable annotation cloudcredential.openshift.io/upgradeable-to on cloudcredential.operator.openshift.io/cluster object needs updating before upgrade. See Manually Creating IAM documentation for instructions on preparing a cluster for upgrade.
      	clusteroperator/ingress is progressing: ingresscontroller "test-34166" is progressing: IngressControllerProgressing: One or more status conditions indicate progressing: DeploymentRollingOut=True (DeploymentRollingOut: Waiting for router deployment rollout to finish: 0 of 1 updated replica(s) are available...
      ).
      Not all ingress controllers are available.
      
      
      
      STDERR:
      error: yaml: line 7: did not find expected key
      [08:06:46] INFO> Exit Status: 1
Expected results:
{code:none}
abnormal status of any of the operators should not affect must-gather's exit code

Additional info:

Description of problem:

Alibaba clusters were never declared GA. They are still in TechPreview.
We do not allow upgrades between TechPreview clusters in minor streams (eg 4.12 to 4.13)

To allow a future deprecation and removal of the platform, we will prevent upgrades past 4.13.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

This is a manual clone of https://issues.redhat.com/browse/OCPBUGS-18902 for backporting purposes.

 

In this recent PR that merged, a number of API calls do not use caches causing excessive calls.

Done when:

-Change all Get() calls to use listers

-API call metric should decrease

When a HostedCluster is configured as `Private`, annotate the necessary hosted CP components (API and OAuth) so that External DNS can still create public DNS records (pointing to private IP resources).

The External DNS record should be pointing to the resource for the PrivateLink VPC Endpoint. "We need to specify the IP of the A record. We can do that with a cluster IP service."

Context: https://redhat-internal.slack.com/archives/C01C8502FMM/p1675432805760719

aws-ebs-csi-driver-operator ServiceAccount does not include the HCP pull-secret in its imagePullSecrets. Thus, if a HostedCluster is created with a `pullSecret` that contains creds that the management cluster pull secret does not have, the image pull fails.

Description of problem:

Quoting Joel: In 4.14 there's been an effort to make Machine API optional, anything that that relies on the CRD needs to be able to detect that the CRD is not installed and then not error should that be the case. You should be able to use a discovery client to determine if the API group is installed or not

We have several controllers and informers that are depending on the machine API to be at least available to list and sync caches with. When the API is not installed at all the depending controllers are blocked forever and eventually get killed by the aliveness probe. That causes hot restart loops that cause installations to fail. 

https://redhat-internal.slack.com/archives/C027U68LP/p1690436286860899

 

Version-Release number of selected component (if applicable):

4.14

How reproducible:

always

Steps to Reproduce:

1. install a machineAPI=false cluster
2. ??? 
3. watch it fail

Tracker issue for bootimage bump in 4.14. This issue should block issues which need a bootimage bump to fix.

The previous bump was OCPBUGS-16776.

Description of problem:

CPMS create two replace machines when deleting a master machine on vSphere.

Sorry, I have to revisit this https://issues.redhat.com/browse/OCPBUGS-4297 as I see all the related pr are merged, but I met twice on this template cluster
ipi-on-vsphere/versioned-installer-vmc7-ovn-winc-thin_pvc-ci, once on ipi-on-vsphere/versioned-installer-vmc7-ovn template cluster today 

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2023-02-13-235211

How reproducible:

Three times

Steps to Reproduce:

1. On this template cluster
ipi-on-vsphere/versioned-installer-vmc7-ovn-winc-thin_pvc-ci, the first time I met this is after update all the 3 master machines using RollingUpdate strategy, then I delete a master machine. But seems the redundant machine was automatically deleted, because there was only one replacement machine when I revisit it.

liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                               PHASE     TYPE   REGION   ZONE   AGE
huliu-vs15b-75tr7-master-djlxv-2   Running                          47m
huliu-vs15b-75tr7-master-h76sp-1   Running                          58m
huliu-vs15b-75tr7-master-wtzb7-0   Running                          70m
huliu-vs15b-75tr7-worker-gzsp9     Running                          4h43m
huliu-vs15b-75tr7-worker-vcqqh     Running                          4h43m
winworker-4cltm                    Running                          4h19m
winworker-qd4c4                    Running                          4h19m
liuhuali@Lius-MacBook-Pro huali-test % oc delete machine huliu-vs15b-75tr7-master-djlxv-2
machine.machine.openshift.io "huliu-vs15b-75tr7-master-djlxv-2" deleted
^C
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                               PHASE          TYPE   REGION   ZONE   AGE
huliu-vs15b-75tr7-master-bzd4h-2   Provisioning                          34s
huliu-vs15b-75tr7-master-djlxv-2   Deleting                              48m
huliu-vs15b-75tr7-master-gzhlk-2   Provisioning                          35s
huliu-vs15b-75tr7-master-h76sp-1   Running                               59m
huliu-vs15b-75tr7-master-wtzb7-0   Running                               70m
huliu-vs15b-75tr7-worker-gzsp9     Running                               4h44m
huliu-vs15b-75tr7-worker-vcqqh     Running                               4h44m
winworker-4cltm                    Running                               4h20m
winworker-qd4c4                    Running                               4h20m
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                               PHASE     TYPE   REGION   ZONE   AGE
huliu-vs15b-75tr7-master-bzd4h-2   Running                          38m
huliu-vs15b-75tr7-master-h76sp-1   Running                          97m
huliu-vs15b-75tr7-master-wtzb7-0   Running                          108m
huliu-vs15b-75tr7-worker-gzsp9     Running                          5h22m
huliu-vs15b-75tr7-worker-vcqqh     Running                          5h22m
winworker-4cltm                    Running                          4h57m
winworker-qd4c4                    Running                          4h57m 

2.Then I change the strategy to OnDelete, and after update all the 3 master machines using OnDelete strategy, then I delete a master machine. 

liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                               PHASE     TYPE   REGION   ZONE   AGE
huliu-vs15b-75tr7-master-hzhgq-0   Running                          137m
huliu-vs15b-75tr7-master-kj9zf-2   Running                          89m
huliu-vs15b-75tr7-master-kz6cx-1   Running                          59m
huliu-vs15b-75tr7-worker-gzsp9     Running                          7h46m
huliu-vs15b-75tr7-worker-vcqqh     Running                          7h46m
winworker-4cltm                    Running                          7h21m
winworker-qd4c4                    Running                          7h21m
liuhuali@Lius-MacBook-Pro huali-test % oc delete machine huliu-vs15b-75tr7-master-hzhgq-0
machine.machine.openshift.io "huliu-vs15b-75tr7-master-hzhgq-0" deleted
^C
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                               PHASE          TYPE   REGION   ZONE   AGE
huliu-vs15b-75tr7-master-hzhgq-0   Deleting                              138m
huliu-vs15b-75tr7-master-kb687-0   Provisioning                          26s
huliu-vs15b-75tr7-master-kj9zf-2   Running                               90m
huliu-vs15b-75tr7-master-kz6cx-1   Running                               60m
huliu-vs15b-75tr7-master-qn6kq-0   Provisioning                          26s
huliu-vs15b-75tr7-worker-gzsp9     Running                               7h47m
huliu-vs15b-75tr7-worker-vcqqh     Running                               7h47m
winworker-4cltm                    Running                               7h22m
winworker-qd4c4                    Running                               7h22m
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                               PHASE     TYPE   REGION   ZONE   AGE
huliu-vs15b-75tr7-master-kb687-0   Running                          154m
huliu-vs15b-75tr7-master-kj9zf-2   Running                          4h5m
huliu-vs15b-75tr7-master-kz6cx-1   Running                          3h34m
huliu-vs15b-75tr7-master-qn6kq-0   Running                          154m
huliu-vs15b-75tr7-worker-gzsp9     Running                          10h
huliu-vs15b-75tr7-worker-vcqqh     Running                          10h
winworker-4cltm                    Running                          9h
winworker-qd4c4                    Running                          9h
liuhuali@Lius-MacBook-Pro huali-test % oc get co     
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.13.0-0.nightly-2023-02-13-235211   True        False         False      5h13m   
baremetal                                  4.13.0-0.nightly-2023-02-13-235211   True        False         False      10h     
cloud-controller-manager                   4.13.0-0.nightly-2023-02-13-235211   True        False         False      10h     
cloud-credential                           4.13.0-0.nightly-2023-02-13-235211   True        False         False      10h     
cluster-autoscaler                         4.13.0-0.nightly-2023-02-13-235211   True        False         False      10h     
config-operator                            4.13.0-0.nightly-2023-02-13-235211   True        False         False      10h     
console                                    4.13.0-0.nightly-2023-02-13-235211   True        False         False      145m    
control-plane-machine-set                  4.13.0-0.nightly-2023-02-13-235211   True        False         True       10h     Observed 1 updated machine(s) in excess for index 0
csi-snapshot-controller                    4.13.0-0.nightly-2023-02-13-235211   True        False         False      10h     
dns                                        4.13.0-0.nightly-2023-02-13-235211   True        False         False      10h     
etcd                                       4.13.0-0.nightly-2023-02-13-235211   True        False         False      10h     
image-registry                             4.13.0-0.nightly-2023-02-13-235211   True        False         False      9h      
ingress                                    4.13.0-0.nightly-2023-02-13-235211   True        False         False      10h     
insights                                   4.13.0-0.nightly-2023-02-13-235211   True        False         False      10h     
kube-apiserver                             4.13.0-0.nightly-2023-02-13-235211   True        False         False      10h     
kube-controller-manager                    4.13.0-0.nightly-2023-02-13-235211   True        False         False      10h     
kube-scheduler                             4.13.0-0.nightly-2023-02-13-235211   True        False         False      10h     
kube-storage-version-migrator              4.13.0-0.nightly-2023-02-13-235211   True        False         False      6h18m   
machine-api                                4.13.0-0.nightly-2023-02-13-235211   True        False         False      10h     
machine-approver                           4.13.0-0.nightly-2023-02-13-235211   True        False         False      10h     
machine-config                             4.13.0-0.nightly-2023-02-13-235211   True        False         False      3h59m   
marketplace                                4.13.0-0.nightly-2023-02-13-235211   True        False         False      10h     
monitoring                                 4.13.0-0.nightly-2023-02-13-235211   True        False         False      10h     
network                                    4.13.0-0.nightly-2023-02-13-235211   True        False         False      10h     
node-tuning                                4.13.0-0.nightly-2023-02-13-235211   True        False         False      10h     
openshift-apiserver                        4.13.0-0.nightly-2023-02-13-235211   True        False         False      145m    
openshift-controller-manager               4.13.0-0.nightly-2023-02-13-235211   True        False         False      10h     
openshift-samples                          4.13.0-0.nightly-2023-02-13-235211   True        False         False      10h     
operator-lifecycle-manager                 4.13.0-0.nightly-2023-02-13-235211   True        False         False      10h     
operator-lifecycle-manager-catalog         4.13.0-0.nightly-2023-02-13-235211   True        False         False      10h     
operator-lifecycle-manager-packageserver   4.13.0-0.nightly-2023-02-13-235211   True        False         False      6h7m    
service-ca                                 4.13.0-0.nightly-2023-02-13-235211   True        False         False      10h     
storage                                    4.13.0-0.nightly-2023-02-13-235211   True        False         False      3h57m   
liuhuali@Lius-MacBook-Pro huali-test %  

3.On ipi-on-vsphere/versioned-installer-vmc7-ovn template cluster, 
after update all the 3 master machines using RollingUpdate strategy, no issue,
then delete a master machine, no issue, 
then change the strategy to OnDelete, and replace the master machines one by one, when I delete the last one, two replace machines created.

liuhuali@Lius-MacBook-Pro huali-test % oc get co 
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.13.0-0.nightly-2023-02-13-235211   True        False         False      73m     
baremetal                                  4.13.0-0.nightly-2023-02-13-235211   True        False         False      9h      
cloud-controller-manager                   4.13.0-0.nightly-2023-02-13-235211   True        False         False      9h      
cloud-credential                           4.13.0-0.nightly-2023-02-13-235211   True        False         False      9h      
cluster-autoscaler                         4.13.0-0.nightly-2023-02-13-235211   True        False         False      9h      
config-operator                            4.13.0-0.nightly-2023-02-13-235211   True        False         False      9h      
console                                    4.13.0-0.nightly-2023-02-13-235211   True        False         False      129m    
control-plane-machine-set                  4.13.0-0.nightly-2023-02-13-235211   True        True          False      9h      Observed 1 replica(s) in need of update
csi-snapshot-controller                    4.13.0-0.nightly-2023-02-13-235211   True        False         False      9h      
dns                                        4.13.0-0.nightly-2023-02-13-235211   True        False         False      9h      
etcd                                       4.13.0-0.nightly-2023-02-13-235211   True        False         False      9h      
image-registry                             4.13.0-0.nightly-2023-02-13-235211   True        False         False      8h      
ingress                                    4.13.0-0.nightly-2023-02-13-235211   True        False         False      8h      
insights                                   4.13.0-0.nightly-2023-02-13-235211   True        False         False      8h      
kube-apiserver                             4.13.0-0.nightly-2023-02-13-235211   True        False         False      9h      
kube-controller-manager                    4.13.0-0.nightly-2023-02-13-235211   True        False         False      9h      
kube-scheduler                             4.13.0-0.nightly-2023-02-13-235211   True        False         False      9h      
kube-storage-version-migrator              4.13.0-0.nightly-2023-02-13-235211   True        False         False      3h22m   
machine-api                                4.13.0-0.nightly-2023-02-13-235211   True        False         False      9h      
machine-approver                           4.13.0-0.nightly-2023-02-13-235211   True        False         False      9h      
machine-config                             4.13.0-0.nightly-2023-02-13-235211   True        False         False      9h      
marketplace                                4.13.0-0.nightly-2023-02-13-235211   True        False         False      9h      
monitoring                                 4.13.0-0.nightly-2023-02-13-235211   True        False         False      8h      
network                                    4.13.0-0.nightly-2023-02-13-235211   True        False         False      9h      
node-tuning                                4.13.0-0.nightly-2023-02-13-235211   True        False         False      9h      
openshift-apiserver                        4.13.0-0.nightly-2023-02-13-235211   True        False         False      9h      
openshift-controller-manager               4.13.0-0.nightly-2023-02-13-235211   True        False         False      9h      
openshift-samples                          4.13.0-0.nightly-2023-02-13-235211   True        False         False      9h      
operator-lifecycle-manager                 4.13.0-0.nightly-2023-02-13-235211   True        False         False      9h      
operator-lifecycle-manager-catalog         4.13.0-0.nightly-2023-02-13-235211   True        False         False      9h      
operator-lifecycle-manager-packageserver   4.13.0-0.nightly-2023-02-13-235211   True        False         False      46m     
service-ca                                 4.13.0-0.nightly-2023-02-13-235211   True        False         False      9h      
storage                                    4.13.0-0.nightly-2023-02-13-235211   True        False         False      77m    
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                               PHASE     TYPE   REGION   ZONE   AGE
huliu-vs15a-kjm6h-master-55s4l-1   Running                          84m
huliu-vs15a-kjm6h-master-ppc55-2   Running                          3h4m
huliu-vs15a-kjm6h-master-rqb52-0   Running                          53m
huliu-vs15a-kjm6h-worker-6nbz7     Running                          9h
huliu-vs15a-kjm6h-worker-g84xg     Running                          9h
liuhuali@Lius-MacBook-Pro huali-test % oc delete machine huliu-vs15a-kjm6h-master-ppc55-2
machine.machine.openshift.io "huliu-vs15a-kjm6h-master-ppc55-2" deleted
^C
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                               PHASE          TYPE   REGION   ZONE   AGE
huliu-vs15a-kjm6h-master-55s4l-1   Running                               85m
huliu-vs15a-kjm6h-master-cvwzz-2   Provisioning                          27s
huliu-vs15a-kjm6h-master-ppc55-2   Deleting                              3h5m
huliu-vs15a-kjm6h-master-qp9m5-2   Provisioning                          27s
huliu-vs15a-kjm6h-master-rqb52-0   Running                               54m
huliu-vs15a-kjm6h-worker-6nbz7     Running                               9h
huliu-vs15a-kjm6h-worker-g84xg     Running                               9h liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                               PHASE     TYPE   REGION   ZONE   AGE
huliu-vs15a-kjm6h-master-55s4l-1   Running                          163m
huliu-vs15a-kjm6h-master-cvwzz-2   Running                          79m
huliu-vs15a-kjm6h-master-qp9m5-2   Running                          79m
huliu-vs15a-kjm6h-master-rqb52-0   Running                          133m
huliu-vs15a-kjm6h-worker-6nbz7     Running                          10h
huliu-vs15a-kjm6h-worker-g84xg     Running                          10h
liuhuali@Lius-MacBook-Pro huali-test % 

Actual results:

CPMS create two replace machines when deleting a master machine, and the two replace machines exist there for a long time

Expected results:

CPMS should only create one replace machine when deleting a master machine, or quickly delete the redundant machine

Additional info:

Must-gather: https://drive.google.com/file/d/1aCyFn9okNxRz7nE3Yt_8g6Kx7sPSGCg2/view?usp=sharing for ipi-on-vsphere/versioned-installer-vmc7-ovn-winc-thin_pvc-ci template cluster
https://drive.google.com/file/d/1i0fWSP0-HqfdV5E0wcNevognLUQKecvl/view?usp=sharing for ipi-on-vsphere/versioned-installer-vmc7-ovn template cluster

This is a clone of issue OCPBUGS-19494. The following is the description of the original issue:

Description of problem:

ipsec container kills pluto even if that was started by systemd

Version-Release number of selected component (if applicable):

on any 4.14 nightly

How reproducible:

every time 

Steps to Reproduce:

1. enable N-S ipsec
2. enable E-W IPsec
3. kill/stop/delete one of the ipsec-host pods

Actual results:

pluto is killed on that host

Expected results:

pluto keeps running

Additional info:

https://github.com/yuvalk/cluster-network-operator/blob/37d1cc72f4f6cd999046bd487a705e6da31301a5/bindata/network/ovn-kubernetes/common/ipsec-host.yaml#L235
this should be removed

Description of problem:

according to PR: https://github.com/openshift/cluster-monitoring-operator/pull/1824, startupProbe for UWM prometheus/platform prometheus should be 1 hour, but startupProbe for UWM prometheus is still 15m after enabled UWM, platform promethues does not have issue, startupProbe is increased to 1 hour

$ oc -n openshift-user-workload-monitoring get pod prometheus-user-workload-0 -oyaml | grep startupProbe -A20
    startupProbe:
      exec:
        command:
        - sh
        - -c
        - if [ -x "$(command -v curl)" ]; then exec curl --fail http://localhost:9090/-/ready;
          elif [ -x "$(command -v wget)" ]; then exec wget -q -O /dev/null http://localhost:9090/-/ready;
          else exit 1; fi
      failureThreshold: 60
      periodSeconds: 15
      successThreshold: 1
      timeoutSeconds: 3
...

$ oc -n openshift-monitoring get pod prometheus-k8s-0 -oyaml | grep startupProbe -A20
    startupProbe:
      exec:
        command:
        - sh
        - -c
        - if [ -x "$(command -v curl)" ]; then exec curl --fail http://localhost:9090/-/ready;
          elif [ -x "$(command -v wget)" ]; then exec wget -q -O /dev/null http://localhost:9090/-/ready;
          else exit 1; fi
      failureThreshold: 240
      periodSeconds: 15
      successThreshold: 1
      timeoutSeconds: 3

 

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2023-03-19-052243

How reproducible:

always

Steps to Reproduce:

1. enable UWM, check startupProbe for UWM prometheus/platform prometheus
2.
3.

Actual results:

startupProbe for UWM prometheus is still 15m

Expected results:

startupProbe for UWM prometheus should be 1 hour

Additional info:

since startupProbe for platform prometheus is increased to 1 hour, and no similar bug for UWM prometheus, won't fix the issue is OK.

When ProjectID is not set, TenantID might be ignored in MAPO.

Context: When setting additional networks in Machine templates, networks can be identified by the means of a filter. The network filter has both TenantID and ProjectID as fields. TenantID was ignored.

Steps to reproduce:
Create a Machine or a MachineSet with a template containing a Network filter that sets a TenantID.

```
networks:

  • filter:
    id: 'the-network-id'
    tenantId: '123-123-123'
    ```

One cheap way of testing this could be to pass a valid network ID and set a bogus tenantID. If the machine gets associated with the network, then tenantID has been ignored and the bug is present. If instead MAPO errors, then in means that it has taken tenantID into consideration.

Description of problem:

This Jira is filed to track upstream issue (fix and backport) https://github.com/kubernetes-sigs/azurefile-csi-driver/issues/1308

Version-Release number of selected component (if applicable):

4.14

Description of problem:

[Hypershift] default KAS PSA config should be consistent with OCP 
 enforce: privileged 

Version-Release number of selected component (if applicable):

Cluster version is 4.14.0-0.nightly-2023-10-08-220853

How reproducible:

Always

Steps to Reproduce:

1. Install OCP cluster and hypershift operator
2. Create hosted cluster
3. Check the default kas config of the hosted cluster

Actual results:

The hosted cluster default kas PSA config enforce is 'restricted'
$ jq '.admission.pluginConfig.PodSecurity' < `oc extract cm/kas-config -n clusters-9cb7724d8bdd0c16a113 --confirm`
{
  "location": "",
  "configuration": {
    "kind": "PodSecurityConfiguration",
    "apiVersion": "pod-security.admission.config.k8s.io/v1beta1",
    "defaults": {
      "enforce": "restricted",
      "enforce-version": "latest",
      "audit": "restricted",
      "audit-version": "latest",
      "warn": "restricted",
      "warn-version": "latest"
    },
    "exemptions": {
      "usernames": [
        "system:serviceaccount:openshift-infra:build-controller"
      ]
    }
  }
}

Expected results:

The hosted cluster default kas PSA config enforce should be 'privileged' in

https://github.com/openshift/hypershift/blob/release-4.13/control-plane-operator/controllers/hostedcontrolplane/kas/config.go#L93

Additional info:

References: OCPBUGS-8710

Description of problem:

oauth user:check-access scoped tokens can not be used to check access as intended.  SelfSubjectAccessReviews from such scoped token always report allowed: false, denied: true.  Unless the SelfSubjectAccessReview is checking access for ability to create SelfSubjectAccessReviews.  This does not seem like the intended behavior per documentation.

https://docs.openshift.com/container-platform/4.12/authentication/tokens-scoping.html

oauth user:check-access scoped tokens only have authorization for SelfSubjectAccessReview.  This is as intended.  This seems to be limited by the scopeauthorizor.  However, the authorizor used by SelfSubjectAccessReview includes this filter, meaning the returned response is useless (you can only check-access to SelfSubjectAccessReview itself instead of using the token to check access of RBAC of the parent user the token is scoped from).

https://github.com/openshift/kubernetes/blob/master/openshift-kube-apiserver/authorization/scopeauthorizer/authorizer.go

https://github.com/openshift/kubernetes/blob/master/pkg/registry/authorization/selfsubjectaccessreview/rest.go

 

Version-Release number of selected component (if applicable):

 

How reproducible:

Create user:check-access scoped token.  Token must not have user:full scope.  Use the token to do a SelfSubjectAccessReview.

Steps to Reproduce:

1. Create user:check-access scoped token.  Must not have user:full scope.
2. Use the token to do a SelfSubjectAccessReview against a resource the parent user has access to.
3. Observe the status response is allowed: false, denied: true.

Actual results:

Unable to check user access with a user:check-access scoped token.

Expected results:

Ability to check user access with a user:check-access scoped token, without user:full scope which would give the token full access and abilities of the parent user.

Additional info:

 

Some tests may cause unexpected reboots of nodes. On HA setups this is checked by "should report ready nodes the entire duration of the test run" test, which ensures Prometheus metric for node readiness didn't flip.

On SNO however we can't use the metrics, as the prometheus will go down along with the node and the node would become ready again before Prometheus/kube-state-metrics is up again. For SNO we have to check that the node has expected number of reboots - number of "rendered-master/rendered-worker" MC + 1

Description of problem:

The MCDaemon has a codepath for "pivot" used in older versions, and then as part of solutions articles to initiate a direct pivot to an ostree version, mostly used when things fail.

As of 4.12 this codepath should no longer work due to us switching to new format OSImage, so we should fully deprecate it.

This is likely where it fails:
https://github.com/openshift/machine-config-operator/blob/ecc6bf3dc21eb33baf56692ba7d54f9a3b9be1d1/pkg/daemon/rpm-ostree.go#L248

Version-Release number of selected component (if applicable):

4.12+

How reproducible:

Not sure but should be 100%

Steps to Reproduce:

1. Follow https://access.redhat.com/solutions/5598401
2.
3.

Actual results:

fails

Expected results:

MCD telling you pivot is deprecated

Additional info:

 

Description of problem:

Secrets generated by CCO in STS mode is different than the one created by ccoctl on cmdline.

ccoctl generates:

[default]
sts_regional_endpoints = regional
role_arn = arn:aws:iam::269733383066:role/jsafrane-1-5h8rm-openshift-cluster-csi-drivers-aws-efs-cloud-cre
web_identity_token_file = /var/run/secrets/openshift/serviceaccount/token

CCO generates:

sts_regional_endpoints = regional
role_arn = arn:aws:iam::269733383066:role/jsafrane-1-5h8rm-openshift-cluster-csi-drivers-aws-efs-cloud-cre
web_identity_token_file = /var/run/secrets/openshift/serviceaccount/token 

IMO these two should be the same. AWS EFS CSI driver does not work without "[default]" at the beginning.

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-07-11-092038

How reproducible:

Always

Steps to Reproduce:

1. Create a Manual mode, STS cluster in AWS.
2. Create a CredentialsRequest which provides .spec.cloudTokenPath and .spec.providerSpec.stsIAMRoleARN.
3. Observe that secret is created by CCO in the target namespace specified by the CredentialsRequest. 

Actual results:

The secrets does not have [default] in the `data` content.

Expected results:

 

 

Background

When we run our agent we set the proxy environment variables as can be seen here

When the user SSHs into the host, the shell does not have those environment variables set.

Issue

This means that when the user is trying to debug network connectivity (for example, in day-2 users often SSH to see why they can't reach the day-1 cluster's API), they will usually try to run curl to see whether they can reach the URL themselves, but it might behave differently than the agent because the shell, by default, doesn't use the proxy settings.

Solution

Set the default environment variables (through .profile) of the core and root shells to include the same proxy environment variables as the agent, so that when the user logs into the host to run commands, they would have the same proxy settings as the ones the agent has.

Example

One example where we ran into this issue is when a customer forgot to set the correct noProxy settings in the UI during day-2, and so the agent was complaining about not being able to reach the day-1 API server (as the API server is unreachable through the proxy), but when we SSHd into the host and tried to curl, everything seemed to be working fine. Only after we ran tcpdump to see the difference in requests that we noticed the agent was routing requests through the proxy but curl wasn't, because the shell didn't have the proxy settings by default. If the shell had the correct proxy settings, it would've been easier to troubleshoot the problem.

Description of problem:

The NS autolabeler should adjust the PSS namespace labels such that a previously permitted workload (based on the SCCs it has access to) can still run.

The autolabeler requires the RoleBinding's .subjects[].namespace to be set when .subjects[].kind is ServiceAccount even though this is not required by the RBAC system to successfully bind the SA to a Role

Version-Release number of selected component (if applicable):

$ oc version
Client Version: 4.7.0-0.ci-2021-05-21-142747
Server Version: 4.12.0-0.nightly-2022-08-15-150248
Kubernetes Version: v1.24.0+da80cd0

How reproducible: 100%

Steps to Reproduce:

---
apiVersion: v1
kind: Namespace
metadata:
  name: test

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: mysa
  namespace: test

---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: myrole
  namespace: test
rules:
- apiGroups:
  - security.openshift.io
  resourceNames:
  - privileged
  resources:
  - securitycontextconstraints
  verbs:
  - use

---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: myrb
  namespace: test
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: myrole
subjects:
- kind: ServiceAccount
  name: mysa
  #namespace: test  # This is required for the autolabeler

---
kind: Job
apiVersion: batch/v1
metadata:
  name: myjob
  namespace: test
spec:
  template:
    spec:
      containers:
        - name: ubi
          image: registry.access.redhat.com/ubi8
          command: ["/bin/bash", "-c"]
          args: ["whoami; sleep infinity"]
      restartPolicy: Never
      securityContext:
        runAsUser: 0
      serviceAccount: mysa
      terminationGracePeriodSeconds: 2
{{}}

Actual results:

Applying the manifest, above, the Job's pod will not start:

$ kubectl -n test describe job/myjob...Events:
  Type     Reason        Age   From            Message
  ----     ------        ----  ----            -------
  Warning  FailedCreate  20s   job-controller  Error creating: pods "myjob-zxcvv" is forbidden: violates PodSecurity "restricted:v1.24": allowPrivilegeEscalation != false (container "ubi" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "ubi" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "ubi" must set securityContext.runAsNonRoot=true), runAsUser=0 (pod must not set runAsUser=0), seccompProfile (pod or container "ubi" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
  Warning  FailedCreate  20s   job-controller  Error creating: pods "myjob-fkb9x" is forbidden: violates PodSecurity "restricted:v1.24": allowPrivilegeEscalation != false (container "ubi" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "ubi" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "ubi" must set securityContext.runAsNonRoot=true), runAsUser=0 (pod must not set runAsUser=0), seccompProfile (pod or container "ubi" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
  Warning  FailedCreate  10s   job-controller  Error creating: pods "myjob-5klpc" is forbidden: violates PodSecurity "restricted:v1.24": allowPrivilegeEscalation != false (container "ubi" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "ubi" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "ubi" must set securityContext.runAsNonRoot=true), runAsUser=0 (pod must not set runAsUser=0), seccompProfile (pod or container "ubi" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")

Uncommenting the "namespace" field in the RoleBinding will allow it to start as the autolabeler will adjust the Namespace labels.

However, the namespace field isn't actually required by the RBAC system. Instead of using the autolabeler, the pod can be allowed to run by (w/o uncommenting the field):

$ kubectl label ns/test security.openshift.io/scc.podSecurityLabelSync=false
namespace/test labeled
$ kubectl label ns/test pod-security.kubernetes.io/enforce=privileged --overwrite
namespace/test labeled

 

We now see that the pod is running as root and has access to the privileged scc:

$ kubectl -n test get po -oyaml
apiVersion: v1
items:
- apiVersion: v1
  kind: Pod
  metadata:
    annotations:
      k8s.ovn.org/pod-networks: '{"default":{"ip_addresses":["10.129.2.18/23"],"mac_address":"0a:58:0a:81:02:12","gateway_ips":["10.129.2.1"],"ip_address":"10.129.2.18/23","gateway_ip":"10.129.2.1"'}}
      k8s.v1.cni.cncf.io/network-status: |-
        [{
            "name": "ovn-kubernetes",
            "interface": "eth0",
            "ips": [
                "10.129.2.18"
            ],
            "mac": "0a:58:0a:81:02:12",
            "default": true,
            "dns": {}
        }]
      k8s.v1.cni.cncf.io/networks-status: |-
        [{
            "name": "ovn-kubernetes",
            "interface": "eth0",
            "ips": [
                "10.129.2.18"
            ],
            "mac": "0a:58:0a:81:02:12",
            "default": true,
            "dns": {}
        }]
      openshift.io/scc: privileged
    creationTimestamp: "2022-08-16T13:08:24Z"
    generateName: myjob-
    labels:
      controller-uid: 1867dbe6-73b2-44ea-a324-45c9273107b8
      job-name: myjob
    name: myjob-rwjmv
    namespace: test
    ownerReferences:
    - apiVersion: batch/v1
      blockOwnerDeletion: true
      controller: true
      kind: Job
      name: myjob
      uid: 1867dbe6-73b2-44ea-a324-45c9273107b8
    resourceVersion: "36418"
    uid: 39f18dea-31d4-4783-85b5-8ae6a8bec1f4
  spec:
    containers:
    - args:
      - whoami; sleep infinity
      command:
      - /bin/bash
      - -c
      image: registry.access.redhat.com/ubi8
      imagePullPolicy: Always
      name: ubi
      resources: {}
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      volumeMounts:
      - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
        name: kube-api-access-6f2h6
        readOnly: true
    dnsPolicy: ClusterFirst
    enableServiceLinks: true
    imagePullSecrets:
    - name: mysa-dockercfg-mvmtn
    nodeName: ip-10-0-140-172.ec2.internal
    preemptionPolicy: PreemptLowerPriority
    priority: 0
    restartPolicy: Never
    schedulerName: default-scheduler
    securityContext:
      runAsUser: 0
    serviceAccount: mysa
    serviceAccountName: mysa
    terminationGracePeriodSeconds: 2
    tolerations:
    - effect: NoExecute
      key: node.kubernetes.io/not-ready
      operator: Exists
      tolerationSeconds: 300
    - effect: NoExecute
      key: node.kubernetes.io/unreachable
      operator: Exists
      tolerationSeconds: 300
    volumes:
    - name: kube-api-access-6f2h6
      projected:
        defaultMode: 420
        sources:
        - serviceAccountToken:
            expirationSeconds: 3607
            path: token
        - configMap:
            items:
            - key: ca.crt
              path: ca.crt
            name: kube-root-ca.crt
        - downwardAPI:
            items:
            - fieldRef:
                apiVersion: v1
                fieldPath: metadata.namespace
              path: namespace
        - configMap:
            items:
            - key: service-ca.crt
              path: service-ca.crt
            name: openshift-service-ca.crt
  status:
    conditions:
    - lastProbeTime: null
      lastTransitionTime: "2022-08-16T13:08:24Z"
      status: "True"
      type: Initialized
    - lastProbeTime: null
      lastTransitionTime: "2022-08-16T13:08:28Z"
      status: "True"
      type: Ready
    - lastProbeTime: null
      lastTransitionTime: "2022-08-16T13:08:28Z"
      status: "True"
      type: ContainersReady
    - lastProbeTime: null
      lastTransitionTime: "2022-08-16T13:08:24Z"
      status: "True"
      type: PodScheduled
    containerStatuses:
    - containerID: cri-o://8fd1c3a5ee565a1089e4e6032bd04bceabb5ab3946c34a2bb55d3ee696baa007
      image: registry.access.redhat.com/ubi8:latest
      imageID: registry.access.redhat.com/ubi8@sha256:08e221b041a95e6840b208c618ae56c27e3429c3dad637ece01c9b471cc8fac6
      lastState: {}
      name: ubi
      ready: true
      restartCount: 0
      started: true
      state:
        running:
          startedAt: "2022-08-16T13:08:28Z"
    hostIP: 10.0.140.172
    phase: Running
    podIP: 10.129.2.18
    podIPs:
    - ip: 10.129.2.18
    qosClass: BestEffort
    startTime: "2022-08-16T13:08:24Z"
kind: List
metadata:
  resourceVersion: ""
{{}}

 

$ kubectl -n test logs job/myjob
root

 

Expected results:

The autolabeler should properly follow the RoleBinding back to the SCC

 

Additional info:

Description of problem:

While updating a cluster to 4.12.11, which contains the bug fix for [OCPBUGS-7999|https://issues.redhat.com/browse/OCPBUGS-7999] (which is the 4.12.z backport of [OCPBUGS-2783|https://issues.redhat.com/browse/OCPBUGS-2783], it seems that the older {{{Custom|Default}RouteSync{Degraded|Progressing}}} conditions are not cleaned up as they should, as per [OCPBUGS-2783|https://issues.redhat.com/browse/OCPBUGS-2783] resolution, while the newer ones are added.

Due to this, on an upgrade to 4.12.11 (or higher, until this bug is fixed), it is possible to hit a problem very similar to the one that lead to [OCPBUGS-2783|https://issues.redhat.com/browse/OCPBUGS-2783] in the first place, but while upgrading to 4.12.11.

So, we need to do a proper cleanup of the older conditions.

Version-Release number of selected component (if applicable):

4.12.11 and higher

How reproducible:

Always in what regards the wrong conditions. It only leads to issues if one of the wrong conditions was in unhealthy state.

Steps to Reproduce:

1. Upgrade
2.
3.

Actual results:

Both new (and correct) conditions plus older (and wrong) conditions.

Expected results:

Both new (and correct) conditions only.

Additional info:

Problem seems to be that the stale conditions controller is created[1] with a list that says {{CustomRouteSync}} and {{DefaultRouteSync}}, while that list should be {{CustomRouteSyncDegraded}}, {{CustomRouteSyncProgressing}}, {{DefaultRouteSyncDegraded}} and {{DefaultRouteSyncProgressing}}. I read the source code of the controller a bit and it seems that it does not admit prefixes but performs a literal comparison.

[1] - https://github.com/openshift/console-operator/blob/0b54727/pkg/console/starter/starter.go#L403-L404

Description of problem:
During the creation of a new HostedCluster, the control-plane-operator reports several lines of logs like

{"level":"error","ts":"2023-05-04T05:24:03Z","msg":"failed to remove service ca annotation and secret: %w","controller":"hostedcontrolplane","controllerGroup":"hypershift.openshift.io","controllerKind":"HostedControlPlane","hostedControlPlane":{"name":"demo-02","namespace":"clusters-demo-02"},"namespace":"clusters-demo-02","name":"demo-02","reconcileID":"5ffe0a7f-94ce-4745-b89d-4d5168cabe8d","error":"failed to get service: Service \"node-tuning-operator\" not found","stacktrace":"github.com/openshift/hypershift/control-plane-operator/controllers/hostedcontrolplane.(*HostedControlPlaneReconciler).reconcile\n\t/hypershift/control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go:929\ngithub.com/openshift/hypershift/control-plane-operator/controllers/hostedcontrolplane.(*HostedControlPlaneReconciler).update\n\t/hypershift/control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go:830\ngithub.com/openshift/hypershift/control-plane-operator/controllers/hostedcontrolplane.(*HostedControlPlaneReconciler).Reconcile\n\t/hypershift/control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go:677\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:234"}

Until the Service / Secret are created.

Version-Release number of selected component (if applicable):

Management cluster: 4.14.0-nightly
Hosted Cluster: 4.13.0 or 4.14.0-nightly

How reproducible:

Always

Steps to Reproduce:

1. Create a hosted cluster

Actual results:

HostedCluster is created but there are several unnecessary "error" logs in the control-plane-operator

Expected results:

No error logs from control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go:removeServiceCAAnnotationAndSecret() during normal cluster creation

Additional info:

 

Marko Luksa mentioned multus missing '/etc/cni/multus/net.d' mount in OCP4.14 and here's the repro step (verivied in multus team)

Our original reproducer would be too complex, so I had to write a simple one for you:
Use a 4.14 OpenShift cluster
Create the CNI plugin installer DaemonSet in namespace test:

oc apply -f https://gist.githubusercontent.com/luksa/c4d444e918124604839c424339c29a62/raw/1454bd389138980ea3f93bcfaf6026d4821e3543/noop-cni-plugin-installer.yaml

Create the test Deployment:

oc apply -f https://gist.githubusercontent.com/luksa/4c7c144ef88b1b0d8f772d6eacdeec14/raw/06b161fdb8c71406f4531d35550bd507a6a25200/test-deployment.yaml

Describe the test pod:

oc -n test describe po test

The last event shows the following:

ERRORED: error configuring pod [test/test-6cf67dcfb6-hgszq] networking: Multus: [test/test-6cf67dcfb6-hgszq/3e8a6f0d-ce84-4885-a7a7-43506669339f]: error loading k8s delegates k8s args: TryLoadPodDelegates: error in getting k8s network for pod: GetNetworkDelegates: failed getting the delegate: GetCNIConfig: err in GetCNIConfigFromFile: No networks found in /etc/cni/multus/net.d

The same reproducer runs fine on OCP 4.13

Description of problem:

The current version of openshift/router vendors Kubernetes 1.26 packages. OpenShift 4.14 is based on Kubernetes 1.27.   

Version-Release number of selected component (if applicable):

4.14

How reproducible:

Always

Steps to Reproduce:

1. Check https://github.com/openshift/router/blob/release-4.14/go.mod 

Actual results:

Kubernetes packages (k8s.io/api, k8s.io/apimachinery, k8s.io/apiserver, and k8s.io/client-go) are at version v0.26

Expected results:

Kubernetes packages are at version v0.27.0 or later.

Additional info:

Using old Kubernetes API and client packages brings risk of API compatibility issues.

Description of problem:

I attempted to install a BM SNO with the agent based installer.
In the install_config, I disabled all supported capabilities except marketplace. Install_config snippet: 

capabilities:
  baselineCapabilitySet: None
  additionalEnabledCapabilities:
  - marketplace

The system installed fine but the capabilities config was not passed down to the cluster. 

clusterversion: 
status:
    availableUpdates: null
    capabilities:
      enabledCapabilities:
      - CSISnapshot
      - Console
      - Insights
      - Storage
      - baremetal
      - marketplace
      - openshift-samples
      knownCapabilities:
      - CSISnapshot
      - Console
      - Insights
      - Storage
      - baremetal
      - marketplace
      - openshift-samples

oc -n kube-system get configmap cluster-config-v1 -o yaml
apiVersion: v1
data:
  install-config: |
    additionalTrustBundlePolicy: Proxyonly
    apiVersion: v1
    baseDomain: ptp.lab.eng.bos.redhat.com
    bootstrapInPlace:
      installationDisk: /dev/disk/by-id/wwn-0x62cea7f04d10350026c6f2ec315557a0
    compute:
    - architecture: amd64
      hyperthreading: Enabled
      name: worker
      platform: {}
      replicas: 0
    controlPlane:
      architecture: amd64
      hyperthreading: Enabled
      name: master
      platform: {}
      replicas: 1
    metadata:
      creationTimestamp: null
      name: cnfde8
    networking:
      clusterNetwork:
      - cidr: 10.128.0.0/14
        hostPrefix: 23
      machineNetwork:
      - cidr: 10.16.231.0/24
      networkType: OVNKubernetes
      serviceNetwork:
      - 172.30.0.0/16
    platform:
      none: {}
    publish: External
    pullSecret: ""





Version-Release number of selected component (if applicable):

4.12.0-rc.5

How reproducible:

100%

Steps to Reproduce:

1. Install SNO with agent based installer as described above
2.
3.

Actual results:

Capabilities installed  

Expected results:

Capabilities not installed 

Additional info:

 

Description of problem:
When try to import the Helm chart "httpd-imagestreams" the "Create Helm Release" page shows a info alert that the form isn't avaiable because there isn't a schema for this helm chart. But the YAML view is also not visible.

Info Alert:

Form view is disabled for this chart because the schema is not available

Version-Release number of selected component (if applicable):
4.9-4.14 (current master)

How reproducible:
Always

Steps to Reproduce:

  1. Switch to the developer perspective
  2. Navigate to Add > Helm Chart
  3. Search and select "httpd-imagestreams", click the card and then Create to open the "Create Helm Release" page

Actual results:

  1. Form / YAML switch is disabled
  2. Info alert is shown: Form view is disabled for this chart because the schema is not available
  3. There is no YAML editor

Expected results:

  1. It's fine that the Form/ YAML switch is disabled
  2. Info alert is also fine
  3. YAML editor should be displayed

Additional info:
The chart yaml is available here and doesn't contain a schema (at the moment).

https://github.com/openshift-helm-charts/charts/blob/main/charts/redhat/redhat/httpd-imagestreams/0.0.1/src/Chart.yaml

Description of problem:

machine-config-operator will fail on clusters deployed with IPI on Power Virtual Server with the following error:

Cluster not available for []: ControllerConfig.machineconfiguration.openshift.io "machine-config-controller" is invalid: spec.infra.status.platformStatus.powervs.resourceGroup: Invalid value: "": spec.infra.status.platformStatus.powervs.resourceGroup in body should match '^[a-zA-Z0-9-_ 

Version-Release number of selected component (if applicable):

4.14 and 4.13

How reproducible:

100%

Steps to Reproduce:

1. Deploy with openshift-installer to Power VS
2. Wait for masters to start deploying
3. Error will appear for the machine-config CO

Actual results:

MCO fails

Expected results:

MCO should come up

Additional info:

Fix has been identified

Description of problem:

Pipelines Creation YAML form is not allowing v1beta1 YAMLs get created

Version-Release number of selected component (if applicable):

4.14

How reproducible:

Always

Steps to Reproduce:

1. Open the Pipelines Creation YAML form
2. Paste the following YAML
3. Submit the form

Actual results:

The form doesnot submit, stating version mismatch. Expects v1, got v1beta1

Expected results:

We must support the creation of both the versions in the YAML form

Additional info:

The issue is not observed when the "Import from YAML" Form is used.

Attachment: https://drive.google.com/file/d/1B_sAuGREgmX800JXGmrL30iByowfHzs7/view?usp=sharing

 

Description of problem:

The TRT ComponentReadiness tool shows what looks like a regression (https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?arch=amd64&baseEndTime=2023-05-16%2023%3A59%3A59&baseRelease=4.13&baseStartTime=2023-04-16%2000%3A00%3A00&capability=Other&component=Monitoring&confidence=95&environment=ovn%20no-upgrade%20amd64%20aws%20hypershift&excludeArches=heterogeneous%2Carm64%2Cppc64le%2Cs390x&groupBy=cloud%2Carch%2Cnetwork&ignoreDisruption=true&ignoreMissing=false&minFail=3&network=ovn&pity=5&platform=aws&sampleEndTime=2023-07-20%2023%3A59%3A59&sampleRelease=4.14&sampleStartTime=2023-07-13%2000%3A00%3A00&testId=openshift-tests%3A79898d2e28b78374d89e10b38f88107b&testName=%5Bsig-instrumentation%5D%20Prometheus%20%5Bapigroup%3Aimage.openshift.io%5D%20when%20installed%20on%20the%20cluster%20should%20report%20telemetry%20%5BLate%5D%20%5BSkipped%3ADisconnected%5D%20%5BSuite%3Aopenshift%2Fconformance%2Fparallel%5D&upgrade=no-upgrade&variant=hypershift)

in the "[sig-instrumentation] Prometheus [apigroup:image.openshift.io] when installed on the cluster should report telemetry [Late] [Skipped:Disconnected] [Suite:openshift/conformance/parallel]" test.

In the ComponentReadiness link above, you can see the sample runs (linked with red "F").

Version-Release number of selected component (if applicable):

4.14

How reproducible:

The pass rate in 4.13 is 100% vs. 81% in 4.14

Steps to Reproduce:

1.  There query above focuses on "periodic-ci-openshift-hypershift-release-4.14-periodics-e2e-aws-ovn-conformance" jobs and the specific test mentioned.  You can see the failures by clicking on the red "F"s
2.
3.

Actual results:

The failures look like:

{  fail [github.com/openshift/origin/test/extended/prometheus/prometheus.go:365]: Unexpected error:
    <errors.aggregate | len:2, cap:2>: 
    [promQL query returned unexpected results:
    metricsclient_request_send{client="federate_to",job="telemeter-client",status_code="200"} >= 1
    [], promQL query returned unexpected results:
    federate_samples{job="telemeter-client"} >= 10
    []]
    [
        <*errors.errorString | 0xc0017611b0>{
            s: "promQL query returned unexpected results:\nmetricsclient_request_send{client=\"federate_to\",job=\"telemeter-client\",status_code=\"200\"} >= 1\n[]",
        },
        <*errors.errorString | 0xc00203d380>{
            s: "promQL query returned unexpected results:\nfederate_samples{job=\"telemeter-client\"} >= 10\n[]",
        },
    ]

Expected results:

Query should succeed

Additional info:

I set the severity to Major because this looks like a regression from where it was in the 5 weeks before 4.13 went GA.

Description of the problem:

 When providing an ICSP in the install config for caching images locally when also using the SaaS the cluster fails to prepare for installation because oc adm release extract is trying to use the ICSP from the install config.

How reproducible:

 100% on a fresh deploy, but if the installer cache is already warmed up 0%

Steps to reproduce:

1. Deploy fresh replicas to the SaaS environment

2. Create a cluster

3. Override install config and add ICSP content for an inaccessable (from the SaaS) registry

4. Install cluster

Actual results:

 Cluster fails to prepare with an error like:

Failed to prepare the installation due to an unexpected error: failed generating install config for cluster f3e55b14-297d-453b-8ef4-953caebefc67: failed to get installer path: command 'oc adm release extract --command=openshift-install --to=/data/install-config-generate/installercache/quay.io/openshift-release-dev/ocp-release:4.13.0-x86_64 --insecure=false --icsp-file=/tmp/icsp-file1525063401 quay.io/openshift-release-dev/ocp-release:4.13.0-x86_64 --registry-config=/tmp/registry-config882468533' exited with non-zero exit code 1: warning: --icsp-file only applies to images referenced by digest and will be ignored for tags error: unable to read image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:81be8aec46465412abbef5f1ec252ee4a17b043e82d31feac13d25a8a215a2c9: unauthorized: access to the requested resource is not authorized . Please retry later

Expected results:

Installer image is pulled successfully.

Additional Information

This seems to have been introduced in https://github.com/openshift/assisted-service/pull/4115 when we started pulling ICSP information from the install config.

Description of problem:

Cluster Network Operator managed component multus-admission-controller does not conform to Hypershift control plane expectations.

When CNO is managed by Hypershift, multus-admission-controller and other CNO-managed deployments should run with non-root security context. If Hypershift runs control plane on kubernetes (as opposed to Openshift) management cluster, it adds pod security context to its managed deployments, including CNO, with runAsUser element inside. In such a case CNO should do the same, set security context for its managed deployments, like multus-admission-controller, to meet Hypershift security rules.

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1.Create OCP cluster using Hypershift using Kube management cluster
2.Check pod security context of multus-admission-controller

Actual results:

no pod security context is set on multus-admission-controller

Expected results:

pod security context is set with runAsUser: xxxx

Additional info:

Corresponding CNO change 

Description of problem:
Component Readiness is showing a regression in 4.14 compared to 4.13 in the rt variant of test Cluster resource quota should control resource limits across namespaces. Example

{  fail [github.com/openshift/origin/test/extended/quota/clusterquota.go:107]: unexpected error: timed out waiting for the condition
Ginkgo exit error 1: exit with code 1}
 

Looker studio graph (scroll down to see) shows the regression started around May 24th.

Version-Release number of selected component (if applicable):

 

How reproducible:
4.13 Sippy shows 100% success rate vs. 4.14 which is down to about 91%

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

Historical pass rate was 100%

Additional info:

 

Description of problem:
Same for OCP 4.14.

In OCP 4.13 when trying to reach prometheus UI  via port-forward, e.g. `oc port-forward prometheus-k8s-0` the UI url($HOST:9090/graph) is returning `Error opening React index.html: open web/ui/static/react/index.html: no such file or directory`

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2023-01-24-061922

How reproducible:

100%

Steps to Reproduce:

1.  oc -n openshift-monitoring port-forward prometheus-k8s-0 9090:9090 --address='0.0.0.0' 

2. curl http://localhost:9090/graph

Actual results:

Error opening React index.html: open web/ui/static/react/index.html: no such file or directory

Expected results:

Prometheus UI is loaded

Additional info:

 The UI loads fine when following the same steps in 4.12.

Removes the version check on reconciling the image content type policy since that is not needed in release image versions greater than 4.13.

Description of problem:

visiting global configurations page will return error after 'Red Hat OpenShift Serverless' is installed, the error persist even operator is uninstalled

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-06-06-212044

How reproducible:

Always

Steps to Reproduce:

1. Subscribe 'Red Hat OpenShift Serverless' from OperatorHub, wait for the operator to be successfully installed
2. Visit Administration -> Cluster Settings -> Configurations tab

Actual results:

react_devtools_backend_compact.js:2367 unhandled promise rejection: TypeError: Cannot read properties of undefined (reading 'apiGroup') 
    at r (main-chunk-e70ea3b3d562514df486.min.js:1:1)
    at main-chunk-e70ea3b3d562514df486.min.js:1:1
    at Array.map (<anonymous>)
    at main-chunk-e70ea3b3d562514df486.min.js:1:1
overrideMethod @ react_devtools_backend_compact.js:2367
window.onunhandledrejection @ main-chunk-e70ea3b3d562514df486.min.js:1

main-chunk-e70ea3b3d562514df486.min.js:1 Uncaught (in promise) TypeError: Cannot read properties of undefined (reading 'apiGroup')
    at r (main-chunk-e70ea3b3d562514df486.min.js:1:1)
    at main-chunk-e70ea3b3d562514df486.min.js:1:1
    at Array.map (<anonymous>)
    at main-chunk-e70ea3b3d562514df486.min.js:1:1

 

Expected results:

no errors

Additional info:

 

This is a clone of issue OCPBUGS-19512. The following is the description of the original issue:

OCPBUGS-5469 and backports began prioritizing later target releases, but we still wait 10m between different PromQL evaluations while evaluating conditional update risks.  This ticket is tracking work to speed up cache warming, and allows changes that are too invasive to be worth backporting.

Definition of done:

  • When presented with new risks, the CVO will initially evaluate one PromQL expression every second or so, instead of waiting 10m between different evaluations.  Each PromQL expression will still only be evaluated once every hour or so, to avoid excessive load on the PromQL engine.

Acceptance Criteria:

  • After changing the channel and receiving a new graph conditional risks are evaluated as quickly as possible, ideally less than 500ms per unique risk

Description of problem:

In an STS cluster with the TechPreviewNoUpgrade featureset enabled, CCO ignores CRs whose .spec.providerSpec.stsIAMRoleARN is unset. 

While the CR controller does not provision a Secret for the aforementioned type of CRs, it still sets .status.provisioned to true for them. 

Steps to Reproduce:

1. Create an STS cluster, enable the feature set. 

2. Create a dummy CR like the following:
fxie-mac:cloud-credential-operator fxie$ cat cr2.yaml
apiVersion: cloudcredential.openshift.io/v1
kind: CredentialsRequest
metadata:
  name: test-cr-2
  namespace: openshift-cloud-credential-operator
spec:
  providerSpec:
    apiVersion: cloudcredential.openshift.io/v1
    kind: AWSProviderSpec
    statementEntries:
    - action:
      - ec2:CreateTags
      effect: Allow
      resource: '*'
  secretRef:
    name: test-secret-2
    namespace: default
  serviceAccountNames:
  - default

3. Check CR.status
fxie-mac:cloud-credential-operator fxie$ oc get credentialsrequest test-cr-2 -n openshift-cloud-credential-operator -o yaml
apiVersion: cloudcredential.openshift.io/v1
kind: CredentialsRequest
metadata:
  creationTimestamp: "2023-07-24T09:21:44Z"
  finalizers:
  - cloudcredential.openshift.io/deprovision
  generation: 1
  name: test-cr-2
  namespace: openshift-cloud-credential-operator
  resourceVersion: "180154"
  uid: 34b36cac-3fca-4fa5-a003-a9b64c5fbf00
spec:
  providerSpec:
    apiVersion: cloudcredential.openshift.io/v1
    kind: AWSProviderSpec
    statementEntries:
    - action:
      - ec2:CreateTags
      effect: Allow
      resource: '*'
  secretRef:
    name: test-secret-2
    namespace: default
  serviceAccountNames:
  - default
status:
  lastSyncGeneration: 0
  lastSyncTimestamp: "2023-07-24T09:39:40Z"
  provisioned: true 

Description of problem:

After destroyed the private cluster, the cluster's dns records left.

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2023-02-26-022418 
4.13.0-0.nightly-2023-02-26-081527 

How reproducible:

always

Steps to Reproduce:

1.create a private cluster
2.destroy the cluster
3.check the dns record  
$ibmcloud dns zones | grep private-ibmcloud.qe.devcluster.openshift.com (base_domain)
3c7af30d-cc2c-4abc-94e1-3bcb36e01a9b   private-ibmcloud.qe.devcluster.openshift.com     PENDING_NETWORK_ADD
$zone_id=3c7af30d-cc2c-4abc-94e1-3bcb36e01a9b
$ibmcloud dns resource-records $zone_id
CNAME:520c532f-ca61-40eb-a04e-1a2569c14a0b   api-int.ci-op-wkb4fgd6-eef7e.private-ibmcloud.qe.devcluster.openshift.com   CNAME   60    10a7a6c7-jp-tok.lb.appdomain.cloud   
CNAME:751cf3ce-06fc-4daf-8a44-bf1a8540dc60   api.ci-op-wkb4fgd6-eef7e.private-ibmcloud.qe.devcluster.openshift.com       CNAME   60    10a7a6c7-jp-tok.lb.appdomain.cloud   
CNAME:dea469e3-01cd-462f-85e3-0c1e6423b107   *.apps.ci-op-wkb4fgd6-eef7e.private-ibmcloud.qe.devcluster.openshift.com    CNAME   120   395ec2b3-jp-tok.lb.appdomain.cloud 

Actual results:

the dns records of the cluster were left

Expected results:

created dns record by installer are all deleted, after destroyed the cluster

Additional info:

this block create private cluster later, caused the maximum limit of 5 wildcard records are easily reached. (qe account limitation)
checking the *ingress-operator.log of the failed cluster, got the error: "createOrUpdateDNSRecord: failed to create the dns record: Reached the maximum limit of 5 wildcard records."

It is caused by the power off routine, which initialises last_error to None. The field is later restored, but BMO manages to observe and record the wrong value.

This issue is not trivial to reproduce in the product. You need OCPBUGS-2471 to land first, then you need to trigger the cleaning failure several times. I used direct access to Ironic via CLI to abort cleaning (`baremetal node abort <node name>`) during deprovisioning. After a few attempts you can observe the following in the BMH's status:

status:
  errorCount: 2
  errorMessage: 'Cleaning failed: '
  errorType: provisioning error

The empty message after the colon is a sign of this bug.

Description of the problem:

If an interface name is over 15 characters long network manager refuses to allow the interface to come up. 

How reproducible:

Depends on the system interface names

Steps to reproduce:

1. Create a cluster with static networking (a vlan with a large id works best)

2. Boot a host with the discovery ISO

Actual results:

Host interface does not come up if the resulting interface name is over 15 characters

Expected results:

Interfaces should always come up

Additional info:

Slack thread: https://redhat-internal.slack.com/archives/CUPJTHQ5P/p1689956128746919?thread_ts=1689774706.220319&cid=CUPJTHQ5P

Attached a screenshot of the log stating the connection name is too long.

This happens because our script to apply static networking on a host uses the host interface name and appends the extension nmstate added for the interface.

In this case the interface name was enp94s0f0np0 with a vlan id of 2507. This meant that the resulting interface name was enp94s0f0np0.2507 (17 characters).

When configuring this interface manually as a workaround the user stated that the interface name (not the vlan id) was truncated to accommodate the length limit.
So in this case the valid interface created by nmcli was "enp94s0f0n.2507" we should attempt to replicate this behavior.

Also attached a screenshot of the working interface.

Description of problem:

'Show tooltips' toggle is added on resource YAML page, but the checkbox icon seems not aligned with other icons

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-05-23-103225

How reproducible:

Always

Steps to Reproduce:

1. goes to any resource YAML page, check 'Show tooltips' icon position
2.
3.

Actual results:

1. the checkbox is a little above other icons, see screenshot https://drive.google.com/file/d/10wKeRaaE76GBXBph93wAkFCWYGrEKcA9/view?usp=share_link 

Expected results:

1. all icons should be aligned

Additional info:

 

Tracker issue for bootimage bump in 4.14. This issue should block issues which need a bootimage bump to fix.

The previous bump was OCPBUGS-13960.

1. Proposed title of this feature request
Support new Azure LoadBalancer 100min idle TCP timeout

2. What is the nature and description of the request ?
When provisioning a service of type LoadBalancer for OCP cluster on Azure, it is possible to customize TCP idle timeouts in minutes using the LoadBalancer annotations 'service.beta.kubernetes.io/azure-load-balancer-tcp-idle-timeout'

Currently, min and max values are hardcoded to respectively 4 an 30 in both legacy Azure Cloud Provider implementation and cloud Provider Azure

Recently Azure upgraded its implementation to support a max of 100 min for idle timeout, corresponding documentation should be updated soon Configure TCP reset and idle timeout for Azure Load Balancer. It is now possible to use such idle timeout with more than 30min manually in Azure portal or with Azure cli but not possible from K8s load balancer as max value is still 30min in K8s code.
Error message returned is

`Warning  SyncLoadBalancerFailed  2s (x3 over 18s)    service-controller  Error syncing load balancer: failed to ensure load balancer: idle timeout value must be a whole number representing minutes between 4 and 30`

3. Why does the customer need this? (List the business requirements here)
Customer is migrating workloads from on premise datacenter to Azure. Using idle timeout with more than 30min is critical to migrate some of our customer links to Azure and is preventing the migration until this is supported by Openshift

4. List any affected packages or components.
Azure cloud controler

Seeing segfault failures related to HAProxy on multiple platforms that begin around the same time as the [HAProxy bump|http://example.com] like:

{ nodes/ci-op-5s09hi2q-0dd98-rwds8-worker-centralus1-8nkx5/journal.gz:Apr 10 06:21:54.317971 ci-op-5s09hi2q-0dd98-rwds8-worker-centralus1-8nkx5 kernel: haproxy[302399]: segfault at 0 ip 0000556eadddafd0 sp 00007fff0cceed50 error 4 in haproxy[556eadc00000+2a3000]}

Sippy Node Process Segfaulted

release-master-ci-4.14-upgrade-from-stable-4.13-e2e-azure-sdn-upgrade/1645265104259780608

periodic-ci-openshift-release-master-ci-4.14-e2e-gcp-ovn-upgrade/1645265114720374784

periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-aws-ovn-upgrade/1644449798939480064

Description of problem:

The IPv6 VIP does not seem to be present in the keepalived.conf.

networking:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  - cidr: fd65:10:128::/56
    hostPrefix: 64
  machineNetwork:
  - cidr: 192.168.110.0/23
  - cidr: fd65:a1a8:60ad::/112
  networkType: OVNKubernetes
  serviceNetwork:
  - 172.30.0.0/16
  - fd65:172:16::/112
platform:
  vsphere:
    apiVIPs:
    - 192.168.110.116
    - fd65:a1a8:60ad:271c::1116
    ingressVIPs:
    - 192.168.110.117
    - fd65:a1a8:60ad:271c::1117
    vcenters:
    - datacenters:
      - IBMCloud
      server: ibmvcenter.vmc-ci.devcluster.openshift.com

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2023-04-21-084440

How reproducible:

Frequently.
2 failures out of 3 attemps.

Steps to Reproduce:

1. Install vSphere dual-stack with dual VIPs, see above config
2. Check keepalived.conf
for f in $(oc get pods -n openshift-vsphere-infra -l app=vsphere-infra-vrrp --no-headers -o custom-columns=N:.metadata.name  ) ; do oc -n openshift-vsphere-infra exec -c keepalived $f -- cat /etc/keepalived/keepalived.conf | tee $f-keepalived.conf ; done

Actual results:

IPv6 VIP is not in keepalived.conf

Expected results:

vrrp_instance rbrattai_INGRESS_1 {
    state BACKUP
    interface br-ex
    virtual_router_id 129
    priority 20
    advert_int 1

    unicast_src_ip fd65:a1a8:60ad:271c::cc
    unicast_peer {
        fd65:a1a8:60ad:271c:9af:16a9:cb4f:d75c
        fd65:a1a8:60ad:271c:86ec:8104:1bc2:ab12
        fd65:a1a8:60ad:271c:5f93:c9cf:95f:9a6d
        fd65:a1a8:60ad:271c:bb4:de9e:6d58:89e7
        fd65:a1a8:60ad:271c:3072:2921:890:9263
    }
...
    virtual_ipaddress {
        fd65:a1a8:60ad:271c::1117/128
    }
...
}

Additional info:

See OPNET-207

Description of problem:

It seems that we don't correctly update the network data secret version in the PreprovisioningImage, resulting in BMO assuming that the image is still stale, while the image-customization-controller assumes it's done. As a result, the host is stuck in inspecting.

How reproducible:

What I think I did is to add a network data secret to a host which already has a preprovisioningimage previously created. I need to check if I can repeat it.

Actual results:

Host in inspecting, BMO logs show

{"level":"info","ts":"2023-05-11T11:52:52.348Z","logger":"controllers.BareMetalHost","msg":"network data in pre-provisioning image is out of date","baremetalhost":"openshift-machine-api/oste
st-extraworker-0","provisioningState":"inspecting","latestVersion":"9055823","currentVersion":"9055820"}

Indeed, the image has the old version:

status:
  architecture: x86_64
  conditions:
  - lastTransitionTime: "2023-05-11T11:27:51Z"
    message: Generated image
    observedGeneration: 1
    reason: ImageSuccess
    status: "True"
    type: Ready
  - lastTransitionTime: "2023-05-11T11:27:51Z"
    message: ""
    observedGeneration: 1
    reason: ImageSuccess
    status: "False"
    type: Error
  format: iso
  imageUrl: http://metal3-image-customization-service.openshift-machine-api.svc.cluster.local/231b39d5-1b83-484c-9096-aa87c56a222a
  networkData:
    name: ostest-extraworker-0-network-config-secret
    version: "9055820"

What I find puzzling is that we even have two versions of the secret. I only created it once.

What

Address issues and PRs.

In particular:

  • Make downstream version bump
  • Merge Standa's Open PR.

Why

A healthy open source repo is being maintained and keeps users.

Description of problem:

Unable to set protectKernelDefaults from "true" to "false" in kubelet.conf on the nodes in RHOCP4.13 although this was possible in RHOCP4.12.

Version-Release number of selected component (if applicable):

   Red Hat OpenShift Container Platform Version Number: 4
   Release Number: 13
   Kubernetes Version: v1.26.3+b404935
   Docker Version: N/A
   Related Package Version: 
	   - cri-o-1.26.3-3.rhaos4.13.git641290e.el9.x86_64
   Related Middleware/Application: none
   Underlying RHEL Release Number: Red Hat Enterprise Linux CoreOS release 4.13
   Underlying RHEL Architecture: x86_64
   Underlying RHEL Kernel Version: 5.14.0-284.13.1.el9_2.x86_64
   
Drivers or hardware or architecture dependency: none

How reproducible:


 always

Steps to Reproduce:

    1. Deploy OCP cluster using RHCOS
    2. Set protectKernelDefaults as true using the document [1]

Actual results:

protectKernelDefaults can't be set.

Expected results:

 protectKernelDefaults can be set.

Additional info:



protectKernelDefaults in NOT set in kubelet.conf

    ---
    # oc debug node/ocp4-worker1

    # chroot /host

    # cat /etc/kubernetes/kubelet.conf
      ...
      "protectKernelDefaults": true, <- NOT modified. Moreover, the format is changed to json.
      ...
    ---

Also    "protectKernelDefaults: false" does not seem to be set into the machineConfig created by kubeletConfig Kind. See below:

    ---
    # oc get mc 99-worker-generated-kubelet -o yaml
    ...
    storage:
      files:
      - contents:
          compression: "" 
          source: data:text/plain;charset=utf-8;base64, [The contents of kubelet.conf encoded with base64]
        mode: 420
        overwrite: true
        path: /etc/kubernetes/kubelet.conf

    // Write [The contents of kubelet.conf encoded with base64] to the file.
    # vim kubelet.conf 

    // Decode [The contents of kubelet.conf encoded with base64]
    # cat kubelet.conf | base64 -d
    ...
    "protectKernelDefaults": true, <- "protectKernelDefaults: false" is not set.
    ----



[1] https://access.redhat.com/solutions/6974438

Sanitize OWNERS/OWNER_ALIASES:

1) OWNERS must have:

component: "Storage / Kubernetes External Components"

2) OWNER_ALIASES must have all team members of Storage team.

Description of the problem:

After creating successfully a hosted cluster using CAPI agent with 6 worker nodes (on two different subnets), I attempted to scale down the nodepool to 0 replicas.

2 agents returned to infraenv in "known-unbound" state, but the other 4 are still bound to the cluster., and their related machines CR are stuck in phase Deleting

$ oc get machines.cluster.x-k8s.io -n clusters-hosted-1
NAME                        CLUSTER          NODENAME            PROVIDERID                                     PHASE      AGE   VERSION
hosted-1-6655884866-dr4mv   hosted-1-vhc4f   hosted-rwn-1-1      agent://4cc93549-45cd-42a9-8c61-5d72b802ebe5   Deleting   94m   4.14.0-ec.3
hosted-1-6655884866-fkfjf   hosted-1-vhc4f   hosted-worker-1-0   agent://324afeeb-1af1-45d9-a2ba-f1101ffb6a6b   Deleting   94m   4.14.0-ec.3
hosted-1-6655884866-nzflz   hosted-1-vhc4f   hosted-rwn-1-2      agent://50b12199-7e95-4b3a-a5ce-d4aa0fa7909e   Deleting   94m   4.14.0-ec.3
hosted-1-6655884866-pc67l   hosted-1-vhc4f   hosted-worker-1-2   agent://284eb9e6-4375-4e59-9a11-a0a3131aa08b   Deleting   94m   4.14.0-ec.3 

In the capi-provider pod logs I have the following:

time="2023-07-25T15:23:27Z" level=error msg="failed to add finalizer agentmachine.agent-install.openshift.io/deprovision to resource hosted-1-2ntnh clusters-hosted-1" func="github.com/openshift/cluster-api-provider-agent/controllers.(*AgentMachineReconciler).handleDeletionHook" file="/remote-source/app/controllers/agentmachine_controller.go:206" agent_machine=hosted-1-2ntnh agent_machine_namespace=clusters-hosted-1 error="Operation cannot be fulfilled on agentmachines.capi-provider.agent-install.openshift.io \"hosted-1-2ntnh\": StorageError: invalid object, Code: 4, Key: /kubernetes.io/capi-provider.agent-install.openshift.io/agentmachines/clusters-hosted-1/hosted-1-2ntnh, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 75febba6-8e98-4fca-861f-e83c467a3368, UID in object meta: " 

and

time="2023-07-25T15:23:50Z" level=error msg="Failed to get agentMachine clusters-hosted-1/hosted-1-l4pp7" func="github.com/openshift/cluster-api-provider-agent/controllers.(*AgentMachineReconciler).Reconcile" file="/remote-source/app/controllers/agentmachine_controller.go:95" agent_machine=hosted-1-l4pp7 agent_machine_namespace=clusters-hosted-1 error="AgentMachine.capi-provider.agent-install.openshift.io \"hosted-1-l4pp7\" not found" 

Actual results:

4 out of 6 agents are still bound to cluster

Expected results:

The nodepool is scaled to 0 replicas

Description of problem:

After customizing the routes for Console and Downloads, the `Downloads` route is not being updated within the `https://custom-console-route/command-line-tools` and still pointing the old/default downloads route.

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1. Customize Console and Downloads routes.
2. Access the web-console using custom console route.
3. Go to Command-line-tools.
4. Try to access the downloads urls.

Actual results:

While accessing the downloads urls, it is pointing towards default/old downloads route

Expected results:

While accessing the downloads urls, it should be pointing towards custom downloads route

Additional info:

 

Description of problem:

As discovered in https://bugzilla.redhat.com/show_bug.cgi?id=2111632 the dispatcher scripts don't have permission to set the hostname directly. We need to use systemd-run to get them into an appropriate SELinux context.

I doubt the static DHCP scripts are still being used intentionally since we have proper static IP support now, but since the fix is pretty trivial we should go ahead and do it since technically the feature is still supported.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

CR.status.lastSyncGeneration is not updated in STS mode (AWS). 

Steps to Reproduce:

See https://issues.redhat.com/browse/OCPBUGS-16684.

Description of problem:

On Azure when drop vmsize or location field from cpms's providerSpec, a master will be in a creating/deleting loop.

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-10-25-210451

How reproducible:

always

Steps to Reproduce:

1. Create an Azure cluster with a CPMS
2. Activate the CPMS
3. Drop the vmsize field from the providerSpec

Actual results:

New machine is created, deleted, created, deleted ...
$ oc get machine         
NAME                                    PHASE      TYPE              REGION   ZONE   AGE
zhsuncpms1-7svhz-master-0               Running    Standard_D8s_v3   eastus   2      3h21m
zhsuncpms1-7svhz-master-1               Running    Standard_D8s_v3   eastus   3      3h21m
zhsuncpms1-7svhz-master-2               Running    Standard_D8s_v3   eastus   1      3h21m
zhsuncpms1-7svhz-master-l489k-0         Deleting                                     0s
zhsuncpms1-7svhz-worker-eastus1-6vsl4   Running    Standard_D4s_v3   eastus   1      3h16m
zhsuncpms1-7svhz-worker-eastus2-dpvp9   Running    Standard_D4s_v3   eastus   2      3h16m
zhsuncpms1-7svhz-worker-eastus3-sg7dx   Running    Standard_D4s_v3   eastus   3      19m
$ oc get machine  
NAME                                    PHASE     TYPE              REGION   ZONE   AGE
zhsuncpms1-7svhz-master-0               Running   Standard_D8s_v3   eastus   2      3h26m
zhsuncpms1-7svhz-master-1               Running   Standard_D8s_v3   eastus   3      3h26m
zhsuncpms1-7svhz-master-2               Running   Standard_D8s_v3   eastus   1      3h26m
zhsuncpms1-7svhz-master-wmnfq-0                                                     1s
zhsuncpms1-7svhz-worker-eastus1-6vsl4   Running   Standard_D4s_v3   eastus   1      3h21m
zhsuncpms1-7svhz-worker-eastus2-dpvp9   Running   Standard_D4s_v3   eastus   2      3h21m
zhsuncpms1-7svhz-worker-eastus3-sg7dx   Running   Standard_D4s_v3   eastus   3      24m

$ oc get controlplanemachineset   
NAME      DESIRED   CURRENT   READY   UPDATED   UNAVAILABLE   STATE    AGE
cluster   3         4         3                               Active   25m
$ oc get co control-plane-machine-set      
NAME                        VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
control-plane-machine-set   4.12.0-0.nightly-2022-10-25-210451   True        True          False      4h38m   Observed 3 replica(s) in need of update

Expected results:

Errors are logged and no machine is created or new machine could be created successful.

Additional info:

Drop vmSize, we can create new machine, seems default value is Standard_D4s_v3, but don't allow update.
$ oc get machine                
NAME                                      PHASE         TYPE              REGION   ZONE   AGE
zhsunazure11-cdbs8-master-0               Running       Standard_D8s_v3   eastus   2      4h7m
zhsunazure11-cdbs8-master-000             Provisioned   Standard_D4s_v3   eastus   2      48s
zhsunazure11-cdbs8-master-1               Running       Standard_D8s_v3   eastus   3      4h7m
zhsunazure11-cdbs8-master-2               Running       Standard_D8s_v3   eastus   1      4h7m
zhsunazure11-cdbs8-worker-eastus1-5v66l   Running       Standard_D4s_v3   eastus   1      4h1m
zhsunazure11-cdbs8-worker-eastus1-test    Running       Standard_D4s_v3   eastus   1      7m45s
zhsunazure11-cdbs8-worker-eastus2-hm9bm   Running       Standard_D4s_v3   eastus   2      4h1m
zhsunazure11-cdbs8-worker-eastus3-7j9kf   Running       Standard_D4s_v3   eastus   3      4h1m

$ oc edit machineset zhsuncpms1-7svhz-worker-eastus3         
error: machinesets.machine.openshift.io "zhsuncpms1-7svhz-worker-eastus3" could not be patched: admission webhook "validation.machineset.machine.openshift.io" denied the request: providerSpec.vmSize: Required value: vmSize should be set to one of the supported Azure VM sizes

Description of problem:

A leftover comment in CPMSO tests is causing a linting issue.

Version-Release number of selected component (if applicable):

4.13.z, 4.14.0

How reproducible:

Always

Steps to Reproduce:

1. make lint
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

When using a disconnected env and OPENSHIFT_INSTALL_RELEASE_IMAGE_MIRROR env var is specified, the create-cluster-and-infraenv service fails[*].
Seems that the issue happens due to a missing registries.conf in the assisted-service container, which is required for pulling the image.

[*[
create-cluster-and-infraenv[2784]: level=fatal msg="Failed to register cluster with assisted-service: command 'oc adm release info -o template --template '{{.metadata.version}}' --insecure=true quay.io/openshift-release-dev/ocp-release@sha256:3c050cb52fdd3e65c518d4999d238ec026ef724503f275377fee6bf0d33093ab --registry-config=/tmp/registry-config1560177852' exited with non-zero exit code 1: \nerror: unable to read image quay.io/openshift-release-dev/ocp-release@sha256:3c050cb52fdd3e65c518d4999d238ec026ef724503f275377fee6bf0d33093ab: Get "http://quay.io/v2/\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)\n"

Version-Release number of selected component (if applicable):

4.14

How reproducible:

100%

Steps to Reproduce:

1. Add registries.conf with mirror config set to a local registry (e.g. use imageContentSources in install-config)
2. Ensure that a custom release image mirror that refers the registry is set on OPENSHIFT_INSTALL_RELEASE_IMAGE_MIRROR env var.
3. Boot the machine on a disconnected env.

Actual results:

create-cluster-and-infraenv service fails pull the release image.

Expected results:

create-cluster-and-infraenv service should finish successfully.

Additional info:

Pushed a PR to the installer for propagating registries.conf: https://github.com/openshift/installer/pull/7332

We have a workaround in the appliance by overriding the service:
https://github.com/openshift/appliance/pull/94/

 

Please review the following PR: https://github.com/openshift/operator-framework-olm/pull/470

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Availability requirement updates is disabled on Edit PDB page, also when user tries to edit, it clears the current value so that user has no idea what's the current settings

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-04-03-211601

How reproducible:

Always

Steps to Reproduce:

1. Goes to deployment page -> Actions -> Add PodDisruptionBudget
2. on 'Create PodDisruptionBudge' page, set following fields and hit 'Create'
Name: example-pdb
Availability requirement:  maxUnavailable: 2
3. Make sure pdb/example-pdb is successfully created
$ oc get pdb
NAME          MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
example-pdb   N/A             2                 2                     99s
4. Goes to deployment page again,  Actions -> Edit PodDisruptionBudget

Actual results:

'Availability requirement' value is disabled from editing by default, when user click 'maxUnavailable', the value is set to empty(user has no idea what's the original value)

Expected results:

when editing PDB, we should load the form with current value and user should have permission to update the values by default

Additional info:

 

Description of problem:

[AWS EBS CSI Driver Operator] should not update the default storageclass annotation back after customers remove the default storageclass annotation

Version-Release number of selected component (if applicable):

Server Version: 4.14.0-0.nightly-2023-06-08-102710

How reproducible:

Always

Steps to Reproduce:

1. Install an aws openshift cluster
2. Create 6 extra storage classes(any sc is ok)
3. Overwriter all the sc with the storageclass.kubernetes.io/is-default-class=false and check all the sc are set as undefault 
4. Overwriter all the sc with the storageclass.kubernetes.io/is-default-class=true 
5. loop step4-5 several times

Actual results:

Overwriter all the sc with the storageclass.kubernetes.io/is-default-class=false, sometimes recovered by the driver operator

Expected results:

Overwriter all the sc with the storageclass.kubernetes.io/is-default-class=false should always succeed

Additional info:

 

Description of problem:
This is a clone of the doc issue OCPBUGS-9162.

Import JAR files doesn't work if the Cluster Samples Operator is not installed. This is a common issue in disconnected clusters where the Cluster Samples Operator is disabled by default. Users should not see the JAR import option if its not working correctly.

Version-Release number of selected component (if applicable):
4.9+

How reproducible:
Always, when the samples operator is not installed

Steps to Reproduce:

  1. Setup a cluster without samples operator or uninstall all "Java" Builder Images (ImageStreams from the openshift namespace)
  2. Switch to the Developer perspective
  3. Navigate to Add > Import JAR file
  4. Upload a JAR file and press Create

Actual results:
Import doesn't work

Expected results:
The Import JAR file option should not be disabled if no "Java" Builder Image (ImageStream in the openshift namespace) is available

Additional info:

  1. https://docs.openshift.com/container-platform/4.9/applications/creating_applications/odc-creating-applications-using-developer-perspective.html#odc-deploying-java-applications_odc-creating-applications-using-developer-perspective
  2. https://docs.openshift.com/container-platform/4.11/post_installation_configuration/cluster-capabilities.html
  3. https://docs.openshift.com/container-platform/4.9/openshift_images/configuring-samples-operator.html
  4. https://github.com/jerolimov/openshift/blob/master/notes/cluster.md

This is a clone of issue OCPBUGS-18485. The following is the description of the original issue:

Description of problem:

developer console, go to "Observe -> openshift-moniotring -> Alerts", silence Watchdog alert, at the first, the alert state is Silenced in Alerts tab, but changed to Firing quickly(the alert is silenced actually), see the attached screen shoot

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-09-02-132842

How reproducible:

always

Steps to Reproduce:

1. silence alert in the dev console, and check alert state in Alerts tab
2.
3.

Actual results:

alert state is changed from Silenced to Firing quickly

Expected results:

state should be Silenced

This is a clone of issue OCPBUGS-18788. The following is the description of the original issue:

Description of problem:

metal3-baremetal-operator-7ccb58f44b-xlnnd pod failed to start on the SNO baremetal dualstack cluster:

Events:
  Type     Reason                  Age                    From               Message
  ----     ------                  ----                   ----               -------
  Normal   Scheduled               34m                    default-scheduler  Successfully assigned openshift-machine-api/metal3-baremetal-operator-7ccb58f44b-xlnnd to sno.ecoresno.lab.eng.tlv2.redha
t.com
  Warning  FailedScheduling        34m                    default-scheduler  0/1 nodes are available: 1 node(s) didn't have free ports for the requested pod ports. preemption: 0/1 nodes are availabl
e: 1 node(s) didn't have free ports for the requested pod ports..
  Warning  FailedCreatePodSandBox  34m                    kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to add hostport mapping for sandbox k8s_metal3-baremetal-operator-7ccb58f44b-xlnnd_openshift-machine-api_5f6d8c69-a508-47f3-a6b1-7701b9d3617e_0(c4a8b353e3ec105d2bff2eb1670b82a0f226ac1088b739a256deb9dfae6ebe54): cannot open hostport 60000 for pod k8s
_metal3-baremetal-operator-7ccb58f44b-xlnnd_openshift-machine-api_5f6d8c69-a508-47f3-a6b1-7701b9d3617e_0_: listen tcp4 :60000: bind: address already in use
  Warning  FailedCreatePodSandBox  34m                    kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to add hostport mapping for sandbox k8s_metal3-bare
metal-operator-7ccb58f44b-xlnnd_openshift-machine-api_5f6d8c69-a508-47f3-a6b1-7701b9d3617e_0(9e6960899533109b02fbb569c53d7deffd1ac8185cef3d8677254f9ccf9387ff): cannot open hostport 60000 for pod k8s
_metal3-baremetal-operator-7ccb58f44b-xlnnd_openshift-machine-api_5f6d8c69-a508-47f3-a6b1-7701b9d3617e_0_: listen tcp4 :60000: bind: address already in use

Version-Release number of selected component (if applicable):

4.14.0-rc.0

How reproducible:

so far once

Steps to Reproduce:

1. Deploy disconnected baremetal SNO node with dualstack networking with agent-based installer
2.
3.

Actual results:

metal3-baremetal-operator pod fails to start

Expected results:

metal3-baremetal-operator pod is running

Additional info:

Checking the pots on node showed it was `kube-apiserver` process bound to the port:

tcp   ESTAB      0      0                                                [::1]:60000                        [::1]:2379    users:(("kube-apiserver",pid=43687,fd=455))


After rebooting the node all pods started as expected

Description of problem:

Critical Alert Rules do not have runbook url

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

This bug is being raised by Openshift Monitoring team as part of effort to detect invalid Alert Rules in OCP.

1. Check details of KubeSchedulerDown Alert Rule
2.
3.

Actual results:

The Alert Rule KubeSchedulerDown has Critical Severity, but does not have runbook_url annotation.

Expected results:

All Critical Alert Rules must have runbbok_url annotation

Additional info:

Critical Alerts must have a runbook, please refer to style guide at https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/alerting-consistency.md#style-guide 

The runbooks are located at github.com/openshift/runbooks

To resolve the bug, 
 - Add runbooks for the relevant Alerts at github.com/openshift/runbooks
 - Add the link to the runbook in the Alert annotation 'runbook_url'
 - Remove the exception in the origin test, added in PR https://github.com/openshift/origin/pull/27933

Description of problem:

The reconciler removes the overlappingrangeipreservations.whereabouts.cni.cncf.io resources whether the pod is alive or not. 

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1. Create pods and check the overlappingrangeipreservations.whereabouts.cni.cncf.io resources:

$ oc get overlappingrangeipreservations.whereabouts.cni.cncf.io -A
NAMESPACE          NAME                      AGE
openshift-multus   2001-1b70-820d-4b04--13   4m53s
openshift-multus   2001-1b70-820d-4b05--13   4m49s

2.  Verify that when the ip-reconciler cronjob removes the overlappingrangeipreservations.whereabouts.cni.cncf.io resources when run:

$ oc get cronjob -n openshift-multus
NAME            SCHEDULE       SUSPEND   ACTIVE   LAST SCHEDULE   AGE
ip-reconciler   */15 * * * *   False     0        14m             4d13h

$ oc get overlappingrangeipreservations.whereabouts.cni.cncf.io -A
No resources found

$ oc get cronjob -n openshift-multus
NAME            SCHEDULE       SUSPEND   ACTIVE   LAST SCHEDULE   AGE
ip-reconciler   */15 * * * *   False     0        5s              4d13h

 

Actual results:

The overlappingrangeipreservations.whereabouts.cni.cncf.io resources are removed for each created pod by the ip-reconciler cronjob.
The "overlapping ranges" are not used. 

Expected results:

The overlappingrangeipreservations.whereabouts.cni.cncf.io should not be removed regardless of if a pod has used an IP in the overlapping ranges.

Additional info:

 

Description of problem:

User defined taints in machineset, then scale up machineset, instance can join the cluster and Node will be Ready but pod couldn't be deployed, checked node yaml file uninitialized taint was not removed.

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-07-20-215234

How reproducible:

Always

Steps to Reproduce:

1.Setup a cluster on Azure
2.Create a machineset with taint
      taints:
      - effect: NoSchedule
        key: mapi
        value: mapi_test
3.Check node yaml file

Actual results:

uninitialized taint still in node, but no providerID in node.
$ oc get node 
NAME                                              STATUS   ROLES                  AGE   VERSION
zhsun724-mh4dt-master-0                           Ready    control-plane,master   9h    v1.27.3+4aaeaec
zhsun724-mh4dt-master-1                           Ready    control-plane,master   9h    v1.27.3+4aaeaec
zhsun724-mh4dt-master-2                           Ready    control-plane,master   9h    v1.27.3+4aaeaec
zhsun724-mh4dt-worker-westus21-8rzqw              Ready    worker                 21m   v1.27.3+4aaeaec
zhsun724-mh4dt-worker-westus21-additional-q58zp   Ready    worker                 9h    v1.27.3+4aaeaec
zhsun724-mh4dt-worker-westus21-additional-vwwhh   Ready    worker                 9h    v1.27.3+4aaeaec
zhsun724-mh4dt-worker-westus21-v7k7s              Ready    worker                 9h    v1.27.3+4aaeaec
zhsun724-mh4dt-worker-westus22-ggxql              Ready    worker                 9h    v1.27.3+4aaeaec
zhsun724-mh4dt-worker-westus23-zf8l5              Ready    worker                 9h    v1.27.3+4aaeaec

$ oc edit node zhsun724-mh4dt-worker-westus21-8rzqw
spec:
  taints:
  - effect: NoSchedule
    key: node.cloudprovider.kubernetes.io/uninitialized
    value: "true"
  - effect: NoSchedule
    key: mapi
    value: mapi_test

Expected results:

uninitialized taint is removed, providerID is set in node.

Additional info:

must-gather: https://drive.google.com/file/d/12ypYmHN98j9lyWCS9Dgaqq5MLpftqEkS/view?usp=sharing

It seems the e2e-metal-ipi-ovn-dualstack job is permafailing the last couple of days.
sippy link

one common symptom seems to be that some nodes are being fully provisioned.
here is an example from this job

you can see the clusteroperators are not happy and specifically machine-api is stuck in init

Description of problem:

OCP 4.14 installation fails.

Waiting for the UPI installation to complete using the wait-for, ends with a CO error:
```
$ openshift-install wait-for install-complete --log-level=debug

level=error msg=failed to initialize the cluster: Cluster operator control-plane-machine-set is not available
```

```
$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version             False       True          122m    Unable to apply 4.14.0-0.nightly-2023-07-18-085740: the cluster operator control-plane-machine-set is not available
```

```
$ oc get co | grep control-plane-machine-set
control-plane-machine-set                  4.14.0-0.nightly-2023-07-18-085740   False       False         True       6h47m   Missing 3 available replica(s)
```

Version-Release number of selected component (if applicable):

Openshift on Openstack
OCP 4.14.0-0.nightly-2023-07-18-085740
RHOS-16.2-RHEL-8-20230413.n.1
UPI installation

How reproducible:

Always

Steps to Reproduce:

Run the UPI openshift installation  

Actual results:

UPI installation fail

Expected results:

UPI installation pass

Additional info:

  • Last UPI successful installation in D/S CI used: 4.14.0-0.nightly-2023-07-05-191022 
  • control-plane-machine-set-operator log:
$ oc logs -n openshift-machine-api control-plane-machine-set-operator-5cbb7f68cc-h5f4p | tail
E0719 14:20:52.645504       1 controller.go:649]  "msg"="Observed unmanaged control plane nodes" "error"="found unmanaged control plane nodes, the following node(s) do not have associated machines: ostest-c2drn-master-0, ostest-c2drn-master-1, ostest-c2drn-master-2" "controller"="controlplanemachineset" "name"="cluster" "namespace"="openshift-machine-api" "reconcileID"="1984ddf9-506f-4d10-88e5-0787b305484e" "unmanagedNodes"="ostest-c2drn-master-0,ostest-c2drn-master-1,ostest-c2drn-master-2"
I0719 14:20:52.645530       1 controller.go:268]  "msg"="Cluster state is degraded. The control plane machine set will not take any action until issues have been resolved." "controller"="controlplanemachineset" "name"="cluster" "namespace"="openshift-machine-api" "reconcileID"="1984ddf9-506f-4d10-88e5-0787b305484e"
I0719 14:20:52.667462       1 controller.go:212]  "msg"="Finished reconciling control plane machine set" "controller"="controlplanemachineset" "name"="cluster" "namespace"="openshift-machine-api" "reconcileID"="1984ddf9-506f-4d10-88e5-0787b305484e"
I0719 14:20:52.668013       1 controller.go:156]  "msg"="Reconciling control plane machine set" "controller"="controlplanemachineset" "name"="cluster" "namespace"="openshift-machine-api" "reconcileID"="3f095b75-21af-4475-b0fd-25052e8c3bce"
I0719 14:20:52.668718       1 controller.go:121]  "msg"="Reconciling control plane machine set" "controller"="controlplanemachinesetgenerator" "name"="cluster" "namespace"="openshift-machine-api" "reconcileID"="e80d898c-9a8d-4774-8f22-fb464be45758"
I0719 14:20:52.668780       1 controller.go:142]  "msg"="Finished reconciling control plane machine set" "controller"="controlplanemachinesetgenerator" "name"="cluster" "namespace"="openshift-machine-api" "reconcileID"="e80d898c-9a8d-4774-8f22-fb464be45758"
I0719 14:20:52.669005       1 status.go:119]  "msg"="Observed Machine Configuration" "controller"="controlplanemachineset" "name"="cluster" "namespace"="openshift-machine-api" "observedGeneration"=1 "readyReplicas"=0 "reconcileID"="3f095b75-21af-4475-b0fd-25052e8c3bce" "replicas"=0 "unavailableReplicas"=3 "updatedReplicas"=0
E0719 14:20:52.669237       1 controller.go:649]  "msg"="Observed unmanaged control plane nodes" "error"="found unmanaged control plane nodes, the following node(s) do not have associated machines: ostest-c2drn-master-0, ostest-c2drn-master-1, ostest-c2drn-master-2" "controller"="controlplanemachineset" "name"="cluster" "namespace"="openshift-machine-api" "reconcileID"="3f095b75-21af-4475-b0fd-25052e8c3bce" "unmanagedNodes"="ostest-c2drn-master-0,ostest-c2drn-master-1,ostest-c2drn-master-2"
I0719 14:20:52.669267       1 controller.go:268]  "msg"="Cluster state is degraded. The control plane machine set will not take any action until issues have been resolved." "controller"="controlplanemachineset" "name"="cluster" "namespace"="openshift-machine-api" "reconcileID"="3f095b75-21af-4475-b0fd-25052e8c3bce"
I0719 14:20:52.669842       1 controller.go:212]  "msg"="Finished reconciling control plane machine set" "controller"="controlplanemachineset" "name"="cluster" "namespace"="openshift-machine-api" "reconcileID"="3f095b75-21af-4475-b0fd-25052e8c3bce"
  • The nodes are up:
[cloud-user@installer-host ~]$ oc get nodes
NAME                    STATUS   ROLES                  AGE     VERSION
ostest-c2drn-master-0   Ready    control-plane,master   6h55m   v1.27.3+4aaeaec
ostest-c2drn-master-1   Ready    control-plane,master   6h55m   v1.27.3+4aaeaec
ostest-c2drn-master-2   Ready    control-plane,master   6h55m   v1.27.3+4aaeaec
ostest-c2drn-worker-0   Ready    worker                 6h36m   v1.27.3+4aaeaec
ostest-c2drn-worker-1   Ready    worker                 6h35m   v1.27.3+4aaeaec
ostest-c2drn-worker-2   Ready    worker                 6h36m   v1.27.3+4aaeaec 

 

Description of problem:

On command-line-tools page,the title is "Command line tools" instead of "Command Line Tools"

Version-Release number of selected component (if applicable):

 

How reproducible:

1/1

Steps to Reproduce:

1.goto command-line-tools page
2.check the title

Actual results:

the title is "Command line tools"

Expected results:

the title should be "Command Line Tools"

Additional info:

 

When implementing support for IPv6-primary dual-stack clusters, we have extended the available IP families to

const (
	IPFamiliesIPv4                 IPFamiliesType = "IPv4"
	IPFamiliesIPv6                 IPFamiliesType = "IPv6"
	IPFamiliesDualStack            IPFamiliesType = "DualStack"
	IPFamiliesDualStackIPv6Primary IPFamiliesType = "DualStackIPv6Primary"
)

At the same time definitions of kubelet.service systemd unit still contain the code

{{- if eq .IPFamilies "DualStack"}}
        --node-ip=${KUBELET_NODE_IPS} \
{{- else}}
        --node-ip=${KUBELET_NODE_IP} \
{{- end}}

which only matches the "old" dual-stack family. Because of this, an IPv6-primary dual-stack renders node-ip param with only 1 IP address instead of 2 as required in dual-stack.

Description of problem:

the acm dropdown has a filter and clusters title even though there are only ever 2 items in the dropdown, local cluster and all clusters. it has been reported by a customer as confusing that they can add many clusters to the dropdown.

Version-Release number of selected component (if applicable):

4.14

How reproducible:

always

Steps to Reproduce:

1. install acm dynamic plugin to cluster
2. open cluster dropdown
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

RHCOS is being published to new AWS regions (https://github.com/openshift/installer/pull/6861) but aws-sdk-go need to be bumped to recognize those regions

Version-Release number of selected component (if applicable):

master/4.14

How reproducible:

always

Steps to Reproduce:

1. openshift-install create install-config
2. Try to select ap-south-2 as a region
3.

Actual results:

New regions are not found. New regions are: ap-south-2, ap-southeast-4, eu-central-2, eu-south-2, me-central-1.

Expected results:

Installer supports and displays the new regions in the Survey

Additional info:

See https://github.com/openshift/installer/blob/master/pkg/asset/installconfig/aws/regions.go#L13-L23

 

Description of problem:

oc patch project command is failing to annotate the project

Version-Release number of selected component (if applicable):

4.12

How reproducible:

100%

Steps to Reproduce:

1. Run the below patch command to update the annotation on existing project
~~~
oc patch project <PROJECT_NAME> --type merge --patch '{"metadata":{"annotations":{"openshift.io/display-name": "null","openshift.io/description": "This is a new project"}}}'
~~~


Actual results:

It produces the error output below:
~~~
The Project "<PROJECT_NAME>" is invalid: * metadata.namespace: Invalid value: "<PROJECT_NAME>": field is immutable * metadata.namespace: Forbidden: not allowed on this type 
~~~ 

Expected results:

The `oc patch project` command should patch the project with specified annotation.

Additional info:

Tried to patch the project with OCP 4.11.26 version, and it worked as expected.
~~~
oc patch project <PROJECT_NAME> --type merge --patch '{"metadata":{"annotations":{"openshift.io/display-name": "null","openshift.io/description": "New project"}}}'

project.project.openshift.io/<PROJECT_NAME> patched
~~~

The issue is with OCP 4.12, where it is not working. 

 

Description of problem:

When we rebased to 1.26, the rebase picked up https://github.com/kubernetes-sigs/cloud-provider-azure/pull/2653/ which made the Azure cloud node manager stop applying beta toplogy labels, such as failure-domain.beta.kubernetes.io/zone

Since we haven't completed the removal cycle for this, we still need the node manager to apply these labels. In the future we must ensure that these labels are available until users are no longer using them.

Version-Release number of selected component (if applicable):

 

How reproducible:

100%

Steps to Reproduce:

1. Create a TP cluster on 4.13
2. Observe no beta label for zone or region
3.

Actual results:

Beta labels are not present

Expected results:

Beta labels are present and should match GA labels

Additional info:

Created https://github.com/kubernetes-sigs/cloud-provider-azure/pull/3685 to try and make upstream allow this to be flagged

Description of problem:

When the configuration is installed with the config-image,
the kubeadmin-password it not accepted to log into the console.

Version-Release number of selected component (if applicable):

 

How reproducible:

Every time

Steps to Reproduce:

1. Build and install unconfigured ignition
2. Build and install config-image
3. When able to ssh into host0, attempt to log into console using the core user and generated kubeadmin-password.

Actual results:

The login fails.

Expected results:

The login should succeed.

Additional info:

 

Description of problem:

When creating an OCP cluster with Nutanix infrastructure and using DHCP instead of IPAM network config, the Hostname of the VM is not set by DHCP. In these case we need to inject the desired hostname through cloud-init for both control-plane and worker nodes.

Version-Release number of selected component (if applicable):

 

How reproducible:

Reproducible when creating an OCP cluster with Nutanix infrastructure and using DHCP instead of IPAM network config.

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

The aforementioned test in the e2e origin test suite sometimes fails because it can't connect to the API endpoint.

Version-Release number of selected component (if applicable):

4.14

How reproducible:

Sometimes

Steps to Reproduce:

1. See https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.14-e2e-azure-ovn-upgrade/1673703516675248128
2.
3.

Actual results:

The test failed.

Expected results:

The test should retry a couple of times with a delay when it didn't get an HTTP response from the endpoint (e.g. connection issue).

Additional info:

 

This is a clone of issue OCPBUGS-18137. The following is the description of the original issue:

Description of problem:

When a workload includes a node selector term on the label kubernetes.io/arch and the allowed values do not include amd64, the auto scaler does not trigger the scale out of a valid, non-amd64, machine set if its current replicas are 0 and (for 4.14+) no architecture capacity annotation is set (ref MIXEDARCH-129).

The issue is due to https://github.com/openshift/kubernetes-autoscaler/blob/f0ceeacfca57014d07f53211a034641d52d85cfd/cluster-autoscaler/cloudprovider/utils.go#L33

This bug should be considered at first on clusters having the same architecture for the control plane and the data plane.

In the case of multi-arch compute clusters, there is probably no alternative than letting the capacity annotation to be properly set in the machine set either manually or by the cloud provider actuator, as already discussed in the MIXEDARCH-129 works, otherwise relying to the control plane architecture.

Version-Release number of selected component (if applicable):

- ARM64 IPI on GCP 4.14
- ARM64 IPI on Aws and Azure <=4.13
- In general, non-amd64 single-arch clusters supporting autoscale from 0

How reproducible:

Always

Steps to Reproduce:

1. Create an arm64 IPI cluster on GCP
2. Set one of the machinesets to have 0 replicas: 
    oc scale -n openshift-machine-api machineset/adistefa-a1-zn8pg-worker-f
3. Deploy the default autoscaler
4. Deploy the machine autoscaler for the given machineset
5. Deploy a workload with node affinity to arm64 only nodes, large resource requests and enough number of replicas. 

Actual results:

From the pod events: 

pod didn't trigger scale-up: 1 node(s) didn't match Pod's node affinity/selector

Expected results:

The cluster autoscaler scales the machineset with 0 replicas in order to provide resources for the pending pods.

Additional info:

---
apiVersion: autoscaling.openshift.io/v1
kind: ClusterAutoscaler
metadata:
  name: default
spec: {}
---
apiVersion: autoscaling.openshift.io/v1beta1
kind: MachineAutoscaler
metadata:
  name: worker-us-east-1a
  namespace: openshift-machine-api
spec:
  minReplicas: 0
  maxReplicas: 12
  scaleTargetRef:
    apiVersion: machine.openshift.io/v1beta1
    kind: MachineSet
    name: adistefa-a1-zn8pg-worker-f
---
apiVersion: apps/v1
kind: Deployment
metadata:
  namespace: openshift-machine-api
  name: 'my-deployment'
  annotations: {}
spec:
  selector:
    matchLabels:
      app: name
  replicas: 3
  template:
    metadata:
      labels:
        app: name
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                - key: kubernetes.io/arch
                  operator: In
                  values:
                    - "arm64"
      containers:
        - name: container
          image: >-
            image-registry.openshift-image-registry.svc:5000/openshift/httpd:latest
          ports:
            - containerPort: 8080
              protocol: TCP
          env: []
          resources:
              requests:
                cpu: "2"
      imagePullSecrets: []
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
  paused: false

Description of problem:

Dev sandbox - CronJobs table/details UI doesn't have Suspend indication

Version-Release number of selected component (if applicable):

4.12

How reproducible:

Always

Steps to Reproduce:

1. Create sample CronJob with either @daily or @hourly as schedule
2. Navigate to Administrator/Workloads/CronJobs area
3. Observe that table with CronJobs contain your created entry, but no column with Suspend True/False indication
4. Navigate into that same cron job details - still no presence of Suspend state
5. Then invoke 'oc get cj' command and example output could be:
NAME      SCHEDULE   SUSPEND   ACTIVE   LAST SCHEDULE   AGE
example   @hourly    True      0        24m             34m

where you could see separate SUSPEND column

Actual results:

 

Expected results:

 

Additional info:

 

As a HyperShift developer, I would like a config file created to control the creation frequency of RHTAP PRs so that the HyperShift repo & CI is not inundated with RHTAP PRs.

Description of problem:

At moment we are using an alpha version of controller-runtime on the machine-api-operator.
Now that controller-runtime v0.15.0 is out, we want to bump to it.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem

oc adm node-logs feature has been upstreamed and is part of k8s 1.27. This resulted in the addition kubelet configuration enableSystemLogQuery to enable the feature. This feature has been enabled in the base kubelet configs in MCO. However in situations where TechPreview is enabled, it causes MCO to generate a kubelet configuration that overwrites the default and when it does this, the unmarshal and marshal cycle drops the field it is not aware of. This is because MCO currently vendors in k8s.io/kubelet at v0.25.1 and can be fixed by vendoring in v0.27.1

How reproducible:always

Steps to Reproduce:

1. Bring up a 4.14 cluster with TechPreview enabled
2. Run oc adm node-logs
3.

Actual results:

Command returns "<a href="ec274df5b608cc7a149ece1ce673306c/">ec274df5b608cc7a149ece1ce673306c/</a>" which is the contents of /var/log/journal

Expected results:

Should return journal logs from the node

Additional info

I took a quick cut of updating the OpenShift and k8s APIs to 1.27. Running into the following during make verify:

cmd/machine-config-controller/start.go:18:2: could not import github.com/openshift/machine-config-operator/pkg/controller/template (-: # github.com/openshift/machine-config-operator/pkg/controller/template
pkg/controller/template/render.go:396:91: cannot use cfg.FeatureGate (variable of type *"github.com/openshift/api/config/v1".FeatureGate) as featuregates.FeatureGateAccess value in argument to cloudprovider.IsCloudProviderExternal: *"github.com/openshift/api/config/v1".FeatureGate does not implement featuregates.FeatureGateAccess (missing method AreInitialFeatureGatesObserved)
pkg/controller/template/render.go:441:90: cannot use cfg.FeatureGate (variable of type *"github.com/openshift/api/config/v1".FeatureGate) as featuregates.FeatureGateAccess value in argument to cloudprovider.IsCloudProviderExternal: *"github.com/openshift/api/config/v1".FeatureGate does not implement featuregates.FeatureGateAccess (missing method AreInitialFeatureGatesObserved)) (typecheck)
        "github.com/openshift/machine-config-operator/pkg/controller/template"
        ^

Here are some examples of how other operators have handled this. 

 This is a critical bug as oc adm node-logs runs as part of must-gather and debugging node issues with TechPreview jobs in CI is impossible without this working.

Description of problem:

when searching InstallPlans with specific project selected, still all IPs are listed, the selected project is not applied in filter

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-06-05-112833

How reproducible:

Always

Steps to Reproduce:

1. Install some operators to specific namespace and all namespaces
$ oc get ip -A
NAMESPACE             NAME            CSV                                 APPROVAL    APPROVED
default               install-tftg4   etcdoperator.v0.9.4                 Automatic   true
openshift-operators   install-5g2l4   3scale-community-operator.v0.10.1   Automatic   true
$ oc get sub -A
NAMESPACE             NAME                        PACKAGE                     SOURCE                CHANNEL
default               etcd                        etcd                        community-operators   singlenamespace-alpha
openshift-operators   3scale-community-operator   3scale-community-operator   community-operators   threescale-2.13  
2. navigates to Home -> Search page, select project 'default' in project dropdown, choose 'InstallPlan' resource
3. check the filtered lists

Actual results:

3. InstallPlans in all namespaces are listed

Expected results:

3. only the InstallPlan in 'default' project should be listed

Additional info:

 

This is a clone of issue OCPBUGS-18720. The following is the description of the original issue:

Description of problem:

Catalog pods in hypershift control plane in ImagePullBackOff

Version-Release number of selected component (if applicable):

 

How reproducible:

always

Steps to Reproduce:

1. Create a cluster in 4.14 HO + OCP 4.14.0-0.ci-2023-09-07-120503
2. Check controlplane pods, catalog pods in control plane namespace in ImagePullBackOff
3.

Actual results:

 

jiezhao-mac:hypershift jiezhao$ oc get pods -n clusters-jie-test | grep catalog catalog-operator-64fd787d9c-98wx5                     2/2     Running            0          2m43s 
certified-operators-catalog-7766fc5b8-4s66z           0/1     ImagePullBackOff   0          2m43s 
community-operators-catalog-847cdbff6-wsf74           0/1     ImagePullBackOff   0          2m43s 
redhat-marketplace-catalog-fccc6bbb5-2d5x4            0/1     ImagePullBackOff   0          2m43s 
redhat-operators-catalog-86b6f66d5d-mpdsc             0/1     ImagePullBackOff   0          2m43s

Events:   Type     Reason          Age                 From               Message   ----     ------          ----                ----               -------   Normal   Scheduled       65m                 default-scheduler  Successfully assigned clusters-jie-test/certified-operators-catalog-7766fc5b8-4s66z to ip-10-0-64-135.us-east-2.compute.internal   Normal   AddedInterface  65m                 multus             Add eth0 [10.128.2.141/23] from openshift-sdn   Normal   Pulling         63m (x4 over 65m)   kubelet            Pulling image "from:imagestream"   Warning  Failed          63m (x4 over 65m)   kubelet            Failed to pull image "from:imagestream": rpc error: code = Unknown desc = reading manifest imagestream in docker.io/library/from: requested access to the resource is denied   Warning  Failed          63m (x4 over 65m)   kubelet            Error: ErrImagePull   Warning  Failed          63m (x6 over 65m)   kubelet            Error: ImagePullBackOff   Normal   BackOff         9s (x280 over 65m)  kubelet            Back-off pulling image "from:imagestream" jiezhao-mac:hypershift jiezhao$  

Expected results:

catalog pods are running

Additional info:

slack:
https://redhat-internal.slack.com/archives/C01C8502FMM/p1694170060144859

Description of problem:

Running the following tests using Openshift on Openstack with Kuryr
"[sig-cli] oc idle [apigroup:apps.openshift.io][apigroup:route.openshift.io][apigroup:project.openshift.io][apigroup:image.openshift.io] by all [Suite:openshift/conformance/parallel]"
"[sig-cli] oc idle [apigroup:apps.openshift.io][apigroup:route.openshift.io][apigroup:project.openshift.io][apigroup:image.openshift.io] by checking previous scale [Suite:openshift/conformance/parallel]"
"[sig-cli] oc idle [apigroup:apps.openshift.io][apigroup:route.openshift.io][apigroup:project.openshift.io][apigroup:image.openshift.io] by label [Suite:openshift/conformance/parallel]"
"[sig-cli] oc idle [apigroup:apps.openshift.io][apigroup:route.openshift.io][apigroup:project.openshift.io][apigroup:image.openshift.io] by name [Suite:openshift/conformance/parallel]"

Fails waiting for endpoints
STEP: wait until endpoint addresses are scaled to 2 01/21/23 01:16:42.024
Jan 21 01:16:42.025: INFO: Running 'oc --namespace=e2e-test-oc-idle-h2mvt --kubeconfig=/tmp/configfile3007731725 get endpoints idling-echo --template={{ len (index .subsets 0).addresses }} --output=go-template'
Jan 21 01:16:42.158: INFO: Error running /usr/local/bin/oc --namespace=e2e-test-oc-idle-h2mvt --kubeconfig=/tmp/configfile3007731725 get endpoints idling-echo --template={{ len (index .subsets 0).addresses }} --output=go-template:
StdOut>
Error executing template: template: output:1:8: executing "output" at <index .subsets 0>: error calling index: index of untyped nil. Printing more information for debugging the template:
    template was:
        {{ len (index .subsets 0).addresses }}
    raw data was:
        {"apiVersion":"v1","kind":"Endpoints","metadata":{"annotations":{"endpoints.kubernetes.io/last-change-trigger-time":"2023-01-21T01:16:40Z"},"creationTimestamp":"2023-01-21T01:16:40Z","labels":{"app":"idling-echo"},"managedFields":[{"apiVersion":"v1","fieldsType":"FieldsV1","fieldsV1":{"f:metadata":{"f:annotations":{".":{},"f:endpoints.kubernetes.io/last-change-trigger-time":{}},"f:labels":{".":{},"f:app":{}}}},"manager":"kube-controller-manager","operation":"Update","time":"2023-01-21T01:16:40Z"}],"name":"idling-echo","namespace":"e2e-test-oc-idle-h2mvt","resourceVersion":"409973","uid":"91cd122e-b418-4e29-98c6-2ff757c74a15"}}
    object given to template engine was:
        map[apiVersion:v1 kind:Endpoints metadata:map[annotations:map[endpoints.kubernetes.io/last-change-trigger-time:2023-01-21T01:16:40Z] creationTimestamp:2023-01-21T01:16:40Z labels:map[app:idling-echo] managedFields:[map[apiVersion:v1 fieldsType:FieldsV1 fieldsV1:map[f:metadata:map[f:annotations:map[.:map[] f:endpoints.kubernetes.io/last-change-trigger-time:map[]] f:labels:map[.:map[] f:app:map[]]]] manager:kube-controller-manager operation:Update time:2023-01-21T01:16:40Z]] name:idling-echo namespace:e2e-test-oc-idle-h2mvt resourceVersion:409973 uid:91cd122e-b418-4e29-98c6-2ff757c74a15]]

When using 60 seconds in PollImmediate instead of 30 the tests pass.

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2023-01-19-110743

How reproducible:

Consistently 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

DoD:

Currently we return early if we fail to apply a resource during installation https://github.com/openshift/hypershift/blob/main/cmd/install/install.go#L248

There's no reason why we wouldn't keep going, aggregate errors and return at the end.

It might help for scenarios where one broken CR prevent everything else from being installed, e.g.

https://redhat-internal.slack.com/archives/C02LM9FABFW/p1680599409023509?thread_ts=1680589848.540709&cid=C02LM9FABFW

 

Description of problem:

We need to update the operator to be synced with the K8 api version used by OCP 4.13. We also need to sync our samples libraries with latest available libraries. Any deprecated libraries should be removed as well.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

Kubernetes and other associated dependencies need to be updated to protect against potential vulnerabilities.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of the problem:

Events search should not be case sensitive 

 

How reproducible:

100%

 

Steps to reproduce:

1. On UI View Cluster Events

2. Enter text on "Filter by text" field. (eg. "success" or "Success" )

 

Actual results:

Events filter is case sensitive. 

See screenshots enclosed

 

Expected results:

Events filter should not be case sensitive

Description of problem:

CRL list is capped at 1MB due to configmap max size. If multiple public CRLs are needed for ingress controller the CRL pem file will be over 1MB. 

Version-Release number of selected component (if applicable):

 

How reproducible:

100%

Steps to Reproduce:

1. Create CRL configmap with the following distribution points: 

         Issuer: C=US, O=DigiCert Inc, CN=DigiCert Global G2 TLS RSA SHA256 2020 CA1
         Subject: SOME SIGNED CERT            X509v3 CRL Distribution Points: 
                Full Name:
                  URI:http://crl3.digicert.com/DigiCertGlobalG2TLSRSASHA2562020CA1-2.cr  
       
      
# curl -o DigiCertGlobalG2TLSRSASHA2562020CA1-2.crl http://crl3.digicert.com/DigiCertGlobalG2TLSRSASHA2562020CA1-2.crl
# openssl crl -in  DigiCertGlobalG2TLSRSASHA2562020CA1-2.crl -inform DER -out  DigiCertGlobalG2TLSRSASHA2562020CA1-2.pem 
# du -bsh DigiCertGlobalG2TLSRSASHA2562020CA1-2.pem 
604K    DigiCertGlobalG2TLSRSASHA2562020CA1-2.pem


I still need to find more intermediate CRLS to grow this. 

Actual results:

2023-01-25T13:45:01.443Z ERROR operator.init controller/controller.go:273 Reconciler error {"controller": "crl", "object": {"name":"custom","namespace":"openshift-ingress-operator"}, "namespace": "openshift-ingress-operator", "name": "custom", "reconcileID": "d49d9b96-d509-4562-b3d9-d4fc315226c0", "error": "failed to ensure client CA CRL configmap for ingresscontroller openshift-ingress-operator/custom: failed to update configmap: ConfigMap \"router-client-ca-crl-custom\" is invalid: []: Too long: must have at most 1048576 bytes"}

Expected results:

First be able to create a configmap where data only accounted to the 1MB max (see additional info below for more details), second some way to compress or allow a large CRL list that would be larger than 1MB

Additional info:

Only using this CRL and it being only 600K still causes issue and it could be due to  the `last-applied-configuration` annotation on the configmap. This is added since we do an apply operation (update) on the configmap. I am not sure if this is counting towards the 1MB max. 

https://github.com/openshift/cluster-ingress-operator/blob/release-4.10/pkg/operator/controller/crl/crl_configmap.go#L295 

Not sure if we could just replace the configmap.   

 

Description of problem:

node-driver-registrar and hostpath containers in pod shared-resource-csi-driver-node-xxxxx under openshift-cluster-csi-drivers namespace are not pinned to reserved management cores.

Version-Release number of selected component (if applicable):

4.13

How reproducible:

Always

Steps to Reproduce:

1. Deploy SNO via ZTP with workload partitioning enabled
2. Check mgmt pods affinity
3.

Actual results:

pods do not have workload partitioning annotation, and are not pinned to mgmt cores

Expected results:

All management pods should be pinned to reserved cores

Pod should be annotated with: target.workload.openshift.io/management: '{"effect":"PreferredDuringScheduling"}'

Additional info:

pod metadata

metadata:
  annotations:
    k8s.ovn.org/pod-networks: '{"default":{"ip_addresses":["fd01:0:0:1::5f/64"],"mac_address":"0a:58:97:51:ad:31","gateway_ips":["fd01:0:0:1::1"],"ip_address":"fd01:0:0:1::5f/64","gateway_ip":"fd01:0:0:1::1"}}'
    k8s.v1.cni.cncf.io/network-status: |-
      [{
          "name": "ovn-kubernetes",
          "interface": "eth0",
          "ips": [
              "fd01:0:0:1::5f"
          ],
          "mac": "0a:58:97:51:ad:31",
          "default": true,
          "dns": {}
      }]
    k8s.v1.cni.cncf.io/networks-status: |-
      [{
          "name": "ovn-kubernetes",
          "interface": "eth0",
          "ips": [
              "fd01:0:0:1::5f"
          ],
          "mac": "0a:58:97:51:ad:31",
          "default": true,
          "dns": {}
      }]
    openshift.io/scc: privileged
/var/lib/jenkins/workspace/ocp-far-edge-vran-deployment/cnf-gotests/test/ran/workloadpartitioning/tests/workload_partitioning.go:113


SNO management workload partitioning [It] should have management pods pinned to reserved cpus
/var/lib/jenkins/workspace/ocp-far-edge-vran-deployment/cnf-gotests/test/ran/workloadpartitioning/tests/workload_partitioning.go:113

  [FAILED] Expected
      <[]ranwphelper.ContainerInfo | len:3, cap:4>: [
          {
              Name: "hostpath",
              Cpus: "2-55,58-111",
              Namespace: "openshift-cluster-csi-drivers",
              PodName: "shared-resource-csi-driver-node-vzvtc",
              Shares: 10,
              Pid: 41650,
          },
          {
              Name: "cluster-proxy-service-proxy",
              Cpus: "2-55,58-111",
              Namespace: "open-cluster-management-agent-addon",
              PodName: "cluster-proxy-service-proxy-66599b78bf-k2dvr",
              Shares: 2,
              Pid: 35093,
          },
          {
              Name: "node-driver-registrar",
              Cpus: "2-55,58-111",
              Namespace: "openshift-cluster-csi-drivers",
              PodName: "shared-resource-csi-driver-node-vzvtc",
              Shares: 10,
              Pid: 34782,
          },
      ]
  to be empty
  In [It] at: /var/lib/jenkins/workspace/ocp-far-edge-vran-deployment/cnf-gotests/test/ran/workloadpartitioning/ranwphelper/ranwphelper.go:172 @ 02/22/23 01:05:00.268

cluster-proxy-service-proxy is reported in https://issues.redhat.com/browse/OCPBUGS-7652

X-CSRF token is currently added automatically for any request using `coFetch` functions. In some cases, plugins would like to use their own functions/libs like axios. Console should enable retrieving the X-CSRF token

Acceptance Criteria:

  • Dynamic plugin can retrieve X-CSRF token via their own functions (axios)

Description of problem:

The current version of openshift/cluster-dns-operator vendors Kubernetes 1.26 packages. OpenShift 4.14 is based on Kubernetes 1.27.   

Version-Release number of selected component (if applicable):

4.14

How reproducible:

Always

Steps to Reproduce:

1. Check https://github.com/openshift/cluster-dns-operator/blob/release-4.14/go.mod 

Actual results:

Kubernetes packages (k8s.io/api, k8s.io/apimachinery, and k8s.io/client-go) are at version v0.26

Expected results:

Kubernetes packages are at version v0.27.0 or later.

Additional info:

Using old Kubernetes API and client packages brings risk of API compatibility issues.
controller-runtime will need to be bumped to v0.15.0 as well

Description of problem:

accidentally merged before fully reviewed

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

The CPO does not currently respect the CVO runlevels as standalone OCP does.

The CPO reconciles everything all at once during upgrades which is resulting in FeatureSet aware components trying to start because the FeatureSet status is set for that version, leading to pod restarts.

It should roll things out in the following order for both initial install and upgrade, waiting between stages until rollout is complete:

  • etcd
  • kas
  • kcm and ks
  • everything else

In many cases, the /dev/disk/by-path symlink is the only way to stably identify a disk without having prior knowledge of the hardware from some external source (e.g. a spreadsheet of disk serial numbers). It should be possible to specify this path in the root device hints.
This is fixed by the first commit in the upstream Metal³ PR https://github.com/metal3-io/baremetal-operator/pull/1264

Description of problem:

The usage of "compute.platform.gcp.serviceAccount" needs to be clarified, and also the installation failure.

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-07-16-230237

How reproducible:

Always

Steps to Reproduce:

1. "openshift-install explain installconfig.compute.platform.gcp.serviceAccount"
2. "create cluster" with an existing install-config having the field configured 

Actual results:

1. It tells "The provided service account will be attached to control-plane nodes...", although the field is under compute.platform.gcp.
2. The installation failed on creating install config, with error "service accounts only valid for master nodes, provided for worker nodes".

Expected results:

1. shall "explain" command tell the field "serviceAccount" under "installconfig.compute.platform.gcp"?
2. please clarify how "compute.platform.gcp.serviceAccount" should be used

Additional info:

FYI the corresponding PR: https://github.com/openshift/installer/pull/7308

$ openshift-install version
openshift-install 4.14.0-0.nightly-2023-07-16-230237
built from commit c2d7db9d4eedf7b79fcf975f3cbd8042542982ca
release image registry.ci.openshift.org/ocp/release@sha256:e31716b6f12a81066c78362c2f36b9f18ad51c9768bdc894d596cf5b0f689681
release architecture amd64
$ openshift-install explain installconfig.compute.platform.gcp.serviceAccount
KIND:     InstallConfig
VERSION:  v1RESOURCE: <string>
  ServiceAccount is the email of a gcp service account to be used for shared vpn installations. The provided service account will be attached to control-plane nodes in order to provide the permissions required by the cloud provider in the host project.

$ openshift-install explain installconfig.controlPlane.platform.gcp.serviceAccount
KIND:     InstallConfig
VERSION:  v1RESOURCE: <string>
  ServiceAccount is the email of a gcp service account to be used for shared vpn installations. The provided service account will be attached to control-plane nodes in order to provide the permissions required by the cloud provider in the host project.

$ yq-3.3.0 r test2/install-config.yaml platform
gcp:
  projectID: openshift-qe
  region: us-central1
  computeSubnet: installer-shared-vpc-subnet-2
  controlPlaneSubnet: installer-shared-vpc-subnet-1
  network: installer-shared-vpc
  networkProjectID: openshift-qe-shared-vpc
$ yq-3.3.0 r test2/install-config.yaml credentialsMode
Passthrough
$ yq-3.3.0 r test2/install-config.yaml baseDomain
qe1.gcp.devcluster.openshift.com
$ yq-3.3.0 r test2/install-config.yaml metadata
creationTimestamp: null
name: jiwei-0718b
$ yq-3.3.0 r test2/install-config.yaml compute
- architecture: amd64
  hyperthreading: Enabled
  name: worker
  platform:
    gcp:
      ServiceAccount: ipi-xpn-minpt-permissions@openshift-qe.iam.gserviceaccount.com
      tags:
      - preserved-ipi-xpn-compute
  replicas: 2
$ yq-3.3.0 r test2/install-config.yaml controlPlane
architecture: amd64
hyperthreading: Enabled
name: master
platform:
  gcp:
    ServiceAccount: ipi-xpn-minpt-permissions@openshift-qe.iam.gserviceaccount.com
    tags:
    - preserved-ipi-xpn-control-plane
replicas: 3
$ openshift-install create cluster --dir test2
ERROR failed to fetch Metadata: failed to load asset "Install Config": failed to create install config: invalid "install-config.yaml" file: compute[0].platform.gcp.serviceAccount: Invalid value: "ipi-xpn-minpt-permissions@openshift-qe.iam.gserviceaccount.com": service accounts only valid for master nodes, provided for worker nodes 
$ 

Description of problem:

When listing installed operators, we attempt to list subscriptions in all namespaces in order to associate subscriptions/csvs. This prevents users without cluster scope list priveleges from seeing subscriptions on this page, which makes the uninstall action unavailable.

Version-Release number of selected component (if applicable):

4.14

How reproducible:

Always

Steps to Reproduce:

1. Install an namespaced operator
2. Log in as a user with project admin permissions where the operator was installed
3. Visit the installed operators page
4. Click the kebab menu for the operator from step 1

Actual results:

The only action available is to delete the CSV

Expected results:

The "Uninstall Operator" and "Edit Subscriptions" actions should show since the user has permission to view, edit, delete Subscription resources in this namespace.

Additional info:

 

Description of problem:



Remove changing the image name for a MachineSet if ClusterOSImage is set

Terraform has already created an image bucket based on OPENSHIFT_INSTALL_OS_IMAGE_OVERRIDE
for us. So worker nodes should not use OPENSHIFT_INSTALL_OS_IMAGE_OVERRIDE directly and instead use the image bucket.

Version-Release number of selected component (if applicable):

current master branch

How reproducible:

Always

Steps to Reproduce:

1.
2.
3.

Actual results:


Expected results:


Additional info:


Description of problem:

When creating a pod controller (e.g. deployment) with pod spec that will be mutated by SCCs, the users might still get a warning about the pod not meeting given namespace pod security level.

Version-Release number of selected component (if applicable):

4.11

How reproducible:

100%

Steps to Reproduce:

1. create a namespace with restricted PSa warning level (the default)
2. create a deployment with a pod with an empty security context

Actual results:

You get a warning about the deployment's pod not meeting the NS's pod security admission requirements.

Expected results:

No warning if the pod for the deployment would be properly mutated by SCCs in order to fulfill the NS's pod security requirements.

Additional info:

originally implemented as a part of https://issues.redhat.com/browse/AUTH-337

 

The agent integration tests is failing with different errors when run multiple times locally:

Local Run 1:

 

level=fatal msg=failed to fetch Agent Installer PXE Files: failed to fetch dependency of "Agent Installer PXE Files": failed to generate asset "Agent Installer Artifacts": lstat /home/rwsu/.cache/agent/files_cache/libnmstate.so.2: no such file or directory
[exit status 1]
FAIL: testdata/agent/pxe/configurations/sno.txt:3: unexpected command failure

 

Local Run 2:

 

level=fatal msg=failed to fetch Agent Installer PXE Files: failed to fetch dependency of "Agent Installer PXE Files": failed to generate asset "Agent Installer Artifacts": file /usr/bin/agent-tui was not found
[exit status 1]
FAIL: testdata/agent/pxe/configurations/sno.txt:3: unexpected command failure

 

In the [CI|https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_installer/7299/pull-ci-openshift-installer-master-agent-integration-tests/1677347591739674624,] it has failed in this PR multiple times with this error: 

level=fatal msg=failed to fetch Agent Installer PXE Files: failed to fetch dependency of "Agent Installer PXE Files": failed to generate asset "Agent Installer Artifacts": lstat /.cache/agent/files_cache/agent-tui: no such file or directory   32  [exit status 1]   33  FAIL: testdata/agent/pxe/configurations/sno.txt:3: unexpected command failure 

I believe the issue is the integration tests are running in parallel, and the extractFileFromImage function in pkg/asset/agent/image/oc.go problematic because the cache is being cleared and then files extracted to the same path. When the tests run in parallel, another test could clear the cached files and when the current test tries to read the file from the cached directory, it has disappeared.

Adding 

-parallel 1

to ./hack/go-integration-test.sh eliminates the errors, so that why I think it is an concurrency issue.
 

If the cluster enters the installing-pending-user-action state in assisted-service, it will not recover absent user action.
One way to reproduce this is to have the wrong boot order set in the host, so that it reboots into the agent ISO again instead of the installed CoreOS on disk. (I managed this in dev-scripts by setting a root device hint that pointed to a secondary disk, and only creating that disk once the VM was up. This does not add the new disk to the boot order list, and even if you set it manually it does not take effect until after a full shutdown of the VM - the soft reboot doesn't count.)

Currently we report:

cluster has stopped installing... working to recover installation

in a loop. This is not accurate (unlike in e.g. the install-failed state) - it cannot be recovered automatically.

Also we should only report this, or any other, status once when the status changes, and not continuously in a loop.

Description of problem:

Install failed with External platform type

Version-Release number of selected component (if applicable):

4.14.0-0.ci-2023-03-07-170635
as there is no available 4.14 nightly build, so use the ci build

How reproducible:

Always

Steps to Reproduce:

1.Set up a UPI vsphere cluster with platform set to External

2.Install failed

liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion               
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version             False       True          141m    Unable to apply 4.14.0-0.ci-2023-03-07-170635: the cluster operator cloud-controller-manager is not available
liuhuali@Lius-MacBook-Pro huali-test % oc get co                           
NAME                                       VERSION                         AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.14.0-0.ci-2023-03-07-170635   True        False         False      118m    
baremetal                                  4.14.0-0.ci-2023-03-07-170635   True        False         False      137m    
cloud-controller-manager                   4.14.0-0.ci-2023-03-07-170635                                                
cloud-credential                           4.14.0-0.ci-2023-03-07-170635   True        False         False      140m    
cluster-autoscaler                         4.14.0-0.ci-2023-03-07-170635   True        False         False      137m    
config-operator                            4.14.0-0.ci-2023-03-07-170635   True        False         False      139m    
console                                    4.14.0-0.ci-2023-03-07-170635   True        False         False      124m    
control-plane-machine-set                  4.14.0-0.ci-2023-03-07-170635   True        False         False      137m    
csi-snapshot-controller                    4.14.0-0.ci-2023-03-07-170635   True        False         False      138m    
dns                                        4.14.0-0.ci-2023-03-07-170635   True        False         False      137m    
etcd                                       4.14.0-0.ci-2023-03-07-170635   True        False         False      137m    
image-registry                             4.14.0-0.ci-2023-03-07-170635   True        False         False      127m    
ingress                                    4.14.0-0.ci-2023-03-07-170635   True        False         False      126m    
insights                                   4.14.0-0.ci-2023-03-07-170635   True        False         False      132m    
kube-apiserver                             4.14.0-0.ci-2023-03-07-170635   True        False         False      134m    
kube-controller-manager                    4.14.0-0.ci-2023-03-07-170635   True        False         False      136m    
kube-scheduler                             4.14.0-0.ci-2023-03-07-170635   True        False         False      135m    
kube-storage-version-migrator              4.14.0-0.ci-2023-03-07-170635   True        False         False      138m    
machine-api                                4.14.0-0.ci-2023-03-07-170635   True        False         False      137m    
machine-approver                           4.14.0-0.ci-2023-03-07-170635   True        False         False      138m    
machine-config                             4.14.0-0.ci-2023-03-07-170635   True        False         False      136m    
marketplace                                4.14.0-0.ci-2023-03-07-170635   True        False         False      137m    
monitoring                                 4.14.0-0.ci-2023-03-07-170635   True        False         False      124m    
network                                    4.14.0-0.ci-2023-03-07-170635   True        False         False      139m    
node-tuning                                4.14.0-0.ci-2023-03-07-170635   True        False         False      137m    
openshift-apiserver                        4.14.0-0.ci-2023-03-07-170635   True        False         False      132m    
openshift-controller-manager               4.14.0-0.ci-2023-03-07-170635   True        False         False      138m    
openshift-samples                          4.14.0-0.ci-2023-03-07-170635   True        False         False      131m    
operator-lifecycle-manager                 4.14.0-0.ci-2023-03-07-170635   True        False         False      138m    
operator-lifecycle-manager-catalog         4.14.0-0.ci-2023-03-07-170635   True        False         False      138m    
operator-lifecycle-manager-packageserver   4.14.0-0.ci-2023-03-07-170635   True        False         False      132m    
service-ca                                 4.14.0-0.ci-2023-03-07-170635   True        False         False      138m    
storage                                    4.14.0-0.ci-2023-03-07-170635   True        False         False      138m    
liuhuali@Lius-MacBook-Pro huali-test % oc get infrastructure cluster -oyaml
apiVersion: config.openshift.io/v1
kind: Infrastructure
metadata:
  creationTimestamp: "2023-03-08T07:46:07Z"
  generation: 1
  name: cluster
  resourceVersion: "527"
  uid: 096a54bc-8a35-4071-b750-cfac439c1916
spec:
  cloudConfig:
    name: ""
  platformSpec:
    external:
      platformName: vSphere
    type: External
status:
  apiServerInternalURI: https://api-int.huliu-vs8x.qe.devcluster.openshift.com:6443
  apiServerURL: https://api.huliu-vs8x.qe.devcluster.openshift.com:6443
  controlPlaneTopology: HighlyAvailable
  etcdDiscoveryDomain: ""
  infrastructureName: huliu-vs8x-fk79b
  infrastructureTopology: HighlyAvailable
  platform: External
  platformStatus:
    external: {}
    type: External
liuhuali@Lius-MacBook-Pro huali-test % 

Actual results:

Install failed. the cluster operator cloud-controller-manager is not available

Expected results:

Install successfully

Additional info:

This if for testing https://issues.redhat.com/browse/OCPCLOUD-1772

Currently assisted installer doesn't verify that etcd is ok before reboot on the bootstrap node as wait_for_ceo in bootkube does nothing. 

In 4.13 and backported to 4.12 etcd team had added status that we can check in assisted installer in order to decide if it is safe to reboot bootstrap or not. We should check it before running shutdown command.

Eran Cohen Rom Freiman 

 We want to parametrize envoy configmap name: with that, we can configure a private envoy configuration that would bring the following advantages:

  • private infra details
  • changing envoy config can be done with app-interface MR only

Please review the following PR: https://github.com/openshift/ibm-vpc-block-csi-driver/pull/36

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

If a JSON schema used in by a chart contains unknown value format (non-standard JSON Schema but valid in OpenAPI spec for example), Helm form view hangs on validation and stays in "submitting" state.

 

As per JSON Schema standard the "format" keyword should only take an advisory role (like an annotation) and should not affect validation.

https://json-schema.org/understanding-json-schema/reference/string.html#format 

Version-Release number of selected component (if applicable):

Verified against 4.13, but probably applies to others.

How reproducible:

100%

Steps to Reproduce:

1. Go to Helm tab.
2. Click create in top right and select Repository
3. Paste following into YAML view and click Create:

apiVersion: helm.openshift.io/v1beta1
kind: ProjectHelmChartRepository
metadata:
  name: reproducer
spec:
  connectionConfig:
    url: 'https://raw.githubusercontent.com/tumido/helm-backstage/repo-multi-schema2'

4. Go to the Helm tab again (if redirected elsewhere)
5. Click create in top right and select Helm Release
6. In catalog filter select Chart repositories: Reproducer
7. Click on the single tile available (Backstage) and click Create
8. Switch to Form view
9. Leave default values and click Create
10. Stare at the always loading screen that never proceeds further.

Actual results:

And never finishes or displays any error in UI.

Expected results:

Unknown format should not result in rejected validation. JSON Schema standard says that formats should not be used for validation.

Additional info:

This is not a schema violation by itself since Helm itself is happy about it and doesn't complain. The same chart can be successfully deployed via the YAML view.

See this component readiness page.

test=[sig-cluster-lifecycle] cluster upgrade should complete in 105.00 minutes

Appears to indicate we're now taking longer than 105 minutes about 7% of the time, previously never.

Slack thread: https://redhat-internal.slack.com/archives/C01CQA76KMX/p1694547497553699

wking points out it may be a one time ovn IC thing. Find out what's up and route to appropriate team.

Description of problem:

Multiple instances of tabs under ODF dashboard is seen and sometimes it also shows 404 error when each such tab is selected and the page is re-loaded

https://bugzilla.redhat.com/show_bug.cgi?id=2124829

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

We faced an issue where the quota was reached for VPCE. This is visible in the status of AWSEndpointService

  - lastTransitionTime: "2023-03-01T10:23:08Z"
    message: 'failed to create vpc endpoint: VpcEndpointLimitExceeded'
    reason: AWSError
    status: "False"
    type: EndpointAvailable

but it should be propagated to the HC as it blocks worker creation (ignition was not working) and for better visibility.

 

Description of problem:
This is a follow-up on https://bugzilla.redhat.com/show_bug.cgi?id=2083087 and https://github.com/openshift/console/pull/12390

When creating a Deployment, DeploymentConfig, or Knative Service with enabled Pipeline, and then deleting it again with the enabled option "Delete other resources created by console" (only available on 4.13+ with the PR above) the automatically created Pipeline is not deleted.

When the user tries to create the same resource with a Pipeline again this fails with an error:

An error occurred
secrets "nodeinfo-generic-webhook-secret" already exists

Version-Release number of selected component (if applicable):
4.13

(we might want to backport this together with https://github.com/openshift/console/pull/12390 and OCPBUGS-5547)

How reproducible:
Always

Steps to Reproduce:

  1. Install OpenShift Pipelines operator (tested with 1.8.2)
  2. Create a new project
  3. Navigate to Add > Import from git and create an application
  4. Case 1: In the topology select the new resource and delete it
  5. Case 2: In the topology select the application group and delete the complete app

Actual results:
Case 1: Delete resources:

  1. Deployment (tries it twice!) $name
  2. Service $name
  3. Route $name
  4. ImageStream $name

Case 2: Delete application:

  1. Deployment (just once) $name
  2. Service $name
  3. Route $name
  4. ImageStream $name

Expected results:
Case 1: Delete resource:

  1. Delete Deployment $name should be called just once
  2. (Keep this deletion) Service $name
  3. (Keep this deletion) Route $name
  4. (Keep this deletion) ImageStream $name
  5. Missing deletion of the Tekton Pipeline $name
  6. Missing deletion of the Tekton TriggerTemplate with generated name trigger-template-$name-$random
  7. Missing deletion of the Secret $name-generic-webhook-secret
  8. Missing deletion of the Secret $name-github-webhook-secret

Case 2: Delete application:

  1. (Keep this deletion) Deployment $name
  2. (Keep this deletion) Service $name
  3. (Keep this deletion) Route $name
  4. (Keep this deletion) ImageStream $name
  5. Missing deletion of the Tekton Pipeline $name
  6. Missing deletion of the Tekton TriggerTemplate with generated name trigger-template-$name-$random
  7. Missing deletion of the Secret $name-generic-webhook-secret
  8. Missing deletion of the Secret $name-github-webhook-secret

Additional info:

Description of problem:

For HOSTEDCP-1062 , components without a label `hypershift.openshift.io/need-management-kas-access: "true"` can not access the management cluster KAS resources. 
But for `kube-apiserver` in HCP, there isn't the targe label `hypershift.openshift.io/need-management-kas-access: "true"` but it can access the mgmt KAS


jiezhao-mac:hypershift jiezhao$ oc get pods -n clusters-jie-test | grep kube-apiserver
kube-apiserver-6799b6cfd8-wk8pv                      3/3     Running   0          178m
jiezhao-mac:hypershift jiezhao$ 
jiezhao-mac:hypershift jiezhao$ oc get pods kube-apiserver-6799b6cfd8-wk8pv -n clusters-jie-test -o yaml | grep hypershift.openshift.io/need-management-kas-access
jiezhao-mac:hypershift jiezhao$ 

jiezhao-mac:hypershift jiezhao$ oc -n clusters-jie-test rsh pod/kube-apiserver-6799b6cfd8-wk8pv curl --connect-timeout 2 -Iks https://10.0.142.255:6443 -v
Defaulted container "apply-bootstrap" out of: apply-bootstrap, kube-apiserver, audit-logs, init-bootstrap (init), wait-for-etcd (init)
* Rebuilt URL to: https://10.0.142.255:6443/
..
< HTTP/2 403 
HTTP/2 403 
...
< 
* Connection #0 to host 10.0.142.255 left intact

How reproducible:

refer test case: https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-65141

Steps to Reproduce:

https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-65141 

Additional info:

router pod has the label and can access mgmt KAS. My expectation is that router pod shouldn't have the label and shouldn't access mgmt KAS.
$ oc get pods router-667cb7f844-lx8mv -n clusters-jie-test -o yaml | grep hypershift.openshift.io/need-management-kas-access
hypershift.openshift.io/need-management-kas-access: "true"
jiezhao-mac:hypershift jiezhao$ oc -n clusters-jie-test rsh pod/router-667cb7f844-lx8mv curl --connect-timeout 2 -Iks 
https://10.0.142.255:6443
-v
Rebuilt URL to: 
https://10.0.142.255:6443/
Trying 10.0.142.255...
...
< HTTP/2 403
HTTP/2 403

> Actually, router doesn't need it anymore after https://github.com/openshift/hypershift/pull/2778 

Description of the problem:

Adding invalid label (key or value) to a node returns error code 500 "Internal Server Error", instead of 400 

"Bad Request"

 

How reproducible:

100%

 

Steps to reproduce:

1. Create a cluster

2. Boot node from ISO

3. Add invalid label, invalid key or value

e.g:

curl -s -H 'Content-Type: application/json' -X PATCH -d '{"node_labels": [{"key": "Label-1", "value": "Label1*1"},{"key": "worker.label2", "value": "Label-2"}]}' https://api.stage.openshift.com/api/assisted-install/v2/infra-envs/8603fe29-e67f-49ad-8ba7-7a256bcb3923/hosts/af629f1e-da67-4211-97f0-f27cb10471ff --header "Authorization: Bearer $(ocm token)"

 

Actual results:

Action failed with error code 500

{"code":"500","href":"","id":500,"kind":"Error","reason":"node_labels: Invalid value: \"Label1*1\": a valid label must be an empty string or consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyValue',  or 'my_value',  or '12345', regex used for validation is '(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])?')"}

 

Expected results:

Action failed with error code 400

Description of problem:

Noticed an issue with the ignition server when testing some of the latest HO updates on our older control planes:
❯ oc logs ignition-server-5fd4c89764-bddss -n master-roks-dev-4-9
Defaulted container "ignition-server" out of: ignition-server, fetch-feature-gate (init)
Error: unknown flag: --feature-gate-manifest
This seems to be thrown because that flag doesn't exist within the ignition server source code for previous control plane versions--we're specifically only seeing this in 4.9 and 4.10, where the ignition server was not being managed by CPO.

Version-Release number of selected component (if applicable):

 

How reproducible:

100%

Steps to Reproduce:

1. Install HO off main
2. Bring up 4.9/4.10 hosted control planes
3. Ignition server crashes

Actual results:

Ignition server crashes

Expected results:

Ignition server to run without issues

Additional info:

 

This is a clone of issue OCPBUGS-18246. The following is the description of the original issue:

Description of problem:

Role assignment for Azure AD Workload Identity performed by ccoctl does not provide an option to scope role assignments to a resource group containing customer vnet in a byo vnet installation workflow.

https://docs.openshift.com/container-platform/4.13/installing/installing_azure/installing-azure-vnet.html

Version-Release number of selected component (if applicable):

4.14.0

How reproducible:

100%

Steps to Reproduce:

1. Create Azure resource group and vnet for OpenShift within that resource group.
2. Create Azure AD Workload Identity infrastructure with ccoctl.
3. Follow steps to configure existing vnet for installation setting networkResourceGroupName within the install config.
4. Attempt cluster installation.

Actual results:

Cluster installation fails.

Expected results:

Cluster installation succeeds.

Additional info:

ccoctl must be extended to accept a parameter specifying the network resource group name and scope relevant component role assignments to the network resource group in addition to the installation resource group.

Description of problem:

When installing a HyperShift cluster into ap-southeast-3 (currently only availble in the production environment), the install will never succeed due to the hosted KCM pods stuck in CrashLoopBackoff

Version-Release number of selected component (if applicable):

4.12.18

How reproducible:

100%

Steps to Reproduce:

1. Install a HyperShift Cluster in ap-southeast-3 on AWS

Actual results:

kube-controller-manager-54fc4fff7d-2t55x                 1/2     CrashLoopBackOff   7 (2m49s ago)   16m
kube-controller-manager-54fc4fff7d-dxldc                 1/2     CrashLoopBackOff   7 (93s ago)     16m
kube-controller-manager-54fc4fff7d-ww4kv                 1/2     CrashLoopBackOff   7 (21s ago)     15m

With selected "important" logs:
I0606 15:16:25.711483       1 event.go:294] "Event occurred" object="kube-system/kube-controller-manager" fieldPath="" kind="ConfigMap" apiVersion="v1" type="Normal" reason="LeaderElection" message="kube-controller-manager-54fc4fff7d-ww4kv_6dbab916-b4bf-447f-bbb2-5037864e7f78 became leader"
I0606 15:16:25.711498       1 event.go:294] "Event occurred" object="kube-system/kube-controller-manager" fieldPath="" kind="Lease" apiVersion="coordination.k8s.io/v1" type="Normal" reason="LeaderElection" message="kube-controller-manager-54fc4fff7d-ww4kv_6dbab916-b4bf-447f-bbb2-5037864e7f78 became leader"
W0606 15:16:25.741417       1 plugins.go:132] WARNING: aws built-in cloud provider is now deprecated. The AWS provider is deprecated and will be removed in a future release. Please use https://github.com/kubernetes/cloud-provider-aws
I0606 15:16:25.741763       1 aws.go:1279] Building AWS cloudprovider
F0606 15:16:25.742096       1 controllermanager.go:245] error building controller context: cloud provider could not be initialized: could not init cloud provider "aws": not a valid AWS zone (unknown region): ap-southeast-3a

Expected results:

The KCM pods are Running

Description of problem:

Credentials secret generated by CCO on STS Manual Mode cluster does not have status

Version-Release number of selected component (if applicable):

4.14.0

How reproducible:

4.14.0

Steps to Reproduce:

1. Create a Manual mode, STS cluster in AWS.
2. Create a CredentialsRequest which provides .spec.cloudTokenPath and .spec.providerSpec.stsIAMRoleARN.
3. Observe that secret is created by CCO in the target namespace specified by the CredentialsRequest.
4. Observe that the CredentialsRequest does not set status once the secret is generated. Specifically, the CredentialsRequest does not set .status.provisioned == true.

Actual results:

Status is not set on CredentialsRequest with provisioned secret.

Expected results:

Status is set on CredentialsRequest with provisioned secret.

Additional info:

Reported by Jan Safranek when testing integration with the aws-efs-csi-driver-operator.

Description of problem: When running in development mode [1], the Loaded enabled plugin count numbers in the Cluster Dashboard Dynamic Plugins popover may be incorrect. In order to make the experience less confusing for users working with the console in development mode, we need to:

Note there is additional work planned in https://issues.redhat.com/browse/CONSOLE-3185. This bug is intended to only capture improving the experience for development mode.

[1] https://github.com/openshift/console/blob/master/frontend/packages/console-dynamic-plugin-sdk/README.md#plugin-development

Description of problem:

I have deployed multicluster-engine.v2.3.0-81 with spoke cLuster 4.12.ec5

In the assisted pod I see data collection is enabled:
sh-4.4$ env | grep DATA
DATA_UPLOAD_ENDPOINT=https://console.redhat.com/api/ingress/v1/upload
ENABLE_DATA_COLLECTION=True 

But : in AI logs I see "Event uploading is not enabled"

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

  1. deployed multicluster-engine.v2.3.0-81 with spoke cLuster 4.12.ec5
  1. check the logs and env vars in pod
  2. ...

Actual results:

in AI logs I see "Event uploading is not enabled"

Expected results:

Data should be uploaded

Additional info:

On https://issues.redhat.com/browse/RFE-2273 the customer analyzed quite correctly:

I have re-reviewed all of the provided data from the attached cases (DHL and ANZ) and have documented my findings below:
1) It looks like the request mentioned by the customer is sent to the Console API. Specifically `api/prometheus-tenancy/api/v1/*`
2) This is then forwarded to Cluster Monitoring (Thanos Querier) [0]
3) Thanos is configured to set the CORS headers to `*` due to the absence of the `--web.disable-cors` argument.[1]
4) The Thanos deployment is managed by the Cluster Monitoring Operator directly [2]
5) When using Postman, we can see the endpoint respond with a `access-control-allow-origin: *` [see image 1]
6) Manually setting the `--web.disable-cors` argument inside the Thanos Querier deployment, the `access-control-allow-origin: *` is removed.
7) Changing the Cluster Monitoring Operator deployment template[4] to include the flag and push the custom image into an OCP 4.10.31 cluster [3]
8) Seems like everything is working and the endpoint is not longer returning the CORS header. [see image 2]

We should set {}web.disable-cors{-} for our thanos deployment. We don't load any cross-origin resources through the console>thanos querier path, so this should just work.

Description of the problem:

Base domain contains double `–` like  cat–rahul.com allowed by UI and BE and when node discovered , network validation fails.

 

Current domain is a private case for using – but note that UI and BE allows to send many – chars as part of domain name.

 

from agent logs:

 

 

Aug 28 11:28:55 master-0-0 next_step_runne[1918]: time="28-08-2023 11:28:55" level=info msg="Creating execution step for ntp-synchronizer ntp-synchronizer-70565cf4 args <[{\"ntp_source\":\"\"}]>" file="step_processor.go:123" request_id=5467e025-2683-4119-a55a-976bb7787279
Aug 28 11:28:55 master-0-0 next_step_runne[1918]: time="28-08-2023 11:28:55" level=info msg="Creating execution step for domain-resolution domain-resolution-f3917dea args <[{\"domains\":[{\"domain_name\":\"api.dummy---dummy.cat--rahul.com\"},{\"domain_name\":\"api-int.dummy---dummy.cat--rahul.com\"},{\"domain_name\":\"console-openshift-console.apps.dummy---dummy.cat--rahul.com\"},{\"domain_name\":\"validateNoWildcardDNS.dummy---dummy.cat--rahul.com.\"},{\"domain_name\":\"validateNoWildcardDNS.dummy---dummy.cat--rahul.com\"},{\"domain_name\":\"quay.io\"}]}]>" file="step_processor.go:123" request_id=5467e025-2683-4119-a55a-976bb7787279
Aug 28 11:28:55 master-0-0 next_step_runne[1918]: time="28-08-2023 11:28:55" level=info msg="Validating domain resolution with args [{\"domains\":[{\"domain_name\":\"api.dummy---dummy.cat--rahul.com\"},{\"domain_name\":\"api-int.dummy---dummy.cat--rahul.com\"},{\"domain_name\":\"console-openshift-console.apps.dummy---dummy.cat--rahul.com\"},{\"domain_name\":\"validateNoWildcardDNS.dummy---dummy.cat--rahul.com.\"},{\"domain_name\":\"validateNoWildcardDNS.dummy---dummy.cat--rahul.com\"},{\"domain_name\":\"quay.io\"}]}]" file="action.go:29"
Aug 28 11:28:55 master-0-0 next_step_runne[1918]: time="28-08-2023 11:28:55" level=info msg="Validating inventory with args [fea3d7b9-a990-48a6-9a46-4417915072b0]" file="action.go:29"
Aug 28 11:28:55 master-0-0 next_step_runne[1918]: time="28-08-2023 11:28:55" level=error msg="Failed to validate domain resolution: data, {\"domains\":[{\"domain_name\":\"api.dummy---dummy.cat--rahul.com\"},{\"domain_name\":\"api-int.dummy---dummy.cat--rahul.com\"},{\"domain_name\":\"console-openshift-console.apps.dummy---dummy.cat--rahul.com\"},{\"domain_name\":\"validateNoWildcardDNS.dummy---dummy.cat--rahul.com.\"},{\"domain_name\":\"validateNoWildcardDNS.dummy---dummy.cat--rahul.com\"},{\"domain_name\":\"quay.io\"}]}" file="action.go:42" error="validation failure list:\nvalidation failure list:\ndomains.0.domain_name in body should match '^([a-zA-Z0-9]+(-[a-zA-Z0-9]+)*[.])+[a-zA-Z]{2,}[.]?$'"
Aug 28 11:28:55 master-0-0 next_step_runne[1918]: time="28-08-2023 11:28:55" level=info msg="Validating ntp synchronizer with args [{\"ntp_source\":\"\"}]" file="action.go:29"
Aug 28 11:28:55 master-0-0 next_step_runne[1918]: time="28-08-2023 11:28:55" level=info msg="Validating free addresses with args [[\"192.168.123.0/24\"]]" file="action.go:29"
Aug 28 11:28:55 master-0-0 next_step_runne[1918]: time="28-08-2023 11:28:55" level=info msg="Executing nsenter [--target 1 --cgroup --mount --ipc --net -- sh -c cp /etc/mtab /root/mtab-fea3d7b9-a990-48a6-9a46-4417915072b0 && podman run --privileged --pid=host --net=host --rm --quiet -v /var/log:/var/log -v /run/udev:/run/udev -v /dev/disk:/dev/disk -v /run/systemd/journal/socket:/run/systemd/journal/socket -v /var/log:/host/var/log:ro -v /proc/meminfo:/host/proc/meminfo:ro -v /sys/kernel/mm/hugepages:/host/sys/kernel/mm/hugepages:ro -v /proc/cpuinfo:/host/proc/cpuinfo:ro -v /root/mtab-fea3d7b9-a990-48a6-9a46-4417915072b0:/host/etc/mtab:ro -v /sys/block:/host/sys/block:ro -v /sys/devices:/host/sys/devices:ro -v /sys/bus:/host/sys/bus:ro -v /sys/class:/host/sys/class:ro -v /run/udev:/host/run/udev:ro -v /dev/disk:/host/dev/disk:ro registry-proxy.engineering.redhat.com/rh-osbs/openshift4-assisted-installer-agent-rhel8:v1.0.0-279 inventory]" file="execute.go:39"
Aug 28 11:28:55 master-0-0 next_step_runne[1918]: time="28-08-2023 11:28:55" level=error msg="Unable to create runner for step <domain-resolution-f3917dea>, args <[{\"domains\":[{\"domain_name\":\"api.dummy---dummy.cat--rahul.com\"},{\"domain_name\":\"api-int.dummy---dummy.cat--rahul.com\"},{\"domain_name\":\"console-openshift-console.apps.dummy---dummy.cat--rahul.com\"},{\"domain_name\":\"validateNoWildcardDNS.dummy---dummy.cat--rahul.com.\"},{\"domain_name\":\"validateNoWildcardDNS.dummy---dummy.cat--rahul.com\"},{\"domain_name\":\"quay.io\"}]}]>" file="step_processor.go:126" error="validation failure list:\nvalidation failure list:\ndomains.0.domain_name in body should match '^([a-zA-Z0-9]+(-[a-zA-Z0-9]+)*[.])+[a-zA-Z]{2,}[.]?$'" request_id=5467e025-2683-4119-a55a-976bb7787279
Aug 28 11:28:55 master-0-0 next_step_runne[1918]: time="28-08-2023 11:28:55" level=info msg="Executing nsenter [--target 1 --cgroup --mount --ipc --net -- findmnt --raw --noheadings --output SOURCE,TARGET --target /run/media/iso]" file="execute.go:39"
Aug 28 11:28:55 master-0-0 next_step_runne[1918]: time="28-08-2023 11:28:55" level=info msg="Executing nsenter [--target 1 --cgroup --mount --ipc --net -- sh -c podman ps --format '{{.Names}}' | grep -q '^free_addresses_scanner$' || podman run --privileged --net=host --rm --quiet --name free_addresses_scanner -v /var/log:/var/log -v /run/systemd/journal/socket:/run/systemd/journal/socket registry-proxy.engineering.redhat.com/rh-osbs/openshift4-assisted-installer-agent-rhel8:v1.0.0-279 free_addresses '[\"192.168.123.0/24\"]']" file="execute.go:39"
Aug 28 11:28:55 master-0-0 next_step_runne[1918]: time="28-08-2023 11:28:55" level=info msg="Executing nsenter [--target 1 --cgroup --mount --ipc --net -- timeout 30 chronyc -n sources]" file="execute.go:39"
Aug 28 11:28:55 master-0-0 next_step_runne[1918]: time="28-08-2023 11:28:55" level=warning msg="Sending step <domain-resolution-f3917dea> reply output <> error <validation failure list:\nvalidation failure list:\ndomains.0.domain_name in body should match '^([a-zA-Z0-9]+(-[a-zA-Z0-9]+)*[.])+[a-zA-Z]{2,}[.]?$'> exit-code <-1>" file="step_processor.go:76" request_id=5467e025-2683-4119-a55a-976bb7787279

 

 

 

How reproducible:

Create a cluster with domain cat–rahul.com with UI fix that allowing it.

Once node discovered , network validation fails on :

  • DNS wildcard not configured: DNS wildcard check cannot be performed yet because the host has not yet performed DNS resolution.

Steps to reproduce:

see above

Actual results:

Unable to install cluster due to network validation failure

Expected results:
The domain should be allowed in regex

Description of problem:

When modifying a secret in the Management Console that has a binary file inclued (such as a keystore), the keystore will get corrupted post the modification and therefore impact application functionality (as the keystore can not be read).

$ openssl req -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem -sha256 -days 365
$ cat cert.pem key.pem > file.crt.txt
$ openssl pkcs12 -export -in file.crt.txt -out mykeystore.pkcs12 -name myAlias -noiter -nomaciter
$ oc create secret generic keystore --from-file=mykeystore.pkcs12 --from-file=cert.pem --from-file=key.pem -n project-300

apiVersion: v1
kind: Pod
metadata:
  name: mypod
  namespace: project-300
spec:
  containers:
  - name: mypod
    image: quay.io/rhn_support_sreber/curl:latest
    volumeMounts:
    - name: foo
      mountPath: "/keystore"
      readOnly: true
  volumes:
  - name: foo
    secret:
      secretName: keystore
      optional: true

# Getting the md5sum from the file on the local Laptop to compare with what is available in the pod
$ md5sum mykeystore.pkcs12
c189536854e59ab444720efaaa76a34a  mykeystore.pkcs12

sh-5.2# ls -al /keystore/..data/
total 16
drwxr-xr-x. 2 root root  100 Mar 24 11:19 .
drwxrwxrwt. 3 root root  140 Mar 24 11:19 ..
-rw-r--r--. 1 root root 1992 Mar 24 11:19 cert.pem
-rw-r--r--. 1 root root 3414 Mar 24 11:19 key.pem
-rw-r--r--. 1 root root 4380 Mar 24 11:19 mykeystore.pkcs12

sh-5.2# md5sum /keystore/..data/mykeystore.pkcs12
c189536854e59ab444720efaaa76a34a  /keystore/..data/mykeystore.pkcs12
sh-5.2#

Edit cert.pem in secret using the Management Console

$ oc delete pod mypod -n project-300

apiVersion: v1
kind: Pod
metadata:
  name: mypod
  namespace: project-300
spec:
  containers:
  - name: mypod
    image: quay.io/rhn_support_sreber/curl:latest
    volumeMounts:
    - name: foo
      mountPath: "/keystore"
      readOnly: true
  volumes:
  - name: foo
    secret:
      secretName: keystore
      optional: true

sh-5.2# ls -al /keystore/..data/
total 20
drwxr-xr-x. 2 root root   100 Mar 24 12:52 .
drwxrwxrwt. 3 root root   140 Mar 24 12:52 ..
-rw-r--r--. 1 root root  1992 Mar 24 12:52 cert.pem
-rw-r--r--. 1 root root  3414 Mar 24 12:52 key.pem
-rw-r--r--. 1 root root 10782 Mar 24 12:52 mykeystore.pkcs12

sh-5.2# md5sum /keystore/..data/mykeystore.pkcs12
56f04fa8059471896ed5a3c54ade707c  /keystore/..data/mykeystore.pkcs12
sh-5.2#      

$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.13.0-0.nightly-2023-03-23-204038   True        False         91m     Cluster version is 4.13.0-0.nightly-2023-03-23-204038

The modification was done in the Management Console, selecting the secret and then use: Actions -> Edit Secrets -> Modifying the value of cert.pem and submiting via Save button

Version-Release number of selected component (if applicable):

OpenShift Container Platform 4.13.0-0.nightly-2023-03-23-204038 and 4.12.6

How reproducible:

Always

Steps to Reproduce:

1. See above the details steps

Actual results:

# md5sum on the Laptop for the file
$ md5sum mykeystore.pkcs12
c189536854e59ab444720efaaa76a34a  mykeystore.pkcs12

# md5sum of the file in the pod after the modification in the Management Console
sh-5.2# md5sum /keystore/..data/mykeystore.pkcs12
56f04fa8059471896ed5a3c54ade707c  /keystore/..data/mykeystore.pkcs12

The file got corrupted and is not usable anymore. The binary file though should not be modified if no changes was made on it's value, when editing the secret in the Mansgement Console.

Expected results:

The binary file though should not be modified if no changes was made on it's value, when editing the secret in the Mansgement Console.

Additional info:

A similar problem was alredy fixed in https://bugzilla.redhat.com/show_bug.cgi?id=1879638 but that was, when the binary file was uploaded. Possible that the secret edit functionality is also missing binary file support.

Improve logging format of KNI haproxy logs to display tcplogs + frondend IP and frontend port.

The current logging format is not very verbose:

<134>Jun  2 22:54:02 haproxy[11]: Connect from ::1:42424 to ::1:9445 (main/TCP)
<134>Jun  2 22:54:04 haproxy[11]: Connect from ::1:42436 to ::1:9445 (main/TCP)
<134>Jun  2 22:54:04 haproxy[11]: Connect from ::1:42446 to ::1:9445 (main/TCP)

It lacks critical information for troubleshooting, such as load-balancing destination and timestamps.
https://www.haproxy.com/blog/introduction-to-haproxy-logging recommends the following for tcp mode:

When in TCP mode, which is set by adding mode tcp, you should also add [option tcplog](https://www.haproxy.com/documentation/hapee/1-8r1/onepage/#option%20tcplog).

Please review the following PR: https://github.com/operator-framework/operator-marketplace/pull/535

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

CSI storage capacity tracking is GA since Kubernetes 1.24, yet must-gather does not collect CSIStorageCapacity objects. It would be useful for single node clusters with LVMO, but other clusters could benefit from it too.

Version-Release number of selected component (if applicable):

4.11.0

How reproducible:

always

Steps to Reproduce:

1. oc adm must-gather

Actual results:

Output does not contain CSIStorageCapacity objects

Expected results:

Output  contains CSIStorageCapacity objects

Additional info:

We should go through all new additions to storage APIs (storage.k8s.io/v1) and any missing items.

Description of problem:

CNO panics with net/http: abort Handler while installing SNO cluster on OpenshiftSDN

network                                    4.14.0-0.nightly-2023-07-05-191022   True        False         True       9h      Panic detected: net/http: abort Handler

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-07-05-191022

How reproducible:

sometimes

Steps to Reproduce:

1.Install OpenshiftSDN cluster on SNO
2.
3.

Actual results:

Cluster (CNO) reports errors

Expected results:

Cluster should be installed fine

Additional info:

SOS: http://shell.lab.bos.redhat.com/~anusaxen/sosreport-rg-0707-tl6fd-master-0-2023-07-07-pyaruar.tar.xz

MG:  http://shell.lab.bos.redhat.com/~anusaxen/must-gather.local.4340060474822893433/

Hypershift needs to be able to specify a different release payload for control plane components without redeploying anything in the hosted cluster.

csi-driver-node DaemonSet pods in the hosted cluster and the csi-driver-controller Deployment that runs in the control plane both use the AWS_EBS_DRIVER_IMAGE and LIVENESS_PROBE_IMAGE

https://github.com/openshift/hypershift/blob/fc42313fc93125799f7eba5361190043cc2f6561/control-plane-operator/controllers/hostedcontrolplane/storage/envreplace.go#L9-L48

We need a way to specify these images separately for csi-driver-node and csi-driver-controller.

Description of problem:

Even in environments when containers are manually loaded into containers-store, services will fail because they are written to always pull images priory to starting the container (or checking podman image to see if the image exists first).

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

Business Automation Operands fail to load in uninstall operator modal. With "Cannot load Operands. There was an error loading operands for this operator. Operands will need to be deleted manually..." alert message.

"Delete all operand instances for this operator__checkbox" is not shown so the test fails. 

https://search.ci.openshift.org/?search=Testing+uninstall+of+Business+Automation+Operator&maxAge=168h&context=1&type=junit&name=pull-ci-openshift-console-master-e2e-gcp-console&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

The kube-controller-manager container cluster-policy-controller will show unusual error logs ,such as "
I0214 10:49:34.698154       1 interface.go:71] Couldn't find informer for template.openshift.io/v1, Resource=templateinstances
I0214 10:49:34.698159       1 resource_quota_monitor.go:185] QuotaMonitor unable to use a shared informer for resource "template.openshift.io/v1, Resource=templateinstances": no informer found for template.openshift.io/v1, Resource=templateinstances
"

Version-Release number of selected component (if applicable):

 

How reproducible:

when the cluster-policy-controller restart ,u will see these logs

Steps to Reproduce:

1.oc logs kube-controller-manager-master0 -n openshift-kube-controller-manager -c cluster-policy-controller  

Actual results:

 

Expected results:

 

Additional info:

 

Please review the following PR: https://github.com/openshift/cluster-etcd-operator/pull/1042

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

nmstate packages > 2.2.9 will cause MCD firstboot to fail. For now, let's pin the nmstate version and fix properly via https://github.com/openshift/machine-config-operator/pull/3720

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

No datapoints found for Long Running Requests by Resource and Long Running Requests by Instance of "API Performance" dashboard on web-console UI

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-06-13-223353

How reproducible:

always

Steps to Reproduce:

1.Installed OCP cluster with 4.14 nightly payload
2.Open the web-console, view the page "API Performance" dashboard on web-console UI

Actual results:

1.On the Long Running Requests by Resource and Long Running Requests by Instance page, shows No datapoints found

Expected results:

2.Should show something on Long Running Requests by Resource and Long Running Requests by Instance pages. 

Additional info:

1. Got the same results on 4.13.
2. Not found the apiserver_longrunning_gauge in prometheus data, only apiserver_longrunning_requests

$ token=`oc create token prometheus-k8s -n openshift-monitoring`
$ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/label/__name__/values' | jq | grep apiserver_longrunning_gauge
no result

$ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/label/__name__/values' | jq | grep apiserver_long
    "apiserver_longrunning_requests",

Description of problem:

In assisted-installer flow bootkube service is started on Live ISO, so root FS is read-only. OKD installer attempts to pivot the booted OS to machine-os-content via `rpm-ostree rebase`. This is not necessary since we're already using SCOS in Live ISO.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:

1.
2.
3.

Actual results:


Expected results:


Additional info:


Description of problem:

Print preview of Topology presents incorrect layout

Version-Release number of selected component (if applicable):

4.12.0

How reproducible:

Always

Steps to Reproduce:

1. Have 2 KNative/Serverless Functions deployed (in my case 1 is Quarkus and another is Spring Boot)
2. In Topology UI observe you see their snippets properly within Graph view are
3. Now switch to List view.
4. In my case items I see in List view are such short list of my items:
Broker
  default
Operator Backed Service
DW terminal-avby87
  D workspaceb5975d64dbc54983
Service
KSVC caller-function
  REV caller-function-00002
Service
KSVC callme-function
  REV callme-function-00001
5. Now using Chrome browser click Ctrl+P, i.e. Print preview
6. Observe that even in Landscape mode only till workspace item is displayed and no more pages/info.

Actual results:

Incomplete Topology info from List view in Print Preview

Expected results:

Full and accurate Topology info from List view in Print Preview

Additional info:

 

Description of problem:

When installing a new cluster with TechPreviewNoUpgrade featureSet, Nodes never become Ready.

Logs from control-plane components indicate that a resource associated with the DynamicResourceAllocation feature can't be found:

E0804 15:48:51.094383       1 reflector.go:147] vendor/k8s.io/client-go/informers/factory.go:150: Failed to watch *v1alpha2.PodSchedulingContext: failed to list *v1alpha2.PodSchedulingContext: the server could not find the requested resource (get podschedulingcontexts.resource.k8s.io)

It turns out we either need to:

1. Enable the resource.k8s.io/v1alpha2=true API in kube-apiserver.
2. Or disable the DynamicResourceAllocation feature as TP.

For now I added a commit to invalidate this feature in o/k and disable all related tests. Please let me know once this is sorted out so that I can drop that commit from the rebase PR.

Version-Release number of selected component (if applicable):

4.15

How reproducible:

Always when installing a new cluster with TechPreviewNoUpgrade featureSet.

Steps to Reproduce:

1. Install cluster with TechPreviewNoUpgrade featureSet (this can be done passing an install-config.yaml to the installer).
2. Check logs from one the control-plane components.

Actual results:

Nodes are NotReady and ClusterOperators Degraded.

Expected results:

Cluster is installed successfully.

Additional info:

Slack thread: https://redhat-internal.slack.com/archives/C05HQGU8TFF/p1691154653507499

How to enable an API in KAS: https://kubernetes.io/docs/tasks/administer-cluster/enable-disable-api/

When making a change to the uninstaller for GCP, the linter picked up an error:

 

 

pkg/destroy/gcp/gcp.go:42:2: found a struct that contains a context.Context field (containedctx)
	Context           context.Context 

 

 

Contexts should not be added to structs. Instead the context should be created at the top level of the uninstaller OR a separate context can be used for each stage of the uninstallation process.

 

Currently this error can be bypassed by adding:

//nolint:containedctx 

to the offending line

 

Description of problem:

We need to update the operator to be synced with the K8 api version used by OCP 4.14. We also need to sync our samples libraries with latest available libraries. Any deprecated libraries should be removed as well.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 
Should update with --include-local-oci-catalogs for --oci-registries-config's help info 


      --oci-registries-config string    Registries config file location (used only with --use-oci-feature flag)
Now the `--use-oci-feature` has been deprecated, please replace with --include-local-oci-catalogs for the help information.

Description of problem:

After updating the sysctl config map, the test waits up to 30s for the pod to be in ready state. From the logs, it could be seen that the allowlist controller takes more than 30s to reconcile when multiple tests are running in parallel.

The internal logic of the allowlist controller waits up to 60s for the pods of the allowlist DS to be running. Therefore, it is logical to increase the timeout in the test to 60s.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

 

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1. Go to console 
2. Click  on "Installed Operator"
3. Add operator (Node feature discovery )
4. Click on all instances that on Create new (see image)

Actual results:

The drop down is empty but the as a user you can click them and get to the new instance yaml  

Expected results:

For a better user experince if at least there will be at least some labels or clickable text

Additional info:

 

Description of problem:

While installing cluster with assisted installer lately we have cases when one of the master joins very quickly and start all needed pods in order for cluster bootstrap to finish but the second one joins only after that.
Keepalived can't start if there is only one joined cluster as it doesn't have enough data to build configuration files.
In HA mode cluster bootstrap should wait at least for 2 joined masters before removing bootstrap control plane as without it installation with fail.
 

Version-Release number of selected component (if applicable):

 

How reproducible:

Start bm installation and start one master, wait till it starts all required pods and then add others.

Steps to Reproduce:

1. Start bm installation 
2. Start one master 
3. Wait till it starts all required pods.
4. Add others

Actual results:

no vip, installation fails

Expected results:

installation succeeds, vip moves to master

Additional info:

 

Description of problem:

After a replace upgrade from OCP 4.14 image to another 4.14 image first node is in NotReady.

jiezhao-mac:hypershift jiezhao$ oc get node --kubeconfig=hostedcluster.kubeconfig 
NAME                     STATUS   ROLES  AGE   VERSION
ip-10-0-128-175.us-east-2.compute.internal  Ready   worker  72m   v1.26.2+06e8c46
ip-10-0-134-164.us-east-2.compute.internal  Ready   worker  68m   v1.26.2+06e8c46
ip-10-0-137-194.us-east-2.compute.internal  Ready   worker  77m   v1.26.2+06e8c46
ip-10-0-141-231.us-east-2.compute.internal  NotReady  worker  9m54s  v1.26.2+06e8c46

- lastHeartbeatTime: "2023-03-21T19:48:46Z"
  lastTransitionTime: "2023-03-21T19:42:37Z"
  message: 'container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady
   message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/.
   Has your network provider started?'
  reason: KubeletNotReady
  status: "False"
  type: Ready

Events:
 Type   Reason          Age         From          Message
 ----   ------          ----        ----          -------
 Normal  Starting         11m         kubelet        Starting kubelet.
 Normal  NodeHasSufficientMemory 11m (x2 over 11m)  kubelet        Node ip-10-0-141-231.us-east-2.compute.internal status is now: NodeHasSufficientMemory
 Normal  NodeHasNoDiskPressure  11m (x2 over 11m)  kubelet        Node ip-10-0-141-231.us-east-2.compute.internal status is now: NodeHasNoDiskPressure
 Normal  NodeHasSufficientPID   11m (x2 over 11m)  kubelet        Node ip-10-0-141-231.us-east-2.compute.internal status is now: NodeHasSufficientPID
 Normal  NodeAllocatableEnforced 11m         kubelet        Updated Node Allocatable limit across pods
 Normal  Synced          11m         cloud-node-controller Node synced successfully
 Normal  RegisteredNode      11m         node-controller    Node ip-10-0-141-231.us-east-2.compute.internal event: Registered Node ip-10-0-141-231.us-east-2.compute.internal in Controller
 Warning ErrorReconcilingNode   17s (x30 over 11m) controlplane      nodeAdd: error adding node "ip-10-0-141-231.us-east-2.compute.internal": could not find "k8s.ovn.org/node-subnets" annotation

ovnkube-master log:

I0321 20:55:16.270197       1 default_network_controller.go:667] Node add failed for ip-10-0-141-231.us-east-2.compute.internal, will try again later: nodeAdd: error adding node "ip-10-0-141-231.us-east-2.compute.internal": could not find "k8s.ovn.org/node-subnets" annotation
I0321 20:55:16.270209       1 obj_retry.go:326] Retry add failed for *v1.Node ip-10-0-141-231.us-east-2.compute.internal, will try again later: nodeAdd: error adding node "ip-10-0-141-231.us-east-2.compute.internal": could not find "k8s.ovn.org/node-subnets" annotation
I0321 20:55:16.270273       1 event.go:285] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"ip-10-0-141-231.us-east-2.compute.internal", UID:"621e6289-ca5a-4e17-afff-5b49961cfb38", APIVersion:"v1", ResourceVersion:"52970", FieldPath:""}): type: 'Warning' reason: 'ErrorReconcilingNode' nodeAdd: error adding node "ip-10-0-141-231.us-east-2.compute.internal": could not find "k8s.ovn.org/node-subnets" annotation
I0321 20:55:17.851497       1 master.go:719] Adding or Updating Node "ip-10-0-137-194.us-east-2.compute.internal"
I0321 20:55:25.965132       1 master.go:719] Adding or Updating Node "ip-10-0-128-175.us-east-2.compute.internal"
I0321 20:55:45.928694       1 client.go:783]  "msg"="transacting operations" "database"="OVN_Northbound" "operations"="[{Op:update Table:NB_Global Row:map[options:{GoMap:map[e2e_timestamp:1679432145 mac_prefix:2e:f9:d8 max_tunid:16711680 northd_internal_version:23.03.1-20.27.0-70.6 northd_probe_interval:5000 svc_monitor_mac:fe:cb:72:cf:f8:5f use_logical_dp_groups:true]}] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {c8b24290-296e-44a2-a4d0-02db7e312614}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}]"
I0321 20:55:46.270129       1 obj_retry.go:265] Retry object setup: *v1.Node ip-10-0-141-231.us-east-2.compute.internal
I0321 20:55:46.270154       1 obj_retry.go:319] Adding new object: *v1.Node ip-10-0-141-231.us-east-2.compute.internal
I0321 20:55:46.270164       1 master.go:719] Adding or Updating Node "ip-10-0-141-231.us-east-2.compute.internal"
I0321 20:55:46.270201       1 default_network_controller.go:667] Node add failed for ip-10-0-141-231.us-east-2.compute.internal, will try again later: nodeAdd: error adding node "ip-10-0-141-231.us-east-2.compute.internal": could not find "k8s.ovn.org/node-subnets" annotation
I0321 20:55:46.270209       1 obj_retry.go:326] Retry add failed for *v1.Node ip-10-0-141-231.us-east-2.compute.internal, will try again later: nodeAdd: error adding node "ip-10-0-141-231.us-east-2.compute.internal": could not find "k8s.ovn.org/node-subnets" annotation
I0321 20:55:46.270284       1 event.go:285] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"ip-10-0-141-231.us-east-2.compute.internal", UID:"621e6289-ca5a-4e17-afff-5b49961cfb38", APIVersion:"v1", ResourceVersion:"52970", FieldPath:""}): type: 'Warning' reason: 'ErrorReconcilingNode' nodeAdd: error adding node "ip-10-0-141-231.us-east-2.compute.internal": could not find "k8s.ovn.org/node-subnets" annotation
I0321 20:55:52.916512       1 reflector.go:559] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.Namespace total 5 items received
I0321 20:56:06.910669       1 reflector.go:559] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.Pod total 12 items received
I0321 20:56:15.928505       1 client.go:783]  "msg"="transacting operations" "database"="OVN_Northbound" "operations"="[{Op:update Table:NB_Global Row:map[options:{GoMap:map[e2e_timestamp:1679432175 mac_prefix:2e:f9:d8 max_tunid:16711680 northd_internal_version:23.03.1-20.27.0-70.6 northd_probe_interval:5000 svc_monitor_mac:fe:cb:72:cf:f8:5f use_logical_dp_groups:true]}] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {c8b24290-296e-44a2-a4d0-02db7e312614}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}]"
I0321 20:56:16.269611       1 obj_retry.go:265] Retry object setup: *v1.Node ip-10-0-141-231.us-east-2.compute.internal
I0321 20:56:16.269637       1 obj_retry.go:319] Adding new object: *v1.Node ip-10-0-141-231.us-east-2.compute.internal
I0321 20:56:16.269646       1 master.go:719] Adding or Updating Node "ip-10-0-141-231.us-east-2.compute.internal"
I0321 20:56:16.269688       1 default_network_controller.go:667] Node add failed for ip-10-0-141-231.us-east-2.compute.internal, will try again later: nodeAdd: error adding node "ip-10-0-141-231.us-east-2.compute.internal": could not find "k8s.ovn.org/node-subnets" annotation
I0321 20:56:16.269697       1 obj_retry.go:326] Retry add failed for *v1.Node ip-10-0-141-231.us-east-2.compute.internal, will try again later: nodeAdd: error adding node "ip-10-0-141-231.us-east-2.compute.internal": could not find "k8s.ovn.org/node-subnets" annotation
I0321 20:56:16.269724       1 event.go:285] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"ip-10-0-141-231.us-east-2.compute.internal", UID:"621e6289-ca5a-4e17-afff-5b49961cfb38", APIVersion:"v1", ResourceVersion:"52970", FieldPath:""}): type: 'Warning' reason: 'ErrorReconcilingNode' nodeAdd: error adding node "ip-10-0-141-231.us-east-2.compute.internal": could not find "k8s.ovn.org/node-subnets" annotation

cluster-network-operator log:

I0321 21:03:38.487602       1 log.go:198] Set operator conditions:
- lastTransitionTime: "2023-03-21T17:39:21Z"
  status: "False"
  type: ManagementStateDegraded
- lastTransitionTime: "2023-03-21T19:53:10Z"
  message: DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" rollout is not making
    progress - last change 2023-03-21T19:42:39Z
  reason: RolloutHung
  status: "True"
  type: Degraded
- lastTransitionTime: "2023-03-21T17:39:21Z"
  status: "True"
  type: Upgradeable
- lastTransitionTime: "2023-03-21T19:42:39Z"
  message: |-
    DaemonSet "/openshift-network-diagnostics/network-check-target" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-multus/multus" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-multus/network-metrics-daemon" is not available (awaiting 1 nodes)
  reason: Deploying
  status: "True"
  type: Progressing
- lastTransitionTime: "2023-03-21T17:39:26Z"
  status: "True"
  type: Available
I0321 21:03:38.488312       1 log.go:198] Skipping reconcile of Network.operator.openshift.io: spec unchanged
I0321 21:03:38.499825       1 log.go:198] Set ClusterOperator conditions:
- lastTransitionTime: "2023-03-21T17:39:21Z"
  status: "False"
  type: ManagementStateDegraded
- lastTransitionTime: "2023-03-21T19:53:10Z"
  message: DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" rollout is not making
    progress - last change 2023-03-21T19:42:39Z
  reason: RolloutHung
  status: "True"
  type: Degraded
- lastTransitionTime: "2023-03-21T17:39:21Z"
  status: "True"
  type: Upgradeable
- lastTransitionTime: "2023-03-21T19:42:39Z"
  message: |-
    DaemonSet "/openshift-network-diagnostics/network-check-target" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-multus/multus" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-multus/network-metrics-daemon" is not available (awaiting 1 nodes)
  reason: Deploying
  status: "True"
  type: Progressing
- lastTransitionTime: "2023-03-21T17:39:26Z"
  status: "True"
  type: Available
I0321 21:03:38.571013       1 log.go:198] Set HostedControlPlane conditions:
- lastTransitionTime: "2023-03-21T17:38:24Z"
  message: All is well
  observedGeneration: 3
  reason: AsExpected
  status: "True"
  type: ValidAWSIdentityProvider
- lastTransitionTime: "2023-03-21T17:37:06Z"
  message: Configuration passes validation
  observedGeneration: 3
  reason: AsExpected
  status: "True"
  type: ValidHostedControlPlaneConfiguration
- lastTransitionTime: "2023-03-21T19:24:24Z"
  message: ""
  observedGeneration: 3
  reason: QuorumAvailable
  status: "True"
  type: EtcdAvailable
- lastTransitionTime: "2023-03-21T17:38:23Z"
  message: Kube APIServer deployment is available
  observedGeneration: 3
  reason: AsExpected
  status: "True"
  type: KubeAPIServerAvailable
- lastTransitionTime: "2023-03-21T20:26:29Z"
  message: ""
  observedGeneration: 3
  reason: AsExpected
  status: "False"
  type: Degraded
- lastTransitionTime: "2023-03-21T17:37:11Z"
  message: All is well
  observedGeneration: 3
  reason: AsExpected
  status: "True"
  type: InfrastructureReady
- lastTransitionTime: "2023-03-21T17:37:06Z"
  message: External DNS is not configured
  observedGeneration: 3
  reason: StatusUnknown
  status: Unknown
  type: ExternalDNSReachable
- lastTransitionTime: "2023-03-21T19:24:24Z"
  message: ""
  observedGeneration: 3
  reason: AsExpected
  status: "True"
  type: Available
- lastTransitionTime: "2023-03-21T17:37:06Z"
  message: Reconciliation active on resource
  observedGeneration: 3
  reason: AsExpected
  status: "True"
  type: ReconciliationActive
- lastTransitionTime: "2023-03-21T17:38:25Z"
  message: All is well
  reason: AsExpected
  status: "True"
  type: AWSDefaultSecurityGroupCreated
- lastTransitionTime: "2023-03-21T19:30:54Z"
  message: 'Error while reconciling 4.14.0-0.nightly-2023-03-20-201450: the cluster
    operator network is degraded'
  observedGeneration: 3
  reason: ClusterOperatorDegraded
  status: "False"
  type: ClusterVersionProgressing
- lastTransitionTime: "2023-03-21T17:39:11Z"
  message: Condition not found in the CVO.
  observedGeneration: 3
  reason: StatusUnknown
  status: Unknown
  type: ClusterVersionUpgradeable
- lastTransitionTime: "2023-03-21T17:44:05Z"
  message: Done applying 4.14.0-0.nightly-2023-03-20-201450
  observedGeneration: 3
  reason: FromClusterVersion
  status: "True"
  type: ClusterVersionAvailable
- lastTransitionTime: "2023-03-21T19:55:15Z"
  message: Cluster operator network is degraded
  observedGeneration: 3
  reason: ClusterOperatorDegraded
  status: "True"
  type: ClusterVersionFailing
- lastTransitionTime: "2023-03-21T17:39:11Z"
  message: Payload loaded version="4.14.0-0.nightly-2023-03-20-201450" image="registry.ci.openshift.org/ocp/release:4.14.0-0.nightly-2023-03-20-201450"
    architecture="amd64"
  observedGeneration: 3
  reason: PayloadLoaded
  status: "True"
  type: ClusterVersionReleaseAccepted
- lastTransitionTime: "2023-03-21T17:39:21Z"
  message: ""
  reason: AsExpected
  status: "False"
  type: network.operator.openshift.io/ManagementStateDegraded
- lastTransitionTime: "2023-03-21T19:53:10Z"
  message: DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" rollout is not making
    progress - last change 2023-03-21T19:42:39Z
  reason: RolloutHung
  status: "True"
  type: network.operator.openshift.io/Degraded
- lastTransitionTime: "2023-03-21T17:39:21Z"
  message: ""
  reason: AsExpected
  status: "True"
  type: network.operator.openshift.io/Upgradeable
- lastTransitionTime: "2023-03-21T19:42:39Z"
  message: |-
    DaemonSet "/openshift-network-diagnostics/network-check-target" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-multus/multus" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-multus/network-metrics-daemon" is not available (awaiting 1 nodes)
  reason: Deploying
  status: "True"
  type: network.operator.openshift.io/Progressing
- lastTransitionTime: "2023-03-21T17:39:27Z"
  message: ""
  reason: AsExpected
  status: "True"
  type: network.operator.openshift.io/Available
I0321 21:03:39.450912       1 pod_watcher.go:125] Operand /, Kind= openshift-multus/multus updated, re-generating status
I0321 21:03:39.450953       1 pod_watcher.go:125] Operand /, Kind= openshift-multus/multus updated, re-generating status
I0321 21:03:39.493206       1 log.go:198] Set operator conditions:
- lastTransitionTime: "2023-03-21T17:39:21Z"
  status: "False"
  type: ManagementStateDegraded
- lastTransitionTime: "2023-03-21T19:53:10Z"
  message: DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" rollout is not making
    progress - last change 2023-03-21T19:42:39Z
  reason: RolloutHung
  status: "True"
  type: Degraded
- lastTransitionTime: "2023-03-21T17:39:21Z"
  status: "True"
  type: Upgradeable
- lastTransitionTime: "2023-03-21T19:42:39Z"
  message: |-
    DaemonSet "/openshift-multus/network-metrics-daemon" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-network-diagnostics/network-check-target" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" is not available (awaiting 1 nodes)
  reason: Deploying
  status: "True"
  type: Progressing
- lastTransitionTime: "2023-03-21T17:39:26Z"
  status: "True"
  type: Available
I0321 21:03:39.494050       1 log.go:198] Skipping reconcile of Network.operator.openshift.io: spec unchanged
I0321 21:03:39.508538       1 log.go:198] Set ClusterOperator conditions:
- lastTransitionTime: "2023-03-21T17:39:21Z"
  status: "False"
  type: ManagementStateDegraded
- lastTransitionTime: "2023-03-21T19:53:10Z"
  message: DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" rollout is not making
    progress - last change 2023-03-21T19:42:39Z
  reason: RolloutHung
  status: "True"
  type: Degraded
- lastTransitionTime: "2023-03-21T17:39:21Z"
  status: "True"
  type: Upgradeable
- lastTransitionTime: "2023-03-21T19:42:39Z"
  message: |-
    DaemonSet "/openshift-multus/network-metrics-daemon" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-network-diagnostics/network-check-target" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" is not available (awaiting 1 nodes)
  reason: Deploying
  status: "True"
  type: Progressing
- lastTransitionTime: "2023-03-21T17:39:26Z"
  status: "True"
  type: Available
I0321 21:03:39.684429       1 log.go:198] Set HostedControlPlane conditions:
- lastTransitionTime: "2023-03-21T17:38:24Z"
  message: All is well
  observedGeneration: 3
  reason: AsExpected
  status: "True"
  type: ValidAWSIdentityProvider
- lastTransitionTime: "2023-03-21T17:37:06Z"
  message: Configuration passes validation
  observedGeneration: 3
  reason: AsExpected
  status: "True"
  type: ValidHostedControlPlaneConfiguration
- lastTransitionTime: "2023-03-21T19:24:24Z"
  message: ""
  observedGeneration: 3
  reason: QuorumAvailable
  status: "True"
  type: EtcdAvailable
- lastTransitionTime: "2023-03-21T17:38:23Z"
  message: Kube APIServer deployment is available
  observedGeneration: 3
  reason: AsExpected
  status: "True"
  type: KubeAPIServerAvailable
- lastTransitionTime: "2023-03-21T20:26:29Z"
  message: ""
  observedGeneration: 3
  reason: AsExpected
  status: "False"
  type: Degraded
- lastTransitionTime: "2023-03-21T17:37:11Z"
  message: All is well
  observedGeneration: 3
  reason: AsExpected
  status: "True"
  type: InfrastructureReady
- lastTransitionTime: "2023-03-21T17:37:06Z"
  message: External DNS is not configured
  observedGeneration: 3
  reason: StatusUnknown
  status: Unknown
  type: ExternalDNSReachable
- lastTransitionTime: "2023-03-21T19:24:24Z"
  message: ""
  observedGeneration: 3
  reason: AsExpected
  status: "True"
  type: Available
- lastTransitionTime: "2023-03-21T17:37:06Z"
  message: Reconciliation active on resource
  observedGeneration: 3
  reason: AsExpected
  status: "True"
  type: ReconciliationActive
- lastTransitionTime: "2023-03-21T17:38:25Z"
  message: All is well
  reason: AsExpected
  status: "True"
  type: AWSDefaultSecurityGroupCreated
- lastTransitionTime: "2023-03-21T19:30:54Z"
  message: 'Error while reconciling 4.14.0-0.nightly-2023-03-20-201450: the cluster
    operator network is degraded'
  observedGeneration: 3
  reason: ClusterOperatorDegraded
  status: "False"
  type: ClusterVersionProgressing
- lastTransitionTime: "2023-03-21T17:39:11Z"
  message: Condition not found in the CVO.
  observedGeneration: 3
  reason: StatusUnknown
  status: Unknown
  type: ClusterVersionUpgradeable
- lastTransitionTime: "2023-03-21T17:44:05Z"
  message: Done applying 4.14.0-0.nightly-2023-03-20-201450
  observedGeneration: 3
  reason: FromClusterVersion
  status: "True"
  type: ClusterVersionAvailable
- lastTransitionTime: "2023-03-21T19:55:15Z"
  message: Cluster operator network is degraded
  observedGeneration: 3
  reason: ClusterOperatorDegraded
  status: "True"
  type: ClusterVersionFailing
- lastTransitionTime: "2023-03-21T17:39:11Z"
  message: Payload loaded version="4.14.0-0.nightly-2023-03-20-201450" image="registry.ci.openshift.org/ocp/release:4.14.0-0.nightly-2023-03-20-201450"
    architecture="amd64"
  observedGeneration: 3
  reason: PayloadLoaded
  status: "True"
  type: ClusterVersionReleaseAccepted
- lastTransitionTime: "2023-03-21T17:39:21Z"
  message: ""
  reason: AsExpected
  status: "False"
  type: network.operator.openshift.io/ManagementStateDegraded
- lastTransitionTime: "2023-03-21T19:53:10Z"
  message: DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" rollout is not making
    progress - last change 2023-03-21T19:42:39Z
  reason: RolloutHung
  status: "True"
  type: network.operator.openshift.io/Degraded
- lastTransitionTime: "2023-03-21T17:39:21Z"
  message: ""
  reason: AsExpected
  status: "True"
  type: network.operator.openshift.io/Upgradeable
- lastTransitionTime: "2023-03-21T19:42:39Z"
  message: |-
    DaemonSet "/openshift-multus/network-metrics-daemon" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-network-diagnostics/network-check-target" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" is not available (awaiting 1 nodes)
  reason: Deploying
  status: "True"
  type: network.operator.openshift.io/Progressing
- lastTransitionTime: "2023-03-21T17:39:27Z"
  message: ""
  reason: AsExpected
  status: "True"
  type: network.operator.openshift.io/Available

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1. management cluster 4.13
2. bring up the hostedcluster and nodepool in 4.14.0-0.nightly-2023-03-19-234132
3. upgrade the hostedcluster to 4.14.0-0.nightly-2023-03-20-201450 
4. replace upgrade the nodepool to 4.14.0-0.nightly-2023-03-20-201450 

Actual results

First node is in NotReady

Expected results:

All nodes should be Ready

Additional info:

No issue with replace upgrade from 4.13 to 4.14

 

 

 

 

 

 

Description of problem:

While mirroring nvidia operator with oc-mirror 4.13 version, ImageContentSourcePolicy is not getting created properly 

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1. Create imageset file

kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v1alpha2
archiveSize: 4
storageConfig:
  local:
    path: /home/name/nvidia
mirror:
  operators:
  - catalog: registry.redhat.io/redhat/certified-operator-index:v4.11
    packages:
    - name: nvidia-network-operator

2. mirror to disk using oc-mirror 4.13
$oc-mirror -c imageset.yaml file:///home/name/nvidia/
./oc-mirror version
Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.13.0-202307242035.p0.gf11a900.assembly.stream-f11a900", GitCommit:"f11a9001caad8fe146c73baf2acc38ddcf3642b5", GitTreeState:"clean", BuildDate:"2023-07-24T21:25:46Z", GoVersion:"go1.19.10 X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}

3. Now generate the manifest

$ oc-mirror --from /home/name/nvidia/ docker://registry:8443 --manifests-only

- mirrors:
    - registry:8443/nvidia/cloud-native
    source: nvcr.io/nvidia

However the correct mapping should be:
    - mirrors:
        - registry/nvidia
      source: nvcr.io/nvidia

4. perform same step with 4.12.0 version you will not hit this issue. 
./oc-mirror version
Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.12.0-202304241542.p0.g5fc00fe.assembly.stream-5fc00fe", GitCommit:"5fc00fe735d8fb3b6125f358f5d6b9fe726fad10", GitTreeState:"clean", BuildDate:"2023-04-24T16:01:29Z", GoVersion:"go1.19.6", Compiler:"gc", Platform:"linux/amd64"}

 

 

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:
Builds navigation item is missing in Developer perspective

Version-Release number of selected component (if applicable):
4.14.0

How reproducible:
Always

Steps to Reproduce:

  1. Open the developer perspective on a cluster with BuildConfigs (default)

Actual results:
"Builds" is missing as a navigation item below "Search".

Expected results:
"Builds" navigation item should be displayed again when BuildConfigs CRD is available.

Additional info:
Might be dropped with PR https://github.com/openshift/console/pull/13097

Description of problem:

We disabled copies of CSVs in our clusters, the list of the installed operators is visible, but when we go (within the context of some user namespace) to:
Developer Catalog -> Operator Backed
then the list is empty.

When we enable the copies of CSVs, then the operator backed catalog shows the expected items.

Version-Release number of selected component (if applicable):

OpenShift 4.13.1

How reproducible:

every time

Steps to Reproduce:

1. install Camel-k operator (community version, stable channel)
2. Disable copies of CSV by setting 'OLMConfig.spec.features.disableCopiedCSVs' to 'true'
3. create a new namespace/project
4. go to Developer Catalog -> Operator backed

Actual results:

the Operator Backed Catalog is empty

Expected results:

the Operator Backed Catalog should show Camel-K related items

Additional info:

 

Description of problem:

Dockerfile.fast relies on picking up the `bin` directory built in the host for inclusion in the HyperShift Operator image for development.

Containerfile.operator, for RHTAP, relies on .dockerignore to prevent a `/bin` to be present in the podman build context that has permissions that the user `default` (used by the golang build container) can't write to. 

Version-Release number of selected component (if applicable):

 

How reproducible:

100%

Steps to Reproduce:

1.make docker-build-fast

Actual results:

COPY bin/* /usr/bin/ fails due to bin not being included in the podman build context

Expected results:

The container builds successfully

Additional info:

 

Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:

1.
2.
3.

Actual results:


Expected results:


Additional info:


This is a clone of issue OCPBUGS-19311. The following is the description of the original issue:

Description

As a user, I would like to use the Import from Git form even if I don't have BC installed in my cluster, but I have installed the Pipelines operator.

Acceptance Criteria

  1. Show the Import From Git Tab on the Add page if Pipelines Operator is installed and BuildConfig is not installed in the cluster

Additional Details:

Description of problem:

Jenkins and Jenkins Agent Base image versions needs to be updated to use the latest images to mitigate known CVEs in plugins and Jenkins versions.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Tracker issue for bootimage bump in 4.14. This issue should block issues which need a bootimage bump to fix.

The previous bump was OCPBUGS-15999.

The PSA changes introduced in 4.12 meant that we had to figure out a way to ensure that customer workloads (3rd-party or otherwise) wouldn't grind to a halt as pods cannot be scheduled due to PSA. The solution found was to have another controller that could introspect a namespace to determine the best pod security standard to apply to the namespace. This controller ignores payload namespaces (usually named openshift-), but will reconcile non-payload openshift- namespaces with a special label applied to it. On the OLM side, we had to create a controller that would apply the psa label sync'er label to non-payload openshift-* namespaces with operators (CSVs) installed in them.

OLM took a dependency on the cluster-policy-controller in order to get the list of payload namespaces. This dependency introduced a few challenges for our CI:

  • we need to ensure parity between the CPC and OLM OpenShift releases: since the list of payload namespaces could vary between OpenShift releases.
  • because the CPC is also a controller, it depends on many of the same libraries as OLM. This can cause vendoring problems, or force OLM to be in lockstep with CPC w.r.t. the common controller libraries

To avoid these issues, and seen as the list probably won't update very frequently, we'll make our own copy of the list and maintain it on this side, as this will be less busy work than the alternative.

Description of problem:

 On attempting to perform EUS->EUS upgrade from 4.12.z->4.14 (CI builds), I am seeing consistently that after upgrade OCP to 4.14, worker machine configpool goes to degraded state, complaining about {noformat}message: 'Node c01-dbn-412-tzm44-worker-0-7w6wg is reporting: "failed to run
        nmstatectl: fork/exec /run/machine-config-daemon-bin/nmstatectl: no such file
        or directory", Node c01-dbn-412-tzm44-worker-0-cmqsl is reporting: "failed
        to run nmstatectl: fork/exec /run/machine-config-daemon-bin/nmstatectl: no
        such file or directory", Node c01-dbn-412-tzm44-worker-0-qrp6v is reporting:
        "failed to run nmstatectl: fork/exec /run/machine-config-daemon-bin/nmstatectl:
        no such file or directory"'
{noformat}. And then clusterversion reports error:
{noformat}
[cloud-user@ocp-psi-executor dbasunag]$ oc get clusterversion
NAME      VERSION                         AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.13.0-0.ci-2023-08-14-110508   True        True          125m    Unable to apply 4.14.0-0.ci-2023-08-14-152624: wait has exceeded 40 minutes for these operators: machine-config
[cloud-user@ocp-psi-executor dbasunag]$
{noformat}
This is consistently reproducible in clusters with knmstate installed.

Version-Release number of selected component (if applicable):

4.12.29 -> 4.13.0-0.ci-2023-08-14-110508->4.14.0-0.ci-2023-08-14-152624

How reproducible:

100%

Steps to Reproduce:

1. Perform EUS upgrade on a cluster with CNV, ODF, Knmstate
2. After pausing worker mcp, upgraded OCP, ODF, CNV, KNMstate to 4.13 - everything worked fine
3. After upgrading OCP to 4.14, when master mcp is updated, worker mcp went to degraded state and clusterversion eventually reported error (all the master nodes were updated)

Actual results:

[cloud-user@ocp-psi-executor dbasunag]$ oc get co
NAME                                       VERSION                         AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.14.0-0.ci-2023-08-14-152624   True        False         False      9h      
baremetal                                  4.14.0-0.ci-2023-08-14-152624   True        False         False      2d23h   
cloud-controller-manager                   4.14.0-0.ci-2023-08-14-152624   True        False         False      2d23h   
cloud-credential                           4.14.0-0.ci-2023-08-14-152624   True        False         False      2d23h   
cluster-autoscaler                         4.14.0-0.ci-2023-08-14-152624   True        False         False      2d23h   
config-operator                            4.14.0-0.ci-2023-08-14-152624   True        False         False      2d23h   
console                                    4.14.0-0.ci-2023-08-14-152624   True        False         False      2d22h   
control-plane-machine-set                  4.14.0-0.ci-2023-08-14-152624   True        False         False      2d23h   
csi-snapshot-controller                    4.14.0-0.ci-2023-08-14-152624   True        False         False      2d23h   
dns                                        4.14.0-0.ci-2023-08-14-152624   True        False         False      2d23h   
etcd                                       4.14.0-0.ci-2023-08-14-152624   True        False         False      2d23h   
image-registry                             4.14.0-0.ci-2023-08-14-152624   True        False         False      2d22h   
ingress                                    4.14.0-0.ci-2023-08-14-152624   True        False         False      2d22h   
insights                                   4.14.0-0.ci-2023-08-14-152624   True        False         False      2d22h   
kube-apiserver                             4.14.0-0.ci-2023-08-14-152624   True        False         False      2d22h   
kube-controller-manager                    4.14.0-0.ci-2023-08-14-152624   True        False         False      2d22h   
kube-scheduler                             4.14.0-0.ci-2023-08-14-152624   True        False         False      2d22h   
kube-storage-version-migrator              4.14.0-0.ci-2023-08-14-152624   True        False         False      2d23h   
machine-api                                4.14.0-0.ci-2023-08-14-152624   True        False         False      2d22h   
machine-approver                           4.14.0-0.ci-2023-08-14-152624   True        False         False      2d23h   
machine-config                             4.13.0-0.ci-2023-08-14-110508   True        True          True       2d23h   Unable to apply 4.14.0-0.ci-2023-08-14-152624: error during syncRequiredMachineConfigPools: [context deadline exceeded, failed to update clusteroperator: [client rate limiter Wait returned an error: context deadline exceeded, error MachineConfigPool worker is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 3)]]
marketplace                                4.14.0-0.ci-2023-08-14-152624   True        False         False      2d23h   
monitoring                                 4.14.0-0.ci-2023-08-14-152624   True        False         False      2d22h   
network                                    4.14.0-0.ci-2023-08-14-152624   True        False         False      2d23h   
node-tuning                                4.14.0-0.ci-2023-08-14-152624   True        False         False      95m     
openshift-apiserver                        4.14.0-0.ci-2023-08-14-152624   True        False         False      2d22h   
openshift-controller-manager               4.14.0-0.ci-2023-08-14-152624   True        False         False      2d22h   
openshift-samples                          4.14.0-0.ci-2023-08-14-152624   True        False         False      98m     
operator-lifecycle-manager                 4.14.0-0.ci-2023-08-14-152624   True        False         False      2d23h   
operator-lifecycle-manager-catalog         4.14.0-0.ci-2023-08-14-152624   True        False         False      2d23h   
operator-lifecycle-manager-packageserver   4.14.0-0.ci-2023-08-14-152624   True        False         False      2d22h   
service-ca                                 4.14.0-0.ci-2023-08-14-152624   True        False         False      2d23h   
storage                                    4.14.0-0.ci-2023-08-14-152624   True        False         False      2d23h   
[cloud-user@ocp-psi-executor dbasunag]$ 
[cloud-user@ocp-psi-executor dbasunag]$ oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-693b054330417fe5e098b58716603fc8   True      False      False      3              3                   3                     0                      2d23h
worker   rendered-worker-b2f5a9084e9919b4c1c491658c73bce5   False     False      True       3              0                   0                     3                      2d23h
[cloud-user@ocp-psi-executor dbasunag]$
[cloud-user@ocp-psi-executor dbasunag]$ oc get node
NAME                               STATUS   ROLES                  AGE     VERSION
c01-dbn-412-tzm44-master-0         Ready    control-plane,master   2d23h   v1.27.4+deb2c60
c01-dbn-412-tzm44-master-1         Ready    control-plane,master   2d23h   v1.27.4+deb2c60
c01-dbn-412-tzm44-master-2         Ready    control-plane,master   2d23h   v1.27.4+deb2c60
c01-dbn-412-tzm44-worker-0-7w6wg   Ready    worker                 2d22h   v1.25.11+1485cc9
c01-dbn-412-tzm44-worker-0-cmqsl   Ready    worker                 2d22h   v1.25.11+1485cc9
c01-dbn-412-tzm44-worker-0-qrp6v   Ready    worker                 2d22h   v1.25.11+1485cc9
[cloud-user@ocp-psi-executor dbasunag]$ 

Expected results:

EUS upgrade should work without error

Additional info:

Must-gather can be found here: https://drive.google.com/drive/folders/1SCZoYpGiRpOteTM-sTLmbfgr3hqsICVO?usp=drive_link

Description of problem:

CredentialsRequest for Azure AD Workload Identity missing disk encryption set read permissions.

- Microsoft.Compute/diskEncryptionSets/read

Version-Release number of selected component (if applicable):

4.14.0

How reproducible:

Every time when creating a machine with a disk encryption set

Steps to Reproduce:

1. Create workload identity cluster
2. Create keyvault and secret within keyvault
3. Create disk encryption set and point it to keyvault; can use system-assigned identity 
4. Create or modify existing machineset to include a disk encryption set.  
            managedDisk:
              diskEncryptionSet:
                id: /subscriptions/<subscription_id>/resourceGroups/<resource_id>/providers/Microsoft.Compute/diskEncryptionSets/<disk_encryption_set_name>
5. Scale machineset 

Actual results:

'failed to create vm <vm_name>:
        failure sending request for machine steven-wi-cluster-pzqvm-worker-eastus3-mfk5z:
        cannot create vm: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending
        request: StatusCode=403 -- Original Error: Code="LinkedAuthorizationFailed"
        Message="The client ''55c10ba9-f891-4f42-a697-0ab283b86c63'' with object id
        ''55c10ba9-f891-4f42-a697-0ab283b86c63'' has permission to perform action
        ''Microsoft.Compute/virtualMachines/write'' on scope ''/subscriptions/<subscription_id>/resourceGroups/<resource_group>/providers/Microsoft.Compute/virtualMachines/steven-wi-cluster-pzqvm-worker-eastus3-mfk5z'';
        however, it does not have permission to perform action ''read'' on the linked
        scope(s) ''/subscriptions/<subscription_id>/resourceGroups/<resource_group>/providers/Microsoft.Compute/diskEncryptionSets/test-disk-encryption-set''
        or the linked scope(s) are invalid."'

Expected results:

The machine is able to create and join the cluster successfully.

Additional info:

Docs about preparing disk encryption sets on Azure: https://docs.openshift.com/container-platform/4.12/installing/installing_azure/enabling-user-managed-encryption-azure.html 

Description of problem:

Labels added in the Git import flow are not propagated to the pipeline resources when a pipeline is added

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1. Goto Git Import Form
2. Add Pipeline
3. Add labels
4. Submit the form

Actual results:

The added labels are not propagated to the pipeline resources

Expected results:

The added labels should be added to the pipeline resources

Additional info:

 

Please review the following PR: https://github.com/openshift/etcd/pull/208

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Cannot list Kepler CSV

Version-Release number of selected component (if applicable):

4.12

How reproducible:

Always

Steps to Reproduce:

1. Install Kepler Community Operator
2. Create Kepler Instance
3. Console gets error and shows "Oops, something went wrong"

Actual results:

Console gets error and shows "Oops, something went wrong"

Expected results:

Should list Kepler Instance

Additional info:

 

OAuth-Proxy should send an Audit-Id header with its requests to the kube-apiserver so that we can easily track its requests and be able to tell which arrived and which were processed.

This comes from a time when the CI was in disarray and oauth-proxy requests were failing to reach the KAS but we did not know if at least any were processed or if they were just all plainly rejected somewhere in the middle.

Description of the problem:

 assisted-service pod crashloops with kube-api enabled without the BMH CRD.

How reproducible:

 100%

Steps to reproduce:

1. Deploy assisted-service will kube-api enabled

2. Either don't create or remove the BMH CRD (if removed you will need to restart the assisted-service pod)

3. Observe assisted-service pod

Actual results:

 After a few minutes assisted-service will crash with a message like:

time="2023-01-12T14:26:03Z" level=fatal msg="failed to run manager" func=main.main.func1 file="/remote-source/assisted-service/app/cmd/main.go:204" error="failed to wait for baremetal-agent-controller caches to sync: timed out waiting for cache to be synced"

Expected results:

Either assisted service comes up without the BMAC controller and without errors or a clear error stating that the BMH CRD is required and is missing.

Description of problem:

The test for updating the sysctl whitelist fails to check the error returned when the pod running state is verified.

Test is always passing. We failed to detect a bug in the cluster network operator for the allowlist controller.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Please review the following PR: https://github.com/openshift/cluster-image-registry-operator/pull/855

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

OCP 4.11 ships the alertingrules CRD as a techpreview feature. Before graduating to GA we need to have e2e tests in the CMO repository.

AC:

  • End-to-end tests in the CMO repository validating that
    • Admins can create/update/delete alertingrules
    • Invalid resources are rejected invalid alertingrules don't break the system
  • Configuration of a blocking job in openshift/release.

Description of problem:

When running the nutanix-e2e-windows test from the WMCO PR https://github.com/openshift/windows-machine-config-operator/pull/1398, the MAPI nutanix-controller failed to create the Windows machine VM with the below error logs. It failed to marshal the windows-user-data to struct IgnitionConfig, since the windows-user-data is in powershell script format, but not the ignition data format.

I0424 17:37:43.472054       1 recorder.go:103] events "msg"="ci-op-zhi8pd1k-5c595-dnpj5-e2e-wm-f84vt: reconciler failed to Create machine: failed to get user data: Failed to unmarshal userData to IgnitionConfig. invalid character '<' looking for beginning of value" "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"ci-op-zhi8pd1k-5c595-dnpj5-e2e-wm-f84vt","uid":"d3981cb0-4f98-4424-9252-b100521c2a93","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"31045"} "reason"="FailedCreate" "type"="Warning"
E0424 17:37:43.472923       1 controller.go:329]  "msg"="Reconciler error" "error"="ci-op-zhi8pd1k-5c595-dnpj5-e2e-wm-f84vt: reconciler failed to Create machine: failed to get user data: Failed to unmarshal userData to IgnitionConfig. invalid character '<' looking for beginning of value" "controller"="machine-controller" "name"="ci-op-zhi8pd1k-5c595-dnpj5-e2e-wm-f84vt" "namespace"="openshift-machine-api" "object"={"name":"ci-op-zhi8pd1k-5c595-dnpj5-e2e-wm-f84vt","namespace":"openshift-machine-api"} "reconcileID"="16572b5d-2418-4f7c-b7a8-5f08f2659391"

Version-Release number of selected component (if applicable):

 

How reproducible:

When the Machine is configured to be Windows node

Steps to Reproduce:

Run the ci/prow/nutanix-e2e-operator test.

Actual results:

The MAPI nutanix-controller failed to create the Windows VM with the error logs showing above.

Expected results:

The Windows VM and node can be successfully created and provisioned.

Additional info:

 

From deads2k: I think creating pods that should get rejected in the kube-system namespace would ensure it.  OCP-classic is still struggling with customers who did naughty things.

Description of problem:

There are several labels used by the Nutanix platform which can vary between instances. If not set as ignore labels on the Cluster Autoscaler, features such as balancing similar node groups will not work predictably.

The Cluster Autoscaler Operator should be updated with the following labels on Nutanix:

* nutanix.com/prism-element-name
* nutanix.com/prism-element-uuid
* nutanix.com/prism-host-name
* nutanix.com/prism-host-uuid

for reference see this code: https://github.com/openshift/cluster-autoscaler-operator/blob/release-4.14/pkg/controller/clusterautoscaler/clusterautoscaler.go#L72-L159

Version-Release number of selected component (if applicable):

master, 4.14

How reproducible:

always

Steps to Reproduce:

1. create a ClusterAutoscaler CR on Nutanix platform
2. inspect the deployment for the cluster-autoscaler
3. see that it does not have the ignore labels added as command line flags

Actual results:

labels are not added as flags

Expected results:

labels should be added as flags

Additional info:

this should proabably be backported to 4.13 as well since the labels will be applied by the Nutanix CCM

Description of problem:

Kubernetes and other associated dependencies need to be updated to protect against potential vulnerabilities.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Please review the following PR: https://github.com/openshift/kubernetes-autoscaler/pull/255

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

We should log vcenter version information in plain text.

There are cases in code where vcenter version that we receive from vcenter could become unparseable. I see errors in problem-detector while parsing the version and both CSI driver and operator depends on ability to determine vcenter version.

A clone of https://issues.redhat.com/browse/OCPBUGS-11143 but for the downstream openshift/cloud-provider-azure

 

Description of problem:

On azure, delete a master, old machine stuck in Deleting, some pods in cluster are in ImagePullBackOff, check from azure console, new master did not add into lb backend, seems this lead the machine has no internet connection.

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2023-02-12-024338

How reproducible:

Always

Steps to Reproduce:

1. Set up a cluster on Azure, networkType ovn
2. Delete a master
3. Check master and pod

Actual results:

Old machine stuck in Deleting,  some pods are in ImagePullBackOff.
 $ oc get machine    
NAME                                    PHASE      TYPE              REGION   ZONE   AGE
zhsunaz2132-5ctmh-master-0              Deleting   Standard_D8s_v3   westus          160m
zhsunaz2132-5ctmh-master-1              Running    Standard_D8s_v3   westus          160m
zhsunaz2132-5ctmh-master-2              Running    Standard_D8s_v3   westus          160m
zhsunaz2132-5ctmh-master-flqqr-0        Running    Standard_D8s_v3   westus          105m
zhsunaz2132-5ctmh-worker-westus-dhwfz   Running    Standard_D4s_v3   westus          152m
zhsunaz2132-5ctmh-worker-westus-dw895   Running    Standard_D4s_v3   westus          152m
zhsunaz2132-5ctmh-worker-westus-xlsgm   Running    Standard_D4s_v3   westus          152m

$ oc describe machine zhsunaz2132-5ctmh-master-flqqr-0  -n openshift-machine-api |grep -i "Load Balancer"
      Internal Load Balancer:  zhsunaz2132-5ctmh-internal
      Public Load Balancer:      zhsunaz2132-5ctmh

$ oc get node            
NAME                                    STATUS     ROLES                  AGE    VERSION
zhsunaz2132-5ctmh-master-0              Ready      control-plane,master   165m   v1.26.0+149fe52
zhsunaz2132-5ctmh-master-1              Ready      control-plane,master   165m   v1.26.0+149fe52
zhsunaz2132-5ctmh-master-2              Ready      control-plane,master   165m   v1.26.0+149fe52
zhsunaz2132-5ctmh-master-flqqr-0        NotReady   control-plane,master   109m   v1.26.0+149fe52
zhsunaz2132-5ctmh-worker-westus-dhwfz   Ready      worker                 152m   v1.26.0+149fe52
zhsunaz2132-5ctmh-worker-westus-dw895   Ready      worker                 152m   v1.26.0+149fe52
zhsunaz2132-5ctmh-worker-westus-xlsgm   Ready      worker                 152m   v1.26.0+149fe52
$ oc describe node zhsunaz2132-5ctmh-master-flqqr-0
  Warning  ErrorReconcilingNode       3m5s (x181 over 108m)  controlplane         [k8s.ovn.org/node-chassis-id annotation not found for node zhsunaz2132-5ctmh-master-flqqr-0, macAddress annotation not found for node "zhsunaz2132-5ctmh-master-flqqr-0" , k8s.ovn.org/l3-gateway-config annotation not found for node "zhsunaz2132-5ctmh-master-flqqr-0"]

$ oc get po --all-namespaces | grep ImagePullBackOf   
openshift-cluster-csi-drivers                      azure-disk-csi-driver-node-l8ng4                                  0/3     Init:ImagePullBackOff   0              113m
openshift-cluster-csi-drivers                      azure-file-csi-driver-node-99k82                                  0/3     Init:ImagePullBackOff   0              113m
openshift-cluster-node-tuning-operator             tuned-bvvh7                                                       0/1     ImagePullBackOff        0              113m
openshift-dns                                      node-resolver-2p4zq                                               0/1     ImagePullBackOff        0              113m
openshift-image-registry                           node-ca-vxv87                                                     0/1     ImagePullBackOff        0              113m
openshift-machine-config-operator                  machine-config-daemon-crt5w                                       1/2     ImagePullBackOff        0              113m
openshift-monitoring                               node-exporter-mmjsm                                               0/2     Init:ImagePullBackOff   0              113m
openshift-multus                                   multus-4cg87                                                      0/1     ImagePullBackOff        0              113m
openshift-multus                                   multus-additional-cni-plugins-mc6vx                               0/1     Init:ImagePullBackOff   0              113m
openshift-ovn-kubernetes                           ovnkube-master-qjjsv                                              0/6     ImagePullBackOff        0              113m
openshift-ovn-kubernetes                           ovnkube-node-k8w6j                                                0/6     ImagePullBackOff        0              113m

Expected results:

Replace master successful

Additional info:

Tested payload 4.13.0-0.nightly-2023-02-03-145213, same result.
Before we have tested in 4.13.0-0.nightly-2023-01-27-165107, all works well.

Description of problem:

Helm view in Dev console doesn't allow you to edit Helm repositories through the three dots menu "Edit option". It results in 404.

Prerequisites (if any, like setup, operators/versions):

Tried in 4.13 only, not sure if other versions are affected

Steps to Reproduce

1. Create a new Helm chart repository (/ns/<NAMESPACE>/helmchartrepositories/~new/form endpoint)
2. List all the custom Helm repositories ( /helm-releases/ns/<NAMESPACE>/repositories endpoint)
3. Click three dots menu on the right of any chart repository and select "Edit ProjectHelmChartRepository" (leads to /k8s/ns/<NAMESPACE>/helmchartrepositories/<REPO_NAME>/edit)
4. You land on 404 page

Actual results:

404 page, see the attached GIF

Expected results:

Edit view

Reproducibility (Always/Intermittent/Only Once):

Always

Build Details:

Observed in OCP 4.13 (Dev sandbox and OpenShift Local)

Workaround:

Follow steps 1 and 2. from the reproducer above
3. Click on Helm repository name
4. Click YAML tab to edit resource (/k8s/ns/<NAMESPACE>/helm.openshift.io~v1beta1~ProjectHelmChartRepository/<REPO_NAME>/yaml endpoint)

Additional info:

Description of the problem:

Since MGMT-13083 merged, disconnected jobs are failing in the ephemeral installer (specifically e2e-agent-sno-ipv6 and e2e-agent-ha-dualstack). Preparing for installation fails because we can't get the installer binary:
 

Apr 21 10:00:43 master-0 service[2298]: time="2023-04-20T22:00:43Z" level=info msg="Successfully extracted openshift-baremetal-install binary from the release to: /data/install-config-generate/installercache/virthost.ostest.test.metalkube.org:5000/localimages/local-release-image@sha256:63357ac661a312dde07b60350ea72428463853ea9a09cdf9487d853496a97d58/openshift-baremetal-install" func="github.com/openshift/assisted-service/internal/oc.(*release).extractFromRelease" file="/src/internal/oc/release.go:376" cluster_id=a3945e90-44a8-436c-89ad-12d3a5820a26 go-id=18956 request_id=
Apr 21 10:00:43 master-0 service[2298]: time="2023-04-20T22:00:43Z" level=error msg="failed generating install config for cluster a3945e90-44a8-436c-89ad-12d3a5820a26" func="github.com/openshift/assisted-service/internal/bminventory.(*bareMetalInventory).generateClusterInstallConfig" file="/src/internal/bminventory/inventory.go:1738" cluster_id=a3945e90-44a8-436c-89ad-12d3a5820a26 error="failed to get installer path: Failed to create hard link to binary /data/install-config-generate/installercache/registry.build05.ci.openshift.org/ci-op-1w73h6fv/release@sha256:63357ac661a312dde07b60350ea72428463853ea9a09cdf9487d853496a97d58/openshift-baremetal-install: link /data/install-config-generate/installercache/registry.build05.ci.openshift.org/ci-op-1w73h6fv/release@sha256:63357ac661a312dde07b60350ea72428463853ea9a09cdf9487d853496a97d58/openshift-baremetal-install /data/install-config-generate/installercache/registry.build05.ci.openshift.org/ci-op-1w73h6fv/release@sha256:63357ac661a312dde07b60350ea72428463853ea9a09cdf9487d853496a97d58/ln_1682028043_openshift-baremetal-install: no such file or directory" go-id=18956 pkg=Inventory request_id=
Apr 21 10:00:43 master-0 service[2298]: time="2023-04-20T22:00:43Z" level=warning msg="Cluster installation initialization failed" func="github.com/openshift/assisted-service/internal/bminventory.(*bareMetalInventory).InstallClusterInternal.func3.1" file="/src/internal/bminventory/inventory.go:1339" cluster_id=a3945e90-44a8-436c-89ad-12d3a5820a26 error="failed generating install config for cluster a3945e90-44a8-436c-89ad-12d3a5820a26: failed to get installer path: Failed to create hard link to binary /data/install-config-generate/installercache/registry.build05.ci.openshift.org/ci-op-1w73h6fv/release@sha256:63357ac661a312dde07b60350ea72428463853ea9a09cdf9487d853496a97d58/openshift-baremetal-install: link /data/install-config-generate/installercache/registry.build05.ci.openshift.org/ci-op-1w73h6fv/release@sha256:63357ac661a312dde07b60350ea72428463853ea9a09cdf9487d853496a97d58/openshift-baremetal-install /data/install-config-generate/installercache/registry.build05.ci.openshift.org/ci-op-1w73h6fv/release@sha256:63357ac661a312dde07b60350ea72428463853ea9a09cdf9487d853496a97d58/ln_1682028043_openshift-baremetal-install: no such file or directory" go-id=18932 pkg=Inventory request_id=ca799c5a-c798-4a93-9bf8-7f27ed93ca20
Apr 21 10:00:43 master-0 service[2298]: time="2023-04-20T22:00:43Z" level=warning msg="Failed to prepare installation of cluster a3945e90-44a8-436c-89ad-12d3a5820a26" func="github.com/openshift/assisted-service/internal/cluster.(*Manager).HandlePreInstallError" file="/src/internal/cluster/cluster.go:985" cluster_id=a3945e90-44a8-436c-89ad-12d3a5820a26 error="failed generating install config for cluster a3945e90-44a8-436c-89ad-12d3a5820a26: failed to get installer path: Failed to create hard link to binary /data/install-config-generate/installercache/registry.build05.ci.openshift.org/ci-op-1w73h6fv/release@sha256:63357ac661a312dde07b60350ea72428463853ea9a09cdf9487d853496a97d58/openshift-baremetal-install: link /data/install-config-generate/installercache/registry.build05.ci.openshift.org/ci-op-1w73h6fv/release@sha256:63357ac661a312dde07b60350ea72428463853ea9a09cdf9487d853496a97d58/openshift-baremetal-install /data/install-config-generate/installercache/registry.build05.ci.openshift.org/ci-op-1w73h6fv/release@sha256:63357ac661a312dde07b60350ea72428463853ea9a09cdf9487d853496a97d58/ln_1682028043_openshift-baremetal-install: no such file or directory" go-id=18956 pkg=cluster-state request_id=

The issue appears to be that we extract the binary to a path including the mirror registry (installercache/virthost.ostest.test.metalkube.org:5000/localimages/local-release-image) but then look for it at a path representing the original pullspec (installercache/registry.build05.ci.openshift.org/ci-op-1w73h6fv/release)

How reproducible:

 100%

Steps to reproduce:

1. Use the agent-based installer to install using a disconnected mirror registry in the ImageContentSources.

Actual results:

 Installation never starts, we just see a loop of:

evel=debug msg=Host worker-0: updated status from known to preparing-for-installation (Host finished successfully to prepare for installation)
level=debug msg=Host worker-1: updated status from known to preparing-for-installation (Host finished successfully to prepare for installation)
level=debug msg=Host master-0: updated status from known to preparing-for-installation (Host finished successfully to prepare for installation)
level=debug msg=Host master-1: updated status from known to preparing-for-installation (Host finished successfully to prepare for installation)
level=debug msg=Host master-2: updated status from known to preparing-for-installation (Host finished successfully to prepare for installation)
level=debug msg=Host worker-0: updated status from preparing-for-installation to preparing-successful (Host finished successfully to prepare for installation)
level=debug msg=Host worker-1: updated status from preparing-for-installation to preparing-successful (Host finished successfully to prepare for installation)
level=debug msg=Host master-0: updated status from preparing-for-installation to preparing-successful (Host finished successfully to prepare for installation)
level=debug msg=Host master-1: updated status from preparing-for-installation to preparing-successful (Host finished successfully to prepare for installation)
level=debug msg=Host master-2: updated status from preparing-for-installation to preparing-successful (Host finished successfully to prepare for installation)
level=debug msg=Host worker-0: updated status from preparing-successful to known (Host is ready to be installed)
level=debug msg=Host worker-1: updated status from preparing-successful to known (Host is ready to be installed)
level=debug msg=Host master-0: updated status from preparing-successful to known (Host is ready to be installed)
level=debug msg=Host master-1: updated status from preparing-successful to known (Host is ready to be installed)
level=debug msg=Host master-2: updated status from preparing-successful to known (Host is ready to be installed)

Expected results:

Cluster is installed.

Description of problem:

IngressVIP is getting attached to two node at once.

Version-Release number of selected component (if applicable):

4.11.39

How reproducible:

Always in customer cluster

Actual results:

IngressVIP is getting attached to two node at once.

Expected results:

IngressVIP should get attach to only one node.

Additional info:

 

This is a clone of issue OCPBUGS-18954. The following is the description of the original issue:

Description of problem:

While installing 3618 SNOs via ZTP using ACM 2.9, 15 clusters failed to complete install and have failed on the cluster-autoscaler operator. This represents the bulk of all cluster install failures in this testbed for OCP 4.14.0-rc.0.


# cat aci.InstallationFailed.autoscaler  | xargs -I % sh -c "echo -n '% '; oc --kubeconfig /root/hv-vm/kc/%/kubeconfig get clusterversion --no-headers "
vm00527 version         False   True   20h   Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available
vm00717 version         False   True   14h   Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available
vm00881 version         False   True   19h   Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available
vm00998 version         False   True   18h   Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available
vm01006 version         False   True   17h   Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available
vm01059 version         False   True   15h   Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available
vm01155 version         False   True   14h   Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available
vm01930 version         False   True   17h   Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available
vm02407 version         False   True   16h   Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available
vm02651 version         False   True   18h   Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available
vm03073 version         False   True   19h   Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available
vm03258 version         False   True   20h   Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available
vm03295 version         False   True   14h   Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available
vm03303 version         False   True   15h   Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available
vm03517 version         False   True   18h   Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available

Version-Release number of selected component (if applicable):

Hub 4.13.11
Deployed SNOs 4.14.0-rc.0
ACM 2.9 - 2.9.0-DOWNSTREAM-2023-09-07-04-47-52

How reproducible:

15 out of 20 failures (75% of the failures)
15 out of 3618 total attempted SNOs to be installed ~.4% of all installs

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

It appears that some show in the logs of the cluster-autoscaler-operator an error, Example:

I0912 19:54:39.962897       1 main.go:15] Go Version: go1.20.5 X:strictfipsruntime
I0912 19:54:39.962977       1 main.go:16] Go OS/Arch: linux/amd64
I0912 19:54:39.962982       1 main.go:17] Version: cluster-autoscaler-operator v4.14.0-202308301903.p0.gb57f5a9.assembly.stream-dirty
I0912 19:54:39.963137       1 leaderelection.go:122] The leader election gives 4 retries and allows for 30s of clock skew. The kube-apiserver downtime tolerance is 78s. Worst non-graceful lease acquisition is 2m43s. Worst graceful lease acquisition is {26s}.
I0912 19:54:39.975478       1 listener.go:44] controller-runtime/metrics "msg"="Metrics server is starting to listen" "addr"="127.0.0.1:9191"
I0912 19:54:39.976939       1 server.go:187] controller-runtime/webhook "msg"="Registering webhook" "path"="/validate-clusterautoscalers"
I0912 19:54:39.976984       1 server.go:187] controller-runtime/webhook "msg"="Registering webhook" "path"="/validate-machineautoscalers"
I0912 19:54:39.977082       1 main.go:41] Starting cluster-autoscaler-operator
I0912 19:54:39.977216       1 server.go:216] controller-runtime/webhook/webhooks "msg"="Starting webhook server" 
I0912 19:54:39.977693       1 certwatcher.go:161] controller-runtime/certwatcher "msg"="Updated current TLS certificate" 
I0912 19:54:39.977813       1 server.go:273] controller-runtime/webhook "msg"="Serving webhook server" "host"="" "port"=8443
I0912 19:54:39.977938       1 certwatcher.go:115] controller-runtime/certwatcher "msg"="Starting certificate watcher" 
I0912 19:54:39.978008       1 server.go:50]  "msg"="starting server" "addr"={"IP":"127.0.0.1","Port":9191,"Zone":""} "kind"="metrics" "path"="/metrics"
I0912 19:54:39.978052       1 leaderelection.go:245] attempting to acquire leader lease openshift-machine-api/cluster-autoscaler-operator-leader...
I0912 19:54:39.982052       1 leaderelection.go:255] successfully acquired lease openshift-machine-api/cluster-autoscaler-operator-leader
I0912 19:54:39.983412       1 controller.go:177]  "msg"="Starting EventSource" "controller"="cluster_autoscaler_controller" "source"="kind source: *v1.ClusterAutoscaler"
I0912 19:54:39.983462       1 controller.go:177]  "msg"="Starting EventSource" "controller"="cluster_autoscaler_controller" "source"="kind source: *v1.Deployment"
I0912 19:54:39.983483       1 controller.go:177]  "msg"="Starting EventSource" "controller"="cluster_autoscaler_controller" "source"="kind source: *v1.Service"
I0912 19:54:39.983501       1 controller.go:177]  "msg"="Starting EventSource" "controller"="cluster_autoscaler_controller" "source"="kind source: *v1.ServiceMonitor"
I0912 19:54:39.983520       1 controller.go:177]  "msg"="Starting EventSource" "controller"="cluster_autoscaler_controller" "source"="kind source: *v1.PrometheusRule"
I0912 19:54:39.983532       1 controller.go:185]  "msg"="Starting Controller" "controller"="cluster_autoscaler_controller"
I0912 19:54:39.986041       1 controller.go:177]  "msg"="Starting EventSource" "controller"="machine_autoscaler_controller" "source"="kind source: *v1beta1.MachineAutoscaler"
I0912 19:54:39.986065       1 controller.go:177]  "msg"="Starting EventSource" "controller"="machine_autoscaler_controller" "source"="kind source: *unstructured.Unstructured"
I0912 19:54:39.986072       1 controller.go:185]  "msg"="Starting Controller" "controller"="machine_autoscaler_controller"
I0912 19:54:40.095808       1 webhookconfig.go:72] Webhook configuration status: created
I0912 19:54:40.101613       1 controller.go:219]  "msg"="Starting workers" "controller"="cluster_autoscaler_controller" "worker count"=1
I0912 19:54:40.102857       1 controller.go:219]  "msg"="Starting workers" "controller"="machine_autoscaler_controller" "worker count"=1
E0912 19:58:48.113290       1 leaderelection.go:327] error retrieving resource lock openshift-machine-api/cluster-autoscaler-operator-leader: Get "https://[fd02::1]:443/apis/coordination.k8s.io/v1/namespaces/openshift-machine-api/leases/cluster-autoscaler-operator-leader": net/http: TLS handshake timeout - error from a previous attempt: unexpected EOF
E0912 20:02:48.135610       1 leaderelection.go:327] error retrieving resource lock openshift-machine-api/cluster-autoscaler-operator-leader: Get "https://[fd02::1]:443/apis/coordination.k8s.io/v1/namespaces/openshift-machine-api/leases/cluster-autoscaler-operator-leader": dial tcp [fd02::1]:443: connect: connection refused
E0913 13:49:02.118757       1 leaderelection.go:327] error retrieving resource lock openshift-machine-api/cluster-autoscaler-operator-leader: Get "https://[fd02::1]:443/apis/coordination.k8s.io/v1/namespaces/openshift-machine-api/leases/cluster-autoscaler-operator-leader": dial tcp [fd02::1]:443: connect: connection refused


Description of problem:

Terraform will not create VMs for master and worker for upi vsphere when unset var.control_plane_ip_addresses and var.compute_ip_addresses. When users are using IPAM (as before) to reserve IPs instead of setting static IPs directly into var.control_plane_ip_addresses and var.compute_ip_addresses, Based on upstream code #1 and #2. The count of master and worker is always 0, then terraform will not create any VMs for master and worker nodes. If we changed code as below, it works in IPAM case as before.  
control_plane_fqdns = [for idx in range(length(var.control_plane_ip_addresses)) : "control-plane-${idx}.${var.cluster_domain}"]  
compute_fqdns = [for idx in range(length(var.compute_ip_addresses)) : "compute-${idx}.${var.cluster_domain}"] ==>>
control_plane_fqdns = [for idx in range(var.control_plane_count) : "control-plane-${idx}.${var.cluster_domain}"]
compute_fqdns = [for idx in range(var.compute_count) : "compute-${idx}.${var.cluster_domain}"]

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2023-03-11-033820

How reproducible:

always

Steps to Reproduce:

1.Trigger job to install a cluster on vSphere with upi.
2.If the ip applied for master and worker VMs from IPAM server instead of setting the static ip directly into var.control_plane_ip_addresses and var.compute_ip_addresses, the VM creation will fail. 

Actual results:

the VM creation will fail

Expected results:

VM creation succeeds.

Additional info:

#1 link:https://github.com/openshift/installer/blob/master/upi/vsphere/main.tf#L15-L16
#2 link:https://github.com/openshift/installer/blob/master/upi/vsphere/main.tf#L211
This bug will only affect UPI vSphere installation when user use IPAM server to reserve static IPs instead of setting static ip directly into var.control_plane_ip_addresses and var.compute_ip_addresses. now it don't affect QE test, because we still install with previous code. 

 

Description of problem:

Create BuildConfig button in the Dev console builds opens the form view but in default namespace

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1. Goto Dev Perspective
2. Click on Builds
3. Click on "Create BuildConfig"

Actual results:

"default" namespace is selected in the namespace selector

Expected results:

It should open the form in the active namespace

Additional info:

 

Description of problem:

In hypershift context:
Operands managed by Operators running in the hosted control plane namespace in the management cluster do not honour affinity opinions https://hypershift-docs.netlify.app/how-to/distribute-hosted-cluster-workloads/
https://github.com/openshift/hypershift/blob/main/support/config/deployment.go#L263-L265

These operands running management side should honour the same affinity, tolerations, node selector and priority rules than the operator.
This could be done by looking at the operator deployment itself or at the HCP resource.

aws-ebs-csi-driver-controller
aws-ebs-csi-driver-operator
csi-snapshot-controller
csi-snapshot-webhook


Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1. Create a hypershift cluster.
2. Check affinity rules and node selector of the operands above.
3.

Actual results:

Operands missing affinity rules and node selecto

Expected results:

Operands have same affinity rules and node selector than the operator

Additional info:

 

The aggregated https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/aggregated-gcp-ovn-rt-upgrade-4.14-minor-release-openshift-release-analysis-aggregator/1633554110798106624 job failed.  Digging into one of them:

 

This MCD log has https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-gcp-ovn-rt-upgrade/1633554106595414016/artifacts/e2e-gcp-ovn-rt-upgrade/gather-extra/artifacts/pods/openshift-machine-config-operator_machine-config-daemon-p2vf4_machine-config-daemon.log

 

Deployments:
* ostree-unverified-registry:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4f28fbcd049025bab9719379492420f9eaab0426cdbbba43b395eb8421f10a17
                   Digest: sha256:4f28fbcd049025bab9719379492420f9eaab0426cdbbba43b395eb8421f10a17
                  Version: 413.86.202302230536-0 (2023-03-08T20:10:47Z)
      RemovedBasePackages: kernel-core kernel-modules kernel kernel-modules-extra 4.18.0-372.43.1.el8_6
          LayeredPackages: kernel-rt-core kernel-rt-kvm kernel-rt-modules
                           kernel-rt-modules-extra
...
E0308 22:11:21.925030 74176 writer.go:200] Marking Degraded due to: failed to update OS to quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cd299b2bf3cc98fb70907f152b4281633064fe33527b5d6a42ddc418ff00eec1 : error running rpm-ostree rebase --experimental ostree-unverified-registry:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cd299b2bf3cc98fb70907f152b4281633064fe33527b5d6a42ddc418ff00eec1: error: Importing: remote error: fetching blob: received unexpected HTTP status: 500 Internal Server Error
... 
I0308 22:11:36.959143   74176 update.go:2010] Running: rpm-ostree override reset kernel kernel-core kernel-modules kernel-modules-extra --uninstall kernel-rt-core --uninstall kernel-rt-kvm --uninstall kernel-rt-modules --uninstall kernel-rt-modules-extra
...
E0308 22:12:35.525156   74176 writer.go:200] Marking Degraded due to: error running rpm-ostree override reset kernel kernel-core kernel-modules kernel-modules-extra --uninstall kernel-rt-core --uninstall kernel-rt-kvm --uninstall kernel-rt-modules --uninstall kernel-rt-modules-extra: error: Package/capability 'kernel-rt-core' is not currently requested
: exit status 1
  

 

Something is going wrong here in our retry loop.   I think it might be that we don't clear the pending deployment on failure.  IOW we need to

rpm-ostree cleanup -p 

before we rertry.

 

This is fallout from https://github.com/openshift/machine-config-operator/pull/3580 - Although I suspect it may have been an issue before too.

 

Description of problem: 

"pipelines-as-code-pipelinerun-go" configMap is not been used for the Go repository while creating Pipeline Repository. "pipelines-as-code-pipelinerun-generic" configMap has been used.

Prerequisites (if any, like setup, operators/versions):

Install Red Hat Pipeline operator

Steps to Reproduce

  1. Navigate to Create Repository form 
  2. Enter the Git URL `https://github.com/vikram-raj/hello-func-go`
  3. Click on Add

Actual results:

`pipelines-as-code-pipelinerun-generic` PipelineRun template has been shown on the overview page 

Expected results:

`pipelines-as-code-pipelinerun-go` PipelineRun template should show on the overview page

Reproducibility (Always/Intermittent/Only Once):

Build Details:

4.13

Workaround:

Additional info:

Description of problem:

We need to export the hook function from the module that's required in the dynamic core api, otherwise an exception will be thrown if the hook is imported/used by plugins.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

Plugins using this hook throw an exception.

Expected results:

The hook should be imported and function properly.

Additional info:

 

Description of problem:

Enabling IPSec doesn't result in IPsec tunnels being created

Version-Release number of selected component (if applicable):

4.13

How reproducible:

Deploy & Enable IPSec

Steps to Reproduce:

1.
2.
3.

Actual results:

000 Total IPsec connections: loaded 0, active 0
000  
000 State Information: DDoS cookies not required, Accepting new IKE connections
000 IKE SAs: total(0), half-open(0), open(0), authenticated(0), anonymous(0)
000 IPsec SAs: total(0), authenticated(0), anonymous(0)

Expected results:

Active connections > 0

Additional info:

✘-1 ~/code/k8s-netperf [more-meta L|✚ 4…37⚑ 1] 
06:49 $ oc -n openshift-ovn-kubernetes -c nbdb rsh ovnkube-master-qw4zv \ovn-nbctl --no-leader-only get nb_global . ipsec
true

Description of problem:

While installing ocp on aws user can set metadataService auth to Required in order to use IMDSv2, in that case user requires all the vms to use it. 
Currently bootstrap will always run with Optional and this can be blocked on users aws account and will fail the installation process

Version-Release number of selected component (if applicable):

4.14.0

How reproducible:

Install aws cluster and set metadataService to Required

Steps to Reproduce:

1.
2.
3.

Actual results:

Bootstrap has IMDSv2 set to optional

Expected results:

All vms had IMDSv2 set to required

Additional info:

 

Description of problem:

Newly introduced `--idms-file` in oc image extract is incorrectly mapped to ICSPFile object instead IDMSFile

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

SNO installation performed with the assisted-installer failed 

Version-Release number of selected component (if applicable):

4.10.32
# oc get co authentication -o yaml
- lastTransitionTime: '2023-01-30T00:51:11Z'
    message: 'IngressStateEndpointsDegraded: No subsets found for the endpoints of
      oauth-server      OAuthServerConfigObservationDegraded: secret "v4-0-config-system-router-certs"
      not found      OAuthServerDeploymentDegraded: 1 of 1 requested instances are unavailable for
      oauth-openshift.openshift-authentication (container is waiting in pending oauth-openshift-58b978d7f8-s6x4b
      pod)      OAuthServerRouteEndpointAccessibleControllerDegraded: secret "v4-0-config-system-router-certs"

# oc logs ingress-operator-xxx-yyy -c ingress-operator 
2023-01-30T08:14:13.701799050Z 2023-01-30T08:14:13.701Z ERROR   operator.certificate_publisher_controller       certificate-publisher/controller.go:80  failed to list ingresscontrollers for secret    {"related": "", "error": "Index with name field:defaultCertificateName does not exist"}

Restarting the ingress-operator pod helped fix the issue, but a permanent fix is required.

The Bug(https://bugzilla.redhat.com/show_bug.cgi?id=2005351) was filed earlier but closed due to inactivity.

 

 

Description of problem:

Add storage admission plugin "storage.openshift.io/CSIInlineVolumeSecurity"

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1.Create OCP cluster v 4.13
2.Check config map kas-config

Actual results:

The CM does not include "storage.openshift.io/CSIInlineVolumeSecurity" storage plugin

Expected results:

The plugin should be included

Additional info:

 

Please review the following PR: https://github.com/openshift/cloud-provider-openstack/pull/195

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Fix cnf compute tests to check scheduler settings under /sys/kernel/debug/sched/ 

Version-Release number of selected component (if applicable):

4.14

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Please review the following PR: https://github.com/openshift/prom-label-proxy/pull/355

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

NetworkPolicyLegacy test timeout on bump PR, the latest is https://github.com/openshift/origin/pull/27912
Job example https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/27912/pull-ci-openshift-origin-master-e2e-gcp-ovn/1655997089001246720

Seems like the problem is 15 min timeout, test fails with " Interrupted by User". I think this is change that affected it https://github.com/kubernetes/kubernetes/pull/112923.

From what I saw in the logs, seems like "testCannotConnect" reaches 5 min timeout instead of completing in ~45 sec based on the client pod command. But this is NetworkPolicyLegacy, not sure how much time we want to spend debugging it.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

Slack thread https://redhat-internal.slack.com/archives/C04UQLWQAP3/p1683640905643069

This is a clone of issue OCPBUGS-17682. The following is the description of the original issue:

Description of problem:

since in-cluster prometheus-operator and UWM prometheus-operator pods are scheduled to master nodes, see from

https://github.com/openshift/cluster-monitoring-operator/blob/release-4.14/assets/prometheus-operator/deployment.yaml#L88-L97

https://github.com/openshift/cluster-monitoring-operator/blob/release-4.14/assets/prometheus-operator-user-workload/deployment.yaml#L91-L103

enabled UWM and add topologySpreadConstraints for in-cluster prometheus-operator and UWM prometheus-operator(set topologyKey to node-role.kubernetes.io/master), topologySpreadConstraints takes effect for in-cluster prometheus-operator, but not for UWM prometheus-operator

apiVersion: v1
data:
  config.yaml: |
    enableUserWorkload: true
    prometheusOperator:
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: node-role.kubernetes.io/master
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app.kubernetes.io/name: prometheus-operator
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring

in-cluster prometheus-operator, topologySpreadConstraints settings are loaded to prometheus-operator pod and deployment, see

$ oc -n openshift-monitoring get deploy prometheus-operator -oyaml | grep topologySpreadConstraints -A7
      topologySpreadConstraints:
      - labelSelector:
          matchLabels:
            app.kubernetes.io/name: prometheus-operator
        maxSkew: 1
        topologyKey: node-role.kubernetes.io/master
        whenUnsatisfiable: DoNotSchedule
      volumes:

$ oc -n openshift-monitoring get pod -l app.kubernetes.io/name=prometheus-operator -o wide
NAME                                   READY   STATUS    RESTARTS   AGE    IP            NODE                                                 NOMINATED NODE   READINESS GATES
prometheus-operator-65496d5b78-fb9nq   2/2     Running   0          105s   10.128.0.71   juzhao-0813-szb9h-master-0.c.openshift-qe.internal   <none>           <none>

$ oc -n openshift-monitoring get pod prometheus-operator-65496d5b78-fb9nq -oyaml | grep topologySpreadConstraints -A7
    topologySpreadConstraints:
    - labelSelector:
        matchLabels:
          app.kubernetes.io/name: prometheus-operator
      maxSkew: 1
      topologyKey: node-role.kubernetes.io/master
      whenUnsatisfiable: DoNotSchedule
    volumes: 

but the topologySpreadConstraints settings are not loaded to UWM prometheus-operator pod and deployment

$ oc -n openshift-user-workload-monitoring get cm user-workload-monitoring-config -oyaml
apiVersion: v1
data:
  config.yaml: |
    prometheusOperator:
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: node-role.kubernetes.io/master
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app.kubernetes.io/name: prometheus-operator
kind: ConfigMap
metadata:
  creationTimestamp: "2023-08-14T08:10:49Z"
  labels:
    app.kubernetes.io/managed-by: cluster-monitoring-operator
    app.kubernetes.io/part-of: openshift-monitoring
  name: user-workload-monitoring-config
  namespace: openshift-user-workload-monitoring
  resourceVersion: "212490"
  uid: 048f91cb-4da6-4b1b-9e1f-c769096ab88c

$ oc -n openshift-user-workload-monitoring get deploy prometheus-operator -oyaml | grep topologySpreadConstraints -A7
no result

$ oc -n openshift-user-workload-monitoring get pod -l app.kubernetes.io/name=prometheus-operator
NAME                                   READY   STATUS    RESTARTS   AGE
prometheus-operator-77bcdcbd9c-m5x8z   2/2     Running   0          15m

$ oc -n openshift-user-workload-monitoring get pod prometheus-operator-77bcdcbd9c-m5x8z -oyaml | grep topologySpreadConstraints
no result 

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-08-11-055332

How reproducible:

always

Steps to Reproduce:

1. see the description
2.
3.

Actual results:

topologySpreadConstraints settings are not loaded to UWM prometheus-operator pod and deployment

Expected results:

topologySpreadConstraints settings loaded to UWM prometheus-operator pod and deployment

This is a clone of issue OCPBUGS-17391. The following is the description of the original issue:

the pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-local-to-shared-gateway-mode-migration job started failing recently when the
ovnkube-master daemonset would not finish rolling out after 360s.

taking the must gather to debug which happens a few minutes after the test
failure you can see that the daemonset is still not ready, so I believe that
increasing the timeout is not the answer.

some debug info:

 

static-kas git:(master) oc --kubeconfig=/tmp/kk get daemonsets -A 
NAMESPACE NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
openshift-cluster-csi-drivers aws-ebs-csi-driver-node 6 6 6 6 6 kubernetes.io/os=linux 8h
openshift-cluster-node-tuning-operator tuned 6 6 6 6 6 kubernetes.io/os=linux 8h
openshift-dns dns-default 6 6 6 6 6 kubernetes.io/os=linux 8h
openshift-dns node-resolver 6 6 6 6 6 kubernetes.io/os=linux 8h
openshift-image-registry node-ca 6 6 6 6 6 kubernetes.io/os=linux 8h
openshift-ingress-canary ingress-canary 3 3 3 3 3 kubernetes.io/os=linux 8h
openshift-machine-api machine-api-termination-handler 0 0 0 0 0 kubernetes.io/os=linux,machine.openshift.io/interruptible-instance= 8h
openshift-machine-config-operator machine-config-daemon 6 6 6 6 6 kubernetes.io/os=linux 8h
openshift-machine-config-operator machine-config-server 3 3 3 3 3 node-role.kubernetes.io/master= 8h
openshift-monitoring node-exporter 6 6 6 6 6 kubernetes.io/os=linux 8h
openshift-multus multus 6 6 6 6 6 kubernetes.io/os=linux 9h
openshift-multus multus-additional-cni-plugins 6 6 6 6 6 kubernetes.io/os=linux 9h
openshift-multus network-metrics-daemon 6 6 6 6 6 kubernetes.io/os=linux 9h
openshift-network-diagnostics network-check-target 6 6 6 6 6 beta.kubernetes.io/os=linux 9h
openshift-ovn-kubernetes ovnkube-master 3 3 2 2 2 beta.kubernetes.io/os=linux,node-role.kubernetes.io/master= 9h
openshift-ovn-kubernetes ovnkube-node 6 6 6 6 6 beta.kubernetes.io/os=linux 9h
Name: ovnkube-master
Selector: app=ovnkube-master
Node-Selector: beta.kubernetes.io/os=linux,node-role.kubernetes.io/master=
Labels: networkoperator.openshift.io/generates-operator-status=stand-alone
Annotations: deprecated.daemonset.template.generation: 3
kubernetes.io/description: This daemonset launches the ovn-kubernetes controller (master) networking components.
networkoperator.openshift.io/cluster-network-cidr: 10.128.0.0/14
networkoperator.openshift.io/hybrid-overlay-status: disabled
networkoperator.openshift.io/ip-family-mode: single-stack
release.openshift.io/version: 4.14.0-0.ci.test-2023-08-04-123014-ci-op-c6fp05f4-latest
Desired Number of Nodes Scheduled: 3
Current Number of Nodes Scheduled: 3
Number of Nodes Scheduled with Up-to-date Pods: 2
Number of Nodes Scheduled with Available Pods: 2
Number of Nodes Misscheduled: 0
Pods Status: 3 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
Labels: app=ovnkube-master
component=network
kubernetes.io/os=linux
openshift.io/component=network
ovn-db-pod=true
type=infra
Annotations: networkoperator.openshift.io/cluster-network-cidr: 10.128.0.0/14
networkoperator.openshift.io/hybrid-overlay-status: disabled
networkoperator.openshift.io/ip-family-mode: single-stack
target.workload.openshift.io/management:
{"effect": "PreferredDuringScheduling"}
Service Account: ovn-kubernetes-controller

 

it seems there is one pod that is not coming up all the way and that pod has
two containers not ready (sbdb and nbdb). logs from those containers below:

 

static-kas git:(master) oc --kubeconfig=/tmp/kk describe pod ovnkube-master-7qlm5 -n openshift-ovn-kubernetes | rg '^ [a-z].*:|Ready'
northd:
Ready: True
nbdb:
Ready: False
kube-rbac-proxy:
Ready: True
sbdb:
Ready: False
ovnkube-master:
Ready: True
ovn-dbchecker:
Ready: True
➜ static-kas git:(master) oc --kubeconfig=/tmp/kk logs ovnkube-master-7qlm5 -n openshift-ovn-kubernetes -c sbdb
2023-08-04T13:08:49.127480354Z + [[ -f /env/_master ]]
2023-08-04T13:08:49.127562165Z + trap quit TERM INT
2023-08-04T13:08:49.127609496Z + ovn_kubernetes_namespace=openshift-ovn-kubernetes
2023-08-04T13:08:49.127637926Z + ovndb_ctl_ssl_opts='-p /ovn-cert/tls.key -c /ovn-cert/tls.crt -C /ovn-ca/ca-bundle.crt'
2023-08-04T13:08:49.127637926Z + transport=ssl
2023-08-04T13:08:49.127645167Z + ovn_raft_conn_ip_url_suffix=
2023-08-04T13:08:49.127682687Z + [[ 10.0.42.108 == \: ]]
2023-08-04T13:08:49.127690638Z + db=sb
2023-08-04T13:08:49.127690638Z + db_port=9642
2023-08-04T13:08:49.127712038Z + ovn_db_file=/etc/ovn/ovnsb_db.db
2023-08-04T13:08:49.127854181Z + [[ ! ssl:10.0.102.2:9642,ssl:10.0.42.108:9642,ssl:10.0.74.128:9642 =~ .:10\.0\.42\.108:. ]]
2023-08-04T13:08:49.128199437Z ++ bracketify 10.0.42.108
2023-08-04T13:08:49.128237768Z ++ case "$1" in
2023-08-04T13:08:49.128265838Z ++ echo 10.0.42.108
2023-08-04T13:08:49.128493242Z + OVN_ARGS='--db-sb-cluster-local-port=9644 --db-sb-cluster-local-addr=10.0.42.108 --no-monitor --db-sb-cluster-local-proto=ssl --ovn-sb-db-ssl-key=/ovn-cert/tls.key --ovn-sb-db-ssl-cert=/ovn-cert/tls.crt --ovn-sb-db-ssl-ca-cert=/ovn-ca/ca-bundle.crt'
2023-08-04T13:08:49.128535253Z + CLUSTER_INITIATOR_IP=10.0.102.2
2023-08-04T13:08:49.128819438Z ++ date -Iseconds
2023-08-04T13:08:49.130157063Z 2023-08-04T13:08:49+00:00 - starting sbdb CLUSTER_INITIATOR_IP=10.0.102.2
2023-08-04T13:08:49.130170893Z + echo '2023-08-04T13:08:49+00:00 - starting sbdb CLUSTER_INITIATOR_IP=10.0.102.2'
2023-08-04T13:08:49.130170893Z + initialize=false
2023-08-04T13:08:49.130179713Z + [[ ! -e /etc/ovn/ovnsb_db.db ]]
2023-08-04T13:08:49.130318475Z + [[ false == \t\r\u\e ]]
2023-08-04T13:08:49.130406657Z + wait 9
2023-08-04T13:08:49.130493659Z + exec /usr/share/ovn/scripts/ovn-ctl -db-sb-cluster-local-port=9644 --db-sb-cluster-local-addr=10.0.42.108 --no-monitor --db-sb-cluster-local-proto=ssl --ovn-sb-db-ssl-key=/ovn-cert/tls.key --ovn-sb-db-ssl-cert=/ovn-cert/tls.crt --ovn-sb-db-ssl-ca-cert=/ovn-ca/ca-bundle.crt '-ovn-sb-log=-vconsole:info -vfile:off -vPATTERN:console:%D
{%Y-%m-%dT%H:%M:%S.###Z}
|%05N|%c%T|%p|%m' run_sb_ovsdb
2023-08-04T13:08:49.208399304Z 2023-08-04T13:08:49.208Z|00001|vlog|INFO|opened log file /var/log/ovn/ovsdb-server-sb.log
2023-08-04T13:08:49.213507987Z ovn-sbctl: unix:/var/run/ovn/ovnsb_db.sock: database connection failed (No such file or directory)
2023-08-04T13:08:49.224890005Z 2023-08-04T13:08:49Z|00001|reconnect|INFO|unix:/var/run/ovn/ovnsb_db.sock: connecting...
2023-08-04T13:08:49.224912156Z 2023-08-04T13:08:49Z|00002|reconnect|INFO|unix:/var/run/ovn/ovnsb_db.sock: connection attempt failed (No such file or directory)
2023-08-04T13:08:49.255474964Z 2023-08-04T13:08:49.255Z|00002|raft|INFO|local server ID is 7f92
2023-08-04T13:08:49.333342909Z 2023-08-04T13:08:49.333Z|00003|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 3.1.2
2023-08-04T13:08:49.348948944Z 2023-08-04T13:08:49.348Z|00004|reconnect|INFO|ssl:10.0.102.2:9644: connecting...
2023-08-04T13:08:49.349002565Z 2023-08-04T13:08:49.348Z|00005|reconnect|INFO|ssl:10.0.74.128:9644: connecting...
2023-08-04T13:08:49.352510569Z 2023-08-04T13:08:49.352Z|00006|reconnect|INFO|ssl:10.0.102.2:9644: connected
2023-08-04T13:08:49.353870484Z 2023-08-04T13:08:49.353Z|00007|reconnect|INFO|ssl:10.0.74.128:9644: connected
2023-08-04T13:08:49.889326777Z 2023-08-04T13:08:49.889Z|00008|raft|INFO|server 2501 is leader for term 5
2023-08-04T13:08:49.890316765Z 2023-08-04T13:08:49.890Z|00009|raft|INFO|rejecting append_request because previous entry 5,1538 not in local log (mismatch past end of log)
2023-08-04T13:08:49.891199951Z 2023-08-04T13:08:49.891Z|00010|raft|INFO|rejecting append_request because previous entry 5,1539 not in local log (mismatch past end of log)
2023-08-04T13:08:50.225632838Z 2023-08-04T13:08:50Z|00003|reconnect|INFO|unix:/var/run/ovn/ovnsb_db.sock: connecting...
2023-08-04T13:08:50.225677739Z 2023-08-04T13:08:50Z|00004|reconnect|INFO|unix:/var/run/ovn/ovnsb_db.sock: connected
2023-08-04T13:08:50.227772827Z Waiting for OVN_Southbound to come up.
2023-08-04T13:08:55.716284614Z 2023-08-04T13:08:55.716Z|00011|raft|INFO|ssl:10.0.74.128:43498: learned server ID 3dff
2023-08-04T13:08:55.716323395Z 2023-08-04T13:08:55.716Z|00012|raft|INFO|ssl:10.0.74.128:43498: learned remote address ssl:10.0.74.128:9644
2023-08-04T13:08:55.724570375Z 2023-08-04T13:08:55.724Z|00013|raft|INFO|ssl:10.0.102.2:47804: learned server ID 2501
2023-08-04T13:08:55.724599466Z 2023-08-04T13:08:55.724Z|00014|raft|INFO|ssl:10.0.102.2:47804: learned remote address ssl:10.0.102.2:9644
2023-08-04T13:08:59.348572779Z 2023-08-04T13:08:59.348Z|00015|memory|INFO|32296 kB peak resident set size after 10.1 seconds
2023-08-04T13:08:59.348648190Z 2023-08-04T13:08:59.348Z|00016|memory|INFO|atoms:35959 cells:31476 monitors:0 n-weak-refs:749 raft-connections:4 raft-log:1543 txn-history:100 txn-history-atoms:7100
➜ static-kas git:(master) oc --kubeconfig=/tmp/kk logs ovnkube-master-7qlm5 -n openshift-ovn-kubernetes -c nbdb 
2023-08-04T13:08:48.779743434Z + [[ -f /env/_master ]]
2023-08-04T13:08:48.779743434Z + trap quit TERM INT
2023-08-04T13:08:48.779825516Z + ovn_kubernetes_namespace=openshift-ovn-kubernetes
2023-08-04T13:08:48.779825516Z + ovndb_ctl_ssl_opts='-p /ovn-cert/tls.key -c /ovn-cert/tls.crt -C /ovn-ca/ca-bundle.crt'
2023-08-04T13:08:48.779825516Z + transport=ssl
2023-08-04T13:08:48.779825516Z + ovn_raft_conn_ip_url_suffix=
2023-08-04T13:08:48.779825516Z + [[ 10.0.42.108 == \: ]]
2023-08-04T13:08:48.779825516Z + db=nb
2023-08-04T13:08:48.779825516Z + db_port=9641
2023-08-04T13:08:48.779825516Z + ovn_db_file=/etc/ovn/ovnnb_db.db
2023-08-04T13:08:48.779887606Z + [[ ! ssl:10.0.102.2:9641,ssl:10.0.42.108:9641,ssl:10.0.74.128:9641 =~ .:10\.0\.42\.108:. ]]
2023-08-04T13:08:48.780159182Z ++ bracketify 10.0.42.108
2023-08-04T13:08:48.780167142Z ++ case "$1" in
2023-08-04T13:08:48.780172102Z ++ echo 10.0.42.108
2023-08-04T13:08:48.780314224Z + OVN_ARGS='--db-nb-cluster-local-port=9643 --db-nb-cluster-local-addr=10.0.42.108 --no-monitor --db-nb-cluster-local-proto=ssl --ovn-nb-db-ssl-key=/ovn-cert/tls.key --ovn-nb-db-ssl-cert=/ovn-cert/tls.crt --ovn-nb-db-ssl-ca-cert=/ovn-ca/ca-bundle.crt'
2023-08-04T13:08:48.780314224Z + CLUSTER_INITIATOR_IP=10.0.102.2
2023-08-04T13:08:48.780518588Z ++ date -Iseconds
2023-08-04T13:08:48.781738820Z 2023-08-04T13:08:48+00:00 - starting nbdb CLUSTER_INITIATOR_IP=10.0.102.2, K8S_NODE_IP=10.0.42.108
2023-08-04T13:08:48.781753021Z + echo '2023-08-04T13:08:48+00:00 - starting nbdb CLUSTER_INITIATOR_IP=10.0.102.2, K8S_NODE_IP=10.0.42.108'
2023-08-04T13:08:48.781753021Z + initialize=false
2023-08-04T13:08:48.781753021Z + [[ ! -e /etc/ovn/ovnnb_db.db ]]
2023-08-04T13:08:48.781816342Z + [[ false == \t\r\u\e ]]
2023-08-04T13:08:48.781936684Z + wait 9
2023-08-04T13:08:48.781974715Z + exec /usr/share/ovn/scripts/ovn-ctl -db-nb-cluster-local-port=9643 --db-nb-cluster-local-addr=10.0.42.108 --no-monitor --db-nb-cluster-local-proto=ssl --ovn-nb-db-ssl-key=/ovn-cert/tls.key --ovn-nb-db-ssl-cert=/ovn-cert/tls.crt --ovn-nb-db-ssl-ca-cert=/ovn-ca/ca-bundle.crt '-ovn-nb-log=-vconsole:info -vfile:off -vPATTERN:console:%D
{%Y-%m-%dT%H:%M:%S.###Z}
|%05N|%c%T|%p|%m' run_nb_ovsdb
2023-08-04T13:08:48.851644059Z 2023-08-04T13:08:48.851Z|00001|vlog|INFO|opened log file /var/log/ovn/ovsdb-server-nb.log
2023-08-04T13:08:48.852091247Z ovn-nbctl: unix:/var/run/ovn/ovnnb_db.sock: database connection failed (No such file or directory)
2023-08-04T13:08:48.861365357Z 2023-08-04T13:08:48Z|00001|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connecting...
2023-08-04T13:08:48.861365357Z 2023-08-04T13:08:48Z|00002|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connection attempt failed (No such file or directory)
2023-08-04T13:08:48.875126148Z 2023-08-04T13:08:48.875Z|00002|raft|INFO|local server ID is c503
2023-08-04T13:08:48.911846610Z 2023-08-04T13:08:48.911Z|00003|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 3.1.2
2023-08-04T13:08:48.918864408Z 2023-08-04T13:08:48.918Z|00004|reconnect|INFO|ssl:10.0.102.2:9643: connecting...
2023-08-04T13:08:48.918934490Z 2023-08-04T13:08:48.918Z|00005|reconnect|INFO|ssl:10.0.74.128:9643: connecting...
2023-08-04T13:08:48.923439162Z 2023-08-04T13:08:48.923Z|00006|reconnect|INFO|ssl:10.0.102.2:9643: connected
2023-08-04T13:08:48.925166154Z 2023-08-04T13:08:48.925Z|00007|reconnect|INFO|ssl:10.0.74.128:9643: connected
2023-08-04T13:08:49.861650961Z 2023-08-04T13:08:49Z|00003|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connecting...
2023-08-04T13:08:49.861747153Z 2023-08-04T13:08:49Z|00004|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connected
2023-08-04T13:08:49.875272530Z 2023-08-04T13:08:49.875Z|00008|raft|INFO|server fccb is leader for term 6
2023-08-04T13:08:49.875302480Z 2023-08-04T13:08:49.875Z|00009|raft|INFO|rejecting append_request because previous entry 6,1732 not in local log (mismatch past end of log)
2023-08-04T13:08:49.876027164Z Waiting for OVN_Northbound to come up.
2023-08-04T13:08:55.694760761Z 2023-08-04T13:08:55.694Z|00010|raft|INFO|ssl:10.0.74.128:57122: learned server ID d382
2023-08-04T13:08:55.694800872Z 2023-08-04T13:08:55.694Z|00011|raft|INFO|ssl:10.0.74.128:57122: learned remote address ssl:10.0.74.128:9643
2023-08-04T13:08:55.706904913Z 2023-08-04T13:08:55.706Z|00012|raft|INFO|ssl:10.0.102.2:43230: learned server ID fccb
2023-08-04T13:08:55.706931733Z 2023-08-04T13:08:55.706Z|00013|raft|INFO|ssl:10.0.102.2:43230: learned remote address ssl:10.0.102.2:9643
2023-08-04T13:08:58.919567770Z 2023-08-04T13:08:58.919Z|00014|memory|INFO|21944 kB peak resident set size after 10.1 seconds
2023-08-04T13:08:58.919643762Z 2023-08-04T13:08:58.919Z|00015|memory|INFO|atoms:8471 cells:7481 monitors:0 n-weak-refs:200 raft-connections:4 raft-log:1737 txn-history:72 txn-history-atoms:8165
➜ static-kas git:(master)

This seems to happen very frequently now, but was not happening before around July 21st.

https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-local-to-shared-gateway-mode-migration?buildId=1684628739427667968

 

Description of problem:

When attempting to add nodes to a long-lived 4.12.3 cluster, net new nodes are not able to join the cluster. They are provisioned in the cloud provider (AWS), but never actually join as a node.

Version-Release number of selected component (if applicable):

4.12.3

How reproducible:

Consistent

Steps to Reproduce:

1. On a long lived cluster, add a new machineset

Actual results:

Machines reach "Provisioned" but don't join the cluster

Expected results:

Machines join cluster as nodes

Additional info:


Currently, the installer has a dependency on the main assisted-service go module. This means that we pull in all of it's dependencies, which include libnmstate (the Rust one). In practice, this means that we can't update assisted-service at least until AGENT-139 is implemented. And since the main assisted-service module and the API module should be in lockstep, this means we can't update to pick up recent changes to the ZTP API either.

Please review the following PR: https://github.com/openshift/cluster-autoscaler-operator/pull/271

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

We have OCP 4:10 installed along with Tigera 3.13 with no issues. We could also update OCP to 4:11 and 4:12 along with Tigera upgrade to 3.15 and 3.16. The upgrade works with no issue. The problem appears when we install Tigera 3.16 along with OCP 4.12. (fresh install)
Tigera support says OCP install parameters need to be updated to accommodate new Tigera updates. Its either in the Terraform Plug-in or file called main.tf that need update. 
Please engage someone from RedHat OCP engineering team.

Ref doc:  https://access.redhat.com/solutions/6980264

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

install Tigera 3.16 along with OCP 4.12. (fresh install) 

Actual results:

Installation fails with the error: "rpc error: code = ResourceExhausted desc = grpc: received message larger than max (5330750 vs. 4194304)"

Expected results:

Just like 4.10, 4.12 installation should work with Tigera calico

Additional info:

 

Description of problem:

According to https://docs.aws.amazon.com/vpc/latest/userguide/amazon-vpc-limits.html, the default Security groups number per network interface is 5 and could be 16 at most, so we better to have some pre-check on the number of provided custom security groups.

When it's more than 15(since the maximum is 16, but installer will also create one ${var.cluster_id}-master-sg/${var.cluster_id}-worker-sg), installer should quit and warn user about this.

Version-Release number of selected component (if applicable):

registry.ci.openshift.org/ocp/release:4.14.0-0.nightly-2023-07-11-092038

How reproducible:

Always

Steps to Reproduce:

1. Set 16 Security groups IDs in compute.platform.aws.additionalSecurityGroupIDs

  compute:
 - architecture: amd64
   hyperthreading: Enabled
   name: worker
   platform:
     aws:
       additionalSecurityGroupIDs:
       - sg-06e63a6ad731c10cc
       - sg-054614d4f4eb5751d
       - sg-05c4fe202c8e2c28c
       - sg-0c948fa8b85bf4af1
       - sg-0cfb0c91c0b48f0de
       - sg-0eff6077ca727c921
       - sg-0d2d1f41f1ac9801c
       - sg-047c67d5decb64563
       - sg-0ee63f164c0ab8b04
       - sg-033ff80fa12e43c7f
       - sg-0ccad43754d9652cd
       - sg-04e4cbca2b5d50c3a
       - sg-0d133411fdcb0a4e0
       - sg-0b2b0e0d515b2f561
       - sg-045fde620b3e702da
       - sg-07e0493a65749973c
   replicas: 3

2. The installation failed due to workers couldn't be provisioned. 

Actual results:

[root@preserve-gpei-worker k_files]# oc get machines -A
NAMESPACE               NAME                                       PHASE     TYPE         REGION      ZONE         AGE
openshift-machine-api   gpei-0613g-wp7zw-master-0                  Running   m6i.xlarge   us-west-2   us-west-2a   66m
openshift-machine-api   gpei-0613g-wp7zw-master-1                  Running   m6i.xlarge   us-west-2   us-west-2b   66m
openshift-machine-api   gpei-0613g-wp7zw-master-2                  Running   m6i.xlarge   us-west-2   us-west-2a   66m
openshift-machine-api   gpei-0613g-wp7zw-worker-us-west-2a-7rszc   Failed                                          62m
openshift-machine-api   gpei-0613g-wp7zw-worker-us-west-2a-pwnvp   Failed                                          62m
openshift-machine-api   gpei-0613g-wp7zw-worker-us-west-2b-n2cs9   Failed                                          62m
[root@preserve-gpei-worker k_files]# oc describe machine gpei-0613g-wp7zw-worker-us-west-2b-n2cs9 -n openshift-machine-api
Name:         gpei-0613g-wp7zw-worker-us-west-2b-n2cs9
..
Spec:
  Lifecycle Hooks:
  Metadata:
  Provider Spec:
    Value:
      Ami:
        Id:         ami-01bfc200595c748a1
      API Version:  machine.openshift.io/v1beta1
      Block Devices:
        Ebs:
      Metadata Service Options:
      Placement:
        Availability Zone:  us-west-2b
        Region:             us-west-2
      Security Groups:
        Filters:
          Name:  tag:Name
          Values:
            gpei-0613g-wp7zw-worker-sg
        Id:  sg-033ff80fa12e43c7f
        Id:  sg-045fde620b3e702da
        Id:  sg-047c67d5decb64563
        Id:  sg-04e4cbca2b5d50c3a
        Id:  sg-054614d4f4eb5751d
        Id:  sg-05c4fe202c8e2c28c
        Id:  sg-06e63a6ad731c10cc
        Id:  sg-07e0493a65749973c
        Id:  sg-0b2b0e0d515b2f561
        Id:  sg-0c948fa8b85bf4af1
        Id:  sg-0ccad43754d9652cd
        Id:  sg-0cfb0c91c0b48f0de
        Id:  sg-0d133411fdcb0a4e0
        Id:  sg-0d2d1f41f1ac9801c
        Id:  sg-0ee63f164c0ab8b04
        Id:  sg-0eff6077ca727c921
      Subnet:
        Id:  subnet-0641814f00311bd9c
      Tags:
        Name:   kubernetes.io/cluster/gpei-0613g-wp7zw
        Value:  owned
      User Data Secret:
        Name:  worker-user-data
Status:
  Conditions:
    Last Transition Time:  2023-07-13T09:58:02Z
    Status:                True
    Type:                  Drainable
    Last Transition Time:  2023-07-13T09:58:02Z
    Message:               Instance has not been created
    Reason:                InstanceNotCreated
    Severity:              Warning
    Status:                False
    Type:                  InstanceExists
    Last Transition Time:  2023-07-13T09:58:02Z
    Status:                True
    Type:                  Terminable
  Error Message:           error launching instance: You have exceeded the maximum number of security groups allowed per network interface.

Expected results:

Installer could abort and prompt the provided custom security group number exceeded the maximum number allowed.

Additional info:


Related to TRT-849, we want to write a test to see how often this is happening before we undertake a major effort to get to the bottom of it.

The test will need to process disruption across all backends, look for DNS lookup disruptions, and then see if we have overlap with non-DNS lookup disruptions within those timeframes.

We have some precedent for similar code in KubePodNotReady alerts that we handle differently if in proximity to other intervals.

The test should flake, we can then see how often it's happening in sippy and on what platforms. With sql we could likely pinpoint to certain build clusters as well.

Description of problem:

According to the Red Hat documentation https://docs.openshift.com/container-platform/4.12/networking/ovn_kubernetes_network_provider/configuring-egress-ips-ovn.html, the maximum number of IP aliases per node is 10 - "Per node, the maximum number of IP aliases, both IPv4 and IPv6, is 10.".

Looking at the code base, the number of allowed IPs is calculated as
Capacity = defaultGCPPrivateIPCapacity (which is set to 10) + cloudPrivateIPsCount (that is number of available IPs from the range) - currentIPv4Usage (number of assigned v4 IPs) - currentIPv6Usage (number of assigned v6 IPs)
https://github.com/openshift/cloud-network-config-controller/blob/master/pkg/cloudprovider/gcp.go#L18-L22

Speaking to GCP, they support up to 100 alias IP ranges (not IPs) per vNIC.

Can Red Hat confirm
1) If there is a limitation of 10 from OCP and why?
2) If there isn't a limit, what is the maximum number of egress IPs that could be supported per node?

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Case:  03487893
It is one of the most highlighted bug from our customer.

 

This is a clone of issue OCPBUGS-13044. The following is the description of the original issue:

Description of problem:

During cluster installations/upgrades with an imageContentSourcePolicy in place but with access to quay.io, the ICSP is not honored to pull the machine-os-content image from a private registry.

Version-Release number of selected component (if applicable):

$ oc logs -n openshift-machine-config-operator ds/machine-config-daemon -c machine-config-daemon|head -1
Found 6 pods, using pod/machine-config-daemon-znknf
I0503 10:53:00.925942    2377 start.go:112] Version: v4.12.0-202304070941.p0.g87fedee.assembly.stream-dirty (87fedee690ae487f8ae044ac416000172c9576a5)

How reproducible:

100% in clusters with ICSP configured BUT with access to quay.io

Steps to Reproduce:

1. Create mirror repo:
$ cat <<EOF > /tmp/isc.yaml                                                    
kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v1alpha2
archiveSize: 4
storageConfig:
  registry:
    imageURL: quay.example.com/mirror/oc-mirror-metadata
    skipTLS: true
mirror:
  platform:
    channels:
    - name: stable-4.12
      type: ocp
      minVersion: 4.12.13
    graph: true
EOF
$ oc mirror --dest-skip-tls  --config=/tmp/isc.yaml docker://quay.example.com/mirror/oc-mirror-metadata
<...>
info: Mirroring completed in 2m27.91s (138.6MB/s)
Writing image mapping to oc-mirror-workspace/results-1683104229/mapping.txt
Writing UpdateService manifests to oc-mirror-workspace/results-1683104229
Writing ICSP manifests to oc-mirror-workspace/results-1683104229

2. Confirm machine-os-content digest:
$ oc adm release info 4.12.13 -o jsonpath='{.references.spec.tags[?(@.name=="machine-os-content")].from}'|jq
{
  "kind": "DockerImage",
  "name": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:a1660c8086ff85e569e10b3bc9db344e1e1f7530581d742ad98b670a81477b1b"
}
$ oc adm release info 4.12.14 -o jsonpath='{.references.spec.tags[?(@.name=="machine-os-content")].from}'|jq
{
  "kind": "DockerImage",
  "name": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ed68d04d720a83366626a11297a4f3c5761c0b44d02ef66fe4cbcc70a6854563"
}

3. Create 4.12.13 cluster with ICSP at install time:
$ grep imageContentSources -A6 ./install-config.yaml
imageContentSources:
  - mirrors:
    - quay.example.com/mirror/oc-mirror-metadata/openshift/release
    source: quay.io/openshift-release-dev/ocp-v4.0-art-dev
  - mirrors:
    - quay.example.com/mirror/oc-mirror-metadata/openshift/release-images
    source: quay.io/openshift-release-dev/ocp-release


Actual results:

1. After the installation is completed, no pulls for a166 (4.12.13-x86_64-machine-os-content) are logged in the Quay usage logs whereas e.g. digest 22d2 (4.12.13-x86_64-machine-os-images) are reported to be pulled from the mirror. 

2. After upgrading to 4.12.14 no pulls for ed68 (4.12.14-x86_64-machine-os-content) are logged in the mirror-registry while the image was pulled as part of `oc image extract` in the machine-config-daemon:

[core@master-1 ~]$ sudo less /var/log/pods/openshift-machine-config-operator_machine-config-daemon-7fnjz_e2a3de54-1355-44f9-a516-2f89d6c6ab8f/machine-config-daemon/0.log                        2023-05-03T10:51:43.308996195+00:00 stderr F I0503 10:51:43.308932   11290 run.go:19] Running: nice -- ionice -c 3 oc image extract -v 10 --path /:/run/mco-extensions/os-extensions-content-4035545447 --registry- config /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ad48fe01f3e82584197797ce2151eecdfdcce67ae1096f06412e5ace416f66ce 2023-05-03T10:51:43.418211869+00:00 stderr F I0503 10:51:43.418008  184455 client_mirrored.go:174] Attempting to connect to quay.io/openshift-release-dev/ocp-v4.0-art-dev 2023-05-03T10:51:43.418211869+00:00 stderr F I0503 10:51:43.418174  184455 round_trippers.go:466] curl -v -XGET  -H "User-Agent: oc/4.12.0 (linux/amd64) kubernetes/31aa3e8" 'https://quay.io/v2/' 2023-05-03T10:51:43.419618513+00:00 stderr F I0503 10:51:43.419517  184455 round_trippers.go:495] HTTP Trace: DNS Lookup for quay.io resolved to [{34.206.15.82 } {54.209.210.231 } {52.5.187.29 } {52.3.168.193 }  {52.21.36.23 } {50.17.122.58 } {44.194.68.221 } {34.194.241.136 } {2600:1f18:483:cf01:ebba:a861:1150:e245 } {2600:1f18:483:cf02:40f9:477f:ea6b:8a2b } {2600:1f18:483:cf02:8601:2257:9919:cd9e } {2600:1f18:483:cf01 :8212:fcdc:2a2a:50a7 } {2600:1f18:483:cf00:915d:9d2f:fc1f:40a7 } {2600:1f18:483:cf02:7a8b:1901:f1cf:3ab3 } {2600:1f18:483:cf00:27e2:dfeb:a6c7:c4db } {2600:1f18:483:cf01:ca3f:d96e:196c:7867 }] 2023-05-03T10:51:43.429298245+00:00 stderr F I0503 10:51:43.429151  184455 round_trippers.go:510] HTTP Trace: Dial to tcp:34.206.15.82:443 succeed 

Expected results:

All images are pulled from the location as configured in the ICSP.

Additional info:

 

Description of problem:

When CNO is managed by Hypershift multus-admission-controller does not have correct RollingUpdate parameterts meeting Hypershift requirements outligned here: https://github.com/openshift/hypershift/blob/646bcef53e4ecb9ec01a05408bb2da8ffd832a14/support/config/deployment.go#L81
```
There are two standard cases currently with hypershift: HA mode where there are 3 replicas spread across zones and then non ha with one replica. When only 3 zones are available you need to be able to set maxUnavailable in order to progress the rollout. However, you do not want to set that in the single replica case because it will result in downtime.
```
So when multus-admission-controller has more than one replica the RollingUpdate parameters should be
```
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 1
```

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1.Create OCP cluster using Hypershift
2.Check rolling update parameters of multus-admission-controller

Actual results:

the operator has default parameters: {"rollingUpdate":{"maxSurge":"25%","maxUnavailable":"25%"},"type":"RollingUpdate"}

Expected results:

{"rollingUpdate":{"maxSurge":0,"maxUnavailable":1},"type":"RollingUpdate"}

Additional info:

 

User Story

As a user I want to see what differs between the Machine's (current) ProviderSpec and the Control Plane Machine Set (desired) ProviderSpec so that I can understand why the CPMSO is replacing my control plane machine.

Background

Work spawn out of discussions in https://redhat-internal.slack.com/archives/CCX9DB894/p1678820665803259 and https://redhat-internal.slack.com/archives/C04UB95G802 

Believe we are already logging this, would be good to emit either an event or the diff into the status, whoever takes this card should investigate the best way of surfacing this.

Outcome:

  • Decision on event/status/both
  • If status, API design scoped out
  • Cards written for implementation

Steps

  • PR
  • update tests

Stakeholders

  • <Who is interested in this/where did they request this>

Definition of Done

  • <Add items that need to be completed for this card>
  • Docs
  • <Add docs requirements for this card>
  • Testing
  • <Explain testing that will be added>

Please review the following PR: https://github.com/openshift/cluster-kube-controller-manager-operator/pull/726

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/cluster-monitoring-operator/pull/1952

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

The MCO's "Certificate Observability" CRD fields (introduced in MCO-607) are non-RFC3339 formatted strings and are unparseable as the API standard metav1.Time

For context, the MCO is currently migrating its API to openshift/api where it needs to comply with API standards, and if these strings are still present in the API when 4.14 ships, we will be unable to upgrade from the shipping version to the one where the API has migrated, so we need to adjust this now before it ships. 

Version-Release number of selected component (if applicable):

 

How reproducible:

100% 

Steps to Reproduce:

1.Create a cluster
2.Observe ControllerConfig status.controllerCertificates
3.Observe MachineConfigPool status.certExpirys

Actual results:

Types are wrong, and strings are formatted thusly: 2033-08-12 01:47:54 +0000 UTC 

Expected results:

ControllerConfig and MachineConfigPools do not contain certificate observability fields formatted as "2033-08-12 01:47:54 +0000 UTC".

Either contain certificate observability fields formatted as "2006-01-02T15:04:05Z07:00" or should not contain them at all. 

Additional info:

If we ship 4.14 with these strings how they are, we will be stuck like that and unable to easily upgrade out of it (because the new MCO that regards the fields as metav1.Time will be unable to parse the old strings), e.g.

2023-08-15T05:03:40.989575279Z W0815 05:03:40.989527 1 reflector.go:533] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:101: failed to list *v1.MachineConfigPool: parsing time "2033-08-12 01:47:54 +0000 UTC" as "2006-01-02T15:04:05Z07:00": cannot parse " 01:47:54 +0000 UTC" as "T" 2023-08-15T05:03:40.989575279Z E0815 05:03:40.989555 1 reflector.go:148] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:101: Failed to watch *v1.MachineConfigPool: failed to list *v1.MachineConfigPool: parsing time "2033-08-12 01:47:54 +0000 UTC" as "2006-01-02T15:04:05Z07:00": cannot parse " 01:47:54 +0000 UTC" as "T" 2023-08-15T05:04:05.304139210Z W0815 05:04:05.304088 1 reflector.go:533] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:101: failed to list *v1.ControllerConfig: parsing time "2033-08-12 01:47:54 +0000 UTC" as "2006-01-02T15:04:05Z07:00": cannot parse " 01:47:54 +0000 UTC" as "T" 2023-08-15T05:04:05.304139210Z E0815 05:04:05.304121 1 reflector.go:148] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:101: Failed to watch *v1.ControllerConfig: failed to list *v1.ControllerConfig: parsing time "2033-08-12 01:47:54 +0000 UTC" as "2006-01-02T15:04:05Z07:00": cannot parse " 01:47:54 +0000 UTC" as "T"           


Allow creating a single NAT gateway for a multi-zone hosted cluster. The route table in other zones should point to the one NAT gateway.

This allows running a cluster in multiple zones with a single NAT gateway which can be expensive to run in AWS.

Please review the following PR: https://github.com/openshift/cloud-provider-alibaba-cloud/pull/30

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

ControllerConfig renders properly until Infrastructure object changes, then:
- 'Kind' and 'APIVersion' are no longer present on the object resulting from a "get" for that object via the lister and
- as a result, the embedded dns and infrastructure objects in ControllerConfig fail to validate 
- this results in ControllerConfig failing to sync 

Version-Release number of selected component (if applicable):

4.14 machine-config-operator

How reproducible:

I can reproduce it every time 

Steps to Reproduce:

1.Build a 4.14 cluster
2.Update Infrastructure non-destructively, e.g.: oc annotate infrastructure cluster break.the.mco=yep
3.Watch the machine-config-operator pod logs (or oc get co, the error will propagate) to see the validation errors for the new controllerconfig

Actual results:

2023-05-17T20:45:04.627320107Z I0517 20:45:04.627281       1 event.go:285] Event(v1.ObjectReference{Kind:"", Namespace:"", Name:"machine-config", UID:"d52d09f4-f7bb-497a-a5c3-92861aa6796f", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'OperatorDegraded: MachineConfigControllerFailed' Failed to resync 4.14.0-0.ci.test-2023-05-17-193937-ci-op-dcrr8kjq-latest because: ControllerConfig.machineconfiguration.openshift.io "machine-config-controller" is invalid: [spec.infra.apiVersion: Required value: must not be empty, spec.infra.kind: Required value: must not be empty, <nil>: Invalid value: "null": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation]

Expected results:

machine-config-operator quietly syncs controllerconfig :) 

Additional info:

The MCO itself is not doing this. It's not part of resourcemerge or anything like that. It's happening "below" us. 

The short version here is that when using a typed client, the group,version,kind (GVK) gets stripped during decoding because it's redundant (you already know the type). For "top level" objects, it gets put back during an update request automatically, but it doesn't recurse into embedded objects (which Infrastructure and DNS are). So we end up with embedded objects that are missing explicit GVKs and won't validate. 

Why does it only happen after the objects change? We're using a lister, and the lister's "strip-on-decode" behavior seems a little inconsistent. Sometimes the GVK is populated. If you use a direct client "get", the GVK will never be populated. 

There is a lot of history on this behavior, it won't be changed any time soon, here are some entry points: 
- https://github.com/kubernetes/kubernetes/pull/63972
- https://github.com/kubernetes/kubernetes/issues/80609
 

Description of problem:

test "operator conditions control-plane-machine-set" fails https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/1574/pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-upgrade/1634410710559625216
control-plane-machine-set operator is Unavailable, because it doesn't reconcile node events. If a node becomes ready later than the referencing Machine, Node update event will not trigger reconciliation.

Version-Release number of selected component (if applicable):

 

How reproducible:

depends on the sequence of Node vs Machine events

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

operator logs 
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/1574/pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-upgrade/1634410710559625216/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/pods/openshift-machine-api_control-plane-machine-set-operator-5d5848c465-g4q2p_control-plane-machine-set-operator.log

machines 
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/1574/pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-upgrade/1634410710559625216/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/machines.json

nodes 
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/1574/pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-upgrade/1634410710559625216/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/nodes.json

Please review the following PR: https://github.com/openshift/cluster-dns-operator/pull/357

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

This is a clone of issue OCPBUGS-18304. The following is the description of the original issue:

Description of problem:

https://github.com/openshift/installer/pull/6770 reverted part of https://github.com/openshift/installer/pull/5788 which has set guestinfo.domain for bootstrap machine. This breaks some OKD installations, which require that setting

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:


NodePool conditions AllMachinesReady and AllNodesHealthy are used by Cluster Service to detect problems on customer nodes.

Everytime a NodePool is updated, it triggers an update in a ManifestWork that is being processed by CS to build a user message about why a specific machinepool/nodepool is not healthy.

The lack of a sorted message when there are more than one machines creates a bug that the NodePool is updated multiple time, when the state is the same.

For example, CS may capture scenarios like this and consider them like the change is the same.

Machine rosa-vws58-workshop-69b55d58b-mq44p: UnhealthyNode
Machine rosa-vws58-workshop-69b55d58b-97n47: UnhealthyNode ,
Machine rosa-vws58-workshop-69b55d58b-mq44p: NodeConditionsFailed
Machine rosa-vws58-workshop-69b55d58b-97n47: Deleting ,

Machine rosa-vws58-workshop-69b55d58b-97n47: UnhealthyNode
Machine rosa-vws58-workshop-69b55d58b-mq44p: UnhealthyNode ,
Machine rosa-vws58-workshop-69b55d58b-97n47: Deleting
Machine rosa-vws58-workshop-69b55d58b-mq44p: NodeConditionsFailed ,

Machine rosa-vws58-workshop-69b55d58b-mq44p: UnhealthyNode
Machine rosa-vws58-workshop-69b55d58b-97n47: UnhealthyNode ,
Machine rosa-vws58-workshop-69b55d58b-mq44p: NodeConditionsFailed
Machine rosa-vws58-workshop-69b55d58b-97n47: Deleting ,

Expected results:


The HyperShift Operator should sort the messages where multiple machines/nodes are invovled:

https://github.com/openshift/hypershift/blob/86af31a5a5cdee3da0d7f65f3bd550f4ec9cac55/hypershift-operator/controllers/nodepool/nodepool_controller.go#L2509

Description of problem:

we can see TypeErrors on operand creation page

Version-Release number of selected component (if applicable):

cluster-bot cluster 
launch 4.14-ci,openshift/console#12525

How reproducible:

Always

Steps to Reproduce:

1. create mock CRD and CSV files into project 'test'
$ oc project test
$ oc apply -f mock-crd-and-csv.yaml 
customresourcedefinition.apiextensions.k8s.io/mock-k8s-dropdown-resources.test.tectonic.com created
clusterserviceversion.operators.coreos.com/mock-k8s-resource-dropdown-operator created
2. Goes to CR creation page Operators -> Installed Operators -> Mock K8sResourcePrefixOperator -> Mock Resource tab -> click on 'Create MockK8sDropdownResource' button  

Actual results:

2. we can see errors

Description:
e is undefined

Component trace: 
g@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/create-operand-chunk-b03c5cb69a738de3ba86.min.js:1:17026
v@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/create-operand-chunk-b03c5cb69a738de3ba86.min.js:1:54359
div
N@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/main-chunk-4f4f3b36aabdf0eb831f.min.js:1:173048
R@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/main-chunk-4f4f3b36aabdf0eb831f.min.js:1:173543
_@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/create-operand-chunk-b03c5cb69a738de3ba86.min.js:1:20749
10807/t.a@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/create-operand-chunk-b03c5cb69a738de3ba86.min.js:1:145
4156/t.default@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/create-operand-chunk-b03c5cb69a738de3ba86.min.js:1:22586
s@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/main-chunk-4f4f3b36aabdf0eb831f.min.js:1:223444
t@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/vendors~main-chunk-8e90f77cf4a58a9d5a52.min.js:21:69403
T
t@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/vendors~main-chunk-8e90f77cf4a58a9d5a52.min.js:21:71448
Suspense
i@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/main-chunk-4f4f3b36aabdf0eb831f.min.js:1:435931
section
m@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/vendor-patternfly-core-chunk-277c96b9c656c5dae20f.min.js:1:170312
div
div
t.a@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/main-chunk-4f4f3b36aabdf0eb831f.min.js:1:1501506
div
div
c@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/vendor-patternfly-core-chunk-277c96b9c656c5dae20f.min.js:1:699298
d@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/vendor-patternfly-core-chunk-277c96b9c656c5dae20f.min.js:1:219161
div
d@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/vendor-patternfly-core-chunk-277c96b9c656c5dae20f.min.js:1:89596
l@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/main-chunk-4f4f3b36aabdf0eb831f.min.js:1:1151500
H<@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/main-chunk-4f4f3b36aabdf0eb831f.min.js:1:442786
S@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/vendors~main-chunk-8e90f77cf4a58a9d5a52.min.js:87:86675
main
div
v@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/vendor-patternfly-core-chunk-277c96b9c656c5dae20f.min.js:1:466912
div
div
c@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/vendor-patternfly-core-chunk-277c96b9c656c5dae20f.min.js:1:311348
div
div
c@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/vendor-patternfly-core-chunk-277c96b9c656c5dae20f.min.js:1:699298
d@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/vendor-patternfly-core-chunk-277c96b9c656c5dae20f.min.js:1:219161
div
d@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/vendor-patternfly-core-chunk-277c96b9c656c5dae20f.min.js:1:89596
Jn@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/vendors~main-chunk-8e90f77cf4a58a9d5a52.min.js:36:185686
t.default@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/main-chunk-4f4f3b36aabdf0eb831f.min.js:1:854425
5404/t.default@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/quick-start-chunk-0b68859d1eaa39849249.min.js:1:1264
s@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/main-chunk-4f4f3b36aabdf0eb831f.min.js:1:223444
t.a@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/main-chunk-4f4f3b36aabdf0eb831f.min.js:1:1581508
ee@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/main-chunk-4f4f3b36aabdf0eb831f.min.js:1:1599747
St@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/vendors~main-chunk-8e90f77cf4a58a9d5a52.min.js:36:142700
ee@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/main-chunk-4f4f3b36aabdf0eb831f.min.js:1:1599747
ee@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/main-chunk-4f4f3b36aabdf0eb831f.min.js:1:1599747
i@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/main-chunk-4f4f3b36aabdf0eb831f.min.js:1:809765
t.a@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/main-chunk-4f4f3b36aabdf0eb831f.min.js:1:1575685
t.a@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/main-chunk-4f4f3b36aabdf0eb831f.min.js:1:1575874
t.a@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/main-chunk-4f4f3b36aabdf0eb831f.min.js:1:1573290
te@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/main-chunk-4f4f3b36aabdf0eb831f.min.js:1:1599889
ne<@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/main-chunk-4f4f3b36aabdf0eb831f.min.js:1:1603021
r@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/vendors~main-chunk-8e90f77cf4a58a9d5a52.min.js:36:122338
t@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/vendors~main-chunk-8e90f77cf4a58a9d5a52.min.js:21:69403
t@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/vendors~main-chunk-8e90f77cf4a58a9d5a52.min.js:21:71448
t@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/vendors~main-chunk-8e90f77cf4a58a9d5a52.min.js:21:66008
re@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/main-chunk-4f4f3b36aabdf0eb831f.min.js:1:1603332
t.a@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/main-chunk-4f4f3b36aabdf0eb831f.min.js:1:783751
t.a@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/main-chunk-4f4f3b36aabdf0eb831f.min.js:1:1084331
s@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/main-chunk-4f4f3b36aabdf0eb831f.min.js:1:635039
t.a@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/vendors~main-chunk-8e90f77cf4a58a9d5a52.min.js:135:257437
Suspense


Expected results:

2. operand creation form/yaml page should be loaded successfully

Additional info:

mock-crd-and-csv.yaml and screenshot are at https://drive.google.com/drive/folders/1Z432vVMArHLgCgzu5IMGi9_oq3iRtezx 

There is a workloads change, which is introducing DeploymentConfigs and Builds API as a capabilities, which gives the cluster admin option to enable/disable each of their API.

In case the DeploymentConfigs capability is disabled we should remove the `Deployment Config` subsection from `Workloads` nav section.

In case the Builds capability is disabled we should remove the `Builds` and `Build Configs` subsection from `Workloads` nav section.

 

This is a clone of issue OCPBUGS-7893. The following is the description of the original issue:

Description of problem:
The TaskRun duration diagram on the "Metrics" tab of pipeline is set to only show 4 TaskRuns in the legend regardless of the number of TaskRuns on the diagram.

 

 

Expected results:

All TaskRuns should be displayed in the legend.

Please review the following PR: https://github.com/openshift/vmware-vsphere-csi-driver/pull/61

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Hello, one of our customers had several cni-sysctl-allowlist-ds created (around 10.000 pods) in openshift-multus namespace. That caused several issues in the cluster, as nodes were full of pods an run out of IPs.

After deleting them, the situation has improved. But we want to know the root cause of this issue.

Searching in the network-operator pod logs, it seems that the customer faced some networking issues. After this issue, we can see that the cni-sysctl-allowlist pods started to be created.

Could we know why the cni-sysctl-allowlist-ds pods were created?

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

Unable to successfully create HyperShift KubeVirt HostedCluster on BM, control plane's pod/importer-prime-xxx can's be ready

Version-Release number of selected component (if applicable):

4.14

How reproducible:

100%

Steps to Reproduce:

1. HyperShift install operator
2. HyperShift create cluster KubeVirt xxx

Actual results:

➜  oc get pod -n clusters-3d9ec3c7e495f1c58da1  | grep "importer-prime"
importer-prime-90175dc9-21bf-4f13-a021-6c42a2e19652   1/2     Error              16 (5m13s ago)   57m
importer-prime-9f153661-1c2c-4b61-84fd-0a2d83f30699   1/2     Error              16 (5m4s ago)    57m
importer-prime-cb817383-58bd-4480-a7e1-49ae42368cae   1/2     CrashLoopBackOff   15 (4m51s ago)   57m

➜  oc logs importer-prime-90175dc9-21bf-4f13-a021-6c42a2e19652 -c importer -n clusters-3d9ec3c7e495f1c58da1

I0728 18:41:20.106447       1 importer.go:103] Starting importer
E0728 18:41:20.107346       1 importer.go:133] exit status 1, blockdev: cannot open /dev/cdi-block-volume: Permission denied

kubevirt.io/containerized-data-importer/pkg/util.GetAvailableSpaceBlock
        /remote-source/app/pkg/util/util.go:136
kubevirt.io/containerized-data-importer/pkg/util.GetAvailableSpaceByVolumeMode
        /remote-source/app/pkg/util/util.go:106
main.main
        /remote-source/app/cmd/cdi-importer/importer.go:131
runtime.main
        /usr/lib/golang/src/runtime/proc.go:250
runtime.goexit
        /usr/lib/golang/src/runtime/asm_amd64.s:1598
➜  oc get hostedcluster -n clusters 3d9ec3c7e495f1c58da1 -ojsonpath='{.status.version.desired}' | jq
{
  "image": "registry.build01.ci.openshift.org/ci-op-ywf2rxrx/release@sha256:940a0463d1203888fb4e5fa4a09b69dc4eb3cc5d70dee22e1155c677aafca197",
  "version": "4.14.0-0.ci-2023-07-28-090906"
}
➜  oc get hostedcluster -n clusters 3d9ec3c7e495f1c58da1                                    
NAME                   VERSION   KUBECONFIG                              PROGRESS   AVAILABLE   PROGRESSING   MESSAGE
3d9ec3c7e495f1c58da1             3d9ec3c7e495f1c58da1-admin-kubeconfig   Partial    True        False         The hosted control plane is available
➜  oc get clusterversion version -ojsonpath='{.status.desired.image}'
registry.build01.ci.openshift.org/ci-op-ywf2rxrx/release@sha256:940a0463d1203888fb4e5fa4a09b69dc4eb3cc5d70dee22e1155c677aafca197                                                       
➜  oc get vmi -A                                             
No resources found 

Expected results:

All pods on the control plane should be ready

Additional info:

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/41772/rehearse-41772-periodic-ci-openshift-hypershift-release-4.14-periodics-e2e-kubevirt-baremetalds-conformance/1684954151244533760

Description of problem:

container_network* metrics stop reporting after a container restarts. Other container_* metrics continue to report for the same pod. 

How reproducible:

Issue can be reproduced by triggering a container restart 

Steps to Reproduce:

1.Restart container 
2.Check metrics and see container_network* not reporting

Additional info:
Ticket with more detailed debugging process OHSS-16739

First showed on https://amd64.ocp.releases.ci.openshift.org/releasestream/4.14.0-0.nightly/release/4.14.0-0.nightly-2023-08-16-042125

Did not appear to happen on https://amd64.ocp.releases.ci.openshift.org/releasestream/4.14.0-0.nightly/release/4.14.0-0.nightly-2023-08-15-200133

Changelog is getting huge but I diffed these two PRs:

❯ diff 1.txt 2.txt 
2a3
>     Use go 1.18 when setting up environment (#5422) #5422
15a17
>     CFE-688: Update install-config CRD to support gcp labels and tags #7126
23a26,27
>     OCPBUGS-17711: Revert “pkg/cli/admin/release/extract: Add –included and –install-config” #1527
>     Update openshift/api #1525
28a33
>     pkg/aws/actuator: Drop comment which suggested passthrough permission verification #590
49a55,59
> cluster-control-plane-machine-set-operator
> 
>     OCPCLOUD-2130: Add subnet to Azure FD, fix for optional fields in FD #229
>     Full changelog
> 
64a75
>     IR-373: remove node-ca daemon #867
126a138,147
> cluster-storage-operator
> 
>     STOR-1274: use granular permissons for Azure credential requests #388
>     Full changelog
> 
> cluster-version-operator
> 
>     CNF-9385: add ImageRegistry capability #950
>     Full changelog
> 
132a154,158
> container-networking-plugins
> 
>     OCPBUGS-17681: Default CNI binaries to RHEL 8 #116
>     Full changelog
> 
143a170,174
> haproxy-router
> 
>     OCPBUGS-17653: haproxy/template: mitigate CVE-2023-40225 #505
>     Full changelog
> 
193a225,229
> monitoring-plugin
> 
>     OCPBUGS-17650: Fix /monitoring/ redirects #68
>     Full changelog
> 
204a241,245
> openstack-machine-api-provider
> 
>     Bump CAPO to match branch release-0.7 #80
>     Full changelog
> 
206a248,249
>     OCPBUGS-17157: scripts: add a Go-based bumper, sync upstream #534
>     Add ncdc to DOWNSTREAM_OWNERS #539
223a267
>     update watch-endpoint-slices to usable shape #28184

Description of problem:

A runtime error is encountered when running the console backend in off-cluster mode against only one cluster (non-multicluster configuration)

Version-Release number of selected component (if applicable):

4.14

How reproducible:

Always

Steps to Reproduce:

1. Follow readme instructions for running bridge locally
2.
3.

Actual results:

Bridge crashes with a runtime error

Expected results:

Bridge should run normally

Additional info:

 

Description of problem:

Alert Rules do not have summary/description

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

This bug is being raised by Openshift Monitoring team as part of effort to detect invalid Alert Rules in OCP.

Check details of following Alert Rules
1. KubeletHealthState
2. MCCDrainError
3. MCDPivotError
4. MCDRebootError
5. SystemMemoryExceedsReservation 

Actual results:

These Alert Rules do not have Summary/Description annotation, but have a 'message' annotation. OpenShift alerts must use 'description' -- consider renaming the annotation

Expected results:

Alerts should have Summary/Description annotation.

Additional info:

Alerts must have a summary/description annotation, please refer to style guide at https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/alerting-consistency.md#style-guide 


To resolve the bug, 
- Rename message annotation to summary/description annotation
- Remove the exception in the origin test, added in PR https://github.com/openshift/origin/pull/27944

From https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-multiarch-master-nightly-4.9-ocp-jenkins-e2e-remote-libvirt-ppc64le/1423947091704549376:

```
alert TargetDown fired for 13 seconds with labels:

{job="machine-config-daemon", namespace="openshift-machine-config-operator", service="machine-config-daemon", severity="warning"}

```

Checking kubelet logs for all the nodes:
```
Aug 07 10:11:49.788245 libvirt-ppc64le-1-1-9-kfv8v-master-0 crio[1244]: time="2021-08-07 10:11:49.788169211Z" level=info msg="Started container dd7e2473c51870c1894531af9a3935b907340a31216f85c32e391bddf22d7fd0: openshift-machine-config-operator/machine-config-daemon-7r2bb/machine-config-daemon" id=15456b41-39c9-41ce-8f10-71398df6dd26 name=/runtime.v1alpha2.RuntimeService/StartContainer
Aug 07 10:11:49.265439 libvirt-ppc64le-1-1-9-kfv8v-master-1 crio[1242]: time="2021-08-07 10:11:49.264443242Z" level=info msg="Created container 0651d7904d63a3f2c1fa9177d2ccf890c8fc769e96c836074aa8cc28a8bd7e04: openshift-machine-config-operator/machine-config-daemon-pk29l/machine-config-daemon" id=a622e284-7d45-4b72-b271-c39081c2c77a name=/runtime.v1alpha2.RuntimeService/CreateContainer
Aug 07 10:11:49.602420 libvirt-ppc64le-1-1-9-kfv8v-master-2 crio[1243]: time="2021-08-07 10:11:49.602359290Z" level=info msg="Started container 5a24f464210595cd394aacd4e98903a196d67762a53d764bd6f4a6010cc17acf: openshift-machine-config-operator/machine-config-daemon-69fw6/machine-config-daemon" id=89b0650c-741e-4c61-ab49-f68aa82cb302 name=/runtime.v1alpha2.RuntimeService/StartContainer
Aug 07 10:15:54.666525 libvirt-ppc64le-1-1-9-kfv8v-worker-0-gddxw crio[1252]: time="2021-08-07 10:15:54.666233168Z" level=info msg="Started container 8ba32989af629e00c35578c51e9b5612ca8ddcf97b32f2b500d777a6eb2ff2e1: openshift-machine-config-operator/machine-config-daemon-5tb88/machine-config-daemon" id=4fa0e2ba-54aa-41a8-ab7b-7a3b6f6a9998 name=/runtime.v1alpha2.RuntimeService/StartContainer
Aug 07 10:16:14.170188 libvirt-ppc64le-1-1-9-kfv8v-worker-0-p76x7 crio[1235]: time="2021-08-07 10:16:14.170137303Z" level=info msg="Started container 78d933af1e7100050332b1df62e67d1fc71ca735c7a7d3c060411f61f32a0c74: openshift-machine-config-operator/machine-config-daemon-k6l8w/machine-config-daemon" id=c344fd94-abeb-4393-87f3-5bcaba21d45f name=/runtime.v1alpha2.RuntimeService/StartContainer
```

All containers started before the test started (before 2021-08-07T10:28:00Z, see https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-multiarch-master-nightly-4.9-ocp-jenkins-e2e-remote-libvirt-ppc64le/1423947091704549376/build-log.txt). Checking https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-multiarch-master-nightly-4.9-ocp-jenkins-e2e-remote-libvirt-ppc64le/1423947091704549376/artifacts/ocp-jenkins-e2e-remote-libvirt-ppc64le/gather-libvirt/artifacts/pods.json:

```
machine-config-daemon-5tb88_machine-config-daemon.log: assigned to libvirt-ppc64le-1-1-9-kfv8v-worker-0-gddxw, 0 restarts, ready since 2021-08-07T10:16:07Z
machine-config-daemon-k6l8w_machine-config-daemon.log: assigned to libvirt-ppc64le-1-1-9-kfv8v-worker-0-p76x7, 0 restarts, ready since 2021-08-07T10:16:14Z
machine-config-daemon-69fw6_machine-config-daemon.log: assigned to libvirt-ppc64le-1-1-9-kfv8v-master-2, 0 restarts, ready since 2021-08-07T10:11:49Z
machine-config-daemon-pk29l_machine-config-daemon.log: assigned to libvirt-ppc64le-1-1-9-kfv8v-master-1, 0 restarts, ready since 2021-08-07T10:11:49Z
machine-config-daemon-7r2bb_machine-config-daemon.log: assigned to libvirt-ppc64le-1-1-9-kfv8v-master-0, 0 restarts, ready since 2021-08-07T10:11:49Z
```

All containers were running since they got created and never restarted.

The incident (alert TargetDown fired for 13 seconds) occurred at August 7, 2021 10:33:18 AM. The test suite finished 2021-08-07T10:33:40Z.

Based on the TargetDown definition (see https://github.com/openshift/cluster-monitoring-operator/blob/001eccd81ff51af0ed7a9d463dd35bfa9b75d102/assets/cluster-monitoring-operator/prometheus-rule.yaml#L16-L28):
```

  • alert: TargetDown
    annotations:
    description: '{{ printf "%.4g" $value }}% of the {{ $labels.job }}/{{ $labels.service
    }} targets in {{ $labels.namespace }} namespace have been unreachable for
    more than 15 minutes. This may be a symptom of network connectivity issues,
    down nodes, or failures within these components. Assess the health of the
    infrastructure and nodes running these targets and then contact support.'
    summary: Some targets were not reachable from the monitoring server for an
    extended period of time.
    expr: |
    100 * (count(up == 0 unless on (node) max by (node) (kube_node_spec_unschedulable == 1)) BY (job, namespace, service) /
    count(up unless on (node) max by (node) (kube_node_spec_unschedulable == 1)) BY (job, namespace, service)) > 10
    for: 15m
    ```

The machine-config-daemon was down for 15m and 13s. Given the test suite ran for ~5m42s (10:33:18-10:28:00), the target was down before the test suite started to run.

This patterns repears in other jobs as well:

For other jobs see:
https://search.ci.openshift.org/?search=alert+TargetDown+fired+for+.*+seconds+with+labels%3A+%5C%7Bjob%3D%22machine-config-daemon%22%2C+namespace%3D%22openshift-machine-config-operator%22%2C+service%3D%22machine-config-daemon%22%2C+severity%3D%22warning%22%5C%7D&maxAge=48h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Please review the following PR: https://github.com/openshift/cluster-api-provider-aws/pull/459

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

VPC endpoint service cannot be cleaned up by HyperShift operator when the OIDC provider of the customer cluster has been deleted.

Version-Release number of selected component (if applicable):

4.14

How reproducible:

Sometimes

Steps to Reproduce:

1.Create a HyperShift hosted cluster
2.Delete the HyperShift cluster's OIDC provider in AWS
3.Delete the HyperShift hosted cluster

Actual results:

Cluster is stuck deleting

Expected results:

Cluster deletes

Additional info:

The hypershift operator is stuck trying to delete the AWS endpoint service but it can't be deleted because it gets an error that there are active connections.

Description of problem:

Bump Kubernetes to 0.27.1 and bump dependencies

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

On a freshly installed cluster, the control-plane-machineset-operator begins rolling a new master node, but the machine remains in a Provisioned state and never joins as a node.

Its status is:
Drain operation currently blocked by: [{Name:EtcdQuorumOperator Owner:clusteroperator/etcd}]

The cluster is left in this state until an admin manually removes the stuck master node, at which point a new master machine is provisioned and successfully joins the cluster.

Version-Release number of selected component (if applicable):

4.12.4

How reproducible:

Observed at least 4 times over the last week, but unsure on how to reproduce.

Actual results:

A master node remains in a stuck Provisioned state and requires manual deletion to unstick the control plane machine set process.

Expected results:

No manual interaction should be necessary.

Additional info:

 

Description of problem:

The certificates synced by MCO in 4.13 onwards are more comprehensive and correct, and out of sync issues will surface much faster.

See https://issues.redhat.com/browse/MCO-499 for details

Version-Release number of selected component (if applicable):

4.13

How reproducible:

Always

Steps to Reproduce:

1.Install 4.13, pause MCPs
2.
3.

Actual results:

Within ~24 hours the cluster will fire critical clusterdown alerts

Expected results:

No alerts fire

Additional info:

 

This PR will allow the installation of non-latest Operator channels and associated versions. https://github.com/openshift/console/pull/12743

When I version is installed that is not the `currentCSV` default version for a channel, The data returns  `installed: false` and `installed state: "Not Installed"`

So the UI doesn't place an "Installed" label on the operator card in OperatorHub and the user doesn't see that it's already installed when viewing the operator details.

 

Version-Release number of selected component (if applicable):

4.14 cluster

 

Steps to Reproduce: 

  1. In OperatorHub select Data Grid operator and install version 8.4.3.
  2. Once installed, go into OperatorHub and select Data Grid operator card. Note there isn't an "Installed" label on card. 
  3. Select the Data Grid card, once open is should have a show that the operator is installed with a link to the installed version. 

 

Animated screen gif of installed Data Grid version 8.4.3, the default latest version is 8.4.4

https://drive.google.com/file/d/1KVMCdflBYsI3yiLf2oQv69MoStgA5kof/view?usp=sharing

 

Actual results:

obj data returns `installState: "Not Installed" and `installed: false`

Expected results:

obj data returns `installState: "Installed" and `installed: true`

 

Additional info:

Requires 4.14 cluster to support installing previous versions and channels

Description of problem:

On 4.14, 'MachineAPI' is marked as optional capability which will disable two operators machine-api and cluster-autoscaler.

epic link: https://issues.redhat.com/browse/CNF-6318

And operator machine-api is required for common IPI (no SNO and no compact) cluster, so if disabling "MachineAPI" in install-config.yaml, common IPI cluster will be installed failed.

Suggest to have pre-check on installer side for common IPI (no SNO and no compact) when running "openshift-installer create cluster". If MachineAPI is disabled, installer should exit with corresponding messages.

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-06-30-131338

How reproducible:

Always

Steps to Reproduce:

1. Prepare install-config.yaml and set baselineCapabilitySet as None, make sure that compute node number is greater than 0.
2. Run command "openshift-install create cluster" to install common IPI
3.

Actual results:

Installation failed since missing machine-api operator

Expected results:

Installer should have pre-check for this scenario and exit with error message if MachineAPI is disabled

Additional info:

 

Description of the problem:

We get the disk serial from ghw, which gets it from looking at 2 udev properties.  There are a couple more recent udev properties that should be tried first, as lsblk does:

https://github.com/util-linux/util-linux/blob/36c52fd14b83e6f7eff9a565c426a1e21812403b/misc-utils/lsblk-properties.c#L122-L128

 

I have a PR open on ghw that should solve the issue.  We'll need to update our version of ghw once it's merged.

 

See more info in the ABI ticket: https://issues.redhat.com/browse/OCPBUGS-18174

Please review the following PR: https://github.com/openshift/cloud-provider-azure/pull/59

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

This test tends to be flakey; depending on how the cert changes are propagated. We rotate 2/7 certs in the bundle; if the changes don't get batched together, the assert to verify after the certs happens too soon causing the test to fail. 

Version-Release number of selected component (if applicable):

4.14.0

Description of problem:

The statefulset thanos-ruler-user-workload no serviceName. As the document described, the serviceName is a must for Statefulset. I'm not sure if we need service here, but one question, if we don't need service, why not use a regular Deployment? Thanks!

MacBook-Pro:k8sgpt jianzhang$ oc explain statefulset.spec.serviceName 
KIND:     StatefulSet
VERSION:  apps/v1FIELD:    serviceName <string>DESCRIPTION:
     serviceName is the name of the service that governs this StatefulSet. This
     service must exist before the StatefulSet, and is responsible for the
     network identity of the set. Pods get DNS/hostnames that follow the
     pattern: pod-specific-string.serviceName.default.svc.cluster.local where
     "pod-specific-string" is managed by the StatefulSet controller.

MacBook-Pro:k8sgpt jianzhang$ oc get statefulset -n openshift-user-workload-monitoring -o=jsonpath={.spec.serviceName}
MacBook-Pro:k8sgpt jianzhang$ 

MacBook-Pro:k8sgpt jianzhang$ oc get statefulset -n openshift-user-workload-monitoring
NAME                         READY   AGE
prometheus-user-workload     2/2     4h44m
thanos-ruler-user-workload   2/2     4h44m

MacBook-Pro:k8sgpt jianzhang$ oc get svc -n openshift-user-workload-monitoring
NAME                                      TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                       AGE
prometheus-operated                       ClusterIP   None            <none>        9090/TCP,10901/TCP            4h44m
prometheus-operator                       ClusterIP   None            <none>        8443/TCP                      4h44m
prometheus-user-workload                  ClusterIP   172.30.46.204   <none>        9091/TCP,9092/TCP,10902/TCP   4h44m
prometheus-user-workload-thanos-sidecar   ClusterIP   None            <none>        10902/TCP                     4h44m
thanos-ruler                              ClusterIP   172.30.110.49   <none>        9091/TCP,9092/TCP,10901/TCP   4h44m
thanos-ruler-operated                     ClusterIP   None            <none>        10902/TCP,10901/TCP           4h44m


Version-Release number of selected component (if applicable):

MacBook-Pro:k8sgpt jianzhang$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.14.0-0.nightly-2023-05-31-080250   True        False         7h30m   Cluster version is 4.14.0-0.nightly-2023-05-31-080250

How reproducible:

always

Steps to Reproduce:

1. Install OCP 4.14 cluster.
2. Check cluster's statefulset instances or run `k8sgpt analyze -d`
3.

Actual results:

MacBook-Pro:k8sgpt jianzhang$ k8sgpt analyze -d
Service nfs-provisioner/example.com-nfs does not exist
AI Provider: openai


0 openshift-user-workload-monitoring/thanos-ruler-user-workload(thanos-ruler-user-workload)
- Error: StatefulSet uses the service openshift-user-workload-monitoring/ which does not exist.
  Kubernetes Doc: serviceName is the name of the service that governs this StatefulSet. This service must exist before the StatefulSet, and is responsible for the network identity of the set. Pods get DNS/hostnames that follow the pattern: pod-specific-string.serviceName.default.svc.cluster.local where "pod-specific-string" is managed by the StatefulSet controller.

Expected results:

There is the serviceName for statefulset.

Additional info:

 

Description of problem:

The script for checking the certs for Openshift install on openstack fails. 

https://docs.openshift.com/container-platform/4.12/installing/installing_openstack/preparing-to-install-on-openstack.html#security-osp-validating-certificates_preparing-to-install-on-openstack

I see that the command "openstack catalog list --format json --column Name --column Endpoints" returns output as,

-----------
[
  {
    "Name": "heat-cfn",
    "Endpoints": "RegionOne\n  admin: http://10.254.x.x:8000/v1\nRegionOne\n  public: https://<domain_name>:8000/v1\nRegionOne\n  internal: http://10.254.x.x:8000/v1\n"
  },
  {
    "Name": "cinderv2",
    "Endpoints": "RegionOne\n  admin: http://10.254.x.x:8776/v2/f36f2db6bb434484b71a45aa84b9d790\nRegionOne\n  internal: http://10.254.x.x:8776/v2/f36f2db6bb434484b71a45aa84b9d790\nRegionOne\n  public: https://<domain_name>:8776/v2/f36f2db6bb434484b71a45aa84b9d790\n"
  },
  {
    "Name": "glance",
    "Endpoints": "RegionOne\n  public: https://<domain_name>:9292\nRegionOne\n  admin: http://10.254.x.x:9292\nRegionOne\n  internal: http://10.254.x.x:9292\n"
  },
  {
    "Name": "keystone",
    "Endpoints": "RegionOne\n  internal: http://10.254.x.x:5000\nRegionOne\n  admin: http://10.254.x.x:35357\nRegionOne\n  public: https://<domain_name>:5000\n"
  },
  {
    "Name": "swift",
    "Endpoints": "RegionOne\n  admin: https://ch-dc-s3-gsn-33.eecloud.nsn-net.net:10032/swift/v1\nRegionOne\n  public: https://ch-dc-s3-gsn-33.eecloud.nsn-net.net:10032/swift/v1\nRegionOne\n  internal: https://ch-dc-s3-gsn-33.eecloud.nsn-net.net:10032/swift/v1\n"
  },
  {
    "Name": "nova",
    "Endpoints": "RegionOne\n  public: https://<domain_name>:8774/v2.1\nRegionOne\n  internal: http://10.254.x.x:8774/v2.1\nRegionOne\n  admin: http://10.254.x.x:8774/v2.1\n"
  },
  {
    "Name": "heat",
    "Endpoints": "RegionOne\n  internal: http://10.254.x.x:8004/v1/f36f2db6bb434484b71a45aa84b9d790\nRegionOne\n  public: https://<domain_name>:8004/v1/f36f2db6bb434484b71a45aa84b9d790\nRegionOne\n  admin: http://10.254.x.x:8004/v1/f36f2db6bb434484b71a45aa84b9d790\n"
  },
  {
    "Name": "cinder",
    "Endpoints": ""
  },
  {
    "Name": "cinderv3",
    "Endpoints": "RegionOne\n  public: https://<domain_name>:8776/v3/f36f2db6bb434484b71a45aa84b9d790\nRegionOne\n  admin: http://10.254.x.x:8776/v3/f36f2db6bb434484b71a45aa84b9d790\nRegionOne\n  internal: http://10.254.x.x:8776/v3/f36f2db6bb434484b71a45aa84b9d790\n"
  },
  {
    "Name": "neutron",
    "Endpoints": "RegionOne\n  internal: http://10.254.x.x:9696\nRegionOne\n  public: https://<domain_name>:9696\nRegionOne\n  admin: http://10.254.x.x:9696\n"
  },
  {
    "Name": "placement",
    "Endpoints": "RegionOne\n  internal: http://10.254.x.x:8778\nRegionOne\n  admin: http://10.254.x.x:8778\nRegionOne\n  public: https://<domain_name>:8778\n"
  }
]
-----------

Which then expected to be filtered with jq as " | jq -r '.[] | .Name as $name | .Endpoints[] | [$name, .interface, .url] | join(" ")'| sort " 


But it fails with error as,

----------------
./certs.sh
jq: error (at <stdin>:46): Cannot iterate over string ("RegionOne\...)

Further check the script following commands execution is  failing
 openstack catalog list --format json --column Name --column Endpoints \
> | jq -r '.[] | .Name as $name | .Endpoints[] | [$name, .interface, .url] | join(" ")'
jq: error (at <stdin>:46): Cannot iterate over string ("RegionOne\...)
----------------

Where certs.sh is the script we copied from documentation.

I did some debugs to get the things .interface,.url to internal,public,admin fields from endpoint but I'm not sure if that's way it is on openstack so marking this as BZ to have reviewed.

 

 

 

 

Version-Release number of selected component (if applicable):

Openshift Container Platform 4.12 on 3.18.1 release of openstack 

How reproducible:

- Always

Steps to Reproduce:

1. Copy the script and run it on given release of openstack version. 2.
3.

Actual results:

Fails with parsing 

Expected results:

Shouldn't fail.

Additional info:

 

Invoking 'create cluster-manifests' fails when imageContentSources is missing in install-config yaml:

$ openshift-install agent create cluster-manifests
INFO Consuming Install Config from target directory
FATAL failed to write asset (Mirror Registries Config) to disk: failed to write file: open .: is a directory

install-config.yaml:

apiVersion: v1alpha1
metadata:
  name: appliance
rendezvousIP: 192.168.122.116
hosts:
  - hostname: sno
    installerArgs: '["--save-partlabel", "agent*", "--save-partlabel", "rhcos-*"]'
    interfaces:
     - name: enp1s0
       macAddress: 52:54:00:e7:05:72
    networkConfig:
      interfaces:
        - name: enp1s0
          type: ethernet
          state: up
          mac-address: 52:54:00:e7:05:72
          ipv4:
            enabled: true
            dhcp: true 

Description of problem:

The following changes are required for openshift/route-controller-manager#22 refactoring.

add POD_NAME to route-controller-manager deployment
introduce route-controller-defaultconfig and customize lease name openshift-route-controllers to override the default one supplied by library-go
add RBAC for infrastructures which is used by library-go for configuring leader election

Description of problem:

We are seeing flakes in HyperShift CI jobs: https://search.ci.openshift.org/?search=Alerting+rule+%22CsvAbnormalFailedOver2Min%22&maxAge=48h&context=1&type=bug%2Bissue%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Sample failure: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/1692/pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-hypershift/1664244482360479744

{  fail [github.com/openshift/origin/test/extended/prometheus/prometheus.go:148]: Incompliant rules detected:

Alerting rule "CsvAbnormalFailedOver2Min" (group: olm.csv_abnormal.rules) has no 'description' annotation, but has a 'message' annotation. OpenShift alerts must use 'description' -- consider renaming the annotation
Alerting rule "CsvAbnormalFailedOver2Min" (group: olm.csv_abnormal.rules) has no 'summary' annotation
Alerting rule "CsvAbnormalOver30Min" (group: olm.csv_abnormal.rules) has no 'description' annotation, but has a 'message' annotation. OpenShift alerts must use 'description' -- consider renaming the annotation
Alerting rule "CsvAbnormalOver30Min" (group: olm.csv_abnormal.rules) has no 'summary' annotation
Ginkgo exit error 1: exit with code 1}

Version-Release number of selected component (if applicable):

4.14 CI

How reproducible:

sometimes

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

Serverless -> Eventing -> Channels, Values under conditions column are in Englis.
Translator comments:
"x OK/y" should be translated as "x个 OK(共y个)"

Version-Release number of selected component (if applicable):

4.13.0-ec.1

How reproducible:

always

Steps to Reproduce:

1. Navigate to Serverless -> Eventing -> Channels.
2. Values under Conditions column are in English.
3.

Actual results:

Content is in English.

Expected results:

Content should be in target language. x OK/y" should be translated as "x个 OK(共y个)"

Additional info:

screenshot provided

Description of the problem:
OCI platform is available only from OCP 4.14, we shouldn't be able to create an OCI cluster with OCP < 4.14
 

How reproducible:

You can reproduce with aicli
 

Steps to reproduce:

$ aicli --integration create cluster agentil-test-oci-19 -P platform='{"type": "oci"}' -P pull_secret=<your pull secret> -P user_managed_networking=true -P minimal=true -P openshift_version=4.13

Actual results:

 [agentil@fedora Downloads]$ aicli --integration create cluster agentil-test-oci-19 -P platform='{"type": "oci"}' -P pull_secret=~/Downloads/pull-secret.txt -P user_managed_networking=true -P minimal=true -P openshift_version=4.13
Creating cluster agentil-test-oci-19
Using karmalabs.corp as DNS domain as no one was provided
Forcing network_type to OVNKubernetes
Using version 4.13.2
Creating infraenv agentil-test-oci-19_infra-env
Using karmalabs.corp as DNS domain as no one was provided

[agentil@fedora Downloads]$ aicli --integration info cluster agentil-test-oci-19
ams_subscription_id: 2QvJWtlvlUIvFtCmOIPiwkHRirC
api_vips: []
base_dns_domain: karmalabs.corp
cluster_networks: [{'cluster_id': '65f2a1fa-efd2-419a-9bf0-802e595a0a63', 'cidr': '10.128.0.0/14', 'host_prefix': 23}]
connectivity_majority_groups: {"IPv4":[],"IPv6":[]}
controller_logs_collected_at: 0001-01-01 00:00:00+00:00
controller_logs_started_at: 0001-01-01 00:00:00+00:00
cpu_architecture: x86_64
created_at: 2023-06-08 12:42:36.327854+00:00
disk_encryption: {'enable_on': 'none', 'mode': 'tpmv2', 'tang_servers': None}
email_domain: redhat.com
feature_usage: {"Cluster Tags":{"id":"CLUSTER_TAGS","name":"Cluster Tags"},"Hyperthreading":{"data":{"hyperthreading_enabled":"all"},"id":"HYPERTHREADING","name":"Hyperthreading"},"OVN network type":{"id":"OVN_NETWORK_TYPE","name":"OVN network type"},"Platform selection":{"data":{"platform_type":"oci"},"id":"PLATFORM_SELECTION","name":"Platform selection"},"User Managed Networking With Multi Node":{"id":"USER_MANAGED_NETWORKING_WITH_MULTI_NODE","name":"User Managed Networking With Multi Node"}}
high_availability_mode: Full
hyperthreading: all
id: 65f2a1fa-efd2-419a-9bf0-802e595a0a63
ignition_endpoint: {'url': None, 'ca_certificate': None}
imported: False
ingress_vips: []
install_completed_at: 0001-01-01 00:00:00+00:00
install_started_at: 0001-01-01 00:00:00+00:00
ip_collisions: {}
machine_networks: []
monitored_operators: [{'cluster_id': '65f2a1fa-efd2-419a-9bf0-802e595a0a63', 'name': 'console', 'version': None, 'namespace': None, 'subscription_name': None, 'operator_type': 'builtin', 'properties': None, 'timeout_seconds': 3600, 'status': None, 'status_info': None, 'status_updated_at': datetime.datetime(1, 1, 1, 0, 0, tzinfo=tzutc())}, {'cluster_id': '65f2a1fa-efd2-419a-9bf0-802e595a0a63', 'name': 'cvo', 'version': None, 'namespace': None, 'subscription_name': None, 'operator_type': 'builtin', 'properties': None, 'timeout_seconds': 3600, 'status': None, 'status_info': None, 'status_updated_at': datetime.datetime(1, 1, 1, 0, 0, tzinfo=tzutc())}]
name: agentil-test-oci-19
network_type: OVNKubernetes
ocp_release_image: quay.io/openshift-release-dev/ocp-release:4.13.2-x86_64
openshift_version: 4.13.2
org_id: 11009103
platform: {'type': 'oci'}
progress: {'total_percentage': None, 'preparing_for_installation_stage_percentage': None, 'installing_stage_percentage': None, 'finalizing_stage_percentage': None}
schedulable_masters: False
schedulable_masters_forced_true: True
service_networks: [{'cluster_id': '65f2a1fa-efd2-419a-9bf0-802e595a0a63', 'cidr': '172.30.0.0/16'}]
status: insufficient
status_info: Cluster is not ready for install
status_updated_at: 2023-06-08 12:42:36.324000+00:00
tags: aicli
updated_at: 2023-06-08 12:42:43.362119+00:00
user_managed_networking: True
user_name: agentil@redhat.com

Expected results:

The cluster creation should fail because the version of OCP is incompatible with OCI platform.

Description of problem:

When authenticating openshift-install with the gcloud cli, rather than using a service account key file, the installer will throw an error because https://github.com/openshift/installer/blob/master/pkg/asset/machines/gcp/machines.go#L170-L178 ALWAYS expects to extract a service account to passthrough to nodes in XPN installs. 

An alternative approach would be to handle the lack of service account without error, and allow the required service accounts to passed in through another mechanism.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1. Create install config for gcp xpn install
2. Authenticate installer without service account key file (either gcloud cli auth or through a VM).
3.

Actual results:

 

Expected results:

 

Additional info:

 

As discussed in https://issues.redhat.com/browse/MON-1634, adding ownerref will be put on hold for now until CMO has a CR.

 

 

In the meantime we'll add (let's hope temporary) labels to emphasize ownership, this will help guide users for now and help us highlight relations and how we can/want to express them using ownerref in the future. (See option 1 and option 2 in the doc above)

Description of problem:

"oc adm upgrade --to-multi-arch" command have no guard in cases where there's cluster conditions that may interfere with the transition, such as:
Invalid=True, Failing=True, and Progressing=True

Steps to Reproduce:

either apply the command while an upgrade is in progress, or while there's cluster conditions such as Invalid=True or Failing=True 

Actual results:

accepts the command

Expected results:

warns about the interfering condition, while allowing to progress only if --allow-upgrade-with-warnings is applied

 

Description of problem:

The e2e-nutanix test run failed at bootstrap stage when testing the PR https://github.com/openshift/cloud-provider-nutanix/pull/7. Could reproduce the bootstrap failure with the manual testing to create a Nutanix OCP cluster with the latest nutanix-ccm image.

time="2023-03-06T12:25:56-05:00" level=error msg="Bootstrap failed to complete: timed out waiting for the condition"
time="2023-03-06T12:25:56-05:00" level=error msg="Failed to wait for bootstrapping to complete. This error usually happens when there is a problem with control plane hosts that prevents the control plane operators from creating the control plane."
time="2023-03-06T12:25:56-05:00" level=warning msg="The bootstrap machine is unable to resolve API and/or API-Int Server URLs" 

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

From the PR https://github.com/openshift/cloud-provider-nutanix/pull/7, trigger the e2e-nutanix test. The test will fail at bootstrap stage with the described errors.

Actual results:

The e2e-nutanix test run failed at bootstrapping with the errors: 

level=error msg=Bootstrap failed to complete: timed out waiting for the condition
level=error msg=Failed to wait for bootstrapping to complete. This error usually happens when there is a problem with control plane hosts that prevents the control plane operators from creating the control plane.

Expected results:

The e2e-nutanix test will pass

Additional info:

Investigation showed the root cause was the Nutanix cloud-controller-manager pod did not have permission to get/list ConfigMap resource. The error logs from the Nutanix cloud-controller-manager pod:

E0307 16:08:31.753165       1 reflector.go:140] pkg/provider/client.go:124: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:openshift-cloud-controller-manager:cloud-controller-manager" cannot list resource "configmaps" in API group "" at the cluster scope
I0307 16:09:30.050507       1 reflector.go:257] Listing and watching *v1.ConfigMap from pkg/provider/client.go:124
W0307 16:09:30.052278       1 reflector.go:424] pkg/provider/client.go:124: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:openshift-cloud-controller-manager:cloud-controller-manager" cannot list resource "configmaps" in API group "" at the cluster scope
E0307 16:09:30.052308       1 reflector.go:140] pkg/provider/client.go:124: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:openshift-cloud-controller-manager:cloud-controller-manager" cannot list resource "configmaps" in API group "" at the cluster scope 

This is a clone of issue OCPBUGS-18772. The following is the description of the original issue:

MCO installs resolve-prepender NetworkManager script on the nodes. In order to find out node details it needs to pull baremetalRuntimeCfgImage. However, this image needs to be pulled just the first time, in the followup attempts this script just verifies that this image is available.

This is not desirable in situations where mirror / quay are not available or having a temporary problem - these kind of issues should not prevent the node from starting kubelet. During certificate rotation testing I noticed that the node with a significant time skew won't start kubelet, as it tries to pull baremetalRuntimeCfgImage for kubelet to start - but the image is already on the nodes and it doesn't need refreshing.

Manifests are copied from the object store (either S3 or pod) into the node that is performing the role of bootstrap during installation (or to the single node in an SNO setup)

They are copied into one of two directories according to the directory into which they were uploaded to the object store.

<cluster-id>/manifests/manifests/* will end up being copied to /run/ephemeral/var/opt/openshift/manifests/
<cluster-id>/manifests/openshift/* will end up being copied to /run/ephemeral/var/opt/openshift/openshift/manifest

After this step, any files that have been written to /run/ephemeral/var/opt/openshift/openshift/ are also copied to /run/ephemeral/var/opt/openshift/manifests/, any identically named files are overwritten as part of this operation.

https://github.com/openshift/installer/blob/1e9209ac80ed2cb4ba5663f519e51161a1d8858a/data/data/bootstrap/files/usr/local/bin/bootkube.sh.template#L71C1-L71C27

This behaviour is entirely expected and correct, however it does lead to an issue where if a user chooses to upload a file to both directories with identical names, for example;

File 1: <cluster-id>/manifests/manifests/manifest1.yaml
File 2: <cluster-id>/manifests/openshift/manifest1.yaml

That the only File 2 would end up being applied and that File 1 would end up being overwritten during the bootkube phase.

We should prevent this from happening by treating any attempt to introduce the same file in two places as illegal, meaning that if File 2 is present, we should prevent the upload of File 1 and vice versa during the creation/update of a manifest.

 

Description of problem:

Now that the bug to include libnmstate.2.2.x has been resolved (https://issues.redhat.com/browse/OCPBUGS-11659) we are seeing a boot issue in which agent-tui can't start. It looks like it is failing to find the symlink libnmstate.2 as when its run directly we see 
$ /usr/local/bin/agent-tui
/usr/local/bin/agent-tui: error while loading shared libraries: libnmstate.so.2: cannot open shared object file: No such file or directory

This results neither the console or ssh available in bootstrap which makes debugging difficult. However it does not affect the installation as we still get a successful install. The bootstrap screenshots are attached.

 

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

If the user specifies a DNS name in an egressnetworkpolicy for which the upstream server returns a truncated DNS response, openshift-sdn does not fall back to TCP as expected but just take this as a failure.

Version-Release number of selected component (if applicable):

4.11 (originally reproduced on 4.9)

How reproducible:

Always

Steps to Reproduce:

1. Setup an EgressNetworkPolicy that points to a domain where a truncated response is returned while querying via UDP.
2.
3.

Actual results:

Error, DNS resolution not completed.

Expected results:

Request retried via TCP and succeeded.

Additional info:

In comments.

Description of problem:
When the user edits a deployment and switches (just) the rollout "Strategy type" the form couldn't be saved because the Save button stays disabled.

Version-Release number of selected component (if applicable):
4.13

How reproducible:
Always

Steps to Reproduce:

  1. Import an application from git
  2. Select action "Edit Deployment"
  3. Change the "Strategy type" value

Actual results:
Save button stays disabled

Expected results:
Save button should enable when changing a value (that doesn't make the form state invalid)

Additional info:

Description of problem:

egressip cannot be assigned on hypershift hosted cluster node

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2023-03-09-162945

How reproducible:

100%

Steps to Reproduce:

1. setup hypershift env


2. lable egress ip node on hosted cluster
% oc get node
NAME                                         STATUS   ROLES    AGE     VERSION
ip-10-0-129-175.us-east-2.compute.internal   Ready    worker   3h20m   v1.26.2+bc894ae
ip-10-0-129-244.us-east-2.compute.internal   Ready    worker   3h20m   v1.26.2+bc894ae
ip-10-0-141-41.us-east-2.compute.internal    Ready    worker   3h20m   v1.26.2+bc894ae
ip-10-0-142-54.us-east-2.compute.internal    Ready    worker   3h20m   v1.26.2+bc894ae

% oc label node/ip-10-0-129-175.us-east-2.compute.internal k8s.ovn.org/egress-assignable=""
node/ip-10-0-129-175.us-east-2.compute.internal labeled
% oc label node/ip-10-0-129-244.us-east-2.compute.internal k8s.ovn.org/egress-assignable=""
node/ip-10-0-129-244.us-east-2.compute.internal labeled
% oc label node/ip-10-0-141-41.us-east-2.compute.internal k8s.ovn.org/egress-assignable=""
node/ip-10-0-141-41.us-east-2.compute.internal labeled
% oc label node/ip-10-0-142-54.us-east-2.compute.internal  k8s.ovn.org/egress-assignable=""
node/ip-10-0-142-54.us-east-2.compute.internal labeled


3. create egressip
% cat egressip.yaml 
apiVersion: k8s.ovn.org/v1
kind: EgressIP
metadata:
  name: egressip-1
spec:
  egressIPs: [ "10.0.129.180" ]
  namespaceSelector:
    matchLabels:
      env: ovn-tests
% oc apply -f egressip.yaml 
egressip.k8s.ovn.org/egressip-1 created


4. check egressip assignment
             

Actual results:

egressip cannot assigned to node
% oc get egressip NAME         EGRESSIPS      ASSIGNED NODE   ASSIGNED EGRESSIPS egressip-1   10.0.129.180 

Expected results:

egressip can be assigned to one of the hosted cluster node

Additional info:

 

Description of problem:

Starting with 4.12.0-0.nightly-2023-03-13-172313, the machine API operator began receiving an invalid version tag either due to a missing or invalid VERSION_OVERRIDE(https://github.com/openshift/machine-api-operator/blob/release-4.12/hack/go-build.sh#L17-L20) value being passed tot he build.

This is resulting in all jobs invoked by the 4.12 nightlies failing to install.

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2023-03-13-172313 and later

How reproducible:

consistently in 4.12 nightlies only(ci builds do not seem to be impacted).

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

Example of failure https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.12-e2e-aws-csi/1635331349046890496/artifacts/e2e-aws-csi/gather-extra/artifacts/pods/openshift-machine-api_machine-api-operator-866d7647bd-6lhl4_machine-api-operator.log

Description of problem:

Command `$ oc explain route.spec.tls.insecureEdgeTerminationPolicy` shows different values than the actual values.

Version-Release number of selected component (if applicable):

4.10.z

How reproducible:

100%

Steps to Reproduce:

1. $ oc explain route.spec.tls.insecureEdgeTerminationPolicy
KIND:     Route
VERSION:  route.openshift.io/v1FIELD:    insecureEdgeTerminationPolicy <string>DESCRIPTION:
     insecureEdgeTerminationPolicy indicates the desired behavior for insecure
     connections to a route. While each router may make its own decisions on
     which ports to expose, this is normally port 80.     
    
     * Allow - traffic is sent to the server on the insecure port (default)
     * Disable - no traffic is allowed on the insecure port.
     * Redirect - clients are redirected to the secure port.

2. Set the option to 'Disable' in any secure route :
   $ oc edit route <route-name>
     spec:
       host: hello.example.com
       port:
         targetPort: https
       tls:
         insecureEdgeTerminationPolicy: Disable

3. After editing the route and setting `insecureEdgeTerminationPolicy: Disable` , it gives error :
Danger alert:An error occurred
Error "Invalid value: "Disable": invalid value for InsecureEdgeTerminationPolicy option, acceptable values are None, Allow, Redirect, or empty" for field "spec.tls.insecureEdgeTerminationPolicy".

Actual results:

Based on the API Usage information, the Disable value for insecureEdgeTerminationPolicy field is not acceptable.

Expected results:

The `oc explain route.spec.tls.insecureEdgeTerminationPolicy` must show the correct values.

Additional info:

 

Description of problem:

We are not error checking the response when we request console plugins in getConsolePlugins. If this request fails, we still try to access the "Items" property of the response, which is nil, and causes an exception to be trhown. We need to make sure the request succeeded before referencing any properties of the response.

Version-Release number of selected component (if applicable):

4.14

How reproducible:

Always

Steps to Reproduce:

1. Run bridge locally without setting the requisite env vars

Actual results:

A runtime exception is thrown from the getConsolePlugins function and bridge terminates

Expected results:

An error should be logged and bridge should continue to run

Additional info:

 

Owner: Architect:

Story (Required)

As an ODC helm backend developer I would like to be able to bump version of helm to 3.12 to stay synched up with the version we will ship with OCP 4.14

Background (Required)

Normal activity we do every time a new OCP version is release to stay current

Glossary

NA

Out of scope

NA

Approach(Required)

Bump version of helm to 3.12 run, build and unit test and make sure everything is working as expected. Last time we had a conflict with DevFile backend.

Dependencies

Might had dependencies with DevFile team to move some dependencies forward

Edge Case

NA

Acceptance Criteria

Console Helm dependency is moved to 3.12

INVEST Checklist

Dependencies identified
Blockers noted and expected delivery timelines set
Design is implementable
Acceptance criteria agreed upon
Story estimated

Legend

Unknown
Verified
Unsatisfied

Description of problem:

NAT gateway is not yet a supported feature and the current implementation is a partial non-zonal solution.

Version-Release number of selected component (if applicable):

4.14

How reproducible:

always

Steps to Reproduce:

1. Set OutboundType = NatGateway
2. Deploy cluster
3.

Actual results:

Install successful

Expected results:

Install requires TechPreviewNoUpgrade before proceeding

Additional info:

 

Description of problem:

https://github.com/openshift/openshift-docs/pull/59549#discussion_r1184195239

per the discussion here, the text in the dev console when creating a function says a func.yaml file must be present OR it must use the s2i build strategy, when in fact both things are required

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1. Go to +Add -> Create Serverless function and use a repo URL that doesn't fit the requirements in order to see the error

Actual results:

 

Expected results:

 

Additional info:

 

Version:

$ openshift-install version

./openshift-install 4.9.11
built from commit 4ee186bb88bf6aeef8ccffd0b5d4e98e9ddd895f
release image quay.io/openshift-release-dev/ocp-release@sha256:0f72e150329db15279a1aeda1286c9495258a4892bc5bf1bf5bb89942cd432de
release architecture amd64

Platform: Openstack

install type: IPI

What happened?

Image streams using the swift container to store the images, after running so many image streams I am able to see the huge number of objects in the swift container if I destroy the cluster now, it takes huge time based on the size of the swift container

What did you expect to happen?

The destroy script should clean the resources in some reasonable time

How to reproduce it (as minimally and precisely as possible)?

deploy OCP, run some workload which creates a lot of image streams and destroy the cluster, it will take a lot of time to complete the destroy cmd

Anything else we need to know?

here is the output of the swift state cmd and the time it took to complete the destroy job

$ swift stat vlan609-26jxm-image-registry-nseyclolgfgxoaiysrlejlhvoklcawbxt
Account: AUTH_2b4d979a2a9e4cf88b2509e9c5e0e232
Container: vlan609-26jxm-image-registry-nseyclolgfgxoaiysrlejlhvoklcawbxt
Objects: 723756
Bytes: 652448740473
Read ACL:
Write ACL:
Sync To:
Sync Key:
Meta Name: vlan609-26jxm-image-registry-nseyclolgfgxoaiysrlejlhvoklcawbxt
Meta Openshiftclusterid: vlan609-26jxm
Content-Type: application/json; charset=utf-8
X-Timestamp: 1640248399.77606
Last-Modified: Thu, 23 Dec 2021 08:34:48 GMT
Accept-Ranges: bytes
X-Storage-Policy: Policy-0
X-Trans-Id: txb0717d5198e344a5a095d-0061c93b70
X-Openstack-Request-Id: txb0717d5198e344a5a095d-0061c93b70

Time took to complete the destroy: 6455.42s

 In case of user provides partial/empty/invalid ca certificate in the ignition endpoint override the ignitionDownloadable/API_VIP validation will fail but the user will not know why.
In the agent log we will see this error:

Failed to download worker.ign: unable to parse cert 

One option to let the user know about the problem is to return the error in case of failure as part of the APIVipConnectivityResponse and present it to the user.
and use that value as part of the failing validation message.
This is a bit tricky, the current error message are not user facing and we will need to adjust them.
It also requires API changes... 
Another option is to validate the parameters the user provides

Description of the problem:

While scale testing ACM 2.8, sometimes 0 of the SNOs are discovered.  Upon review, the agent on the SNOs is attempting to return the inspection data to the API VIP ip address instead of the ip address of the metal3 pod (which is the node hosting the metal3 pod). Presumbly the times where the agents were discovered, the VIP API address happened to be on the same node as the metal3 pod.  

How reproducible:

Roughly it should be 66% of the time you could encounter this with a 3 node cluster.

Steps to reproduce:

1.

2.

3.

Actual results:

Ironic agents attempting to access "fc00:1004::3" which is the API vip address

2023-03-12 17:52:51.441 1 CRITICAL ironic-python-agent [-] Unhandled error: requests.exceptions.ConnectionError: HTTPSConnectionPool(host='fc00:1004::3', port=5050): Max retries exceeded with url: /v1/continue (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f94354114c0>: Failed to establish a new connection: [Errno 111] ECONNREFUSED'))                                                    
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent Traceback (most recent call last):
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent   File "/usr/lib/python3.9/site-packages/urllib3/connection.py", line 169, in _new_conn                                                                       
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent     conn = connection.create_connection(
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent   File "/usr/lib/python3.9/site-packages/urllib3/util/connection.py", line 96, in create_connection                                                           
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent     raise err
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent   File "/usr/lib/python3.9/site-packages/urllib3/util/connection.py", line 86, in create_connection                                                           
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent     sock.connect(sa)
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent   File "/usr/lib/python3.9/site-packages/eventlet/greenio/base.py", line 253, in connect                                                                      
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent     socket_checkerr(fd)
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent   File "/usr/lib/python3.9/site-packages/eventlet/greenio/base.py", line 51, in socket_checkerr                                                               
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent     raise socket.error(err, errno.errorcode[err])
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent ConnectionRefusedError: [Errno 111] ECONNREFUSED
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent During handling of the above exception, another exception occurred:                                                                                           
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent Traceback (most recent call last):
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent   File "/usr/lib/python3.9/site-packages/urllib3/connectionpool.py", line 699, in urlopen                                                                     
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent     httplib_response = self._make_request(
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent   File "/usr/lib/python3.9/site-packages/urllib3/connectionpool.py", line 382, in _make_request                                                               
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent     self._validate_conn(conn)
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent   File "/usr/lib/python3.9/site-packages/urllib3/connectionpool.py", line 1010, in _validate_conn                                                             
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent     conn.connect()
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent   File "/usr/lib/python3.9/site-packages/urllib3/connection.py", line 353, in connect                                                                         
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent     conn = self._new_conn()
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent   File "/usr/lib/python3.9/site-packages/urllib3/connection.py", line 181, in _new_conn                                                                       
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent     raise NewConnectionError(
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x7f94354114c0>: Failed to establish a new connection: [Errno 111] ECONNREFUSED
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent During handling of the above exception, another exception occurred:                                                                                           
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent Traceback (most recent call last):
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent   File "/usr/lib/python3.9/site-packages/requests/adapters.py", line 439, in send                                                                             
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent     resp = conn.urlopen(
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent   File "/usr/lib/python3.9/site-packages/urllib3/connectionpool.py", line 755, in urlopen                                                                     
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent     retries = retries.increment(
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent   File "/usr/lib/python3.9/site-packages/urllib3/util/retry.py", line 574, in increment                                                                       
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent     raise MaxRetryError(_pool, url, error or ResponseError(cause))                                                                                            
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='fc00:1004::3', port=5050): Max retries exceeded with url: /v1/continue (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f94354114c0>: Failed to establish a new connection: [Errno 111] ECONNREFUSED'))                                                                               
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent During handling of the above exception, another exception occurred:                                                                                           
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent Traceback (most recent call last):
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent   File "/usr/bin/ironic-python-agent", line 10, in <module>                                                                                                   
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent     sys.exit(run())
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent   File "/usr/lib/python3.9/site-packages/ironic_python_agent/cmd/agent.py", line 50, in run                                                                   
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent     agent.IronicPythonAgent(CONF.api_url,
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent   File "/usr/lib/python3.9/site-packages/ironic_python_agent/agent.py", line 471, in run                                                                      
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent     uuid = inspector.inspect()
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent   File "/usr/lib/python3.9/site-packages/ironic_python_agent/inspector.py", line 106, in inspect                                                              
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent     resp = call_inspector(data, failures)
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent   File "/usr/lib/python3.9/site-packages/ironic_python_agent/inspector.py", line 145, in call_inspector                                                       
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent     resp = _post_to_inspector()
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent   File "/usr/lib/python3.9/site-packages/tenacity/__init__.py", line 329, in wrapped_f                                                                        
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent     return self.call(f, *args, **kw)
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent   File "/usr/lib/python3.9/site-packages/tenacity/__init__.py", line 409, in call                                                                             
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent     do = self.iter(retry_state=retry_state)
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent   File "/usr/lib/python3.9/site-packages/tenacity/__init__.py", line 368, in iter                                                                             
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent     raise retry_exc.reraise()
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent   File "/usr/lib/python3.9/site-packages/tenacity/__init__.py", line 186, in reraise                                                                          
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent     raise self.last_attempt.result()
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent   File "/usr/lib64/python3.9/concurrent/futures/_base.py", line 439, in result                                                                                
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent     return self.__get_result()
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent   File "/usr/lib64/python3.9/concurrent/futures/_base.py", line 391, in __get_result                                                                          
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent     raise self._exception
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent   File "/usr/lib/python3.9/site-packages/tenacity/__init__.py", line 412, in call                                                                             
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent     result = fn(*args, **kwargs)
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent   File "/usr/lib/python3.9/site-packages/ironic_python_agent/inspector.py", line 142, in _post_to_inspector                                                   
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent     return requests.post(CONF.inspection_callback_url, data=data,                                                                                             
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent   File "/usr/lib/python3.9/site-packages/requests/api.py", line 119, in post                                                                                  
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent     return request('post', url, data=data, json=json, **kwargs)                                                                                               
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent   File "/usr/lib/python3.9/site-packages/requests/api.py", line 61, in request                                                                                
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent     return session.request(method=method, url=url, **kwargs)                                                                                                  
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent   File "/usr/lib/python3.9/site-packages/requests/sessions.py", line 542, in request                                                                          
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent     resp = self.send(prep, **send_kwargs)
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent   File "/usr/lib/python3.9/site-packages/requests/sessions.py", line 655, in send                                                                             
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent     r = adapter.send(request, **kwargs)
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent   File "/usr/lib/python3.9/site-packages/requests/adapters.py", line 516, in send                                                                             
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent     raise ConnectionError(e, request=request)
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent requests.exceptions.ConnectionError: HTTPSConnectionPool(host='fc00:1004::3', port=5050): Max retries exceeded with url: /v1/continue (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f94354114c0>: Failed to establish a new connection: [Errno 111] ECONNREFUSED'))                                                                            
2023-03-12 17:52:51.441 1 ERROR ironic-python-agent

You can see the metal3 pod node and ip address:

# oc get po -n openshift-machine-api metal3-5cc95d74d8-lqd9x -o wide
NAME                      READY   STATUS    RESTARTS   AGE     IP             NODE               NOMINATED NODE   READINESS GATES
metal3-5cc95d74d8-lqd9x   5/5     Running   0          2d16h   fc00:1004::7   e27-h05-000-r650   <none>           <none> 

The addresses on the e27-h05-000-r650 node:

[root@e27-h05-000-r650 ~]# ip a | grep "fc00"
    inet6 fc00:1004::4/128 scope global nodad deprecated
    inet6 fc00:1004::7/64 scope global noprefixroute

You can see the api VIP is actually on this host:

[root@e27-h03-000-r650 ~]# ip a | grep "fc00"
    inet6 fc00:1004::3/128 scope global nodad deprecated 
    inet6 fc00:1004::6/64 scope global noprefixroute 

 

Expected results:

 

Versions:

Hub and SNO OCP 4.12.2

ACM - 2.8.0-DOWNSTREAM-2023-02-28-23-06-27

Description of problem:

nodeip-configuration.service is failed on cluster nodes:
systemctl status nodeip-configuration.service
× nodeip-configuration.service - Writes IP address configuration so that kubelet and crio services select a valid node IP
     Loaded: loaded (/etc/systemd/system/nodeip-configuration.service; enabled; preset: disabled)
     Active: failed (Result: exit-code) since Tue 2023-08-15 16:28:09 UTC; 18h ago
   Main PID: 3709 (code=exited, status=0/SUCCESS)
        CPU: 237ms

Aug 15 16:28:09 openshift-worker-2.lab.eng.tlv2.redhat.com configure-ip-forwarding.sh[3761]: ++ [[ -z bond0.354 ]]
Aug 15 16:28:09 openshift-worker-2.lab.eng.tlv2.redhat.com configure-ip-forwarding.sh[3761]: ++ echo bond0.354
Aug 15 16:28:09 openshift-worker-2.lab.eng.tlv2.redhat.com configure-ip-forwarding.sh[3760]: + iface=bond0.354
Aug 15 16:28:09 openshift-worker-2.lab.eng.tlv2.redhat.com configure-ip-forwarding.sh[3760]: + echo 'Node IP interface determined as: bond0.354. Enabling IP forwarding...'
Aug 15 16:28:09 openshift-worker-2.lab.eng.tlv2.redhat.com configure-ip-forwarding.sh[3760]: Node IP interface determined as: bond0.354. Enabling IP forwarding...
Aug 15 16:28:09 openshift-worker-2.lab.eng.tlv2.redhat.com configure-ip-forwarding.sh[3760]: + sysctl -w net.ipv4.conf.bond0.354.forwarding=1
Aug 15 16:28:09 openshift-worker-2.lab.eng.tlv2.redhat.com configure-ip-forwarding.sh[3767]: sysctl: cannot stat /proc/sys/net/ipv4/conf/bond0/354/forwarding: No such file or directory
Aug 15 16:28:09 openshift-worker-2.lab.eng.tlv2.redhat.com systemd[1]: nodeip-configuration.service: Control process exited, code=exited, status=1/FAILURE
Aug 15 16:28:09 openshift-worker-2.lab.eng.tlv2.redhat.com systemd[1]: nodeip-configuration.service: Failed with result 'exit-code'.
Aug 15 16:28:09 openshift-worker-2.lab.eng.tlv2.redhat.com systemd[1]: Failed to start Writes IP address configuration so that kubelet and crio services select a valid node IP.

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-08-08-005757

How reproducible:

so far once

Steps to Reproduce:

1. Deploy multinode spoke cluster with GitOps-ZTP
2. Configure baremetal network to be on top of vlan interface
              - name: bond0.354
                description: baremetal network
                type: vlan
                state: up
                vlan:
                  base-iface: bond0
                  id: 354
                ipv4:
                  enabled: true
                  dhcp: false
                  address:
                  - ip: 10.x.x.20
                    prefix-length: 26
                ipv6:
                  enabled: false
                  dhcp: false
                  autoconf: false

Actual results:

Cluster is deployed but nodeip-configuration.service is Failed

Expected results:

nodeip-configuration.service is Active

Please review the following PR: https://github.com/openshift/thanos/pull/104

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/router/pull/473

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Prerequisites (if any, like setup, operators/versions):

Steps to Reproduce

  1. create and start a pipeline and navigate to the Pipeline metrics page 

Actual results:

Pipeline metrics page crash

Expected results:

Pipeline metrics page should works 

Reproducibility (Always/Intermittent/Only Once):

Always

Build Details:

4.14.0-0.nightly-2023-05-29-174116

Workaround:

Additional info:

It is regression after this got merged https://github.com/openshift/console/pull/12821/commits/c2d24932cd41b1b4c89d7b9fa5ca46d18b0d2d29#diff-782cbf3ae7050932e76be67d990d9cdaa02e322ea6c2b53083a677ed311ff612R40

 

Description of the problem:

In Staging, deleting host in UI {}> Host re{-} register after ~15 mins

How reproducible:

100%

Steps to reproduce:

1. Before cluster installation, delete random host using UI

2. Wait 15 mins

3. Host re-register without rebooting

Actual results:

Agent automatically register himself after 15 min

Expected results:

Agent should register again after reboot

Description of problem:

The test TestPrometheusRemoteWrite/assert_remote_write_cluster_id_relabel_config_works is flaky and keeps blocking PR merges. After investigation it seems like the timeout to wait for the expected value is simply to short.

Description of problem:

hypershift CLI tool allows any string for cluster name. But later when the cluster is to be imported, it needs to confirm to RFC1123.

So the user needs to read the error, destroy the cluster and then try again with a proper name. This experience can be improved.

Version-Release number of selected component (if applicable):

4.13.4

How reproducible:

Always

Steps to Reproduce:

1. hypershift create cluster kubevirt --name virt-4.12 ...
2. try to import it

Actual results:

cluster fails to import due to its name

Expected results:

validate the cluster name in the hypershift cli, fail early

Additional info:

 

Reported by IBM.

Apparently, they run in such a way that status.Version.Desired.Version is not guaranteed to be a parseable semantic version. Thus isUpgradeble returns an error and blocks upgrade, even if the force upgrade annotation is present.

We should check for the annotation first and if the upgrade is being forced, we don't need to do the z-stream upgrade check.

https://redhat-internal.slack.com/archives/C01C8502FMM/p1689279310050439

Description of problem:

ccoctl does not prevent the user from using the same resource group name for the OIDC and installation resource groups which can result in resources existing in the resource group used for cluster installation. The OpenShift installer requires that the installation resource group be empty so OIDC and installation resource groups must be distinct.

ccoctl currently allows for providing either --oidc-resource-group-name and --installation-resource-group name but does not indicate a problem when those resource group names are the same. When the same resource group name is provided using a combination of the --name, --oidc-resource-group-name and --installation-resource-group-name parameters, ccoctl should exit with an error indicating that the resource group names must be different.

Version-Release number of selected component (if applicable):

4.14.0

How reproducible:

100%

Steps to Reproduce:

1. Run ccoctl azure create-all with a combination of --name, --oidc-resource-group-name or --installation-resource-group-name resulting in OIDC and installation resource group names being the same.

./ccoctl azure create-all --name "abutchertest" --region centralus --subscription-id "${SUBSCRIPTION_ID}"--credentials-requests-dir "${MYDIR}/credreqs" --oidc-resource-group-name test "abutchertest" --dnszone-resource-group-name "${DNS_RESOURCE_GROUP}"

ccoctl will default the installation resource group to match the provided --name parameter "abutchertest" which results in OIDC and installation resource groups being "abutchertest" since --oidc-resource-group uses the same name. This means that OIDC resources will be created in the resource group that will be configured for the OpenShift installer within the install-config.yaml.

2. Run the OpenShift installer having set .platform.azure.resourceGroupName in the install-config.yaml to be "abutchertest" and receive error that the installation resource group is not empty when running the installer. The resource identified will contain user-assigned managed identities meant to be created in the OIDC resource group which must be separate from the installation resource group.

FATAL failed to fetch Terraform Variables: failed to fetch dependency of "Terraform Variables": failed to generate asset "Platform Provisioning Check": platform.azure.resourceGroupName: Invalid value: "abutchertest": resource group must be empty but it has 8 resources like...

Actual results:

ccoctl allows OIDC and installation resource group names to be the same.

Expected results:

ccoctl does not allow OIDC and installation resource groups to be the same.

Additional info:

 

Please review the following PR: https://github.com/openshift/aws-ebs-csi-driver/pull/220

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Openshift Console fails to render Monitoring Dashboard when there is a Proxy expected to be used. Additionally, Websocket connections fail due to not using Proxy.

Version-Release number of selected component (if applicable):

4.13

How reproducible:

Always

Steps to Reproduce:

1. Connect to a cluster using backplane and use one of IT's proxies
2. Execute "ocm backplane console -b"
3. Attempt to view the monitoring dashbaord

Actual results:

Monitoring dashboard fails to load with an EOF error
Terminal is spammed with EOF errors

Expected results:

Monitoring dashboard should be rendered correctly
Terminal should not be spammed with error logs

Additional info:

When we apply changes as this PR, the monitoring dashboard works with proxy https://github.com/openshift/console/pull/12877

Description of problem:

When the OIDC provider is deleted on the customer side, AWS resource deletion is not skipped in cases that the ValidAWSIdentityProvider state is on 'Unknown'.

This results in clusters being stuck during deletion.

Version-Release number of selected component (if applicable):

4.12.z, 4.13.z, 4.14.z

How reproducible:

Irregular

Steps to Reproduce:

1.
2.
3.

Actual results:

Cluster stuck in uninstallation

Expected results:

Clusters not stuck in uninstallation, AWS customer resources being skipped for removal

Additional info:

Added MG for all hypershift related NS

Bug seems to be at https://github.com/openshift/hypershift/pull/2281/files#diff-f90ab1b32c9e1b349f04c32121d59f5e9081ccaf2be490f6782165d2960bc6c7R295 : 'Unknown' needs to be added to the check if OIDC is valid or not.

Description of problem:

A customer has reported that the Thanos querier pods would be OOM-killed when loading the API performance dashboard with large time ranges (e.g. >= 1 week) 

Version-Release number of selected component (if applicable):

4.10

How reproducible:

Always for the customer

Steps to Reproduce:

1. Open the "API performance" dashboard in the admin console.
2. Select a time range of 2 weeks.
3.

Actual results:

The dashboard fails to refresh and the thanos-query pods are killed.

Expected results:

The dashboard loads without error.

Additional info:

The issue arises for the customer because they have very large clusters (hundreds of nodes) which generate lots of metrics.
In practice the queries executed by the dashboard are costly because they access lots of series (probably > tens of thousands). To make it more efficient, the "upstream" dashboard from kubernetes-monitoring/kubernetes-mixin uses recording rules [1] instead of raw queries. While it decreases a bit the accuracy (one can only distinguish between read & write API requests), it's the only solution to avoid overloading the Thanos query endpoint.

[1] https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/05a58f765eda05902d4f7dd22098a2b870f7ca1e/dashboards/apiserver.libsonnet#L50-L75

 

Description of problem:

In the metric `cluster:capacity_cpu_cores:sum` there is an attribute label `label_node_role_kubernetes_io` that has `infra` or `master`. There is no label for `worker`. If the infra nodes are missing this label, they get added into the "unlabeled" worker nodes. 

For example:
This cluster has all three types `cluster:capacity_cpu_cores:sum{_id="0702a3b1-c2d8-427f-865d-3ce7dc3a2be7"}`

But this cluster has the infra and worker merged. `cluster:capacity_cpu_cores:sum{_id="0e60ac76-d61a-4e6d-a4f3-269110b6b1f9"}`


If I count clusters that have sockets with infra but capacity_cpu without infra, I get 7,617 cluster for 2023-03-15

If I count clusters that have sockets with infra but capacity_cpu with infra, I get 2,015 cluster for 2023-03-15

That means that there are 5602 clusters that are missing the infra label. 

This metric is used to identify the vCPU/CPU count that is used in TeleSense. This is presented to the Sales teams and upper management. If there is another metric we should use, please let me know. Otherwise, this needs to be fixed. 

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

refer to Slack thread: https://redhat-internal.slack.com/archives/C0VMT03S5/p1678967355450719

Description of problem:

ROSA is being branded via custom branding; as a result, the favicon disappears since we do not want any Red Hat/Openshift-specific branding to appear when custom branding is in use.  Since ROSA is a Red Hat product, it should get a branding option added to the console so all the correct branding including favicon appears.

Version-Release number of selected component (if applicable):

4.14.0, 4.13.z, 4.12.z, 4.11.z

How reproducible:

Always

Steps to Reproduce:

1.  View a ROSA cluster
2.  Note the absence of the OpenShift logo favicon

Description of problem:

Daemonset cni-sysctl-allowlist-ds is missing annotation for workload partitioning.

Version-Release number of selected component (if applicable):

 

How reproducible:

Executing the daemonset shows the pod missing the workload annotation

Steps to Reproduce:

1. Run Daemonset
2.
3.

Actual results:

No workload annotation present.

Expected results:

annotations:
        target.workload.openshift.io/management: '{"effect": "PreferredDuringScheduling"}'

Additional info:

 

Description of problem:

vSphere dual-stack added support for both IPv4 and IPv6 in kubelet --node-ip
however the masters are booting without the IPv6 address in --node-ip

"Ignoring filtered route {Ifindex: 2 Dst: <nil> Src: 192.168.130.19 Gw: 192.168.130.1 Flags: [] Table: 254}"
"Ignoring filtered route {Ifindex: 2 Dst: 192.168.130.0/24 Src: 192.168.130.19 Gw: <nil> Flags: [] Table: 254}"
"Ignoring filtered route {Ifindex: 2 Dst: fd65:a1a8:60ad:271c::22/128 Src: <nil> Gw: <nil> Flags: [] Table: 254}"
"Ignoring filtered route {Ifindex: 2 Dst: fe80::/64 Src: <nil> Gw: <nil> Flags: [] Table: 254}"
"Ignoring filtered route {Ifindex: 2 Dst: <nil> Src: <nil> Gw: fe80::9eb4:f9fa:2b8d:8372 Flags: [] Table: 254}"

"Writing Kubelet service override with content [Service]\nEnvironment=\"KUBELET_NODE_IP=192.168.130.19\" \"KUBELET_NODE_IPS=192.168.130.19\"\n"

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-08-28-154013

How reproducible:

Intermittent (DHCPv6 related)

Steps to Reproduce:

1. install vsphere dual-stack IPI with DHCPv6


networking:
  clusterNetwork:
    - cidr: 10.128.0.0/14
      hostPrefix: 23
    - cidr: fd65:10:128::/56
      hostPrefix: 64
  machineNetwork:
    - cidr: 192.168.0.0/16
    - cidr: fd65:a1a8:60ad:271c::/64
  networkType: OVNKubernetes



Actual results:

Masters missing IPv6 address in KUBELET_NODE_IPS

Install fails with

time="2023-08-30T19:54:19Z" level=error msg="failed to initialize the cluster: Cluster operators authentication, console, ingress, monitoring are not available"

Expected results:

Both IPv4 and IPv6 address in KUBELET_NODE_IPS

Install succeeds

Additional info:

Do we set ipv6.may-fail with NetworkManager?

Description of problem:
After upgrading a plugin image the browser continues to request old plugin files

How reproducible:
100%

Steps to Reproduce:
1. Build and deploy a plugin generated from console-plugin-template repo
2. open one of the plugin pages in the browser
4. Make a change in the code of that page, rebuild and deploy a new image
5. Try to view this page in firefox - you'll get a 404 error. In chrome you'll get the old page

The root cause is
The plugin js file names are auto generated, so the new image has different js file names.
But the plugin-entry.js filename remains the same, the file is cached by default and continues to request the old files

Description of problem: The openshift-manila-csi-driver namespace should have the "workload.openshift.io/allowed= management" label.

This is currently not the case:

❯ oc describe ns openshift-manila-csi-driver  
Name:         openshift-manila-csi-driver
Labels:       kubernetes.io/metadata.name=openshift-manila-csi-driver
              pod-security.kubernetes.io/audit=privileged
              pod-security.kubernetes.io/enforce=privileged
              pod-security.kubernetes.io/warn=privileged
Annotations:  include.release.openshift.io/self-managed-high-availability: true
              openshift.io/node-selector: 
              openshift.io/sa.scc.mcs: s0:c24,c4
              openshift.io/sa.scc.supplemental-groups: 1000560000/10000
              openshift.io/sa.scc.uid-range: 1000560000/10000
Status:       Active

No resource quota.

No LimitRange resource.

It is causing CI jobs to fail with:

{  fail [github.com/openshift/origin/test/extended/cpu_partitioning/platform.go:82]: projects [openshift-manila-csi-driver] do not contain the annotation map[workload.openshift.io/allowed:management]
Expected
    <[]string | len:1, cap:1>: [
        "openshift-manila-csi-driver",
    ]
to be empty
Ginkgo exit error 1: exit with code 1}

For instance https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/27831/pull-ci-openshift-origin-release-4.13-e2e-openstack-ovn/1641317874201006080.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:

1.
2.
3.

Actual results:


Expected results:


Additional info:


Description of problem:

thanos-sidecar is panicking after the image was rebuilt in this payload https://amd64.ocp.releases.ci.openshift.org/releasestream/4.14.0-0.nightly/release/4.14.0-0.nightly-2023-04-18-045408


Example job: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-metal-ipi-sdn-bm/1648276769645531136

Logs:
  - containerID: cri-o://c62dcc73b8203bfd968ffca95bba8607e24a06492948a0179cde6a57a897d431
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:a007b49153ee517ab4fe0600d217832bac0fd6152b5a709da291b60c82a5875d
    imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:a007b49153ee517ab4fe0600d217832bac0fd6152b5a709da291b60c82a5875d
    lastState:
      terminated:
        containerID: cri-o://c62dcc73b8203bfd968ffca95bba8607e24a06492948a0179cde6a57a897d431
        exitCode: 2
        finishedAt: '2023-04-18T12:30:20Z'
        message: "panic: Something in this program imports go4.org/unsafe/assume-no-moving-gc\
          \ to declare that it assumes a non-moving garbage collector, but your version\
          \ of go4.org/unsafe/assume-no-moving-gc hasn't been updated to assert that\
          \ it's safe against the go1.20 runtime. If you want to risk it, run with\
          \ environment variable ASSUME_NO_MOVING_GC_UNSAFE_RISK_IT_WITH=go1.20 set.\
          \ Notably, if go1.20 adds a moving garbage collector, this program is unsafe\
          \ to use.\n\ngoroutine 1 [running]:\ngo4.org/unsafe/assume-no-moving-gc.init.0()\n\
          \t/go/src/github.com/improbable-eng/thanos/vendor/go4.org/unsafe/assume-no-moving-gc/untested.go:25\
          \ +0x1ba\n"
        reason: Error
        startedAt: '2023-04-18T12:30:20Z'
    name: thanos-sidecar
    ready: false
    restartCount: 14
    started: false
    state:
      waiting:
        message: back-off 5m0s restarting failed container=thanos-sidecar pod=prometheus-k8s-0_openshift-monitoring(bafeb85b-3980-4153-90bc-a302b93c3465)
        reason: CrashLoopBackOff

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-04-18-045408

How reproducible:

Always

Steps to Reproduce:

1. Install 4.14.0-0.nightly-2023-04-18-045408

Actual results:

thanos-sidecar panics and cluster doesn't install

Expected results:

 

Additional info:

 

Description of problem:

Deployed a OCP cluster using hypershift agent with 4.14.0-ec.4 release version on Power.
We are observing that when loading operator hub page in GUI is throwing 404 error

Version-Release number of selected component (if applicable):

OCP 4.14.0-ec.4

How reproducible:

Every time

Steps to Reproduce:

1. Deploy Hypershift cluster 
2. Go to GUI and check OperatorHub
3. 

Actual results:

OperatorHub page in GUI is throwing 404 error

Expected results:

OperatorHub page should show Operators

Additional information:

Failure status in olm operator pod from management cluster:

# oc get pod olm-operator-754779f559-846tw -n clusters-hypershift-015 -oyaml

        message: |
          'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')" monitor=clusteroperator
          time="2023-08-17T10:58:37Z" level=error msg="initialization error - failed to ensure name=\"\" - ClusterOperator.config.openshift.io \"\\\"\\\"\" is invalid: metadata.name: Invalid value: \"\\\"\\\"\": a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')" monitor=clusteroperator
          time="2023-08-17T10:59:37Z" level=error msg="initialization error - failed to ensure name=\"\" - ClusterOperator.config.openshift.io \"\\\"\\\"\" is invalid: metadata.name: Invalid value: \"\\\"\\\"\": a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')" monitor=clusteroperator
          time="2023-08-17T11:00:37Z" level=error msg="initialization error - failed to ensure name=\"\" - ClusterOperator.config.openshift.io \"\\\"\\\"\" is invalid: metadata.name: Invalid value: \"\\\"\\\"\": a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')" monitor=clusteroperator
          I0817 11:01:33.000390       1 trace.go:205] Trace[2006040218]: "DeltaFIFO Pop Process" ID:system:controller:route-controller,Depth:152,Reason:slow event handlers blocking the queue (17-Aug-2023 11:01:28.947) (total time: 456ms):
          Trace[2006040218]: [456.950035ms] [456.950035ms] END
          2023/08/17 11:01:41 http: TLS handshake error from 10.244.0.10:33355: read tcp 172.17.53.0:8443->10.244.0.10:33355: read: connection reset by peer
        reason: Error
        startedAt: "2023-08-14T11:03:46Z" 

Screenshot: https://drive.google.com/file/d/1I_XkX15xEl9ZBtAIZ2yp70twD4z2ASlS/view?usp=sharing

Must gather logs:

https://drive.google.com/file/d/1AkmzC_TUi9z6p13funrSygBm2CgepbpU/view?usp=sharing

Description of problem:

maxUnavailable defaults to 50% for anything under 4: https://github.com/openshift/cluster-ingress-operator/blob/master/pkg/operator/controller/ingress/poddisruptionbudget.go#L71

Based on PDB rounding logic, it always rounds to the next while integer, so 1.5 becomes 2.

spec:
  maxUnavailable: 50%
  selector:
    matchLabels:
      ingresscontroller.operator.openshift.io/deployment-ingresscontroller: default
  currentHealthy: 3
  desiredHealthy: 1
  disruptionsAllowed: 2

Where as with 4 router pods, we only allow 1 of 4 to be disrupted at a time. 

Version-Release number of selected component (if applicable):

4.x

How reproducible:

Always

Steps to Reproduce:

1. Set 3 replicas
2. Look at the disruptionsAllowed on the PDB

Actual results:

You can take down 2 of 3 routers at once, leaving no HA.

Expected results:

With 3+ routers, we should always ensure 2 are up with the PDB.

Additional info:

Reduce the maxUnavailable to 25% for >= 3 pods instead of 4

Description of problem:y

An empty page returned when normal user try to view Route Metrics page

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-06-13-223353

How reproducible:

Always

Steps to Reproduce:

1. Check any Routes metrics page with cluster-admin user, for example /k8s/ns/openshift-monitoring/routes/alertmanager-main/metrics, we can see the route metrics page and charts are loaded successfully
2. Grant normal user admin permission on 'openshift-monitoring' project
$ oc adm policy add-role-to-user admin testuser-1 -n openshift-monitoring
clusterrole.rbac.authorization.k8s.io/admin added: "testuser-1"
3. Login with normal user 'testuser-1' and check Networking -> Routes -> alertmanager-main -> Metrics page again 

Actual results:

3. empty page is returned

Expected results:

3. If normal user doesn't have ability to view Route Metrics, we should better either hide 'Metrics' tab or show an error message instead of totally empty page

Additional info:

 

Description of problem:

The operator catalog images used in 4.13 hosted clusters are the ones from 4.12

Version-Release number of selected component (if applicable):

4.13.z

How reproducible:

Always

Steps to Reproduce:

1. Create a 4.13 HostedCluster
2. Inspect the image tags used for catalog imagestreams (oc get imagestreams -n CONTROL_PLANE_NAMESPACE)

Actual results:

image tags point to 4.12 catalog images

Expected results:

image tags point to 4.13 catalog images

Additional info:

These image tags need to be updated: https://github.com/openshift/hypershift/blob/release-4.13/control-plane-operator/controllers/hostedcontrolplane/olm/catalogs.go#L117-L120

Description of problem:

The MCO must have compatibility in place one OCP version in advance if we want to bump ignition spec version, otherwise downgrades will fail.

This is NOT needed in 4.14, only 4.13

Version-Release number of selected component (if applicable):

4.13

How reproducible:

Always

Steps to Reproduce:

1. None atm, this is preventative for the future
2.
3.

Actual results:

N/A

Expected results:

N/A

Additional info:

 

As part of single run, we are basically fetching same thing over and over again and hence using API calls that should not even be made.

For example:

1. privilges check verifies permissions of datasore which is also verified by storageclass check. What is more each of those checks fetches datacenter and datastore and results in several duplication API calls.

Exit Critirea:
1. Remove duplicate checks
2. Avoid fetching same API object again and again as part of same system check.

Description of the problem:

In staging, BE 2.18.0, using UI trying to create new cluster with P/Z cpu arch. and OCP 4.10 - getting the following response :

Non x86_64 CPU architectures for version 4.10 are supported only with User Managed Networking 

How reproducible:

100%

Steps to reproduce:

1. 

2.

3.

Actual results:

 

Expected results:
Message should be clearer for the user to understand the issue:
p/Z cpu arch. is only supported with OCP ver >= 4.12

Description of problem:

2022-09-12T13:48:57.505323919Z {"level":"info","ts":1662990537.5052269,"logger":"controllers.BareMetalHost","msg":"start","baremetalhost":"qe2/master-1-0"}
2022-09-12T13:48:57.566917845Z {"level":"info","ts":1662990537.5668473,"logger":"provisioner.ironic","msg":"no node found, already deleted","host":"qe2~master-1-0"}
2022-09-12T13:48:57.566945972Z {"level":"info","ts":1662990537.566904,"logger":"controllers.BareMetalHost","msg":"done","baremetalhost":"qe2/master-1-0","provisioningState":"available","requeue":true,"after":600}
2022-09-12T13:49:13.556690278Z {"level":"info","ts":1662990553.556591,"logger":"controllers.HostFirmwareSettings","msg":"start","hostfirmwaresettings":"qe2/master-1-0"}
2022-09-12T13:49:13.614818643Z {"level":"info","ts":1662990553.6147015,"logger":"controllers.HostFirmwareSettings","msg":"retrieving firmware settings and saving to resource","hostfirmwaresettings":"qe2/master-1-0","node":"48d24898-1911-4f43-82b0-0b15f8484ae7"}
2022-09-12T13:49:13.629455616Z {"level":"info","ts":1662990553.6293764,"logger":"controllers.HostFirmwareSettings","msg":"provisioner returns error","hostfirmwaresettings":"qe2/master-1-0","RequeueAfter:":30}

Version-Release number of selected component (if applicable):

 

How reproducible:

always

Steps to Reproduce:

1. Detach a BMH
2. Check BMO logs for errors
3. Check Ironic logs for errors

Actual results:

BMO and Ironic logs have errors related to the already deleted node.

Expected results:

No noise in the logs.

Additional info:

 

Description of problem:

tested https://issues.redhat.com/browse/OCPBUGS-10387 with PR

launch 4.14-ci,openshift/cluster-monitoring-operator#1926 no-spot

3 masters, 3 workers, each node is with 4 cpus, no infra node

$ oc get node
NAME                                         STATUS   ROLES                  AGE   VERSION
ip-10-0-132-193.us-east-2.compute.internal   Ready    control-plane,master   23m   v1.26.2+d2e245f
ip-10-0-135-65.us-east-2.compute.internal    Ready    control-plane,master   23m   v1.26.2+d2e245f
ip-10-0-149-72.us-east-2.compute.internal    Ready    worker                 14m   v1.26.2+d2e245f
ip-10-0-158-0.us-east-2.compute.internal     Ready    worker                 14m   v1.26.2+d2e245f
ip-10-0-229-135.us-east-2.compute.internal   Ready    worker                 17m   v1.26.2+d2e245f
ip-10-0-234-36.us-east-2.compute.internal    Ready    control-plane,master   23m   v1.26.2+d2e245f

labels see below

control-plane: node-role.kubernetes.io/control-plane: ""
master: node-role.kubernetes.io/master: ""
worker: node-role.kubernetes.io/worker: ""

search with "cluster:capacity_cpu_cores:sum" on admin console "Observe -> Metrics", label_node_role_kubernetes_io=master and label_node_role_kubernetes_io="" are both calculated twice

Name                label_beta_kubernetes_io_instance_type    label_kubernetes_io_arch    label_node_openshift_io_os_id    label_node_role_kubernetes_io    prometheus            Value
cluster:capacity_cpu_cores:sum  m6a.xlarge                amd64                rhcos                                openshift-monitoring/k8s    12
cluster:capacity_cpu_cores:sum  m6a.xlarge                amd64                rhcos                master                openshift-monitoring/k8s    12
cluster:capacity_cpu_cores:sum  m6a.xlarge                amd64                rhcos                                openshift-monitoring/k8s    12
cluster:capacity_cpu_cores:sum  m6a.xlarge                amd64                rhcos                master                openshift-monitoring/k8s    12 

checked from thanos-querier API, same result with that from console UI(console UI used thanos-querier API)

$ token=`oc create token prometheus-k8s -n openshift-monitoring`
$ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?' --data-urlencode 'query=cluster:capacity_cpu_cores:sum' | jq
{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {
          "__name__": "cluster:capacity_cpu_cores:sum",
          "label_beta_kubernetes_io_instance_type": "m6a.xlarge",
          "label_kubernetes_io_arch": "amd64",
          "label_node_openshift_io_os_id": "rhcos",
          "prometheus": "openshift-monitoring/k8s"
        },
        "value": [
          1682394655.248,
          "12"
        ]
      },
      {
        "metric": {
          "__name__": "cluster:capacity_cpu_cores:sum",
          "label_beta_kubernetes_io_instance_type": "m6a.xlarge",
          "label_kubernetes_io_arch": "amd64",
          "label_node_openshift_io_os_id": "rhcos",
          "label_node_role_kubernetes_io": "master",
          "prometheus": "openshift-monitoring/k8s"
        },
        "value": [
          1682394655.248,
          "12"
        ]
      },
      {
        "metric": {
          "__name__": "cluster:capacity_cpu_cores:sum",
          "label_beta_kubernetes_io_instance_type": "m6a.xlarge",
          "label_kubernetes_io_arch": "amd64",
          "label_node_openshift_io_os_id": "rhcos",
          "prometheus": "openshift-monitoring/k8s"
        },
        "value": [
          1682394655.248,
          "12"
        ]
      },
      {
        "metric": {
          "__name__": "cluster:capacity_cpu_cores:sum",
          "label_beta_kubernetes_io_instance_type": "m6a.xlarge",
          "label_kubernetes_io_arch": "amd64",
          "label_node_openshift_io_os_id": "rhcos",
          "label_node_role_kubernetes_io": "master",
          "prometheus": "openshift-monitoring/k8s"
        },
        "value": [
          1682394655.248,
          "12"
        ]
      }
    ]
  }
} 

no such issue if we query the expr for "cluster:capacity_cpu_cores:sum" directly

Name                label_beta_kubernetes_io_instance_type    label_kubernetes_io_arch    label_node_openshift_io_os_id    label_node_role_kubernetes_io    prometheus             Value
cluster:capacity_cpu_cores:sum    m6a.xlarge                amd64                rhcos                                openshift-monitoring/k8s    12
cluster:capacity_cpu_cores:sum    m6a.xlarge                amd64                rhcos                master                openshift-monitoring/k8s    12 

should do deduplication for thanos-querier API

Version-Release number of selected component (if applicable):

tested https://issues.redhat.com/browse/OCPBUGS-10387 with PR

How reproducible:

always

Steps to Reproduce:

1. see the description
2.
3.

Actual results:

node role is calculated twice in thanos-querier API

Expected results:

node role should be calculated only once in thanos-querier API

Description of problem:

When updating s390x cluster from 4.10.35 to 4.11.34, i got following message in the UI:

Updating this cluster to 4.11.34 is supported, but not recommended as it might not be optimized for some components in this cluster.

Exposure to KeepalivedMulticastSkew is unknown due to an evaluation failure: client-side throttling: only 9m20.476632575s has elapsed since the last match call completed for this cluster condition backend; this cached cluster condition request has been queued for later execution
On OpenStack, oVirt, and vSphere infrastructure, updates to 4.11 can cause degraded cluster operators as a result of a multicast-to-unicast keepalived transition, until all nodes have updated to 4.11. https://access.redhat.com/solutions/7007826

As we discussed on Slack[1] message could be more user friendly, something like this[2]:

"Throttling risk evaluation, 2 risks to evaluate, next evaluation in 9m59s."

[1] https://redhat-internal.slack.com/archives/CEGKQ43CP/p1683621220358259
[2] https://redhat-internal.slack.com/archives/CEGKQ43CP/p1683643286581299?thread_ts=1683621220.358259&cid=CEGKQ43CP

Version-Release number of selected component (if applicable):

4.11.34

How reproducible:

Have a cluster on 4.10.35 or i guess any 4.10.z and update to 4.11.34

Steps to Reproduce: 

1. Open webconsole
2. On the dashboard/Overview click on Update cluster
3. Change the channel to stable-4.11
4. Select new version and from the drop down menu click on Include supported but not recommended versions
5. Select 4.11.34
6. Message from the problem description appears 

Actual results:

Unclear message

Expected results:

Clear message

Description of problem:

etcd-backup fails with 'FIPS mode is enabled, but the required OpenSSL library is not available' on 4.13 FIPS enabled cluster

Version-Release number of selected component (if applicable):

OCP 4.13

How reproducible:

 

Steps to Reproduce:

1. run etcd-backup script on FIPS enabled OCP 4.13
2.
3.

Actual results:

backup script fails with

+ etcdctl snapshot save /home/core/assets/backup/snapshot_2023-08-28_125218.db
FIPS mode is enabled, but the required OpenSSL library is not available

Expected results:

successful run of etcd-backup script

Additional info:

4.13 uses RHEL9-based RHCOS while ETCD image still use RHEL8 and this could be main issue. If so, image should be rebuilt with RHEL9.

Description of problem:

STS cluster awareness was in techpreview for testing and assurance of quality before release. The created unit tests and runs have indicated no change in operation to the cluster. QE has reported several bugs and they've been fixed. A periodic e2e test to verify that when an STS cluster is detected and proper AWS resource access tokens are present in the CredentialsRequest a Secret is generated has been passing and has passed when run manually on several follow-on PRs.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

The Azure CCM will panic when it loses its leader election lease. This is contrary to the behaviour of other components which exit intentionally.

See https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-azure-modern/1632791244243472384

Version-Release number of selected component (if applicable):


How reproducible:

Force the CCM to lose leader election, can happen during upgrades

Steps to Reproduce:

1.
2.
3.

Actual results:

Code will panic, eg 

E0306 18:09:14.315039       1 runtime.go:77] Observed a panic: leaderelection lost
goroutine 1 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x1adc660?, 0x219b9c0})
	/go/src/github.com/openshift/cloud-provider-azure/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x99
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0x81e22e?})
	/go/src/github.com/openshift/cloud-provider-azure/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x75
panic({0x1adc660, 0x219b9c0})
	/usr/lib/golang/src/runtime/panic.go:884 +0x212
sigs.k8s.io/cloud-provider-azure/cmd/cloud-controller-manager/app.NewCloudControllerManagerCommand.func1.1()
	/go/src/github.com/openshift/cloud-provider-azure/cmd/cloud-controller-manager/app/controllermanager.go:138 +0x27
k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run.func1()
	/go/src/github.com/openshift/cloud-provider-azure/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:203 +0x1f
k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run(0xc0002c0d80, {0x21bce08, 0xc0001ac008})
	/go/src/github.com/openshift/cloud-provider-azure/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:213 +0x14d
k8s.io/client-go/tools/leaderelection.RunOrDie({0x21bce08, 0xc0001ac008}, {{0x21c0e00, 0xc0002c0c60}, 0x1fe5d61a00, 0x18e9b26e00, 0x60db88400, {0xc000418080, 0x1fc4978, 0x0}, ...})
	/go/src/github.com/openshift/cloud-provider-azure/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:226 +0x94
sigs.k8s.io/cloud-provider-azure/cmd/cloud-controller-manager/app.NewCloudControllerManagerCommand.func1(0xc000170000?, {0x1ea43e2?, 0xd?, 0xd?})
	/go/src/github.com/openshift/cloud-provider-azure/cmd/cloud-controller-manager/app/controllermanager.go:130 +0x3a7
github.com/spf13/cobra.(*Command).execute(0xc000170000, {0xc00019e010, 0xd, 0xd})
	/go/src/github.com/openshift/cloud-provider-azure/vendor/github.com/spf13/cobra/command.go:876 +0x67b
github.com/spf13/cobra.(*Command).ExecuteC(0xc000170000)
	/go/src/github.com/openshift/cloud-provider-azure/vendor/github.com/spf13/cobra/command.go:990 +0x3bd
github.com/spf13/cobra.(*Command).Execute(...)
	/go/src/github.com/openshift/cloud-provider-azure/vendor/github.com/spf13/cobra/command.go:918
main.main()
	/go/src/github.com/openshift/cloud-provider-azure/cmd/cloud-controller-manager/controller-manager.go:47 +0xc5
panic: leaderelection lost [recovered]
	panic: leaderelection lost

goroutine 1 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0x81e22e?})
	/go/src/github.com/openshift/cloud-provider-azure/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:56 +0xd7
panic({0x1adc660, 0x219b9c0})
	/usr/lib/golang/src/runtime/panic.go:884 +0x212
sigs.k8s.io/cloud-provider-azure/cmd/cloud-controller-manager/app.NewCloudControllerManagerCommand.func1.1()
	/go/src/github.com/openshift/cloud-provider-azure/cmd/cloud-controller-manager/app/controllermanager.go:138 +0x27
k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run.func1()
	/go/src/github.com/openshift/cloud-provider-azure/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:203 +0x1f
k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run(0xc0002c0d80, {0x21bce08, 0xc0001ac008})
	/go/src/github.com/openshift/cloud-provider-azure/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:213 +0x14d
k8s.io/client-go/tools/leaderelection.RunOrDie({0x21bce08, 0xc0001ac008}, {{0x21c0e00, 0xc0002c0c60}, 0x1fe5d61a00, 0x18e9b26e00, 0x60db88400, {0xc000418080, 0x1fc4978, 0x0}, ...})
	/go/src/github.com/openshift/cloud-provider-azure/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:226 +0x94
sigs.k8s.io/cloud-provider-azure/cmd/cloud-controller-manager/app.NewCloudControllerManagerCommand.func1(0xc000170000?, {0x1ea43e2?, 0xd?, 0xd?})
	/go/src/github.com/openshift/cloud-provider-azure/cmd/cloud-controller-manager/app/controllermanager.go:130 +0x3a7
github.com/spf13/cobra.(*Command).execute(0xc000170000, {0xc00019e010, 0xd, 0xd})
	/go/src/github.com/openshift/cloud-provider-azure/vendor/github.com/spf13/cobra/command.go:876 +0x67b
github.com/spf13/cobra.(*Command).ExecuteC(0xc000170000)
	/go/src/github.com/openshift/cloud-provider-azure/vendor/github.com/spf13/cobra/command.go:990 +0x3bd
github.com/spf13/cobra.(*Command).Execute(...)
	/go/src/github.com/openshift/cloud-provider-azure/vendor/github.com/spf13/cobra/command.go:918
main.main()
	/go/src/github.com/openshift/cloud-provider-azure/cmd/cloud-controller-manager/controller-manager.go:47 +0xc5

Expected results:

Code should exit without panicking

Additional info:


Description of problem:

The modal displayed when installing a Helm chart shows a Documentation link field. This field can't be ever populated with a value and is always N/A

Annotation for documentation URL doesn't exist in https://github.com/redhat-certification/chart-verifier/blob/main/docs/helm-chart-annotations.md#provider-annotations

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1. Go to Helm chart catalog
2. View any chart
3. See documentation = "N/A"

Actual results:

N/A

Expected results:

A way to populate the value

Additional info:

The value is consumed here: https://github.com/openshift/console/blob/2e8624014065d09ba40164221dd612d882f20395/frontend/packages/console-shared/src/components/catalog/details/CatalogDetailsPanel.tsx

But it is never extracted from a chart:
https://github.com/openshift/console/blob/2e8624014065d09ba40164221dd612d882f20395/frontend/packages/helm-plugin/src/catalog/utils/catalog-utils.tsx#L138

It is probably because no such annotation exists in chart certification requirements/recommendations:
https://github.com/redhat-certification/chart-verifier/blob/main/docs/helm-chart-annotations.md#provider-annotations

This is a clone of issue OCPBUGS-19674. The following is the description of the original issue:

Description of problem:

 

Version-Release number of selected component (if applicable):

When using a route to expose the API server endpoint in a HostedCluster, the .status.controlPlaneEndpoint.port is reported as 6443 (the internal port) instead of 443 which is the port that is externally exposed via the route.

How reproducible:

Always

Steps to Reproduce:

1. Create a HostedCluster with a custom dns name using route as the strategy
3. Inspect .status.controlPlaneEndpoint

Actual results:

It has 6443 as the port

Expected results:

It has 443 as the port

Additional info:

 

Please review the following PR: https://github.com/openshift/cloud-provider-openstack/pull/188

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

based on bugs from ART team, example: https://issues.redhat.com/browse/OCPBUGS-12347, 4.14 image should be built with go 1.20, but prometheus container image is built by go1.19.6

$ token=`oc create token prometheus-k8s -n openshift-monitoring`
$ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/label/goversion/values' | jq
{
  "status": "success",
  "data": [
    "go1.19.6",
    "go1.20.3"
  ]
}

searched from thanos API

$ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?' --data-urlencode 'query={__name__=~".*",goversion="go1.19.6"}' | jq
{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {
          "__name__": "prometheus_build_info",
          "branch": "rhaos-4.14-rhel-8",
          "container": "kube-rbac-proxy",
          "endpoint": "metrics",
          "goarch": "amd64",
          "goos": "linux",
          "goversion": "go1.19.6",
          "instance": "10.128.2.19:9092",
          "job": "prometheus-k8s",
          "namespace": "openshift-monitoring",
          "pod": "prometheus-k8s-0",
          "prometheus": "openshift-monitoring/k8s",
          "revision": "fe01b9f83cb8190fc8f04c16f4e05e87217ab03e",
          "service": "prometheus-k8s",
          "tags": "unknown",
          "version": "2.43.0"
        },
        "value": [
          1682576802.496,
          "1"
        ]
      },
...

prometheus-k8s-0 container name: [prometheus config-reloader thanos-sidecar prometheus-proxy kube-rbac-proxy kube-rbac-proxy-thanos], prometheus image is built with go1.19.6

$ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- prometheus --version
prometheus, version 2.43.0 (branch: rhaos-4.14-rhel-8, revision: fe01b9f83cb8190fc8f04c16f4e05e87217ab03e)
  build user:       root@402ffbe02b57
  build date:       20230422-00:43:08
  go version:       go1.19.6
  platform:         linux/amd64
  tags:             unknown

$ oc -n openshift-monitoring exec -c config-reloader prometheus-k8s-0 -- prometheus-config-reloader --version
prometheus-config-reloader, version 0.63.0 (branch: rhaos-4.14-rhel-8, revision: ce71a7d)
  build user:       root
  build date:       20230424-15:53:51
  go version:       go1.20.3
  platform:         linux/amd64

$ oc -n openshift-monitoring exec -c thanos-sidecar prometheus-k8s-0 -- thanos --version
thanos, version 0.31.0 (branch: rhaos-4.14-rhel-8, revision: d58df6d218925fd007e16965f50047c9a4194c42)
  build user:       root@c070c5e6af32
  build date:       20230422-00:44:21
  go version:       go1.20.3
  platform:         linux/amd64


# owned by oauth team, not responsible by Monitoring
$ oc -n openshift-monitoring exec -c prometheus-proxy prometheus-k8s-0 -- oauth-proxy --version
oauth2_proxy was built with go1.18.10

# below isssue is tracked by bug OCPBUGS-12821
$ oc -n openshift-monitoring exec -c kube-rbac-proxy prometheus-k8s-0 -- kube-rbac-proxy --version
Kubernetes v0.0.0-master+$Format:%H$

$ oc -n openshift-monitoring exec -c kube-rbac-proxy-thanos prometheus-k8s-0 -- kube-rbac-proxy --version
Kubernetes v0.0.0-master+$Format:%H$

should fix files
https://github.com/openshift/prometheus/blob/master/.ci-operator.yaml#L4
https://github.com/openshift/prometheus/blob/master/Dockerfile.ocp#L1

 

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-04-26-154754  

How reproducible:

always

Actual results:

4.14 prometheus is built with go1.19.6

Expected results:

4.14 prometheus image should be built with go1.20

Additional info:

no functional impact

Along with external disruption tests via api DNS we should also check that apiserver is not disrupted via api-int and service network endpoints

Ref: https://issues.redhat.com/browse/API-1526

Description of problem:

The CCMs at the moment are given RBAC permissions of "get, list, watch" on secrets across all namespaces. This was a security concern raised by the OpenShift Security team. 

In Nutanix CCM, it currently creates a secrets informer and a configmaps informer at the cluster scope, these are then passed into the NewProvider call for the prism environment. Within the prism environment, the configmap and secret informers are used once each, and only to list a single namespace. We should modify the informers creation to limit to just the namespaces required? This would reduce the scope of RBAC required and meet the OpenShift security requirements.

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:


Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

Fixed by @wking, opening bug for Jira linking.

The cluster-dns-operator sets the status condition's lastTransitionTime whenever the status (true, false, unknown), reason, or message changed on a condition.  

It should only set the lastTransitionTime if the condition status changes. Otherwise this can have an affect on status flapping between true and false.  See https://github.com/openshift/api/blob/master/config/v1/types_cluster_operator.go#L129

Version-Release number of selected component (if applicable):

4.15 and earlier

How reproducible:

100%

Steps to Reproduce:

1. Put cluster-dns-operator in a Degraded condition by stopping a pod, notice the lastTransitionTime
2. Wait 1 second and stop another pod, which only updates the condition message

Actual results:

Notice the lastTransitionTime for the Degraded condition changes when the message changes, even though the status is still Degraded=true

Expected results:

The lastTransitionTime should not change unless the Degraded status changes, not the message or reason.

Additional info:

 

Description of problem:

# QE prow CI job update hostedcluster.spec.pullSecret for some qe catalog source configurations. 4.13 jobs failed with error msg:

Error from server (HostedCluster.spec.pullSecret.name: Invalid value: "9509a26c339de31aa3c9-pull-secret-new": Attempted to change an immutable field): admission webhook "hostedclusters.hypershift.openshift.io" denied the request: HostedCluster.spec.pullSecret.name: Invalid value: "9509a26c339de31aa3c9-pull-secret-new": Attempted to change an immutable field

Version-Release number of selected component (if applicable):

4.13

How reproducible:

4.13 job:
https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/pr-logs/pull/openshift_release/41339/rehearse-41339-periodic-ci-openshift-openshift-tests-private-release-4.13-amd64-nightly-aws-ipi-ovn-hypershift-guest-p1-f7/1689831180221812736

Steps to Reproduce:

see the above job

Actual results:

job failed to config pull secret for hostedcluster

Expected results:

job could run successfully

Additional info:

1. The 4.14  hypershift QE CI jobs were successfully executed with the same codes.
2. I can update 4.13 hostedcluster spec.pullSecret in my local hypershift env.

It seems to be caused by some limitation only in prow?

 

slack thread: https://redhat-internal.slack.com/archives/C01C8502FMM/p1691736890938529

Description of problem:

TRT has unfortunately had to revert this breaking change to get CI and/or nightly payloads flowing again. 
The original PR was https://github.com/openshift/cluster-storage-operator/pull/381.
The revert PR: https://github.com/openshift/cluster-storage-operator/pull/384

The following evidence helped us pushing for the revert:
In the nightly payload runs, periodic-ci-openshift-release-master-nightly-4.14-e2e-metal-ipi-sdn-bm has been consistently failing in the last three nightly payloads. But the run in the revert PR passed.

To restore your change, create a new PR that reverts the revert and layers additional separate commit(s) on top that addresses the problem.

Contact information for TRT is available at https://source.redhat.com/groups/public/atomicopenshift/atomicopenshift_wiki/how_to_contact_the_technical_release_team. Please reach out if you need assistance in relanding your change or have feedback about this process.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

When a machine is created with a compute availability zone (defined via mpool.zones) and a storage root volume (defined as mpool.rootVolume) and that rootVolume has no specified zones, CAPO will use the compute AZ for the volume AZ.

This can be problematic if the AZ doesn't exist in Cinder.
Source:

https://github.com/kubernetes-sigs/cluster-api-provider-openstack/blob/9d183bd479fe9aed4f6e7ac3d5eee46681c518e7/pkg/cloud/services/compute/instance.go#L439-L442

Version-Release number of selected component (if applicable):

All versions supporting rootVolume AZ.

Steps to Reproduce:

1. In install-config.yaml, add "zones" with valid Nova AZs, and a rootVolume without "zones". Your OpenStack cloud must not have Cinder AZs (only Nova AZs)
2. Day 1 deployment will go fine, Terraform will create the machines with no AZ.
3. Day 2 operation on machines will fail since CAPO tries to use the Nova AZ for the root volume if no volume AZ is provided, but since the AZ don't match between Cinder & Nova, the machine will never be created

Actual results:

Machine not created

Expected results:

Machine created in the right AZ for both Nova & Cinder

Description of problem:

- Calico Virtual NICs should be excluded from node_exporter collector.
- All NICs beginning with cali* should be added to collector.netclass.ignored-devices to ensure that metrics are not collected.
- node_exporter is meant to collect metrics for physical interfaces only. 

Version-Release number of selected component (if applicable):

OpenShift 4.12

How reproducible:

Always

Steps to Reproduce:

Run an OpenShift cluster using Calico SDN.
Observe -> Metrics -> Run the following PromQL query: "group by(device) (node_network_info)"
Observe that Calico Virtual NICs present. 

Actual results:

Calico Virtual NICs present in OCP Metrics.

Expected results:

Only physical network interfaces should be present.

Additional info:

Similar to this bug, but for Calico virtual NICs: https://issues.redhat.com/browse/OCPBUGS-1321

We've removed SR-IOV code that was using python-grpcio and python-protobuf. These are gone from Python's requirements.txt, but we never removed them from RPM spec we use to build Kuryr in OpenShift. This should be fixed.

Description of problem:

When updating from 4.12 to 4.13, the incoming ovn-k8s-cni-overlay expects RHEL 9, and fails to run on the still-RHEL-8 4.12 nodes.

Version-Release number of selected component (if applicable):

4.13 and 4.14 ovn-k8s-cni-overlay vs. 4.12 RHCOS's RHEL 8.

How reproducible:

100%

Steps to Reproduce:

Picked up in TestGrid.

Actual results:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.13-upgrade-from-stable-4.12-e2e-gcp-ovn-rt-upgrade/1677232369326624768/artifacts/e2e-gcp-ovn-rt-upgrade/gather-extra/artifacts/nodes/ci-op-y7r1x9z3-3a480-9swt7-master-2/journal | zgrep dns-operator | tail -n1
Jul 07 12:34:30.202100 ci-op-y7r1x9z3-3a480-9swt7-master-2 kubenswrapper[2168]: E0707 12:34:30.201720    2168 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"CreatePodSandbox\" for \"dns-operator-78cbdc89fd-kckcd_openshift-dns-operator(5c97a52b-f774-40ae-8c17-a17b30812596)\" with CreatePodSandboxError: \"Failed to create sandbox for pod \\\"dns-operator-78cbdc89fd-kckcd_openshift-dns-operator(5c97a52b-f774-40ae-8c17-a17b30812596)\\\": rpc error: code = Unknown desc = failed to create pod network sandbox k8s_dns-operator-78cbdc89fd-kckcd_openshift-dns-operator_5c97a52b-f774-40ae-8c17-a17b30812596_0(1fa1dd2b35100b0f1ec058d79042a316b909e38711fcadbf87bd9a1e4b62e0d3): error adding pod openshift-dns-operator_dns-operator-78cbdc89fd-kckcd to CNI network \\\"multus-cni-network\\\": plugin type=\\\"multus\\\" name=\\\"multus-cni-network\\\" failed (add): [openshift-dns-operator/dns-operator-78cbdc89fd-kckcd/5c97a52b-f774-40ae-8c17-a17b30812596:ovn-kubernetes]: error adding container to network \\\"ovn-kubernetes\\\": netplugin failed: \\\"/var/lib/cni/bin/ovn-k8s-cni-overlay: /lib64/libc.so.6: version `GLIBC_2.32' not found (required by /var/lib/cni/bin/ovn-k8s-cni-overlay)\\\\n/var/lib/cni/bin/ovn-k8s-cni-overlay: /lib64/libc.so.6: version `GLIBC_2.34' not found (required by /var/lib/cni/bin/ovn-k8s-cni-overlay)\\\\n\\\"\"" pod="openshift-dns-operator/dns-operator-78cbdc89fd-kckcd" podUID=5c97a52b-f774-40ae-8c17-a17b30812596

Expected results:

Successful update.

Additional info:

Both 4.14 and 4.13 control planes can be associated with 4.12 compute nodes, because of EUS-to-EUS updates.

This is a clone of issue OCPBUGS-19550. The following is the description of the original issue:

Multus doesn't need to watch pods on other nodes. To save memory and CPU set MULTUS_NODE_NAME to filter pods that multus watches.

Description of problem: Multus currently implements a certificate that exists for 10 minutes, we need to add configuration for certificates for 24 hours

Description of problem:

Similar to OCPBUGS-11636 ccoctl needs to be updated to account for the s3 bucket changes described in https://aws.amazon.com/blogs/aws/heads-up-amazon-s3-security-changes-are-coming-in-april-of-2023/

these changes have rolled out to us-east-2 and China regions as of today and will roll out to additional regions in the near future

See OCPBUGS-11636 for additional information

Version-Release number of selected component (if applicable):

 

How reproducible:

Reproducible in affected regions.

Steps to Reproduce:

1. Use "ccoctl aws create-all" flow to create STS infrastructure in an affected region like us-east-2. Notice that document upload fails because the s3 bucket is created in a state that does not allow usage of ACLs with the s3 bucket.

Actual results:

./ccoctl aws create-all --name abutchertestue2 --region us-east-2 --credentials-requests-dir ./credrequests --output-dir _output
2023/04/11 13:01:06 Using existing RSA keypair found at _output/serviceaccount-signer.private
2023/04/11 13:01:06 Copying signing key for use by installer
2023/04/11 13:01:07 Bucket abutchertestue2-oidc created
2023/04/11 13:01:07 Failed to create Identity provider: failed to upload discovery document in the S3 bucket abutchertestue2-oidc: AccessControlListNotSupported: The bucket does not allow ACLs
        status code: 400, request id: 2TJKZC6C909WVRK7, host id: zQckCPmozx+1yEhAj+lnJwvDY9rG14FwGXDnzKIs8nQd4fO4xLWJW3p9ejhFpDw3c0FE2Ggy1Yc=

Expected results:

"ccoctl aws create-all" successfully creates IAM and S3 infrastructure. OIDC discovery and JWKS documents are successfully uploaded to the S3 bucket and are publicly accessible.

Additional info:

 

Description of problem

CI is flaky because the TestRouterCompressionOperation test fails.

Version-Release number of selected component (if applicable)

I have seen these failures on 4.14 CI jobs.

How reproducible

Presently, search.ci reports the following stats for the past 14 days:

Found in 7.71% of runs (16.58% of failures) across 402 total runs and 24 jobs (46.52% failed)

GCP is most impacted:

pull-ci-openshift-cluster-ingress-operator-master-e2e-gcp-operator (all) - 44 runs, 86% failed, 37% of failures match = 32% impact

Azure and AWS are also impacted:

pull-ci-openshift-cluster-ingress-operator-master-e2e-azure-operator (all) - 36 runs, 64% failed, 43% of failures match = 28% impact
pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-operator (all) - 38 runs, 79% failed, 23% of failures match = 18% impact

Steps to Reproduce

1. Post a PR and have bad luck.
2. Check https://search.ci.openshift.org/?search=compression+error%3A+expected&maxAge=336h&context=1&type=build-log&name=cluster-ingress-operator&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job.

Actual results

The test fails:

TestAll/serial/TestRouterCompressionOperation 
=== RUN   TestAll/serial/TestRouterCompressionOperation
    router_compression_test.go:209: compression error: expected "gzip", got "" for canary route

Expected results

CI passes, or it fails on a different test.

Please review the following PR: https://github.com/openshift/kube-rbac-proxy/pull/66

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Kubernetes and other associated dependencies need to be updated to protect against potential vulnerabilities.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

Since the introduction of https://github.com/openshift/origin/pull/27570 the openshift-tests binary now looks for the cluster infra resource for later usage (setting TEST_PROVIDER env var when running run-test command to inject details about the cluster). Since microshift does not have this resource the returned value is nil and it panics when its used later in the code.

Version-Release number of selected component (if applicable):

 

How reproducible:

Run openshift-tests and it immediately panics

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Due to removal of in-tree AWS provider https://github.com/kubernetes/kubernetes/pull/115838 we need to ensure that KCM is setting --external-cloud-volume-plugin flag accordingly, especially that the CSI migration was GA-ed in 4.12/1.25.

Description of problem:

In topology side panel, in pipelineruns section, on click of "Start last run" button, error alert message is displayed

Version-Release number of selected component (if applicable):

4.14

How reproducible:

Always

Steps to Reproduce:

1. Create a deployment with pipeline
2. Click on deployment to open side panel
3. Click "Start last run" button in PipelineRuns section 

Actual results:

Error alert message is displayed

Expected results:

Should be able to run the last run

Additional info:

 

Description of problem:

We have seen unit tests flaking on the mapping within the OnDelete policy tests for the control plane machine set.

It turns out there is a race condition, and, given the right timing, if a reconcile is in progress while a machine is marked for deletion, the load balancing part of the algorithm fails to properly apply

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

Sync "Debug in Terminal" feature with 3.x pods in web console
The types of pods that enable the "Debug in terminal" feature should be in alignment with those in v3.11. See code here: https://github.com/openshift/origin-web-console/blob/c37982397087036321312172282e139da378eff2/app/scripts/directives/resources.js#L33-L53

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

UPSTREAM: <carry>: Force using host go always and use host libriaries introduced a build failure for the Windows kubelet that is showing up only in release-4.11 for an unknown reason but could potentially occur on other releases too.

Version-Release number of selected component (if applicable):

WMCO version: 9.0.0 and below
 

How reproducible:

Always on release-4.11
 

Steps to Reproduce:

1. Clone the WMCO repo
2. Build the WMCO image

Actual results:

WMCO image build fails

Expected results:

 WMCO image build should succeed

Description of problem:

Most contents on "Command Line Tools" page are not i18n.

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2023-03-10-165006

How reproducible:

Always

Steps to Reproduce:

1.Go to "?"-> "Command Line Tools" page. Add "?pseudolocalization=true&lng=en" at the end of the url. Check if all contents are i18n.
2.
3.

Actual results:

1. Most of contents are not i18n.

Expected results:

1.All contents should be i18n.

Additional info:


Please review the following PR: https://github.com/openshift/cluster-capi-operator/pull/104

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

OCP upgrade blocks because of cluster operator csi-snapshot-controller fails to start its deployment with a fatal message of read-only filesystem

Version-Release number of selected component (if applicable):

Red Hat OpenShift 4.11
rhacs-operator.v3.72.1

How reproducible:

At least once in user's cluster while upgrading 

Steps to Reproduce:

1. Have a OCP 4.11 installed
2. Install ACS on top of the OCP cluster
3. Upgrade OCP to the next z-stream version

Actual results:

Upgrade gets blocked: waiting on csi-snapshot-controller

Expected results:

Upgrade should succeed

Additional info:

stackrox SCCs (stackrox-admission-control, stackrox-collector and stackrox-sensor) contain the `readOnlyRootFilesystem` set to `true`, if not explicitly defined/requested, other Pods might receive this SCC which will make the deployment to fail with a `read-only filesystem` message

Description of problem:

When installing a 3 master + 2 worker BM IPv6 cluster with proxy, worker BMHs are failing inspection with the message: "Could not contact ironic-inspector for version discovery: Unable to find a version discovery document". This causes the installation to fail due to nodes with worker role never joining the cluster. However, when installing with no workers, the issue does not reproduce and the cluster installs successfully.

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2023-01-04-203333

How reproducible:

100%

Steps to Reproduce:

1. Attempt to install an IPv6 cluster with 3 masters + 2 workers and proxy with baremetal installer

Actual results:

Installation never completes because a number of pods are in Pending status

Expected results:

Workers join the cluster and installation succeeds 

Additional info:

$ oc get events
LAST SEEN   TYPE     REASON              OBJECT                               MESSAGE
174m        Normal   InspectionError     baremetalhost/openshift-worker-0-1   Failed to inspect hardware. Reason: unable to start inspection: Could not contact ironic-inspector for version discovery: Unable to find a version discovery document at https://[fd2e:6f44:5dd8::37]:5050, the service is unavailable or misconfigured. Required version range (any - any), version hack disabled.
174m        Normal   InspectionError     baremetalhost/openshift-worker-0-0   Failed to inspect hardware. Reason: unable to start inspection: Could not contact ironic-inspector for version discovery: Unable to find a version discovery document at https://[fd2e:6f44:5dd8::37]:5050, the service is unavailable or misconfigured. Required version range (any - any), version hack disabled.
174m        Normal   InspectionStarted   baremetalhost/openshift-worker-0-0   Hardware inspection started
174m        Normal   InspectionStarted   baremetalhost/openshift-worker-0-1   Hardware inspection started

This is actually a better design since BMO does not need to be coupled with Ironic (unlike Ironic and httpd, for example). But the current architecture also has two real issues:

  1. BMO needs to know the IP address of Ironic, which causes a chicken-and-egg problem: the IP is not known until the pod starts.
  2. Since BMO is a part of the Metal3 pod, it also uses host networking and other privileges. For example, the webhook port is exposed externally.

The main thing to fix is to make BMO talk to Ironic via its external IP instead of localhost.

Description of problem:

RHEL-7 already comes with {{xz}} installed but in RHEL-8 it needs to explicitly installed.

Version-Release number of selected component (if applicable):

4.14

How reproducible:

always

Steps to Reproduce:

1. Use an image based on Dockerfile.upi.ci.rhel8
2. Trigger a CI job that uses the xz tool
3.

Actual results:

/bin/sh: xz: command not found
tar: /tmp/secret/terraform_state.tar.xz: Wrote only 4096 of 10240 bytes
tar: Child returned status 127
tar: Error is not recoverable: exiting now 

Expected results:

no errors

Additional info:

Step: https://github.com/openshift/release/blob/master/ci-operator/step-registry/upi/install/vsphere/upi-install-vsphere-commands.sh#L185

And investigation by Jinyun Ma: https://github.com/openshift/release/pull/39991#issuecomment-1581937323

Description of problem:

Machine and respective Node should indicate proper zones, but machine doesn’t indicate proper zones on multiple vCenter zones cluster

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2023-02-07-064924

How reproducible:

always

Steps to Reproduce:

1.Create a multiple vCenter zones cluster 

sh-4.4$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.13.0-0.nightly-2023-02-07-064924   True        False         58m     Cluster version is 4.13.0-0.nightly-2023-02-07-064924
sh-4.4$ oc get machine
NAME                           PHASE     TYPE   REGION    ZONE   AGE
jima15b-x4584-master-0         Running          us-east          88m
jima15b-x4584-master-1         Running          us-east          88m
jima15b-x4584-master-2         Running          us-west          88m
jima15b-x4584-worker-0-26hml   Running          us-east          81m
jima15b-x4584-worker-1-zljp8   Running          us-east          81m
jima15b-x4584-worker-2-kkdzf   Running          us-west          81m

2.Check machine labels and node labels 
sh-4.4$ oc get machine jima15b-x4584-worker-0-26hml -oyaml 
apiVersion: machine.openshift.io/v1beta1
kind: Machine
metadata:
  annotations:
    machine.openshift.io/instance-state: poweredOn
  creationTimestamp: "2023-02-09T02:28:03Z"
  finalizers:
  - machine.machine.openshift.io
  generateName: jima15b-x4584-worker-0-
  generation: 2
  labels:
    machine.openshift.io/cluster-api-cluster: jima15b-x4584
    machine.openshift.io/cluster-api-machine-role: worker
    machine.openshift.io/cluster-api-machine-type: worker
    machine.openshift.io/cluster-api-machineset: jima15b-x4584-worker-0
    machine.openshift.io/region: us-east
    machine.openshift.io/zone: ""
  name: jima15b-x4584-worker-0-26hml
  namespace: openshift-machine-api

sh-4.4$ oc get node jima15b-x4584-worker-0-26hml --show-labels
NAME                           STATUS   ROLES    AGE    VERSION           LABELS
jima15b-x4584-worker-0-26hml   Ready    worker   9m4s   v1.26.0+9eb81c2   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=vsphere-vm.cpu-4.mem-16gb.os-unknown,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east,failure-domain.beta.kubernetes.io/zone=us-east-1a,kubernetes.io/arch=amd64,kubernetes.io/hostname=jima15b-x4584-worker-0-26hml,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=vsphere-vm.cpu-4.mem-16gb.os-unknown,node.openshift.io/os_id=rhcos,topology.csi.vmware.com/openshift-region=us-east,topology.csi.vmware.com/openshift-zone=us-east-1a,topology.kubernetes.io/region=us-east,topology.kubernetes.io/zone=us-east-1a

Actual results:

Machine doesn’t indicate proper zone, it's machine.openshift.io/zone: ""

Expected results:

Machine should indicate proper zone

Additional info:

Discussed here https://redhat-internal.slack.com/archives/GE2HQ9QP4/p1675848293159359

Description of problem:

 

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

when checking the bug https://issues.redhat.com/browse/OCPBUGS-15976, found that the default ingresscontroller DNSReady is True even dns records failed to be published to public zone, the co/ingress doesn't report any error.

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-07-05-191022

How reproducible:

100%

Steps to Reproduce:

1. install Azure cluster configured for manual mode with Azure Workload Identity 

2. check dnsrecords of default-wildcard
$ oc -n openshift-ingress-operator get dnsrecords default-wildcard -oyaml
<---snip--->
  - conditions:
    - lastTransitionTime: "2023-07-10T04:23:55Z"
      message: 'The DNS provider failed to ensure the record: failed to update dns ......
      reason: ProviderError
      status: "False"
      type: Published
    dnsZone:
      id: /subscriptions/xxxxx/resourceGroups/os4-common/providers/Microsoft.Network/dnszones/qe.azure.devcluster.openshift.com

3. Check ingresscontroller status
$ oc -n openshift-ingress-operator get ingresscontroller default -oyaml
<---snip--->
  - lastTransitionTime: "2023-07-10T04:23:55Z"
    message: The record is provisioned in all reported zones.
    reason: NoFailedZones
    status: "True"
    type: DNSReady

4. Check co/ingress status
$ oc get co/ingress
NAME      VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
ingress   4.14.0-0.nightly-2023-07-05-191022   True        False         False      127m    

Actual results:

1. DNSReady is True and message shows: The record is provisioned in all reported zones.
2. co/ingress doesn't report any error

Expected results:

DNSReady should be False since failed to publish to public zone

Additional info:

 

This is a clone of issue OCPBUGS-19314. The following is the description of the original issue:

Description

As a user, I dont want to see the option of "DeploymentConfigs" in the User settings, when I have not installed the same in the cluster.

Acceptance Criteria

  1. Hide the DeploymentConfig option as the Default Resource Type when its not installed

Additional Details:

Description of problem:

When deploying 4.14 spoke, agentclusterinstall is stuck at finalizing stage

clusterverions on spoke report "Unable to apply 4.14.0-0.ci-2023-06-13-083232: the cluster operator monitoring is not available"

Please note: console operator is disabled purposely - it is needed in telco case to reduce platform resource usage

[kni@registry.kni-qe-28 ~]$ oc get clusterversions.config.openshift.io -A
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version             False       True          46m     Unable to apply 4.14.0-0.ci-2023-06-13-083232: the cluster operator monitoring is not available
[kni@registry.kni-qe-28 ~]$ oc get clusterversions.config.openshift.io -n version -o yaml 
apiVersion: v1
items:
- apiVersion: config.openshift.io/v1
  kind: ClusterVersion
  metadata:
    creationTimestamp: "2023-06-13T15:16:32Z"
    generation: 2
    name: version
    resourceVersion: "20061"
    uid: f8fc0c3e-009d-4d86-a05d-2fd0aba59528
  spec:
    capabilities:
      additionalEnabledCapabilities:
      - marketplace
      - NodeTuning
      baselineCapabilitySet: None
    channel: stable-4.14
    clusterID: 5cfc0491-5a23-4383-935b-71e3c793e875
  status:
    availableUpdates: null
    capabilities:
      enabledCapabilities:
      - NodeTuning
      - marketplace
      knownCapabilities:
      - CSISnapshot
      - Console
      - Insights
      - NodeTuning
      - Storage
      - baremetal
      - marketplace
      - openshift-samples
    conditions:
    - lastTransitionTime: "2023-06-13T15:16:33Z"
      message: 'Unable to retrieve available updates: Get "https://api.openshift.com/api/upgrades_info/v1/graph?arch=amd64&channel=stable-4.14&id=5cfc0491-5a23-4383-935b-71e3c793e875&version=4.14.0-0.ci-2023-06-13-083232":
        dial tcp 54.211.39.83:443: connect: network is unreachable'
      reason: RemoteFailed
      status: "False"
      type: RetrievedUpdates
    - lastTransitionTime: "2023-06-13T15:16:33Z"
      message: Capabilities match configured spec
      reason: AsExpected
      status: "False"
      type: ImplicitlyEnabledCapabilities
    - lastTransitionTime: "2023-06-13T15:16:33Z"
      message: Payload loaded version="4.14.0-0.ci-2023-06-13-083232" image="registry.kni-qe-28.ptp.lab.eng.bos.redhat.com:5000/openshift-release-dev/ocp-release@sha256:826bb878c5a1469ee8bb991beebc38a4e25b8f5cef9cdf1931ef99ffe5ffbc80"
        architecture="amd64"
      reason: PayloadLoaded
      status: "True"
      type: ReleaseAccepted
    - lastTransitionTime: "2023-06-13T15:16:33Z"
      status: "False"
      type: Available
    - lastTransitionTime: "2023-06-13T15:41:36Z"
      message: Cluster operator monitoring is not available
      reason: ClusterOperatorNotAvailable
      status: "True"
      type: Failing
    - lastTransitionTime: "2023-06-13T15:16:33Z"
      message: 'Unable to apply 4.14.0-0.ci-2023-06-13-083232: the cluster operator
        monitoring is not available'
      reason: ClusterOperatorNotAvailable
      status: "True"
      type: Progressing
    desired:
      image: registry.kni-qe-28.ptp.lab.eng.bos.redhat.com:5000/openshift-release-dev/ocp-release@sha256:826bb878c5a1469ee8bb991beebc38a4e25b8f5cef9cdf1931ef99ffe5ffbc80
      version: 4.14.0-0.ci-2023-06-13-083232
    history:
    - completionTime: null
      image: registry.kni-qe-28.ptp.lab.eng.bos.redhat.com:5000/openshift-release-dev/ocp-release@sha256:826bb878c5a1469ee8bb991beebc38a4e25b8f5cef9cdf1931ef99ffe5ffbc80
      startedTime: "2023-06-13T15:16:33Z"
      state: Partial
      verified: false
      version: 4.14.0-0.ci-2023-06-13-083232
    observedGeneration: 2
    versionHash: H6tRc6p_ZWU=
kind: List
metadata:
  resourceVersion: ""

[kni@registry.kni-qe-28 ~]$ oc get co -A
NAME                                       VERSION                         AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.14.0-0.ci-2023-06-13-083232   True        False         False      14m     
cloud-controller-manager                   4.14.0-0.ci-2023-06-13-083232   True        False         False      24m     
cloud-credential                           4.14.0-0.ci-2023-06-13-083232   True        False         False      25m     
cluster-autoscaler                         4.14.0-0.ci-2023-06-13-083232   True        False         False      24m     
config-operator                            4.14.0-0.ci-2023-06-13-083232   True        False         False      25m     
control-plane-machine-set                  4.14.0-0.ci-2023-06-13-083232   True        False         False      24m     
dns                                        4.14.0-0.ci-2023-06-13-083232   True        False         False      19m     
etcd                                       4.14.0-0.ci-2023-06-13-083232   True        False         False      22m     
image-registry                             4.14.0-0.ci-2023-06-13-083232   True        False         False      14m     
ingress                                    4.14.0-0.ci-2023-06-13-083232   True        False         False      25m     
kube-apiserver                             4.14.0-0.ci-2023-06-13-083232   True        False         False      18m     
kube-controller-manager                    4.14.0-0.ci-2023-06-13-083232   True        False         False      19m     
kube-scheduler                             4.14.0-0.ci-2023-06-13-083232   True        False         False      17m     
kube-storage-version-migrator              4.14.0-0.ci-2023-06-13-083232   True        False         False      25m     
machine-api                                4.14.0-0.ci-2023-06-13-083232   True        False         False      25m     
machine-approver                           4.14.0-0.ci-2023-06-13-083232   True        False         False      24m     
machine-config                             4.14.0-0.ci-2023-06-13-083232   True        False         False      21m     
marketplace                                4.14.0-0.ci-2023-06-13-083232   True        False         False      25m     
monitoring                                                                 False       True          True       14m     reconciling Console Plugin failed: creating ConsolePlugin object failed: the server could not find the requested resource (post consoleplugins.console.openshift.io)
network                                    4.14.0-0.ci-2023-06-13-083232   True        False         False      26m     
node-tuning                                4.14.0-0.ci-2023-06-13-083232   True        False         False      25m     
openshift-apiserver                        4.14.0-0.ci-2023-06-13-083232   True        False         False      14m     
openshift-controller-manager               4.14.0-0.ci-2023-06-13-083232   True        False         False      18m     
operator-lifecycle-manager                 4.14.0-0.ci-2023-06-13-083232   True        False         False      25m     
operator-lifecycle-manager-catalog         4.14.0-0.ci-2023-06-13-083232   True        False         False      25m     
operator-lifecycle-manager-packageserver   4.14.0-0.ci-2023-06-13-083232   True        False         False      19m     
service-ca                                 4.14.0-0.ci-2023-06-13-083232   True        False         False      25m    

Version-Release number of selected component (if applicable):
4.14

How reproducible:

100%

Steps to Reproduce:

1. Deploy RAN DU spoke cluster via gitops ZTP approach with multiple base capabilities disabled including Console operator.
   spec:     
     capabilities:       
       additionalEnabledCapabilities:
         - marketplace       
         - NodeTuning       
     baselineCapabilitySet: None     
     channel: stable-4.14 
2. Monitor ocp deployment on spoke.

Actual results:

Deployment fails while finalizing agentclusterinstall.  clusterverions on spoke report "the cluster operator monitoring is not available"

Expected results:

Successful spoke deployment

Additional info:

After manually enabling console in clusterversion, the monitoring operator succeeded and OCP install completed

must-gather logs:
https://drive.google.com/file/d/19zO21jqcVTIkAdGS2DEqQuhg2oGUmuNY/view?usp=sharing
https://drive.google.com/file/d/1PXjZmBdMwHWNwkaXr2wE9tTtBRJWYeKP/view?usp=sharing

 

Description of problem:

While reviewing PRs in CoreDNS 1.11.0, we stumbled upon https://github.com/coredns/coredns/pull/6179, which describes an CoreDNS crash in the kubernetes plugin if you create an EndpointSlice object contains a port without a port number.

I reproduced this myself and was able to successfully bring down all of CoreDNS so that the cluster was put into a degraded state.

We've bumped to CoreDNS 1.11.1 in 4.15, so this is concern for < 4.15.

Version-Release number of selected component (if applicable):

Less than or equal to 4.14

How reproducible:

100%

Steps to Reproduce:

1. Create an endpointslice with a port with no port number:

apiVersion: discovery.k8s.io/v1
kind: EndpointSlice
metadata:
  name: example-abc
addressType: IPv4
ports:
  - name: ""

2.Shortly after creating this object, all DNS pods continuously crash:
oc get -n openshift-dns pods
NAME                  READY   STATUS             RESTARTS     AGE
dns-default-57lmh     1/2     CrashLoopBackOff   1 (3s ago)   79m
dns-default-h6cvm     1/2     CrashLoopBackOff   1 (4s ago)   79m
dns-default-mn7qd     1/2     CrashLoopBackOff   1 (3s ago)   79m
dns-default-mxq5g     1/2     CrashLoopBackOff   1 (3s ago)   79m
dns-default-wdrff     1/2     CrashLoopBackOff   1 (3s ago)   79m
dns-default-zs7cd     1/2     CrashLoopBackOff   1 (3s ago)   79m

Actual results:

DNS Pods crash

Expected results:

DNS Pods should NOT crash

Additional info:

 

Description of problem:

The dynamic demo plugin locales is missing a correct plural string. The  dynamic demo plugin doesn't make use of the script console uses to transform plural strings, so we need to update the plural string manually 

This would help with the further validation of i18n dependencies update changes, and also the investigation of [Dynamic plugin translation support for plurals broken](https://issues.redhat.com/browse/OCPBUGS-11285) bug

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1. Deploy dynamic demo plugin on a cluster
2. Goto Overview page
3.

Actual results:

The Node Worker string is NOT in correct plural format

Expected results:

The node Worker string is in the correct plural format

Additional info:

 

Description of problem:

In order for Windows nodes to use the openshift-cluster-csi-drivers/internal-feature-states.csi.vsphere.vmware.com ConfigMap, which contains the configuration for vSphere CSI, `csi-windows-support` must be set to true.
This is documented here: https://github.com/kubernetes-sigs/vsphere-csi-driver/blob/833421f42475809b4f76ea125095b5120af0f8e1/docs/book/features/csi_driver_on_windows.md#how-to-enable-vsphere-csi-with-windows-nodes

Without this, a separate ConfigMap must be created and used for a user deploying Windows vSphere CSI drivers.

Version-Release number of selected component (if applicable):


How reproducible:

Always

Steps to Reproduce:

1. Add a Windows node to the cluster
2. Deploy vsphere csi daemonset for windows nodes as documented upstream
3. Add a Windows pod with a pvc mount

Actual results:

The pod is unable to mount the volume as windows support is not enabled

Expected results:

The pod can mount the volume

Additional info:


Description of problem:

When we exapnd the baremetal IP cluster with static IP, no information is logged if nmstate output is "--- {}\n" and the customized image generates without the static network configuration.

Version-Release number of selected component (if applicable):

4.11

How reproducible:

100%

Steps to Reproduce:

1. Exapand baremetal ipi cluster node with the below invalid nmstate data.
   ---
   apiVersion: v1
   kind: Secret
   metadata:
    name: openshift-worker-0-network-config-secret
   type: Opaque
   stringData:
    nmstate: |
     foo:
      bar: baz
   ---
   apiVersion: v1
   kind: Secret
   metadata:
     name: openshift-worker-0-bmc-secret
     namespace: openshift-machine-api
   type: Opaque
   data:
     username: YWRtaW4K
     password: cGFzc3dvcmQK
   ---
   apiVersion: metal3.io/v1alpha1
   kind: BareMetalHost
   metadata:
     name: openshift-worker-0
     namespace: openshift-machine-api
   spec:
     online: True
     bootMACAddress: 52:54:00:11:22:b4
     bmc:
       address: ipmi://192.168.123.1:6233
       credentialsName: openshift-worker-0-bmc-secret
       disableCertificateVerification: True
       username: admin
       password: password
     rootDeviceHints:
       deviceName: "/dev/sda"
     preprovisioningNetworkDataName: openshift-worker-0-network-config-secret

2. Check if an IP is configured with the node
3.

Actual results:

No static network configuration in the metal3 customized image.

Expected results:

Information should be logged and the metal3 customized image should not be generated.

Additional info:

https://github.com/openshift/image-customization-controller/pull/72

This is a clone of issue OCPBUGS-17724. The following is the description of the original issue:

Environment: OCP 4.12.24
Installation Method: IPI: Manual Mode + STS using a customer provider AWS IAM Role

I am trying to deploy an OCP4 cluster on AWS for my customer. The customer does not permit creation of IAM users so I am performing a Manual Mode with STS IPI installation instead. I have been given an IAM role to assume for the OCP installation, but unfortunately the customer's AWS Organizational Service Control Policy (SCP) does not permit the use of the iam:GetUser{} permission.

(I have informed my customer that iam:GetUser is an installation requirement - it's clearly documented in our docs, and I have raised a ticket with their internal support team requesting that their SCP is amended to include iam:getUser, however I have been informed that my request is likely to be rejected).

With this limitation understood, I still attempted to install OCP4. Surprisingly, I was able to deploy an OCP (4.12) cluster without any apparent issues, however when I tried to destroy the cluster I encountered the following error from the installer (note: fields in brackets <> have been redacted):

DEBUG search for IAM roles
DEBUG iterating over a page of 74 IAM roles
DEBUG search for IAM users
DEBUG iterating over a page of 1 IAM users
INFO get tags for <ARN of the IAM user>: AccessDenied: User:<ARN of my user> is notauthorized to perform: iam:GetUser on resource: <IAMusername> with an explicit deny in a service control policy
INFO status code: 403, request id: <request ID>
DEBUG search for IAM instance profiles
INFO error while finding resources to delete error=get tags for <ARN of IAM user> AccessDenied: User:<ARN of my user> is not authorized to perform: iam:GetUser on resource: <IAM username> with an explicit deny in a service control policy status code: 403, request id: <request ID>

Similarly, the error in AWS CloudTrail logs shows the following (note: some fields in brackets have been redacted):
User: arn:aws:sts::<AWS account no>:assumed-role/<role-name>/<user name> is not authorized to perform: iam:GetUser on resource <IAM User> with an explicit deny in a service control policy

It appears that the destroy operation is failing when the installer is trying to list tags on the only IAM user in the customer's AWS account. As discussed, the SCP does not permit the use of iam:GetUser and consequently this API call on the IAM user is denied. The installer then enters an endless loop as it continuously retries the operation. We have potentially identified the iamUserSearch function within the installer code at pkg/destroy/aws/iamhelpers.go as the area where this call is failing.

See: https://github.com/openshift/installer/blob/16f19ea94ecdb056d4955f33ddacc96c57341bb2/pkg/destroy/aws/iamhelpers.go#L95

There does not appear to be a handler for "AccessDenied" API error in this function. Therefore we request that the access denied event is gracefully handled and skipped over when processing IAM users, allowing the installer to continue with the destroy operation, much in the same way that a similar access denied event is handled within the iamRoleSearch function when processing IAM roles:

See: https://github.com/openshift/installer/blob/16f19ea94ecdb056d4955f33ddacc96c57341bb2/pkg/destroy/aws/iamhelpers.go#L51

We therefore request that the following is considered and addressed:

1. Re-assess if the iam:GetUser permission is actually needed for cluster installation/cluster operations. 
2. If the permission is required then the installer should provide a warning or halt the installation.
2. During a "destroy" cluster operation - the installer should gracefully handle AccessDenied errors from the API and "skip over" any IAM Users that the installer does not have permission to list tags for and then continue gracefully with the destroy operation.

Please review the following PR: https://github.com/openshift/machine-api-provider-ibmcloud/pull/18

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/oauth-server/pull/119

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

When an HCP Service LB is created, for example for an IngressController, the CAPA controller calls ModifyNetworkInterfaceAttribute. It references the default security group for the VPC in addition to the security group created for the cluster ( with the right tags). Ideally, the LBs (and any other HCP components) should not be using the default VPC SecurityGroup

Version-Release number of selected component (if applicable):

All 4.12 and 4.13

How reproducible:

100%

Steps to Reproduce:

1. Create HCP
2. Wait for Ingress to come up.
3. Look in CloudTrail for ModifyNetworkInterfaceAttribute, and see default security group referenced 

Actual results:

Default security group is used

Expected results:

Default security group should not be used

Additional info:

This is problematic as we are attempting to scope our AWS permissions as small as possible. The goal is to only use resources that are tagged with `red-hat-managed: true` so that our IAM Policies can conditioned to only access these resources. Using the Security Group created for the cluster should be sufficient, and the default Security Group does not need to be used, so if the usage can be removed here, we can secure our AWS policies that much better. Similar to OCPBUGS-11894

Description of problem:

oc idle tests do not expect the deprecation warning in its output and breaks.

Version-Release number of selected component (if applicable):

4.14

How reproducible:

Always

Steps to Reproduce:

1. Run the test
2. Watch it fail
3.

Actual results:

Error running /usr/bin/oc --namespace=e2e-test-oc-idle-hns4c --kubeconfig=/tmp/configfile3347652119 describe deploymentconfigs v1 DeploymentConfig is deprecated in v4.14+, unavailable in v4.10000+
deploymentconfig.apps.openshift.io:
StdOut>
Warning: apps.openshift.io/v1 DeploymentConfig is deprecated in v4.14+, unavailable in v4.10000+
Error from server (NotFound): deploymentconfigs.apps.openshift.io "v1 DeploymentConfig is deprecated in v4.14+, unavailable in v4.10000+
deploymentconfig.apps.openshift.io" not found
StdErr>
Warning: apps.openshift.io/v1 DeploymentConfig is deprecated in v4.14+, unavailable in v4.10000+
Error from server (NotFound): deploymentconfigs.apps.openshift.io "v1 DeploymentConfig is deprecated in v4.14+, unavailable in v4.10000+
deploymentconfig.apps.openshift.io" not found
exit status 1

Expected results:

Tests should pass

Additional info:

I have tracked down the problem to this line: https://github.com/openshift/origin/blob/master/test/extended/cli/idle.go#LL49C40-L49C40

deploymentConfigName gets assigned to "v1 DeploymentConfig is deprecated in v4.14+, unavailable in v4.10000+ deploymentconfig.apps.openshift.io", which leads to the next command not finding a deployment config.

Description of problem:

The target.workload.openshift.io/management annotation causes CNO operator pods to wait for nodes to appear. Eventually they give up waiting and they get scheduled. This annotation should not be set for the hosted control plane topology, given that we should not wait for nodes to exist for the CNO to be scheduled.

Version-Release number of selected component (if applicable):

4.14, 4.13

How reproducible:

always

Steps to Reproduce:

1. Create IBM ROKS cluster
2. Wait for cluster to come up
3.

Actual results:

Cluster takes a long time to come up because CNO pods take ~15 min to schedule.

Expected results:

Cluster comes up quickly

Additional info:

Note: Verification for the fix has already happened on the IBM Cloud side. All OCP QE needs to do is to make sure that the fix doesn't cause any regression to the regular OCP use case.

Description of problem:

Techpreview parallel jobs are failing due to changes in the insights operator

Example failure: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.14-e2e-gcp-sdn-techpreview/1663408887002304512

Looks like it's from https://github.com/openshift/insights-operator/pull/764

https://sippy.dptools.openshift.org/sippy-ng/jobs/4.14/analysis?filters=%7B%22items%22%3A%5B%7B%22id%22%3A0%2C%22columnField%22%3A%22name%22%2C%22operatorValue%22%3A%22equals%22%2C%22value%22%3A%22periodic-ci-openshift-release-master-ci-4.14-e2e-aws-sdn-techpreview%22%7D%2C%7B%22id%22%3A1%2C%22columnField%22%3A%22name%22%2C%22operatorValue%22%3A%22equals%22%2C%22value%22%3A%22periodic-ci-openshift-release-master-ci-4.14-e2e-gcp-sdn-techpreview%22%7D%2C%7B%22id%22%3A2%2C%22columnField%22%3A%22name%22%2C%22operatorValue%22%3A%22equals%22%2C%22value%22%3A%22periodic-ci-openshift-release-master-nightly-4.14-e2e-vsphere-ovn-techpreview%22%7D%5D%2C%22linkOperator%22%3A%22or%22%7D


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:

1.
2.
3.

Actual results:


Expected results:


Additional info:


Description of the problem:

 While having a cluster with 3 masters and attaching 5 additional disks , on the 3 masters , checking the device storage sets for the operator show only 3 storage devices and not as expected the 5 additional disks

How reproducible:

80%,

OCP 4.12, OCS 4.12.1

also reproduces on OCP 4.11

Steps to reproduce:

1. Create a Cluster with 3 master nodes

2. attach 2 additional disks to master1 , 2 additional disks to master 2 , 1 additional disk to master 3

3. check count of storage devices on operator

Actual results:
operator show device set count = 3

Expected results:
device set count should be as the amount of the different valid additional attached disks (= 5)

Description of problem:

When deploying hosts using ironic's agent both the ironic service address and inspector address are required.

The ironic service is proxied such that it can be accessed at a consistent endpoint regardless of where the pod is running. This is not the case for the inspection service.

This means that if the inspection service moves after we find the address, provisioning will fail.

In particular this non-matching behavior is frustrating when using the CBO [GetIronicIP function|https://github.com/openshift/cluster-baremetal-operator/blob/6f0a255fdcc7c0e5c04166cb9200be4cee44f4b7/provisioning/utils.go#L95-L127] as one return value is usable forever but the other needs to somehow be re-queried every time the pod moves.

Version-Release number of selected component (if applicable):

4.12

How reproducible:

Relatively

Steps to Reproduce:

1. Retrieve the inspector IP from GetIronicIP
2. Reschedule the inspector service pod
3. Provision a host

Actual results:

Ironic python agent raises an exception

Expected results:

Host provisions

Additional info:

This was found while deploying clusters using ZTP

In this scenario specifically an image containing the ironic inspector IP is valid for an extended period of time. The same image can be used for multiple hosts and possibly multiple different spoke clusters.

Our controller shouldn't be expected to watch the ironic pod to ensure we update the image whenever it moves. The best we can do is re-query the inspector IP whenever a user makes changes to the image, but that may still not be often enough.

Description of problem:

when catalogsouce name started with number , the pod will not running well , could we add checkpoint for the name , if the name is not suitable for regex used validation  ''[a-z]([-a-z0-9]*[a-z0-9])?'')',  print message and can't create the catalogsource . 

Version-Release number of selected component (if applicable):


How reproducible:

always 

Steps to Reproduce:

1.skopeo copy --all --format v2s2 docker://icr.io/cpopen/ibm-zcon-zosconnect-catalog@sha256:6f02ecef46020bcd21bdd24a01f435023d5fc3943972ef0d9769d5276e178e76 oci:///home1/611/oci-index
2. change the work directory to :  `cd  home1/611/oci-index` 
3. run the oc-mirror command : 
cat config.yaml 
kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v1alpha2
storageConfig:
  local:
    path: /home1/ocilocalstorage
mirror:
  operators:
  - catalog: oci:///home1/611/oci-index

`oc-mirror --config config.yaml docker://ec2-18-217-58-249.us-east-2.compute.amazonaws.com:5000/multi-oci --dest-skip-tls --include-local-oci-catalogs`
4. apply the catalogsouce and ICSP yaml file;
5 . check the catalogsource pod 


Actual results:

[root@preserve-fedora36 oci-index]# oc get pod --show-labels 
NAME                                    READY   STATUS              RESTARTS   AGE     LABELS
611-oci-index-2sfh8                     0/1     Terminating         0          4s      olm.catalogSource=611-oci-index,olm.pod-spec-hash=6b8656f87
611-oci-index-dbj9b                     0/1     ContainerCreating   0          1s      olm.catalogSource=611-oci-index,olm.pod-spec-hash=6b8656f87
611-oci-index-w4tfd                     0/1     Terminating         0          2s      olm.catalogSource=611-oci-index,olm.pod-spec-hash=6b8656f87
611-oci-index-zj8nn                     0/1     Terminating         0          3s      olm.catalogSource=611-oci-index,olm.pod-spec-hash=6b8656f87

oc get catalogsource 611-oci-index -oyaml 
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  creationTimestamp: "2023-05-10T03:01:36Z"
  generation: 1
  name: 611-oci-index
  namespace: openshift-marketplace
  resourceVersion: "97108"
  uid: 2287434b-9e70-4865-b1a1-95997165f94e
spec:
  image: ec2-18-217-58-249.us-east-2.compute.amazonaws.com:5000/multi-oci/home1/611/oci-index:6f02ec
  sourceType: grpc
status:
  message: 'couldn''t ensure registry server - error ensuring service: 611-oci-index:
    Service "611-oci-index" is invalid: metadata.name: Invalid value: "611-oci-index":
    a DNS-1035 label must consist of lower case alphanumeric characters or ''-'',
    start with an alphabetic character, and end with an alphanumeric character (e.g.
    ''my-name'',  or ''abc-123'', regex used for validation is ''[a-z]([-a-z0-9]*[a-z0-9])?'')'
  reason: RegistryServerError

Expected results:

should not create the catalogsouce when it's name is not suitable for the regex used validation  . 

Additional info:
rename the catalogsource with oci-611-index, pod running well, and could create the operator and instance .

 

Description of problem:

The current version of openshift/cluster-ingress-operator vendors Kubernetes 1.26 packages. OpenShift 4.13 is based on Kubernetes 1.27.   

Version-Release number of selected component (if applicable):

4.14

How reproducible:

Always

Steps to Reproduce:

1. Check https://github.com/openshift/cluster-ingress-operator/blob/release-4.14/go.mod 

Actual results:

Kubernetes packages (k8s.io/api, k8s.io/apimachinery, and k8s.io/client-go) are at version v0.26

Expected results:

Kubernetes packages are at version v0.27.0 or later.

Additional info:

Using old Kubernetes API and client packages brings risk of API compatibility issues.
controller-runtime will need to be bumped to 1.15 as well

This is a clone of issue OCPBUGS-19376. The following is the description of the original issue:

Description of problem:

IPI installation using the service account attached to a GCP VM always fail with error "unable to parse credentials"

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-09-15-233408

How reproducible:

Always

Steps to Reproduce:

1. "create install-config"
2. edit install-config.yaml to insert "credentialsMode: Manual"
3. "create manifests"
4. manually create the required credentials and copy the manifests to installation-dir/manifests directory
5. launch the bastion host along with binding to the pre-configured service account ipi-on-bastion-sa@openshift-qe.iam.gserviceaccount.com and scopes being "cloud-platform"
6. copy the installation-dir and openshift-install to the bastion host
7. try "create cluster" on the bastion host 

Actual results:

The installation failed on "Creating infrastructure resources"

Expected results:

The installation should succeed.

Additional info:

(1) FYI the 4.12 epic: https://issues.redhat.com/browse/CORS-2260

(2) 4.12.34 doesn't have the issue (Flexy-install/234112/). 

(3) 4.13.13 doesn’t have the issue (Flexy-install/234126/).

(4) The 4.14 errors (Flexy-install/234113/):
09-19 16:13:44.919  level=info msg=Consuming Master Ignition Config from target directory
09-19 16:13:44.919  level=info msg=Consuming Bootstrap Ignition Config from target directory
09-19 16:13:44.919  level=info msg=Consuming Worker Ignition Config from target directory
09-19 16:13:44.919  level=info msg=Credentials loaded from gcloud CLI defaults
09-19 16:13:49.071  level=info msg=Creating infrastructure resources...
09-19 16:13:50.950  level=error
09-19 16:13:50.950  level=error msg=Error: unable to parse credentials
09-19 16:13:50.950  level=error
09-19 16:13:50.950  level=error msg=  with provider["openshift/local/google"],
09-19 16:13:50.950  level=error msg=  on main.tf line 10, in provider "google":
09-19 16:13:50.950  level=error msg=  10: provider "google" {
09-19 16:13:50.950  level=error
09-19 16:13:50.950  level=error msg=unexpected end of JSON input
09-19 16:13:50.950  level=error msg=failed to fetch Cluster: failed to generate asset "Cluster": failure applying terraform for "cluster" stage: failed to create cluster: failed to apply Terraform: exit status 1
09-19 16:13:50.950  level=error
09-19 16:13:50.950  level=error msg=Error: unable to parse credentials
09-19 16:13:50.950  level=error
09-19 16:13:50.950  level=error msg=  with provider["openshift/local/google"],
09-19 16:13:50.950  level=error msg=  on main.tf line 10, in provider "google":
09-19 16:13:50.950  level=error msg=  10: provider "google" {
09-19 16:13:50.950  level=error
09-19 16:13:50.950  level=error msg=unexpected end of JSON input
09-19 16:13:50.950  level=error

Agent does not replace localhost.localdomain node names with MAC addresses
in case Cluster network configuration is Static IPs with VLAN
Found in agent log
Dec 20 17:37:42 localhost.localdomain inventory[2284]: time="20-12-2022 17:37:42" level=info msg="Replaced original forbidden hostname with calculated one" file="inventory.go:63" calculated=localhost.localdomain original=localhost.localdomain

As result
Cluster is not ready yet.
The cluster is not ready yet. Some hosts have an ineligible name. To change the hostname, click on it.

How reproducible:
1. Provision libvirt VMs and network with VLAN
2. Create cluster and select Static IP Network configuration
3. Fill all required filed in from view and press Next
4. Generate and download ISO
5. Wait until nodes will be UP and discovered

Actual results:
Nodes have localhost.localdomain names
 
Expected results:
Nodes have name as host's MAC address

Description of problem:

Cluster Network Operator managed component multus-admission-controller does not conform to Hypershift control plane expectations.

When CNO is managed by Hypershift, multus-admission-controller must run with non-root security context. If Hypershift runs control plane on kubernetes (as opposed to Openshift) management cluster, it adds pod or container security context to most deployments with runAsUser clause inside.

In Hypershift CPO, the security context of deployment containers, including CNO, is set when it detects that SCC's are not available, see https://github.com/openshift/hypershift/blob/9d04882e2e6896d5f9e04551331ecd2129355ecd/support/config/deployment.go#L96-L100. In such a case CNO should do the same, set security context for its managed deployment multus-admission-controller to meet Hypershift standard.

 

How reproducible:

Always

Steps to Reproduce:

1.Create OCP cluster using Hypershift using Kube management cluster
2.Check pod security context of multus-admission-controller

Actual results:

no pod security context is set

Expected results:

pod security context is set with runAsUser: xxxx

Additional info:

This is the highest priority item from https://issues.redhat.com/browse/OCPBUGS-7942 and it needs to be fixed ASAP as it is a security issue preventing IBM from releasing Hypershift-managed Openshift service.

Description of the problem:

When 9.2 based live iso is used in agentserviceconfig, after booting into CD, spoke console stuck at acquire live pxe rootfs with could not resolve host error. 

 

It seems the DNS server configured in nmstate is not applied to spoke.

 

How reproducible:

100% 

 

Steps to reproduce:

  1. configure agentserviceconfig to use 4.13 9.2 live iso. (413.92.202303190222-0)

2. install SNO via ZTP 

3. Monitor install CRs on hub

Actual results:

  • agentclusterinstall stuck at "insufficient" state
  • spoke console shows could not resolve host when attempt to download rootfs image (screenshot attached)

Expected results:

  • install succeeded

 

Extra info:

  • ACM version: latest 2.7.3 downstream snapshot  
  • Did not encounter this specific issue if switch to 8.6 based 4.13 live iso in agentserviceconfig.
  • However, even though we can by pass this step, then similar issue happens after booting into HD which has 9.2 based OS - the DNS server on spoke is different than what is configured in nmstate, causing DNS resolution to fail. 
    • And we did not see this issue when using ACM 2.7.2 snapshot from about 3 weeks ago. We were able to install the same cluster using same networking configs with 4.13 9.2 build (8.6 live iso). 

Description of the problem:

Infraenv creation data missing

 

How reproducible:

data is propagated only on infraenv update

 

Steps to reproduce:

1. create new cluster

2. check elastic data: some special feature is missing

 

Please review the following PR: https://github.com/openshift/machine-api-provider-nutanix/pull/42

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

This is a clone of issue OCPBUGS-13152. The following is the description of the original issue:

Description of problem:
With OCPBUGS-11099 our Pipeline Plugin supports the TektonConfig config "embedded-status: minimal" option that will be the default in OpenShift Pipelines 1.11+.

But since this change, the Pipeline pages loads the TaskRuns for any Pipeline and PipelineRun rows. To decrease the risk of a performance issue we should make this call only if the status.tasks wasn't defined.

Version-Release number of selected component (if applicable):

  • 4.12-4.14, as soon as OCPBUGS-11099 is backported.
  • Tested with Pipelines operator 1.10.1

How reproducible:
Always

Steps to Reproduce:

  1. Install Pipelines operator
  2. Import a Git repository and enable the Pipeline option
  3. Open the browser network inspector
  4. Navigate to the Pipeline page

Actual results:
The list page load a list of TaskRuns for each Pipeline / PipelineRun also if the PipelineRun contains the related data already (status.tasks)

Expected results:
No unnecessary network calls. When the admin changes the TektonConfig config "embedded-status" option to minimal the UI should still work and load the TaskRuns as it does it today.

Additional info:
None

Description of the problem:

#!/bin/bashwhile sleep 0.5; do
    for i in {1..10}; do
        curl -I -H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" 'https://api.stage.openshift.com/api/assisted-install/v2/infra-envs/3dc00d41-46bf-4b83-9874-f21812263c97/downloads/files?discovery_iso_type=full-iso&file_name=discovery.ign' > /dev/null &
    done ;
done
 

 

This script above would cause assisted-service to spike CPU and 99th percentile of requests to jump to 10s

How reproducible:

100%

Steps to reproduce:

1. run script above

2. check response time/cpu usage

3.

Actual results:

response time really slow / 504

Expected results:

service continues to run smoothly

Description of the problem:

Change the user message from: "Host is not compatible with cluster platform %s; either disable this host or choose a compatible cluster platform (%v)" to "Host is not compatible with cluster platform %s; either disable this host or discover a new, compatible host."

How reproducible:

100%

Steps to reproduce:

1.

2.

3.

Actual results:

 

Expected results:

Description of problem:

Fix grammatical error in feedback modal. Remove 'the' before openshift text.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

OCP FeatureGate object gets a new status field, where the enabled feature gates are listed. We should use this new field instead of parsing FeatureGate.Spec.

This should be fully transparent to users, they still set FeatureGate.Spec and they should still observe that SharedResource CSI driver + operator is installed when they enable TechPreviewNoUpgrade feature set there.

Enhancement: https://github.com/openshift/cluster-storage-operator/pull/368

Sanitize OWNERS/OWNER_ALIASES:

1) OWNERS must have:

component: "Storage / Kubernetes External Components"

2) OWNER_ALIASES must have all team members of Storage team.

Description of problem:

Metrics page is broken

Version-Release number of selected component (if applicable):

Openshift Pipelines 1.9.0 on 4.12

How reproducible:

Always

Steps to Reproduce:

1. Install Openshift Pipelines 1.9.0
2. Create a pipeline and run it several times
3. Update metrics.pipelinerun.duration-type and metrics.taskrun.duration-type to lastvalue
4. Navigate to created pipeline 
5. Switch to Metrics tab

Actual results:

The Metrics page is showing error

Expected results:

Metrics of the pipeline should be shown

Additional info:

 

Description of problem:

There are different versions, channel for the operator, but may be they use the same 'latest' label, when mirroring them as `additionalImages`, got the below error:

[root@ip-172-31-249-209 jian]# oc-mirror --config mirror.yaml file:///root/jian/test/
...
...
sha256:672b4bee759f8115e5538a44c37c415b362fc24b02b0117fd4bdcc129c53e0a1 file://brew.registry.redhat.io/openshift4/ose-cluster-kube-descheduler-operator:latest
sha256:d90aecc425e1b2e0732d0a90bc84eb49eb1139e4d4fd8385070d00081c80b71c file://brew.registry.redhat.io/openshift4/ose-cluster-kube-descheduler-operator:latest
error: unable to push manifest to file://brew.registry.redhat.io/openshift4/ose-cluster-kube-descheduler-operator:latest: symlink sha256:f6b6a15c4477615ff202e73d77fc339977aeeca714b9667196509d53e2d2e4f5 /root/jian/test/oc-mirror-workspace/src/v2/openshift4/ose-cluster-kube-descheduler-operator/manifests/latest.download: file exists
error: unable to push manifest to file://brew.registry.redhat.io/openshift4/ose-cluster-kube-descheduler-operator:latest: symlink sha256:6a1de43c60d021921973e81c702e163a49300254dc3b612fd62ed2753efe4f06 /root/jian/test/oc-mirror-workspace/src/v2/openshift4/ose-cluster-kube-descheduler-operator/manifests/latest.download: file exists
info: Mirroring completed in 22.48s (125.8MB/s)
error: one or more errors occurred while uploading images

Version-Release number of selected component (if applicable):

[root@ip-172-31-249-209 jian]# oc-mirror version
Client Version: version.Info{Major:"0", Minor:"1", GitVersion:"v0.1.0", GitCommit:"6ead1890b7a21b6586b9d8253b6daf963717d6c3", GitTreeState:"clean", BuildDate:"2022-08-25T05:27:39Z", GoVersion:"go1.17.12", Compiler:"gc", Platform:"linux/amd64"}

How reproducible:

always

Steps to Reproduce:

1. use the below config:
[cloud-user@preserve-olm-env2 mirror-tmp]$ cat mirror.yaml
apiVersion: mirror.openshift.io/v1alpha1
kind: ImageSetConfiguration
# archiveSize: 4
mirror:
  additionalImages:
    - name: brew.registry.redhat.io/rh-osbs/openshift-ose-cluster-kube-descheduler-operator-bundle@sha256:46a62d73aeebfb72ccc1743fc296b74bf2d1f80ec9ff9771e655b8aa9874c933
    - name: brew.registry.redhat.io/rh-osbs/openshift-ose-cluster-kube-descheduler-operator-bundle@sha256:9e549c09edc1793bef26f2513e72e589ce8f63a73e1f60051e8a0ae3d278f394
    - name: brew.registry.redhat.io/rh-osbs/openshift-ose-cluster-kube-descheduler-operator-bundle@sha256:c16891ee9afeb3fcc61af8b2802e56605fff86a505e62c64717c43ed116fd65e
    - name: brew.registry.redhat.io/rh-osbs/openshift-ose-cluster-kube-descheduler-operator-bundle@sha256:5c37bd168645f3d162cb530c08f4c9610919d4dada2f22108a24ecdea4911d60
    - name: brew.registry.redhat.io/rh-osbs/openshift-ose-cluster-kube-descheduler-operator-bundle@sha256:89a6abbf10908e9805d8946ad78b98a13a865cefd185d622df02a8f31900c4c1
    - name: brew.registry.redhat.io/rh-osbs/openshift-ose-cluster-kube-descheduler-operator-bundle@sha256:de5b339478e8e1fc3bfd6d0b6784d91f0d3fbe0a133354be9e9d65f3d7906c2d
    - name: brew.registry.redhat.io/rh-osbs/openshift-ose-cluster-kube-descheduler-operator-bundle@sha256:fdf774c4365bde48d575913d63ef3db00c9b4dda5c89204029b0840e6dc410b1
    - name: brew.registry.redhat.io/openshift4/ose-cluster-kube-descheduler-operator@sha256:d90aecc425e1b2e0732d0a90bc84eb49eb1139e4d4fd8385070d00081c80b71c
    - name: brew.registry.redhat.io/openshift4/ose-descheduler@sha256:15cc75164335fa178c80db4212d11e4a793f53d2b110c03514ce4c79a3717ca0
    - name: brew.registry.redhat.io/openshift4/ose-cluster-kube-descheduler-operator@sha256:9e66db3a282ee442e71246787eb24c218286eeade7bce4d1149b72288d3878ad
    - name: brew.registry.redhat.io/openshift4/ose-descheduler@sha256:546b14c1f3fb02b1a41ca9675ac57033f2b01988b8c65ef3605bcc7d2645be60
    - name: brew.registry.redhat.io/openshift4/ose-cluster-kube-descheduler-operator@sha256:12d7061012fd823b57d7af866a06bb0b1e6c69ec8d45c934e238aebe3d4b68a5
    - name: brew.registry.redhat.io/openshift4/ose-descheduler@sha256:41025e3e3b72f94a3290532bdd6cabace7323c3086a9ce434774162b4b1dd601
    - name: brew.registry.redhat.io/openshift4/ose-cluster-kube-descheduler-operator@sha256:672b4bee759f8115e5538a44c37c415b362fc24b02b0117fd4bdcc129c53e0a1
    - name: brew.registry.redhat.io/openshift4/ose-descheduler@sha256:92542b22911fbd141fadc53c9737ddc5e630726b9b53c477f4dfe71b9767961f
    - name: brew.registry.redhat.io/openshift4/ose-cluster-kube-descheduler-operator@sha256:f6b6a15c4477615ff202e73d77fc339977aeeca714b9667196509d53e2d2e4f5
    - name: brew.registry.redhat.io/openshift4/ose-descheduler@sha256:1feb7073dec9341cadcc892df39ae45c427647fb034cf09dce1b7aa120bbb459
    - name: brew.registry.redhat.io/openshift4/ose-cluster-kube-descheduler-operator@sha256:7ca05f93351959c0be07ec3af84ffe6bb5e1acea524df210b83dd0945372d432
    - name: brew.registry.redhat.io/openshift4/ose-descheduler@sha256:c0fe8830f8fdcbe8e6d69b90f106d11086c67248fa484a013d410266327a4aed
    - name: brew.registry.redhat.io/openshift4/ose-cluster-kube-descheduler-operator@sha256:6a1de43c60d021921973e81c702e163a49300254dc3b612fd62ed2753efe4f06
    - name: brew.registry.redhat.io/openshift4/ose-descheduler@sha256:b386d0e1c9e12e9a3a07aa101257c6735075b8345a2530d60cf96ff970d3d21a


2. Run the 
$ oc-mirror --config mirror.yaml file:///root/jian/test/  

Actual results:

error: unable to push manifest to file://brew.registry.redhat.io/openshift4/ose-cluster-kube-descheduler-operator:latest: symlink sha256:f6b6a15c4477615ff202e73d77fc339977aeeca714b9667196509d53e2d2e4f5 /root/jian/test/oc-mirror-workspace/src/v2/openshift4/ose-cluster-kube-descheduler-operator/manifests/latest.download: file exists
error: unable to push manifest to file://brew.registry.redhat.io/openshift4/ose-cluster-kube-descheduler-operator:latest: symlink sha256:6a1de43c60d021921973e81c702e163a49300254dc3b612fd62ed2753efe4f06 /root/jian/test/oc-mirror-workspace/src/v2/openshift4/ose-cluster-kube-descheduler-operator/manifests/latest.download: file exists

Expected results:

No error

Additional info:

 

Description of problem

CI is flaky because of test failures such as the following:

[sig-auth][Feature:SCC][Early] should not have pod creation failures during install [Suite:openshift/conformance/parallel]
Run #0: Failed
{  fail [github.com/openshift/origin/test/extended/authorization/scc.go:69]: 1 pods failed before test on SCC errors
Error creating: pods "azure-file-csi-driver-node-" is forbidden: unable to validate against any security context constraint: [provider "anyuid": Forbidden: not usable by user or serviceaccount, provider restricted-v2: .spec.securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used, spec.volumes[0]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[1]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[2]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[3]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[4]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[5]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[6]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[7]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[9]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[10]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.initContainers[0].securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used, spec.initContainers[0].securityContext.containers[0].hostPort: Invalid value: 10302: Host ports are not allowed to be used, spec.containers[0].securityContext.privileged: Invalid value: true: Privileged containers are not allowed, spec.containers[0].securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used, spec.containers[0].securityContext.containers[0].hostPort: Invalid value: 10302: Host ports are not allowed to be used, spec.containers[1].securityContext.privileged: Invalid value: true: Privileged containers are not allowed, spec.containers[1].securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used, spec.containers[1].securityContext.containers[0].hostPort: Invalid value: 10302: Host ports are not allowed to be used, spec.containers[2].securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used, spec.containers[2].securityContext.containers[0].hostPort: Invalid value: 10302: Host ports are not allowed to be used, provider "restricted": Forbidden: not usable by user or serviceaccount, provider "nonroot-v2": Forbidden: not usable by user or serviceaccount, provider "nonroot": Forbidden: not usable by user or serviceaccount, provider "hostmount-anyuid": Forbidden: not usable by user or serviceaccount, provider "machine-api-termination-handler": Forbidden: not usable by user or serviceaccount, provider "hostnetwork-v2": Forbidden: not usable by user or serviceaccount, provider "hostnetwork": Forbidden: not usable by user or serviceaccount, provider "hostaccess": Forbidden: not usable by user or serviceaccount, provider "privileged": Forbidden: not usable by user or serviceaccount] for DaemonSet.apps/v1/azure-file-csi-driver-node -n openshift-cluster-csi-drivers happened 12 times

Ginkgo exit error 1: exit with code 1}

Run #1: Failed
{  fail [github.com/openshift/origin/test/extended/authorization/scc.go:69]: 1 pods failed before test on SCC errors
Error creating: pods "azure-file-csi-driver-node-" is forbidden: unable to validate against any security context constraint: [provider "anyuid": Forbidden: not usable by user or serviceaccount, provider restricted-v2: .spec.securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used, spec.volumes[0]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[1]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[2]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[3]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[4]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[5]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[6]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[7]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[9]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[10]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.initContainers[0].securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used, spec.initContainers[0].securityContext.containers[0].hostPort: Invalid value: 10302: Host ports are not allowed to be used, spec.containers[0].securityContext.privileged: Invalid value: true: Privileged containers are not allowed, spec.containers[0].securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used, spec.containers[0].securityContext.containers[0].hostPort: Invalid value: 10302: Host ports are not allowed to be used, spec.containers[1].securityContext.privileged: Invalid value: true: Privileged containers are not allowed, spec.containers[1].securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used, spec.containers[1].securityContext.containers[0].hostPort: Invalid value: 10302: Host ports are not allowed to be used, spec.containers[2].securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used, spec.containers[2].securityContext.containers[0].hostPort: Invalid value: 10302: Host ports are not allowed to be used, provider "restricted": Forbidden: not usable by user or serviceaccount, provider "nonroot-v2": Forbidden: not usable by user or serviceaccount, provider "nonroot": Forbidden: not usable by user or serviceaccount, provider "hostmount-anyuid": Forbidden: not usable by user or serviceaccount, provider "machine-api-termination-handler": Forbidden: not usable by user or serviceaccount, provider "hostnetwork-v2": Forbidden: not usable by user or serviceaccount, provider "hostnetwork": Forbidden: not usable by user or serviceaccount, provider "hostaccess": Forbidden: not usable by user or serviceaccount, provider "privileged": Forbidden: not usable by user or serviceaccount] for DaemonSet.apps/v1/azure-file-csi-driver-node -n openshift-cluster-csi-drivers happened 12 times

Ginkgo exit error 1: exit with code 1}

This particular failure comes from https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-ingress-operator/901/pull-ci-openshift-cluster-ingress-operator-master-e2e-azure-ovn/1638557668689842176. Search.ci has additional similar errors.

Version-Release number of selected component (if applicable)

I have seen these failures in 4.14 CI jobs.

How reproducible

Presently, search.ci shows the following stats for the past two days:

Found in 0.00% of runs (0.01% of failures) across 131399 total runs and 7623 jobs (19.50% failed) in 1.01s

Steps to Reproduce

1. Post a PR and have bad luck.
2. Check search.ci: https://search.ci.openshift.org/?search=pods+%22azure-file-csi-driver-%28controller%7Cnode%29-%22+is+forbidden&maxAge=168h&context=1&type=bug%2Bissue%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Actual results

CI fails.

Expected results

CI passes, or fails on some other test failure, and the failures don't show up in search.ci.

Description of problem:

with new s3 bucket, hc failed with condition :
- lastTransitionTime: “2023-04-13T14:17:11Z”
   message: ‘failed to upload /.well-known/openid-configuration to the heli-hypershift-demo-oidc-2
    s3 bucket: aws returned an error: AccessControlListNotSupported’
   observedGeneration: 3
   reason: OIDCConfigurationInvalid
   status: “False”
   type: ValidOIDCConfiguration

Version-Release number of selected component (if applicable):

 

How reproducible:

1 create s3 bucket 
$ aws s3api create-bucket --create-bucket-configuration  LocationConstraint=us-east-2 --region=us-east-2 --bucket heli-hypershift-demo-oidc-2
{
  "Location": "http://heli-hypershift-demo-oidc-2.s3.amazonaws.com/"
}
[cloud-user@heli-rhel-8 ~]$ aws s3api delete-public-access-block --bucket heli-hypershift-demo-oidc-2

2 install HO and create a hc on aws us-west-2
3. hc failed with condition:
- lastTransitionTime: “2023-04-13T14:17:11Z”    message: ‘failed to upload /.well-known/openid-configuration to the heli-hypershift-demo-oidc-2     s3 bucket: aws returned an error: AccessControlListNotSupported’    observedGeneration: 3    reason: OIDCConfigurationInvalid    status: “False”    type: ValidOIDCConfiguration

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

create a hc successfully

Additional info:

 

The dns operator appears to have begun frequently spamming kube Events in some serial jobs across multiple clouds. (especially gcp and azure, aws is less common but there are some failures with the same signature)

The pathological events test and here it appears this started on May 5th. See the Pass Rate By NURP+ Combination panel for where this is most common.

As of the date of filing, pass rates are:
56% - gcp, amd64, sdn, ha, serial, techpreview
57% - gcp, amd64, sdn, ha, serial
60% - azure, amd64, ovn, ha, serial
60% - azure, amd64, ovn, ha, serial, techpreview

The events seem to consistently appear as follows on all clouds:

ns/openshift-dns service/dns-default hmsg/ade328ddf3 - pathological/true reason/TopologyAwareHintsDisabled Unable to allocate minimum required endpoints to each zone without exceeding overload threshold (5 endpoints, 3 zones), addressType: IPv4 From: 08:58:41Z To: 08:58:42Z

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.14-e2e-azure-sdn-techpreview-serial/1656207924667617280 (intervals)

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.14-e2e-gcp-sdn-techpreview-serial/1656207916375478272

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.14-e2e-aws-sdn-serial/1655277608981499904

The Intervals item under "Debug Tools" is a great way to see these charted in time, see the "interesting events" section.

 

test=[sig-arch] events should not repeat pathologically for namespace openshift-dns

Description of problem:

Not able to provision a new baremetalhost because ironic is not able to find a suitable virtual media device.

Version-Release number of selected component (if applicable):

 

How reproducible:

100% if you have a UCS Blade

Steps to Reproduce:

1. add the baremetalhost 
2. wait for the error
3.

Actual results:

No suitable virtual media device found.

Expected results:

That the provisioning would succeeed

Additional info:

I tried to insert an ISO using curl and I can do it on the virtualmedia[3] device, which is a virtual DVD.

When I'm looking at the metal3-ironic logs I can see the follow entry:
Received representation of VirtualMedia /redfish/v1/Managers/CIMC/VirtualMedia/3: {'_actions': {'eject_media': {'operation_apply_time_support': None, 'target_uri': '/redfish/v1/Managers/CIMC/VirtualMedia/3/Actions/VirtualMedia.EjectMedia'}, 'insert_media': {'operation_apply_time_support': None, 'target_uri': '/redfish/v1/Managers/CIMC/VirtualMedia/3/Actions/VirtualMedia.InsertMedia'}}, '_certificates_path': None, '_oem_vendors': ['Cisco'], 'connected_via': <ConnectedVia.URI: 'URI'>, 'identity': '3', 'image': None, 'image_name': None, 'inserted': False, 'links': None, 'media_types': [<VirtualMediaType.DVD: 'DVD'>], 'name': 'CIMC-Mapped vDVD', 'status': {'health': <Health.OK: 'OK'>, 'health_rollup': None, 'state': <State.DISABLED: 'Disabled'>}, 'transfer_method': None, 'user_name': None, 'verify_certificate': None, 'write_protected': False}

I'm sure this is the correct device, and verified that I can insert vmedia using curl.

Someone metal3/ironic is not selecting this device.
I'm suspecting that the reason is that "DVD" is not a valid media_type.
When I look at [the ironic code](https://github.com/openstack/ironic/blob/b4f8209b99af32d8d2a646591af9b62436aad3d8/ironic/drivers/modules/redfish/boot.py#LL188C31-L188C31) I can see that there is a check for the media_type.

I'm not able to see which values are accepted by metal3.

I was able to validate the media_types for a rackmount server which works and there I see the following values: "CD, DVD".

This led me to believe that DVD is not an accepted value.

Can you please confirm that this is the case and if so, can we add the DVD as a suitable device?

 

Description of problem:

Customer is facing issue with console slowness when loading workloads page having 300+ workloads.

Version-Release number of selected component (if applicable):

4.12

How reproducible:

 

Steps to Reproduce:

1. Login to OCP console
2. Workloads — > Projects --> Project-> Deployment Configs(300+)
3.

Actual results:

 

Expected results:

 

Additional info:

 

Please review the following PR: https://github.com/openshift/openshift-state-metrics/pull/97

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

oc should not append the -x86_64 suffix when mirroring multi-arch payloads

Version-Release number of selected component (if applicable):

4.14

How reproducible:

Always

Steps to Reproduce:

1.oc adm release mirror quay.io/openshift-release-dev/ocp-release:4.12.13-multi --keep-manifest-list=true --to=someregistry.io/somewhere/release  
2.
3.

Actual results:

05-31 04:54:15.807        sha256:cd8639e34840833dd98d8323f1999b00ca06c73d7ae9ad8945f7b397450821ee -> 4.14.0-0.nightly-multi-2023-05-30-024840-x86_64-insights-operator
05-31 04:54:15.807        sha256:d0443f26968a2159e8b9590b33c428b6af7c0220ab6cc13633254d8843818cdf -> 4.14.0-0.nightly-multi-2023-05-30-024840-x86_64-keepalived-ipfailover
05-31 04:54:15.807        sha256:d2126187264d04f812068c03b59316547f043f97e90ec1a605ac24ab008c85a0 -> 4.14.0-0.nightly-multi-2023-05-30-024840-x86_64-agent-installer-orchestrator
05-31 04:54:15.807        sha256:d445a4ece53f0695f1b812920e4bbb8a73ceef582918a0f376c2c5950a3e050b -> 4.14.0-0.nightly-multi-2023-05-30-024840-x86_64-ovn-kubernetes
05-31 04:54:15.807        sha256:d4bfe3bac81d5bb758efced8706a400a4b1dad7feb2c9a9933257fde9f405866 -> 4.14.0-0.nightly-multi-2023-05-30-024840-x86_64-csi-snapshot-controller
05-31 04:54:15.807        sha256:d50c009e4b47bb6d93125c08c19c13bf7fd09ada197b5e0232549af558b25d19 -> 4.14.0-0.nightly-multi-2023-05-30-024840-x86_64-vsphere-csi-driver-operator
05-31 04:54:15.807        sha256:d844ecbbba99e64988f4d57de9d958172264e88b9c3bfc7b43e5ee19a1a2914e -> 4.14.0-0.nightly-multi-2023-05-30-024840-x86_64-ironic
05-31 04:54:15.807        sha256:d90b37357d4c2c0182787f6842f89f56aaebeab38a139c62f4a727126e036578 -> 4.14.0-0.nightly-multi-2023-05-30-024840-x86_64-baremetal-machine-controllers
05-31 04:54:15.807        sha256:d928536d8d9c4d4d078734004cc9713946da288b917f1953a8e7b1f2a8428a64 -> 4.14.0-0.nightly-multi-2023-05-30-024840-x86_64-azure-cloud-controller-manager
05-31 04:54:15.807        sha256:da049d5a453eeb7b453e870a0c52f70df046f2df149bca624248480ef83f2ac8 -> 4.14.0-0.nightly-multi-2023-05-30-024840-x86_64-cli-artifacts
05-31 04:54:15.807        sha256:db1cf013e3f845be74553eecc9245cc80106b8c70496bbbc0d63b497dcbb6556 -> 4.14.0-0.nightly-multi-2023-05-30-024840-x86_64-cluster-capi-controllers
05-31 04:54:15.807        sha256:dc7b1305c7fec48d29adc4d8b3318d3b1d1d12495fb2d0ddd49a33e3b6aed0cc -> 4.14.0-0.nightly-multi-2023-05-30-024840-x86_64-gcp-pd-csi-driver
05-31 04:54:15.807        sha256:de8753eb8b2ccec3474016cd5888d03eeeca7e0f23a171d85b4f9d76d91685a3 -> 4.14.0-0.nightly-multi-2023-05-30-024840-x86_64-baremetal-installer

Expected results:

no -x86_64 suffix added to the images tags

Additional info:

 

Description of problem:
Navigation:
Workloads -> Deployments -> Edit update strategy
'greater than pod' is in English

Version-Release number of selected component (if applicable):
4.11.0-0.nightly-2022-06-23-044003

How reproducible:
Always

Steps to Reproduce:
1.
2.
3.

Actual results:
Translation missing

Expected results:
Translation should appear

Additional info:

Description of the problem:

BE 2.16, base domain allows 1 char string long. This results with cluster address like: clustername.r, but in networking page I get DNS wildcard not configured

How reproducible:

100%

Steps to reproduce:

1. Create a cluster with 1 character string as base domain (i.e. "c" )

2. move to Networking page

3. set all needed info (api + ingress vips) . Validation error - DNS wildcard not configured: is shown

Actual results:

 

Expected results:

Description of problem:

Global configuration of 'KnativeServing' is missing after user installed the Operator of 'Serverless' successfully

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-06-13-223353

How reproducible:

Always

Steps to Reproduce:

1. Installed 'Serveless' Operator, make sure the operator has been installed successfully, and the Knative Serving instance is created without any error
2. Navigate to Administration -> Cluster Settings -> Global Configuration
3. Check if KnativeServing is listed in the Cluster Setting page

Actual results:

KnativeServing is missing

Expected results:

KnativeServing should list in the Global Configuration page

Additional info:

 

Description of problem:

when use oci-registries-config, the oc-mirror will panic

Version-Release number of selected component (if applicable):

Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.14.0-202308091944.p0.gdba4a0c.assembly.stream-dba4a0c", GitCommit:"dba4a0cfd0a9fd29c1e4b5bc1da737e1153cc679", GitTreeState:"clean", BuildDate:"2023-08-10T00:13:31Z", GoVersion:"go1.20.5 X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}

How reproducible:

always 

Steps to Reproduce:

1.  mirror to localhost :
cat config.yaml 
apiVersion: mirror.openshift.io/v1alpha2
kind: ImageSetConfiguration
mirror:
  operators:
    - catalog: oci:///home1/oci-414
      packages:
      - name: cluster-logging
oc-mirror --config config.yaml docker://localhost:5000 --dest-use-http
2. use oci-registries-config 
`oc-mirror --config config.yaml docker://localhost:5000 --dest-use-http   --oci-registries-config /home1/registry.conf`

Actual results:

2. The oc-mirror will panic :
oc-mirror --config config.yaml docker://ec2-18-117-165-30.us-east-2.compute.amazonaws.com:5000  --dest-use-http   --oci-registries-config /home1/registry.conf 
Logging to .oc-mirror.log
Checking push permissions for ec2-18-117-165-30.us-east-2.compute.amazonaws.com:5000
Found: oc-mirror-workspace/src/publish
Found: oc-mirror-workspace/src/v2
Found: oc-mirror-workspace/src/charts
Found: oc-mirror-workspace/src/release-signatures
backend is not configured in config.yaml, using stateless mode
backend is not configured in config.yaml, using stateless mode
No metadata detected, creating new workspace
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x38 pc=0x2e8a774]

goroutine 43 [running]:
github.com/containers/image/v5/docker.(*dockerImageSource).Close(0x3?)
	/go/src/github.com/openshift/oc-mirror/vendor/github.com/containers/image/v5/docker/docker_image_src.go:170 +0x14
github.com/openshift/oc-mirror/pkg/cli/mirror.findFirstAvailableMirror.func1()
	/go/src/github.com/openshift/oc-mirror/pkg/cli/mirror/fbc_operators.go:449 +0x42
github.com/openshift/oc-mirror/pkg/cli/mirror.findFirstAvailableMirror({0x4c67b38, 0xc0004ca230}, {0xc00ad56000, 0x1, 0x40d19c0?}, {0xc00077e000, 0x94}, {0xc00ac0f6b0, 0x24}, {0x0, ...})
	/go/src/github.com/openshift/oc-mirror/pkg/cli/mirror/fbc_operators.go:467 +0x6df
github.com/openshift/oc-mirror/pkg/cli/mirror.(*MirrorOptions).addRelatedImageToMapping(0xc0001c0f00, {0x4c67b38, 0xc0004ca230}, 0xc00ac13480?, {{0xc0074a14e8?, 0x18?}, {0xc0076563f0?, 0x8b?}}, {0xc000c5b580, 0x36})
	/go/src/github.com/openshift/oc-mirror/pkg/cli/mirror/fbc_operators.go:154 +0x3c5
github.com/openshift/oc-mirror/pkg/cli/mirror.(*OperatorOptions).plan.func3()
	/go/src/github.com/openshift/oc-mirror/pkg/cli/mirror/operator.go:570 +0x52
golang.org/x/sync/errgroup.(*Group).Go.func1()
	/go/src/github.com/openshift/oc-mirror/vendor/golang.org/x/sync/errgroup/errgroup.go:75 +0x64
created by golang.org/x/sync/errgroup.(*Group).Go
	/go/src/github.com/openshift/oc-mirror/vendor/golang.org/x/sync/errgroup/errgroup.go:72 +0xa5
 

Expected results:

Should  not panic 

Additional info:

Description of problem:

The fix for https://issues.redhat.com/browse/OCPBUGS-15947 seems to have introduced a problem in our keepalived-monitor logic. What I'm seeing is that at some point all of the apiservers became unavailable, which caused haproxy-monitor to drop the redirect firewall rule since it wasn't able to reach the API and we normally want to fall back to direct, un-loadbalanced API connectivity in that case.

However, due to the fix linked above we now short-circuit the keepalived-monitor update loop if we're unable to retrieve the node list, which is what will happen if the node holding the VIP has neither a local apiserver nor the HAProxy firewall rule. Because of this we will also skip updating the status of the firewall rule and thus the keepalived priority for the node won't be dropped appropriately.

Version-Release number of selected component (if applicable):

We backported the fix linked above to 4.11 so I expect this goes back at least that far.

How reproducible:

Unsure. It's clearly not happening every time, but I have a local dev cluster in this state so it can happen.

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

I think the solution here is just to move the firewall rule check earlier in the update loop so it will have run before we try to retrieve nodes. There's no dependency on the ordering of those two steps so I don't foresee any major issues.

To workaround this I believe we can just bounce keepalived on the affected node until the VIP ends up on the node with a local apiserver.

Please review the following PR: https://github.com/openshift/kube-state-metrics/pull/94

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

mtls connection is not working when using an intermetiate CA appart from the root CA, both with CRL defined.
The Intermediate CA Cert had a published CDP which directed to a CRL issued by the root CA.

The config map in the openshift-ingress namespace contains the CRL as issued by the root CA. The CRL issued by the Intermediate CA is not present since that CDP is in the user cert and so not in the bundle.

When attempting to connect using a user certificate issued by the Intermediate CA it fails with an error of unknown CA.

When attempting to connect using a user certificate issued by the to Root CA the connection is successful.

Version-Release number of selected component (if applicable):

4.10.24

How reproducible:
Always

Steps to Reproduce:

1. Configure CA and intermediate CA with CRL
2. Sign client certificate with the intermediate CA
3. Configure mtls in openshift-ingress

Actual results:

When attempting to connect using a user certificate issued by the Intermediate CA it fails with an error of unknown CA.
When attempting to connect using a user certificate issued by the to Root CA the connection is successful.

Expected results:

Be able to connect with client certificated signed by the intermediate CA

Additional info:

This is a clone of issue OCPBUGS-13034. The following is the description of the original issue:

Description of problem:

Cluster-api pod can't create events due to RBAC. we may miss some useful event due to this.
E0503 07:20:44.925786       1 event.go:267] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"ad1-workers-f5f568855-vnzmn.175b911e43aa3f41", GenerateName:"", Namespace:"ocm-integration-23frm3gtnh3cf212daoe1a13su7buqk4-ad1", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Machine", Namespace:"ocm-integration-23frm3gtnh3cf212daoe1a13su7buqk4-ad1", Name:"ad1-workers-f5f568855-vnzmn", UID:"2b40a694-d36d-4b13-9afc-0b5daeecc509", APIVersion:"cluster.x-k8s.io/v1beta1", ResourceVersion:"144260357", FieldPath:""}, Reason:"DetectedUnhealthy", Message:"Machine ocm-integration-23frm3gtnh3cf212daoe1a13su7buqk4-ad1/ad1-workers/ad1-workers-f5f568855-vnzmn/ has unhealthy node ", Source:v1.EventSource{Component:"machinehealthcheck-controller", Host:""}, FirstTimestamp:time.Date(2023, time.May, 3, 7, 20, 44, 923289409, time.Local), LastTimestamp:time.Date(2023, time.May, 3, 7, 20, 44, 923289409, time.Local), Count:1, Type:"Normal", EventTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'events is forbidden: User "system:serviceaccount:ocm-integration-23frm3gtnh3cf212daoe1a13su7buqk4-ad1:cluster-api" cannot create resource "events" in API group "" in the namespace "ocm-integration-23frm3gtnh3cf212daoe1a13su7buqk4-ad1"' (will not retry!)

Version-Release number of selected component (if applicable):

4.12

How reproducible:

Always

Steps to Reproduce:

1. Create an hosted cluster
2. Check cluster-api pod for some kind of error (e.g. slow node startup)
3.

Actual results:

Error

Expected results:

Event generated

Additional info:
ClusterRole hypershift-cluster-api is created here https://github.com/openshift/hypershift/blob/e7eb32f259b2a01e5bbdddf2fe963b82b331180f/hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go#L2720

We should add create/patch/update for events there

Description of problem:

MetalLB does not work when traffic comes from a secondary nic. The root cause of this failure is net.ipv4.ip_forward flag change from 1 to 0. If we re-enable this flag everything works as expected.

Version-Release number of selected component (if applicable):

Server Version: 4.14.0-0.nightly-2023-07-05-191022

How reproducible:

Run any test case that tests metallb via secondary interface. 

Steps to Reproduce:

1.
2.
3.

Actual results:

Test failed

Expected results:

Test Passed

Additional info:

Looks like this PR is the root cause: https://github.com/openshift/machine-config-operator/pull/3676/files#

Description of problem:

when applying a CSV with the current label recommendation for STS, the following error occurs:

error creating csv ack-s3-controller.v1.0.3: ClusterServiceVersion.operators.coreos.com "ack-s3-controller.v1.0.3" is invalid: metadata.annotations: Invalid value: "operators.openshift.io/infrastructure-features/token-auth/aws": a qualified name must consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyName', or 'my.name', or '123-abc', regex used for validation is '([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9]') with an optional DNS subdomain prefix and '/' (e.g. 'example.com/MyName')

Version-Release number of selected component (if applicable):

4.14

How reproducible:

always

Steps to Reproduce:

1. create a CSV with an annotation "operators.openshift.io/infrastructure-features/token-auth/aws: `false`"
2. apply the CSV on cluster

Actual results:

fails with the above error

Expected results:

should not fail

Additional info:

 

Description of problem:
{{}}
vsphereStorageDriver validation error message here is odd when I change LegacyDeprecatedInTreeDriver to "" . I get:

Invalid value: "string": VSphereStorageDriver can not be changed once it is set to CSIWithMigrationDriver

There is no CSIWithMigrationDriver either in the old or new Storage CR.
 
Version-Release number of selected component (if applicable):

4.13 with this PR: https://github.com/openshift/api/pull/1433

Description of problem:

We have presubmit and periodic jobs failing on

: [sig-arch] events should not repeat pathologically for namespace openshift-monitoring
{  2 events happened too frequently

event happened 21 times, something is wrong: ns/openshift-monitoring statefulset/prometheus-k8s hmsg/6f9bc9e1d7 - pathological/true reason/RecreatingFailedPod StatefulSet openshift-monitoring/prometheus-k8s is recreating failed Pod prometheus-k8s-1 From: 16:11:36Z To: 16:11:37Z result=reject 
event happened 22 times, something is wrong: ns/openshift-monitoring statefulset/prometheus-k8s hmsg/ecfdd1d225 - pathological/true reason/SuccessfulDelete delete Pod prometheus-k8s-1 in StatefulSet prometheus-k8s successful From: 16:11:36Z To: 16:11:37Z result=reject }

The failure occurs when the event happens over 20 times.

The RecreatingFailedPod reason shows up in 4.14 and Presubmits and does not show up in 4.13.

Version-Release number of selected component (if applicable):

4.14

How reproducible:

Run presubmits or periodics; here are latest examples:

 2023-05-24 06:25:52.551883+00 | https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.14-e2e-aws-sdn-serial/1661210557367193600                                                                                        | {aws,amd64,sdn,ha,serial}
 2023-05-24 10:20:54.91883+00  | https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-gcp-sdn-serial/1661267817128792064                                                                                   | {gcp,amd64,sdn,ha,serial}
 2023-05-24 14:17:18.849402+00 | https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/27899/pull-ci-openshift-origin-master-e2e-gcp-ovn-upgrade/1661321663389634560                                                                                      | {gcp,amd64,ovn,upgrade,upgrade-micro,ha}
 2023-05-24 14:17:51.908405+00 | https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_kubernetes/1583/pull-ci-openshift-kubernetes-master-e2e-azure-ovn-upgrade/1661324100011823104                                                            | {azure,amd64,ovn,upgrade,upgrade-micro,ha}

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

That event/reason should not show up as a failure in the pathological test

Additional info:

This table shows what variants on 4.14 and Presubmits:

                     variants                     | test_count 
--------------------------------------------------+------------
 {aws,amd64,ovn,upgrade,upgrade-micro,ha}         |         63
 {gcp,amd64,ovn,upgrade,upgrade-micro,ha}         |         14
 {gcp,amd64,sdn,ha,serial,techpreview}            |         12
 {azure,amd64,sdn,ha,serial,techpreview}          |          7
 {aws,amd64,sdn,upgrade,upgrade-micro,ha}         |          6
 {aws,amd64,ovn,ha}                               |          6
 {vsphere-ipi,amd64,ovn,upgrade,upgrade-micro,ha} |          5
 {aws,amd64,sdn,ha,serial}                        |          5
 {azure,amd64,ovn,upgrade,upgrade-micro,ha}       |          5
 {metal-ipi,amd64,ovn,upgrade,upgrade-micro,ha}   |          5
 {vsphere-ipi,amd64,ovn,ha,serial}                |          4
 {gcp,amd64,sdn,ha,serial}                        |          3
 {aws,amd64,ovn,single-node}                      |          3
 {metal-ipi,amd64,ovn,ha,serial}                  |          2
 {aws,amd64,ovn,ha,serial}                        |          2
 {aws,amd64,upgrade,upgrade-micro,ha}             |          1
 {aws,arm64,sdn,ha,serial}                        |          1
 {aws,arm64,ovn,ha,serial,techpreview}            |          1
 {vsphere-ipi,amd64,ovn,ha,serial,techpreview}    |          1
 {aws,amd64,sdn,ha,serial,techpreview}            |          1
 {libvirt,ppc64le,ovn,ha,serial}                  |          1
 {amd64,upgrade,upgrade-micro,ha}                 |          1

Just for my record, I'm using this query to check 4.14 and Presubmits:

SELECT
    rt.created_at, url, variants
FROM
    prow_jobs pj
    JOIN prow_job_runs r ON r.prow_job_id = pj.id
    JOIN prow_job_run_tests rt ON rt.prow_job_run_id = r.id
    JOIN prow_job_run_test_outputs o ON o.prow_job_run_test_id = rt.id
    JOIN tests ON rt.test_id = tests.id
WHERE
    pj.release IN ('4.14', 'Presubmits')
    AND rt.status = 12
    AND tests.id = 65991
    AND o.output LIKE '%RecreatingFailedPod%'
ORDER BY rt.created_at, variants DESC;

And this query for checking 4.13:

SELECT
    rt.created_at, url, variants
FROM
    prow_jobs pj
    JOIN prow_job_runs r ON r.prow_job_id = pj.id
    JOIN prow_job_run_tests rt ON rt.prow_job_run_id = r.id
    JOIN prow_job_run_test_outputs o ON o.prow_job_run_test_id = rt.id
    JOIN tests ON rt.test_id = tests.id
WHERE
    pj.release IN ('4.13')
    AND rt.status = 12
    AND tests.id IN (65991, 244,245)
    AND o.output LIKE '%RecreatingFailedPod%'
ORDER BY rt.created_at, variants DESC;

This shows jobs beginning on 4/13 to today.

Description of problem:

when viewing servicemonitor schema in YAML sidebar, for many fields whose type is Object, console doesn't have a 'View details' button to show more details

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-06-12-044657

How reproducible:

Always

Steps to Reproduce:

1. goes to any ServiceMonitor yaml page, open Schema by clicking on 'View sidebar'
click 'View details' of 'spec' -> click 'View details' of 'endpoints'
2. Check object and array type schema
spec.endpoints.authorization
spec.endpoints.basicAuth
spec.endpoints.bearerTokenSecret
spec.endpoints.oauth2
spec.endpoints.params
spec.endpoints.tlsConfig
spec.endpoints.relabelings

Actual results:

2. there is no 'View details' button for these 'object' and 'array' type field 

Expected results:

2. we should provide 'View details' link for 'object' and 'array' fields so that user has ability to view more details 

For example
$ oc explain servicemonitors.spec.endpoints.tlsConfig
KIND:     ServiceMonitor
VERSION:  monitoring.coreos.com/v1RESOURCE: tlsConfig <Object>DESCRIPTION:
     TLS configuration to use when scraping the endpointFIELDS:
   ca    <Object>
     Certificate authority used when verifying server certificates.   caFile    <string>
     Path to the CA cert in the Prometheus container to use for the targets.   cert    <Object>
     Client certificate to present when doing client-authentication.   certFile    <string>
     Path to the client cert file in the Prometheus container for the targets.   insecureSkipVerify    <boolean>
     Disable target certificate validation.   keyFile    <string>
     Path to the client key file in the Prometheus container for the targets.   keySecret    <Object>
     Secret containing the client key file for the targets.   serverName    <string>
     Used to verify the hostname for the targets.


oc explain servicemonitors.spec.endpoints.relabelings
KIND:     ServiceMonitor
VERSION:  monitoring.coreos.com/v1RESOURCE: relabelings <[]Object>DESCRIPTION:
     RelabelConfigs to apply to samples before scraping. Prometheus Operator
     automatically adds relabelings for a few standard Kubernetes fields. The
     original scrape job's name is available via the `__tmp_prometheus_job_name`
     label. More info:
     https://prometheus.io/docs/prometheus/latest/configuration/configuration/#relabel_config     RelabelConfig allows dynamic rewriting of the label set, being applied to
     samples before ingestion. It defines `<metric_relabel_configs>`-section of
     Prometheus configuration. More info:
     https://prometheus.io/docs/prometheus/latest/configuration/configuration/#metric_relabel_configsFIELDS:
   action    <string>
     Action to perform based on regex matching. Default is 'replace'. uppercase
     and lowercase actions require Prometheus >= 2.36.   modulus    <integer>
     Modulus to take of the hash of the source label values.   regex    <string>
     Regular expression against which the extracted value is matched. Default is
     '(.*)'   replacement    <string>
     Replacement value against which a regex replace is performed if the regular
     expression matches. Regex capture groups are available. Default is '$1'   separator    <string>
     Separator placed between concatenated source label values. default is ';'.   sourceLabels    <[]string>
     The source labels select values from existing labels. Their content is
     concatenated using the configured separator and matched against the
     configured regular expression for the replace, keep, and drop actions.   targetLabel    <string>
     Label to which the resulting value is written in a replace action. It is
     mandatory for replace actions. Regex capture groups are available.

Additional info:

 

 

Description of problem:

`rprivate`  default mount propagation in combination with `hostPath: path: /` breaks CSI driver relying on multipath

How reproducible:

Always

Steps to Reproduce (simplified):

1. ssh to node, 
2.  mount a partition (for instance) /dev/{s,v}da2 which on CoreOs is an UEFI FAT partition
    $ sudo mount /dev/vda2 /mnt
3. start a debug pod on that node ( or any pod that does a hostPath mount of /, like the node tuning operand pod, the machine config operand, the filesystem integrity operand ) 
    $ oc debug nodes/master-2.sharedocp4upi411ovn.lab.upshift.rdu2.redhat.com
4. unmount the partition on node

5. notice the debug pod still has a reference to the filesystem
grep vda2 /proc/*/mountinfo
/proc/3687945/mountinfo:11219 10837 252:2 / /host/var/mnt rw,relatime - vfat /dev/vda2 rw,fmask=0022,dmask=0022,codepage=437,iocharset=ascii,shortname=mixed,errors=remount-ro

6. On the node, although the mount is absent from /proc/mounts, the file system is still mounted, as shown by the dirty bit being still set on the FAT filesystem:

sudo fsck -n  /dev/vda2 
fsck from util-linux 2.32.1
fsck.fat 4.1 (2017-01-24)
0x25: Dirty bit is set. Fs was not properly unmounted and some data may be corrupt.

Expected results:

File system is unmounted in host and in container.

Additional info:

Although the steps above show the behaviour in a simple way, this becomes quite problematic when using multipath on a host mount.
We noticed in a customer environment that we cannot reschedule some pods from old node to new node using oc adm drain when these pods have a Persistent Volume mount created by the third party CSI driver block.csi.ibm.com.

The CSI driver is using multipath from CoreOS to manage multipath block devices, however the multipath daemon blocks the volume removal from the node (the multipath -f flushing calls from the CSI driver always return busy. Flushing a multiple device means removing it from the device tree in /dev in storage parlance)

multipath flush are always failing because although the multipath block device is unmounted on the host, machine-config, file integrity, node tuning pods are doing hostPath volume mounts of /, the host root filesystem.
and thus get a copy of the mounts.
Due to that mount copy the kernel sees the filesystem is still in use, although there a no file descriptors open on that filesyste, and considers it is unsafe to remove the multipath block device, and the node CSI driver cannot finish the unmount of the volume, thus blocking the container creation on another node.

We can see this mount copies by looking at /proc/<container pid>/mountinfo:

$ grep mpathes proc/*/mountinfo
proc/3295781/mountinfo:56348 52693 253:42 / /var/lib/kubelet/plugins/kubernetes.io/csi/block.csi.ibm.com/12345/globalmount rw,relatime - xfs /dev/mapper/mpathes rw,seclabel,nouuid,attr2,inode64,logbufs=8,logbsize=32k,noquota

cri-o is doing this mount copy using `rprivate` mount propagation
( see https://github.com/cri-o/cri-o/blob/b098bec2d4d79bdf99c3ce89b0eeb16bfe8b5645/server/container_create_linux.go#L1030 )

the semantics of rprivate are mapped in`runc`
https://github.com/opencontainers/runc/blob/ba58ee9c3b9550c3e32b94802b0fb29761955290/libcontainer/specconv/spec_linux.go#L55
to mount flags passed to the mount(2) systemcall

MS_REC (since Linux 2.4.11)
              Used  in  conjunction  with  MS_BIND to create a recursive bind mount, and in
              conjunction with the propagation type flags to recursively change the  propa‐
              gation  type  of  all  of the mounts in a subtree.  See below for further de‐
              tails.

MS_PRIVATE
              Make this mount private.  Mount and unmount events do not propagate  into  or
              out of this mount.

the key is the MS_PRIVATE mount here. The unmounting of the multipath block device is not propagated to the mount namespace of containers, thus keeping the filesystem eternally mounted, preventing the flushing of the multipath device.

Maybe hostPath mounts should be done using `rslave` mount propagation, when we see we try to bind mount /var/lib ?
Seems cri-dockerd is doing something similar according to https://kubernetes.io/docs/concepts/storage/volumes/#mount-propagation

Description of problem:

Prerequisites (if any, like setup, operators/versions):

Steps to Reproduce

  1. Follow https://github.com/openshift/console/blob/master/docs/helm/configure-namespaced-helm-repos.md for adding project helm chart repositories supporting basic auth.
  2. If we create a repository and provide basicAuthConfig as shown in current documentation we will get an error.
  3. The documentation here needs an update as the basicAuthConfig secret name should be specified with a `name` field

Actual results:

  1. * spec.connectionConfig.basicAuthConfig: Invalid value: "string": spec.connectionConfig.basicAuthConfig in body must be of type object: "string"

    Expected results:

We should be able to add the repository supporting basic auth

Reproducibility (Always/Intermittent/Only Once):

Build Details:

Additional info:

Documentation Requirement: Yes/No (needs-docs|upstream-docs / no-doc)

Upstream: <Inputs/Requirement details>/ Not Applicable

Downstream: <Type: Doc defect/More inputs to doc>/ Not Applicable

Provide link to the relevant section
Provide doc inputs and details required

Release Notes Type: <New Feature/Enhancement/Known Issue/Bug
fix/Breaking change/Deprecated Functionality/Technology Preview>

LatencySensitive has been functionally equivalent to "" (Default) for several years. Code has forgotten that the featureset must be handled and its more efficacious to remove the featureset (with migration code) than try to plug all the holes.

To ensure this is working, update a cluster to use LatencySensitve and see that the FEatureSet value is reset after two minutes

Description of problem:

In 4.10 we added an option REGISTRY_AUTH_PREFERENCE to opt-in for podman registry auth file prefence reading order. This is important for oc registry commands like oc registry login and oc image. https://github.com/openshift/oc/pull/893

We also started warning users that we will remove support for docker order and default to podman order - meaning we will check podman locations first and then we will fallback to docker locations.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

We should default to podman auth file locations and remove a warning when using oc registry login or oc image commands without REGISTRY_AUTH_PREFERENCE variable.

Additional info:

 

Description of problem:

During an operator installation with the Installation mode set to all namespaces, the "Installed Namespace" dropdown selection is restricted to "openshift-operators" or another specific namespace, if one is recommended by the operator owners.

With to recent* change to allow non-latest operator version installs, users should be allowed to select any namespace to install a globally installed operator.

 

Related info:
Operators can now be installed on non-latest versions with the merge of * https://github.com/openshift/console/pull/12743 They require a manual approval and because of the way InstallPlan upgrades work, this effects all operators installed that namespace. 

 

 

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-19411. The following is the description of the original issue:

Description of problem:

 

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.oc -n openshift-machine-api get role/cluster-autoscaler-operator -o yaml
2. Observe missing watch verb
3. Tail cluster-autoscaler logs to see error

status.go:444] No ClusterAutoscaler. Reporting available.
I0919 16:40:52.877216       1 status.go:244] Operator status available: at version 4.14.0-rc.1
E0919 16:40:53.719592       1 reflector.go:148] github.com/openshift/client-go/config/informers/externalversions/factory.go:101: Failed to watch *v1.ClusterOperator: unknown (get clusteroperators.config.openshift.io) 

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-18439. The following is the description of the original issue:

Description of problem:

In the developer sandbox, the happy path to create operator-backed resources is broken.

Users can only work on their assigned namespace. When doing so, and attempting to create an Operator-backed resource from the Developer console, the user interface switches inadvertendly the working namespace from the user's to the `openshift` one. The console shows an error message when the user clicks the "create" button.

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1. Login to the Developer Sandbox
2. Choose the Developer view
3. Click Add+ -> Developer Catalog -> Operator Backed
4. Filter by "integration"
5. Notice the working namespace is still the user's one. 
6. Select "Integration" (Camel K operator)
7. Click "Create"
8. Notice the working namespace has switched to `openshift`
9. Notice the custom resource in YAML view includes `namespace: openshift`
10. Click "Create"


Actual results:

An error message shows: "Danger alert:An error occurredintegrations.camel.apache.org is forbidden: User "bmesegue" cannot create resource "integrations" in API group "camel.apache.org" in the namespace "openshift""

Expected results:

On step 8, the working directory should remain the user's one
On step 9, in the YAML view, the namespace should be the user's one, or none.
After step 10, the creation process should trigger the creation of a Camel K integration.

Additional info:

 

The code in our infrastructure test needs to be updated to make the test more accurate. Currently we are targeting gomock.any() in many cases, this means that the tests are not as accurate as they could be.

Updates should be similar to MGMT-13918

Description of the problem:

In Staging, UI 2.18.6 - Enable DHCP and then switch to UMN --> BE response "User Managed Networking cannot be set with VIP DHCP Allocation"

How reproducible:

100%

Steps to reproduce:

1. In networking page - enable DHCP

2. Switch to UMN

3. BE response with "User Managed Networking cannot be set with VIP DHCP Allocation"

Actual results:

 

Expected results:

Description of problem:

Install cert-manager operator of version cert-manager-operator-bundle:v1.11.1-6 from console, the UI shown version slips between from v1.11.1 and v1.10.2 and v1.11.1 again and v1.10.2 again ... constantly.

Version-Release number of selected component (if applicable):

cert-manager-operator-bundle:v1.11.1-6, 4.13.0-0.nightly-2023-05-18-195839

How reproducible:

Always. I tried a few times in different envs, double confirmed.

Steps to Reproduce:

1. Install cert-manager operator of version cert-manager-operator-bundle:v1.11.1-6 from console
2. Watch console
3.

Actual results:

The UI shown version slips between from v1.11.1 and v1.10.2 and v1.11.1 again and v1.10.2 again ... constantly.
See attached video https://drive.google.com/drive/folders/1AFWquCK-pDCoQFMEOONQwGByBUg6tKR9?usp=sharing .

Expected results:

Should always show v1.11.1

Additional info:

No matter using index image v4.13 brew.registry.redhat.io/rh-osbs/iib:500235 (gotten from email "[CVP] (SUCCESS) (cvp-redhatopenshiftcfe: cert-manager-operator-bundle-container-v1.11.1-6)") or brew.registry.redhat.io/rh-osbs/iib-pub-pending:v4.13, both reproduced it.

 

Description

Multiple gherkin files have missing package tags, these tags can be utilised for further automation. Currently tag allocation is inconsistent across gherkin files.

Acceptance Criteria

  1. Every gherkin file should have package tag in it's first line.

PR: https://github.com/openshift/console/pull/12847

Description of problem:

When CNO is managed by Hypershift, it's deployment has "hypershift.openshift.io/release-image" template metadata annotation. The annotation's value is used to track progress of cluster control plane version upgrades. But multus-admission-controller created and managed by CNO does not have that annotation so service providers are not able to track its version upgrades.

The proposed solution is for CNO to propagate its "hypershift.openshift.io/release-image" annotation down to the multus-admission-controller deployment. For that CNO need to have "get" access to its own deployment manifest to be able to read the deployment template metadata annotations. 

Hypershift needs code change to assign CNO "get" permission on the CNO deployment object.

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1.Create OCP cluster using Hypershift
2.Check deployment template metadata annotations on multus-admission-controller

Actual results:

No "hypershift.openshift.io/release-image" deployment template metadata annotation exists 

Expected results:

"hypershift.openshift.io/release-image" annotation must be present

Additional info:

 

Description of problem:

When setting no configuration for node-exporter in CMO config, we did not see the 2 arguments collector.netclass.ignored-devices and collector.netdev.device-exclude in node-exporter daemonset, full info see: http://pastebin.test.redhat.com/1093428

and checked in 4.13.0-0.nightly-2023-02-27-101545, no configuration for node-exporter, there is collector.netclass.ignored-devices setting
see from: http://pastebin.test.redhat.com/1093429

after disabled netdev/netclass on bot cluster, would see collector.netclass.ignored-devices and collector.netdev.device-exclude settings in node-exporter, since OCPBUGS-7282 is filed on 4.12, disable netdev/netclass is not supported then, I don't think we should disable netdev/netclass

$ oc -n openshift-monitoring get ds node-exporter -oyaml | grep collector
        - --no-collector.wifi
        - --collector.filesystem.mount-points-exclude=^/(dev|proc|sys|run/k3s/containerd/.+|var/lib/docker/.+|var/lib/kubelet/pods/.+)($|/)
        - --collector.netclass.ignored-devices=^(veth.*|[a-f0-9]{15}|enP.*|ovn-k8s-mp[0-9]*|br-ex|br-int|br-ext|br[0-9]*|tun[0-9]*|cali[a-f0-9]*)$
        - --collector.netdev.device-exclude=^(veth.*|[a-f0-9]{15}|enP.*|ovn-k8s-mp[0-9]*|br-ex|br-int|br-ext|br[0-9]*|tun[0-9]*|cali[a-f0-9]*)$
        - --collector.cpu.info
        - --collector.textfile.directory=/var/node_exporter/textfile
        - --no-collector.cpufreq
        - --no-collector.tcpstat
        - --no-collector.netdev
        - --no-collector.netclass
        - --no-collector.buddyinfo
        - '[[ ! -d /node_exporter/collectors/init ]] || find /node_exporter/collectors/init

Version-Release number of selected component (if applicable):

4.13

How reproducible:


Steps to Reproduce:

The 2 arguments are missing when booting up OCP with default configurations for CMO.

Actual results:

The 2 arguments collector.netclass.ignored-devices and collector.netdev.device-exclude are missing in node-exporter DaemonSet.

Expected results:

The 2 arguments collector.netclass.ignored-devices and collector.netdev.device-exclude are present in node-exporter DaemonSet.

Additional info:


Description of problem:

OpenShift Container Platform 4.12.5 installation with IPI installation method on Microsoft Azure is showing undesired behavior when trying to curl "https://api.<clustername>.<domain>:6443/readyz". When using `HostNetwork` it all works without any issues. But when doing the same request from a pod that does not have `HostNetwork` capabilties and therefore has an IP from the SDN range, a big portion of the requests is failing.

$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.12.5    True        False         29m     Cluster version is 4.12.5

$ oc get network cluster -o yaml
apiVersion: config.openshift.io/v1
kind: Network
metadata:
  creationTimestamp: "2023-03-10T13:12:06Z"
  generation: 2
  name: cluster
  resourceVersion: "2975"
  uid: e1e9c464-526c-4ebf-ab84-0deedf092cac
spec:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  externalIP:
    policy: {}
  networkType: OVNKubernetes
  serviceNetwork:
  - 172.30.0.0/16
status:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  clusterNetworkMTU: 1400
  networkType: OVNKubernetes
  serviceNetwork:
  - 172.30.0.0/16

$ oc get infrastructure cluster -o yaml
apiVersion: config.openshift.io/v1
kind: Infrastructure
metadata:
  creationTimestamp: "2023-03-10T13:12:04Z"
  generation: 1
  name: cluster
  resourceVersion: "430"
  uid: 5c260276-d901-40f7-a28c-172c492e81e6
spec:
  cloudConfig:
    key: config
    name: cloud-provider-config
  platformSpec:
    type: Azure
status:
  apiServerInternalURI: https://api-int.clustername.domain.lab:6443
  apiServerURL: https://api.clustername.domain.lab:6443
  controlPlaneTopology: HighlyAvailable
  etcdDiscoveryDomain: ""
  infrastructureName: sreberazure-njj24
  infrastructureTopology: HighlyAvailable
  platform: Azure
  platformStatus:
    azure:
      cloudName: AzurePublicCloud
      networkResourceGroupName: sreberazure-njj24-rg
      resourceGroupName: sreberazure-njj24-rg
    type: Azure

$ oc project openshift-apiserver
Already on project "openshift-apiserver" on server "https://api.clustername.domain.lab:6443".
$ oc get pod
NAME                         READY   STATUS    RESTARTS   AGE
apiserver-6f58784797-kq4kr   2/2     Running   0          41m
apiserver-6f58784797-l69jr   2/2     Running   0          38m
apiserver-6f58784797-nn6tn   2/2     Running   0          45m

$ oc get pod -o wide
NAME                         READY   STATUS    RESTARTS   AGE   IP            NODE                         NOMINATED NODE   READINESS GATES
apiserver-6f58784797-kq4kr   2/2     Running   0          42m   10.130.0.21   sreberazure-njj24-master-0   <none>           <none>
apiserver-6f58784797-l69jr   2/2     Running   0          38m   10.129.0.29   sreberazure-njj24-master-2   <none>           <none>
apiserver-6f58784797-nn6tn   2/2     Running   0          45m   10.128.0.36   sreberazure-njj24-master-1   <none>           <none>

$ oc rsh apiserver-6f58784797-l69jr
Defaulted container "openshift-apiserver" out of: openshift-apiserver, openshift-apiserver-check-endpoints, fix-audit-permissions (init)
sh-4.4# while true; do curl -k --connect-timeout 1  https://api.clustername.domain.lab:6443/readyz; sleep 1; done
curl: (28) Connection timed out after 1000 milliseconds
okokokcurl: (28) Connection timed out after 1001 milliseconds
okokcurl: (28) Connection timed out after 1003 milliseconds
curl: (28) Connection timed out after 1001 milliseconds
curl: (28) Connection timed out after 1001 milliseconds
okokokokokokokokokcurl: (28) Connection timed out after 1001 milliseconds
okokcurl: (28) Connection timed out after 1001 milliseconds
curl: (28) Connection timed out after 1001 milliseconds
^C
sh-4.4# exit
exit
command terminated with exit code 130

$ oc project openshift-kube-apiserver
Now using project "openshift-kube-apiserver" on server "https://api.clustername.domain.lab:6443".

$ oc get pod -o wide
NAME                                              READY   STATUS      RESTARTS   AGE   IP            NODE                         NOMINATED NODE   READINESS GATES
apiserver-watcher-sreberazure-njj24-master-0      1/1     Running     0          55m   10.0.0.6      sreberazure-njj24-master-0   <none>           <none>
apiserver-watcher-sreberazure-njj24-master-1      1/1     Running     0          57m   10.0.0.8      sreberazure-njj24-master-1   <none>           <none>
apiserver-watcher-sreberazure-njj24-master-2      1/1     Running     0          57m   10.0.0.7      sreberazure-njj24-master-2   <none>           <none>
installer-2-sreberazure-njj24-master-2            0/1     Completed   0          51m   10.129.0.27   sreberazure-njj24-master-2   <none>           <none>
installer-3-sreberazure-njj24-master-2            0/1     Completed   0          50m   10.129.0.32   sreberazure-njj24-master-2   <none>           <none>
installer-4-sreberazure-njj24-master-2            0/1     Completed   0          49m   10.129.0.36   sreberazure-njj24-master-2   <none>           <none>
installer-5-sreberazure-njj24-master-2            0/1     Completed   0          46m   10.129.0.15   sreberazure-njj24-master-2   <none>           <none>
installer-6-sreberazure-njj24-master-0            0/1     Completed   0          37m   10.130.0.27   sreberazure-njj24-master-0   <none>           <none>
installer-6-sreberazure-njj24-master-1            0/1     Completed   0          39m   10.128.0.45   sreberazure-njj24-master-1   <none>           <none>
installer-6-sreberazure-njj24-master-2            0/1     Completed   0          36m   10.129.0.37   sreberazure-njj24-master-2   <none>           <none>
kube-apiserver-guard-sreberazure-njj24-master-0   1/1     Running     0          37m   10.130.0.29   sreberazure-njj24-master-0   <none>           <none>
kube-apiserver-guard-sreberazure-njj24-master-1   1/1     Running     0          38m   10.128.0.47   sreberazure-njj24-master-1   <none>           <none>
kube-apiserver-guard-sreberazure-njj24-master-2   1/1     Running     0          50m   10.129.0.31   sreberazure-njj24-master-2   <none>           <none>
kube-apiserver-sreberazure-njj24-master-0         5/5     Running     0          37m   10.0.0.6      sreberazure-njj24-master-0   <none>           <none>
kube-apiserver-sreberazure-njj24-master-1         5/5     Running     0          38m   10.0.0.8      sreberazure-njj24-master-1   <none>           <none>
kube-apiserver-sreberazure-njj24-master-2         5/5     Running     0          34m   10.0.0.7      sreberazure-njj24-master-2   <none>           <none>
revision-pruner-6-sreberazure-njj24-master-0      0/1     Completed   0          33m   10.130.0.35   sreberazure-njj24-master-0   <none>           <none>
revision-pruner-6-sreberazure-njj24-master-1      0/1     Completed   0          33m   10.128.0.56   sreberazure-njj24-master-1   <none>           <none>
revision-pruner-6-sreberazure-njj24-master-2      0/1     Completed   0          33m   10.129.0.39   sreberazure-njj24-master-2   <none>           <none>

$ oc rsh kube-apiserver-sreberazure-njj24-master-1
sh-4.4# while true; do curl -k --connect-timeout 1  https://api.clustername.domain.lab:6443/readyz; sleep 1; done
okokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokok

Also changing  `--connect-timeout 1` from curl to `--connect-timeout 10` for example does not have any impact. It simply takes longer until the timeout is reached.

Version-Release number of selected component (if applicable):

OpenShift Container Platform 4.12 (also previous version were not tested)

How reproducible:

Always

Steps to Reproduce:

1. Install OpenShift Container Platform 4.12 on Azure using IPI install method and set the SDN to OVN-Kubernetes
2. Once successfully installed run `oc project openshift-apiserver`
3. rsh apiserver-<podID>
4. while true; do curl -k --connect-timeout 1  https://api.clustername.domain.lab:6443/readyz; sleep 1; done

Actual results:

sh-4.4# while true; do curl -k --connect-timeout 1  https://api.clustername.domain.lab:6443/readyz; sleep 1; done
curl: (28) Connection timed out after 1000 milliseconds
okokokcurl: (28) Connection timed out after 1001 milliseconds
okokcurl: (28) Connection timed out after 1003 milliseconds
curl: (28) Connection timed out after 1001 milliseconds
curl: (28) Connection timed out after 1001 milliseconds
okokokokokokokokokcurl: (28) Connection timed out after 1001 milliseconds
okokcurl: (28) Connection timed out after 1001 milliseconds
curl: (28) Connection timed out after 1001 milliseconds
 

Expected results:

sh-4.4# while true; do curl -k --connect-timeout 1  https://api.clustername.domain.lab:6443/readyz; sleep 1; done
okokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokok
 

Additional info:

 

Follow up for https://issues.redhat.com/browse/HOSTEDCP-969

Create metrics and grafana panel in

https://hypershift-monitoring.homelab.sjennings.me:3000/d/PGCTmCL4z/hypershift-slos-slis-alberto-playground?orgId=1&from=now-24h&to=now

https://github.com/openshift/hypershift/tree/main/contrib/metrics

for NodePool internal SLOs/SLIs:

  • NodePoolDeletionDuration
  • NodePoolInitialRolloutDuration

Move existing metrics when possible from metrics loop into nodepool controller:

- nodePoolSize

Explore and discuss granular metrics to track NodePool lifecycle bottle necks, infra, ignition, node networking, available. Consolidate that with hostedClusterTransitionSeconds metrics and dashboard panels

Explore and discuss metrics for upgrade duration SLO for both HC and NodePool.

Description of problem:

OCP 4.13 uses a release candidate v3.0.0-rc.1 of vsphere-csi-driver. We should ship OCp with a GA version

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2023-03-17-161027

Trying to update my cluster from 4.12.0 to 4.12.2 and this resulted in a crashlooping state for both prometheus adapter pods. Tried to downgrade back to 4.12.0 and then upgrade to 4.12.4 but neither approach solved the situation.

 

What I can see in the logs of the adapters is the following:

 

I0216 15:24:59.144559 1 adapter.go:114] successfully using in-cluster auth
I0216 15:25:00.345620 1 request.go:601] Waited for 1.180640418s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/operators.coreos.com/v1alpha1?timeout=32s
I0216 15:25:10.345634 1 request.go:601] Waited for 11.180149045s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/triggers.tekton.dev/v1beta1?timeout=32s
I0216 15:25:20.346048 1 request.go:601] Waited for 2.597453714s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/apiextensions.k8s.io/v1?timeout=32s
I0216 15:25:30.347435 1 request.go:601] Waited for 12.598768922s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/machine.openshift.io/v1beta1?timeout=32s
I0216 15:25:40.545767 1 request.go:601] Waited for 22.797001115s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/samples.operator.openshift.io/v1?timeout=32s
I0216 15:25:50.546588 1 request.go:601] Waited for 32.797748538s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/metrics.k8s.io/v1beta1?timeout=32s
I0216 15:25:56.041594 1 secure_serving.go:210] Serving securely on [::]:6443
I0216 15:25:56.042265 1 dynamic_serving_content.go:132] "Starting controller" name="serving-cert::/etc/tls/private/tls.crt::/etc/tls/private/tls.key"
I0216 15:25:56.042971 1 dynamic_cafile_content.go:157] "Starting controller" name="request-header::/etc/tls/private/requestheader-client-ca-file"
I0216 15:25:56.043309 1 tlsconfig.go:240] "Starting DynamicServingCertificateController"
I0216 15:25:56.043310 1 object_count_tracker.go:84] "StorageObjectCountTracker pruner is exiting"
I0216 15:25:56.043398 1 dynamic_serving_content.go:146] "Shutting down controller" name="serving-cert::/etc/tls/private/tls.crt::/etc/tls/private/tls.key"
I0216 15:25:56.043562 1 tlsconfig.go:255] "Shutting down DynamicServingCertificateController"
I0216 15:25:56.043606 1 dynamic_cafile_content.go:157] "Starting controller" name="client-ca-bundle::/etc/tls/private/client-ca-file"
I0216 15:25:56.043614 1 secure_serving.go:255] Stopped listening on [::]:6443
I0216 15:25:56.043621 1 dynamic_cafile_content.go:171] "Shutting down controller" name="client-ca-bundle::/etc/tls/private/client-ca-file"
I0216 15:25:56.043635 1 dynamic_cafile_content.go:171] "Shutting down controller" name="request-header::/etc/tls/private/requestheader-client-ca-file"

I also tried to search online for known issues and bugs and found this one that might be related:

https://github.com/kubernetes-sigs/metrics-server/issues/983

I also tried rebooting the server but it didn't help.

Need a workaround at least because at the moment the cluster is still in a pending stage.

Description of problem:

Following https://bugzilla.redhat.com/show_bug.cgi?id=2102765 respectively https://issues.redhat.com/browse/OCPBUGS-2140 problems with OpenID Group sync have been resolved.

Yet the problem documented in https://bugzilla.redhat.com/show_bug.cgi?id=2102765 still does exist and we see that Groups that are being removed are still part of the chache in oauth-apiserver, causing a panic of the respective components and failures during login for potentially affected users.

So in general, it looks like that oauth-apiserver cache is not properly refreshing or handling the OpenID Groups being synced.

E1201 11:03:14.625799       1 runtime.go:76] Observed a panic: interface conversion: interface {} is nil, not *v1.Group
goroutine 3706798 [running]:
k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP.func1.1()
    k8s.io/apiserver@v0.22.2/pkg/server/filters/timeout.go:103 +0xb0
panic({0x1aeab00, 0xc001400390})
    runtime/panic.go:838 +0x207
k8s.io/apiserver/pkg/endpoints/filters.WithAudit.func1.1.1()
    k8s.io/apiserver@v0.22.2/pkg/endpoints/filters/audit.go:80 +0x2a
k8s.io/apiserver/pkg/endpoints/filters.WithAudit.func1.1()
    k8s.io/apiserver@v0.22.2/pkg/endpoints/filters/audit.go:89 +0x250
panic({0x1aeab00, 0xc001400390})
    runtime/panic.go:838 +0x207
github.com/openshift/library-go/pkg/oauth/usercache.(*GroupCache).GroupsFor(0xc00081bf18?, {0xc000c8ac03?, 0xc001400360?})
    github.com/openshift/library-go@v0.0.0-20211013122800-874db8a3dac9/pkg/oauth/usercache/groups.go:47 +0xe7
github.com/openshift/oauth-server/pkg/groupmapper.(*UserGroupsMapper).processGroups(0xc0002c8880, {0xc0005d4e60, 0xd}, {0xc000c8ac03, 0x7}, 0x1?)
    github.com/openshift/oauth-server/pkg/groupmapper/groupmapper.go:101 +0xb5
github.com/openshift/oauth-server/pkg/groupmapper.(*UserGroupsMapper).UserFor(0xc0002c8880, {0x20f3c40, 0xc000e18bc0})
    github.com/openshift/oauth-server/pkg/groupmapper/groupmapper.go:83 +0xf4
github.com/openshift/oauth-server/pkg/oauth/external.(*Handler).login(0xc00022bc20, {0x20eebb0, 0xc00041b058}, 0xc0015d8200, 0xc001438140?, {0xc0000e7ce0, 0x150})
    github.com/openshift/oauth-server/pkg/oauth/external/handler.go:209 +0x74f
github.com/openshift/oauth-server/pkg/oauth/external.(*Handler).ServeHTTP(0xc00022bc20, {0x20eebb0, 0xc00041b058}, 0x0?)
    github.com/openshift/oauth-server/pkg/oauth/external/handler.go:180 +0x74a
net/http.(*ServeMux).ServeHTTP(0x1c9dda0?, {0x20eebb0, 0xc00041b058}, 0xc0015d8200)
    net/http/server.go:2462 +0x149
github.com/openshift/oauth-server/pkg/server/headers.WithRestoreAuthorizationHeader.func1({0x20eebb0, 0xc00041b058}, 0xc0015d8200)
    github.com/openshift/oauth-server/pkg/server/headers/oauthbasic.go:27 +0x10f
net/http.HandlerFunc.ServeHTTP(0x0?, {0x20eebb0?, 0xc00041b058?}, 0x0?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/endpoints/filterlatency.trackCompleted.func1({0x20eebb0, 0xc00041b058}, 0xc0015d8200)
    k8s.io/apiserver@v0.22.2/pkg/endpoints/filterlatency/filterlatency.go:103 +0x1a5
net/http.HandlerFunc.ServeHTTP(0xc0005e0280?, {0x20eebb0?, 0xc00041b058?}, 0x0?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/endpoints/filters.WithAuthorization.func1({0x20eebb0, 0xc00041b058}, 0xc0015d8200)
    k8s.io/apiserver@v0.22.2/pkg/endpoints/filters/authorization.go:64 +0x498
net/http.HandlerFunc.ServeHTTP(0x0?, {0x20eebb0?, 0xc00041b058?}, 0x0?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/endpoints/filterlatency.trackStarted.func1({0x20eebb0, 0xc00041b058}, 0xc0015d8200)
    k8s.io/apiserver@v0.22.2/pkg/endpoints/filterlatency/filterlatency.go:79 +0x178
net/http.HandlerFunc.ServeHTTP(0x2f6cea0?, {0x20eebb0?, 0xc00041b058?}, 0x3?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/server/filters.WithMaxInFlightLimit.func1({0x20eebb0, 0xc00041b058}, 0xc0015d8200)
    k8s.io/apiserver@v0.22.2/pkg/server/filters/maxinflight.go:187 +0x2a4
net/http.HandlerFunc.ServeHTTP(0x0?, {0x20eebb0?, 0xc00041b058?}, 0x0?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/endpoints/filterlatency.trackCompleted.func1({0x20eebb0, 0xc00041b058}, 0xc0015d8200)
    k8s.io/apiserver@v0.22.2/pkg/endpoints/filterlatency/filterlatency.go:103 +0x1a5
net/http.HandlerFunc.ServeHTTP(0x11?, {0x20eebb0?, 0xc00041b058?}, 0x1aae340?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/endpoints/filters.WithImpersonation.func1({0x20eebb0, 0xc00041b058}, 0xc0015d8200)
    k8s.io/apiserver@v0.22.2/pkg/endpoints/filters/impersonation.go:50 +0x21c
net/http.HandlerFunc.ServeHTTP(0xc000d52120?, {0x20eebb0?, 0xc00041b058?}, 0x0?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/endpoints/filterlatency.trackStarted.func1({0x20eebb0, 0xc00041b058}, 0xc0015d8200)
    k8s.io/apiserver@v0.22.2/pkg/endpoints/filterlatency/filterlatency.go:79 +0x178
net/http.HandlerFunc.ServeHTTP(0x0?, {0x20eebb0?, 0xc00041b058?}, 0x0?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/endpoints/filterlatency.trackCompleted.func1({0x20eebb0, 0xc00041b058}, 0xc0015d8200)
    k8s.io/apiserver@v0.22.2/pkg/endpoints/filterlatency/filterlatency.go:103 +0x1a5
net/http.HandlerFunc.ServeHTTP(0xc0015d8100?, {0x20eebb0?, 0xc00041b058?}, 0xc000531930?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/endpoints/filters.WithAudit.func1({0x7fae682a40d8?, 0xc00041b048}, 0x9dbbaa?)
    k8s.io/apiserver@v0.22.2/pkg/endpoints/filters/audit.go:111 +0x549
net/http.HandlerFunc.ServeHTTP(0xc00003def0?, {0x7fae682a40d8?, 0xc00041b048?}, 0x0?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/endpoints/filterlatency.trackStarted.func1({0x7fae682a40d8, 0xc00041b048}, 0xc0015d8100)
    k8s.io/apiserver@v0.22.2/pkg/endpoints/filterlatency/filterlatency.go:79 +0x178
net/http.HandlerFunc.ServeHTTP(0x0?, {0x7fae682a40d8?, 0xc00041b048?}, 0x0?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/endpoints/filterlatency.trackCompleted.func1({0x7fae682a40d8, 0xc00041b048}, 0xc0015d8100)
    k8s.io/apiserver@v0.22.2/pkg/endpoints/filterlatency/filterlatency.go:103 +0x1a5
net/http.HandlerFunc.ServeHTTP(0x20f0f58?, {0x7fae682a40d8?, 0xc00041b048?}, 0x20cfd00?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/endpoints/filters.withAuthentication.func1({0x7fae682a40d8, 0xc00041b048}, 0xc0015d8100)
    k8s.io/apiserver@v0.22.2/pkg/endpoints/filters/authentication.go:80 +0x8b9
net/http.HandlerFunc.ServeHTTP(0x20f0f20?, {0x7fae682a40d8?, 0xc00041b048?}, 0x20cfc08?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/endpoints/filterlatency.trackStarted.func1({0x7fae682a40d8, 0xc00041b048}, 0xc000e69e00)
    k8s.io/apiserver@v0.22.2/pkg/endpoints/filterlatency/filterlatency.go:88 +0x46b
net/http.HandlerFunc.ServeHTTP(0xc0019f5890?, {0x7fae682a40d8?, 0xc00041b048?}, 0xc000848764?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/server/filters.WithCORS.func1({0x7fae682a40d8, 0xc00041b048}, 0xc000e69e00)
    k8s.io/apiserver@v0.22.2/pkg/server/filters/cors.go:75 +0x10b
net/http.HandlerFunc.ServeHTTP(0xc00149a380?, {0x7fae682a40d8?, 0xc00041b048?}, 0xc0008487d0?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP.func1()
    k8s.io/apiserver@v0.22.2/pkg/server/filters/timeout.go:108 +0xa2
created by k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP
    k8s.io/apiserver@v0.22.2/pkg/server/filters/timeout.go:94 +0x2cc

goroutine 3706802 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x19eb780?, 0xc001206e20})
    k8s.io/apimachinery@v0.22.2/pkg/util/runtime/runtime.go:74 +0x99
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0xc0016aec60, 0x1, 0x1560f26?})
    k8s.io/apimachinery@v0.22.2/pkg/util/runtime/runtime.go:48 +0x75
panic({0x19eb780, 0xc001206e20})
    runtime/panic.go:838 +0x207
k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP(0xc0005047c8, {0x20eecd0?, 0xc0010fae00}, 0xdf8475800?)
    k8s.io/apiserver@v0.22.2/pkg/server/filters/timeout.go:114 +0x452
k8s.io/apiserver/pkg/endpoints/filters.withRequestDeadline.func1({0x20eecd0, 0xc0010fae00}, 0xc000e69d00)
    k8s.io/apiserver@v0.22.2/pkg/endpoints/filters/request_deadline.go:101 +0x494
net/http.HandlerFunc.ServeHTTP(0xc0016af048?, {0x20eecd0?, 0xc0010fae00?}, 0xc0000bc138?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/server/filters.WithWaitGroup.func1({0x20eecd0?, 0xc0010fae00}, 0xc000e69d00)
    k8s.io/apiserver@v0.22.2/pkg/server/filters/waitgroup.go:59 +0x177
net/http.HandlerFunc.ServeHTTP(0x20f0f58?, {0x20eecd0?, 0xc0010fae00?}, 0x7fae705daff0?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/endpoints/filters.WithAuditAnnotations.func1({0x20eecd0, 0xc0010fae00}, 0xc000e69c00)
    k8s.io/apiserver@v0.22.2/pkg/endpoints/filters/audit_annotations.go:37 +0x230
net/http.HandlerFunc.ServeHTTP(0x20f0f58?, {0x20eecd0?, 0xc0010fae00?}, 0x20cfc08?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/endpoints/filters.WithWarningRecorder.func1({0x20eecd0?, 0xc0010fae00}, 0xc000e69b00)
    k8s.io/apiserver@v0.22.2/pkg/endpoints/filters/warning.go:35 +0x2bb
net/http.HandlerFunc.ServeHTTP(0x1c9dda0?, {0x20eecd0?, 0xc0010fae00?}, 0xd?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/endpoints/filters.WithCacheControl.func1({0x20eecd0, 0xc0010fae00}, 0x0?)
    k8s.io/apiserver@v0.22.2/pkg/endpoints/filters/cachecontrol.go:31 +0x126
net/http.HandlerFunc.ServeHTTP(0x20f0f58?, {0x20eecd0?, 0xc0010fae00?}, 0x20cfc08?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/server/httplog.WithLogging.func1({0x20ef480?, 0xc001c20620}, 0xc000e69a00)
    k8s.io/apiserver@v0.22.2/pkg/server/httplog/httplog.go:103 +0x518
net/http.HandlerFunc.ServeHTTP(0x20f0f58?, {0x20ef480?, 0xc001c20620?}, 0x20cfc08?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/endpoints/filters.WithRequestInfo.func1({0x20ef480, 0xc001c20620}, 0xc000e69900)
    k8s.io/apiserver@v0.22.2/pkg/endpoints/filters/requestinfo.go:39 +0x316
net/http.HandlerFunc.ServeHTTP(0x20f0f58?, {0x20ef480?, 0xc001c20620?}, 0xc0007c3f70?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/endpoints/filters.withRequestReceivedTimestampWithClock.func1({0x20ef480, 0xc001c20620}, 0xc000e69800)
    k8s.io/apiserver@v0.22.2/pkg/endpoints/filters/request_received_time.go:38 +0x27e
net/http.HandlerFunc.ServeHTTP(0x419e2c?, {0x20ef480?, 0xc001c20620?}, 0xc0007c3e40?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/server/filters.withPanicRecovery.func1({0x20ef480?, 0xc001c20620?}, 0xc0004ff600?)
    k8s.io/apiserver@v0.22.2/pkg/server/filters/wrap.go:74 +0xb1
net/http.HandlerFunc.ServeHTTP(0x1c05260?, {0x20ef480?, 0xc001c20620?}, 0x8?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/endpoints/filters.withAuditID.func1({0x20ef480, 0xc001c20620}, 0xc000e69600)
    k8s.io/apiserver@v0.22.2/pkg/endpoints/filters/with_auditid.go:66 +0x40d
net/http.HandlerFunc.ServeHTTP(0x1c9dda0?, {0x20ef480?, 0xc001c20620?}, 0xd?)
    net/http/server.go:2084 +0x2f
github.com/openshift/oauth-server/pkg/server/headers.WithPreserveAuthorizationHeader.func1({0x20ef480, 0xc001c20620}, 0xc000e69600)
    github.com/openshift/oauth-server/pkg/server/headers/oauthbasic.go:16 +0xe8
net/http.HandlerFunc.ServeHTTP(0xc0016af9d0?, {0x20ef480?, 0xc001c20620?}, 0x16?)
    net/http/server.go:2084 +0x2f
github.com/openshift/oauth-server/pkg/server/headers.WithStandardHeaders.func1({0x20ef480, 0xc001c20620}, 0x4d55c0?)
    github.com/openshift/oauth-server/pkg/server/headers/headers.go:30 +0x18f
net/http.HandlerFunc.ServeHTTP(0x0?, {0x20ef480?, 0xc001c20620?}, 0xc0016afac8?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/server.(*APIServerHandler).ServeHTTP(0xc00098d622?, {0x20ef480?, 0xc001c20620?}, 0xc000401000?)
    k8s.io/apiserver@v0.22.2/pkg/server/handler.go:189 +0x2b
net/http.serverHandler.ServeHTTP({0xc0019f5170?}, {0x20ef480, 0xc001c20620}, 0xc000e69600)
    net/http/server.go:2916 +0x43b
net/http.(*conn).serve(0xc0002b1720, {0x20f0f58, 0xc0001e8120})
    net/http/server.go:1966 +0x5d7
created by net/http.(*Server).Serve
    net/http/server.go:3071 +0x4db

Version-Release number of selected component (if applicable):

OpenShift Container Platform 4.11.13

How reproducible:

- Always

Steps to Reproduce:

1. Install OpenShift Container Platform 4.11
2. Configure OpenID Group Sync (as per https://docs.openshift.com/container-platform/4.11/authentication/identity_providers/configuring-oidc-identity-provider.html#identity-provider-oidc-CR_configuring-oidc-identity-provider)
3. Have users with hundrets of groups
4. Login and after a while, remove some Groups from the user in the IDP and from OpenShift Container Platform 
5. Try to login again and see the panic in oauth-apiserver

Actual results:

User is unable to login and oauth pods are reporting a panic as shown above

Expected results:

oauth-apiserver should invalidate the cache quickly to remove potential invalid references to non exsting groups

Additional info:

 

Description of problem:

In certain cases, an AWS cluster running 4.12 doesn't automatically generate a controlplanemachineset when it's expected to.

It looks like CPMS is looking for `infrastructure.Spec.PlatformSpec.Type` (https://github.com/openshift/cluster-control-plane-machine-set-operator/blob/2aeaaf9ec714ee75f933051c21a44f648d6ed42b/pkg/controllers/controlplanemachinesetgenerator/controller.go#L180) and as result, clusters born earlier than 4.5 when this field was introduced (https://github.com/openshift/installer/pull/3277) will not be able to generate a CPMS.

I believe we should be looking at `infrastructure.Status.PlatformStatus.Type` instead

Version-Release number of selected component (if applicable):

4.12.9

How reproducible:

Consistent

Steps to Reproduce:

1. Install a cluster on a version earlier than 4.5
2. Upgrade cluster through to 4.12
3. Observe "Unable to generate control plane machine set, unsupported platform" error message from the control-plane-machine-set-operator, as well as the missing CPMS object in the openshift-machine-api namespace

Actual results:

No generated CPMS is created, despite the platform being AWS

Expected results:

A generated CPMS existing in the openshift-machine-api namespace

Additional info:


Description of problem:

Running `yarn dev` results in the build running on a loop.  This issue appears to be related to changes in https://github.com/openshift/console/pull/12821.

How reproducible:

Always

Steps to Reproduce:

1. Run `yarn dev`
2. Make changes to a file and save
3. Watch the terminal output of `yarn dev` and note the build is looping

Description of problem:
IHAC with OCP 4.9 who has configured the IngressControllers with a long httpLogFormat, and the routers are printing every time it reloads

I0927 13:29:45.495077 1 router.go:612] template "msg"="router reloaded" "output"="[WARNING] 269/132945 (9167) : config : truncating capture length to 63 bytes for frontend 'public'.\n[WARNING] 269/132945 (9167) : config : truncating capture length to 63 bytes for frontend 'fe_sni'.\n[WARNING] 269/132945 (9167) : config : truncating capture length to 63 bytes for frontend 'fe_no_sni'.\n - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\n"

This is the Ingress Contoller configuration:

  logging:
    access:
      destination:
        syslog:
          address: 10.X.X.X
          port: 10514
        type: Syslog
      httpCaptureCookies:
      - matchType: Exact
        maxLength: 128
        name: ITXSESSIONID
      httpCaptureHeaders:
        request:
        - maxLength: 128
          name: Host
        - maxLength: 128
          name: itxrequestid
      httpLogFormat: actconn="%ac",backend_name="%b",backend_queue="%bq",backend_source_ip="%bi",backend_source_port="%bp",beconn="%bc",bytes_read="%B",bytes_uploaded="%U",captrd_req_cookie="%CC",captrd_req_headers="%hr",captrd_res_cookie="%CS",captrd_res_headers="%hs",client_ip="%ci",client_port="%cp",cluster="ieec1ocp1",datacenter="ieec1",environment="pro",fe_name_transport="%ft",feconn="%fc",frontend_name="%f",hostname="%H",http_version="%HV",log_type="http",method="%HM",query_string="%HQ",req_date="%tr",request="%HP",res_time="%TR",retries="%rc",server_ip="%si",server_name="%s",server_port="%sp",srv_queue="%sq",srv_conn="%sc",srv_queue="%sq",status_code="%ST",Ta="%Ta",Tc="%Tc",tenant="bk",term_state="%tsc",tot_wait_q="%Tw",Tr="%Tr"
      logEmptyRequests: Ignore

Any way to avoid this truncate warning?

How reproducible:
For every reload of haproxy config

Steps to Reproduce:
You can reproduce easily with the following configuration in the default ingress controller:

logging:
access:
destination:
type: Container
httpCaptureCookies:

  • matchType: Exact
    maxLength: 128
    name: _abck
    And accessing from out console, you will get a log like:

2022-10-18T14:13:53.068164+00:00 xxxx xxxxxx haproxy[38]: 10.39.192.203:40698 [18/Oct/2022:14:13:52.488] fe_sni~ be_secure:openshift-console:console/pod:console-5976495467-zxgxr:console:https:10.128.1.116:8443 0/0/0/10/580 200 1130598 _abck=B7EA642C9E828FA8210F329F80B7B2D80YAAQnVozuFVfkOaDAQAADk - --VN 78/37/33/33/0 0/0 "GET /api/kubernetes/openapi/v2 HTTP/1.1"

Description of problem:

Trying to deploy a HostedCluster using an IPv6 network, the control plane fails to start. These are the networking parameters for the HostedCluster:

  networking:
    clusterNetwork:
    - cidr: fd01::/48
    networkType: OVNKubernetes
    serviceNetwork:
    - cidr: fd02::/112


When the control plane pods are created, the etcd pod will remain in crashloopbackoff. The error in the logs:

invalid value "https://fd01:0:0:3::4c:2380" for flag -listen-peer-urls: URL address does not have the form "host:port": https://fd01:0:0:3::4c:2380

 

Version-Release number of selected component (if applicable):

Any

How reproducible:

Always

Steps to Reproduce:

1. Create a HostedCluster with the networking parameters set to IPv6 networks.
2. The etcd pod will be created and will fail to start.

Actual results:

etcd crashses at start

Expected results:

etcd starts properly and the other control plane pods follow

Additional info:

N/A

Description of problem:

Selecting "Manual" for Update approval does not take effect.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

ose-gcp-pd-csi-driverfails to build: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=54433295

Error:
/usr/lib/golang/pkg/tool/linux_amd64/link: running gcc failed: exit status 1
gcc: error: static: No such file or directory

make: *** [Makefile:40: gce-pd-driver] Error 1

Version-Release number of selected component (if applicable):

4.14 / master

How reproducible:

run osbs build

Steps to Reproduce:

1.
2.
3.

Actual results:


Expected results:


Additional info:


Description of problem:

When changing channels it's possible that multiple new conditional update risks will need to be evaluated. For instance, a cluster running 4.10.34 in a 4.10 channel today only has to evaluate `OpenStackNodeCreationFails` but when the channel is changed to a 4.11 channel multiple new risks require evaluation and the evaluation of new risks is throttled at one every 10 minutes. This means if there are three new risks it may take up to 30 minutes after the channel has changed for the full set of conditional updates to be computed. This leads to a perception that no update paths are recommended because most will not wait 30 minutes, they expect immediate feedback.

Version-Release number of selected component (if applicable):

4.10.z, 4.11.z, 4.12, 4.13

How reproducible:

100% 

Steps to Reproduce:

1. Install 4.10.34
2. Switch from stable-4.10 to stable-4.11
3. 

Actual results:

Observe no recommended updates for 10-20 minutes because all available paths to 4.11 have a risk associated with them

Expected results:

Risks are computed in a timely manner for an interactive UX, lets say < 10s

Additional info:

This was intentional in the design, we didn't want risks to continuously re-evaluate or overwhelm the monitoring stack, however we didn't anticipate that we'd have long standing pile of risks and realize how confusing the user experience would be.

We intend to work around this in the deployed fleet by converting older risks from `type: promql` to `type: Always` avoiding the evaluation period but preserving the notification. While this may lead customers to believe they're exposed to a risk they may not be, as long as the set of outstanding risks to the latest version is limited to no more than one it's likely no one will notice. All 4.10 and 4.11 clusters currently have a clear path toward relatively recent 4.10.z or 4.11.z with no more than one risk to be evaluated.

Description of problem:

Usually etcd pod is named "etcd-bootstrap" for multinode install. In bootstrap-in-place mode the only master is not started during bootstrap, so its useful to use the expected pod name during bootstrap. This would allow us to re-use the bootstrap-generated certificates on "real" master startup

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:

1.
2.
3.

Actual results:


Expected results:


Additional info:


Description of problem:

Add Audit configuration for hypershift Hosted Cluster not working as expected. 

Version-Release number of selected component (if applicable):

# oc get clusterversions.config.openshift.io
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.13.0-0.nightly-2023-05-04-090524   True        False         15m     Cluster version is 4.13.0-0.nightly-2023-05-04-090524       

How reproducible:

Always

Steps to Reproduce:

1. Get hypershift hosted cluster detail from management cluster. 

# hostedcluster=$( oc get -n clusters hostedclusters -o json | jq -r .items[].metadata.name)  

2. Apply audit profile for hypershift hosted cluster. 
# oc patch HostedCluster $hostedcluster -n clusters -p '{"spec": {"configuration": {"apiServer": {"audit": {"profile": "WriteRequestBodies"}}}}}' --type merge     
hostedcluster.hypershift.openshift.io/85ea85757a5a14355124 patched 

# oc get HostedCluster $hostedcluster -n clusters -ojson | jq .spec.configuration.apiServer.audit        
{
  "profile": "WriteRequestBodies"
}

3. Check Pod or operator restart to apply configuration changes. 

# oc get pods -l app=kube-apiserver  -n clusters-${hostedcluster}
NAME                              READY   STATUS    RESTARTS   AGE
kube-apiserver-7c98b66949-9z6rw   5/5     Running   0          36m
kube-apiserver-7c98b66949-gp5rx   5/5     Running   0          36m
kube-apiserver-7c98b66949-wmk8x   5/5     Running   0          36m

# oc get pods -l app=openshift-apiserver   -n clusters-${hostedcluster}
NAME                                  READY   STATUS    RESTARTS   AGE
openshift-apiserver-dc4c84ff4-566z9   3/3     Running   0          29m
openshift-apiserver-dc4c84ff4-99zq9   3/3     Running   0          29m
openshift-apiserver-dc4c84ff4-9xdrz   3/3     Running   0          30m

4. Check generated audit log.
# NOW=$(date -u "+%s"); echo "$NOW"; echo "$NOW" > now
1683711189

# kaspod=$(oc get pods -l app=kube-apiserver -n clusters-${hostedcluster} --no-headers -o=jsonpath={.items[0].metadata.name})                                     

# oc logs $kaspod -c audit-logs -n clusters-${hostedcluster} > kas-audit.log                                                                                      
# cat kas-audit.log | grep -iE '"verb":"(get|list|watch)","user":.*(requestObject|responseObject)' | jq -c 'select (.requestReceivedTimestamp | .[0:19] + "Z" | fromdateiso8601 > '"`cat now`)" | wc -l
0

# cat kas-audit.log | grep -iE '"verb":"(create|delete|patch|update)","user":.*(requestObject|responseObject)' | jq -c 'select (.requestReceivedTimestamp | .[0:19] + "Z" | fromdateiso8601 > '"`cat now`)" | wc -l
0  

All results should not be zero
In backend it should apply the configuration or pod/operator restart after configuration changes. 

Actual results:

Config changes not applied in backend.Not operator & pod restart

Expected results:

Configuration should applied and pod & operator should restart after config changes. 

Additional info:

 

In many cases, the /dev/disk/by-path symlink is the only way to stably identify a disk without having prior knowledge of the hardware from some external source (e.g. a spreadsheet of disk serial numbers). It should be possible to specify this path in the root device hints.
Metal³ now allows these paths in the `name` hint (see OCPBUGS-13080), so the IPI installer's implementation using terraform must be changed to match.

Description of problem:

When a MCCPoolAlert is fired and we fix the problem that caused this alert, the alert is not removed.
 

Version-Release number of selected component (if applicable):

$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.14.0-0.nightly-2023-06-06-212044   True        False         114m    Cluster version is 4.14.0-0.nightly-2023-06-06-212044
 

How reproducible:

Always
 

Steps to Reproduce:

1. Create a custom MCP

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  name: infra
spec:
  machineConfigSelector:
    matchExpressions:
      - {key: machineconfiguration.openshift.io/role, operator: In, values: [master,infra]}
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/infra: ""


2. Label a master node so that it is included in the new custom MCP

$ oc label node $(oc get nodes -l node-role.kubernetes.io/master -ojsonpath="{.items[0].metadata.name}") node-role.kubernetes.io/infra=""

3. Verify that the alert is fired

alias thanosalerts='curl -s -k -H "Authorization: Bearer $(oc -n openshift-monitoring create token prometheus-k8s)" https://$(oc get route -n openshift-monitoring thanos-querier -o jsonpath={.spec.host})/api/v1/alerts | jq '

$ thanosalerts |grep alertname
  ....
          "alertname": "MCCPoolAlert",


4. Remove the label from the node to fix the problem

$ oc label node $(oc get nodes -l node-role.kubernetes.io/master -ojsonpath="{.items[0].metadata.name}") node-role.kubernetes.io/infra-

Actual results:

The alert is not removed.

When we have a look at the mcc_pool_alert  metric we find 2 values with 2 different "alert" fields.

alias thanosquery='function __lgb() { unset -f __lgb; oc rsh -n openshift-monitoring prometheus-k8s-0 curl -s -k  -H "Authorization: Bearer $(oc -n openshift-monitoring create token prometheus-k8s)" --data-urlencode "query=$1" https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query | jq -c | jq; }; __lgb'

$ thanosquery mcc_pool_alert
{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {
          "__name__": "mcc_pool_alert",
          "alert": "Applying custom label for pool",
          "container": "oauth-proxy",
          "endpoint": "metrics",
          "instance": "10.130.0.86:9001",
          "job": "machine-config-controller",
          "namespace": "openshift-machine-config-operator",
          "node": "ip-10-0-129-20.us-east-2.compute.internal",
          "pod": "machine-config-controller-76dbddff49-75ggr",
          "pool": "infra",
          "prometheus": "openshift-monitoring/k8s",
          "service": "machine-config-controller"
        },
        "value": [
          1686137977.158,
          "0"
        ]
      },
      {
        "metric": {
          "__name__": "mcc_pool_alert",
          "alert": "Given both master and custom pools. Defaulting to master: custom infra",
          "container": "oauth-proxy",
          "endpoint": "metrics",
          "instance": "10.130.0.86:9001",
          "job": "machine-config-controller",
          "namespace": "openshift-machine-config-operator",
          "node": "ip-10-0-129-20.us-east-2.compute.internal",
          "pod": "machine-config-controller-76dbddff49-75ggr",
          "pool": "infra",
          "prometheus": "openshift-monitoring/k8s",
          "service": "machine-config-controller"
        },
        "value": [
          1686137977.158,
          "1"
        ]
      }
    ]
  }
}
 

Expected results:

The alert should be removed.
 

Additional info:

If we remove the MCO controller pod, a new mcc_pool_alert data is generated with the right value and the other values are removed. If we execute this workaround the alert is removed.

 

This is a clone of issue OCPBUGS-18754. The following is the description of the original issue:

Description of problem:

After control plane release upgrade, in the guest cluster pod 'tuned' uses control plane release image

Version-Release number of selected component (if applicable):

 

How reproducible:

always

Steps to Reproduce:

1. create a cluster in 4.14.0-0.ci-2023-09-06-180503
2. control plane release upgrade to 4.14-2023-09-07-180503
3. in the guest cluster check container image in pod tuned

Actual results:

pod tuned uses control plane release image 4.14-2023-09-07-180503

 

Expected results:

pod tuned uses release image 4.14.0-0.ci-2023-09-06-180503

Additional info:

After controlplane release upgrade, in control plane namespace, cluster-node-tuning-operator uses control plane release image:

jiezhao-mac:hypershift jiezhao$ oc get pods cluster-node-tuning-operator-6dc549ffdf-jhj2k -n clusters-jie-test -ojsonpath='{.spec.containers[].name}{"\n"}'
cluster-node-tuning-operator
jiezhao-mac:hypershift jiezhao$ oc get pods cluster-node-tuning-operator-6dc549ffdf-jhj2k -n clusters-jie-test -ojsonpath='{.spec.containers[].image}{"\n"}'
registry.ci.openshift.org/ocp/4.14-2023-09-07-180503@sha256:60bd6e2e8db761fb4b3b9d68c1da16bf0371343e3df8e72e12a2502640173990

Description of problem:

Stop option for pipelinerun is not working

Version-Release number of selected component (if applicable):

Openshift Pipelines 1.9.x

How reproducible:

Always

Steps to Reproduce:

1. Create a pipeline and start it
2. From Actions dropdown select  stop option

Actual results:

Pipelinerun is not getting cancelled

Expected results:

Pipelinerun should get cancelled

Additional info:

 

 

Description of problem:

4.13.0-RC.6 Enter to Cluster status: error while trying to install cluster with agent base installer
After the read disk stage the cluster status turn to "error"

Version-Release number of selected component (if applicable):


How reproducible:

Create image with the attached install config and agent config file and boot node with this images

Steps to Reproduce:

1. Create image with the attached install config and agent config file and boot node with this images

Actual results:

Cluster status: error

Expected results:

Should continue with cluster status: installing 

Additional info:


Description of problem:

In hypershift context:
Operands managed by Operators running in the hosted control plane namespace in the management cluster do not honour affinity opinions https://hypershift-docs.netlify.app/how-to/distribute-hosted-cluster-workloads/
https://github.com/openshift/hypershift/blob/main/support/config/deployment.go#L263-L265

These operands running management side should honour the same affinity, tolerations, node selector and priority rules than the operator.
This could be done by looking at the operator deployment itself or at the HCP resource.

multus-admission-controller
cloud-network-config-controller
ovnkube-master

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1. Create a hypershift cluster.
2. Check affinity rules and node selector of the operands above.
3.

Actual results:

Operands missing affinity rules and node selecto

Expected results:

Operands have same affinity rules and node selector than the operator

Additional info:

 

Description of problem:

Pod Status Overlapping in Sidebar
Status that is breaking the UI: CreateContainerConfigError

Version-Release number of selected component (if applicable):

4.14

How reproducible:

Always when the status is CreateContainerConfigError

Steps to Reproduce:

1. Create a Pod that gives CreateContainerConfigError

Sample YAML:

apiVersion: v1
kind: Pod
metadata:
  name: example
  labels:
    app: httpd
  namespace: avik
spec:
  securityContext:
    runAsNonRoot: true
    seccompProfile:
      type: RuntimeDefault
  containers:
    - name: httpd
      image: docker.io/httpd:latest
      ports:
        - containerPort: 80
      securityContext:
        allowPrivilegeEscalation: true
        capabilities:
          drop:
            - ALL 

Actual results:

The Pod Status should not overlapping when the status is long.

Expected results:

The Pod Status should not overlap. Also, this error status should look like the other error statuses.

Additional info:

 

Description of problem:

Not able to import the repository with .tekton directory and func.yaml file present. As getting this error `Cannot read properties of undefined (reading 'filter')` 

Version-Release number of selected component (if applicable):

4.13, Pipeline and Serverless is installed

How reproducible:

 

Steps to Reproduce:

1. In import from git form enter the git URL: https://github.com/Lucifergene/oc-pipe-func
2. Pipeline is checked and PAC option is selected by default even if user uncheck the Pipeline option user get the same error
3. click Create button

Actual results:

Not able to import and getting this error `Cannot read properties of undefined (reading 'filter')` 

Expected results:

should able to import without any error

Additional info:

 

Description of problem:

An uninstall was started, however it failed due to the hosted-cluster-config-operator being unable to clean up the default ingresscontroller

Version-Release number of selected component (if applicable):

4.12.18

How reproducible:

Unsure - though definitely not 100%

Steps to Reproduce:

1. Uninstall a HyperShift cluster

Actual results:

❯ k logs -n ocm-staging-2439occi66vhbj0pee3s4d5jpi4vpm54-mshen-dr2 hosted-cluster-config-operator-5ccdbfcc4c-9mxfk --tail 10 -f

{"level":"info","ts":"2023-06-06T16:57:21Z","msg":"Image registry is removed","controller":"resources","object":{"name":""},"namespace":"","name":"","reconcileID":"3a8e4485-3d0a-41b7-b82c-ff0a7f0040e6"}
{"level":"info","ts":"2023-06-06T16:57:21Z","msg":"Ensuring ingress controllers are removed","controller":"resources","object":{"name":""},"namespace":"","name":"","reconcileID":"3a8e4485-3d0a-41b7-b82c-ff0a7f0040e6"}
{"level":"info","ts":"2023-06-06T16:57:21Z","msg":"Ensuring load balancers are removed","controller":"resources","object":{"name":""},"namespace":"","name":"","reconcileID":"3a8e4485-3d0a-41b7-b82c-ff0a7f0040e6"}
{"level":"info","ts":"2023-06-06T16:57:21Z","msg":"Load balancers are removed","controller":"resources","object":{"name":""},"namespace":"","name":"","reconcileID":"3a8e4485-3d0a-41b7-b82c-ff0a7f0040e6"}
{"level":"info","ts":"2023-06-06T16:57:21Z","msg":"Ensuring persistent volumes are removed","controller":"resources","object":{"name":""},"namespace":"","name":"","reconcileID":"3a8e4485-3d0a-41b7-b82c-ff0a7f0040e6"}
{"level":"info","ts":"2023-06-06T16:57:21Z","msg":"There are no more persistent volumes. Nothing to cleanup.","controller":"resources","object":{"name":""},"namespace":"","name":"","reconcileID":"3a8e4485-3d0a-41b7-b82c-ff0a7f0040e6"}
{"level":"info","ts":"2023-06-06T16:57:21Z","msg":"Persistent volumes are removed","controller":"resources","object":{"name":""},"namespace":"","name":"","reconcileID":"3a8e4485-3d0a-41b7-b82c-ff0a7f0040e6"}

After manually connecting to the hostedcluster and deleting the ingresscontroller, the uninstall progressed and succeded

Expected results:

The hosted cluster can cleanup the ingresscontrollers successfully and progress the uninstall

Additional info:

HyperShift dump: https://drive.google.com/file/d/1qqjkG4F_mSUCVMz3GbN-lEoqbshPvQcU/view?usp=sharing 

Description of problem:

While trying to deploy OCP on GCP the Installer get stuck on the very first step trying to list all the projects the GCP service account used to deploy OCP can list

Version-Release number of selected component (if applicable):

4.13.3 but also happening on 4.12.5 and I presume other releases as well

How reproducible:

Every time

Steps to Reproduce:

1. Use openshift-install to create a cluster in GCP

Actual results:

$ ./openshift-install-4.13.3 create cluster --dir gcp-doha/ --log-level debug
DEBUG OpenShift Installer 4.13.3                   
DEBUG Built from commit 90bb61f38881d07ce94368f0b34089d152ffa4ef 
DEBUG Fetching Metadata...                         
DEBUG Loading Metadata...                          
DEBUG   Loading Cluster ID...                      
DEBUG     Loading Install Config...                
DEBUG       Loading SSH Key...                     
DEBUG       Loading Base Domain...                 
DEBUG         Loading Platform...                  
DEBUG       Loading Cluster Name...                
DEBUG         Loading Base Domain...               
DEBUG         Loading Platform...                  
DEBUG       Loading Networking...                  
DEBUG         Loading Platform...                  
DEBUG       Loading Pull Secret...                 
DEBUG       Loading Platform...                    
INFO Credentials loaded from environment variable "GOOGLE_CREDENTIALS", file "/home/mak/.gcp/aos-serviceaccount.json"
ERROR failed to fetch Metadata: failed to load asset "Install Config": failed to create install config: platform.gcp.project: Internal error: context deadline exceeded 

Expected results:

The cluster should be deployed with no issues

Additional info:

The GCP user used to deploy OCP has visibility of thousands of projects:

> gcloud projects list | wc -l
  152793

Tracker issue for bootimage bump in 4.14. This issue should block issues which need a bootimage bump to fix.

The previous bump was OCPBUGS-10738.

Description of problem:

Tests Failed.expand_lesslogs in as 'test' user via htpasswd identity provider: Auth test logs in as 'test' user via htpasswd identity provider

 CI-search
Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

API documentation for HostedCluster states that the webhook kubeconfig field is only supported for IBM Cloud. It should be supported for all platforms.

Version-Release number of selected component (if applicable):

4.14

How reproducible:

Always

Steps to Reproduce:

1. Review API documentation at https://hypershift-docs.netlify.app/reference/api/

Actual results:

 

Expected results:

 

Additional info:

 

While running the e2e test locally with Hypershift cluster from cluster-bot I noticed that it fails on step waiting for 2 prometheus instances.

“wait for prometheus-k8s: expected 2 Prometheus instances but got: 1: timed out waiting for the condition” 

Since Hypershift clusters from cluster-bot are single worker node, it will always fail since we are checking it should be always 2 instances in main_test.go.

Ideally we need to check the infrastructureTopology field and adjust the test if the infrastructure is “SingleReplica”

 

Description of problem:

ControlPlaneMachineSet Machines are considered Ready once the underlying MAPI machine is Running.
This should not be a sufficient condition, as the Node linked to that Machine should also be Ready for the overall CPMS Machine to be considered Ready.

Version-Release number of selected component (if applicable):

4.13

How reproducible:

Always

Please review the following PR: https://github.com/openshift/cluster-monitoring-operator/pull/1914

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Oc debug fails with error "container "container-00" in pod "xiyuan24-f3-h4264-master-0-debug" is waiting to start: ContainerCreating"I see that above error happens when run via automation and running it locally does not have this issue, also when increased time around the command in the automation script it works fine with out any issues.

Version-Release number of selected component (if applicable):

03-24 17:57:54.649        [12:27:48] INFO> Shell Commands: oc version -o yaml --client --kubeconfig=/tmp/kubeconfig20230324-374-gt1vvm
03-24 17:57:54.649        clientVersion:
03-24 17:57:54.649          buildDate: "2023-03-17T23:32:35Z"
03-24 17:57:54.649          compiler: gc
03-24 17:57:54.649          gitCommit: eed143055ede731029931ad204b19cd2f565ef1a
03-24 17:57:54.649          gitTreeState: clean
03-24 17:57:54.649          gitVersion: 4.13.0-202303172327.p0.geed1430.assembly.stream-eed1430
03-24 17:57:54.649          goVersion: go1.19.4
03-24 17:57:54.649          major: ""
03-24 17:57:54.649          minor: ""
03-24 17:57:54.649          platform: linux/amd64
03-24 17:57:54.649        kustomizeVersion: v4.5.7
03-24 17:57:54.649        [12:27:49] INFO> Exit Status: 0 

How reproducible:

Always

Steps to Reproduce:

1.Install latest 4.13 cluster
2. Run script https://github.com/openshift/verification-tests/blob/master/features/upgrade/security_compliance/fips.feature#L66

Actual results:

Test fails with error mentioned in the description

Expected results:

Test should not fail

Additional info:

Adding a link to the conversation which i had with maciej about this issue https://redhat-internal.slack.com/archives/GK58XC2G2/p1679655589922729

Run log with --loglevel=9 -> https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Runner/770180/console

Description of problem:

Seen in 4.13.0-rc.2, mcc_drain_err is being served for nodes that have been deleted, causing un-actionable MCDDrainError.

Version-Release number of selected component (if applicable):

At least 4.13.0-rc.2. Further exposure unclear.

How reproducible:

At least four nodes on build01. Possibly all nodes that are removed while suffering from drain issues on 4.13.0-rc.2.

Steps to Reproduce:

Unclear.

Actual results:

The machine-config controller continues to serve mcc_drain_err for the removed nodes.

Expected results:

The machine-config controller never serves{{mcc_drain_err}} for non-existant nodes.

Description of problem:

Bump Kubernetes to 0.27.1 and bump dependencies

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

Backport support starting in 4.12.z to a new GCP region europe-west12

Version-Release number of selected component (if applicable):

4.12.z and 4.13.z

How reproducible:

Always

Steps to Reproduce:

1. Use openhift-install to deploy OCP in europe-west12

Actual results:

europe-west12 is not available as a supported region in the user survey

Expected results:

europe-west12 to be available as a supported region in the user survey

Additional info:

 

Description of problem:

On clusters without the TechPreview feature set enabled, machines are failing to delete due to an attempt to list an IPAM that is not installed.

Version-Release number of selected component (if applicable):

4.14 nightly

How reproducible:

consistently

Steps to Reproduce:

1. Create a platform vSphere cluster
2. Scale down a machine

Actual results:

Machine fails to delete

Expected results:

Machine should delete

Additional info:

Fails with unable to list IPAddressClaims: failed to get API group resources: unable to retrieve the complete list of server APIs: ipam.cluster.x-k8s.io/v1alpha1: the server could not find the requested resource

Description of problem:

After the installation of a cluster, based on the agent installer ISO, is completed, the job assisted-installer-controller remains up

Version-Release number of selected component (if applicable):

4.12

How reproducible:

Generate a valid ISO image using the agent installer. All kind of topologies (compact/ha/sno) and configurations are affect by this problem

Steps to Reproduce:

1.
2.
3.

Actual results:

$ oc get jobs -n assisted-installer
NAME                            COMPLETIONS   DURATION   AGE
assisted-installer-controller   0/1           102m       102m

Expected results:

oc get jobs -n assisted-installer should not return any job

Additional info:

It looks like that the assisted-installer-controller has been designed assuming that Assisted Service (AS) was always available and reachable. This is not necessarily true when using the agent installer, since the AS initially running on the rendezvous node will not be available after the node was rebooted.

The assisted-installer-controller performs a number of different tasks internally, and from the logs not all of them complete successfully (a condition to terminate the job).
It could be useful to perform a deeper troubleshooting on the ApproveCsrs one, as it one that does not terminate properly

 

 

 

Please review the following PR: https://github.com/openshift/cluster-kube-scheduler-operator/pull/478

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Viewing OperatorHub details page will return error page

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2023-03-28-180259

How reproducible:

Always on Hypershift Guest cluster

Steps to Reproduce:

1. Visit OperatorHub details page via Administration -> Cluster Settings -> Configuration -> OperatorHub 
2.
3.

Actual results:

Cannot read properties of undefined (reading 'sources')

Expected results:

page can be loaded successfully

Additional info:

screenshot one: https://drive.google.com/file/d/12cgpChKYuen2v6DWvmMrir273wONo5oY/view?usp=share_link
screenshot two: https://drive.google.com/file/d/1vVsczu7ScIqznoKNsR8V0w4k9bF1xWhB/view?usp=share_link 

Description of problem:

When a (recommended/conditional) release image is provided with --to-image='', the specified image name is not preserved in the ClusterVersion object.

Version-Release number of selected component (if applicable):

 

How reproducible:

100% with oc >4.9

Steps to Reproduce:

$ oc version
Client Version: 4.12.2
Kustomize Version: v4.5.7
Server Version: 4.12.2
Kubernetes Version: v1.25.4+a34b9e9

$ oc get clusterversion/version -o jsonpath='{.status.desired}'|jq
{
  "channels": [
    "candidate-4.12",
    "candidate-4.13",
    "eus-4.12",
    "fast-4.12",
    "stable-4.12"
  ],
  "image": "quay.io/openshift-release-dev/ocp-release@sha256:31c7741fc7bb73ff752ba43f5acf014b8fadd69196fc522241302de918066cb1",
  "url": "https://access.redhat.com/errata/RHSA-2023:0569",
  "version": "4.12.2"
}
$ oc adm release info 4.12.3 -o jsonpath='{.image}'
quay.io/openshift-release-dev/ocp-release@sha256:382f271581b9b907484d552bd145e9a5678e9366330059d31b007f4445d99e36
$ skopeo copy docker://quay.io/openshift-release-dev/ocp-release@sha256:382f271581b9b907484d552bd145e9a5678e9366330059d31b007f4445d99e36 docker://quay.example.com/playground/release-images
Getting image source signatures
Copying blob 64096b96a7b0 done  
Copying blob 0e0550faf8e0 done  
Copying blob 97da74cc6d8f skipped: already exists  
Copying blob d8190195889e skipped: already exists  
Copying blob 17997438bedb done  
Copying blob fdbb043b48dc done  
Copying config b49bc8b603 done  
Writing manifest to image destination
Storing signatures
$ skopeo inspect docker://quay.example.com/playground/release-images@sha256:382f271581b9b907484d552bd145e9a5678e9366330059d31b007f4445d99e36|jq '.Name,.Digest'
"quay.example.com/playground/release-images"
"sha256:382f271581b9b907484d552bd145e9a5678e9366330059d31b007f4445d99e36"
$ oc adm upgrade --to-image=quay.example.com/playground/release-images@sha256:382f271581b9b907484d552bd145e9a5678e9366330059d31b007f4445d99e36 Requesting update to 4.12.3                               
 

Actual results:

$ oc get clusterversion/version -o jsonpath='{.status.desired}'|jq
{
  "channels": [
    "candidate-4.12",
    "candidate-4.13",
    "eus-4.12",
    "fast-4.12",
    "stable-4.12"
  ],
  "image": "quay.io/openshift-release-dev/ocp-release@sha256:382f271581b9b907484d552bd145e9a5678e9366330059d31b007f4445d99e36",    <--- not quay.example.com
  "url": "https://access.redhat.com/errata/RHSA-2023:0728",
  "version": "4.12.3"
}

$ oc get clusterversion/version -o jsonpath='{.status.history}'|jq
[
  {
    "completionTime": null,
    "image": "quay.io/openshift-release-dev/ocp-release@sha256:382f271581b9b907484d552bd145e9a5678e9366330059d31b007f4445d99e36",         <--- not quay.example.com
    "startedTime": "2023-04-28T07:39:11Z",
    "state": "Partial",
    "verified": true,
    "version": "4.12.3"
  },
  {
    "completionTime": "2023-04-27T14:48:06Z",
    "image": "quay.io/openshift-release-dev/ocp-release@sha256:31c7741fc7bb73ff752ba43f5acf014b8fadd69196fc522241302de918066cb1",
    "startedTime": "2023-04-27T14:24:29Z",
    "state": "Completed",
    "verified": false,
    "version": "4.12.2"
  }
]

Expected results:

$ oc get clusterversion/version -o jsonpath='{.status.desired}'|jq
{
  "channels": [
    "candidate-4.12",
    "candidate-4.13",
    "eus-4.12",
    "fast-4.12",
    "stable-4.12"
  ],
  "image": "quay.example.com/playground/release-images@sha256:382f271581b9b907484d552bd145e9a5678e9366330059d31b007f4445d99e36 ",
  "url": "https://access.redhat.com/errata/RHSA-2023:0728",
  "version": "4.12.3"
}$ oc get clusterversion/version -o jsonpath='{.status.history}'|jq
[
  {
    "completionTime": null,
    "image": "quay.example.com/playground/release-images@sha256:382f271581b9b907484d552bd145e9a5678e9366330059d31b007f4445d99e36 ",
    "startedTime": "2023-04-28T07:39:11Z",
    "state": "Partial",
    "verified": true,
    "version": "4.12.3"
  },
  {
    "completionTime": "2023-04-27T14:48:06Z",
    "image": "quay.io/openshift-release-dev/ocp-release@sha256:31c7741fc7bb73ff752ba43f5acf014b8fadd69196fc522241302de918066cb1",
    "startedTime": "2023-04-27T14:24:29Z",
    "state": "Completed",
    "verified": false,
    "version": "4.12.2"
  }
]

Additional info:

While in earlier versions (<4.10) we used to preserve the specified image [1], we now (as of 4.10) store the public image as the desired version [2].
[1] https://github.com/openshift/oc/blob/88cfeb4aa2d74ee5f5598c571661622c0034081b/pkg/cli/admin/upgrade/upgrade.go#L278
[2] https://github.com/openshift/oc/blob/5711859fac135177edf07161615bdabe3527e659/pkg/cli/admin/upgrade/upgrade.go#L278 

Description of the problem:

Proliant Gen 11 always reports the serial number "PCA_number.ACC", causing all hosts to register with the same UUID.

How reproducible:

100%

Steps to reproduce:

1. Boot two Proliant Gen 11 hosts

2. See that both hosts are updating a single host entry in the service

Actual results:

All hosts with this hardware are assigned the same UUID

Expected results:

Each host should have a unique UUID

Description of problem:

Hypershift does not utilize existing liveness and readiness probes on openshift-route-controller-manager and openshift-controller-manager.

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1.Create OCP cluster using Hypershift
2.Look at openshift-route-controller-manager and openshift-controller-manager yaml manifests

Actual results:

No probes defined for pods of those two deployments

Expected results:

Probes should be defined because the service implement them

Additional info:

This is the result of a security review for 4.12 Hypershift, original investigation can be found https://github.ibm.com/alchemy-containers/armada-update/issues/4117#issuecomment-53149378

Description of problem:

Any FBC enabled OLM Catalog displays the Channels in a random order.

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1. Create a catalog source for icr.io/cpopen/ibm-operator-catalog:latest
2. Navigate to OperatorHub
3. Click on the `ibm-mq` operator
4. Click on the Install button.

Actual results:

The list of channels is in random order. The order changes with each page refresh.

Expected results:

The list of channels should be in lexicographical ascending order as it was for SQLITE based catalogs.

Additional info:

See related operator-registry upstream issue:
https://github.com/operator-framework/operator-registry/issues/1069#top

Note:  I think both `operator-registry` and the OperatorHub should provide deterministic sorting of these channels.

Request for sending data via telemetry

The goal is to collect metrics about the number of LIST and WATCH requests to the apiserver because it will allow to measure the deployment progress of the API streaming feature. The new feature will replace the use of LIST requests with WATCH. 

apiserver_list_watch_request_total:rate:sum 

apiserver_list_watch_request_total:rate:sum represents the rate of change for the LIST and WATCH requests over a 5 minute period.

Labels

  • verb, possible values are: LIST, WATCH

The cardinality of the metric is at most 2.

Description of problem:

This is a clone for https://issues.redhat.com/browse/CNV-26608

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

Update to use Jenkins 4.13 images to address CVEs

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

Info message below the "Git access token" Field for creating the Pipelines Repository under the Pipelines section in the Import from Git page is falling back to the default text instead of showing the curated ones of each Git provider.

The Info messages are curated for each of the Git Providers when we are creating the Repository from the Pipelines Page.

Version-Release number of selected component (if applicable):

4.14

How reproducible:

Always

Steps to Reproduce:

1. Go to the Import from Git Page
2. Add a Git URL with PAC ( https://github.com/Lucifergene/oc-pipe )
3. Check the text under the "Git access token" Field 

Actual results:

Use your Git Personal token. Create a token with repo, public_repo & admin:repo_hook scopes and give your token an expiration, i.e 30d.

Expected results:

Use your GitHub Personal token. Use this link to create a token with repo, public_repo & admin:repo_hook scopes and give your token an expiration, i.e 30d.

Additional info:

 

This issue has been reported multiple times over the years with no resolution
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.13-e2e-vsphere-zones/1655633815252504576

kubeconfig received!
waiting for api to be available
level=error
level=error msg=Error: failed to parse ovf: failed to parse ovf: XML syntax error on line 1: illegal character code U+0000
level=error
level=error msg= with vsphereprivate_import_ova.import[0],
level=error msg= on main.tf line 70, in resource "vsphereprivate_import_ova" "import":
level=error msg= 70: resource "vsphereprivate_import_ova" "import" {
level=error
level=error
level=error msg=Error: failed to parse ovf: failed to parse ovf: XML syntax error on line 1: illegal character code U+0000

https://issues.redhat.com/browse/OCPQE-13219
https://issues.redhat.com/browse/TRT-741

Description of problem:

On OpenShift Container Platform, the etcd Pod is showing messages like the following:

2023-06-19T09:10:30.817918145Z {"level":"warn","ts":"2023-06-19T09:10:30.817Z","caller":"fileutil/purge.go:72","msg":"failed to lock file","path":"/var/lib/etcd/member/wal/000000000000bc4b-00000000183620a4.wal","error":"fileutil: file already locked"}


This is described in KCS https://access.redhat.com/solutions/7000327

Version-Release number of selected component (if applicable):

any currently supported version (> 4.10) running with 3.5.x

How reproducible:

always

Steps to Reproduce:

happens after running etcd for a while

 

This has been discussed in https://github.com/etcd-io/etcd/issues/15360

It's not a harmful error message, it merely indicates that some WALs have not been included in snapshots yet.

This was caused by changing default numbers: https://github.com/etcd-io/etcd/issues/13889

This was fixed in https://github.com/etcd-io/etcd/pull/15408/files but never backported to 3.5.

To mitigate that error and stop confusing people, we should also supply that argument when starting etcd in: https://github.com/openshift/cluster-etcd-operator/blob/master/bindata/etcd/pod.yaml#L170-L187

That way we're not surprised by changes of the default values upstream.

Description of problem:

Agent-tui should show before the installation, but it shows again during the installation and when it quit again, the installation fail to go on.

Version-Release number of selected component (if applicable):

4.13.0-0.ci-2023-03-14-045458

How reproducible:

always

Steps to Reproduce:

1. Make sure the primary check pass, and boot the agent.x86_64.iso file, we can see the agent-tui show before the installation

2. Tracking installation by both wait-for output and console output

3. The agent-tui show again during the installation, wait for the agent-tui quit automatically without any user interruption, the installation quit with failure, and we have the following wait-for output:

DEBUG asset directory: .                           
DEBUG Loading Agent Config...                      
...
DEBUG Agent Rest API never initialized. Bootstrap Kube API never initialized 
INFO Waiting for cluster install to initialize. Sleeping for 30 seconds 
DEBUG Agent Rest API Initialized                   
INFO Cluster is not ready for install. Check validations 
DEBUG Cluster validation: The pull secret is set.  
WARNING Cluster validation: The cluster has hosts that are not ready to install. 
DEBUG Cluster validation: The cluster has the exact amount of dedicated control plane nodes. 
DEBUG Cluster validation: API virtual IPs are not required: User Managed Networking 
DEBUG Cluster validation: API virtual IPs are not required: User Managed Networking 
DEBUG Cluster validation: The Cluster Network CIDR is defined. 
DEBUG Cluster validation: The base domain is defined. 
DEBUG Cluster validation: Ingress virtual IPs are not required: User Managed Networking 
DEBUG Cluster validation: Ingress virtual IPs are not required: User Managed Networking 
DEBUG Cluster validation: The Machine Network CIDR is defined. 
DEBUG Cluster validation: The Cluster Machine CIDR is not required: User Managed Networking 
DEBUG Cluster validation: The Cluster Network prefix is valid. 
DEBUG Cluster validation: The cluster has a valid network type 
DEBUG Cluster validation: Same address families for all networks. 
DEBUG Cluster validation: No CIDRS are overlapping. 
DEBUG Cluster validation: No ntp problems found    
DEBUG Cluster validation: The Service Network CIDR is defined. 
DEBUG Cluster validation: cnv is disabled          
DEBUG Cluster validation: lso is disabled          
DEBUG Cluster validation: lvm is disabled          
DEBUG Cluster validation: odf is disabled          
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Valid inventory exists for the host 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Sufficient CPU cores 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Sufficient minimum RAM 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Sufficient disk capacity 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Sufficient CPU cores for role master 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Sufficient RAM for role master 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Hostname openshift-qe-049.arm.eng.rdu2.redhat.com is unique in cluster 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Hostname openshift-qe-049.arm.eng.rdu2.redhat.com is allowed 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Speed of installation disk has not yet been measured 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Host is compatible with cluster platform none 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: VSphere disk.EnableUUID is enabled for this virtual machine 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Host agent compatibility checking is disabled 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: No request to skip formatting of the installation disk 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: All disks that have skipped formatting are present in the host inventory 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Host is connected 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Media device is connected 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: No Machine Network CIDR needed: User Managed Networking 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Host belongs to all machine network CIDRs 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Host has connectivity to the majority of hosts in the cluster 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Platform PowerEdge R740 is allowed 
WARNING Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Host couldn't synchronize with any NTP server 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Host clock is synchronized with service 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: All required container images were either pulled successfully or no attempt was made to pull them 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Network latency requirement has been satisfied. 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Packet loss requirement has been satisfied. 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Host has been configured with at least one default route. 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Domain name resolution for the api.zniusno.arm.eng.rdu2.redhat.com domain was successful or not required 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Domain name resolution for the api-int.zniusno.arm.eng.rdu2.redhat.com domain was successful or not required 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Domain name resolution for the *.apps.zniusno.arm.eng.rdu2.redhat.com domain was successful or not required 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Host subnets are not overlapping 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: No IP collisions were detected by host 7a9649d8-4167-a1f9-ad5f-385c052e2744 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: cnv is disabled 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: lso is disabled 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: lvm is disabled 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: odf is disabled 
WARNING Host openshift-qe-049.arm.eng.rdu2.redhat.com: updated status from discovering to insufficient (Host cannot be installed due to following failing validation(s): Host couldn't synchronize with any NTP server) 
INFO Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Host NTP is synced 
INFO Host openshift-qe-049.arm.eng.rdu2.redhat.com: updated status from insufficient to known (Host is ready to be installed) 
INFO Cluster is ready for install                 
INFO Cluster validation: All hosts in the cluster are ready to install. 
INFO Preparing cluster for installation           
INFO Host openshift-qe-049.arm.eng.rdu2.redhat.com: updated status from known to preparing-for-installation (Host finished successfully to prepare for installation) 
INFO Host openshift-qe-049.arm.eng.rdu2.redhat.com: New image status registry.ci.openshift.org/ocp/4.13-2023-03-14-045458@sha256:b0d518907841eb35adbc05962d4b2e7d45abc90baebc5a82d0398e1113ec04d0. result: success. time: 1.35 seconds; size: 401.45 Megabytes; download rate: 312.54 MBps 
INFO Host openshift-qe-049.arm.eng.rdu2.redhat.com: updated status from preparing-for-installation to preparing-successful (Host finished successfully to prepare for installation) 
INFO Cluster installation in progress             
INFO Host openshift-qe-049.arm.eng.rdu2.redhat.com: updated status from preparing-successful to installing (Installation is in progress) 
INFO Host: openshift-qe-049.arm.eng.rdu2.redhat.com, reached installation stage Starting installation: bootstrap 
INFO Host: openshift-qe-049.arm.eng.rdu2.redhat.com, reached installation stage Installing: bootstrap 
INFO Host: openshift-qe-049.arm.eng.rdu2.redhat.com, reached installation stage Failed: failed executing nsenter [--target 1 --cgroup --mount --ipc --pid -- podman run --net host --pid=host --volume /:/rootfs:rw --volume /usr/bin/rpm-ostree:/usr/bin/rpm-ostree --privileged --entrypoint /usr/bin/machine-config-daemon registry.ci.openshift.org/ocp/4.13-2023-03-14-045458@sha256:f85a278868035dc0a40a66ea7eaf0877624ef9fde9fc8df1633dc5d6d1ad4e39 start --node-name localhost --root-mount /rootfs --once-from /opt/install-dir/bootstrap.ign --skip-reboot], Error exit status 255, LastOutput "...  to initialize single run daemon: error initializing rpm-ostree: Error while ensuring access to kublet config.json pull secrets: symlink /var/lib/kubelet/config.json /run/ostree/auth.json: file exists" 
INFO Cluster has hosts in error                   
INFO cluster has stopped installing... working to recover installation 
INFO cluster has stopped installing... working to recover installation 
INFO cluster has stopped installing... working to recover installation 
INFO cluster has stopped installing... working to recover installation 
INFO cluster has stopped installing... working to recover installation 
INFO cluster has stopped installing... working to recover installation 
INFO cluster has stopped installing... working to recover installation 
INFO cluster has stopped installing... working to recover installation   

4. During the installation, we had NetworkManager-wait-online.service for a while:
-- Logs begin at Wed 2023-03-15 03:06:29 UTC, end at Wed 2023-03-15 03:27:30 UTC. --
Mar 15 03:18:52 openshift-qe-049.arm.eng.rdu2.redhat.com systemd[1]: Starting Network Manager Wait Online...
Mar 15 03:19:55 openshift-qe-049.arm.eng.rdu2.redhat.com systemd[1]: NetworkManager-wait-online.service: Main process exited, code=exited, status=1/FAILURE
Mar 15 03:19:55 openshift-qe-049.arm.eng.rdu2.redhat.com systemd[1]: NetworkManager-wait-online.service: Failed with result 'exit-code'.
Mar 15 03:19:55 openshift-qe-049.arm.eng.rdu2.redhat.com systemd[1]: Failed to start Network Manager Wait Online.

Expected results:

The TUI should only show once before the installation.

Description of problem:

The following tests broke the payload for CI and nightly

[sig-network][Feature:MultiNetworkPolicy][Serial] should enforce a network policies on secondary network IPv6 [Suite:openshift/conformance/serial]

[sig-network][Feature:MultiNetworkPolicy][Serial] should enforce a network policies on secondary network IPv4 [Suite:openshift/conformance/serial]

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

Test Panicked: runtime error: invalid memory address or nil pointer dereference

Expected results:

 

Additional info:

Original PR that broke the payload https://github.com/openshift/origin/pull/27795 

Revert to get payloads back to normal https://github.com/openshift/origin/pull/27926

Broken payloads and related jobs and sippy link for additional info

https://amd64.ocp.releases.ci.openshift.org/releasestream/4.14.0-0.ci/release/4.14.0-0.ci-2023-05-17-212447

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.14-e2e-aws-sdn-serial/1659065324743430144

https://amd64.ocp.releases.ci.openshift.org/releasestream/4.14.0-0.nightly/release/4.14.0-0.nightly-2023-05-18-040905

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-sdn-serial/1659088328617627648
https://sippy.dptools.openshift.org/sippy-ng/tests/4.14?filters=%257B%2522items%2522%253A%255B%257B%2522columnField%2522%253A%2522current_runs%2522%252C%2522operatorValue%2522%253A%2522%253E%253D%2522%252C%2522value%2522%253A%25227%2522%257D%252C%257B%2522columnField%2522%253A%2522variants%2522%252C%2522not%2522%253Atrue%252C%2522operatorValue%2522%253A%2522contains%2522%252C%2522value%2522%253A%2522never-stable%2522%257D%252C%257B%2522columnField%2522%253A%2522variants%2522%252C%2522not%2522%253Atrue%252C%2522operatorValue%2522%253A%2522contains%2522%252C%2522value%2522%253A%2522aggregated%2522%257D%252C%257B%2522id%2522%253A99%252C%2522columnField%2522%253A%2522name%2522%252C%2522operatorValue%2522%253A%2522contains%2522%252C%2522value%2522%253A%2522%255Bsig-network%255D%255BFeature%253AMultiNetworkPolicy%255D%255BSerial%255D%2520should%2520enforce%2520a%2520network%2520policies%2520on%2520secondary%2520network%2520IPv6%2520%255BSuite%253Aopenshift%252Fconformance%252Fserial%255D%2522%257D%255D%252C%2522linkOperator%2522%253A%2522and%2522%257D&sort=asc&sortField=current_working_percentage

Description of problem:

2023-02-20T16:27:58.107800612Z + oc observe pods -n openshift-sdn --listen-addr= -l app=sdn -a '{ .status.hostIP }' -- /var/run/add_iptables.sh
2023-02-20T16:27:58.181727766Z Flag --argument has been deprecated, and will be removed in a future release. Use --template instead.

 

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2023-02-17-090603
 

How reproducible:

Always

Steps to Reproduce:

1. Deploy Azure OpenShiftSDN cluster
2. Check drop-icmp container logs
oc logs -n openshift-sdn -c drop-icmp -l app=sdn --previous
3. 

Actual results:

+ true
+ iptables -F AZURE_ICMP_ACTION
+ iptables -A AZURE_ICMP_ACTION -j LOG
+ iptables -A AZURE_ICMP_ACTION -j DROP
+ oc observe pods -n openshift-sdn --listen-addr= -l app=sdn -a '{ .status.hostIP }' -- /var/run/add_iptables.sh
Flag --argument has been deprecated, and will be removed in a future release. Use --template instead.
E0220 16:27:07.553592   27842 memcache.go:238] couldn't get current server API group list: Get "https://172.30.0.1:443/api?timeout=32s": dial tcp 172.30.0.1:443: connect: connection refused
E0220 16:27:07.553913   27842 memcache.go:238] couldn't get current server API group list: Get "https://172.30.0.1:443/api?timeout=32s": dial tcp 172.30.0.1:443: connect: connection refused
The connection to the server 172.30.0.1:443 was refused - did you specify the right host or port?
Error from server (BadRequest): previous terminated container "drop-icmp" in pod "sdn-v7gqq" not found

 

Expected results:

No deprecation warning

Additional info:

Description of problem:

In the web console Administrator view, the items under "Observe" in the side navigation menu are duplicated.

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

This is happening because those menu items are now provided by the `monitoring-plugin` dynamic plugin, so we need to remove them from the web console codebase.

Description of problem:

1. CR.status.LastSyncTimestamp should also be updated in the "else" code branch: 
https://github.com/openshift/cloud-credential-operator/blob/4cb9faca62c31ebea9a11b55f7af764be4ee2cd8/pkg/operator/credentialsrequest/credentialsrequest_controller.go#L1054

2. r.Client.Status().Update is not called on the CR object in memory after this line:
https://github.com/openshift/cloud-credential-operator/blob/4cb9faca62c31ebea9a11b55f7af764be4ee2cd8/pkg/operator/credentialsrequest/credentialsrequest_controller.go#L713
So CR.status.conditions are not updated. 

Steps to Reproduce:

This results from a static code check.

Please review the following PR: https://github.com/openshift/cluster-api-provider-alibaba/pull/41

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

After upgrading cluster from 4.10.47 to 4.11.25 issue is observed with Egress router pod, pods are in pending state. 

Version-Release number of selected component (if applicable):

4.11.25

How reproducible:

 

Steps to Reproduce:

1. Upgrade from 4.10.47 to 4.11.25
2. Check if co network is in Managed state
3. Verify that egress pods are not created with errors like :
55s         Warning   FailedCreatePodSandBox   pod/******     (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox *******_d6918859-a4e9-4e5b-ba44-acc70499fa7c_0(9c464935ebaeeeab7be0b056c3f7ed1b7279e21445b9febea29eb280f7ee7429): error adding pod ****** to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): [ns/pod/d6918859-a4e9-4e5b-ba44-acc70499fa7c:openshift-sdn]: error adding container to network "openshift-sdn": CNI request failed with status 400: 'could not open netns "/var/run/netns/503fb77f-3b96-4f23-8356-43e7ae1e1b49": unknown FS magic on "/var/run/netns/503fb77f-3b96-4f23-8356-43e7ae1e1b49": 1021994
 

Actual results:

Egress router pods in pending state with error message as below:
$ omg get events 
...
49s        Warning  FailedCreatePodSandBox  pod/xxxx  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_xxxx_379fa7ec-4702-446c-9162-55c2f76989f6_0(86f8c76e9724216143bef024996cb14a7614d3902dcf0d3b7ea858298766630c): error adding pod xxx to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): [xxxx/xxxx/379fa7ec-4702-446c-9162-55c2f76989f6:openshift-sdn]: error adding container to network "openshift-sdn": CNI request failed with status 400: 'could not open netns "/var/run/netns/0d39f378-29fd-4858-a947-51c5c06f1598": unknown FS magic on "/var/run/netns/0d39f378-29fd-4858-a947-51c5c06f1598": 1021994

Expected results:

Egress router pods in running state

Additional info:

Workaround from https://access.redhat.com/solutions/6986283 works :
Edit sdn DS in openshift-sdn namespace : 
- mountPath: /host/var/run/netns <<<<< /var/run/netns
  mountPropagation: HostToContainer
  name: host-run-netns   
  readOnly: true 

dependencies for the ironic containers are quite old, we need to upgrade them to the latest available to keep up with upstream requirements

Description of problem:

place holder bug to backport common latency failures

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

Error message seen during testing:
2023-03-23T22:33:02.507Z	ERROR	operator.dns_controller	dns/controller.go:348	failed to publish DNS record to zone	{"record": {"dnsName":"*.example.com","targets":["34.67.189.132"],"recordType":"A","recordTTL":30,"dnsManagementPolicy":"Managed"}, "dnszone": {"id":"ci-ln-95xvtb2-72292-9jj4w-private-zone"}, "error": "googleapi: Error 400: Invalid value for 'entity.change.additions[*.example.com][A].name': '*.example.com', invalid"}

Version-Release number of selected component (if applicable):

4.13

How reproducible:


Steps to Reproduce:

1. Setup 4.13 gcp cluster, install OSSM using http://pastebin.test.redhat.com/1092754
2. Run gateway api e2e against cluster (or create gateway with listener hostname *.example.com)
3. Check ingress operator logs

Actual results:

DNS record not published, and continous error in log

Expected results:

Should publish DNS record to zone without errors

Additional info:

Miciah: The controller should check ManageDNSForDomain when calling EnsureDNSRecord.  

Description of the problem:

vSphere vCenter cluster field is missing description

How reproducible:

always

Steps to reproduce:

1. install OCP on vSphere platform

2. Go to Overview -> vSphere, configure

Actual results:

vCenter cluster field is missing description

Expected results:

Description is present

Please review the following PR: https://github.com/operator-framework/operator-marketplace/pull/515

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

In case the [appsDomain|https://docs.openshift.com/container-platform/4.13/networking/ingress-operator.html#nw-ingress-configuring-application-domain_configuring-ingress] is specified and a cluster-admin is deleting accidentally all routes on a cluster, the route canary in the namespace openshift-ingress-canary is created with the domain specified in the .spec.appsDomain instead of .spec.domain of the definition in Ingress.config.openshift.io.

Additionally the docs are a bit confusing. On one page (https://docs.openshift.com/container-platform/4.13/networking/ingress-operator.html#nw-ingress-configuring-application-domain_configuring-ingress) it's defined as 

{code:none}
As a cluster administrator, you can specify an alternative to the default cluster domain for user-created routes by configuring the appsDomain field. The appsDomain field is an optional domain for OpenShift Container Platform to use instead of the default, which is specified in the domain field. If you specify an alternative domain, it overrides the default cluster domain for the purpose of determining the default host for a new route.

For example, you can use the DNS domain for your company as the default domain for routes and ingresses for applications running on your cluster.

In the API spec (https://docs.openshift.com/container-platform/4.11/rest_api/config_apis/ingress-config-openshift-io-v1.html#spec) the correct behaviour is explained

appsDomain is an optional domain to use instead of the one specified in the domain field when a Route is created without specifying an explicit host. If appsDomain is nonempty, this value is used to generate default host values for Route. Unlike domain, appsDomain may be modified after installation. This assumes a new ingresscontroller has been setup with a wildcard certificate.

It would be nice if the wording could be adjusted as `you can specify an alternative to the default cluster domain for user-created routes by configuring` does not fits good as more or less all new created routes (operator created and so on) getting created with the appsDomain.

Version-Release number of selected component (if applicable):{code:none}
OpenShift 4.12.22

How reproducible:

see steps below

Steps to Reproduce:

1. Install OpenShift
2. define .spec.appsDomain in Ingress.config.openshift.io
3. oc delete route canary -n openshift-ingress-canary
4. wait some seconds to get the route recreated and check cluster-operator

Actual results:

Ingress Operator degraded and route recreated with wrong domain (.spec.appsDomain)

Expected results:

Ingress Operator not degraded and route recreated with the correct domain (.spec.domain)

Additional info:

Please see screenshot

Description of problem:

The PowerVS installer will have code which creates a new service instance during installation.  Therefore, we need to delete that service instance upon cluster deletion.

Version-Release number of selected component (if applicable):

4.14

How reproducible:

Always

Steps to Reproduce:

1. Create cluster
2. Delete cluster

Actual results:

No leftover service instance

Expected results:


Additional info:


Description of problem:

This may be something we want to either add a validation for or document. It was initially found at a customer site but I've also confirmed it happens with just a Compact config with no workers. 

They created an agent-config.yaml with 2 worker nodes but did not set the replicas in install-config.yaml, i.e. they did not set 
compute:
- hyperthreading: Enabled
  name: worker
  replicas: {{ num_workers }} 

This resulted in an install failure as by default 3 worker replicas are created if not defined
https://github.com/openshift/installer/blob/master/pkg/types/defaults/machinepools.go#L11

See the attached console screenshot showing that the expected number of hosts doesn't match the actual.

I've also duplicated this with a compact config. We can see that the install failed as start-cluster-installation.sh is looking for 6 hosts.

[core@master-0 ~]$ sudo systemctl status start-cluster-installation.service
● start-cluster-installation.service - Service that starts cluster installation
   Loaded: loaded (/etc/systemd/system/start-cluster-installation.service; enabled; vendor preset: enabled)
   Active: activating (start) since Wed 2023-03-15 14:40:04 UTC; 3min 41s ago
 Main PID: 3365 (start-cluster-i)
    Tasks: 5 (limit: 101736)
   Memory: 1.7M
   CGroup: /system.slice/start-cluster-installation.service
           ├─3365 /bin/bash /usr/local/bin/start-cluster-installation.sh
           ├─5124 /bin/bash /usr/local/bin/start-cluster-installation.sh
           ├─5132 /bin/bash /usr/local/bin/start-cluster-installation.sh
           └─5138 diff /tmp/tmp.vIq1jH9Vf2 /etc/issue.d/90_start-install.issueMar 15 14:42:54 master-0 start-cluster-installation.sh[3365]: Waiting for hosts to become ready for cluster installation...
Mar 15 14:43:04 master-0 start-cluster-installation.sh[4746]: Hosts known and ready for cluster installation (3/6)
Mar 15 14:43:04 master-0 start-cluster-installation.sh[3365]: Waiting for hosts to become ready for cluster installation...
Mar 15 14:43:15 master-0 start-cluster-installation.sh[4980]: Hosts known and ready for cluster installation (3/6)
Mar 15 14:43:15 master-0 start-cluster-installation.sh[3365]: Waiting for hosts to become ready for cluster installation...
Mar 15 14:43:25 master-0 start-cluster-installation.sh[5026]: Hosts known and ready for cluster installation (3/6)
Mar 15 14:43:25 master-0 start-cluster-installation.sh[3365]: Waiting for hosts to become ready for cluster installation...
Mar 15 14:43:35 master-0 start-cluster-installation.sh[5079]: Hosts known and ready for cluster installation (3/6)
Mar 15 14:43:35 master-0 start-cluster-installation.sh[3365]: Waiting for hosts to become ready for cluster installation...
Mar 15 14:43:45 master-0 start-cluster-installation.sh[5124]: Hosts known and ready for cluster installation (3/6)

Since the compute section in install-config.yaml is optional we can't assume that it will be there 
https://github.com/openshift/installer/blob/master/pkg/types/installconfig.go#L126

Version-Release number of selected component (if applicable):

4.12

How reproducible:

 

Steps to Reproduce:

1. Remove the compute section from install-config.yaml
2. Do an install
3. See the failure

Actual results:

 

Expected results:

 

Additional info:

 

After https://issues.redhat.com//browse/HOSTEDCP-1062, the `olm-collect-profiles` CronJob pods did not get NeedManagementKASAccessLabel label and thus fail

# oc logs olm-collect-profiles-28171952-2v8gn
Error: Get "https://172.29.0.1:443/api?timeout=32s": dial tcp 172.29.0.1:443: i/o timeout

Description of the problem:

Staging, BE v2.17.3 - Trying to install OCP 4.13 Nutanix cluster and getting no ingress for host error. Igal saw the error is 

Warning  FailedScheduling  98m                 default-scheduler  0/5 nodes are available: 2 node(s) didn't match pod anti-affinity rules, 3 node(s) had untolerated taint {node.cloudprovider.kubernetes.io/uninitialized: true}. preemption: 0/5 nodes are available: 2 node(s) didn't match pod anti-affinity rules, 3 Preemption is not helpful for scheduling..

Which comes from 

 removeUninitializedTaint := false
if cluster.Platform != nil && *cluster.Platform.Type == models.PlatformTypeVsphere {
   removeUninitializedTaint = true
}

How reproducible:

 

Steps to reproduce:

1. 

2.

3.

Actual results:

 

Expected results:

Description of problem:

When deploying a whereabouts-IPAM-based additional network through the cluster-network-operator, the whereabouts-reconciler daemonset is not deployed on non-amd64 clusters due to an hard-coded nodeSelector introduced by https://github.com/openshift/cluster-network-operator/commit/be095d8c378e177d625a92aeca4e919ed0b5a14f

Version-Release number of selected component (if applicable):

4.13+

How reproducible:

Always. Tested on a connected arm64 AWS cluster using the openshift-sdn network

Steps to Reproduce:

1. oc new-project test1
2. oc patch networks.operator.openshift.io/cluster -p '{"spec":{"additionalNetworks":[{"name":"tertiary-net2","namespace":"test1","rawCNIConfig":"{\n  \"cniVersion\": \"0.3.1\",\n  \"name\": \"test\",\n  \"type\": \"macvlan\",\n  \"master\": \"bond0.100\",\n  \"ipam\": {\n    \"type\": \"whereabouts\",\n    \"range\": \"10.10.10.0/24\"\n  }\n}","type":"Raw"}],"useMultiNetworkPolicy":true}}' --type=merge
3. oc get daemonsets -n openshift-multus 

Actual results:

NAME                            DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR              AGE
whereabouts-reconciler          0         0         0       0            0           kubernetes.io/arch=amd64   7m27s

Expected results:

No kubernetes.io/arch=amd64 set, so that non-amd64 and multi-arch compute clusters can schedule the daemonset on each node, regardless of the architecture.

Additional info:

Same problem on s390x

https://github.com/openshift/hypershift/pull/2437 created a binding between HO and CPO as a CPO that contains this PR crashes when deployed by an HO that does not.

The reason appears to be related to the absence of the OPENSHIFT_IMG_OVERRIDES envvar on the CPO deployment.

{"level":"info","ts":"2023-06-06T16:36:21Z","logger":"setup","msg":"Using CPO image","image":"registry.ci.openshift.org/ocp/4.14-2023-06-06-102645@sha256:2d81c28856f5c0a73e55e7cb6fbc208c738fb3ca7c200cc7eb46efb40c8e10d2"}
panic: runtime error: index out of range [1] with length 1

goroutine 1 [running]:
github.com/openshift/hypershift/support/util.ConvertImageRegistryOverrideStringToMap({0x0, 0x0})
        /hypershift/support/util/util.go:237 +0x454
main.NewStartCommand.func1(0xc000d80000, {0xc000a71180, 0x0, 0x8})
        /hypershift/control-plane-operator/main.go:345 +0x2225
      containers:
      - args:
        - run
        - --namespace
        - $(MY_NAMESPACE)
        - --deployment-name
        - control-plane-operator
        - --metrics-addr
        - 0.0.0.0:8080
        - --enable-ci-debug-output=false
        - --registry-overrides==
        command:
        - /usr/bin/control-plane-operator

Description of problem:
sometimes the oc-mirror command will leave big data under /tmp dir and run out of disk space.

Version-Release number of selected component (if applicable):
oc mirror version
4.12/4.13

How reproducible:
Always

Steps to Reproduce:
1. Not sure the detail steps , but see logs when run oc-mirror command :

Actual results:

[root@preserve-fedora36 588]# oc-mirror --config config.yaml docker://yinzhou-133.mirror-registry.qe.gcp.devcluster.openshift.com:5000 --dest-skip-tls
Checking push permissions for yinzhou-133.mirror-registry.qe.gcp.devcluster.openshift.com:5000
Creating directory: oc-mirror-workspace/src/publish
Creating directory: oc-mirror-workspace/src/v2
Creating directory: oc-mirror-workspace/src/charts
Creating directory: oc-mirror-workspace/src/release-signatures
No metadata detected, creating new workspace

The rendered catalog is invalid.

Run "oc-mirror list operators --catalog CATALOG-NAME --package PACKAGE-NAME" for more information.

error: error rendering new refs: render reference "registry.redhat.io/redhat/redhat-operator-index:v4.11": write /tmp/render-unpack-2866670795/tmp/cache/cache/red-hat-camel-k_latest_red-hat-camel-k-operator.v1.6.0.json: no space left on device
[root@preserve-fedora36 588]# cd /tmp/
[root@preserve-fedora36 tmp]# ls
imageset-catalog-registry-333402727  render-unpack-2230547823

Expected results:
Always delete the created datas under /tmp at any stations.

Additional info:

Description of problem:

Tests like lint and vet used to be ran within a container engine by
default if an engine was detected, both locally and in CI.Up until now no container engine was detected in CI, so tests would run natively there.Now that the base image we use in CI has now started
shipping `podman`, a container engine is detected by default and tests
are run within podman by default. But creating nested containers doesn't
work in CI at the moment and thus results in a test failure.As such we are switching the default behaviour for tests (both locally
and in CI), where now by
default no container engine is used to run tests, even if one is
detected, but instead tests are run natively unless otherwise specifi

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

We merged a change into origin to modify a test so that `/readyz` would be used as the health check path. It turns out this makes things worse because we want to use kube-proxy's health probe endpoint to monitor the node health, and kube-proxy only exposes `/healthz` which is the default path anyway.

We should remove the annotation added to change the path and go back to the defaults.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:

1.
2.
3.

Actual results:


Expected results:


Additional info:


Description of problem:

In an IPv6 environment using DHCP, it may not be possible to configure a rendezvousIP that matches the actual address. This is because by default NetworkManager uses DUID-UUIDs for Client ID in the IPv6 DHCP Soliciation (see https://datatracker.ietf.org/doc/html/rfc6355) which are machine dependent. As a result, the DHCPv6 server cannot be configured with a pre-determined Client ID/IPv6 Address pair that matches the rendezvousIP and the nodes will be assigned random IPv6 addresses from the pool of DHCP addresses.

We can see the flow here (the DUID-UUID has a 00:04 prefix)

DHCPSOLICIT(ostestbm) 00:04:56:d2:b1:0b:ba:ef:8c:1a:00:58:3f:ed:e5:d3:5f:85

The DHCP server therefore assigns a new address from the pool, fd2e:6f44:5dd8:c956::32 in this case:
DHCPREPLY(ostestbm) fd2e:6f44:5dd8:c956::32 00:04:56:d2:b1:0b:ba:ef:8c:1a:00:58:3f:ed:e5:d3:5f:85

NetworkManager needs to be configured to use a deterministic Client ID so that a reliable Client ID/IPv6 address can be added to a DHCP server. The best way to do this is to configure NM for dhcp-duid=ll so that it uses a DUID-LL which based on the interface mac address. This is the approach taken by Baremetal IPI in   https://github.com/openshift/machine-config-operator/pull/1395

Version-Release number of selected component (if applicable):

4.14.0

How reproducible:

Every time

Steps to Reproduce:

1. In an IPv6 environment set up agent-config.yaml with an expected IPv6 address and create the ISO
2. It's not possible to configure the DHCP server to assign this address since the Client ID that Node0 will use is unknown
3. Boot the nodes using the created ISO. The nodes will get IPv6 addresses from the DHCP server but its not possible to access the RendezvousIP

Actual results:

 

Expected results:

 

Additional info:

 

It is possible, due to the way that the UI is currently implemented, that a user may be able to submit a manifest with no content.
We need to filter manifests before they are applied to ensure that any manifests that are empty (lack at least one key/value) are not applied.

A good suggested location to look at might be

https://github.com/openshift/assisted-service/blob/master/internal/ignition/ignition.go#L402-L409

Description of problem:

When installing OCP in a disconnected network which doesn’t have access to the public registry, bootkube.service failed

Version-Release number of selected component (if applicable):

from 4.14.0-0.nightly-2023-04-29-153308

How reproducible:

Always

Steps to Reproduce:

1.Prepare a VPC that doesn’t have the access to the Internet, setup a mirror registry inside the VPC and set related ImageContentSource in the install-config
2.Start the installation
3.

Actual results:

Failed when provisioning masters as it couldn’t get master ignition from bootstrap

May 04 07:31:56 maxu-az-dis-6d74v-bootstrap bootkube.sh[246724]: error: unable to read image registry.ci.openshift.org/ocp/release@sha256:227a73d8ff198a55ca0d3314d8fa94835d90769981d1c951ac741b82285f99fc: Get "https://registry.ci.openshift.org/v2/": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
May 04 07:31:56 maxu-az-dis-6d74v-bootstrap systemd[1]: bootkube.service: Main process exited, code=exited, status=1/FAILUREMay 04 07:31:56 maxu-az-dis-6d74v-bootstrap systemd[1]: bootkube.service: Failed with result 'exit-code'.

Expected results:

Installation succeeded. 

Additional info:

In disconnected install, we’re using the ICSP to pull image from the mirror registry, but bootkube.service was still trying to access the public registry. Checked the change log of bootkube.sh.template, it seems to be a regression issue of https://github.com/openshift/installer/pull/6990, it’s using “oc adm release info -o 'jsonpath={.metadata.version}' "${RELEASE_IMAGE_DIGEST}"” to get current OCP version in this scenario.

Description of problem:

 After custom toleration (tainting the dns pod) on master node the dns pod stuck in pending state

Version-Release number of selected component (if applicable):

 

How reproducible:

https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-41050

Steps to Reproduce:

1.melvinjoseph@mjoseph-mac Downloads % oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.14.0-0.nightly-2023-05-03-163151   True        False         4h5m    Cluster version is 4.14.0-0.nightly-2023-05-03-163151
2.check default dns pods placement
melvinjoseph@mjoseph-mac Downloads % ouf5M-5AVBm-Taoxt-aIgPmoc -n openshift-dns get pod -owide
melvinjoseph@mjoseph-mac Downloads % oc -n openshift-dns get pod -owide
NAME                  READY   STATUS    RESTARTS   AGE     IP            NODE                                                       NOMINATED NODE   READINESS GATES
dns-default-6cv9k     2/2     Running   0          4h12m   10.131.0.8    shudi-gcp4h-whdkl-worker-a-qnvjw.c.openshift-qe.internal   <none>           <none>
dns-default-8g2w8     2/2     Running   0          4h12m   10.129.2.5    shudi-gcp4h-whdkl-worker-c-b8qrq.c.openshift-qe.internal   <none>           <none>
dns-default-df7zj     2/2     Running   0          4h18m   10.128.0.40   shudi-gcp4h-whdkl-master-1.c.openshift-qe.internal         <none>           <none>
dns-default-kmv4c     2/2     Running   0          4h18m   10.130.0.9    shudi-gcp4h-whdkl-master-2.c.openshift-qe.internal         <none>           <none>
dns-default-lxxkt     2/2     Running   0          4h18m   10.129.0.11   shudi-gcp4h-whdkl-master-0.c.openshift-qe.internal         <none>           <none>
dns-default-mjrnx     2/2     Running   0          4h11m   10.128.2.4    shudi-gcp4h-whdkl-worker-b-scqdh.c.openshift-qe.internal   <none>           <none>
node-resolver-5bnjv   1/1     Running   0          4h12m   10.0.128.3    shudi-gcp4h-whdkl-worker-a-qnvjw.c.openshift-qe.internal   <none>           <none>
node-resolver-7ns8b   1/1     Running   0          4h18m   10.0.0.4      shudi-gcp4h-whdkl-master-1.c.openshift-qe.internal         <none>           <none>
node-resolver-bz7k5   1/1     Running   0          4h12m   10.0.128.2    shudi-gcp4h-whdkl-worker-c-b8qrq.c.openshift-qe.internal   <none>           <none>
node-resolver-c67mw   1/1     Running   0          4h18m   10.0.0.3      shudi-gcp4h-whdkl-master-2.c.openshift-qe.internal         <none>           <none>
node-resolver-d8h65   1/1     Running   0          4h12m   10.0.128.4    shudi-gcp4h-whdkl-worker-b-scqdh.c.openshift-qe.internal   <none>           <none>
node-resolver-rgb92   1/1     Running   0          4h18m   10.0.0.5      shudi-gcp4h-whdkl-master-0.c.openshift-qe.internal         <none>           <none>

 3.oc -n openshift-dns get ds/dns-default -oyaml
      tolerations:
      - key: node-role.kubernetes.io/master
        operator: Exists melvinjoseph@mjoseph-mac Downloads % oc get dns.operator default -oyaml
apiVersion: operator.openshift.io/v1
kind: DNS
metadata:
  creationTimestamp: "2023-05-08T00:39:00Z"
  finalizers:
  - dns.operator.openshift.io/dns-controller
  generation: 1
  name: default
  resourceVersion: "22893"
  uid: ae53e756-42a3-4c9d-8284-524df006382d
spec:
  cache:
    negativeTTL: 0s
    positiveTTL: 0s
  logLevel: Normal
  nodePlacement: {}
  operatorLogLevel: Normal
  upstreamResolvers:
    policy: Sequential
    transportConfig: {}
    upstreams:
    - port: 53
      type: SystemResolvConf
status:
  clusterDomain: cluster.local
  clusterIP: 172.30.0.10
  conditions:
  - lastTransitionTime: "2023-05-08T00:46:20Z"
    message: Enough DNS pods are available, and the DNS service has a cluster IP address.
    reason: AsExpected
    status: "False"
    type: Degraded
  - lastTransitionTime: "2023-05-08T00:46:20Z"
    message: All DNS and node-resolver pods are available, and the DNS service has
      a cluster IP address.
    reason: AsExpected
    status: "False"
    type: Progressing
  - lastTransitionTime: "2023-05-08T00:39:25Z"
    message: The DNS daemonset has available pods, and the DNS service has a cluster
      IP address.
    reason: AsExpected
    status: "True"
    type: Available
  - lastTransitionTime: "2023-05-08T00:39:01Z"
    message: DNS Operator can be upgraded
    reason: AsExpected
    status: "True"
    type: Upgradeable


4. config custom tolerations of dns pod (to not tolerate master node taints)
 $ oc edit dns.operator default
 spec:
   nodePlacement:
     tolerations:
     - effect: NoExecute
       key: my-dns-test
       operators: Equal
       value: abc
       tolerationSeconds: 3600 
melvinjoseph@mjoseph-mac Downloads % oc edit dns.operator default
Warning: unknown field "spec.nodePlacement.tolerations[0].operators"
dns.operator.openshift.io/default edited
melvinjoseph@mjoseph-mac Downloads % oc -n openshift-dns get pod -owide
NAME                  READY   STATUS    RESTARTS   AGE     IP            NODE                                                       NOMINATED NODE   READINESS GATES
dns-default-6cv9k     2/2     Running   0          5h16m   10.131.0.8    shudi-gcp4h-whdkl-worker-a-qnvjw.c.openshift-qe.internal   <none>           <none>
dns-default-8g2w8     2/2     Running   0          5h16m   10.129.2.5    shudi-gcp4h-whdkl-worker-c-b8qrq.c.openshift-qe.internal   <none>           <none>
dns-default-df7zj     2/2     Running   0          5h22m   10.128.0.40   shudi-gcp4h-whdkl-master-1.c.openshift-qe.internal         <none>           <none>
dns-default-kmv4c     2/2     Running   0          5h22m   10.130.0.9    shudi-gcp4h-whdkl-master-2.c.openshift-qe.internal         <none>           <none>
dns-default-lxxkt     2/2     Running   0          5h22m   10.129.0.11   shudi-gcp4h-whdkl-master-0.c.openshift-qe.internal         <none>           <none>
dns-default-mjrnx     2/2     Running   0          5h16m   10.128.2.4    shudi-gcp4h-whdkl-worker-b-scqdh.c.openshift-qe.internal   <none>           <none>
dns-default-xqxr9     0/2     Pending   0          7s      <none>        <none>                                                     <none>           <none>
node-resolver-5bnjv   1/1     Running   0          5h17m   10.0.128.3    shudi-gcp4h-whdkl-worker-a-qnvjw.c.openshift-qe.internal   <none>           <none>
node-resolver-7ns8b   1/1     Running   0          5h22m   10.0.0.4      shudi-gcp4h-whdkl-master-1.c.openshift-qe.internal         <none>           <none>
node-resolver-bz7k5   1/1     Running   0          5h16m   10.0.128.2    shudi-gcp4h-whdkl-worker-c-b8qrq.c.openshift-qe.internal   <none>           <none>
node-resolver-c67mw   1/1     Running   0          5h22m   10.0.0.3      shudi-gcp4h-whdkl-master-2.c.openshift-qe.internal         <none>           <none>
node-resolver-d8h65   1/1     Running   0          5h16m   10.0.128.4    shudi-gcp4h-whdkl-worker-b-scqdh.c.openshift-qe.internal   <none>           <none>
node-resolver-rgb92   1/1     Running   0          5h22m   10.0.0.5      shudi-gcp4h-whdkl-master-0.c.openshift-qe.internal         <none>           <none>


The dns pod stuck in pending state

melvinjoseph@mjoseph-mac Downloads % oc -n openshift-dns get ds/dns-default -oyaml
<-----snip--->
      tolerations:
      - effect: NoExecute
        key: my-dns-test
        tolerationSeconds: 3600
        value: abc
      volumes:
      - configMap:
          defaultMode: 420
          items:
          - key: Corefile
            path: Corefile
          name: dns-default
        name: config-volume
      - name: metrics-tls
        secret:
          defaultMode: 420
          secretName: dns-default-metrics-tls
  updateStrategy:
    rollingUpdate:
      maxSurge: 10%
      maxUnavailable: 0
    type: RollingUpdate
status:
  currentNumberScheduled: 3
  desiredNumberScheduled: 3
  numberAvailable: 3
  numberMisscheduled: 3
  numberReady: 3
  observedGeneration: 2


melvinjoseph@mjoseph-mac Downloads % oc get dns.operator default -oyaml
apiVersion: operator.openshift.io/v1
kind: DNS
metadata:
  creationTimestamp: "2023-05-08T00:39:00Z"
  finalizers:
  - dns.operator.openshift.io/dns-controller
  generation: 2
  name: default
  resourceVersion: "125435"
  uid: ae53e756-42a3-4c9d-8284-524df006382d
spec:
  cache:
    negativeTTL: 0s
    positiveTTL: 0s
  logLevel: Normal
  nodePlacement:
    tolerations:
    - effect: NoExecute
      key: my-dns-test
      tolerationSeconds: 3600
      value: abc
  operatorLogLevel: Normal
  upstreamResolvers:
    policy: Sequential
    transportConfig: {}
    upstreams:
    - port: 53
      type: SystemResolvConf
status:
  clusterDomain: cluster.local
  clusterIP: 172.30.0.10
  conditions:
  - lastTransitionTime: "2023-05-08T00:46:20Z"
    message: Enough DNS pods are available, and the DNS service has a cluster IP address.
    reason: AsExpected
    status: "False"
    type: Degraded
  - lastTransitionTime: "2023-05-08T06:01:52Z"
    message: Have 0 up-to-date DNS pods, want 3.
    reason: Reconciling
    status: "True"
    type: Progressing
  - lastTransitionTime: "2023-05-08T00:39:25Z"
    message: The DNS daemonset has available pods, and the DNS service has a cluster
      IP address.
    reason: AsExpected
    status: "True"
    type: Available
  - lastTransitionTime: "2023-05-08T00:39:01Z"
    message: DNS Operator can be upgraded
    reason: AsExpected
    status: "True"
    type: Upgradeable


melvinjoseph@mjoseph-mac Downloads % oc -n openshift-dns get pod                  
NAME                  READY   STATUS    RESTARTS   AGE
dns-default-6cv9k     2/2     Running   0          5h18m
dns-default-8g2w8     2/2     Running   0          5h18m
dns-default-df7zj     2/2     Running   0          5h25m
dns-default-kmv4c     2/2     Running   0          5h25m
dns-default-lxxkt     2/2     Running   0          5h25m
dns-default-mjrnx     2/2     Running   0          5h18m
dns-default-xqxr9     0/2     Pending   0          2m12s
node-resolver-5bnjv   1/1     Running   0          5h19m
node-resolver-7ns8b   1/1     Running   0          5h25m
node-resolver-bz7k5   1/1     Running   0          5h19m
node-resolver-c67mw   1/1     Running   0          5h25m
node-resolver-d8h65   1/1     Running   0          5h19m
node-resolver-rgb92   1/1     Running   0          5h25m

Actual results:

The dns pod dns-default-xqxr9  stuck in pending state

Expected results:

There will be reloaded DNS pods

Additional info:

melvinjoseph@mjoseph-mac Downloads % oc describe po/dns-default-xqxr9  -n openshift-dns
Name:                 dns-default-xqxr9
Namespace:            openshift-dns
Priority:             2000001000


<----snip--->
Node-Selectors:              kubernetes.io/os=linux
Tolerations:                 my-dns-test=abc:NoExecute for 3600s
                             node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type     Reason            Age    From               Message
  ----     ------            ----   ----               -------
  Warning  FailedScheduling  3m45s  default-scheduler  0/6 nodes are available: 1 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/6 nodes are available: 1 Preemption is not helpful for scheduling, 2 node(s) had untolerated taint {node-role.kubernetes.io/master: }, 3 node(s) didn't match Pod's node affinity/selector..

Description of problem:

The upgrade Helm Release tab in OpenShift GUI Developer console is not refreshing with updated values.

Version-Release number of selected component (if applicable):

4.12

How reproducible:

100%

Steps to Reproduce:

1. Add below Helm chart repository from CLI

~~~
apiVersion: helm.openshift.io/v1beta1
kind: HelmChartRepository
metadata:
  name: prometheus-community
spec:
  connectionConfig:
    url: 'https://prometheus-community.github.io/helm-charts'
  name: prometheus-community
~~~
2. Goto GUI and select Developer console --> +Add --> Developer Catalog --> Helm Chart --> Select Prometheus Helm chart --> Install Helm chart --> From dropdown of chart version select 22.3.0 --> Install

3. You will see the image tag as v0.63.0
~~~
    image:
      digest: ''
      pullPolicy: IfNotPresent
      repository: quay.io/prometheus-operator/prometheus-config-reloader
      tag: v0.63.0
~~~ 
4. Once that is installed Goto Helm --> Helm Releases --> Prometheus --> Upgrade --> From dropdown of chart version select 22.4.0 --> the page does not refresh with new value of the tag.

~~~
    image:
      digest: ''
      pullPolicy: IfNotPresent
      repository: quay.io/prometheus-operator/prometheus-config-reloader
      tag: v0.63.0
~~~

NOTE: The same steps before installing the helm chart, when we select different versions the value is being updated.
Goto GUI and select Developer console --> +Add --> Developer Catalog --> Helm Chart --> Select Prometheus Helm chart --> Install Helm chart --> From dropdown of chart version select 22.3.0 --> Now select different chart version like 22.7.0 or 22.4.0

Actual results:

The The yaml view of Upgrade Helm Release tab shows the values of older chart version.

Expected results:

The yaml view of Upgrade Helm Release tab should contain latest values as per selected chart version.

Additional info:

 

Description of problem:

Customer upgraded AWS cluster from 4.8 to 4.9. All are update well but When checking the co/storage.status.versions, the AWSEBSCSIDriverOperator version is list but with previous version: 
$ oc get co storage -o json | jq .status.versions
[
  {
    "name": "operator",
    "version": "4.9.50"
  },
  {
    "name": "AWSEBSCSIDriverOperator",
    "version": "4.8.48"
  }
]

From 4.9, seems CSO doesn't report the CSIDriverOperator version, so the previous CSIDriverOperator version which is not correct should be cleaned up in such case.

Version-Release number of selected component (if applicable):

upgrade from 4.8.48 to 4.9.50

How reproducible:

Always

Steps to Reproduce:

1. Install AWS cluster with 4.8
2. Upgrade cluster to 4.9
3. Check co/storage.status.versions  

Actual results:

[ { "name": "operator", "version": "4.9.50" }, { "name": "AWSEBSCSIDriverOperator", "version": "4.8.48" } ]

Expected results:

From 4.9. seems CSO doesn't report the CSIDriverOperator version, so the previous CSIDriverOperator version which is not correct should be cleaned up.

Additional info:

 

Description of problem:

Bump Kubernetes to 0.27.1 and bump dependencies

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

https://search.ci.openshift.org/?search=error%3A+tag+latest+failed%3A+Internal+error+occurred%3A+registry.centos.org&maxAge=48h&context=1&type=build-log&name=okd&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Version-Release number of selected component (if applicable):

all currently tested versions

How reproducible:

~ 9% of jobs fail on this test

 

 ! error: Import failed (InternalError): Internal error occurred: registry.centos.org/dotnet/dotnet-31-runtime-centos7:latest: Get "https://registry.centos.org/v2/": dial tcp: lookup registry.centos.org on 172.30.0.10:53: no such host   782 31 minutes ago 

 

Description of problem:

 

Customer used Agent-based installer to install 4.13.8 on they CID env, but during install process, the bootstrap machine had oom issue, check sosreport find the init container had oom issue

NOTE: Issue is not see when testing with 4.13.6, per the customer

initContainers:

  • name: machine-config-controller
    image: .Images.MachineConfigOperator
    command: ["/usr/bin/machine-config-controller"]
    args:
  • "bootstrap"
  • "--manifest-dir=/etc/mcc/bootstrap"
  • "--dest-dir=/etc/mcs/bootstrap"
  • "--pull-secret=/etc/mcc/bootstrap/machineconfigcontroller-pull-secret"
  • "--payload-version=.ReleaseVersion"
    resources:
    limits:
    memory: 50Mi

we found the sosreport dmesg and crio logs had oom kill machine-config-controller container issue, the issue was cause by cgroup kill, so looks like the limit 50M is too small

The customer used a physical machine that had 100GB of memory

the customer had some network config in asstant install yaml file, maybe the issue is them had some nic config?

log files:
1. sosreport
https://attachments.access.redhat.com/hydra/rest/cases/03578865/attachments/b5501734-60be-4de4-adcf-da57e22cbb8e?usePresignedUrl=true

2. asstent installer yaml file
https://attachments.access.redhat.com/hydra/rest/cases/03578865/attachments/a32635cf-112d-49ed-828c-4501e95a0e7a?usePresignedUrl=true

3. bootstrap machine oom screenshot
https://attachments.access.redhat.com/hydra/rest/cases/03578865/attachments/eefe2e57-cd23-4abd-9e0b-dd45f20a34d2?usePresignedUrl=true

Description of problem:

Machine should create failed when availabilityZone and subnet id is mismatch, 
currently the machine create successfully when availabilityZone and subnet id is mismatch, and the cpms cannot be recreated after deleting.
Another, for the subnet is filter, if availabilityZone and filter is mismatch, the machine will create failed.

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2023-01-31-072358

How reproducible:

always

Steps to Reproduce:

1.Create a machineset whose availabilityZone and subnet id is mismatch, for example, availabilityZone is us-east-2a, but the subnet id is for us-east-2b

          placement:
            availabilityZone: us-east-2a
            region: us-east-2
          securityGroups:
          - filters:
            - name: tag:Name
              values:
              - huliu-aws1w-nk5xd-worker-sg
          subnet:
            id: subnet-0107b4d7cfa35eb9b 

2.Machine created successfully in us-east-2b zone
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                                                PHASE     TYPE         REGION      ZONE         AGE
huliu-aws1w-nk5xd-master-0                          Running   m6i.xlarge   us-east-2   us-east-2a   62m
huliu-aws1w-nk5xd-master-1                          Running   m6i.xlarge   us-east-2   us-east-2b   62m
huliu-aws1w-nk5xd-master-2                          Running   m6i.xlarge   us-east-2   us-east-2a   62m
huliu-aws1w-nk5xd-windows-worker-us-east-2a-689vq   Running   m5a.large    us-east-2   us-east-2b   37m
huliu-aws1w-nk5xd-windows-worker-us-east-2a-nf9dl   Running   m5a.large    us-east-2   us-east-2b   37m
huliu-aws1w-nk5xd-worker-us-east-2a-8kpht           Running   m6i.xlarge   us-east-2   us-east-2a   59m
huliu-aws1w-nk5xd-worker-us-east-2a-dmtlc           Running   m6i.xlarge   us-east-2   us-east-2a   59m
huliu-aws1w-nk5xd-worker-us-east-2b-kdn75           Running   m6i.xlarge   us-east-2   us-east-2b   59m
liuhuali@Lius-MacBook-Pro huali-test % oc get machine -o yaml |grep "id: subnet"
          id: subnet-0fef0e9e255742f3a
          id: subnet-0107b4d7cfa35eb9b
          id: subnet-0fef0e9e255742f3a
          id: subnet-0107b4d7cfa35eb9b
          id: subnet-0107b4d7cfa35eb9b
          id: subnet-0fef0e9e255742f3a
          id: subnet-0fef0e9e255742f3a
          id: subnet-0107b4d7cfa35eb9b 

Actual results:

Machine created successfully in the zone which the subnet id stands for, for the case it created in us-east-2b

huliu-aws1w-nk5xd-windows-worker-us-east-2a-689vq   Running   m5a.large    us-east-2   us-east-2b   37m
huliu-aws1w-nk5xd-windows-worker-us-east-2a-nf9dl   Running   m5a.large    us-east-2   us-east-2b   37m

Expected results:

Machine should create failed as availabilityZone and subnet id is mismatch

Additional info:

1. For the subnet is filter, if availabilityZone and filter is mismatch, the machine will create failed.

huliu-aws1w2-x2tnx-worker-2-m4r8m            Failed                                          4s 
liuhuali@Lius-MacBook-Pro huali-test % oc get machine huliu-aws1w2-x2tnx-worker-2-m4r8m  -o yaml
…
      placement:
        availabilityZone: us-east-2a
        region: us-east-2
      securityGroups:
      - filters:
        - name: tag:Name
          values:
          - huliu-aws1w2-x2tnx-worker-sg
      spotMarketOptions: {}
      subnet:
        filters:
        - name: tag:Name
          values:
          - huliu-aws1w2-x2tnx-private-us-east-2c
      tags:
      - name: kubernetes.io/cluster/huliu-aws1w2-x2tnx
        value: owned
      userDataSecret:
        name: worker-user-data
status:
  conditions:
  - lastTransitionTime: "2023-02-01T02:45:52Z"
    status: "True"
    type: Drainable
  - lastTransitionTime: "2023-02-01T02:45:52Z"
    message: Instance has not been created
    reason: InstanceNotCreated
    severity: Warning
    status: "False"
    type: InstanceExists
  - lastTransitionTime: "2023-02-01T02:45:52Z"
    status: "True"
    type: Terminable
  errorMessage: 'error getting subnet IDs: no subnet IDs were found'
  errorReason: InvalidConfiguration
  lastUpdated: "2023-02-01T02:45:53Z"
  phase: Failed
  providerStatus:
    conditions:
    - lastTransitionTime: "2023-02-01T02:45:53Z"
      message: 'error getting subnet IDs: no subnet IDs were found'
      reason: MachineCreationFailed
      status: "False"
      type: MachineCreation

2.For this case, machine create successfully when availabilityZone and subnet id is mismatch, the cpms cannot be recreated after deleting.

liuhuali@Lius-MacBook-Pro huali-test % oc delete controlplanemachineset cluster 
controlplanemachineset.machine.openshift.io "cluster" deleted
liuhuali@Lius-MacBook-Pro huali-test % oc get controlplanemachineset                                
No resources found in openshift-machine-api namespace.

I0201 02:11:07.850022       1 http.go:143] controller-runtime/webhook/webhooks "msg"="wrote response" "UID"="12f118c4-fafe-45f9-bd24-876abdb8ba83" "allowed"=false "code"=403 "reason"="spec.template.machines_v1beta1_machine_openshift_io.failureDomains: Forbidden: no control plane machine is using specified failure domain(s) [AWSFailureDomain{AvailabilityZone:us-east-2a, Subnet:{Type:ID, Value:subnet-0107b4d7cfa35eb9b}}], failure domain(s) [AWSFailureDomain{AvailabilityZone:us-east-2a, Subnet:{Type:ID, Value:subnet-0fef0e9e255742f3a}}] are duplicated within the control plane machines, please correct failure domains to match control plane machines" "webhook"="/validate-machine-openshift-io-v1-controlplanemachineset"
I0201 02:11:07.850787       1 controller.go:144]  "msg"="Finished reconciling control plane machine set" "controller"="controlplanemachinesetgenerator" "name"="cluster" "namespace"="openshift-machine-api" "reconcileID"="767c4631-ed83-47da-b316-29a21cdba245"
E0201 02:11:07.850828       1 controller.go:326]  "msg"="Reconciler error" "error"="error reconciling control plane machine set: unable to create control plane machine set: unable to create control plane machine set: admission webhook \"controlplanemachineset.machine.openshift.io\" denied the request: spec.template.machines_v1beta1_machine_openshift_io.failureDomains: Forbidden: no control plane machine is using specified failure domain(s) [AWSFailureDomain{AvailabilityZone:us-east-2a, Subnet:{Type:ID, Value:subnet-0107b4d7cfa35eb9b}}], failure domain(s) [AWSFailureDomain{AvailabilityZone:us-east-2a, Subnet:{Type:ID, Value:subnet-0fef0e9e255742f3a}}] are duplicated within the control plane machines, please correct failure domains to match control plane machines" "controller"="controlplanemachinesetgenerator" "reconcileID"="767c4631-ed83-47da-b316-29a21cdba245"

Description of problem:

With the recent update in the logic for considering a CPMS replica Ready only when both the backing Machine is running and the backing Node is Ready: https://github.com/openshift/cluster-control-plane-machine-set-operator/pull/171, we now need to watch nodes at all times to detect nodes transitioning in readiness.

The majority of occurrences of this issue have been fixed with: https://github.com/openshift/cluster-control-plane-machine-set-operator/pull/177 (https://issues.redhat.com//browse/OCPBUGS-10032) but we also need to watch the control plane nodes at steady state (when they are already Ready), to notice if they go UnReady at any point, as relying on control plane machine events is not enough (they might be Running, while the Node has transitioned to NotReady).

Version-Release number of selected component (if applicable):

4.13, 4.14

How reproducible:

Always

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

Topology UI doesn't recognize Serverless Rust function for proper UI icon

Version-Release number of selected component (if applicable):

4.12.0

How reproducible:

Always

Steps to Reproduce:

1. Deploy 3 KNative/Serverless functions: Quarkus, Spring Boot, Rust
2. Observe in Topology UI that only for Quarku and Spring Boot specific icons are used, while for Rust case - regular icon for OpenShift
3. Check each of presented UI snippets/rectangles and find such related labels:
For Quarkus: 
app.openshift.io/runtime=quarkus
function.knative.dev/runtime=rust

For Spring Boot:
app.openshift.io/runtime=spring-boot
function.knative.dev/runtime=springboot

For Rust:
function.knative.dev/runtime=rust (no presence of app.openshift.io/runtime=rust for it) 

Actual results:

No specific UI icon for Rust function

Expected results:

Specific UI icon for Rust function

Additional info:

 

Description of problem:

Currently: Hypershift is squashing any user configured proxy configuration based on this line: https://github.com/openshift/hypershift/blob/main/support/globalconfig/proxy.go#L21-L28, https://github.com/openshift/hypershift/blob/release-4.11/control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go#L487-L493. Because of this any user changes to the cluster-wide proxy configuration documented here: https://docs.openshift.com/container-platform/4.12/networking/enable-cluster-wide-proxy.html are squashed and not valid for more than a few seconds. That blocks some functionality in the openshift cluster from working including application builds from the openshift samples provided in the cluster. 

 

Version-Release number of selected component (if applicable):

4.13 4.12 4.11

How reproducible:

100%

Steps to Reproduce:

1. Make a change to the Proxy object in the cluster with kubectl edit proxy cluster
2. Save the change
3. Wait a few seconds

Actual results:

HostedClusterConfig operator will go in and squash the value

Expected results:

The value the user provides remains in the configuration and is not squashed to an empty value

Additional info:

 

Description of problem:

In awsendpointservice CR AWSEndpointAvailable is still true when endpoint is deleted on AWS console, and AWSEndpointServiceAvailable is still true when endpoint service is deleted on AWS console.

Version-Release number of selected component (if applicable):

 

How reproducible:

always

Steps to Reproduce:

1. Create a PublicAndPrivate or Private cluster, wait for cluster to come up
2. Check conditions in awsendpointservice cr, status of AWSEndpointAvailable and AWSEndpointServiceAvailable should be True
3. On AWS console delete endpoint
4. In awsendpointservice cr, check if condition AWSEndpointAvailable is changed to false 
5. On AWS console delete endpoint service
6. In awsendpointservice cr, check if condition AWSEndpointServiceAvailable is changed to false

Actual results:

status of AWSEndpointAvailable and AWSEndpointServiceAvailable is True

Expected results:

status of AWSEndpointAvailable and AWSEndpointServiceAvailable should be False

Additional info:

 

Description of problem

Since resource type option has been moved to an advanced option in both the Deploy Image and Import from Git flows, there is confusion for some existing customers who are using the feature.

The UI no longer provides transparency of the type of resource which is being created.

Version-Release number of selected component (if applicable)

How reproducible

Steps to Reproduce

1.
2.
3.

Actual results

Expected results

Remove Resource type from Adv Options, and place it back where it was previously.  Resource type selection is now a dropdown so that we will put it in its previous spot, but it will use a different component from 4.11.

  •  

Description of problem:

clusteroperator/network is degraded after running

    FEATURES_ENVIRONMENT="ci" make feature-deploy-on-ci

from openshift-kni/cnf-features-deploy against IPI clusters with OCP 4.13 and 4.14 in CI jobs from Telco 5G DevOps/CI.

Details for a 4.13 job:

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/42141/rehearse-42141-periodic-ci-openshift-release-master-nightly-4.13-e2e-telco5g/1689935408508440576

Details for a 4.14 job:

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/42141/rehearse-42141-periodic-ci-openshift-release-master-nightly-4.14-e2e-telco5g/1689935408541995008

For example, got to artifacts/e2e-telco5g/telco5g-gather-pao/build-log.txt and it will report:

Error from server (BadRequest): container "container-00" in pod "cnfdu5-worker-0-debug" is waiting to start: ContainerCreating
Running gather-pao for T5CI_VERSION=4.13
Running for CNF_BRANCH=master
Running PAO must-gather with tag pao_mg_tag=4.12
[must-gather      ] OUT Using must-gather plug-in image: quay.io/openshift-kni/performance-addon-operator-must-gather:4.12-snapshot
When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information:
ClusterID: 60503edf-ecc6-48f7-b6a6-f4dc34842803
ClusterVersion: Stable at "4.13.0-0.nightly-2023-08-10-021434"
ClusterOperators:
	clusteroperator/network is degraded because DaemonSet "/openshift-multus/dhcp-daemon" rollout is not making progress - pod dhcp-daemon-7lmlq is in CrashLoopBackOff State
DaemonSet "/openshift-multus/dhcp-daemon" rollout is not making progress - pod dhcp-daemon-95tzb is in CrashLoopBackOff State
DaemonSet "/openshift-multus/dhcp-daemon" rollout is not making progress - pod dhcp-daemon-hfxkd is in CrashLoopBackOff State
DaemonSet "/openshift-multus/dhcp-daemon" rollout is not making progress - pod dhcp-daemon-mhwtp is in CrashLoopBackOff State
DaemonSet "/openshift-multus/dhcp-daemon" rollout is not making progress - pod dhcp-daemon-q7gfb is in CrashLoopBackOff State
DaemonSet "/openshift-multus/dhcp-daemon" rollout is not making progress - last change 2023-08-11T10:54:10Z

Version-Release number of selected component (if applicable):

branch release-4.13 from https://github.com/openshift-kni/cnf-features-deploy.git for OCP 4.13
branch master from https://github.com/openshift-kni/cnf-features-deploy.git for OCP 4.14

How reproducible:

Always.

Steps to Reproduce:

1. Install OCP 4.13 or OCP 4.14 with IPI on 3x masters, 2x workers.
2. Clone https://github.com/openshift-kni/cnf-features-deploy.git
3. FEATURES_ENVIRONMENT="ci" make feature-deploy-on-ci
4. oc wait nodes --all --for=condition=Ready=true --timeout=10m
5. oc wait clusteroperators --all --for=condition=Progressing=false --timeout=10m

Actual results:

See above.

Expected results:

All clusteroperators have finished progressing.

Additional info:

Without 'FEATURES_ENVIRONMENT="ci" make feature-deploy-on-ci' the steps to reproduce above work as expected.

This is a clone of issue OCPBUGS-18517. The following is the description of the original issue:

Description of problem:

Installation with Kuryr is failing because multiple components are attempting to connect to the API and fail with the following error:

failed checking apiserver connectivity: Get "https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-service-ca/leases/service-ca-controller-lock": tls: failed to verify certificate: x509: cannot validate certificate for 172.30.0.1 because it doesn't contain any IP SANs

$ oc get po -A -o wide |grep -v Running |grep -v Pending |grep -v Completed
NAMESPACE                                          NAME                                                        READY   STATUS             RESTARTS          AGE     IP              NODE                   NOMINATED NODE   READINESS GATES
openshift-apiserver-operator                       openshift-apiserver-operator-559d855c56-c2rdr               0/1     CrashLoopBackOff   42 (2m28s ago)    3h44m   10.128.16.86    kuryr-5sxhw-master-2   <none>           <none>
openshift-apiserver                                apiserver-6b9f5d48c4-bj6s6                                  0/2     CrashLoopBackOff   92 (4m25s ago)    3h36m   10.128.70.10    kuryr-5sxhw-master-2   <none>           <none>
openshift-cluster-csi-drivers                      manila-csi-driver-operator-75b64d8797-fckf5                 0/1     CrashLoopBackOff   42 (119s ago)     3h41m   10.128.56.21    kuryr-5sxhw-master-0   <none>           <none>
openshift-cluster-csi-drivers                      openstack-cinder-csi-driver-operator-84dfd8d89f-kgtr8       0/1     CrashLoopBackOff   42 (82s ago)      3h41m   10.128.56.9     kuryr-5sxhw-master-0   <none>           <none>
openshift-cluster-node-tuning-operator             cluster-node-tuning-operator-7fbb66545c-kh6th               0/1     CrashLoopBackOff   46 (3m5s ago)     3h44m   10.128.6.40     kuryr-5sxhw-master-2   <none>           <none>
openshift-cluster-storage-operator                 cluster-storage-operator-5545dfcf6d-n497j                   0/1     CrashLoopBackOff   42 (2m23s ago)    3h44m   10.128.21.175   kuryr-5sxhw-master-2   <none>           <none>
openshift-cluster-storage-operator                 csi-snapshot-controller-ddb9469f9-bc4bb                     0/1     CrashLoopBackOff   45 (2m17s ago)    3h41m   10.128.20.106   kuryr-5sxhw-master-1   <none>           <none>
openshift-cluster-storage-operator                 csi-snapshot-controller-operator-6d7b66dbdd-xdwcs           0/1     CrashLoopBackOff   42 (92s ago)      3h44m   10.128.21.220   kuryr-5sxhw-master-2   <none>           <none>
openshift-config-operator                          openshift-config-operator-c5d5d964-2w2bv                    0/1     CrashLoopBackOff   80 (3m39s ago)    3h44m   10.128.43.39    kuryr-5sxhw-master-2   <none>           <none>
openshift-controller-manager-operator              openshift-controller-manager-operator-754d748cf7-rzq6f      0/1     CrashLoopBackOff   42 (3m6s ago)     3h44m   10.128.25.166   kuryr-5sxhw-master-2   <none>           <none>
openshift-etcd-operator                            etcd-operator-76ddc94887-zqkn7                              0/1     CrashLoopBackOff   49 (30s ago)      3h44m   10.128.32.146   kuryr-5sxhw-master-2   <none>           <none>
openshift-ingress-operator                         ingress-operator-9f76cf75b-cjx9t                            1/2     CrashLoopBackOff   39 (3m24s ago)    3h44m   10.128.9.108    kuryr-5sxhw-master-2   <none>           <none>
openshift-insights                                 insights-operator-776cd7cfb4-8gzz7                          0/1     CrashLoopBackOff   46 (4m21s ago)    3h44m   10.128.15.102   kuryr-5sxhw-master-2   <none>           <none>
openshift-kube-apiserver-operator                  kube-apiserver-operator-64f4db777f-7n9jv                    0/1     CrashLoopBackOff   42 (113s ago)     3h44m   10.128.18.199   kuryr-5sxhw-master-2   <none>           <none>
openshift-kube-apiserver                           installer-5-kuryr-5sxhw-master-1                            0/1     Error              0                 3h35m   10.128.68.176   kuryr-5sxhw-master-1   <none>           <none>
openshift-kube-controller-manager-operator         kube-controller-manager-operator-746497b-dfbh5              0/1     CrashLoopBackOff   42 (2m23s ago)    3h44m   10.128.13.162   kuryr-5sxhw-master-2   <none>           <none>
openshift-kube-controller-manager                  installer-4-kuryr-5sxhw-master-0                            0/1     Error              0                 3h35m   10.128.65.186   kuryr-5sxhw-master-0   <none>           <none>
openshift-kube-scheduler-operator                  openshift-kube-scheduler-operator-695fb4449f-j9wqx          0/1     CrashLoopBackOff   42 (63s ago)      3h44m   10.128.44.194   kuryr-5sxhw-master-2   <none>           <none>
openshift-kube-scheduler                           installer-5-kuryr-5sxhw-master-0                            0/1     Error              0                 3h35m   10.128.60.44    kuryr-5sxhw-master-0   <none>           <none>
openshift-kube-storage-version-migrator-operator   kube-storage-version-migrator-operator-6c5cd46578-qpk5z     0/1     CrashLoopBackOff   42 (2m18s ago)    3h44m   10.128.4.120    kuryr-5sxhw-master-2   <none>           <none>
openshift-machine-api                              cluster-autoscaler-operator-7b667675db-tmlcb                1/2     CrashLoopBackOff   46 (2m53s ago)    3h45m   10.128.28.146   kuryr-5sxhw-master-2   <none>           <none>
openshift-machine-api                              machine-api-controllers-fdb99649c-ldb7t                     3/7     CrashLoopBackOff   184 (2m55s ago)   3h40m   10.128.29.90    kuryr-5sxhw-master-0   <none>           <none>
openshift-route-controller-manager                 route-controller-manager-d8f458684-7dgjm                    0/1     CrashLoopBackOff   43 (100s ago)     3h36m   10.128.55.11    kuryr-5sxhw-master-2   <none>           <none>
openshift-service-ca-operator                      service-ca-operator-654f68c77f-g4w55                        0/1     CrashLoopBackOff   42 (2m2s ago)     3h45m   10.128.22.30    kuryr-5sxhw-master-2   <none>           <none>
openshift-service-ca                               service-ca-5f584b7d75-mxllm                                 0/1     CrashLoopBackOff   42 (45s ago)      3h42m   10.128.49.250   kuryr-5sxhw-master-0   <none>           <none>
$ oc get svc -A |grep  172.30.0.1 
default                                            kubernetes                                       ClusterIP   172.30.0.1       <none>        443/TCP                           3h50m

 

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of the problem:

In staging, BE 2.18.0 - Trying to set all validation IDs to be ignored with:

curl -X 'PUT' 'https://api.stage.openshift.com/api/assisted-install/v2/clusters/26a69b99-06a3-441b-be40-73cadbac6b6a/ignored-validations'   --header "Authorization: Bearer $(ocm token)"   -H 'accept: application/json'   -H 'Content-Type: application/json'   -d '{
  "host-validation-ids": "[]",                          
  "cluster-validation-ids": "[\"all\"]"       
}'

Getting this response:

 {"code":"400","href":"","id":400,"kind":"Error","reason":"cannot proceed due to the following errors: Validation ID 'all' is not a known cluster validation"}

How reproducible:

100%

Steps to reproduce:

1.

2.

3.

Actual results:

 

Expected results:
All ignorable validations should added to ignore list

Description of problem:

This came out of the investigation of https://issues.redhat.com/browse/OCPBUGS-11691 . The nested node configs used to support dual stack VIPs do not correctly respect the EnableUnicast setting. This is causing issues on EUS upgrades where the unicast migration cannot happen until all nodes are on 4.12. This is blocking both the workaround and the eventual proper fix.

Version-Release number of selected component (if applicable):

4.12

How reproducible:

Always

Steps to Reproduce:

1. Deploy 4.11 with unicast explicitly disabled (via MCO patch)
2. Write /etc/keepalived/monitor-user.conf to suppress unicast migration
3. Upgrade to 4.12

Actual results:

Nodes come up in unicast mode

Expected results:

Nodes remain in multicast mode until monitor-user.conf is removed

Additional info:

 

Description of problem:

In Reliability (loaded longrun) test, the memory of ovnkube-node-xxx pods on all 6 nodes keep increasing. Within 24 hours, increased to about 1.6G. I did not see this issue in previous releases.

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-06-27-000502

How reproducible:

I met this issue the first time

Steps to Reproduce:

1. Install a AWS OVN cluster with 3 masters, 3 workers, vm_type are all m5.xlarge.
2. Run reliability-v2 test https://github.com/openshift/svt/tree/master/reliability-v2 with config: 1 admin, 15 dev-test, 1 dev-prod. The test will long run the configured tasks.
3. Monitor the test failures in and performance dashboard.

Test failures slack notification: https://redhat-internal.slack.com/archives/C0266JJ4XM5/p1687944463913769

Performance dashboard:http://dittybopper-dittybopper.apps.qili-414-haproxy.qe-lrc.devcluster.openshift.com/d/IgK5MW94z/openshift-performance?orgId=1&from=1687944452000&to=now&refresh=1h

Actual results:

The memory of ovnkube-node-xxx pods on all 6 nodes keep increasing.
Within 24 hours, increased to about 1.6G.

Expected results:

The memory of ovnkube-node-xxx pods

Additional info:

% oc adm top pod -n openshift-ovn-kubernetes | grep node
ovnkube-node-4t282     146m         1862Mi          
ovnkube-node-9p462     41m          1847Mi          
ovnkube-node-b6rqj     46m          2032Mi          
ovnkube-node-fp2gn     72m          2107Mi          
ovnkube-node-hxf95     11m          2359Mi          
ovnkube-node-ql9fx     38m          2089Mi          

I did a pprof heap on one of the pod and upload to heap-ovnkube-node-4t282.out
Must-gather is uploaded to must-gather.local.1315176578017655774.tar.gz
performance dashboard screenshot for ovnkube-node-memory.png

This is a clone of issue OCPBUGS-17906. The following is the description of the original issue:

Description of problem:

On Hypershift(Guest) cluster, EFS driver pod stuck at ContainerCreating state

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-08-11-055332

How reproducible:

Always

Steps to Reproduce:

1. Create Hypershift cluster.    
Flexy template: aos-4_14/ipi-on-aws/versioned-installer-ovn-hypershift-ci

2. Try to install EFS operator and driver from yaml file/web console as mentioned in below steps.  
a) Create iam role from ccoctl tool and will get ROLE ARN value from the output   
b) Install EFS operator using the above ROLE ARN value.   
c) Check EFS operator, node, controller pods are up and running  

// og-sub-hcp.yaml
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  generateName: openshift-cluster-csi-drivers-
  namespace: openshift-cluster-csi-drivers
spec:
  namespaces:
  - ""
---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: aws-efs-csi-driver-operator
  namespace: openshift-cluster-csi-drivers
spec:
    channel: stable
    name: aws-efs-csi-driver-operator
    source: qe-app-registry
    sourceNamespace: openshift-marketplace
    config:
      env:
      - name: ROLEARN
        value: arn:aws:iam::301721915996:role/hypershift-ci-16666-openshift-cluster-csi-drivers-aws-efs-cloud-

// driver.yaml
apiVersion: operator.openshift.io/v1
kind: ClusterCSIDriver
metadata:
  name: efs.csi.aws.com
spec:
  logLevel: TraceAll
  managementState: Managed
  operatorLogLevel: TraceAll

Actual results:

aws-efs-csi-driver-controller-699664644f-dkfdk   0/4     ContainerCreating   0          87m

Expected results:

EFS controller pods should be up and running

Additional info:

oc -n openshift-cluster-csi-drivers logs aws-efs-csi-driver-operator-6758c5dc46-b75hb

E0821 08:51:25.160599       1 base_controller.go:266] "AWSEFSDriverCredentialsRequestController" controller failed to sync "key", err: cloudcredential.operator.openshift.io "cluster" not found

Discussion: https://redhat-internal.slack.com/archives/GK0DA0JR5/p1692606247221239
Installation steps epic: https://issues.redhat.com/browse/STOR-1421 

Description of problem:

Set custom security group IDs in the following fields of install-config.yaml

installconfig.controlPlane.platform.aws.additionalSecurityGroupIDs installconfig.compute.platform.aws.additionalSecurityGroupIDs

such as: 

apiVersion: v1
 controlPlane:
   architecture: amd64
   hyperthreading: Enabled
   name: master
   platform:
     aws:
       additionalSecurityGroupIDs:
       - sg-0d2f88b2980aa5547
       - sg-01f1d2f60a3b4cf6d
   replicas: 3
 compute:
 - architecture: amd64
   hyperthreading: Enabled
   name: worker
   platform:
     aws:
       additionalSecurityGroupIDs:
       - sg-03418b6e2f68e1f63
       - sg-0376fc68fd4b834a4
   replicas: 3


After installation, check the Security Groups attached to master and worker, master doesn’t have the specified custom security groups attached while workers have. 

For one of the masters:
[root@preserve-gpei-worker ~]# aws ec2 describe-instances --instance-ids i-0cd007cca57c86ee9 --region us-west-2 --query 'Reservations[*].Instances[*].SecurityGroups[*]' --output json
[
    [
        [
            {
                "GroupName": "terraform-20230713031140984600000002",
                "GroupId": "sg-05495718555950f77"
            }
        ]
    ]
]

For one of the workers:
[root@preserve-gpei-worker ~]# aws ec2 describe-instances --instance-ids i-0572b7bde8ff07ac4 --region us-west-2 --query 'Reservations[*].Instances[*].SecurityGroups[*]' --output json
[
    [
        [
            {
                "GroupName": "gpei-0613a-worker-2",
                "GroupId": "sg-0376fc68fd4b834a4"
            },
            {
                "GroupName": "gpei-0613a-worker-1",
                "GroupId": "sg-03418b6e2f68e1f63"
            },
            {
                "GroupName": "terraform-20230713031140982700000001",
                "GroupId": "sg-0ce73044e426fe249"
            }
        ]
    ]
]

Also checked the master’s controlplanemachineset, it does have the custom security groups configured, but they’re not attached to the master instance in the end.

[root@preserve-gpei-worker k_files]# oc get controlplanemachineset -n openshift-machine-api cluster -o yaml |yq .spec.template.machines_v1beta1_machine_openshift_io.spec.providerSpec.value.securityGroups
- filters:
    - name: tag:Name
      values:
        - gpei-0613a-pzjbk-master-sg
- id: sg-01f1d2f60a3b4cf6d
- id: sg-0d2f88b2980aa5547



Version-Release number of selected component (if applicable):

registry.ci.openshift.org/ocp/release:4.14.0-0.nightly-2023-07-11-092038

How reproducible:

 Always

Steps to Reproduce:

1. As mentioned above
2.
3.

Actual results:

masters doesn't have custom security groups added

Expected results:

masters should have custom security groups added like workers

Additional info:


In Hypershift CI, we see nil deref panic

I0801 06:35:38.203019       1 controller.go:182] Assigning key: ip-10-0-132-175.ec2.internal to node workqueue
E0801 06:35:38.567021       1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 195 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x28103a0?, 0x47a6400})
	/go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x99
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc00088f260?})
	/go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x75
panic({0x28103a0, 0x47a6400})
	/usr/lib/golang/src/runtime/panic.go:884 +0x213
github.com/openshift/cloud-network-config-controller/pkg/cloudprovider.(*AWS).getSubnet(0xc000c05220, 0xc000d760b0)
	/go/src/github.com/openshift/cloud-network-config-controller/pkg/cloudprovider/aws.go:266 +0x24a
github.com/openshift/cloud-network-config-controller/pkg/cloudprovider.(*AWS).GetNodeEgressIPConfiguration(0x0?, 0x31b8490?, {0x0, 0x0, 0x0})
	/go/src/github.com/openshift/cloud-network-config-controller/pkg/cloudprovider/aws.go:200 +0x185
github.com/openshift/cloud-network-config-controller/pkg/controller/node.(*NodeController).SyncHandler(0xc000d526e0, {0xc00005d7e0, 0x1c})
	/go/src/github.com/openshift/cloud-network-config-controller/pkg/controller/node/node_controller.go:129 +0x44f
github.com/openshift/cloud-network-config-controller/pkg/controller.(*CloudNetworkConfigController).processNextWorkItem.func1(0xc00071f740, {0x25ff720?, 0xc00088f260?})
	/go/src/github.com/openshift/cloud-network-config-controller/pkg/controller/controller.go:152 +0x11c
github.com/openshift/cloud-network-config-controller/pkg/controller.(*CloudNetworkConfigController).processNextWorkItem(0xc00071f740)
	/go/src/github.com/openshift/cloud-network-config-controller/pkg/controller/controller.go:162 +0x46
github.com/openshift/cloud-network-config-controller/pkg/controller.(*CloudNetworkConfigController).runWorker(...)
	/go/src/github.com/openshift/cloud-network-config-controller/pkg/controller/controller.go:113
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)
	/go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x3e
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0x0?, {0x318e140, 0xc0005aa1e0}, 0x1, 0xc0000c4ba0)
	/go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xb6
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0?, 0x3b9aca00, 0x0, 0x0?, 0x0?)
	/go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x89
k8s.io/apimachinery/pkg/util/wait.Until(0x0?, 0x0?, 0x0?)
	/go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:161 +0x25
created by github.com/openshift/cloud-network-config-controller/pkg/controller.(*CloudNetworkConfigController).Run
	/go/src/github.com/openshift/cloud-network-config-controller/pkg/controller/controller.go:99 +0x3aa
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x236d14a]

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-hypershift-release-4.14-periodics-e2e-aws-ovn/1686255525022404608/artifacts/e2e-aws-ovn/run-e2e/artifacts/TestNodePool_PreTeardownClusterDump/namespaces/e2e-clusters-m222b-example-85hhk/core/pods/logs/cloud-network-config-controller-6984cd6dcb-l7pcx-controller-previous.log

https://github.com/openshift/cloud-network-config-controller/blob/master/pkg/cloudprovider/aws.go#L266

Code does an unprotected deref of `networkInterface.SubnetId` which appears to be `nil`, which is probably why multiple subnets are returned in the first place.

Description of problem:


MCO has duplicate feature flags set for Kubelet causing errors on bringup.

{{code}}
I0421 15:32:04.308472    2135 codec.go:98] "Using lenient decoding as strict decoding failed" err=<
Apr 21 15:32:04 ip-10-0-156-156 kubenswrapper[2135]:         strict decoding error: yaml: unmarshal errors:
Apr 21 15:32:04 ip-10-0-156-156 kubenswrapper[2135]:           line 29: key "RotateKubeletServerCertificate" already set in map
Apr 21 15:32:04 ip-10-0-156-156 kubenswrapper[2135]:  >
{{code}}

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:

1.
2.
3.

Actual results:


Expected results:


Additional info:


This is a clone of issue OCPBUGS-19018. The following is the description of the original issue:

using metal-ipi on 4.14 the cluster is failing to come up, 

 

the network cluster-operator is failing to start, the sdn pod shows the error

bash: RHEL_VERSION: unbound variable

Description of problem:

create new host and cluster folder qe-cluster under datacenter, and move cluster workloads into that folder.

$ govc find -type r
/OCP-DC/host/qe-cluster/workloads

using below install-config.yaml file to create single zone cluster.

apiVersion: v1
baseDomain: qe.devcluster.openshift.com
compute:
- architecture: amd64
  hyperthreading: Enabled
  name: worker
  platform: 
    vsphere:
      cpus: 4
      memoryMB: 8192
      osDisk:
        diskSizeGB: 60
      zones:
        - us-east-1
  replicas: 2
controlPlane:
  architecture: amd64
  hyperthreading: Enabled
  name: master
  platform:
    vsphere: 
      cpus: 4
      memoryMB: 16384 
      osDisk:
        diskSizeGB: 60
      zones:
        - us-east-1
  replicas: 3
metadata:
  name: jima-permission
networking:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  machineNetwork:
  - cidr: 10.19.46.0/24
  networkType: OVNKubernetes
  serviceNetwork:
  - 172.30.0.0/16
platform:
  vsphere:
    apiVIP: 10.19.46.99
    cluster: qe-cluster/workloads
    datacenter: OCP-DC
    defaultDatastore: my-nfs
    ingressVIP: 10.19.46.98
    network: "VM Network"
    username: administrator@vsphere.local
    password: xxx
    vCenter: xxx
    vcenters:
    - server: xxx
      user: administrator@vsphere.local
      password: xxx
      datacenters:
      - OCP-DC
    failureDomains:
    - name: us-east-1
      region: us-east
      zone: us-east-1a
      topology:
        datacenter: OCP-DC
        computeCluster: /OCP-DC/host/qe-cluster/workloads
        networks:
        - "VM Network"
        datastore: my-nfs
      server: xxx
pullSecret: xxx 

installer get error:

$ ./openshift-install create cluster --dir ipi5 --log-level debug
DEBUG   Generating Platform Provisioning Check...  
DEBUG   Fetching Common Manifests...               
DEBUG   Reusing previously-fetched Common Manifests 
DEBUG Generating Terraform Variables...            
FATAL failed to fetch Terraform Variables: failed to generate asset "Terraform Variables": failed to get vSphere network ID: could not find vSphere cluster at /OCP-DC/host//OCP-DC/host/qe-cluster/workloads: cluster '/OCP-DC/host//OCP-DC/host/qe-cluster/workloads' not found 
 

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-10-05-053337

How reproducible:

always

Steps to Reproduce:

1. create new host/cluster folder under datacenter, and move vsphere cluster into that folder
2. prepare install-config with zone configuration
3. deploy cluster

Actual results:

fail to create cluster

Expected results:

succeed to create cluster

Additional info:

 

 

 

 

 

Description of problem:

In the control plane machine set operator we perform e2e periodic tests that check the ability to do a rolling update of an entire OCP control plane.

This is a quite involved test as we need to drain and replace all the master machines/nodes, altering operators, waiting for machines to come up + bootstrap and nodes to drain and move their workloads to others while respecting PDBs, and etcd quorum.

As such we need to make sure we are robust to transient issues, occasionaly slow-downs and network errors.

We have investigated these timeout issues and identified some common culprits that we want to address, see: https://redhat-internal.slack.com/archives/GE2HQ9QP4/p1678966522151799

Description of problem:

CPO reconciliation loop hangs after "Reconciling infrastructure status"

Version-Release number of selected component (if applicable):

4.14

How reproducible:

Frequently

Steps to Reproduce:

1.Create a HostedCluster with a recent 4.14 release image
2.Watch CPO logs
3.

Actual results:

Reconcile gets stuck

Expected results:

Reconcile happens fairly quickly

Additional info:

 

Description of problem:

Cluster upgrade failure has been affecting three consecutive nightly payloads. 

https://amd64.ocp.releases.ci.openshift.org/releasestream/4.14.0-0.nightly/release/4.14.0-0.nightly-2023-05-20-041508
https://amd64.ocp.releases.ci.openshift.org/releasestream/4.14.0-0.nightly/release/4.14.0-0.nightly-2023-05-21-120836
https://amd64.ocp.releases.ci.openshift.org/releasestream/4.14.0-0.nightly/release/4.14.0-0.nightly-2023-05-22-035713

In all three cases, upgrade seems to fail waiting on network. Take this job as an example:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-sdn-upgrade/1660495736527130624

Cluster version operator complains about network operator has not finished upgrade:

I0522 07:12:58.540244       1 sync_worker.go:1149] Update error 684 of 845: ClusterOperatorUpdating Cluster operator network is updating versions (*errors.errorString: cluster operator network is available and not degraded but has not finished updating to target version)

This log can been seen in https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-sdn-upgrade/1660495736527130624/artifacts/e2e-aws-sdn-upgrade/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-5565f87cc6-6sjqf_cluster-version-operator.log

The network operator keeps waiting with the following log:
I0522 07:12:58.563312       1 connectivity_check_controller.go:166] ConnectivityCheckController is waiting for transition to desired version (4.14.0-0.nightly-2023-05-22-035713) to be completed.

This lasted over 2 hours. The log can be seen in https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-sdn-upgrade/1660495736527130624/artifacts/e2e-aws-sdn-upgrade/gather-extra/artifacts/pods/openshift-network-operator_network-operator-6975b7b8ff-pdxzk_network-operator.log

Compared with a working job, there seems to be an error getting *v1alpha1.PodNetworkConnectivityCheck in the openshift-network-diagnostics_network-check-source:
W0522 04:34:18.527315       1 reflector.go:424] k8s.io/client-go@v12.0.0+incompatible/tools/cache/reflector.go:169: failed to list *v1alpha1.PodNetworkConnectivityCheck: the server could not find the requested resource (get podnetworkconnectivitychecks.controlplane.operator.openshift.io)
E0522 04:34:18.527391       1 reflector.go:140] k8s.io/client-go@v12.0.0+incompatible/tools/cache/reflector.go:169: Failed to watch *v1alpha1.PodNetworkConnectivityCheck: failed to list *v1alpha1.PodNetworkConnectivityCheck: the server could not find the requested resource (get podnetworkconnectivitychecks.controlplane.operator.openshift.io)

It is not clear whether this is really relevant. Also worth mentioning is that, every time when this problem happens, machine-config and dns also stuck with the older version. 

This has been affecting 4.14 nightly payload three times. If it shows more consistency, we might have to increase the severity of the bug. Please ping TRT if any more info is needed. 

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

During installation:

level=error msg=Error: reading Security Group (sg-0f07c871bdbd6379f) Rules: UnauthorizedOperation: You are not authorized to perform this operation.
level=error msg=	status code: 403, request id: f3e18ac0-f2fc-471f-8055-7194112c8225 

Users are unable to create the security groups for the bootstrap node

Version-Release number of selected component (if applicable):

 

How reproducible:

Always 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

Warning/Error should come up when the permission does not exist.

Additional info:

 

Starting with https://amd64.origin.releases.ci.openshift.org/releasestream/4.13.0-0.okd/release/4.13.0-0.okd-2023-02-28-170012 multiple storage tests are failing:

  [sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: Dynamic  PV (block volmode)] volumes should store data  [Skipped:NoOptionalCapabilities] [Suite:openshift/conformance/parallel]  [Suite:k8s] expand_more | :  [sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: Dynamic  PV (block volmode)] volumes should store data  [Skipped:NoOptionalCapabilities] [Suite:openshift/conformance/parallel]  [Suite:k8s] expand_more
:  [sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: Dynamic  PV (block volmode)] volumes should store data  [Skipped:NoOptionalCapabilities] [Suite:openshift/conformance/parallel]  [Suite:k8s] expand_more
:  [sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: Dynamic  PV (block volmode)] provisioning should provision storage with pvc data  source [Skipped:NoOptionalCapabilities]  [Suite:openshift/conformance/parallel] [Suite:k8s] expand_more | :  [sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: Dynamic  PV (block volmode)] provisioning should provision storage with pvc data  source [Skipped:NoOptionalCapabilities]  [Suite:openshift/conformance/parallel] [Suite:k8s] expand_more
:  [sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: Dynamic  PV (block volmode)] provisioning should provision storage with pvc data  source [Skipped:NoOptionalCapabilities]  [Suite:openshift/conformance/parallel] [Suite:k8s] expand_more
:  [sig-storage] In-tree Volumes [Driver: local][LocalVolumeType: block]  [Testpattern: Pre-provisioned PV (block volmode)] volumes should store  data [Skipped:NoOptionalCapabilities]  [Suite:openshift/conformance/parallel] [Suite:k8s] expand_more | :  [sig-storage] In-tree Volumes [Driver: local][LocalVolumeType: block]  [Testpattern: Pre-provisioned PV (block volmode)] volumes should store  data [Skipped:NoOptionalCapabilities]  [Suite:openshift/conformance/parallel] [Suite:k8s] expand_more
:  [sig-storage] In-tree Volumes [Driver: local][LocalVolumeType: block]  [Testpattern: Pre-provisioned PV (block volmode)] volumes should store  data [Skipped:NoOptionalCapabilities]  [Suite:openshift/conformance/parallel] [Suite:k8s] expand_more
:  [sig-storage] PersistentVolumes-local  [Volume type: block] One pod  requesting one prebound PVC should be able to mount volume and write  from pod1 [Skipped:NoOptionalCapabilities]  [Suite:openshift/conformance/parallel] [Suite:k8s] expand_more | :  [sig-storage] PersistentVolumes-local  [Volume type: block] One pod  requesting one prebound PVC should be able to mount volume and write  from pod1 [Skipped:NoOptionalCapabilities]  [Suite:openshift/conformance/parallel] [Suite:k8s] expand_more
:  [sig-storage] PersistentVolumes-local  [Volume type: block] One pod  requesting one prebound PVC should be able to mount volume and write  from pod1 [Skipped:NoOptionalCapabilities]  [Suite:openshift/conformance/parallel] [Suite:k8s] expand_more
:  [sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: Dynamic  PV (block volmode)] provisioning should provision storage with snapshot  data source [Feature:VolumeSnapshotDataSource]  [Skipped:NoOptionalCapabilities] [Suite:openshift/conformance/parallel]  [Suite:k8s] expand_more | :  [sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: Dynamic  PV (block volmode)] provisioning should provision storage with snapshot  data source [Feature:VolumeSnapshotDataSource]  [Skipped:NoOptionalCapabilities] [Suite:openshift/conformance/parallel]  [Suite:k8s] expand_more
:  [sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: Dynamic  PV (block volmode)] provisioning should provision storage with snapshot  data source [Feature:VolumeSnapshotDataSource]  [Skipped:NoOptionalCapabilities] [Suite:openshift/conformance/parallel]  [Suite:k8s] expand_more
:  [sig-storage] PersistentVolumes-local  [Volume type: block] Two pods  mounting a local volume at the same time should be able to write from  pod1 and read from pod2 [Skipped:NoOptionalCapabilities]  [Suite:openshift/conformance/parallel] [Suite:k8s] expand_more | :  [sig-storage] PersistentVolumes-local  [Volume type: block] Two pods  mounting a local volume at the same time should be able to write from  pod1 and read from pod2 [Skipped:NoOptionalCapabilities]  [Suite:openshift/conformance/parallel] [Suite:k8s] expand_more
:  [sig-storage] PersistentVolumes-local  [Volume type: block] Two pods  mounting a local volume at the same time should be able to write from  pod1 and read from pod2 [Skipped:NoOptionalCapabilities]  [Suite:openshift/conformance/parallel] [Suite:k8s] expand_more
:  [sig-storage] PersistentVolumes-local  [Volume type: block] One pod  requesting one prebound PVC should be able to mount volume and read from  pod1 [Skipped:NoOptionalCapabilities]  [Suite:openshift/conformance/parallel] [Suite:k8s] expand_more | :  [sig-storage] PersistentVolumes-local  [Volume type: block] One pod  requesting one prebound PVC should be able to mount volume and read from  pod1 [Skipped:NoOptionalCapabilities]  [Suite:openshift/conformance/parallel] [Suite:k8s] expand_more
:  [sig-storage] PersistentVolumes-local  [Volume type: block] One pod  requesting one prebound PVC should be able to mount volume and read from  pod1 [Skipped:NoOptionalCapabilities]  [Suite:openshift/conformance/parallel] [Suite:k8s] expand_more
:  [sig-storage] PersistentVolumes-local  [Volume type: block] Two pods  mounting a local volume one after the other should be able to write from  pod1 and read from pod2 [Skipped:NoOptionalCapabilities]  [Suite:openshift/conformance/parallel] [Suite:k8s] expand_more | :  [sig-storage] PersistentVolumes-local  [Volume type: block] Two pods  mounting a local volume one after the other should be able to write from  pod1 and read from pod2 [Skipped:NoOptionalCapabilities]  [Suite:openshift/conformance/parallel] [Suite:k8s] expand_more
:  [sig-storage] PersistentVolumes-local  [Volume type: block] Two pods  mounting a local volume one after the other should be able to write from  pod1 and read from pod2 [Skipped:NoOptionalCapabilities]  [Suite:openshift/conformance/parallel] [Suite:k8s] expand_more
:  [sig-storage] In-tree Volumes [Driver: aws] [Testpattern: Dynamic PV  (block volmode)] volumes should store data  [Skipped:NoOptionalCapabilities] [Suite:openshift/conformance/parallel]  [Suite:k8s] expand_more | :  [sig-storage] In-tree Volumes [Driver: aws] [Testpattern: Dynamic PV  (block volmode)] volumes should store data  [Skipped:NoOptionalCapabilities] [Suite:openshift/conformance/parallel]  [Suite:k8s] expand_more
:  [sig-storage] In-tree Volumes [Driver: aws] [Testpattern: Dynamic PV  (block volmode)] volumes should store data  [Skipped:NoOptionalCapabilities] [Suite:openshift/conformance/parallel]  [Suite:k8s] expand_more
:  [sig-storage] In-tree Volumes [Driver: aws] [Testpattern:  Pre-provisioned PV (block volmode)] volumes should store data  [Skipped:NoOptionalCapabilities] [Suite:openshift/conformance/parallel]  [Suite:k8s] expand_more | :  [sig-storage] In-tree Volumes [Driver: aws] [Testpattern:  Pre-provisioned PV (block volmode)] volumes should store data  [Skipped:NoOptionalCapabilities] [Suite:openshift/conformance/parallel]  [Suite:k8s] expand_more

cc Hemant Kumar

Description of problem:

When we try to create a cluster with --secret-creds, an MCE AWS k8s secret that includes aws-creds, pull secret, and base domain, then the binary should not ask for pull secret. However, it does now after changing from hypershift.

Adding pull secret param will allow the command to continue as expected, though I would think whole point of the secret-creds is to reuse what exists.

 /usr/local/bin/hcp create cluster aws --name acmqe-hc-ad5b1f645d93464c --secret-creds test1-cred --region us-east-1 --node-pool-replicas 1 --namespace local-cluster --instance-type m6a.xlarge --release-image quay.io/openshift-release-dev/ocp-release:4.14.0-ec.4-multi --generate-ssh Output:
  Error: required flag(s) "pull-secret" not set
  required flag(s) "pull-secret" not set

Version-Release number of selected component (if applicable):

2.4.0-DOWNANDBACK-2023-08-31-13-34-02 or mce 2.4.0-137

hcp version openshift/hypershift: 8b4b52925d47373f3fe4f0d5684c88dc8a93368a. Latest supported OCP: 4.14.0

How reproducible:

always

Steps to Reproduce:

  1. download hcp cli from mce
  2. run hcp cluster create aws with valid secret-creds param
  3. ...

Actual results:

Expected results:

Additional info:

Tracker issue for bootimage bump in 4.14. This issue should block issues which need a bootimage bump to fix.

The previous bump was OCPBUGS-13061.

Description of problem:

When fresh normal user visit BuildConfigs page of 'default' project, we can see error page

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-07-05-191022

How reproducible:

Always

Steps to Reproduce:

1. normal user without any projects login to console 
2. switch to Admin perspective
3. Visit workloads page for 'default' project, for example
/k8s/ns/default/route.openshift.io~v1~Route
/k8s/ns/default/core~v1~Service
/k8s/ns/default/apps~v1~Deployment
/k8s/ns/default/build.openshift.io~v1~BuildConfig

Actual results:

3. We can see an error page when visiting BuildConfigs page 

Expected results:

3. no error should be shown and show consistent info with other workloads page

Additional info:

 

Description of problem:

Repository creation in console ask for a mandate secret, does not allow to create repository even for public git url which is weird. However it's working fine with ocp cli

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1. Create repository crd via openshift console
2.
3.

Actual results:

It does not allow me to create the repository

Expected results:

We should be able to create repository crd

Additional info:

slack thread: https://redhat-internal.slack.com/archives/C6A3NV5J9/p1691057766516119

Description of problem:

With 120+ node clusters, we are seeing O(10) larger rate of patch node requests coming from node service accounts.  These higher rate of updates are causing issues where "nodes" watchers are being terminated, causing storm of watch requests that increases CPU load on the cluster.

What I see is node resourceVersions are incremented rapidly and in large bursts and watchers are terminated as a result.

Version-Release number of selected component (if applicable):

4.14.0-ec.4
4.14.0-0.nightly-2023-08-08-222204
4.13.0-0.nightly-2023-08-10-021434

How reproducible:

Repeatable

Steps to Reproduce:

1. Create 4.14 cluster with 120 nodes with m5.8xlarge control plane and c5.4xlarge workers.
2. Run `oc get nodes -w -o custom-columns='NAME:.metadata.name,RV:.metadata.resourceVersion' ` 
3. Wait for a big chunk of nodes to be updated and observe the watch terminate.
4. Optionally run `kube-burner ocp node-density-cni --pods-per-node=100` to generate some load.

Actual results:

kube-apiserver audit events show >1500 node patch requests from a single node SA in a certain amount of time:
   1678 ["system:node:ip-10-0-69-142.us-west-2.compute.internal",null]
   1679 ["system:node:ip-10-0-33-131.us-west-2.compute.internal",null]
   1709 ["system:node:ip-10-0-41-44.us-west-2.compute.internal",null]

Observe that apiserver_terminated_watchers_total{resource="nodes"} starts to increment before 120 node scaleup is even complete.

Expected results:

patch requests in certain amount of time are more aligned with what we see on 4.13*08-10* nightly:
     57 ["system:node:ip-10-0-247-122.us-west-2.compute.internal",null]
     62 ["system:node:ip-10-0-239-217.us-west-2.compute.internal",null]
     63 ["system:node:ip-10-0-165-255.us-west-2.compute.internal",null]
     64 ["system:node:ip-10-0-136-122.us-west-2.compute.internal",null]

Observe that apiserver_terminated_watchers_total{resource="nodes"} does not increment.

Observe that rate of mutating node requests levels off after nodes are created.

Additional info:

Suspecting these updates coming from nodes could be part of a response to the MCO controllerconfigs resource being updated every few minutes or more frequently.

One of the suspected causes of increased kube-apiserer CPU usage investigation of ovn-ic.

An upstream partial fix to logging means that the BMO log now contains a mixture of structured and unstructured logs, making it impossible to read with the structured log parsing tool (bmo-log-parse) we use for debugging customer issues.
This is fixed upstream by https://github.com/metal3-io/baremetal-operator/pull/1249, which will get picked up automatically in 4.14 but which needs to be backported to 4.13.

Description of problem:

Currrently, only one ServerGroup is created in OpenStack when 3 masters on 3 AZs are deployed while 3 should have been created (one per AZ). With the work on CPMS, we made the decision to only create one ServerGroup for the masters. However, this will require a change in the installer to reflect this decision.
Indeed, when specifying AZs, the master machines would have their own ServerGroup, while only one actually existed in OpenStack. This was a mistake but instead of fixing that bug, we'll change the behaviour to have only one ServerGroup for masters.

Version-Release number of selected component (if applicable):

latest (4.14)

How reproducible: deploy a control plane with 3 failure domains:

controlPlane:
  name: master
  platform:
    openstack:
      type: m1.xlarge
      failureDomains:
      - computeAvailabilityZone: az0
      - computeAvailabilityZone: az1
      - computeAvailabilityZone: az2

Steps to Reproduce:

1. Deploy the control plane in 3 AZ
2. List OpenStack Compute Server Groups

Actual results:

+--------------------------------------+--------------------------+--------------------+
| ID                                   | Name                     | Policy             |
+--------------------------------------+--------------------------+--------------------+
| 0750c579-d2cf-41b3-9e88-003dcbcad0c5 | refarch-jkn8g-master-az0 | soft-anti-affinity |
| 05715c08-ac2b-439d-9bd5-5803ac40c322 | refarch-jkn8g-worker     | soft-anti-affinity |
+--------------------------------------+--------------------------+--------------------+

Expected results without our work on CPMS:

refarch-jkn8g-master-az1 and refarch-jkn8g-master-az2 should have been created.

This expectation is purely for documentation, QE should ignore it.

 

Expected results with our work on CPMS (which should be taken in account by QE when testing CPMS):

refarch-jkn8g-master-az0 should not exist, and the ServerGroup should be named refarch-jkn8g-master.
All the masters should use that ServerGroup in both the Nova instance properties and in the MachineSpec once the machines are enrolled by CCPMSO.

Description of problem:

4.14 nightly HyperShift hosted cluster aws-pod-identity does not work. Pods are not injected env vars AWS_ROLE_ARN and AWS_WEB_IDENTITY_TOKEN_FILE.

In 4.13 HyperShift hosted cluster, it works well, see Additional info.

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-08-11-055332

How reproducible:

Always

Steps to Reproduce:

1.
$ export KUBECONFIG=/path/to/hypershift-hosted-cluster/kubeconfig
$ ogcv
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.14.0-0.nightly-2023-08-11-055332   True        False         8h      Cluster version is 4.14.0-0.nightly-2023-08-11-055332
$ oc get mutatingwebhookconfigurations --context admin
NAME               WEBHOOKS   AGE
aws-pod-identity   1          6h5m

$ oc get --raw=/.well-known/openid-configuration | jq -r '.issuer'
https://xxxx.s3.us-east-2.amazonaws.com/hypershift-xxxx

2.
$ oc new-project xxia-proj
$ oc create sa aws-provider
serviceaccount/aws-provider created

3.
$ ccoctl aws create-iam-roles --name=xxia --region=$REGION --credentials-requests-dir=credentialsrequest-dir-aws --identity-provider-arn=arn:aws:iam::xxxx:oidc-provider/xxxx.s3.us-east-2.amazonaws.com/hypershift-xxxx --output-dir=credrequests-ccoctl-output
2023/08/24 17:54:32 Role arn:aws:iam::xxxx:role/xxia-xxia-proj-aws-creds created
2023/08/24 17:54:32 Saved credentials configuration to: credrequests-ccoctl-output/manifests/xxia-proj-aws-creds-credentials.yaml
2023/08/24 17:54:32 Updated Role policy for Role xxia-xxia-proj-aws-creds

4.
$ oc annotate sa/aws-provider eks.amazonaws.com/role-arn="arn:aws:iam::xxxx:role/xxia-xxia-proj-aws-creds"
$ oc create deployment aws-cli --image=amazon/aws-cli --dry-run=client -o yaml -- sleep 360d | sed "/containers/i \      serviceAccountName: aws-provider" | oc create -f -
deployment.apps/aws-cli created
$ oc get po
NAME                               READY   STATUS              RESTARTS   AGE
aws-cli-5c4f6d7d5b-g6d5v           1/1     Running             0          18s

5.
$ oc rsh aws-cli-5c4f6d7d5b-g6d5v
sh-4.2$ env | grep AWS
sh-4.2$ ls /var/run/secrets/eks.amazonaws.com/serviceaccount/token
ls: cannot access /var/run/secrets/eks.amazonaws.com/serviceaccount/token: No such file or directory
sh-4.2$ exit
command terminated with exit code 1

Actual results:

5. No AWS env vars.

Expected results:

5. Should have AWS env vars.

Additional info:

In 4.13 HyperShift hosted cluster, it works well:

1.
$ ogcv    
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.13.0-0.nightly-2023-08-11-101506   True        False         10h     Cluster version is 4.13.0-0.nightly-2023-08-11-101506
$ oc get --raw=/.well-known/openid-configuration | jq -r '.issuer'
https://aos-xxxx.s3.us-east-2.amazonaws.com/xxxx
$ oc get no                       
NAME                                        STATUS   ROLES    AGE   VERSION
ip-10-0-139-76.us-east-2.compute.internal   Ready    worker   10h   v1.26.6+6bf3f75
...
$ REGION=us-east-2

2.
$ oc new-project xxia-proj
$ oc create sa aws-provider

3.
$ ccoctl aws create-iam-roles --name=xxia-test --region=$REGION --credentials-requests-dir=credentialsrequest-dir-aws --identity-provider-arn=arn:aws:iam::xxxx:oidc-provider/aos-xxxx.s3.us-east-2.amazonaws.com/xxxx --output-dir=credrequests-ccoctl-output
2023/08/24 20:06:53 Role arn:aws:iam::xxxx:role/xxia-test-xxia-proj-aws-creds created 
2023/08/24 20:06:53 Saved credentials configuration to: credrequests-ccoctl-output/manifests/xxia-proj-aws-creds-credentials.yaml
2023/08/24 20:06:53 Updated Role policy for Role xxia-test-xxia-proj-aws-creds

4.
$ oc annotate sa/aws-provider eks.amazonaws.com/role-arn="arn:aws:iam::xxxx:role/xxia-test-xxia-proj-aws-creds"
$ oc create deployment aws-cli --image=amazon/aws-cli --dry-run=client -o yaml -- sleep 360d | sed "/containers/i \      serviceAccountName: aws-provider" | oc create -f -
$ oc get pod               
NAME                       READY   STATUS    RESTARTS   AGE
aws-cli-84875995cc-svszl   1/1     Running   0          16s

5.
$ oc rsh aws-cli-84875995cc-svszl
sh-4.2$ env | grep AWS
AWS_ROLE_ARN=arn:aws:iam::xxxx:role/xxia-test-xxia-proj-aws-creds
AWS_WEB_IDENTITY_TOKEN_FILE=/var/run/secrets/eks.amazonaws.com/serviceaccount/token
AWS_DEFAULT_REGION=us-east-2
AWS_REGION=us-east-2

Description of problem:

When upgrading a 4.11.33 cluster to 4.12.21, the Cluster Version Operator is stuck waiting for the Network Operator to update:

$ omc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.43   True        True          14m     Working towards 4.12.21: 672 of 831 done (80% complete), waiting on network

CVO pod log states:

2023-06-16T12:07:22.596127142Z I0616 12:07:22.596023       1 metrics.go:490] ClusterOperator network is not setting the 'operator' version

Indeed the NO version is empty:

$ omc get co network -o json|jq '.status.versions'
null

However, it's status is available and not progressing, not degraded: 

$ omc get co network
NAME      VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
network             True        False         False      19m
   
Network operator pod log states:

2023-06-16T12:08:56.542287546Z I0616 12:08:56.542271       1 connectivity_check_controller.go:138] ConnectivityCheckController is waiting for transition to desired version (4.12.21) to be completed.
2023-06-16T12:04:40.584407589Z I0616 12:04:40.584349       1 ovn_kubernetes.go:1437] OVN-Kubernetes master and node already at release version 4.12.21; no changes required

The Network Operator pod, however, has the version correctly:
$ omc get pods -n openshift-network-operator -o jsonpath='{.items[].spec.containers[0].env[?(@.name=="RELEASE_VERSION")]}'|jq
{
  "name": "RELEASE_VERSION",
  "value": "4.12.21"
}

Restarts of the related pods had no effect.  I have trace logs of the Network Operator available.  It looked like it might be related to https://github.com/openshift/cluster-network-operator/pull/1818 but that looks to be code introduced in 4.14.

 

Version-Release number of selected component (if applicable):

 

How reproducible:

I have not reproduced.

Steps to Reproduce:

1.  Cluster version began at stable 4.10.56
2.  Upgraded to 4.11.43 successfully
3.  Upgraded to 4.12.21 and is stuck. 

Actual results:

CVO Stuck waiting on NO to complete, NO 

Expected results:

NO to update its version so the CVO can continue.

Additional info:

Bare Metal IPI cluster with OVN Networking.

This is a clone of issue OCPBUGS-18396. The following is the description of the original issue:

CI is almost perma failing on mtu migration in 4.14 (both SDN and OVN-Kubernetes):

https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-cluster-network-operator-master-e2e-network-mtu-migration-sdn-ipv4

https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-cluster-network-operator-master-e2e-network-mtu-migration-ovn-ipv4

 

Looks like the common issue is waiting for MCO times out:

+ echo '[2023-08-31T03:58:16+00:00] Waiting for final Machine Controller Config...'
[2023-08-31T03:58:16+00:00] Waiting for final Machine Controller Config...
+ timeout 900s bash
migration field is not cleaned by MCO
migration field is not cleaned by MCO
migration field is not cleaned by MCO
migration field is not cleaned by MCO
migration field is not cleaned by MCO
migration field is not cleaned by MCO
migration field is not cleaned by MCO
migration field is not cleaned by MCO
migration field is not cleaned by MCO
migration field is not cleaned by MCO
migration field is not cleaned by MCO
migration field is not cleaned by MCO
migration field is not cleaned by MCO
migration field is not cleaned by MCO
migration field is not cleaned by MCO
migration field is not cleaned by MCO
migration field is not cleaned by MCO
migration field is not cleaned by MCO
migration field is not cleaned by MCO
migration field is not cleaned by MCO
migration field is not cleaned by MCO
migration field is not cleaned by MCO
migration field is not cleaned by MCO
migration field is not cleaned by MCO
migration field is not cleaned by MCO
migration field is not cleaned by MCO
migration field is not cleaned by MCO
migration field is not cleaned by MCO
migration field is not cleaned by MCO
migration field is not cleaned by MCO
migration field is not cleaned by MCO
migration field is not cleaned by MCO
migration field is not cleaned by MCO
migration field is not cleaned by MCO 
...

https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/1979/pull-ci-openshift-cluster-network-operator-master-e2e-network-mtu-migration-sdn-ipv4/1697077984948654080/build-log.txt

Description of problem:

[vmware csi driver] vsphere-syncher does not retry populate the CSINodeTopology with topology information when registration fails

When syncer starts it watches for node events, but it does not retry if registration fails and in the meanwhile any csinodetopoligy requests might not get served, because VM is not found

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2023-05-04-090524

How reproducible:

Randomly

Steps to Reproduce:

1. Install OCP cluster by UPI with encrypt 
2. Check the cluster storage operator not degrade

Actual results:

cluster storage operator degrade that VSphereCSIDriverOperatorCRProgressing: VMwareVSphereDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods 

...
2023-05-09T06:06:22.146861934Z I0509 06:06:22.146850       1 main.go:183] ServeMux listening at "0.0.0.0:10300"
2023-05-09T06:07:00.283007138Z E0509 06:07:00.282912       1 main.go:64] failed to establish connection to CSI driver: context canceled
2023-05-09T06:07:07.283109412Z W0509 06:07:07.283061       1 connection.go:173] Still connecting to unix:///csi/csi.sock
...

# Many error logs in csi driver related timed out while waiting for topology labels to be updated in \"compute-2\" CSINodeTopology instance .

...
2023-05-09T06:19:16.499856730Z {"level":"error","time":"2023-05-09T06:19:16.499687071Z","caller":"k8sorchestrator/topology.go:837","msg":"timed out while waiting for topology labels to be updated in \"compute-2\" CSINodeTopology instance.","TraceId":"b8d9305e-9681-4eba-a8ac-330383227a23","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/common/commonco/k8sorchestrator.(*nodeVolumeTopology).GetNodeTopologyLabels\n\t/go/src/github.com/kubernetes-sigs/vsphere-csi-driver/pkg/csi/service/common/commonco/k8sorchestrator/topology.go:837\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service.(*vsphereCSIDriver).NodeGetInfo\n\t/go/src/github.com/kubernetes-sigs/vsphere-csi-driver/pkg/csi/service/node.go:429\ngithub.com/container-storage-interface/spec/lib/go/csi._Node_NodeGetInfo_Handler\n\t/go/src/github.com/kubernetes-sigs/vsphere-csi-driver/vendor/github.com/container-storage-interface/spec/lib/go/csi/csi.pb.go:6231\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/src/github.com/kubernetes-sigs/vsphere-csi-driver/vendor/google.golang.org/grpc/server.go:1283\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/src/github.com/kubernetes-sigs/vsphere-csi-driver/vendor/google.golang.org/grpc/server.go:1620\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/src/github.com/kubernetes-sigs/vsphere-csi-driver/vendor/google.golang.org/grpc/server.go:922"}
...

Expected results:

Install vsphere ocp cluster succeed and the cluster storage operator is healthy

Additional info:

 

Version:

$ openshift-install version
./openshift-install 4.11.0-0.nightly-2022-07-13-131410
built from commit cdb9627de7efb43ad7af53e7804ddd3434b0dc58
release image registry.ci.openshift.org/ocp/release@sha256:c5413c0fdd0335e5b4063f19133328fee532cacbce74105711070398134bb433
release architecture amd64

Platform:

  • Azure IPI

What happened?
When one creates an IPI Azure cluster with an `internal` publishing method, it creates a standard load balancer with an empty definition. This load balancer doesn't serve a purpose as far as I can tell since the configuration is completely empty. Because it doesn't have a public IP address and backend pools it's not providing any outbound connectivity, and there are no frontend IP configurations for ingress connectivity to the cluster.

Below is the ARM template that is deployed by the installer (through terraform)

```
{
"$schema": "https://schema.management.azure.com/schemas/2019-04-01/deploymentTemplate.json#",
"contentVersion": "1.0.0.0",
"parameters": {
"loadBalancers_mgahagan411_7p82n_name":

{ "defaultValue": "mgahagan411-7p82n", "type": "String" }

},
"variables": {},
"resources": [
{
"type": "Microsoft.Network/loadBalancers",
"apiVersion": "2020-11-01",
"name": "[parameters('loadBalancers_mgahagan411_7p82n_name')]",
"location": "northcentralus",
"sku":

{ "name": "Standard", "tier": "Regional" }

,
"properties":

{ "frontendIPConfigurations": [], "backendAddressPools": [], "loadBalancingRules": [], "probes": [], "inboundNatRules": [], "outboundRules": [], "inboundNatPools": [] }

}
]
}
```

What did you expect to happen?

  • Don't create the standard load balancer on an internal Azure IPI cluster (as it appears to serve no purpose)

How to reproduce it (as minimally and precisely as possible)?
1. Create an IPI cluster with the `publish` installation config set to `Internal` and the `outboundType` set to `UserDefinedRouting`.
```
apiVersion: v1
controlPlane:
architecture: amd64
hyperthreading: Enabled
name: master
platform:
azure: {}
replicas: 3
compute:

  • architecture: amd64
    hyperthreading: Enabled
    name: worker
    platform:
    azure: {}
    replicas: 3
    metadata:
    name: mgahaganpvt
    platform:
    azure:
    region: northcentralus
    baseDomainResourceGroupName: os4-common
    outboundType: UserDefinedRouting
    networkResourceGroupName: mgahaganpvt-rg
    virtualNetwork: mgahaganpvt-vnet
    controlPlaneSubnet: mgahaganpvt-master-subnet
    computeSubnet: mgahaganpvt-worker-subnet
    pullSecret: HIDDEN
    networking:
    clusterNetwork:
  • cidr: 10.128.0.0/14
    hostPrefix: 23
    serviceNetwork:
  • 172.30.0.0/16
    machineNetwork:
  • cidr: 10.0.0.0/16
    networkType: OpenShiftSDN
    publish: Internal
    proxy:
    httpProxy: http://proxy-user1:password@10.0.0.0:3128
    httpsProxy: http://proxy-user1:password@10.0.0.0:3128
    baseDomain: qe.azure.devcluster.openshift.com
    ```

2. Show the json content of the standard load balancer is completely empty
`az network lb show -g myResourceGroup -n myLbName`

```
{
"name": "mgahagan411-7p82n",
"id": "/subscriptions/00000000-0000-0000-00000000/resourceGroups/mgahagan411-7p82n-rg/providers/Microsoft.Network/loadBalancers/mgahagan411-7p82n",
"etag": "W/\"40468fd2-e56b-4429-b582-6852348b6a15\"",
"type": "Microsoft.Network/loadBalancers",
"location": "northcentralus",
"tags": {},
"properties":

{ "provisioningState": "Succeeded", "resourceGuid": "6fb11ec9-d89f-4c05-b201-a61ea8ed55fe", "frontendIPConfigurations": [], "backendAddressPools": [], "loadBalancingRules": [], "probes": [], "inboundNatRules": [], "inboundNatPools": [] }

,
"sku":

{ "name": "Standard" }

}
```

As a developer, I would like to make sure we are using the latest versions of the dependencies we utilize in the /hack/tools/go.mod file.

Description of problem:

4.12.0-0.nightly-2022-09-08-114806 AWS cluster, "remote error: tls: bad certificate" is in prometheus-operator-admission-webhook logs, should be a regression issue, no such issue in 4.11 and the defect does not block the function, it seems it's from AWS

$ oc -n openshift-monitoring get pod | grep prometheus-operator-admission-webhook
prometheus-operator-admission-webhook-7d8fd8b5bb-kjh4f   1/1     Running   0          3h
prometheus-operator-admission-webhook-7d8fd8b5bb-whl5n   1/1     Running   0          3h

$ oc -n openshift-monitoring logs prometheus-operator-admission-webhook-7d8fd8b5bb-kjh4f
level=info ts=2022-09-08T23:32:53.782445094Z caller=main.go:130 address=[::]:8443 msg="Starting TLS enabled server"
ts=2022-09-08T23:33:09.057366056Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:52820: remote error: tls: bad certificate"
ts=2022-09-08T23:33:10.071639453Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:52830: remote error: tls: bad certificate"
ts=2022-09-08T23:33:12.07959313Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:52842: remote error: tls: bad certificate"
ts=2022-09-08T23:33:31.729332249Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:39188: remote error: tls: bad certificate"
ts=2022-09-08T23:33:32.7374936Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:39196: remote error: tls: bad certificate"
ts=2022-09-08T23:33:34.745945871Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:39206: remote error: tls: bad certificate"
ts=2022-09-08T23:33:57.460069283Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:37500: remote error: tls: bad certificate"
ts=2022-09-08T23:33:58.469984958Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:37508: remote error: tls: bad certificate"
ts=2022-09-08T23:34:00.479578826Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:40948: remote error: tls: bad certificate"
ts=2022-09-08T23:36:22.861562723Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:53866: remote error: tls: bad certificate"
ts=2022-09-08T23:36:24.870186206Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:53882: remote error: tls: bad certificate"
ts=2022-09-08T23:39:43.613375962Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:38780: remote error: tls: bad certificate"
ts=2022-09-08T23:39:45.621205524Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:38792: remote error: tls: bad certificate"
ts=2022-09-08T23:46:03.653578785Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:57878: remote error: tls: bad certificate"
ts=2022-09-08T23:46:05.662237056Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:57890: remote error: tls: bad certificate"
ts=2022-09-08T23:49:08.643599472Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:48340: remote error: tls: bad certificate"
ts=2022-09-08T23:52:08.809838473Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:51682: remote error: tls: bad certificate"
ts=2022-09-08T23:52:09.817050146Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:51698: remote error: tls: bad certificate"
ts=2022-09-08T23:55:11.862993344Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:54280: remote error: tls: bad certificate"
ts=2022-09-08T23:58:15.820629264Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:59462: remote error: tls: bad certificate"
ts=2022-09-09T00:01:17.913920461Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:47320: remote error: tls: bad certificate"
ts=2022-09-09T00:04:21.086495988Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:52438: remote error: tls: bad certificate"
ts=2022-09-09T00:07:24.050365477Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:55148: remote error: tls: bad certificate"
ts=2022-09-09T00:07:27.066559749Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:55168: remote error: tls: bad certificate"
ts=2022-09-09T00:10:28.193017562Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:42222: remote error: tls: bad certificate"
ts=2022-09-09T00:10:30.201598245Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:59802: remote error: tls: bad certificate"
ts=2022-09-09T00:13:30.282592276Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:45648: remote error: tls: bad certificate"
ts=2022-09-09T00:13:31.290450933Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:45654: remote error: tls: bad certificate"
ts=2022-09-09T00:13:33.298604517Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:45668: remote error: tls: bad certificate"
ts=2022-09-09T00:16:33.274732648Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:56710: remote error: tls: bad certificate"
ts=2022-09-09T00:19:39.47117325Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:54978: remote error: tls: bad certificate"
ts=2022-09-09T00:25:43.708275724Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:54638: remote error: tls: bad certificate"
ts=2022-09-09T00:28:46.627225713Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:58124: remote error: tls: bad certificate"
ts=2022-09-09T00:28:48.63515681Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:39454: remote error: tls: bad certificate"
ts=2022-09-09T00:31:51.728153893Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:56894: remote error: tls: bad certificate"
ts=2022-09-09T00:34:52.775067246Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:34884: remote error: tls: bad certificate"
ts=2022-09-09T00:41:00.843743907Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:41784: remote error: tls: bad certificate"
ts=2022-09-09T00:44:00.933970145Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:36150: remote error: tls: bad certificate"
ts=2022-09-09T00:44:03.949135311Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:36166: remote error: tls: bad certificate"
ts=2022-09-09T00:47:03.97630552Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:44732: remote error: tls: bad certificate"
ts=2022-09-09T00:47:06.991580657Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:44748: remote error: tls: bad certificate"
ts=2022-09-09T00:50:08.31637565Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:54092: remote error: tls: bad certificate"
ts=2022-09-09T00:53:11.264559449Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:43144: remote error: tls: bad certificate"
ts=2022-09-09T00:59:16.306282415Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:39864: remote error: tls: bad certificate"
ts=2022-09-09T00:59:17.314074479Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:39878: remote error: tls: bad certificate"
ts=2022-09-09T00:59:19.32313415Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:56104: remote error: tls: bad certificate"
ts=2022-09-09T01:08:25.613927992Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:44280: remote error: tls: bad certificate"
ts=2022-09-09T01:08:26.622625145Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:44290: remote error: tls: bad certificate"
ts=2022-09-09T01:08:28.631034721Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:48838: remote error: tls: bad certificate"
ts=2022-09-09T01:11:28.704732265Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:37372: remote error: tls: bad certificate"
ts=2022-09-09T01:11:31.723552093Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:37392: remote error: tls: bad certificate"
ts=2022-09-09T01:17:34.794690109Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:46750: remote error: tls: bad certificate"
ts=2022-09-09T01:17:35.803918438Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:46752: remote error: tls: bad certificate"
ts=2022-09-09T01:17:37.812700046Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:46768: remote error: tls: bad certificate"
ts=2022-09-09T01:20:38.79326772Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:53880: remote error: tls: bad certificate"
ts=2022-09-09T01:23:41.073187846Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:46086: remote error: tls: bad certificate"
ts=2022-09-09T01:23:44.088529273Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:46090: remote error: tls: bad certificate"
ts=2022-09-09T01:26:44.077154097Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:54234: remote error: tls: bad certificate"
ts=2022-09-09T01:26:45.085277729Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:54248: remote error: tls: bad certificate"
ts=2022-09-09T01:26:47.092797767Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:54254: remote error: tls: bad certificate"
ts=2022-09-09T01:29:48.255127155Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:39536: remote error: tls: bad certificate"
ts=2022-09-09T01:29:50.263225272Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:56030: remote error: tls: bad certificate"
ts=2022-09-09T01:32:51.618334928Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:42836: remote error: tls: bad certificate"
ts=2022-09-09T01:32:53.627565113Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:42844: remote error: tls: bad certificate"
ts=2022-09-09T01:35:56.945306145Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:57828: remote error: tls: bad certificate"
ts=2022-09-09T01:38:57.721110974Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:54038: remote error: tls: bad certificate"
ts=2022-09-09T01:41:59.901865996Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:46096: remote error: tls: bad certificate"
ts=2022-09-09T01:42:00.903596845Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:46102: remote error: tls: bad certificate"
ts=2022-09-09T01:45:03.034044637Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:55868: remote error: tls: bad certificate"
ts=2022-09-09T01:45:04.042270514Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:55874: remote error: tls: bad certificate"
ts=2022-09-09T01:45:06.05067642Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:55888: remote error: tls: bad certificate"
ts=2022-09-09T01:48:06.178001976Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:56024: remote error: tls: bad certificate"
ts=2022-09-09T01:48:09.192075072Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:37562: remote error: tls: bad certificate"
ts=2022-09-09T01:51:10.203900665Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:33016: remote error: tls: bad certificate"
ts=2022-09-09T01:51:12.212458619Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:33022: remote error: tls: bad certificate"
ts=2022-09-09T01:54:13.294550312Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:38042: remote error: tls: bad certificate"
ts=2022-09-09T01:57:15.292731466Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:43838: remote error: tls: bad certificate"
ts=2022-09-09T02:00:19.408152102Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:42838: remote error: tls: bad certificate"
ts=2022-09-09T02:00:21.41717724Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:42842: remote error: tls: bad certificate"
ts=2022-09-09T02:03:21.342937844Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:55026: remote error: tls: bad certificate"
ts=2022-09-09T02:03:22.350450637Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:55034: remote error: tls: bad certificate"
ts=2022-09-09T02:06:25.421123942Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:34882: remote error: tls: bad certificate"
ts=2022-09-09T02:06:27.428721002Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:34884: remote error: tls: bad certificate"
ts=2022-09-09T02:09:28.541378288Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:52888: remote error: tls: bad certificate"
ts=2022-09-09T02:12:31.610427648Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:47430: remote error: tls: bad certificate"
ts=2022-09-09T02:12:33.618581498Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:47434: remote error: tls: bad certificate"
ts=2022-09-09T02:15:33.601606956Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:37706: remote error: tls: bad certificate"
ts=2022-09-09T02:15:36.617807944Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:37730: remote error: tls: bad certificate"
ts=2022-09-09T02:18:37.815046583Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:45066: remote error: tls: bad certificate"
ts=2022-09-09T02:18:39.822858743Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:39614: remote error: tls: bad certificate"
ts=2022-09-09T02:21:40.885368415Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:42250: remote error: tls: bad certificate"

Version-Release number of selected component (if applicable):

"remote error: tls: bad certificate" is in prometheus-operator-admission-webhook logs

How reproducible:

always

Steps to Reproduce:

1. check prometheus-operator-admission-webhook logs.

Actual results:

"remote error: tls: bad certificate" is in prometheus-operator-admission-webhook logs

Expected results:

no error logs

Additional info:

 

 

Description of problem:


Facing the same issue as JIRA[1] in OCP 4.12 and for the backport this bug solution to the OCP 4.12

JIRA[1]: https://issues.redhat.com/browse/OCPBUGS-14064

As port 9447 is exposed from the cluster in one of the control nodes and is using weak cipher and TLS 1.0/ TLS 1.1 , this is incompatible with the security standards for our product release. Either we should be able to disable this port or update the cipher and TLS version as the fix for meeting the security standards as you are aware TLS 1.0 & TLS 1.1 are pretty old and deprecated already.

we confirmed that fips were enabled during cluster deployment by passing the key-value pair in the config file."~~~
fips: true

On JIRA[1] it is suggested to open a separate Bug for backporting. 

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:

1.
2.
3.

Actual results:


Expected results:


Additional info:


controller: Drop noisy log message about certificates

I often turn to the controller pod logs to debug issues, and
this log message is repeated very often. While it was
probably useful at the time the feature was being developed/tested
I doubt it will be necessary in the future.

In the end, the status really is the debugging frontend I believe.


controller: Drop noisy BaseOSContainerImage log message

In general we should avoid logging unless something changed.
I don't believe we need this log message, we can detect OS
changes from e.g. the MCD logs.

Description of problem:

The HyperShift KubeVirt (openshift virtualization) platform has worker nodes that are hosted by KubeVirt virtual  machines. The worker node's internal IP address is interpreted by inspecting the kubevirt vmi's vmi.status.interface field.

Due to the way the vmi.status.interface field sources its information from the qemu guest agent, that field is not guaranteed to remain static in some scenarios, such as soft reboot or when the qemu agent is temporarily unavailable. During these situations, the interfaces list will be empty.

When the interfaces list is empty on the vmi, there are Hypershift related components (cloud-provider-kubevirt and cluster-api-provider-kubevirt) which strip the worker nodes internal IP. This stripping of the node's internal IP causes unpredictable behavior that results in connectivity failures from the KAS to the worker node kubelets.

To address this, the Hypershift related kubevirt components need to only update the Internal IP of worker nodes when the vmi.status.interfaces list has an IP for the default interface. Othewise these hypershift components should use the last known internal IP address rather than stripping the internal IP address from the node.

Version-Release number of selected component (if applicable):

4.14

How reproducible:

100% given enough time and the right environment.

Steps to Reproduce:

1. create a hypershift kubevirt guest cluster
2. run the csi conformance test suite in a loop (this test suite causes the vmi.status.interfaces list to become unstable briefly at times)

Actual results:

the csi test suite will eventually begin failing due to inabiilty to pod exec into worker node pods. This is caused by the node's internal IP being removed.

Expected results:

csi conformance should pass reliably

Additional info:

 

We have occasional cases where admins attempt a rollback, despite long-standing docs:

Only upgrading to a newer version is supported. Reverting or rolling back your cluster to a previous version is not supported. If your update fails, contact Red Hat support.

Deeper history for that content herehere, and here. We could refuse to accept rollbacks unless the administrator sets Force to waive our guards.

From wking:

$ git --no-pager grep OCPBUGS-10218
test/e2e/nodepool_test.go: // TODO: (csrwng) Re-enable when https://issues.redhat.com/browse/OCPBUGS-10218
is fixed
test/e2e/nodepool_test.go: // TODO: (jparrill) Re-enable when https://issues.redhat.com/browse/OCPBUGS-10218
is fixed
but https://issues.redhat.com/browse/OCPBUGS-10218 was closed as a dup of https://issues.redhat.com/browse/OCPBUGS-10485 , and OCPBUGS-10485 is Verified with happy sounds for both 4.13 and 4.14 nightlies
 

Please review the following PR: https://github.com/openshift/cloud-provider-ibm/pull/48

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

When working with Horizontal Nav the component doesn't re-render when location changes. Currently it only updates itself when basePath changes. The location change based re-render was triggered by withRouter HoC previously but was recently removed.

Version-Release number of selected component (if applicable):

4.13

How reproducible:

1/1

Steps to Reproduce:

1. Go to Storage -> ODF (version 4.13-pre-release)
2. Click on Storage System Tab and then Topology tab
3.

Actual results:

The selected tab doesn't get highlighted as active tab.

Expected results:

The selected tab should have the active blue color.

Additional info:

 

This is a clone of issue OCPBUGS-18498. The following is the description of the original issue:

Description of problem:

If not installed capability operator build and deploymentconfig, when use `oc new-app registry.redhat.io/<namespace>/<image>:<tag>` , the created deployment emptied spec.containers[0].image. The deploy will fail to start pod.

Version-Release number of selected component (if applicable):

oc version
Client Version: 4.14.0-0.nightly-2023-08-22-221456
Kustomize Version: v5.0.1
Server Version: 4.14.0-0.nightly-2023-09-02-132842
Kubernetes Version: v1.27.4+2c83a9f

How reproducible:

Always

Steps to Reproduce:

1. Installed cluster without build/deploymentconfig function
Set "baselineCapabilitySet: None" in install-config
2.Create a deploy using 'new-app' cmd
oc new-app registry.redhat.io/ubi8/httpd-24:latest
3.

Actual results:

2.
$oc new-app registry.redhat.io/ubi8/httpd-24:latest
--> Found container image c412709 (11 days old) from registry.redhat.io for "registry.redhat.io/ubi8/httpd-24:latest"    Apache httpd 2.4
    ----------------
    Apache httpd 2.4 available as container, is a powerful, efficient, and extensible web server. Apache supports a variety of features, many implemented as compiled modules which extend the core functionality. These can range from server-side programming language support to authentication schemes. Virtual hosting allows one Apache installation to serve many different Web sites.    Tags: builder, httpd, httpd-24    * An image stream tag will be created as "httpd-24:latest" that will track this image--> Creating resources ...
    imagestream.image.openshift.io "httpd-24" created
    deployment.apps "httpd-24" created
    service "httpd-24" created
--> Success
    Application is not exposed. You can expose services to the outside world by executing one or more of the commands below:
     'oc expose service/httpd-24'
    Run 'oc status' to view your app

3. oc get deploy -o yaml
 apiVersion: v1
items:
- apiVersion: apps/v1
  kind: Deployment
  metadata:
    annotations:
      deployment.kubernetes.io/revision: "1"
      image.openshift.io/triggers: '[{"from":{"kind":"ImageStreamTag","name":"httpd-24:latest"},"fieldPath":"spec.template.spec.containers[?(@.name==\"httpd-24\")].image"}]'
      openshift.io/generated-by: OpenShiftNewApp
    creationTimestamp: "2023-09-04T07:44:01Z"
    generation: 1
    labels:
      app: httpd-24
      app.kubernetes.io/component: httpd-24
      app.kubernetes.io/instance: httpd-24
    name: httpd-24
    namespace: wxg
    resourceVersion: "115441"
    uid: 909d0c4e-180c-4f88-8fb5-93c927839903
  spec:
    progressDeadlineSeconds: 600
    replicas: 1
    revisionHistoryLimit: 10
    selector:
      matchLabels:
        deployment: httpd-24
    strategy:
      rollingUpdate:
        maxSurge: 25%
        maxUnavailable: 25%
      type: RollingUpdate
    template:
      metadata:
        annotations:
          openshift.io/generated-by: OpenShiftNewApp
        creationTimestamp: null
        labels:
          deployment: httpd-24
      spec:
        containers:
        - image: ' '
          imagePullPolicy: IfNotPresent
          name: httpd-24
          ports:
          - containerPort: 8080
            protocol: TCP
          - containerPort: 8443
            protocol: TCP
          resources: {}
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
        dnsPolicy: ClusterFirst
        restartPolicy: Always
        schedulerName: default-scheduler
        securityContext: {}
        terminationGracePeriodSeconds: 30
  status:
    conditions:
    - lastTransitionTime: "2023-09-04T07:44:01Z"
      lastUpdateTime: "2023-09-04T07:44:01Z"
      message: Created new replica set "httpd-24-7f6b55cc85"
      reason: NewReplicaSetCreated
      status: "True"
      type: Progressing
    - lastTransitionTime: "2023-09-04T07:44:01Z"
      lastUpdateTime: "2023-09-04T07:44:01Z"
      message: Deployment does not have minimum availability.
      reason: MinimumReplicasUnavailable
      status: "False"
      type: Available
    - lastTransitionTime: "2023-09-04T07:44:01Z"
      lastUpdateTime: "2023-09-04T07:44:01Z"
      message: 'Pod "httpd-24-7f6b55cc85-pvvgt" is invalid: spec.containers[0].image:
        Invalid value: " ": must not have leading or trailing whitespace'
      reason: FailedCreate
      status: "True"
      type: ReplicaFailure
    observedGeneration: 1
    unavailableReplicas: 1
kind: List
metadata:

Expected results:

Should set spec.containers[0].image to registry.redhat.io/ubi8/httpd-24:latest

Additional info:

 

Currently the upgrade feature agent is disabled by default and enabled explicitly only for the SaaS environment. This ticket is about enabling it by default also for ACM.
 

Description of problem:

Deploying a helm chart that features a values.schema.json using either 2019-09 or 2020-20 (latest) revisions of the JSON-Schema results in the UI hanging on create with three dots loading... This is not the case if YAML view is used, since I suppose this view is not trying to be clever and let Helm validate the chart values against the schema itself.

Version-Release number of selected component (if applicable):

Reproduced in 4.13, probably affects other versions as well.

How reproducible:

100%

Steps to Reproduce:

1. Go to Helm tab.
2. Click create in top right and select Repository
3. Paste following into YAML view and click Create:

apiVersion: helm.openshift.io/v1beta1
kind: ProjectHelmChartRepository
metadata:
  name: reproducer
spec:
  connectionConfig:
    url: 'https://raw.githubusercontent.com/tumido/helm-backstage/blog2'

4. Go to the Helm tab again (if redirected elsewhere)
5. Click create in top right and select Helm Release
6. In catalog filter select Chart repositories: Reproducer
7. Click on the single tile available (Backstage) and click Create
8. Switch to Form view
9. Leave default values and click Create
10. Stare at the always loading screen that never proceeds further.

Actual results:

Expected results:

It installs and deploys the chart

Additional info:

This is caused by a JSON Schema containing a $schema key pointing which revision of the JSON Schema standard should be used:

{
    "$schema": "https://json-schema.org/draft/2020-12/schema",
}

I've managed to trace this back to this react-jsonschema-form issue:

https://github.com/rjsf-team/react-jsonschema-form/issues/2241

It seems the library used here for validation doesn't support 2019-09 draft and the most current revision 2020-20 revision.

It happens only if the chart follows the JSON Schema standard and declares the revision properly.

Workarounds:

IMO best solution:
Helm form renderer should NOT do any validation, since it can't handle the schema properly. Instead, it should leave this job to the Helm backend. Helm validates the values against the schema when installing the chart anyways. The YAML view also does no validation. That one seems to do the job properly.
 
Currently, there is no formal requirement for charts admitted to the helm curated catalog saying that the most recent JSON Schema revision is 4 years old and later 2 revisions are not supported.

Also, the Form UI should not just hang on submit. Instead, it should at least fail gracefully.

 

Related to:

https://github.com/janus-idp/helm-backstage/issues/64#issuecomment-1587678319

Description of problem

CI is flaky because of test failures such as the following:

{  fail [github.com/openshift/origin/test/extended/oauth/requestheaders.go:218]: full response header: HTTP/1.1 403 Forbidden
Content-Length: 192
Audit-Id: f6026f9b-06c5-4b4a-9414-8dc5c681b45a
Cache-Control: no-cache, no-store, max-age=0, must-revalidate
Content-Type: application/json
Date: Tue, 08 Aug 2023 11:26:35 GMT
Expires: 0
Pragma: no-cache
Referrer-Policy: strict-origin-when-cross-origin
X-Content-Type-Options: nosniff
X-Dns-Prefetch-Control: off
X-Frame-Options: DENY
X-Xss-Protection: 1; mode=block

{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"forbidden: User \"system:anonymous\" cannot get path \"/metrics\"","reason":"Forbidden","details":{},"code":403}


Expected
    <string>: 403 Forbidden
to contain substring
    <string>: 401 Unauthorized
Ginkgo exit error 1: exit with code 1}

This particular failure comes from https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_openshift-apiserver/380/pull-ci-openshift-openshift-apiserver-master-e2e-aws-ovn-serial/1688848417708576768. Search.ci has other similar failures.

Version-Release number of selected component (if applicable)

I have seen this in 4.14 CI jobs and 4.13 CI jobs.

How reproducible

Presently, search.ci shows the following stats for the past 14 days:

Found in 2.41% of runs (4.36% of failures) across 1078 total runs and 58 jobs (55.38% failed)
pull-ci-openshift-openshift-apiserver-master-e2e-aws-ovn-serial (all) - 25 runs, 40% failed, 20% of failures match = 8% impact
openshift-cluster-network-operator-1874-nightly-4.14-e2e-aws-ovn-serial (all) - 42 runs, 67% failed, 14% of failures match = 10% impact
pull-ci-openshift-kubernetes-master-e2e-aws-ovn-serial (all) - 59 runs, 54% failed, 6% of failures match = 3% impact
pull-ci-openshift-origin-master-e2e-aws-ovn-serial (all) - 434 runs, 66% failed, 2% of failures match = 1% impact
pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-serial (all) - 55 runs, 49% failed, 7% of failures match = 4% impact
pull-ci-openshift-cluster-etcd-operator-master-e2e-aws-ovn-serial (all) - 60 runs, 58% failed, 3% of failures match = 2% impact
pull-ci-operator-framework-operator-marketplace-master-e2e-aws-ovn-serial (all) - 24 runs, 38% failed, 22% of failures match = 8% impact
pull-ci-openshift-cluster-network-operator-master-e2e-aws-ovn-serial (all) - 81 runs, 58% failed, 4% of failures match = 2% impact
pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-ovn-serial (all) - 35 runs, 46% failed, 13% of failures match = 6% impact
rehearse-41872-pull-ci-openshift-ovn-kubernetes-release-4.14-e2e-aws-ovn-serial (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial (all) - 72 runs, 49% failed, 3% of failures match = 1% impact
pull-ci-openshift-cluster-kube-apiserver-operator-release-4.13-e2e-aws-ovn-serial (all) - 4 runs, 75% failed, 33% of failures match = 25% impact
pull-ci-openshift-cluster-dns-operator-master-e2e-aws-ovn-serial (all) - 19 runs, 63% failed, 8% of failures match = 5% impact

Steps to Reproduce

1. Post a PR and have bad luck.
2. Check search.ci using the link above.

Actual results

CI fails.

Expected results

CI passes, or fails on some other test failure.

Context:

In 4.14 kubelet config from MCO payload comes with --external, which means node.cloudprovider.kubernetes.io/uninitialized taint is set preventing workloads from being scheduled and only cleaned up by the external cloud provider.

This has come as a result of AWS removing their in-tree provider implementation for K8s 1.27

DoD:

We need to let the CPO run the AWS external cloud provider.

Description of problem:

023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health [-] Component KuryrPortHandler is dead. Last caught exception below: openstack.exceptions.InvalidRequest: Request requires an ID but none was found
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health Traceback (most recent call last):
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/controller/handlers/kuryrport.py", line 169, in on_finalize
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health     pod = self.k8s.get(f"{constants.K8S_API_NAMESPACES}"
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/k8s_client.py", line 121, in get
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health     self._raise_from_response(response)
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/k8s_client.py", line 99, in _raise_from_response
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health     raise exc.K8sResourceNotFound(response.text)
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health kuryr_kubernetes.exceptions.K8sResourceNotFound: Resource not found: '{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods \\"mygov-tuo-microservice-dev2-59fffbc58c-l5b79\\" not found","reason":"NotFound","details":{"name":"mygov-tuo-microservice-dev2-59fffbc58c-l5b79","kind":"pods"},"code":404}\n'
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health During handling of the above exception, another exception occurred:
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health Traceback (most recent call last):
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/handlers/logging.py", line 38, in __call__
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health     self._handler(event, *args, **kwargs)
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/handlers/retry.py", line 85, in __call__
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health     self._handler(event, *args, retry_info=info, **kwargs)
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/handlers/k8s_base.py", line 98, in __call__
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health     self.on_finalize(obj, *args, **kwargs)
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/controller/handlers/kuryrport.py", line 184, in on_finalize
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health     pod = self._mock_cleanup_pod(kuryrport_crd)
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/controller/handlers/kuryrport.py", line 160, in _mock_cleanup_pod
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health     host_ip = utils.get_parent_port_ip(port_id)
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/utils.py", line 830, in get_parent_port_ip
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health     parent_port = os_net.get_port(port_id)
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health   File "/usr/lib/python3.6/site-packages/openstack/network/v2/_proxy.py", line 1987, in get_port
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health     return self._get(_port.Port, port)
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health   File "/usr/lib/python3.6/site-packages/openstack/proxy.py", line 48, in check
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health     return method(self, expected, actual, *args, **kwargs)
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health   File "/usr/lib/python3.6/site-packages/openstack/proxy.py", line 513, in _get
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health     resource_type=resource_type.__name__, value=value))
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health   File "/usr/lib/python3.6/site-packages/openstack/resource.py", line 1472, in fetch
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health     base_path=base_path)
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health   File "/usr/lib/python3.6/site-packages/openstack/network/v2/_base.py", line 26, in _prepare_request
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health     base_path=base_path, params=params)
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health   File "/usr/lib/python3.6/site-packages/openstack/resource.py", line 1156, in _prepare_request
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health     "Request requires an ID but none was found")
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health openstack.exceptions.InvalidRequest: Request requires an ID but none was found
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health
2023-04-20 02:08:09.918 1 INFO kuryr_kubernetes.controller.service [-] Service 'KuryrK8sService' stopping
2023-04-20 02:08:09.919 1 INFO kuryr_kubernetes.watcher [-] Stopped watching '/apis/openstack.org/v1/kuryrnetworks'
2023-04-20 02:08:10.026 1 INFO kuryr_kubernetes.watcher [-] Stopped watching '/apis/machine.openshift.io/v1beta1/machines'
2023-04-20 02:08:10.152 1 INFO kuryr_kubernetes.watcher [-] Stopped watching '/api/v1/pods'
2023-04-20 02:08:10.174 1 INFO kuryr_kubernetes.watcher [-] Stopped watching '/apis/networking.k8s.io/v1/networkpolicies'
2023-04-20 02:08:10.857 1 INFO kuryr_kubernetes.watcher [-] Stopped watching '/api/v1/namespaces'
2023-04-20 02:08:10.877 1 WARNING kuryr_kubernetes.controller.drivers.utils [-] Namespace dev-health-air-ids not yet ready: kuryr_kubernetes.exceptions.K8sResourceNotFound: Resource not found: '{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"kuryrnetworks.openstack.org \\"dev-health-air-ids\\" not found","reason":"NotFound","details":{"name":"dev-health-air-ids","group":"openstack.org","kind":"kuryrnetworks"},"code":404}\n'
2023-04-20 02:08:11.024 1 INFO kuryr_kubernetes.watcher [-] Stopped watching '/api/v1/services'
2023-04-20 02:08:11.078 1 INFO kuryr_kubernetes.watcher [-] Stopped watching '/api/v1/endpoints'
2023-04-20 02:08:11.170 1 INFO kuryr_kubernetes.watcher [-] Stopped watching '/apis/openstack.org/v1/kuryrports'
2023-04-20 02:08:11.344 1 INFO kuryr_kubernetes.watcher [-] Stopped watching '/apis/openstack.org/v1/kuryrnetworkpolicies'
2023-04-20 02:08:11.475 1 INFO kuryr_kubernetes.watcher [-] Stopped watching '/apis/openstack.org/v1/kuryrloadbalancers'
2023-04-20 02:08:11.475 1 INFO kuryr_kubernetes.watcher [-] No remaining active watchers, Exiting...
2023-04-20 02:08:11.475 1 INFO kuryr_kubernetes.controller.service [-] Service 'KuryrK8sService' stopping

Version-Release number of selected component (if applicable):


How reproducible:

Always

Steps to Reproduce:

1. Create a pod.
2. Stop kuryr-controller.
3. Delete the pod and the finalizer on it.
4. Delete pod's subport.
5. Start the controller.

Actual results:

Crash

Expected results:

Port cleaned up normally.

Additional info:


Please review the following PR: https://github.com/openshift/vmware-vsphere-csi-driver/pull/75

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

In https://github.com/openshift/cluster-baremetal-operator/blob/master/provisioning/utils.go#L65 we reference .PlatformStatus.BareMetal.APIServerInternalIP attribute from the config API. Meanwhile, a recent change https://github.com/openshift/api/commit/51f399230d604fa013c2bb341040c4ad126e7309 deprecated this field in favour of .APIServerInternalIPs (note plural), this was done to better suit dual stack case.

We need to update the code (trivial) along with vendor dependencies (openshift/api needs a bump to version equal or later to the one including the commit referenced above). Likely there will be code changes required in CBO to adopt to the newer API package.

Slack threads for reference: https://app.slack.com/client/T027F3GAJ/C01RJHA6BRC/thread/C01RJHA6BRC-1661416223.353009 (vendor dependency update)

openshift/api change:
https://coreos.slack.com/archives/C01RJHA6BRC/p1660573560434409?thread_ts=1660229723.998839&cid=C01RJHA6BRC

IMPORTANT NOTE: there is an in-flight PR which is making changes to the CBO code fetching the VIP: https://github.com/openshift/cluster-baremetal-operator/pull/285.

Work done to address this bug needs to be stacked on top of this to avoid duplication of effort (the easiest way is to work on the code from the in-flight PR285 and merge once PR285 merges)

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Please review the following PR: https://github.com/openshift/openshift-state-metrics/pull/95

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Bugs are required for all 4.14 merges right now due to instability. We need to bump the version of the cvo so that the version is consistent with the cluster being installed.

After running several scale tests on a large cluster (252 workers), etcd ran out of space and became unavailable.

 

These tests consisted of running our node-density workload (Creates more than 50k pause pods) and cluster-density 4k several times (creates 4k namespaces with https://github.com/cloud-bulldozer/e2e-benchmarking/tree/master/workloads/kube-burner#cluster-density-variables).

 

The actions above leaded etcd peers to run out of free space in their 4GiB PVCs presenting the following error trace

{"level":"warn","ts":"2023-03-31T09:50:57.532Z","caller":"rafthttp/http.go:271","msg":"failed to save incoming database snapshot","local-member-id":"b14198cd7f0eebf1","remote-snapshot-sender-id":"a4e894c3f4af1379","incoming-snapshot-index ":19490191,"error":"write /var/lib/data/member/snap/tmp774311312: no space left on device"} 

 

Etcd uses 4GiB PVCs to store its data, which seems to be insufficient for this scenario. In addition, unlike not-hypershift clusters we're not applying any periodic database defragmentation (this is done by cluster-etcd-operator) that could lead to a higher database size

 

The graph below represents the metrics etcd_mvcc_db_total_size_in_bytes and etcd_mvcc_db_total_size_in_use_in_byte

 

 

Description of problem:

In our IBM Cloud use-case of RHCOS, we are seeing 4.13 RHCOS nodes failing to properly bootstrap to a HyperShift 4.13 control plane. RHCOS worker node kubelet is failing with "failed to construct kubelet dependencies: unable to load client CA file /etc/kubernetes/kubelet-ca.crt: open /etc/kubernetes/kubelet-ca.crt: no such file or directory". 

Version-Release number of selected component (if applicable):

4.13.0-rc.6

How reproducible:

100%

Steps to Reproduce:

1. Create a HyperShift 4.13 control plane
2. Boot a RHCOS host outside of cluster
3. After initial RHCOS boot, fetch ignition from control plane
4. Attempt to bootstrap to cluster via `machine-config-daemon firstboot-complete-machineconfig`

Actual results:

Kubelet service fails with "failed to construct kubelet dependencies: unable to load client CA file /etc/kubernetes/kubelet-ca.crt: open /etc/kubernetes/kubelet-ca.crt: no such file or directory".

Expected results:

RHCOS worker node to properly bootstrap to HyperShift control plane. This has been the supported bootstrapping flow for releases <4.13.

Additional info:

References:
- https://redhat-internal.slack.com/archives/C01C8502FMM/p1682968210631419
- https://github.com/openshift/machine-config-operator/pull/3575
- https://github.com/openshift/machine-config-operator/pull/3654

This is a clone of issue OCPBUGS-18907. The following is the description of the original issue:

Description of problem:

From on to https://issues.redhat.com/browse/OCPBUGS-17827

jiezhao-mac:hypershift jiezhao$ oc get hostedcluster -n clusters
NAME       VERSION                              KUBECONFIG                  PROGRESS    AVAILABLE   PROGRESSING   MESSAGE
jie-test   4.14.0-0.nightly-2023-09-12-024050   jie-test-admin-kubeconfig   Completed   True        False         The hosted control plane is available
jiezhao-mac:hypershift jiezhao$ 
jiezhao-mac:hypershift jiezhao$ oc get pods -n clusters-jie-test | grep router
router-78d47f4c69-2mvbp                               1/1     Running            0          68m
jiezhao-mac:hypershift jiezhao$ 
jiezhao-mac:hypershift jiezhao$ oc get pods router-78d47f4c69-2mvbp -n clusters-jie-test -ojsonpath='{.metadata.labels}' | jq
{
  "app": "private-router",
  "hypershift.openshift.io/hosted-control-plane": "clusters-jie-test",
  "hypershift.openshift.io/request-serving-component": "true",
  "pod-template-hash": "78d47f4c69"
}
jiezhao-mac:hypershift jiezhao$ oc get networkpolicy management-kas  -n clusters-jie-test
NAME             POD-SELECTOR                                                                                   AGE
management-kas   !hypershift.openshift.io/need-management-kas-access,name notin (aws-ebs-csi-driver-operator)   76m
jiezhao-mac:hypershift jiezhao$ 
jiezhao-mac:hypershift jiezhao$ oc get networkpolicy management-kas  -n clusters-jie-test -o yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  annotations:
    hypershift.openshift.io/cluster: clusters/jie-test
  creationTimestamp: "2023-09-12T14:43:13Z"
  generation: 1
  name: management-kas
  namespace: clusters-jie-test
  resourceVersion: "54049"
  uid: 72288fed-a1f6-4dc9-bb63-981d7cdd479f
spec:
  egress:
  - to:
    - podSelector: {}
  - to:
    - ipBlock:
        cidr: 0.0.0.0/0
        except:
        - 10.0.46.47/32
        - 10.0.7.159/32
        - 10.0.77.20/32
        - 10.128.0.0/14
  - ports:
    - port: 5353
      protocol: UDP
    - port: 5353
      protocol: TCP
    to:
    - namespaceSelector:
        matchLabels:
          kubernetes.io/metadata.name: openshift-dns
  podSelector:
    matchExpressions:
    - key: hypershift.openshift.io/need-management-kas-access
      operator: DoesNotExist
    - key: name
      operator: NotIn
      values:
      - aws-ebs-csi-driver-operator
  policyTypes:
  - Egress
status: {}
jiezhao-mac:hypershift jiezhao$ 
jiezhao-mac:hypershift jiezhao$ oc get endpoints -n default kubernetes
NAME         ENDPOINTS                                         AGE
kubernetes   10.0.46.47:6443,10.0.7.159:6443,10.0.77.20:6443   150m
jiezhao-mac:hypershift jiezhao$ 
jiezhao-mac:hypershift jiezhao$ oc get endpoints -n default kubernetes -o yaml
apiVersion: v1
kind: Endpoints
metadata:
  creationTimestamp: "2023-09-12T13:32:47Z"
  labels:
    endpointslice.kubernetes.io/skip-mirror: "true"
  name: kubernetes
  namespace: default
  resourceVersion: "31961"
  uid: bc170a67-018f-4490-a18c-811ebd3f3676
subsets:
- addresses:
  - ip: 10.0.46.47
  - ip: 10.0.7.159
  - ip: 10.0.77.20
  ports:
  - name: https
    port: 6443
    protocol: TCP
jiezhao-mac:hypershift jiezhao$ 
jiezhao-mac:hypershift jiezhao$ oc get endpoints -n default kubernetes -ojsonpath='{.subsets[].addresses[].ip}{"\n"}'
10.0.46.47
jiezhao-mac:hypershift jiezhao$ 
jiezhao-mac:hypershift jiezhao$ oc get endpoints -n default kubernetes -ojsonpath='{.subsets[].ports[].port}{"\n"}'
6443
jiezhao-mac:hypershift jiezhao$ 
jiezhao-mac:hypershift jiezhao$ oc project clusters-jie-test
Now using project "clusters-jie-test" on server "https://api.jiezhao-091201.qe.devcluster.openshift.com:6443".
jiezhao-mac:hypershift jiezhao$ 
jiezhao-mac:hypershift jiezhao$ oc -n clusters-jie-test rsh pod/router-78d47f4c69-2mvbp curl --connect-timeout 2 -Iks https://10.0.46.47:6443 -v 
* Rebuilt URL to: https://10.0.46.47:6443/
*   Trying 10.0.46.47...
* TCP_NODELAY set
* Connected to 10.0.46.47 (10.0.46.47) port 6443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/pki/tls/certs/ca-bundle.crt
  CApath: none
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, Request CERT (13):
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, [no content] (0):
* TLSv1.3 (OUT), TLS handshake, Certificate (11):
* TLSv1.3 (OUT), TLS handshake, [no content] (0):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256
* ALPN, server accepted to use h2
* Server certificate:
*  subject: CN=172.30.0.1
*  start date: Sep 12 13:35:51 2023 GMT
*  expire date: Oct 12 13:35:52 2023 GMT
*  issuer: OU=openshift; CN=kube-apiserver-service-network-signer
*  SSL certificate verify result: self signed certificate in certificate chain (19), continuing anyway.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* TLSv1.3 (OUT), TLS app data, [no content] (0):
* TLSv1.3 (OUT), TLS app data, [no content] (0):
* TLSv1.3 (OUT), TLS app data, [no content] (0):
* Using Stream ID: 1 (easy handle 0x55c5c46cb990)
* TLSv1.3 (OUT), TLS app data, [no content] (0):
> HEAD / HTTP/2
> Host: 10.0.46.47:6443
> User-Agent: curl/7.61.1
> Accept: */*
> 
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.3 (IN), TLS app data, [no content] (0):
* Connection state changed (MAX_CONCURRENT_STREAMS == 2000)!
* TLSv1.3 (OUT), TLS app data, [no content] (0):
* TLSv1.3 (IN), TLS app data, [no content] (0):
* TLSv1.3 (IN), TLS app data, [no content] (0):
* TLSv1.3 (IN), TLS app data, [no content] (0):
< HTTP/2 403 
HTTP/2 403 
< audit-id: 82d5f3f7-6e5b-4bb5-b846-54df09aefb54
audit-id: 82d5f3f7-6e5b-4bb5-b846-54df09aefb54
< cache-control: no-cache, private
cache-control: no-cache, private
< content-type: application/json
content-type: application/json
< strict-transport-security: max-age=31536000; includeSubDomains; preload
strict-transport-security: max-age=31536000; includeSubDomains; preload
< x-content-type-options: nosniff
x-content-type-options: nosniff
< x-kubernetes-pf-flowschema-uid: 6edd6532-2d15-4d8d-9cea-4dcce99b881f
x-kubernetes-pf-flowschema-uid: 6edd6532-2d15-4d8d-9cea-4dcce99b881f
< x-kubernetes-pf-prioritylevel-uid: 4115bb59-a78d-42ab-9136-37529cf107e1
x-kubernetes-pf-prioritylevel-uid: 4115bb59-a78d-42ab-9136-37529cf107e1
< content-length: 218
content-length: 218
< date: Tue, 12 Sep 2023 16:05:02 GMT
date: Tue, 12 Sep 2023 16:05:02 GMT
< 
* Connection #0 to host 10.0.46.47 left intact
jiezhao-mac:hypershift jiezhao$ 

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-19059. The following is the description of the original issue:

Description of problem:

baremetal 4.14.0-rc.0 ipv6 sno cluster, login as admin user to admin console, there is not Observe menu on the left navigation bar, see picture, https://drive.google.com/file/d/13RAXPxtKhAElN9xf8bAmLJa0GI8pP0fH/view?usp=sharing, monitoring-plugin status is Failed, see: https://drive.google.com/file/d/1YsSaGdLT4bMn-6E-WyFWbOpwvDY4t6na/view?usp=sharing, error is

Failed to get a valid plugin manifest from /api/plugins/monitoring-plugin/
r: Bad Gateway 

checked console logs, 9443: connect: connection refused

$ oc -n openshift-console logs console-6869f8f4f4-56mbj
...
E0915 12:50:15.498589       1 handlers.go:164] GET request for "monitoring-plugin" plugin failed: Get "https://monitoring-plugin.openshift-monitoring.svc.cluster.local:9443/plugin-manifest.json": dial tcp [fd02::f735]:9443: connect: connection refused
2023/09/15 12:50:15 http: panic serving [fd01:0:0:1::2]:39156: runtime error: invalid memory address or nil pointer dereference
goroutine 183760 [running]:
net/http.(*conn).serve.func1()
    /usr/lib/golang/src/net/http/server.go:1854 +0xbf
panic({0x3259140, 0x4fcc150})
    /usr/lib/golang/src/runtime/panic.go:890 +0x263
github.com/openshift/console/pkg/plugins.(*PluginsHandler).proxyPluginRequest(0xc0003b5760, 0x2?, {0xc0009bc7d1, 0x11}, {0x3a41fa0, 0xc0002f6c40}, 0xb?)
    /go/src/github.com/openshift/console/pkg/plugins/handlers.go:165 +0x582
github.com/openshift/console/pkg/plugins.(*PluginsHandler).HandlePluginAssets(0xaa00000000000010?, {0x3a41fa0, 0xc0002f6c40}, 0xc0001f7500)
    /go/src/github.com/openshift/console/pkg/plugins/handlers.go:147 +0x26d
github.com/openshift/console/pkg/server.(*Server).HTTPHandler.func23({0x3a41fa0?, 0xc0002f6c40?}, 0x7?)
    /go/src/github.com/openshift/console/pkg/server/server.go:604 +0x33
net/http.HandlerFunc.ServeHTTP(...)
    /usr/lib/golang/src/net/http/server.go:2122
github.com/openshift/console/pkg/server.authMiddleware.func1(0xc0001f7500?, {0x3a41fa0?, 0xc0002f6c40?}, 0xd?)
    /go/src/github.com/openshift/console/pkg/server/middleware.go:25 +0x31
github.com/openshift/console/pkg/server.authMiddlewareWithUser.func1({0x3a41fa0, 0xc0002f6c40}, 0xc0001f7500)
    /go/src/github.com/openshift/console/pkg/server/middleware.go:81 +0x46c
net/http.HandlerFunc.ServeHTTP(0x5120938?, {0x3a41fa0?, 0xc0002f6c40?}, 0x7ffb6ea27f18?)
    /usr/lib/golang/src/net/http/server.go:2122 +0x2f
net/http.StripPrefix.func1({0x3a41fa0, 0xc0002f6c40}, 0xc0001f7400)
    /usr/lib/golang/src/net/http/server.go:2165 +0x332
net/http.HandlerFunc.ServeHTTP(0xc001102c00?, {0x3a41fa0?, 0xc0002f6c40?}, 0xc000655a00?)
    /usr/lib/golang/src/net/http/server.go:2122 +0x2f
net/http.(*ServeMux).ServeHTTP(0x34025e0?, {0x3a41fa0, 0xc0002f6c40}, 0xc0001f7400)
    /usr/lib/golang/src/net/http/server.go:2500 +0x149
github.com/openshift/console/pkg/server.securityHeadersMiddleware.func1({0x3a41fa0, 0xc0002f6c40}, 0x3305040?)
    /go/src/github.com/openshift/console/pkg/server/middleware.go:128 +0x3af
net/http.HandlerFunc.ServeHTTP(0x0?, {0x3a41fa0?, 0xc0002f6c40?}, 0x11db52e?)
    /usr/lib/golang/src/net/http/server.go:2122 +0x2f
net/http.serverHandler.ServeHTTP({0xc0008201e0?}, {0x3a41fa0, 0xc0002f6c40}, 0xc0001f7400)
    /usr/lib/golang/src/net/http/server.go:2936 +0x316
net/http.(*conn).serve(0xc0009b4120, {0x3a43e70, 0xc001223500})
    /usr/lib/golang/src/net/http/server.go:1995 +0x612
created by net/http.(*Server).Serve
    /usr/lib/golang/src/net/http/server.go:3089 +0x5ed
I0915 12:50:24.267777       1 handlers.go:118] User settings ConfigMap "user-settings-4b4c2f4d-159c-4358-bba3-3d87f113cd9b" already exist, will return existing data.
I0915 12:50:24.267813       1 handlers.go:118] User settings ConfigMap "user-settings-4b4c2f4d-159c-4358-bba3-3d87f113cd9b" already exist, will return existing data.
E0915 12:50:30.155515       1 handlers.go:164] GET request for "monitoring-plugin" plugin failed: Get "https://monitoring-plugin.openshift-monitoring.svc.cluster.local:9443/plugin-manifest.json": dial tcp [fd02::f735]:9443: connect: connection refused
2023/09/15 12:50:30 http: panic serving [fd01:0:0:1::2]:42990: runtime error: invalid memory address or nil pointer dereference 

9443 port is Connection refused

$ oc -n openshift-monitoring get pod -o wide
NAME                                                     READY   STATUS    RESTARTS   AGE     IP                  NODE    NOMINATED NODE   READINESS GATES
alertmanager-main-0                                      6/6     Running   6          3d22h   fd01:0:0:1::564     sno-2   <none>           <none>
cluster-monitoring-operator-6cb777d488-nnpmx             1/1     Running   4          7d16h   fd01:0:0:1::12      sno-2   <none>           <none>
kube-state-metrics-dc5f769bc-p97m7                       3/3     Running   12         7d16h   fd01:0:0:1::3b      sno-2   <none>           <none>
monitoring-plugin-85bfb98485-d4g5x                       1/1     Running   4          7d16h   fd01:0:0:1::55      sno-2   <none>           <none>
node-exporter-ndnnj                                      2/2     Running   8          7d16h   2620:52:0:165::41   sno-2   <none>           <none>
openshift-state-metrics-78df59b4d5-j6r5s                 3/3     Running   12         7d16h   fd01:0:0:1::3a      sno-2   <none>           <none>
prometheus-adapter-6f86f7d8f5-ttflf                      1/1     Running   0          4h23m   fd01:0:0:1::b10c    sno-2   <none>           <none>
prometheus-k8s-0                                         6/6     Running   6          3d22h   fd01:0:0:1::566     sno-2   <none>           <none>
prometheus-operator-7c94855989-csts2                     2/2     Running   8          7d16h   fd01:0:0:1::39      sno-2   <none>           <none>
prometheus-operator-admission-webhook-7bb64b88cd-bvq8m   1/1     Running   4          7d16h   fd01:0:0:1::37      sno-2   <none>           <none>
thanos-querier-5bbb764599-vlztq                          6/6     Running   6          3d22h   fd01:0:0:1::56a     sno-2   <none>           <none>

$  oc -n openshift-monitoring get svc monitoring-plugin
NAME                TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)    AGE
monitoring-plugin   ClusterIP   fd02::f735   <none>        9443/TCP   7d16h


$ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -v 'https://monitoring-plugin.openshift-monitoring.svc.cluster.local:9443/plugin-manifest.json' | jq
*   Trying fd02::f735...
* TCP_NODELAY set
* connect to fd02::f735 port 9443 failed: Connection refused
* Failed to connect to monitoring-plugin.openshift-monitoring.svc.cluster.local port 9443: Connection refused
* Closing connection 0
curl: (7) Failed to connect to monitoring-plugin.openshift-monitoring.svc.cluster.local port 9443: Connection refused
command terminated with exit code 7

no such issue in other 4.14.0-rc.0 ipv4 cluster, but issue reproduced on other 4.14.0-rc.0 ipv6 cluster.
4.14.0-rc.0 ipv4 cluster,

$ oc get clusterversion
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.14.0-rc.0   True        False         20m     Cluster version is 4.14.0-rc.0

$ oc -n openshift-monitoring get pod -o wide | grep monitoring-plugin
monitoring-plugin-85bfb98485-nh428                       1/1     Running   0          4m      10.128.0.107   ci-ln-pby4bj2-72292-l5q8v-master-0   <none>           <none>

$ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k  'https://monitoring-plugin.openshift-monitoring.svc.cluster.local:9443/plugin-manifest.json' | jq
...
{
  "name": "monitoring-plugin",
  "version": "1.0.0",
  "displayName": "OpenShift console monitoring plugin",
  "description": "This plugin adds the monitoring UI to the OpenShift web console",
  "dependencies": {
    "@console/pluginAPI": "*"
  },
  "extensions": [
    {
      "type": "console.page/route",
      "properties": {
        "exact": true,
        "path": "/monitoring",
        "component": {
          "$codeRef": "MonitoringUI"
        }
      }
    },
...

meet issue "9443: Connection refused" in 4.14.0-rc.0 ipv6 cluster(launched cluster-bot cluster: launch 4.14.0-rc.0 metal,ipv6) and login console

$ oc get clusterversion
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.14.0-rc.0   True        False         44m     Cluster version is 4.14.0-rc.0
$ oc -n openshift-monitoring get pod -o wide | grep monitoring-plugin
monitoring-plugin-bd6ffdb5d-b5csk                        1/1     Running   0          53m   fd01:0:0:4::b             worker-0.ostest.test.metalkube.org   <none>           <none>
monitoring-plugin-bd6ffdb5d-vhtpf                        1/1     Running   0          53m   fd01:0:0:5::9             worker-2.ostest.test.metalkube.org   <none>           <none>
$ oc -n openshift-monitoring get svc monitoring-plugin
NAME                TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)    AGE
monitoring-plugin   ClusterIP   fd02::402d   <none>        9443/TCP   59m

$ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -v 'https://monitoring-plugin.openshift-monitoring.svc.cluster.local:9443/plugin-manifest.json' | jq
*   Trying fd02::402d...
* TCP_NODELAY set
* connect to fd02::402d port 9443 failed: Connection refused
* Failed to connect to monitoring-plugin.openshift-monitoring.svc.cluster.local port 9443: Connection refused
* Closing connection 0
curl: (7) Failed to connect to monitoring-plugin.openshift-monitoring.svc.cluster.local port 9443: Connection refused
command terminated with exit code 7$ oc -n openshift-console get pod | grep console
console-5cffbc7964-7ljft     1/1     Running   0          56m
console-5cffbc7964-d864q     1/1     Running   0          56m$ oc -n openshift-console logs console-5cffbc7964-7ljft
...
E0916 14:34:16.330117       1 handlers.go:164] GET request for "monitoring-plugin" plugin failed: Get "https://monitoring-plugin.openshift-monitoring.svc.cluster.local:9443/plugin-manifest.json": dial tcp [fd02::402d]:9443: connect: connection refused
2023/09/16 14:34:16 http: panic serving [fd01:0:0:4::2]:37680: runtime error: invalid memory address or nil pointer dereference
goroutine 3985 [running]:
net/http.(*conn).serve.func1()
    /usr/lib/golang/src/net/http/server.go:1854 +0xbf
panic({0x3259140, 0x4fcc150})
    /usr/lib/golang/src/runtime/panic.go:890 +0x263
github.com/openshift/console/pkg/plugins.(*PluginsHandler).proxyPluginRequest(0xc0008f6780, 0x2?, {0xc000665211, 0x11}, {0x3a41fa0, 0xc0009221c0}, 0xb?)
    /go/src/github.com/openshift/console/pkg/plugins/handlers.go:165 +0x582
github.com/openshift/console/pkg/plugins.(*PluginsHandler).HandlePluginAssets(0xfe00000000000010?, {0x3a41fa0, 0xc0009221c0}, 0xc000d8d600)
    /go/src/github.com/openshift/console/pkg/plugins/handlers.go:147 +0x26d
github.com/openshift/console/pkg/server.(*Server).HTTPHandler.func23({0x3a41fa0?, 0xc0009221c0?}, 0x7?)
    /go/src/github.com/openshift/console/pkg/server/server.go:604 +0x33
net/http.HandlerFunc.ServeHTTP(...)
    /usr/lib/golang/src/net/http/server.go:2122
github.com/openshift/console/pkg/server.authMiddleware.func1(0xc000d8d600?, {0x3a41fa0?, 0xc0009221c0?}, 0xd?)
    /go/src/github.com/openshift/console/pkg/server/middleware.go:25 +0x31
github.com/openshift/console/pkg/server.authMiddlewareWithUser.func1({0x3a41fa0, 0xc0009221c0}, 0xc000d8d600)
    /go/src/github.com/openshift/console/pkg/server/middleware.go:81 +0x46c
net/http.HandlerFunc.ServeHTTP(0xc000653830?, {0x3a41fa0?, 0xc0009221c0?}, 0x7f824506bf18?)
    /usr/lib/golang/src/net/http/server.go:2122 +0x2f
net/http.StripPrefix.func1({0x3a41fa0, 0xc0009221c0}, 0xc000d8d500)
    /usr/lib/golang/src/net/http/server.go:2165 +0x332
net/http.HandlerFunc.ServeHTTP(0xc00007e800?, {0x3a41fa0?, 0xc0009221c0?}, 0xc000b2da00?)
    /usr/lib/golang/src/net/http/server.go:2122 +0x2f
net/http.(*ServeMux).ServeHTTP(0x34025e0?, {0x3a41fa0, 0xc0009221c0}, 0xc000d8d500)
    /usr/lib/golang/src/net/http/server.go:2500 +0x149
github.com/openshift/console/pkg/server.securityHeadersMiddleware.func1({0x3a41fa0, 0xc0009221c0}, 0x3305040?)
    /go/src/github.com/openshift/console/pkg/server/middleware.go:128 +0x3af
net/http.HandlerFunc.ServeHTTP(0x0?, {0x3a41fa0?, 0xc0009221c0?}, 0x11db52e?)
    /usr/lib/golang/src/net/http/server.go:2122 +0x2f
net/http.serverHandler.ServeHTTP({0xc000db9b00?}, {0x3a41fa0, 0xc0009221c0}, 0xc000d8d500)
    /usr/lib/golang/src/net/http/server.go:2936 +0x316
net/http.(*conn).serve(0xc000653680, {0x3a43e70, 0xc000676f30})
    /usr/lib/golang/src/net/http/server.go:1995 +0x612
created by net/http.(*Server).Serve
    /usr/lib/golang/src/net/http/server.go:3089 +0x5ed 

Version-Release number of selected component (if applicable):

baremetal 4.14.0-rc.0 ipv6 sno cluster,
$ token=`oc create token prometheus-k8s -n openshift-monitoring`
$ $ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?' --data-urlencode 'query=virt_platform'  | jq
{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {
          "__name__": "virt_platform",
          "baseboard_manufacturer": "Dell Inc.",
          "baseboard_product_name": "01J4WF",
          "bios_vendor": "Dell Inc.",
          "bios_version": "1.10.2",
          "container": "kube-rbac-proxy",
          "endpoint": "https",
          "instance": "sno-2",
          "job": "node-exporter",
          "namespace": "openshift-monitoring",
          "pod": "node-exporter-ndnnj",
          "prometheus": "openshift-monitoring/k8s",
          "service": "node-exporter",
          "system_manufacturer": "Dell Inc.",
          "system_product_name": "PowerEdge R750",
          "system_version": "Not Specified",
          "type": "none"
        },
        "value": [
          1694785092.664,
          "1"
        ]
      }
    ]
  }
}

How reproducible:

only seen on this cluster

Steps to Reproduce:

1. see the description
2.
3.

Actual results:

no Observe menu on admin console, monitoring-plugin is failed

Expected results:

no error

Description of problem:

In 7 day's reliability test, kube-apiserver's memory usage keep increasing. Max is over 3GB.
In our 4.12 test result, the kube-apiserver's memory usage was stable around 1.7 GB and not keep increasing. 
I'll redo the test on a 4.12.0 build to see if I can reproduce this issue.

I'll do a longer than 7 days test to see how high the memory can grow up.

About Reliability Test
https://github.com/openshift/svt/tree/master/reliability-v2

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2023-03-14-053612

How reproducible:

Always

Steps to Reproduce:

1. Install an AWS cluster with m5.xlarge type
2. Run reliability test for 7 days
Reliability Test Configuration example:
https://github.com/openshift/svt/tree/master/reliability-v2#groups-and-tasks-1
Config used in this test:
admin: 1 user
dev-test: 15 users
dev-prod: 1 user 
3. Use dittybopper dashboard to monitor the kube-apiserver's memory usage

Actual results:

kube-apiserver's memory usage keep increasing. Max is over 3GB

Expected results:

kube-apiserver's memory usage should not keep increasing

Additional info:

Screenshots are uploaded to shared folder OCPBUGS-10829 - Google Drive

413-kube-apiserver-memory.png
413-api-performance-last2d.png - test was stopped on [2023-03-24 04:21:10 UTC]
412-kube-apiserver-memory.png
must-gather.local.525817950490593011.tar.gz - 4.13 cluster's must gather

Description of problem:

The hypershift_hostedclusters_failure_conditions metric produced by the HyperShift operator does not report a value of 0 for conditions that no longer apply. The result is that if a hostedcluster had a failure condition at a given point, but that condition has gone away, the metric still reports a count for that condition.

Version-Release number of selected component (if applicable):

 

How reproducible:

always

Steps to Reproduce:

1. Create a HostedCluster, watch the hypershift_hostedclusters_failure_conditions metric as failure conditions occur.
2.
3.

Actual results:

A cluster count of 1 with a failure condition is reported even if the failure condition no longer applies.

Expected results:

Once failure conditions no longer apply, 0 clusters with those conditions should be reported.

Additional info:

The metric should report an accurate count for each possible failure condition of all clusters at any given time.

Description of problem:

When adding a repository url that contains hyphens in the <owner> part of the url
(<https://github.com/owner/url> - eg https://github.com/redhat-developer/s2i-dotnetcore-ex.git), then create button stays disabled and validation errors are not presented in the UI.

Version-Release number of selected component (if applicable):
4.9

How reproducible:
Always

Steps to Reproduce:
1. Go to Developer -> Add -> Import from Git page
2. use the repo url https://github.com/redhat-developer/s2i-dotnetcore-ex.git
3. add `/app` in the context dir under advanced git options.

Actual results:

1Once the builder image is detected, then Create button is disabled but no errors in the form. When the user touches the name field and then name validation error message is shown even if the suggested name is valid.

Expected results:

After detecting the builder image, the create button should be enabled.

Additional info:

Description of problem:

Authorization by OpenShift Container Platform 4 is not working as expected, when using system:serviceaccounts Group in the ClusterRoleBinding.

Here, one would assume that every serviceAccount would be granted the permissions to access the defined resources but actually access is denied.

$ curl -k -X POST -H "Content-Type: application/json" -H "Authorization: Bearer <token>" --data "@/tmp/post.json" https://api.<url>:6443/apis/authorization.k8s.io/v1/subjectaccessreviews
{
  "kind": "SubjectAccessReview",
  "apiVersion": "authorization.k8s.io/v1",
  "metadata": {
    "creationTimestamp": null,
    "managedFields": [
      {
        "manager": "curl",
        "operation": "Update",
        "apiVersion": "authorization.k8s.io/v1",
        "time": "2023-03-13T09:17:45Z",
        "fieldsType": "FieldsV1",
        "fieldsV1": {
          "f:spec": {
            "f:resourceAttributes": {
              ".": {},
              "f:group": {},
              "f:name": {},
              "f:namespace": {},
              "f:resource": {},
              "f:verb": {}
            },
            "f:user": {}
          }
        }
      }
    ]
  },
  "spec": {
    "resourceAttributes": {
      "namespace": "project-100",
      "verb": "use",
      "group": "sharedresource.openshift.io",
      "resource": "sharedsecrets",
      "name": "shared-subscription"
    },
    "user": "system:serviceaccount:project-100:builder"
  },
  "status": {
    "allowed": false
  }
}

When specifying the serviceAccount in the ClusterRoleBinding access is granted:

$ oc get clusterrolebinding shared-secret-cluster-role-binding -o yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"rbac.authorization.k8s.io/v1","kind":"ClusterRoleBinding","metadata":{"annotations":{},"name":"shared-secret-cluster-role-binding"},"roleRef":{"apiGroup":"rbac.authorization.k8s.io","kind":"ClusterRole","name":"shared-secret-cluster-role"},"subjects":[{"apiGroup":"rbac.authorization.k8s.io","kind":"Group","name":"system:serviceaccounts"}]}
  creationTimestamp: "2023-03-13T08:59:46Z"
  name: shared-secret-cluster-role-binding
  resourceVersion: "1575464"
  uid: dd11825d-834a-4807-ab82-30dc0a415985
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: shared-secret-cluster-role
subjects:
- apiGroup: rbac.authorization.k8s.io
  kind: Group
  name: system:serviceaccounts
- kind: ServiceAccount
  name: builder
  namespace: project-101

$ curl -k -X POST -H "Content-Type: application/json" -H "Authorization: Bearer <token>" --data "@/tmp/post.json" https://api.<url>:6443/apis/authorization.k8s.io/v1/subjectaccessreviews
{
  "kind": "SubjectAccessReview",
  "apiVersion": "authorization.k8s.io/v1",
  "metadata": {
    "creationTimestamp": null,
    "managedFields": [
      {
        "manager": "curl",
        "operation": "Update",
        "apiVersion": "authorization.k8s.io/v1",
        "time": "2023-03-13T09:16:47Z",
        "fieldsType": "FieldsV1",
        "fieldsV1": {
          "f:spec": {
            "f:resourceAttributes": {
              ".": {},
              "f:group": {},
              "f:name": {},
              "f:namespace": {},
              "f:resource": {},
              "f:verb": {}
            },
            "f:user": {}
          }
        }
      }
    ]
  },
  "spec": {
    "resourceAttributes": {
      "namespace": "project-101",
      "verb": "use",
      "group": "sharedresource.openshift.io",
      "resource": "sharedsecrets",
      "name": "shared-subscription"
    },
    "user": "system:serviceaccount:project-101:builder"
  },
  "status": {
    "allowed": true,
    "reason": "RBAC: allowed by ClusterRoleBinding \"shared-secret-cluster-role-binding\" of ClusterRole \"shared-secret-cluster-role\" to ServiceAccount \"builder/project-101\""
  }
}

Both namespaces exist and have the serviceAccount automatically created.

$ oc get sa -n project-100
NAME       SECRETS   AGE
builder    1         11m
default    1         11m
deployer   1         11m

$ oc get sa -n project-101
NAME       SECRETS   AGE
builder    1         4m1s
default    1         4m1s
deployer   1         4m

The difference is only how authorization is granted. For project-101 the serviceAccount is explicitly granted while for project-100 authorization should be granted via Group called system:serviceaccounts

Version-Release number of selected component (if applicable):

OpenShift Container Platform 4.12.5

How reproducible:

Always

Steps to Reproduce:

1. Install OpenShift Container Platform 4.12
2. Create SharedSecret CRD using oc apply -f https://raw.githubusercontent.com/openshift/api/master/sharedresource/v1alpha1/0000_10_sharedsecret.crd.yaml
3. Create SharedSecret resource:
$ oc get sharedsecret shared-subscription -o yaml
apiVersion: sharedresource.openshift.io/v1alpha1
kind: SharedSecret
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"sharedresource.openshift.io/v1alpha1","kind":"SharedSecret","metadata":{"annotations":{},"name":"shared-subscription"},"spec":{"secretRef":{"name":"etc-pki-entitlement","namespace":"openshift-config-managed"}}}
  creationTimestamp: "2023-03-13T08:54:48Z"
  generation: 1
  name: shared-subscription
  resourceVersion: "1567499"
  uid: 15c350aa-0de1-4a02-b876-9b822ba0afe5
spec:
  secretRef:
    name: etc-pki-entitlement
    namespace: openshift-config-managed
4. Create ClusterRole to grant access to SharedSecret:
$ oc get clusterrole shared-secret-cluster-role -o yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"rbac.authorization.k8s.io/v1","kind":"ClusterRole","metadata":{"annotations":{},"name":"shared-secret-cluster-role"},"rules":[{"apiGroups":["sharedresource.openshift.io"],"resourceNames":["shared-subscription"],"resources":["sharedsecrets"],"verbs":["use"]}]}
  creationTimestamp: "2023-03-13T08:57:24Z"
  name: shared-secret-cluster-role
  resourceVersion: "1568481"
  uid: 99324722-ac62-4bb8-a7fe-7ac915393e19
rules:
- apiGroups:
  - sharedresource.openshift.io
  resourceNames:
  - shared-subscription
  resources:
  - sharedsecrets
  verbs:
  - use
5. Create ClusterRoleBinding to access SharedSecret
$ oc get clusterrolebinding shared-secret-cluster-role-binding -o yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"rbac.authorization.k8s.io/v1","kind":"ClusterRoleBinding","metadata":{"annotations":{},"name":"shared-secret-cluster-role-binding"},"roleRef":{"apiGroup":"rbac.authorization.k8s.io","kind":"ClusterRole","name":"shared-secret-cluster-role"},"subjects":[{"apiGroup":"rbac.authorization.k8s.io","kind":"Group","name":"system:serviceaccounts"}]}
  creationTimestamp: "2023-03-13T08:59:46Z"
  name: shared-secret-cluster-role-binding
  resourceVersion: "1575464"
  uid: dd11825d-834a-4807-ab82-30dc0a415985
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: shared-secret-cluster-role
subjects:
- apiGroup: rbac.authorization.k8s.io
  kind: Group
  name: system:serviceaccounts
- kind: ServiceAccount
  name: builder
  namespace: project-101
6. Run SubjectAccessReview call to validate authoriztion:
$ curl -k -X POST -H "Content-Type: application/json" -H "Authorization: Bearer <token>" --data "@/tmp/post.json" https://api.<url>:6443/apis/authorization.k8s.io/v1/subjectaccessreviews
{
  "kind": "SubjectAccessReview",
  "apiVersion": "authorization.k8s.io/v1",
  "metadata": {
    "creationTimestamp": null,
    "managedFields": [
      {
        "manager": "curl",
        "operation": "Update",
        "apiVersion": "authorization.k8s.io/v1",
        "time": "2023-03-13T09:17:45Z",
        "fieldsType": "FieldsV1",
        "fieldsV1": {
          "f:spec": {
            "f:resourceAttributes": {
              ".": {},
              "f:group": {},
              "f:name": {},
              "f:namespace": {},
              "f:resource": {},
              "f:verb": {}
            },
            "f:user": {}
          }
        }
      }
    ]
  },
  "spec": {
    "resourceAttributes": {
      "namespace": "project-100",
      "verb": "use",
      "group": "sharedresource.openshift.io",
      "resource": "sharedsecrets",
      "name": "shared-subscription"
    },
    "user": "system:serviceaccount:project-100:builder"
  },
  "status": {
    "allowed": false
  }
}

Actual results:

$ curl -k -X POST -H "Content-Type: application/json" -H "Authorization: Bearer <token>" --data "@/tmp/post.json" https://api.<url>:6443/apis/authorization.k8s.io/v1/subjectaccessreviews
{
  "kind": "SubjectAccessReview",
  "apiVersion": "authorization.k8s.io/v1",
  "metadata": {
    "creationTimestamp": null,
    "managedFields": [
      {
        "manager": "curl",
        "operation": "Update",
        "apiVersion": "authorization.k8s.io/v1",
        "time": "2023-03-13T09:17:45Z",
        "fieldsType": "FieldsV1",
        "fieldsV1": {
          "f:spec": {
            "f:resourceAttributes": {
              ".": {},
              "f:group": {},
              "f:name": {},
              "f:namespace": {},
              "f:resource": {},
              "f:verb": {}
            },
            "f:user": {}
          }
        }
      }
    ]
  },
  "spec": {
    "resourceAttributes": {
      "namespace": "project-100",
      "verb": "use",
      "group": "sharedresource.openshift.io",
      "resource": "sharedsecrets",
      "name": "shared-subscription"
    },
    "user": "system:serviceaccount:project-100:builder"
  },
  "status": {
    "allowed": false
  }
}
 

Expected results:

$ curl -k -X POST -H "Content-Type: application/json" -H "Authorization: Bearer <token>" --data "@/tmp/post.json" https://api.<url>:6443/apis/authorization.k8s.io/v1/subjectaccessreviews
{
  "kind": "SubjectAccessReview",
  "apiVersion": "authorization.k8s.io/v1",
  "metadata": {
    "creationTimestamp": null,
    "managedFields": [
      {
        "manager": "curl",
        "operation": "Update",
        "apiVersion": "authorization.k8s.io/v1",
        "time": "2023-03-13T09:16:47Z",
        "fieldsType": "FieldsV1",
        "fieldsV1": {
          "f:spec": {
            "f:resourceAttributes": {
              ".": {},
              "f:group": {},
              "f:name": {},
              "f:namespace": {},
              "f:resource": {},
              "f:verb": {}
            },
            "f:user": {}
          }
        }
      }
    ]
  },
  "spec": {
    "resourceAttributes": {
      "namespace": "project-101",
      "verb": "use",
      "group": "sharedresource.openshift.io",
      "resource": "sharedsecrets",
      "name": "shared-subscription"
    },
    "user": "system:serviceaccount:project-101:builder"
  },
  "status": {
    "allowed": true,
    "reason": "RBAC: allowed by ClusterRoleBinding \"shared-secret-cluster-role-binding\" of ClusterRole \"shared-secret-cluster-role\" to ServiceAccount \"builder/project-101\""
  }
}
 

Additional info:

The goal is to use the Group "system:serviceaccounts" to authorize all serviceAccounts to access the given resources to avoid listing all namespaces specifically and thus have the need to create a controller that needs to update a list or similar.
 

Description of problem:

When creating an image for arm, i.e. using:
  architecture: arm64

and running
$ ./bin/openshift-install agent create image --dir ./cluster-manifests/ --log-level debug

the output indicates the the correct base iso was extracted from the release:
INFO Extracting base ISO from release payload     
DEBUG Using mirror configuration                   
DEBUG Fetching image from OCP release (oc adm release info --image-for=machine-os-images --insecure=true --icsp-file=/tmp/icsp-file347546417 registry.ci.openshift.org/origin/release:4.13) 
DEBUG extracting /coreos/coreos-aarch64.iso to /home/bfournie/.cache/agent/image_cache, oc image extract --path /coreos/coreos-aarch64.iso:/home/bfournie/.cache/agent/image_cache --confirm --icsp-file=/tmp/icsp-file3609464443 registry.ci.openshift.org/origin/4.13-2023-03-09-142410@sha256:e3c4445cabe16ca08c5b874b7a7c9d378151eb825bacc90e240cfba9339a828c 
INFO Base ISO obtained from release and cached at /home/bfournie/.cache/agent/image_cache/coreos-aarch64.iso 
DEBUG Extracted base ISO image /home/bfournie/.cache/agent/image_cache/coreos-aarch64.iso from release payload 

When in fact the ISO was not extracted from the release image and the command failed:
ERROR failed to write asset (Agent Installer ISO) to disk: cannot generate ISO image due to configuration errors 
FATAL failed to fetch Agent Installer ISO: failed to generate asset "Agent Installer ISO": provided device /home/bfournie/.cache/agent/image_cache/coreos-aarch64.iso does not exist

Version-Release number of selected component (if applicable):

4.13

How reproducible:

every time

Steps to Reproduce:

1. Set architecture: arm64  for all hosts in install-config.yaml 
2. Run the openshift-install command as above
3. See the log messages and the command fails

Actual results:

Invalid messages are logged and command fails

Expected results:

Command succeeds

Additional info:

 

Description of problem:

During the documentation writing phase, we have received suggestions to improve texts in the vSphere Connection modal. We should address them.

https://docs.google.com/document/d/1jLnHuJyOR5nyuFTpSO6LcuHDVrVGUSs2EMpLFey1qDQ/edit

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1. Deploy OCP cluster on the vSphere platform
2. On the homepage of the Console, see VCenter status plugin
3.

Actual results:

 

Expected results:

 

Additional info:

It's about rephrasing only.

Description of problem:

When doing an IPV6 only agent based installer on bare metal this fails if the RendezvousIP value is not canonical. 

Version-Release number of selected component (if applicable):

OCP 4.12

How reproducible:

Every time.

Steps to Reproduce:

1. Configure the agent through agen-config.yaml for an IPV6 only install.
2. Set to something that is correct, but not canonical: 
   for example: rendezvousIP: 2a00:8a00:4000:020c:0000:0000:0018:143c 
3. Generate discovery iso and boot nodes. 

Actual results:

Installation fails because the set-node-zero.sh script fails to discover that it is running on node zero.

Expected results:

Installation completes. 

Additional info:

The code that detects wether a host is node-zero uses this:

is_rendezvous_host=$(ip -j address | jq "[.[].addr_info] | flatten | map(.local==\"$NODE_ZERO_IP\") | any")

This fails in unexpected ways with IPV6 that are not canonical, as the output of ip address is always canonical, but in this case the value for $NODE_ZERO_IP wasn't. 
We did test this on the node itself: 

[root@slabnode2290 bin]# ip -j address | jq '[.[].addr_info] | flatten | map(.local=="2a00:8a00:4000:020c:0000:0000:0018:143c") | any' 
false

[root@slabnode2290 bin]# ip -j address | jq '[.[].addr_info] | flatten | map(.local=="2a00:8a00:4000:20c::18:143c") | any'
true

A solution may be to use a tool like ipcalc, once available, to do this test and make it less strict. In the mean time a note in the docs would be a good idea.

 

This is a clone of issue OCPBUGS-18990. The following is the description of the original issue:

Description of problem:

The script refactoring from https://github.com/openshift/cluster-etcd-operator/pull/1057 introduced a regression. 

Since the static pod list variable was renamed, it is now empty and won't restore the non-etcd pod yamls anymore. 

Version-Release number of selected component (if applicable):

4.14 and later

How reproducible:

always

Steps to Reproduce:

1. create a cluster
2. restore using cluster-restore.sh

Actual results:

the apiserver and other static pods are not immediately restored

The script only outputs this log:

removing previous backup /var/lib/etcd-backup/member
Moving etcd data-dir /var/lib/etcd/member to /var/lib/etcd-backup
starting restore-etcd static pod

Expected results:

the non-etcd static pods should be immediately restored by moving them into the manifest directory again.

You can see this by the log output:

Moving etcd data-dir /var/lib/etcd/member to /var/lib/etcd-backup
starting restore-etcd static pod
starting kube-apiserver-pod.yaml
static-pod-resources/kube-apiserver-pod-7/kube-apiserver-pod.yaml
starting kube-controller-manager-pod.yaml
static-pod-resources/kube-controller-manager-pod-7/kube-controller-manager-pod.yaml
starting kube-scheduler-pod.yaml
static-pod-resources/kube-scheduler-pod-8/kube-scheduler-pod.yaml

Additional info:

 

 

Description of problem:

Pods are being terminated on Kubelet restart if they consume any device.

In case of CNV this Pods are carrying VMs and the assuption is that Kubelet will not terminate the Pod in this case.

Version-Release number of selected component (if applicable):

4.14 / 4.13.z / 4.12.z

How reproducible:

This should be reproducable with any device plugin as far as goes my understanding

Steps to Reproduce:

1. Create Pod requesting device plugin
2. Restart Kubelet
3.

Actual results:

Admission error -> Pod terminates

Expected results:

No error -> Existing & Running Pods will continue running after Kubelet restart

Additional info:

The culprit seems to be https://github.com/kubernetes/kubernetes/pull/116376

Description of problem:

Currently when the oc-mirror command runs the generated ImageContentSourcePolicy.yaml should not include mirrors for the mirrored operator catalogs

This should be the case for registry located catalogs and oci fbc catalogs (located on disk)
Jennifer Power, Alex Flom can you help us confirm this is the expected behavior?

Version-Release number of selected component (if applicable):

4.13

How reproducible:

Always

Steps to Reproduce:

1.Run the oc mirror command mirroring the catalog
/bin/oc-mirror --config imageSetConfig.yaml  docker://localhost:5000  --use-oci-feature  --dest-use-http  --dest-skip-tls
with imagesetconfig:
kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v1alpha2
storageConfig:
  local:
    path: /tmp/storageBackend
mirror:
  operators:
  - catalog: oci:///home/user/catalogs/rhop4.12
    # copied from registry.redhat.io/redhat/redhat-operator-index:v4.12
    targetCatalog: "mno/redhat-operator-index"
    targetVersion: "v4.12"
    packages:
    - name: aws-load-balancer-operator

Actual results:

Catalog is included in the imageContentSourcePolicy.yaml
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: redhat-operator-index
  namespace: openshift-marketplace
spec:
  image: localhost:5000/mno/redhat-operator-index:v4.12
  sourceType: grpc

---
apiVersion: operator.openshift.io/v1alpha1
kind: ImageContentSourcePolicy
metadata:
  labels:
    operators.openshift.org/catalog: "true"
  name: operator-0
spec:
  repositoryDigestMirrors:
  - mirrors:
    - localhost:5000/albo
    source: registry.redhat.io/albo
  - mirrors:
    - localhost:5000/mno
    source: mno
  - mirrors:
    - localhost:5000/openshift4
    source: registry.redhat.io/openshift4

Expected results:

No catalog should be included in the imageContentSourcePolicy.yaml
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: redhat-operator-index
  namespace: openshift-marketplace
spec:
  image: localhost:5000/mno/redhat-operator-index:v4.12
  sourceType: grpc

---
apiVersion: operator.openshift.io/v1alpha1
kind: ImageContentSourcePolicy
metadata:
  labels:
    operators.openshift.org/catalog: "true"
  name: operator-0
spec:
  repositoryDigestMirrors:
  - mirrors:
    - localhost:5000/albo
    source: registry.redhat.io/albo
  - mirrors:
    - localhost:5000/openshift4
    source: registry.redhat.io/openshift4

Additional info:

 

Description of problem:

Looking at the telemetry data for Nutanix I noticed that the “host_type” for clusters installed with platform nutanix shows as “virt-unknown”. Do you know what needs to happen in the code to tell telemetry about host type being Nutanix? The problem is that we can’t track those installations with platform none, just IPI.

Refer to the slack thread https://redhat-internal.slack.com/archives/C0211848DBN/p1687864857228739.

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

Create an OCP Nutanix cluster

Actual results:

The telemetry data for Nutanix shows the “host_type” for the nutanix cluster as “virt-unknown”.

Expected results:

The telemetry data for Nutanix shows the “host_type” for the nutanix cluster as "nutanix".

Additional info:

 

Description of problem:

Link to Openshift Route from service is breaking because of hardcoded value of targetPort. If the targetPort gets changed, the route still points to the older value of port as it's hardcoded

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1. Install the latest available version of Openshift Pipelines
2. Create the pipeline and triggerbinding using the attached files
3. Add trigger to the created pipeline from devconsole UI, select the above created triggerbinding while adding trigger
4. Trigger an event using the curl command curl -X POST -d '{ "url": "https://www.github.com/VeereshAradhya/cli" }' -H 'Content-Type: application/json' <route> and make sure that the pipelinerun gets started
5. Update the tagetPort in the svc from 8080 to 8000
6. Again use the above curl command to trigger one more event

Actual results:

The curl command throws error

Expected results:

The curl command should be successful and the pipelinerun should get started successfully

Additional info:

Error:
curl -X POST -d '{ "url": "https://www.github.com/VeereshAradhya/cli" }' -H 'Content-Type: application/json' http://el-event-listener-3o9zcv-test-devconsole.apps.ve412psi.psi.ospqa.com
<html>
  <head>
    <meta name="viewport" content="width=device-width, initial-scale=1">    <style type="text/css">
      body {
        font-family: "Helvetica Neue", Helvetica, Arial, sans-serif;
        line-height: 1.66666667;
        font-size: 16px;
        color: #333;
        background-color: #fff;
        margin: 2em 1em;
      }
      h1 {
        font-size: 28px;
        font-weight: 400;
      }
      p {
        margin: 0 0 10px;
      }
      .alert.alert-info {
        background-color: #F0F0F0;
        margin-top: 30px;
        padding: 30px;
      }
      .alert p {
        padding-left: 35px;
      }
      ul {
        padding-left: 51px;
        position: relative;
      }
      li {
        font-size: 14px;
        margin-bottom: 1em;
      }
      p.info {
        position: relative;
        font-size: 20px;
      }
      p.info:before, p.info:after {
        content: "";
        left: 0;
        position: absolute;
        top: 0;
      }
      p.info:before {
        background: #0066CC;
        border-radius: 16px;
        color: #fff;
        content: "i";
        font: bold 16px/24px serif;
        height: 24px;
        left: 0px;
        text-align: center;
        top: 4px;
        width: 24px;
      }      @media (min-width: 768px) {
        body {
          margin: 6em;
        }
      }
    </style>
  </head>
  <body>
    <div>
      <h1>Application is not available</h1>
      <p>The application is currently not serving requests at this endpoint. It may not have been started or is still starting.</p>      <div class="alert alert-info">
        <p class="info">
          Possible reasons you are seeing this page:
        </p>
        <ul>
          <li>
            <strong>The host doesn't exist.</strong>
            Make sure the hostname was typed correctly and that a route matching this hostname exists.
          </li>
          <li>
            <strong>The host exists, but doesn't have a matching path.</strong>
            Check if the URL path was typed correctly and that the route was created using the desired path.
          </li>
          <li>
            <strong>Route and path matches, but all pods are down.</strong>
            Make sure that the resources exposed by this route (pods, services, deployment configs, etc) have at least one pod running.
          </li>
        </ul>
      </div>
    </div>
  </body>
</html>

Note:

The above scenario works fine if we create triggers using the yaml files instead of using devconsole UI

Description of the problem:

EnsureOperatorPrerequisite is using the cluster CPU architecture while on multi arch cluster the CPU architecture will always be multi. On update clusterm EnsureOperatorPrerequisite will not prevent the cluster from being updated but will fail on the next update request.

 

Steps to reproduce:

1. Register multi arch cluster (P or Z)

2. Update cluster with ODF operator 

3. Update any cluster field

 

Actual results:

Cluster failed to update on the second time

 

Expected results:

Not to fail

Description of problem:

These alerts fire without a namespace label:
* KubeStateMetricsListErrors
* KubeStateMetricsWatchErrors
* KubeletPlegDurationHigh
* KubeletTooManyPods
* KubeNodeReadinessFlapping
* KubeletPodStartUpLatencyHigh

Alerting rules without a namespace label make it harder for cluster admins to route the alerts.

Version-Release number of selected component (if applicable):

4.13

How reproducible:

Always

Steps to Reproduce:

1. Check the definitions of the said alerting rules.

Actual results:

The PromQL expressions aggregate away the namespace label and there's no static namespace label either.

Expected results:

Static namespace label in the rule definition.

Additional info:

https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/alerting-consistency.md#style-guide

Alerts SHOULD include a namespace label indicating the source of the alert. Many alerts will include this by virtue of the fact that their PromQL expressions result in a namespace label. Others may require a static namespace label

Description of problem:

4.14 cluster installation failed with TECH_PREVIEW featuregate

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-04-03-002631

How reproducible:

Always on GCP and Azure platform

Steps to Reproduce:

1. Install 4.14 cluster  with TECH_PREVIEW featuregate

Actual results:

Cluster Installation failed and shows below error

oc get pod -n openshift-kube-apiserver -l apiserver --show-labels                 

E0404 18:13:56.266461   73688 memcache.go:238] couldn't get current server API group list: Get "https://api.maxu-az-tp1.qe.azure.devcluster.openshift.com:6443/api?timeout=32s": dial tcp 20.253.227.131:6443: i/o timeout

E0404 18:14:26.270883   73688 memcache.go:238] couldn't get current server API group list: Get "https://api.maxu-az-tp1.qe.azure.devcluster.openshift.com:6443/api?timeout=32s": dial tcp 20.253.227.131:6443: i/o timeout

E0404 18:14:56.269363   73688 memcache.go:238] couldn't get current server API group list: Get "https://api.maxu-az-tp1.qe.azure.devcluster.openshift.com:6443/api?timeout=32s": dial tcp 20.253.227.131:6443: i/o timeout

E0404 18:14:58.075111   73688 memcache.go:255] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request

E0404 18:14:58.302392   73688 memcache.go:255] couldn't get resource list for security.openshift.io/v1: the server is currently unable to handle the request

E0404 18:14:58.309541   73688 memcache.go:255] couldn't get resource list for template.openshift.io/v1: the server is currently unable to handle the request

E0404 18:14:58.313497   73688 memcache.go:255] couldn't get resource list for packages.operators.coreos.com/v1: the server is currently unable to handle the request

NAME                                        READY   STATUS             RESTARTS        AGE   LABELS

kube-apiserver-maxu-az-tp1-86n5v-master-2   4/5     CrashLoopBackOff   7 (2m41s ago)   16m   apiserver=true,app=openshift-kube-apiserver,revision=16

Expected results:

Cluster Installation should be success and not show any error

Additional info:

https://issues.redhat.com/browse/OCPQE-14686

https://drive.google.com/file/d/1EHVuPFaSJA50R2k8uVVUVDvGDCfG9ZYN/view?usp=sharing

https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/?job=*4.14*-tp-*
https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/?job=*4.14*-techpreview*

Description of problem:

When testing AWS on-prem BM expansion, the BMO is not able to reach the IRONIC_ENDPOINT

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-08-10-021647

How reproducible:

100%

Steps to Reproduce:

1. Install IPI AWS 3-node-compact cluster
2. Deploy BMO via YAML
3. Connect AWS against external on-prem env via VPN (out of scope)
4. Create BMH using "preprovisioningNetworkDataName" to push static IP and routes.

Actual results:

BMO is not able to reach the Ironic endpoint with the following error:

~~~
2023-08-10T16:09:22.216778289Z {"level":"info","ts":"2023-08-10T16:09:22Z","logger":"provisioner.ironic","msg":"error caught while checking endpoint","host":"openshift-machine-api~openshift-qe-065","endpoint":"https://metal3-state.openshift-machine-api.svc.cluster.local:6385/v1/","error":"Get \"https://metal3-state.openshift-machine-api.svc.cluster.local:6385/v1\": dial tcp 172.30.19.119:6385: i/o timeout"}
~~~

Expected results:

Standard deploy

Additional info:

Must-gather provided separatedly

Description of problem:

OpenShift Console does not filter the SecretList when displaying the ServiceAccount details page

When reviewing the details page of an OpenShift ServiceAccount, at the bottom of the page there is a SecretsList which is intended to display all of the relevant Secrets that are attached to the ServiceAccount.

In OpenShift 4.8.X, this SecretList only displayed the relevant Secrets. In OpenShift 4.9+ the SecretList now displays all Secrets within the entire Namespace.

Version-Release number of selected component (if applicable):

4.8.57 < Most recent release without issue
4.9.0 < First release with issue 
4.10.46 < Issue is still present

How reproducible:

Everytime

Steps to Reproduce:

1. Deploy a cluster with OpenShift 4.8.57 
      (or replace the OpenShift Console image with `sha256:9dd115a91a4261311c44489011decda81584e1d32982533bf69acf3f53e17540` )
2. Access the ServiceAccounts Page ( User Management -> ServiceAccounts)
3. Click a ServiceAccount to display the Details page
4. Scroll down and review the Secrets section
5. Repeat steps with an OpenShift 4.9 release 
   (or check using image `sha256:fc07081f337a51f1ab957205e096f68e1ceb6a5b57536ea6fc7fbcea0aaaece0` )

Actual results:

All Secrets in the Namespace are displayed

Expected results:

Only Secrets associated with the ServiceAccount are displayed

Additional info:

Lightly reviewing the code, the following links might be a good start:
- https://github.com/openshift/console/blob/master/frontend/public/components/secret.jsx#L126
- https://github.com/openshift/console/blob/master/frontend/public/components/service-account.jsx#L151:L151

Description of problem:

On azure, delete a master, old machine stuck in Deleting, some pods in cluster are in ImagePullBackOff, check from azure console, new master did not add into lb backend, seems this lead the machine has no internet connection.

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2023-02-12-024338

How reproducible:

Always

Steps to Reproduce:

1. Set up a cluster on Azure, networkType ovn
2. Delete a master
3. Check master and pod

Actual results:

Old machine stuck in Deleting,  some pods are in ImagePullBackOff.
 $ oc get machine    
NAME                                    PHASE      TYPE              REGION   ZONE   AGE
zhsunaz2132-5ctmh-master-0              Deleting   Standard_D8s_v3   westus          160m
zhsunaz2132-5ctmh-master-1              Running    Standard_D8s_v3   westus          160m
zhsunaz2132-5ctmh-master-2              Running    Standard_D8s_v3   westus          160m
zhsunaz2132-5ctmh-master-flqqr-0        Running    Standard_D8s_v3   westus          105m
zhsunaz2132-5ctmh-worker-westus-dhwfz   Running    Standard_D4s_v3   westus          152m
zhsunaz2132-5ctmh-worker-westus-dw895   Running    Standard_D4s_v3   westus          152m
zhsunaz2132-5ctmh-worker-westus-xlsgm   Running    Standard_D4s_v3   westus          152m

$ oc describe machine zhsunaz2132-5ctmh-master-flqqr-0  -n openshift-machine-api |grep -i "Load Balancer"
      Internal Load Balancer:  zhsunaz2132-5ctmh-internal
      Public Load Balancer:      zhsunaz2132-5ctmh

$ oc get node            
NAME                                    STATUS     ROLES                  AGE    VERSION
zhsunaz2132-5ctmh-master-0              Ready      control-plane,master   165m   v1.26.0+149fe52
zhsunaz2132-5ctmh-master-1              Ready      control-plane,master   165m   v1.26.0+149fe52
zhsunaz2132-5ctmh-master-2              Ready      control-plane,master   165m   v1.26.0+149fe52
zhsunaz2132-5ctmh-master-flqqr-0        NotReady   control-plane,master   109m   v1.26.0+149fe52
zhsunaz2132-5ctmh-worker-westus-dhwfz   Ready      worker                 152m   v1.26.0+149fe52
zhsunaz2132-5ctmh-worker-westus-dw895   Ready      worker                 152m   v1.26.0+149fe52
zhsunaz2132-5ctmh-worker-westus-xlsgm   Ready      worker                 152m   v1.26.0+149fe52
$ oc describe node zhsunaz2132-5ctmh-master-flqqr-0
  Warning  ErrorReconcilingNode       3m5s (x181 over 108m)  controlplane         [k8s.ovn.org/node-chassis-id annotation not found for node zhsunaz2132-5ctmh-master-flqqr-0, macAddress annotation not found for node "zhsunaz2132-5ctmh-master-flqqr-0" , k8s.ovn.org/l3-gateway-config annotation not found for node "zhsunaz2132-5ctmh-master-flqqr-0"]

$ oc get po --all-namespaces | grep ImagePullBackOf   
openshift-cluster-csi-drivers                      azure-disk-csi-driver-node-l8ng4                                  0/3     Init:ImagePullBackOff   0              113m
openshift-cluster-csi-drivers                      azure-file-csi-driver-node-99k82                                  0/3     Init:ImagePullBackOff   0              113m
openshift-cluster-node-tuning-operator             tuned-bvvh7                                                       0/1     ImagePullBackOff        0              113m
openshift-dns                                      node-resolver-2p4zq                                               0/1     ImagePullBackOff        0              113m
openshift-image-registry                           node-ca-vxv87                                                     0/1     ImagePullBackOff        0              113m
openshift-machine-config-operator                  machine-config-daemon-crt5w                                       1/2     ImagePullBackOff        0              113m
openshift-monitoring                               node-exporter-mmjsm                                               0/2     Init:ImagePullBackOff   0              113m
openshift-multus                                   multus-4cg87                                                      0/1     ImagePullBackOff        0              113m
openshift-multus                                   multus-additional-cni-plugins-mc6vx                               0/1     Init:ImagePullBackOff   0              113m
openshift-ovn-kubernetes                           ovnkube-master-qjjsv                                              0/6     ImagePullBackOff        0              113m
openshift-ovn-kubernetes                           ovnkube-node-k8w6j                                                0/6     ImagePullBackOff        0              113m

Expected results:

Replace master successful

Additional info:

Tested payload 4.13.0-0.nightly-2023-02-03-145213, same result.
Before we have tested in 4.13.0-0.nightly-2023-01-27-165107, all works well.

Description of problem:

If the HyperShift operator is installed onto a cluster, it creates VPC Endpoint Services fronting the hosted Kubernetes API Server for downstream HyperShift clusters to connect to. These VPC Endpoint Services are tagged such that the uninstaller would attempt to action them:

"kubernetes.io/cluster/${ID}: owned"

However they cannot be deleted until all active VPC Endpoint Connections are rejected - the uninstaller should be able to do this.

Version-Release number of selected component (if applicable):

4.12 (but shouldn't be version-specific)

How reproducible:

100%

Steps to Reproduce:

1. Create an NLB + VPC Endpoint Service in the same VPC as a cluster
2. Tag it accordingly and create a VPC Endpoint connection to it

Actual results:

The uninstaller will not be able to delete the VPC Endpoint Service + the NLB that the VPC Endpoint Service is fronting

Expected results:

The VPC Endpoint Service can be completely cleaned up, which would allow the NLB to be cleaned up

Additional info:

 

Description of problem:

When clicking on "Duplicate RoleBinding" in the OpenShift Container Platform Web Console, users are taken to a form where they can review the duplicated RoleBinding.

When the RoleBinding has a ServiceAccount as a subject, clicking "Create" leads to the following error:

An error occurred
Error "Unsupported value: "rbac.authorization.k8s.io": supported values: """ for field "subjects[0].apiGroup".

The root cause seems to be that the field "subjects[0].apiGroup" is set to "rbac.authorization.k8s.io" even for "kind: ServiceAccount" subjects. For "kind: ServiceAccount" subjects, this field is not necessary but the "namespace" field should be set instead.

The functionality works as expected for User and Group subjects.

Version-Release number of selected component (if applicable):

OpenShift Container Platform 4.12.19

How reproducible:

Always

Steps to Reproduce:

1. In the OpenShift Container Platform Web Console, click on "User Management" => "Role Bindings"
2. Search for a RoleBinding that has a "ServiceAccount" as the subject. On the far right, click on the dots and choose "Duplicate RoleBinding"
3. Review the fields, set a new name for the duplicated RoleBinding, click "Create"

Actual results:

Duplicating fails with the following error message being shown:

An error occurred
Error "Unsupported value: "rbac.authorization.k8s.io": supported values: """ for field "subjects[0].apiGroup".

Expected results:

RoleBinding is duplicated without an error message

Additional info:

Reproduced with OpenShift Container Platform 4.12.18 and 4.12.19

Description of problem:

The readme.md of builder is just a one liner overview of project. It would be helpful to have some additional details added for new contributors/visitors of the project.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

Install IPI cluster where all nodes are provisioned from azure marketplace image with purchase plan.

install-config.yaml:
---------------------------
platform:
  azure:
    region: eastus
    baseDomainResourceGroupName: os4-common
    defaultMachinePlatform:
      osImage:
        publisher: Redhat  <----  contains uppercase letter
        offer: rh-ocp-worker
        sku: rh-ocp-worker
        version: 4.8.2021122100
        plan: WithPurchasePlan

as some marketplace images are free without plan, so pulisher in install-config should come from output of `az vm image list`

# az vm image list --offer rh-ocp-worker --all -otable
Architecture    Offer          Publisher       Sku                 Urn                                                             Version
--------------  -------------  --------------  ------------------  --------------------------------------------------------------  --------------
x64             rh-ocp-worker  redhat-limited  rh-ocp-worker       redhat-limited:rh-ocp-worker:rh-ocp-worker:4.8.2021122100       4.8.2021122100
x64             rh-ocp-worker  RedHat          rh-ocp-worker       RedHat:rh-ocp-worker:rh-ocp-worker:4.8.2021122100               4.8.2021122100
x64             rh-ocp-worker  redhat-limited  rh-ocp-worker-gen1  redhat-limited:rh-ocp-worker:rh-ocp-worker-gen1:4.8.2021122100  4.8.2021122100
x64             rh-ocp-worker  RedHat          rh-ocp-worker-gen1  RedHat:rh-ocp-worker:rh-ocp-worker-gen1:4.8.2021122100          4.8.2021122100

the image plan is as below, its publisher is lowercase.
# az vm image show --urn RedHat:rh-ocp-worker:rh-ocp-worker:4.8.2021122100 --query plan
{
  "name": "rh-ocp-worker",
  "product": "rh-ocp-worker",
  "publisher": "redhat"
}

From installer https://github.com/openshift/installer/blob/master/data/data/azure/bootstrap/main.tf#L243-L246, publisher property in image plan is from pulisher what we set in install-config.yaml, installer should use the publisher property from image plan output.

But image plan is case-sensitive, bootstrap instance is provisioned failed with below error in such case.

Unable to deploy from the Marketplace image or a custom image sourced from Marketplace image. The part number in the purchase information for VM '/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jima15image1-flg24-rg/providers/Microsoft.Compute/virtualMachines/jima15image1-flg24-bootstrap' is not as expected. Beware that the Plan object's properties are case-sensitive. Learn more about common virtual machine error codes.

similar errors when provisioning worker instances from this image where image publisher contains upper case but publisher in its plan is all lowercase.

worker machineset:
----------------------------
Spec:
  Lifecycle Hooks:
  Metadata:
  Provider ID:  azure:///subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/ci-op-cc5g2rw8-55267-q66k7-rg/providers/Microsoft.Compute/virtualMachines/ci-op-cc5g2rw8-55267-q66k7-worker-southcentralus1-dq6sp
  Provider Spec:
    Value:
      Accelerated Networking:  true
      API Version:             machine.openshift.io/v1beta1
      Credentials Secret:
        Name:       azure-cloud-credentials
        Namespace:  openshift-machine-api
      Diagnostics:
        Boot:
          Storage Account Type:  AzureManaged
      Image:
        Offer:           rh-ocp-worker
        Publisher:       RedHat
        Resource ID:     
        Sku:             rh-ocp-worker
        Type:            WithPurchasePlan
        Version:         4.8.2021122100
      Kind:              AzureMachineProviderSpec
      Location:          southcentralus
      Managed Identity:  ci-op-cc5g2rw8-55267-q66k7-identity

error when provision worker instance:
Unable to deploy from the Marketplace image or a custom image sourced from Marketplace image. The part number in the purchase information for VM '/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/ci-op-cc5g2rw8-55267-q66k7-rg/providers/Microsoft.Compute/virtualMachines/ci-op-cc5g2rw8-55267-q66k7-worker-southcentralus1-mmr2h' is not as expected. Beware that the Plan object's properties are case-sensitive. Learn more about common virtual machine error codes.

 

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-08-11-055332

How reproducible:

Always on 4.14 for bootstrap/masters
Always on 4.11+ for workers

Steps to Reproduce:

1. Config osImage for all nodes in install-config, set publisher to RedHat 
2. install cluster.
3.

Actual results:

Bootstrap instance is provisioned failed.

Expected results:

installation is successful.

Additional info:

Installation is successful when setting publisher to "redhat"

Description of problem:

A build which works on 4.12 errored out on 4.13.

Version-Release number of selected component (if applicable):

oc --context build02 get clusterversion version
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.13.0-ec.3   True        False         4d2h    Cluster version is 4.13.0-ec.3

How reproducible:

Always

Steps to Reproduce:

1. oc new-project hongkliu-test
2. oc create is test-is --as system:admin
3. oc apply -f test-bc.yaml # the file is in the attachment

Actual results:

oc --context build02 logs test-bc-5-build
Defaulted container "docker-build" out of: docker-build, manage-dockerfile (init)
time="2023-02-20T19:13:38Z" level=info msg="Not using native diff for overlay, this may cause degraded performance for building images: kernel has CONFIG_OVERLAY_FS_REDIRECT_DIR enabled"
I0220 19:13:38.405163       1 defaults.go:112] Defaulting to storage driver "overlay" with options [mountopt=metacopy=on].
Caching blobs under "/var/cache/blobs".Pulling image image-registry.openshift-image-registry.svc:5000/ci/html-proofer@sha256:684aae4e929e596f7042c34a3604c81137860187305f775c2380774bda4b6b08 ...
Trying to pull image-registry.openshift-image-registry.svc:5000/ci/html-proofer@sha256:684aae4e929e596f7042c34a3604c81137860187305f775c2380774bda4b6b08...
Getting image source signatures
Copying blob sha256:aa8ae8202b42d1c70c3a7f65680eabc1c562a29227549b9a1b33dc03943b20d2
Copying blob sha256:31326f32ac37d5657248df0a6aa251ec6a416dab712ca1236ea40ca14322a22c
Copying blob sha256:b21786fe7c0d7561a5b89ca15d8a1c3e4ea673820cd79f1308bdfd8eb3cb7142
Copying blob sha256:68296e6645b26c3af42fa29b6eb7f5befa3d8131ef710c25ec082d6a8606080d
Copying blob sha256:6b1c37303e2d886834dab68eb5a42257daeca973bbef3c5d04c4868f7613c3d3
Copying blob sha256:cbdbe7a5bc2a134ca8ec91be58565ec07d037386d1f1d8385412d224deafca08
Copying blob sha256:46cf6a1965a3b9810a80236b62c42d8cdcd6fb75f9b58d1b438db5736bcf2669
Copying config sha256:9aefe4e59d3204741583c5b585d4d984573df8ff751c879c8a69379c168cb592
Writing manifest to image destination
Storing signatures
Adding transient rw bind mount for /run/secrets/rhsm
STEP 1/4: FROM image-registry.openshift-image-registry.svc:5000/ci/html-proofer@sha256:684aae4e929e596f7042c34a3604c81137860187305f775c2380774bda4b6b08
STEP 2/4: RUN apk add --no-cache bash
fetch http://dl-cdn.alpinelinux.org/alpine/v3.11/main/x86_64/APKINDEX.tar.gz
fetch http://dl-cdn.alpinelinux.org/alpine/v3.11/community/x86_64/APKINDEX.tar.gz
(1/1) Installing bash (5.0.11-r1)
Executing bash-5.0.11-r1.post-install
ERROR: bash-5.0.11-r1.post-install: script exited with error 127
Executing busybox-1.31.1-r9.trigger
ERROR: busybox-1.31.1-r9.trigger: script exited with error 127
1 error; 21 MiB in 40 packages
error: build error: building at STEP "RUN apk add --no-cache bash": while running runtime: exit status 1

Expected results:

 

Additional info:

Run the build on build01 (4.12.4) and it works fine.

oc --context build01 get clusterversion version
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.12.4    True        False         2d11h   Cluster version is 4.12.4

Please review the following PR: https://github.com/openshift/machine-api-provider-aws/pull/64

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Following doc[1] to assign custom role with minimum permission for destroying cluster to installer Service Principle.

As read permission misses on public dns zone and private dns zone in that doc for destroying IPI cluster, public dns records have no permission to be removed.

But installer destroy is completed without any warning message.
$ ./openshift-install destroy cluster --dir ipi --log-level debug
DEBUG OpenShift Installer 4.13.0-0.nightly-2023-02-16-120330 
DEBUG Built from commit c0bf49ca9e83fd00dfdfbbdddd47fbe6b5cdd510 
INFO Credentials loaded from file "/home/fedora/.azure/osServicePrincipal.json" 
DEBUG deleting public records                      
DEBUG deleting resource group                      
INFO deleted                                       resource group=jima-ipi-role-l7qgz-rg
DEBUG deleting application registrations           
DEBUG Purging asset "Metadata" from disk           
DEBUG Purging asset "Master Ignition Customization Check" from disk 
DEBUG Purging asset "Worker Ignition Customization Check" from disk 
DEBUG Purging asset "Terraform Variables" from disk 
DEBUG Purging asset "Kubeconfig Admin Client" from disk 
DEBUG Purging asset "Kubeadmin Password" from disk 
DEBUG Purging asset "Certificate (journal-gatewayd)" from disk 
DEBUG Purging asset "Cluster" from disk            
INFO Time elapsed: 6m16s                          
INFO Uninstallation complete!                     

$ az network dns record-set a list --resource-group os4-common --zone-name qe.azure.devcluster.openshift.com  -o table| grep jima-ipi-role
*.apps.jima-ipi-role                                       os4-common       30     A       kubernetes.io_cluster.jima-ipi-role-l7qgz="owned"

$ az network dns record-set cname list --resource-group os4-common --zone-name qe.azure.devcluster.openshift.com  -o table| grep jima-ipi-role
api.jima-ipi-role                 os4-common       300    CNAME   kubernetes.io_cluster.jima-ipi-role-l7qgz="owned"

[1] https://docs.google.com/document/d/1iEs7T09Opj0iMXvpKeSatsAyPoda_gWQvFKQuWA3QdM/edit#

Version-Release number of selected component (if applicable):

4.13 nightly build

How reproducible:

always

Steps to Reproduce:

1. Create custom role with limited permission for destroying cluster, without read permission on public dns zone and private dns zone.
2. Assign the custom role to Service Principal
3. Use this SP to destroy cluster

Actual results:

Although some permissions missed, installer destroy cluster completed without any warning.

Expected results:

Installer should have some warning message that indicate resources leftover with some specific reason, so that user can process further.

Additional info:

 

 

 

 

 

 

Description of problem:

When creating a hosted cluster on a management cluster that has an imagecontentsourcepolicy that does not include openshift-release-dev or ocp/release images, the control plane operator fails reconciliation with an error:

{"level":"error","ts":"2023-08-22T18:26:07Z","msg":"Reconciler error","controller":"hostedcontrolplane","controllerGroup":"hypershift.openshift.io","controllerKind":"HostedControlPlane","HostedControlPlane":{"name":"jiezhao-test","namespace":"clusters-jiezhao-test"},"namespace":"clusters-jiezhao-test","name":"jiezhao-test","reconcileID":"9b3c101b-b4d2-4d9e-b71c-ede9e0b55374","error":"failed to update control plane: failed to reconcile ignition server: failed to parse private registry hosted control plane image reference \"\": repository name must have at least one component","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:326\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:234"}

Version-Release number of selected component (if applicable):

4.14

How reproducible:

always

Steps to Reproduce:

1. Create an ImageContentSourcePolicy on a management cluster:

apiVersion: operator.openshift.io/v1alpha1
kind: ImageContentSourcePolicy
metadata:
  name: brew-registry
  resourceVersion: "31794"
  uid: 7231c634-da35-4c56-b2ef-be48c2571a9c
spec:
  repositoryDigestMirrors:
  - mirrors:
    - brew.registry.redhat.io
    source: registry.redhat.io
  - mirrors:
    - brew.registry.redhat.io
    source: registry.stage.redhat.io
  - mirrors:
    - brew.registry.redhat.io
    source: registry-proxy.engineering.redhat.com


2. Install the latest hypershift operator and create a hosted cluster with the latest 4.14 ci build

Actual results:

The hostedcluster never creates machines and never gets to a Complete state

Expected results:

The hostedcluster comes up and gets to a Complete state

Additional info:

 

Description of problem:

When trying to delete a BMH object, which is unmanaged, the Metal3 cannot delete. The BMH object is unmanaged because it does not provide information about BMC (neither address, nor credentials). 

In this case the Metal 3 tries to delete but fails and never finalizes. The BMH deletion gets stuc.
This is the log from MEtal3

{"level":"info","ts":1676531586.4898946,"logger":"controllers.BareMetalHost","msg":"start","baremetalhost":"openshift-machine-api/worker-0.el8k-ztp-1.hpecloud.org"}                                                                                          
{"level":"info","ts":1676531586.4980938,"logger":"controllers.BareMetalHost","msg":"start","baremetalhost":"openshift-machine-api/master-1.el8k-ztp-1.hpecloud.org"}                                                                                          
{"level":"info","ts":1676531586.5050912,"logger":"controllers.BareMetalHost","msg":"start","baremetalhost":"openshift-machine-api/master-2.el8k-ztp-1.hpecloud.org"}                                                                                          
{"level":"info","ts":1676531586.5105371,"logger":"controllers.BareMetalHost","msg":"done","baremetalhost":"openshift-machine-api/worker-0.el8k-ztp-1.hpecloud.org","provisioningState":"unmanaged","requeue":true,"after":600}                                
{"level":"info","ts":1676531586.51569,"logger":"controllers.BareMetalHost","msg":"start","baremetalhost":"openshift-machine-api/master-0.el8k-ztp-1.hpecloud.org"}                                                                                            
{"level":"info","ts":1676531586.5191178,"logger":"controllers.BareMetalHost","msg":"done","baremetalhost":"openshift-machine-api/master-1.el8k-ztp-1.hpecloud.org","provisioningState":"unmanaged","requeue":true,"after":600}                                
{"level":"info","ts":1676531586.525755,"logger":"controllers.BareMetalHost","msg":"done","baremetalhost":"openshift-machine-api/master-2.el8k-ztp-1.hpecloud.org","provisioningState":"unmanaged","requeue":true,"after":600}                                 
{"level":"info","ts":1676531586.5356712,"logger":"controllers.BareMetalHost","msg":"done","baremetalhost":"openshift-machine-api/master-0.el8k-ztp-1.hpecloud.org","provisioningState":"unmanaged","requeue":true,"after":600}                                
{"level":"info","ts":1676532186.5117555,"logger":"controllers.BareMetalHost","msg":"start","baremetalhost":"openshift-machine-api/worker-0.el8k-ztp-1.hpecloud.org"}                                                                                          
{"level":"info","ts":1676532186.5195107,"logger":"controllers.BareMetalHost","msg":"start","baremetalhost":"openshift-machine-api/master-1.el8k-ztp-1.hpecloud.org"}                                                                                          
{"level":"info","ts":1676532186.526355,"logger":"controllers.BareMetalHost","msg":"start","baremetalhost":"openshift-machine-api/master-2.el8k-ztp-1.hpecloud.org"}                                                                                           
{"level":"info","ts":1676532186.5317476,"logger":"controllers.BareMetalHost","msg":"done","baremetalhost":"openshift-machine-api/worker-0.el8k-ztp-1.hpecloud.org","provisioningState":"unmanaged","requeue":true,"after":600}
{"level":"info","ts":1676532186.5361836,"logger":"controllers.BareMetalHost","msg":"start","baremetalhost":"openshift-machine-api/master-0.el8k-ztp-1.hpecloud.org"}                                                                                          
{"level":"info","ts":1676532186.5404322,"logger":"controllers.BareMetalHost","msg":"done","baremetalhost":"openshift-machine-api/master-1.el8k-ztp-1.hpecloud.org","provisioningState":"unmanaged","requeue":true,"after":600}
{"level":"info","ts":1676532186.5482726,"logger":"controllers.BareMetalHost","msg":"done","baremetalhost":"openshift-machine-api/master-2.el8k-ztp-1.hpecloud.org","provisioningState":"unmanaged","requeue":true,"after":600}
{"level":"info","ts":1676532186.555394,"logger":"controllers.BareMetalHost","msg":"done","baremetalhost":"openshift-machine-api/master-0.el8k-ztp-1.hpecloud.org","provisioningState":"unmanaged","requeue":true,"after":600}
{"level":"info","ts":1676532532.3448665,"logger":"controllers.BareMetalHost","msg":"start","baremetalhost":"openshift-machine-api/worker-1.el8k-ztp-1.hpecloud.org"}                                                                                          
{"level":"info","ts":1676532532.344922,"logger":"controllers.BareMetalHost","msg":"hardwareData is ready to be deleted","baremetalhost":"openshift-machine-api/worker-1.el8k-ztp-1.hpecloud.org"}
{"level":"info","ts":1676532532.3656478,"logger":"controllers.BareMetalHost","msg":"Initiating host deletion","baremetalhost":"openshift-machine-api/worker-1.el8k-ztp-1.hpecloud.org","provisioningState":"unmanaged"}
{"level":"error","ts":1676532532.3656952,"msg":"Reconciler error","controller":"baremetalhost","controllerGroup":"metal3.io","controllerKind":"BareMetalHost","bareMetalHost":{"name":"worker-1.el8k-ztp-1.hpecloud.org","namespace":"openshift-machine-api"},
"namespace":"openshift-machine-api","name":"worker-1.el8k-ztp-1.hpecloud.org","reconcileID":"525a5b7d-077d-4d1e-a618-33d6041feb33","error":"action \"unmanaged\" failed: failed to determine current provisioner capacity: failed to parse BMC address informa
tion: missing BMC address","errorVerbose":"missing BMC address\ngithub.com/metal3-io/baremetal-operator/pkg/hardwareutils/bmc.NewAccessDetails\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/github.com/metal3-io/baremetal-operator/pkg/hardwareu
tils/bmc/access.go:145\ngithub.com/metal3-io/baremetal-operator/pkg/provisioner/ironic.(*ironicProvisioner).bmcAccess\n\t/go/src/github.com/metal3-io/baremetal-operator/pkg/provisioner/ironic/ironic.go:112\ngithub.com/metal3-io/baremetal-operator/pkg/pro
visioner/ironic.(*ironicProvisioner).HasCapacity\n\t/go/src/github.com/metal3-io/baremetal-operator/pkg/provisioner/ironic/ironic.go:1922\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).ensureCapacity\n\t/go/src/githu
b.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:83\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).updateHostStateFrom\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/meta
l3.io/host_state_machine.go:106\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).ReconcileState.func1\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:175\ngithub.com/metal
3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).ReconcileState\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:186\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*BareM
etalHostReconciler).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/baremetalhost_controller.go:226\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/src/github.com/metal3-io/baremet
al-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/contr
oller-runtime/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/contro
ller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:234\nruntime.goexit\
n\t/usr/lib/golang/src/runtime/asm_amd64.s:1594\nfailed to parse BMC address information\ngithub.com/metal3-io/baremetal-operator/pkg/provisioner/ironic.(*ironicProvisioner).bmcAccess\n\t/go/src/github.com/metal3-io/baremetal-operator/pkg/provisioner/iro
nic/ironic.go:114\ngithub.com/metal3-io/baremetal-operator/pkg/provisioner/ironic.(*ironicProvisioner).HasCapacity\n\t/go/src/github.com/metal3-io/baremetal-operator/pkg/provisioner/ironic/ironic.go:1922\ngithub.com/metal3-io/baremetal-operator/controlle
rs/metal3%2eio.(*hostStateMachine).ensureCapacity\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:83\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).updateHostStateFrom\n
\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:106\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).ReconcileState.func1\n\t/go/src/github.com/metal3-io/baremetal-operator
/controllers/metal3.io/host_state_machine.go:175\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).ReconcileState\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:186\ngithu
b.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*BareMetalHostReconciler).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/baremetalhost_controller.go:226\nsigs.k8s.io/controller-runtime/pkg/internal/controll
er.(*Controller).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/sr
c/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/metal3-io/baremetal-
operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-
runtime/pkg/internal/controller/controller.go:234\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1594\nfailed to determine current provisioner capacity\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).ensur
eCapacity\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:85\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).updateHostStateFrom\n\t/go/src/github.com/metal3-io/baremetal
-operator/controllers/metal3.io/host_state_machine.go:106\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).ReconcileState.func1\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machin
e.go:175\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).ReconcileState\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:186\ngithub.com/metal3-io/baremetal-operator/contr
ollers/metal3%2eio.(*BareMetalHostReconciler).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/baremetalhost_controller.go:226\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/src/gi
thub.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/metal3-io/baremetal-operato
r/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-r
untime/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controll
er.go:234\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1594\naction \"unmanaged\" failed\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*BareMetalHostReconciler).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operato
r/controllers/metal3.io/baremetalhost_controller.go:230\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/contr
oller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller
-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.
(*Controller).Start.func2.2\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:234\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1594","stacktrace":"sigs.k8s.io/cont
roller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/contr
oller.(*Controller).Start.func2.2\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:234"}

Version-Release number of selected component (if applicable):

4.12

How reproducible:

Provide a BMH object with no BMC credentials. The BMH is set unmanaged.

Steps to Reproduce:

1. delete the object
2. gets stuck
3.

Actual results:

get stuck deletiong

Expected results:

Metal3 detects the BMH is unmanaged, and dont try to do deprovisioning.

Additional info:

 

Description of problem:

APIServer service not selected correctly for PublicAndPrivate when external-dns isn't configured. 
Image: 4.14 Hypershift operator + OCP 4.14.0-0.nightly-2023-03-23-050449

jiezhao-mac:hypershift jiezhao$ oc get hostedcluster/jz-test -n clusters -ojsonpath='{.spec.platform.aws.endpointAccess}{"\n"}'
PublicAndPrivate

    - lastTransitionTime: "2023-03-24T15:13:15Z"
      message: Cluster operators console, dns, image-registry, ingress, insights,
        kube-storage-version-migrator, monitoring, openshift-samples, service-ca are
        not available
      observedGeneration: 3
      reason: ClusterOperatorsNotAvailable
      status: "False"
      type: ClusterVersionSucceeding

services:
  - service: APIServer
   servicePublishingStrategy:
    type: LoadBalancer
  - service: OAuthServer
   servicePublishingStrategy:
    type: Route
  - service: Konnectivity
   servicePublishingStrategy:
    type: Route
  - service: Ignition
   servicePublishingStrategy:
    type: Route
  - service: OVNSbDb
   servicePublishingStrategy:
    type: Route

jiezhao-mac:hypershift jiezhao$ oc get service -n clusters-jz-test | grep kube-apiserver
kube-apiserver            LoadBalancer  172.30.211.131  aa029c422933444139fb738257aedb86-9e9709e3fa1b594e.elb.us-east-2.amazonaws.com  6443:32562/TCP         34m
kube-apiserver-private        LoadBalancer  172.30.161.79  ab8434aa316e845c59690ca0035332f0-d818b9434f506178.elb.us-east-2.amazonaws.com  6443:32100/TCP         34m
jiezhao-mac:hypershift jiezhao$

jiezhao-mac:hypershift jiezhao$ cat hostedcluster.kubeconfig | grep server
  server: https://ab8434aa316e845c59690ca0035332f0-d818b9434f506178.elb.us-east-2.amazonaws.com:6443
jiezhao-mac:hypershift jiezhao$

jiezhao-mac:hypershift jiezhao$ oc get node --kubeconfig=hostedcluster.kubeconfig 
E0324 11:17:44.003589   95300 memcache.go:238] couldn't get current server API group list: Get "https://ab8434aa316e845c59690ca0035332f0-d818b9434f506178.elb.us-east-2.amazonaws.com:6443/api?timeout=32s": dial tcp 10.0.129.24:6443: i/o timeout

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1.Create a PublicAndPrivate cluster without external-dns
2.access the guest cluster (it should fail)
3.

Actual results:

unable to access the guest cluster via 'oc get node --kubeconfig=<guest cluster kubeconfig>', some guest cluster co are not available

Expected results:

The cluster is up and running, the guest cluster can be accessed via 'oc get node --kubeconfig=<guest cluster kubeconfig>'

Additional info:

 

 

Description of problem:

Reported upstream in https://github.com/kubernetes/cloud-provider-openstack/issues/2217

Not specifically reproduced in OpenShift, but I have no reason to think we would not be affected, and I know we have users with strict proxy requirements.

The user's configuration requires all OpenStack API requests from the tenant network to go through a proxy. They have configured a proxy 'globally' in their cluster in a manner which also affects the CSI driver.

Attempting to attach a volume to a pod fails. Inspecting the logs we see that cinder attempted to attach the volume to the proxy server, not the node hosting the pod. The reason for this is that the metadata request was also proxied, meaning the returned values relate to the proxy server, not the local server.

Version-Release number of selected component (if applicable):

4.13, but likely all versions since we enabled CSI

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

Ever since the introduction of the latest invariants feature in origin, MicroShift is unable to run the conformance tests.
Failing invariants include load balancer, image registry and kube-apiserver (https://github.com/openshift/origin/blob/master/pkg/defaultinvariants/types.go#L48-L52) and they are tested for disruptions. These tests don't apply in MicroShift because some of those components don't exist, and none of them are HA.
Requiring the invariants without checking the platform breaks conformance testing in MicroShift.

Version-Release number of selected component (if applicable):

4.14

How reproducible:

Run `openshift-tests run openshift/conformance --provider none` with MicroShift kubeconfig.

Steps to Reproduce:

1. 
2.
3.

Actual results:

KUBECONFIG=~/.kube/config ./openshift-tests run openshift/conformance -v 2 --provider none
  Aug  3 11:37:39.859: INFO: MicroShift cluster with version: 4.14.0_0.nightly_2023_06_30_131338_20230703175041_1b2a630fc
I0803 11:37:39.859929    9250 test_setup.go:94] Extended test version v4.1.0-6883-g6ee9dc5
openshift-tests version: v4.1.0-6883-g6ee9dc5
  Aug  3 11:37:39.898: INFO: Enabling in-tree volume drivers
Attempting to pull tests from external binary...
Falling back to built-in suite, failed reading external test suites: unable to extract k8s-tests binary: failed reading ClusterVersion/version: the server could not find the requested resource (get clusterversions.config.openshift.io version)
  W0803 11:37:40.849399    9250 warnings.go:70] unknown field "spec.tls.externalCertificate"
Suite run returned error: [namespaces "openshift-image-registry" not found, the server could not find the requested resource (get infrastructures.config.openshift.io cluster)]
No manifest filename passed
error running options: [namespaces "openshift-image-registry" not found, the server could not find the requested resource (get infrastructures.config.openshift.io cluster)]error: [namespaces "openshift-image-registry" not found, the server could not find the requested resource (get infrastructures.config.openshift.io cluster)]

Expected results:

Tests running to completion.

Additional info:

A nice addition would be having additional presubmits in origin to run Microshift conformance to catch these things earlier.

Description of the problem:

Day-2 host stuck in insufficient

How reproducible:

100%

Steps to reproduce:

1. See CI job

Actual results:

Day-2 host stuck in insufficient

Expected results:

Day-2 host becomes known

We should check if CBT is enabled in cluster's nodes on vSphere platform.

1. Perform a full sweep and log each node which has CBT enabled.
2. Create an alert if some VMs have CBT enabled and other don't.
3. Alert should not be emitted if all VMs in cluster are uniformly CBT enabled.

This will avoid issues like - https://issues.redhat.com/browse/OCPBUGS-12249?filter=12399251

dependencies for the ironic containers are quite old, we need to upgrade them to the latest available to keep up with upstream requirements

Description of problem:

Please check: https://issues.redhat.com/browse/OCPBUGS-18702?focusedId=23021716&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-23021716 for more details.

https://drive.google.com/drive/folders/14aSJs-lO6HC-2xYFlOTJtCZIQg3ekE85?usp=sharing (plz check recording "sc_form_typeerror.mp4").   
Issues:
1. TypeError mentioned above.
2. Default params added by an extension are not getting added to the created StorageClass.
3. Validation for parameters added by an extension in not working correctly as well.
4. The Provisioner child details will be stuck once user selected 'openshift-storage.cephfs.csi.ceph.com'.

Version-Release number of selected component (if applicable):

4.14 (OCP)

How reproducible:

 

Steps to Reproduce:

1. Install ODF operator.
2. Create StorageSystem (once dynamic plugin is loaded).
3. Wait for a while for ODF related StorageClasses gets created.
4. Once they are created, go to "Create StorageSystem" form.
5. Switch to provisioners (rbd.csi.ceph) added by ODF dynamic plugin. 

Actual results:

Page breaks with an error.

Expected results:

Page should not break.
And functionality should be how it was acting before the refactoring introduced by PR: https://github.com/openshift/console/pull/13036

Additional info:

Stack trace:
Caught error in a child component: TypeError: Cannot read properties of undefined (reading 'parameters')
    at allRequiredFieldsFilled (storage-class-form.tsx:204:1)
    at validateForm (storage-class-form.tsx:235:1)
    at storage-class-form.tsx:262:1
    at invokePassiveEffectCreate (react-dom.development.js:23487:1)
    at HTMLUnknownElement.callCallback (react-dom.development.js:3945:1)
    at Object.invokeGuardedCallbackDev (react-dom.development.js:3994:1)
    at invokeGuardedCallback (react-dom.development.js:4056:1)
    at flushPassiveEffectsImpl (react-dom.development.js:23574:1)
    at unstable_runWithPriority (scheduler.development.js:646:1)
    at runWithPriority$1 (react-dom.development.js:11276:1) {componentStack: '\n    at StorageClassFormInner (http://localhost:90...c03030668ef271da51f.js:491534:20)\n    at Suspense'}

Description of problem:

Incorrect AWS ARN [1] is used for GovCloud and AWS China regions, which will cause the command `ccoctl aws create-all` to fail:

Failed to create Identity provider: failed to apply public access policy to the bucket ci-op-bb5dgq54-77753-oidc: MalformedPolicy: Policy has invalid resource
	status code: 400, request id: VNBZ3NYDH6YXWFZ3, host id: pHF8v7C3vr9YJdD9HWamFmRbMaOPRbHSNIDaXUuUyrgy0gKCO9DDFU/Xy8ZPmY2LCjfLQnUDmtQ=

Correct AWS ARN prefix:
GovCloud (us-gov-east-1 and us-gov-west-1): arn:aws-us-gov
AWS China (cn-north-1 and cn-northwest-1): arn:aws-cn

[1] https://github.com/openshift/cloud-credential-operator/pull/526/files#diff-1909afc64595b92551779d9be99de733f8b694cfb6e599e49454b380afc58876R211


 

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2023-05-11-024616

How reproducible:

Always
 

Steps to Reproduce:

1. Run command: `aws create-all --name="${infra_name}" --region="${REGION}" --credentials-requests-dir="/tmp/credrequests" --output-dir="/tmp"` on GovCloud regions
2.
3.

Actual results:

Failed to create Identity provider
 

Expected results:

Create resources successfully.
 

Additional info:

Related PRs:
4.10: https://github.com/openshift/cloud-credential-operator/pull/531
4.11: https://github.com/openshift/cloud-credential-operator/pull/530
4.12: https://github.com/openshift/cloud-credential-operator/pull/529
4.13: https://github.com/openshift/cloud-credential-operator/pull/528
4.14: https://github.com/openshift/cloud-credential-operator/pull/526
 

Description of problem:
The size of PVC/datadir-ibm-spectrum-scale-pmcollector-0 is displayed incorrectly in Openshift webconsole. The PVC size is shown as (negative) -17.6GiB.
Below is SC, PV and PVC details.

$ oc get storageclass
NAME                            PROVISIONER                    RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
ibm-spectrum-fusion-mgmt-sc     spectrumscale.csi.ibm.com      Delete          Immediate              true                   2d
ibm-spectrum-fusion (default)   spectrumscale.csi.ibm.com      Delete          Immediate              true                   2d
ibm-spectrum-scale-internal     kubernetes.io/no-provisioner   Delete          WaitForFirstConsumer   false                  2d
ibm-spectrum-scale-sample       spectrumscale.csi.ibm.com      Delete          Immediate              false                  2d


$ oc get pv
control-1.ncw-az1-005.caas.bbtnet.com-pmcollector   25Gi          RWO           Retain           Bound    ibm-spectrum-scale/datadir-ibm-spectrum-scale-pmcollector-0                     ibm-spectrum-scale-internal  

$ oc get pvc  -A
NAMESPACE            NAME                                       STATUS   VOLUME                                              CAPACITY   ACCESS MODES   STORAGECLASS                  AGE
ibm-spectrum-scale   datadir-ibm-spectrum-scale-pmcollector-0   Bound    control-1.ncw-az1-005.caas.bbtnet.com-pmcollector   25Gi       RWO            ibm-spectrum-scale-internal   3d


$ oc get pvc datadir-ibm-spectrum-scale-pmcollector-0 -n ibm-spectrum-scale
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  annotations:
    pv.kubernetes.io/bind-completed: 'yes'
    pv.kubernetes.io/bound-by-controller: 'yes'
  resourceVersion: '5360546'
  name: datadir-ibm-spectrum-scale-pmcollector-0
  uid: 7a7d0609-0608-409f-91e1-209bb0b3c8d1
  creationTimestamp: '2023-05-01T14:13:40Z'
  managedFields:
    - manager: kube-controller-manager
      operation: Update
      apiVersion: v1
      time: '2023-05-01T14:13:40Z'
      fieldsType: FieldsV1
      fieldsV1:
        'f:metadata':
          'f:annotations':
            .: {}
            'f:pv.kubernetes.io/bind-completed': {}
            'f:pv.kubernetes.io/bound-by-controller': {}
          'f:labels':
            .: {}
            'f:app.kubernetes.io/instance': {}
            'f:app.kubernetes.io/name': {}
        'f:spec':
          'f:accessModes': {}
          'f:resources':
            'f:requests':
              .: {}
              'f:storage': {}
          'f:storageClassName': {}
          'f:volumeMode': {}
          'f:volumeName': {}
    - manager: kube-controller-manager
      operation: Update
      apiVersion: v1
      time: '2023-05-01T14:13:40Z'
      fieldsType: FieldsV1
      fieldsV1:
        'f:status':
          'f:accessModes': {}
          'f:capacity':
            .: {}
            'f:storage': {}
          'f:phase': {}
      subresource: status
  namespace: ibm-spectrum-scale
  finalizers:
    - kubernetes.io/pvc-protection
  labels:
    app.kubernetes.io/instance: ibm-spectrum-scale
    app.kubernetes.io/name: pmcollector
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 25Gi
  volumeName: control-1.ncw-az1-005.caas.bbtnet.com-pmcollector
  storageClassName: ibm-spectrum-scale-internal
  volumeMode: Filesystem
status:
  phase: Bound
  accessModes:
    - ReadWriteOnce
  capacity:
    storage: 25Gi


==> However, when executing from pod ibm-spectrum-scale-pmcollector-0, the mountPath `/opt/IBM/zimon/data` where PVC/datadir-ibm-spectrum-scale-pmcollector-0 is mounted still shows that only 12K is used so far and 11G is the currently available space.

[C49904@openshift-eng-bastion-vm ~]$ oc rsh ibm-spectrum-scale-pmcollector-0
Defaulted container "pmcollector" out of: pmcollector, sysmon

sh-4.4$ df -Th | grep -iE 'size|zimon'
Filesystem     Type     Size  Used Avail Use% Mounted on
tmpfs          tmpfs     11G   12K   11G   1% /opt/IBM/zimon/config   

Version-Release number of selected component (if applicable):

OCP 4.10.21
isf-operator.v2.4.0  

How reproducible:

 

Steps to Reproduce:

1. by installing IBM Spectrum Scale 
2. 
3.

Actual results:

PVC size displayed from Openshift webconsole shows negative size value.

Expected results:

 
PVC size displayed from Openshift webconsole should not show negative size value.

Additional info:

 

 

Description of problem:

Application groups can not be deleted in topology

Version-Release number of selected component (if applicable):


How reproducible:

Always

Steps to Reproduce:

1. Create an application with an application group
2. Go to topology 
3. Delete the application group containing the application

Actual results:

Application group persists in topology

Expected results:

The application group should be deleted

Additional info:

Pipeline API is giving 404 even if the pipelines operator is not installed

Description of problem:

CCO's ServiceAccount cannot list ConfigMaps at the cluster scope.  

Steps to Reproduce:

1. Install an OCP cluster (4.14.0-0.nightly-2023-07-17-215017, CCO commit id = 0c80cc35f6ee4b45016050b3e5a8710a8ed4dd81) with default configuration (CCO in default mode)

2. Create a dummy CredentialsRequest as follows:
apiVersion: cloudcredential.openshift.io/v1
kind: CredentialsRequest
metadata:
  name: test-cr
  namespace: openshift-cloud-credential-operator
spec:
  providerSpec:
    apiVersion: cloudcredential.openshift.io/v1
    kind: AWSProviderSpec
    statementEntries:
    - action:
      - ec2:CreateTags
      effect: Allow
      resource: '*'
    stsIAMRoleARN: whatever
  secretRef:
    name: test-secret
    namespace: default
  serviceAccountNames:
  - default 

3. Check CCO Pod logs:
time="2023-07-18T10:02:45Z" level=info msg="reconciling clusteroperator status"
time="2023-07-18T10:02:45Z" level=info msg="syncing credentials request" controller=credreq cr=openshift-cloud-credential-operator/test-cr
time="2023-07-18T10:02:45Z" level=info msg="adding finalizer: cloudcredential.openshift.io/deprovision" controller=credreq cr=openshift-cloud-credential-operator/test-cr secret=default/test-secret
time="2023-07-18T10:02:45Z" level=info msg="syncing credentials request" controller=credreq cr=openshift-cloud-credential-operator/test-cr
time="2023-07-18T10:02:45Z" level=info msg="stsFeatureGateEnabled: false" actuator=aws cr=openshift-cloud-credential-operator/test-cr
time="2023-07-18T10:02:45Z" level=info msg="stsDetected: false" actuator=aws cr=openshift-cloud-credential-operator/test-cr
time="2023-07-18T10:02:45Z" level=info msg="clusteroperator status updated" controller=status
time="2023-07-18T10:02:45Z" level=info msg="reconciling clusteroperator status"
time="2023-07-18T10:02:45Z" level=info msg="reconciling clusteroperator status"
time="2023-07-18T10:02:45Z" level=info msg="reconciling clusteroperator status"
time="2023-07-18T10:02:45Z" level=info msg="reconciling clusteroperator status"
W0718 10:02:45.352434       1 reflector.go:533] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:233: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:openshift-cloud-credential-operator:cloud-credential-operator" cannot list resource "configmaps" in API group "" at the cluster scope
E0718 10:02:45.352460       1 reflector.go:148] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:233: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:openshift-cloud-credential-operator:cloud-credential-operator" cannot list resource "configmaps" in API group "" at the cluster scope
W0718 10:02:46.512738       1 reflector.go:533] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:233: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:openshift-cloud-credential-operator:cloud-credential-operator" cannot list resource "configmaps" in API group "" at the cluster scope
E0718 10:02:46.512763       1 reflector.go:148] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:233: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:openshift-cloud-credential-operator:cloud-credential-operator" cannot list resource "configmaps" in API group "" at the cluster scope
W0718 10:02:48.859931       1 reflector.go:533] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:233: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:openshift-cloud-credential-operator:cloud-credential-operator" cannot list resource "configmaps" in API group "" at the cluster scope
E0718 10:02:48.859957       1 reflector.go:148] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:233: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:openshift-cloud-credential-operator:cloud-credential-operator" cannot list resource "configmaps" in API group "" at the cluster scope
W0718 10:02:53.514713       1 reflector.go:533] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:233: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:openshift-cloud-credential-operator:cloud-credential-operator" cannot list resource "configmaps" in API group "" at the cluster scope
E0718 10:02:53.514798       1 reflector.go:148] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:233: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:openshift-cloud-credential-operator:cloud-credential-operator" cannot list resource "configmaps" in API group "" at the cluster scope
W0718 10:03:03.042040       1 reflector.go:533] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:233: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:openshift-cloud-credential-operator:cloud-credential-operator" cannot list resource "configmaps" in API group "" at the cluster scope
E0718 10:03:03.042068       1 reflector.go:148] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:233: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:openshift-cloud-credential-operator:cloud-credential-operator" cannot list resource "configmaps" in API group "" at the cluster scope
W0718 10:03:25.023729       1 reflector.go:533] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:233: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:openshift-cloud-credential-operator:cloud-credential-operator" cannot list resource "configmaps" in API group "" at the cluster scope
E0718 10:03:25.023758       1 reflector.go:148] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:233: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:openshift-cloud-credential-operator:cloud-credential-operator" cannot list resource "configmaps" in API group "" at the cluster scope
time="2023-07-18T10:04:10Z" level=info msg="calculating metrics for all CredentialsRequests" controller=metrics
time="2023-07-18T10:04:10Z" level=info msg="reconcile complete" controller=metrics elapsed=4.470475ms
W0718 10:04:11.033286       1 reflector.go:533] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:233: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:openshift-cloud-credential-operator:cloud-credential-operator" cannot list resource "configmaps" in API group "" at the cluster scope
E0718 10:04:11.033311       1 reflector.go:148] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:233: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:openshift-cloud-credential-operator:cloud-credential-operator" cannot list resource "configmaps" in API group "" at the cluster scope
W0718 10:04:42.316200       1 reflector.go:533] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:233: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:openshift-cloud-credential-operator:cloud-credential-operator" cannot list resource "configmaps" in API group "" at the cluster scope
E0718 10:04:42.316223       1 reflector.go:148] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:233: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:openshift-cloud-credential-operator:cloud-credential-operator" cannot list resource "configmaps" in API group "" at the cluster scope
W0718 10:05:40.852983       1 reflector.go:533] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:233: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:openshift-cloud-credential-operator:cloud-credential-operator" cannot list resource "configmaps" in API group "" at the cluster scope
E0718 10:05:40.853008       1 reflector.go:148] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:233: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:openshift-cloud-credential-operator:cloud-credential-operator" cannot list resource "configmaps" in API group "" at the cluster scope
time="2023-07-18T10:06:10Z" level=info msg="reconciling clusteroperator status"
time="2023-07-18T10:06:10Z" level=info msg="reconciling clusteroperator status"
time="2023-07-18T10:06:10Z" level=info msg="calculating metrics for all CredentialsRequests" controller=metrics
time="2023-07-18T10:06:10Z" level=info msg="reconcile complete" controller=metrics elapsed=3.531182ms
time="2023-07-18T10:06:10Z" level=info msg="reconciling clusteroperator status"
time="2023-07-18T10:06:10Z" level=info msg="reconciling clusteroperator status"
time="2023-07-18T10:06:10Z" level=info msg="reconciling clusteroperator status"
time="2023-07-18T10:06:10Z" level=info msg="reconciling clusteroperator status"
time="2023-07-18T10:06:10Z" level=info msg="reconciling clusteroperator status"
... 

Description of problem:
Starting with OpenShift 4.13 we show a copy close to the OpenShift Route URL in the toplogy, the route list and detail page. But the Knative Route URL doesn't show this link as Vikram mentioned in this code review https://github.com/openshift/console/pull/12853#issuecomment-1594829827

Version-Release number of selected component (if applicable):
4.13+

How reproducible:
Always

Steps to Reproduce:

  1. Install OpenShift Serverless operator
  2. Import an application as Knative Service
  3. Open the Service in the topology sidebar

Actual results:
Copy button is not shown

Expected results:
Copy button should be displayed

Additional info:

Description of problem:

cluster-ingress-operator E2E has an error message:

[controller-runtime] log.SetLogger(...) was never called, logs will not be displayed:

Looks like newClient is called from two places, TestMain and TestIngressStatus

Version-Release number of selected component (if applicable):

4.14

How reproducible:

Always

Steps to Reproduce:

1. Run E2E tests that call newClient, such as TestIngressStatus
2. Examine logs

Actual results:

https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-ingress-operator/924/pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-operator/1663696029016395776/build-log.txt 

[controller-runtime] log.SetLogger(...) was never called, logs will not be displayed:
goroutine 9120 [running]:
runtime/debug.Stack()
	/usr/lib/golang/src/runtime/debug/stack.go:24 +0x65
sigs.k8s.io/controller-runtime/pkg/log.eventuallyFulfillRoot()
	/go/src/github.com/openshift/cluster-ingress-operator/vendor/sigs.k8s.io/controller-runtime/pkg/log/log.go:59 +0xbd
sigs.k8s.io/controller-runtime/pkg/log.(*delegatingLogSink).WithName(0xc000113000, {0x1dd106b, 0x14})
	/go/src/github.com/openshift/cluster-ingress-operator/vendor/sigs.k8s.io/controller-runtime/pkg/log/deleg.go:147 +0x4c
github.com/go-logr/logr.Logger.WithName({{0x21435e0, 0xc000113000}, 0x0}, {0x1dd106b?, 0xe?})
	/go/src/github.com/openshift/cluster-ingress-operator/vendor/github.com/go-logr/logr/logr.go:336 +0x46
sigs.k8s.io/controller-runtime/pkg/client.newClient(0xc00086afc0, {0x0, 0xc0001a0fc0, {0x2144930, 0xc00033ac00}, 0x0, {0x0, 0x0}, 0x0})
	/go/src/github.com/openshift/cluster-ingress-operator/vendor/sigs.k8s.io/controller-runtime/pkg/client/client.go:115 +0xb4
sigs.k8s.io/controller-runtime/pkg/client.New(0xc00086afc0?, {0x0, 0xc0001a0fc0, {0x2144930, 0xc00033ac00}, 0x0, {0x0, 0x0}, 0x0})
	/go/src/github.com/openshift/cluster-ingress-operator/vendor/sigs.k8s.io/controller-runtime/pkg/client/client.go:101 +0x85
github.com/openshift/cluster-ingress-operator/pkg/operator/client.NewClient(0x0?)
	/go/src/github.com/openshift/cluster-ingress-operator/pkg/operator/client/client.go:83 +0x145
github.com/openshift/cluster-ingress-operator/test/e2e.TestIngressStatus(0xc000503520)
	/go/src/github.com/openshift/cluster-ingress-operator/test/e2e/dns_ingressdegrade_test.go:33 +0x95
testing.tRunner(0xc000503520, 0x1f015a0)
	/usr/lib/golang/src/testing/testing.go:1576 +0x10b
created by testing.(*T).Run
	/usr/lib/golang/src/testing/testing.go:1629 +0x3ea

Expected results:

No error message

Additional info:

This is due to 1.27 rebase

Description of problem:

According to the slack thread attached: Cluster uninstallation is stuck when load balancers are removed before ingress controllers. This can happen when the ingress controller removal fails and the control plane operator moves on to deleting load balancers without waiting.

Code ref https://github.com/openshift/hypershift/blob/248cea4daef9d8481c367f9ce5a5e0436e0e028a/control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go#L1505-L1520

Version-Release number of selected component (if applicable):

4.12.z 4.13.z

How reproducible:

Whenever the load balancer is deleted before the ingress controller

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

Load balancer deletion waits for the ingress controller deletion

Additional info:

 

Slack: https://redhat-internal.slack.com/archives/C04EUL1DRHC/p1681310121904539?thread_ts=1681216434.676009&cid=C04EUL1DRHC 

Description of problem:

Image registry pruner job fails when cluster was installed without DeploymentConfig capability. 

Cluster was installed only with the following capapbilities:
{\"capabilities\":{\"baselineCapabilitySet\": \"None\", \"additionalEnabledCapabilities\": [ \"marketplace\", \"NodeTuning\" ] }}"

image-pruner pods are failing with the following error:

    state:
      terminated:
        containerID: cri-o://69562d80cafb23a07b9f1d020e1943448916558986092d8540b9a0e1fc3731a1
        exitCode: 1
        finishedAt: "2023-08-21T00:07:37Z"
        message: |
          Error from server (NotFound): the server could not find the requested resource (get deploymentconfigs.apps.openshift.io)
          attempt #1 has failed (exit code 1), going to make another attempt...
          Error from server (NotFound): the server could not find the requested resource (get deploymentconfigs.apps.openshift.io)
          attempt #2 has failed (exit code 1), going to make another attempt...
          Error from server (NotFound): the server could not find the requested resource (get deploymentconfigs.apps.openshift.io)
          attempt #3 has failed (exit code 1), going to make another attempt...
          Error from server (NotFound): the server could not find the requested resource (get deploymentconfigs.apps.openshift.io)
          attempt #4 has failed (exit code 1), going to make another attempt...
          Error from server (NotFound): the server could not find the requested resource (get deploymentconfigs.apps.openshift.io)
          attempt #5 has failed (exit code 1), going to make another attempt...
          Error from server (NotFound): the server could not find the requested resource (get deploymentconfigs.apps.openshift.io)
        reason: Error
        startedAt: "2023-08-21T00:00:05Z"

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-08-16-114741

How reproducible:

100%

Steps to Reproduce:

1. Install SNO cluster withou DeploymentConfig capability
2. Check image pruner jobs status

Actual results:

Image pruner jobs do not complete because deploymentconfigs.apps.openshift.io api is not available.

Expected results:

Image pruner jobs can run without deploymentconfigs api

Additional info:

 

Description of problem:


OCP deployments are failing with machine-api-controller pod crashing.

Version-Release number of selected component (if applicable):

OCP 4.14.0-ec.3 

How reproducible:

Always

Steps to Reproduce:

1. Deploy a Baremetal cluster
2. After bootstrap is completed, check the pods running in the openshift-machine-api namespace
3. Check machine-api-controllers-* pod status (it goes from Running to Crashing all the time)
4. Deployment eventually times out and stops with only the master nodes getting deployed.

Actual results:

machine-api-controllers-* pod remains in a crashing loop and OCP 4.14.0-ec.3 deployments fail.

Expected results:

machine-api-controllers-* pod remains running and OCP 4.14.0-ec.3 deployments are completed 

Additional info:

Jobs with older nightly releases in 4.14 are passing, but since Saturday Jul 10th, our CI jobs are failing

$ oc version
Client Version: 4.14.0-ec.3
Kustomize Version: v5.0.1
Kubernetes Version: v1.27.3+e8b13aa

$ oc get nodes
NAME       STATUS   ROLES                  AGE   VERSION
master-0   Ready    control-plane,master   37m   v1.27.3+e8b13aa
master-1   Ready    control-plane,master   37m   v1.27.3+e8b13aa
master-2   Ready    control-plane,master   38m   v1.27.3+e8b13aa

$ oc -n openshift-machine-api get pods -o wide
NAME                                                  READY   STATUS             RESTARTS        AGE   IP              NODE       NOMINATED NODE   READINESS GATES
cluster-autoscaler-operator-75b96869d8-gzthq          2/2     Running            0               48m   10.129.0.6      master-0   <none>           <none>
cluster-baremetal-operator-7c9cb8cd69-6bqcg           2/2     Running            0               48m   10.129.0.7      master-0   <none>           <none>
control-plane-machine-set-operator-6b65b5b865-w996m   1/1     Running            0               48m   10.129.0.22     master-0   <none>           <none>
machine-api-controllers-59694ff965-v4kxb              6/7     CrashLoopBackOff   7 (2m31s ago)   46m   10.130.0.12     master-2   <none>           <none>
machine-api-operator-58b54d7c86-cnx4w                 2/2     Running            0               48m   10.129.0.8      master-0   <none>           <none>
metal3-6ffbb8dcd4-drlq5                               6/6     Running            0               45m   192.168.62.22   master-1   <none>           <none>
metal3-baremetal-operator-bd95b6695-q6k7c             1/1     Running            0               45m   10.130.0.16     master-2   <none>           <none>
metal3-image-cache-4p7ln                              1/1     Running            0               45m   192.168.62.22   master-1   <none>           <none>
metal3-image-cache-lfmb4                              1/1     Running            0               45m   192.168.62.23   master-2   <none>           <none>
metal3-image-cache-txjg5                              1/1     Running            0               45m   192.168.62.21   master-0   <none>           <none>
metal3-image-customization-65cf987f5c-wgqs7           1/1     Running            0               45m   10.128.0.17     master-1   <none>           <none>
$ oc -n openshift-machine-api logs machine-api-controllers-59694ff965-v4kxb -c machine-controller | less
...
E0710 15:55:08.230413       1 logr.go:270] controller-runtime/source "msg"="if kind is a CRD, it should be installed before calling Start" "error"="no matches for kind \"Metal3Remediation\" in version \"infrastructure.cluster.x-k8s.io/v1beta1\""  "kind"={"Group":"infrastructure.cluster.x-k8s.io","Kind":"Metal3Remediation"}
E0710 15:55:14.019930       1 controller.go:210]  "msg"="Could not wait for Cache to sync" "error"="failed to wait for metal3remediation caches to sync: timed out waiting for cache to be synced" "controller"="metal3remediation" "controllerGroup"="infrastructure.cluster.x-k8s.io" "controllerKind"="Metal3Remediation" 
I0710 15:55:14.020025       1 logr.go:252]  "msg"="Stopping and waiting for non leader election runnables"  
I0710 15:55:14.020054       1 logr.go:252]  "msg"="Stopping and waiting for leader election runnables"  
I0710 15:55:14.020095       1 controller.go:247]  "msg"="Shutdown signal received, waiting for all workers to finish" "controller"="machine-drain-controller" 
I0710 15:55:14.020147       1 controller.go:247]  "msg"="Shutdown signal received, waiting for all workers to finish" "controller"="machineset-controller" 
I0710 15:55:14.020169       1 controller.go:247]  "msg"="Shutdown signal received, waiting for all workers to finish" "controller"="machine-controller" 
I0710 15:55:14.020184       1 controller.go:249]  "msg"="All workers finished" "controller"="machineset-controller" 
I0710 15:55:14.020181       1 controller.go:249]  "msg"="All workers finished" "controller"="machine-drain-controller" 
I0710 15:55:14.020190       1 controller.go:249]  "msg"="All workers finished" "controller"="machine-controller" 
I0710 15:55:14.020209       1 logr.go:252]  "msg"="Stopping and waiting for caches"  
I0710 15:55:14.020323       1 logr.go:252]  "msg"="Stopping and waiting for webhooks"  
I0710 15:55:14.020327       1 reflector.go:225] Stopping reflector *v1alpha1.BareMetalHost (10h53m58.149951981s) from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262
I0710 15:55:14.020393       1 reflector.go:225] Stopping reflector *v1beta1.Machine (9h40m22.116205595s) from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262
I0710 15:55:14.020399       1 logr.go:252] controller-runtime/webhook "msg"="shutting down webhook server"  
I0710 15:55:14.020437       1 reflector.go:225] Stopping reflector *v1.Node (10h3m14.461941979s) from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262
I0710 15:55:14.020466       1 logr.go:252]  "msg"="Wait completed, proceeding to shutdown the manager"  
I0710 15:55:14.020485       1 reflector.go:225] Stopping reflector *v1beta1.MachineSet (10h7m28.391827596s) from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262
E0710 15:55:14.020500       1 main.go:218] baremetal-controller-manager/entrypoint "msg"="unable to run manager" "error"="failed to wait for metal3remediation caches to sync: timed out waiting for cache to be synced"  
E0710 15:55:14.020504       1 logr.go:270]  "msg"="error received after stop sequence was engaged" "error"="leader election lost" 

Our CI job logs can be seen here (RedHat SSO): https://www.distributed-ci.io/jobs/7da8ee48-8918-4a97-8e3c-f525d19583b8/files

Description of problem:

The AdditionalTrustBundle field in install-config.yaml can be used to add additional certs, however these certs are only propagated to the final image when the ImageContentSources field is also set for mirroring. If mirroring is not set then the additional certs will be on the bootstrap but not the final image.

This can cause a problem when user has set up a proxy and wants to add additional certs as described here https://docs.openshift.com/container-platform/4.12/networking/configuring-a-custom-pki.html#installation-configure-proxy_configuring-a-custom-pki

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1. In install-config.yaml set additionalTrustBundle and don't set imageContentSources.
2. Do an installation using the install-config.yaml.
3. After the final image is installed and rebooted view the certs in /etc/pki/ca-trust/source/anchors/openshift-config-user-ca-bundle.crt. 

Actual results:

The certs defined in additionalTrustBundle are not in /etc/pki/ca-trust/source/anchors/openshift-config-user-ca-bundle.crt.

Expected results:

The certs defined in additionalTrustBundle will be in /etc/pki/ca-trust/source/anchors/openshift-config-user-ca-bundle.crt even when imgeContentSources are not defined.

Additional info:

 

Description of problem:

Pull-through only checks for ICSP, ignoring IDMS/ITMS.

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1. Create an IDMS/ITMS rule (TODO: add specifics)
example IDMS/ITMS specifics:  

apiVersion: config.openshift.io/v1
kind: ImageDigestMirrorSet
metadata:
  name: digest-mirror
spec:
  imageDigestMirrors:
  - mirrors:
    - registry.access.redhat.com/ubi8/ubi-minimal
    source: quay.io/podman/hello
    mirrorSourcePolicy: NeverContactSource

apiVersion: config.openshift.io/v1
kind: ImageTagMirrorSet
metadata:
  name: tag-mirror
spec:
  imageTagMirrors:
  - mirrors:
    - registry.access.redhat.com/ubi8/ubi-minimal
    source: quay.io/podman/hello
    mirrorSourcePolicy: NeverContactSource

2. Create an image stream with `referencePolicy: local`. Example: https://gist.github.com/flavianmissi/0518239edd6f51d54b5633212f2b2ac9 
3. Pull the image from the image stream created above. Example `oc new-app test-1:latest` 

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:
As a part of Chaos Monkey testing we tried to delete pod machine-config-controller in SNO+1. The pod machine-config-controller restart results in restart of daemonset/sriov-network-config-daemon and linuxptp-daemonpods pods as well.

      

1m47s       Normal   Killing            pod/machine-config-controller-7f46c5d49b-w4p9s    Stopping container machine-config-controller
1m47s       Normal   Killing            pod/machine-config-controller-7f46c5d49b-w4p9s    Stopping container oauth-proxy

 

 

 

openshift-sriov-network-operator   23m         Normal   Killing            pod/sriov-network-config-daemon-pv4tr   Stopping container sriov-infiniband-cni
openshift-sriov-network-operator   23m         Normal   SuccessfulDelete   daemonset/sriov-network-config-daemon   Deleted pod: sriov-network-config-daemon-pv4tr 

Version-Release number of selected component (if applicable):

 

4.12

How reproducible:

Steps to Reproduce:

Restart the machine-config-controller pod in openshift-machine-config-operator namespace. 
1. oc get pod -n openshift-machine-config-operator 
2. oc delete  pod/machine-config-controller-xxx -n openshift-machine-config-operator 

 

 

Actual results:

It restarting the daemonset/sriov-network-config-daemon and linuxptp-daemonpods pods 

Expected results:

It should not restart these pod

Additional info:

logs : https://drive.google.com/drive/folders/1XxYen8tzENrcIJdde8sortpyY5ZFZCPW?usp=share_link

Description of problem:

CNCC failed to assign egressIP to NIC for Azure Workload Identity Cluster

Refer to https://issues.redhat.com/browse/CCO-294

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-08-11-055332

How reproducible:

Always

Steps to Reproduce:

1. Created a Azure Workload Identity Cluster by "workflow-launch cucushift-installer-rehearse-azure-ipi-cco-manual-workload-identity-tp 4.14" from cluster-bot
2. Configure egressIP
3.

Actual results:

 % oc get egressip
NAME         EGRESSIPS      ASSIGNED NODE   ASSIGNED EGRESSIPS
egressip-3   10.0.128.100     

% oc get cloudprivateipconfig -o yaml
apiVersion: v1
items:
- apiVersion: cloud.network.openshift.io/v1
  kind: CloudPrivateIPConfig
  metadata:
    annotations:
      k8s.ovn.org/egressip-owner-ref: egressip-3
    creationTimestamp: "2023-08-14T04:41:05Z"
    finalizers:
    - cloudprivateipconfig.cloud.network.openshift.io/finalizer
    generation: 1
    name: 10.0.128.100
    resourceVersion: "65159"
    uid: 2b7b1137-0e2e-46e8-9bca-1176330322a9
  spec:
    node: ci-ln-b4tlp9t-1d09d-2chnb-worker-centralus1-jgqp2
  status:
    conditions:
    - lastTransitionTime: "2023-08-14T04:41:17Z"
      message: 'Error processing cloud assignment request, err: network.InterfacesClient#CreateOrUpdate:
        Failure sending request: StatusCode=0 -- Original Error: Code="LinkedAuthorizationFailed"
        Message="The client ''d367c1b8-9f5d-4257-b5c8-363f61af32c2'' with object id
        ''d367c1b8-9f5d-4257-b5c8-363f61af32c2'' has permission to perform action
        ''Microsoft.Network/networkInterfaces/write'' on scope ''/subscriptions/d38f1e38-4bed-438e-b227-833f997adf6a/resourceGroups/ci-ln-b4tlp9t-1d09d/providers/Microsoft.Network/networkInterfaces/ci-ln-b4tlp9t-1d09d-2chnb-worker-centralus1-jgqp2-nic'';
        however, it does not have permission to perform action ''Microsoft.Network/virtualNetworks/subnets/join/action''
        on the linked scope(s) ''/subscriptions/d38f1e38-4bed-438e-b227-833f997adf6a/resourceGroups/ci-ln-b4tlp9t-1d09d/providers/Microsoft.Network/virtualNetworks/ci-ln-b4tlp9t-1d09d-2chnb-vnet/subnets/ci-ln-b4tlp9t-1d09d-2chnb-worker-subnet''
        or the linked scope(s) are invalid."'
      observedGeneration: 1
      reason: CloudResponseError
      status: "False"
      type: Assigned
    node: ci-ln-b4tlp9t-1d09d-2chnb-worker-centralus1-jgqp2
kind: List
metadata:
  resourceVersion: ""

Expected results:

EgressIP can be assigned to egress node 

Additional info:


Description of problem:

Upgraded from 4.11.17 -> 4.12.0 rc3 and found (after successful upgrade) this repeating in Machine Config Operator logs:

2022-12-13T23:11:51.511167249Z W1213 23:11:51.511120       1 warnings.go:70] unknown field "spec.dns.metadata.creationTimestamp"
2022-12-13T23:11:51.511167249Z W1213 23:11:51.511140       1 warnings.go:70] unknown field "spec.dns.metadata.generation"
2022-12-13T23:11:51.511167249Z W1213 23:11:51.511143       1 warnings.go:70] unknown field "spec.dns.metadata.managedFields"
2022-12-13T23:11:51.511167249Z W1213 23:11:51.511146       1 warnings.go:70] unknown field "spec.dns.metadata.name"
2022-12-13T23:11:51.511167249Z W1213 23:11:51.511148       1 warnings.go:70] unknown field "spec.dns.metadata.resourceVersion"
2022-12-13T23:11:51.511167249Z W1213 23:11:51.511151       1 warnings.go:70] unknown field "spec.dns.metadata.uid"
2022-12-13T23:11:51.511167249Z W1213 23:11:51.511153       1 warnings.go:70] unknown field "spec.infra.metadata.creationTimestamp"
2022-12-13T23:11:51.511167249Z W1213 23:11:51.511155       1 warnings.go:70] unknown field "spec.infra.metadata.generation"
2022-12-13T23:11:51.511167249Z W1213 23:11:51.511157       1 warnings.go:70] unknown field "spec.infra.metadata.managedFields"
2022-12-13T23:11:51.511167249Z W1213 23:11:51.511159       1 warnings.go:70] unknown field "spec.infra.metadata.name"
2022-12-13T23:11:51.511167249Z W1213 23:11:51.511161       1 warnings.go:70] unknown field "spec.infra.metadata.resourceVersion"
2022-12-13T23:11:51.511211644Z W1213 23:11:51.511163       1 warnings.go:70] unknown field "spec.infra.metadata.uid"

Version-Release number of selected component (if applicable):

4.12.0-rc3
Platform agnostic installation 

How reproducible:

Just once (working with user outside RH)

Steps to Reproduce:

1. Install 4.11.17
2. Set candidate-4.12 upgrade channel
3. Initiate upgrade (apply admin ack as needed)
4. After upgrade, check Machine Config Operator logs

Actual results:

The upgrade went fine and I don't see any symptoms outside of warnings repeating in MCO log

Expected results:

I don't expect the warnings to be logged repeatedly 

Additional info:

 

Description of problem:

IPI installation to a shared VPC with 'credentialsMode: Manual' failed, due to no IAM service accounts for control-plane machines and compute machines

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2023-04-18-005127

How reproducible:

Always

Steps to Reproduce:

1. "create install-config", and then insert interested settings in install-config.yaml
2. "create manifests"
3. run "ccoctl" to create the required credentials
4. grant the above IAM service accounts the required permissions in the host project (see https://github.com/openshift/openshift-docs/pull/58474)
5. "create cluster" 

Actual results:

The installer doesn't create the 2 IAM service accounts, one for control-plane machine and another for compute machine, so that no compute machine getting created, which leads to installation failure.

Expected results:

The installation should succeed.

Additional info:

FYI https://issues.redhat.com/browse/OCPBUGS-11605
$ gcloud compute instances list --filter='name~jiwei-0418'
NAME                        ZONE           MACHINE_TYPE   PREEMPTIBLE  INTERNAL_IP  EXTERNAL_IP  STATUS
jiwei-0418a-9kvlr-master-0  us-central1-a  n2-standard-4               10.0.0.62                 RUNNING
jiwei-0418a-9kvlr-master-1  us-central1-b  n2-standard-4               10.0.0.58                 RUNNING
jiwei-0418a-9kvlr-master-2  us-central1-c  n2-standard-4               10.0.0.29                 RUNNING
$ gcloud iam service-accounts list --filter='email~jiwei-0418'
DISPLAY NAME                                                     EMAIL                                                                DISABLED
jiwei-0418a-14589-openshift-image-registry-gcs                   jiwei-0418a--openshift-i-zmwwh@openshift-qe.iam.gserviceaccount.com  False
jiwei-0418a-14589-openshift-machine-api-gcp                      jiwei-0418a--openshift-m-5cc5l@openshift-qe.iam.gserviceaccount.com  False
jiwei-0418a-14589-cloud-credential-operator-gcp-ro-creds         jiwei-0418a--cloud-crede-p8lpc@openshift-qe.iam.gserviceaccount.com  False
jiwei-0418a-14589-openshift-gcp-ccm                              jiwei-0418a--openshift-g-bljz6@openshift-qe.iam.gserviceaccount.com  False
jiwei-0418a-14589-openshift-ingress-gcp                          jiwei-0418a--openshift-i-rm4vz@openshift-qe.iam.gserviceaccount.com  False
jiwei-0418a-14589-openshift-cloud-network-config-controller-gcp  jiwei-0418a--openshift-c-6dk7g@openshift-qe.iam.gserviceaccount.com  False
jiwei-0418a-14589-openshift-gcp-pd-csi-driver-operator           jiwei-0418a--openshift-g-pjn24@openshift-qe.iam.gserviceaccount.com  False
$

 

Description of problem:

When use selects "Use Pipeline from this cluster" oprtion from Add Pipeline section, then Create button should be enabled but due to PAC validation the Create button is disabled

Version-Release number of selected component (if applicable):

4.14.0

How reproducible:

Always

Steps to Reproduce:

1. Go to Import from Git page
2. Add repository https://bitbucket.org/lokanandap/hello-func
3. Select Use Pipeline from this cluster in Add Pipeline section 

Actual results:

Create button is disabled

Expected results:

Create button should be enabled to create the workload

Additional info:

 

Description of problem:

IPV6 interface and IP is missing in all pods created in OCP 4.12 EC-2.

Version-Release number of selected component (if applicable):

4.12

How reproducible:

Every time

Steps to Reproduce:

We create network-attachment-definitions.k8s.cni.cncf.io in OCP cluster at namespace scope for our software pods to get IPV6 IPs. 

Actual results:

Pods do not receive IPv6 addresses

Expected results:

Pods receive IPv6 addresses

Additional info:

This has been working flawlessly till OCP 4.10. 21 however we are trying same code in OCP 4.12-ec2 and we notice all our pods are missing ipv6 address and we have to restart pods couple times for them to get ipv6 address.

This is a clone of issue OCPBUGS-19418. The following is the description of the original issue:

Description of problem:

OCP Upgrades fail with message "Upgrade error from 4.13.X: Unable to apply 4.14.0-X: an unknown error has occurred: MultipleErrors"

Version-Release number of selected component (if applicable):

Currently 4.14.0-rc.1, but we observed the same issue with previous 4.14 nightlies too: 
4.14.0-0.nightly-2023-09-12-195514
4.14.0-0.nightly-2023-09-02-132842
4.14.0-0.nightly-2023-08-28-154013

How reproducible:

1 out of 2 upgrades

Steps to Reproduce:

1. Deploy OCP 4.13 with latest GA on a baremetal cluster with IPI and OVN-K
2. Upgrade to latest 4.14 available
3. Check cluster version status during the upgrade, at some point upgrade stops with message: "Upgrade error from 4.13.X Unable to apply 4.14.0-X: an unknown error has occurred: MultipleErrors"
4. Check OVN pods "oc get pods -n openshift-ovn-kubernetes", there are pods running 7 out 8 containers (missing ovnkube-node) constantly restarting, and pods running only 5 containers that show errors to connect to the OVN DBs.
5. Check cluster operators "oc get co" mainly dns, network, and machine-config remained in 4.13 and degraded.

Actual results:

Upgrade not completed, and OVN pods remain in a restarting loop with failures.

Expected results:

Upgrade should be completed without issues, and OVN pods should remain in a Running status without restarts.

Additional info:

  • We have tested this with latest GA versions of 4.13 (as today Sep 19: 4.13.13 to 4.14.0-rc1), but we have been observing this since 20 days ago, with previous versions of 4.13 and 4.14.
  • Our deployments have single stack IPv4 , one NIC for provisioning and one NIC for baremetal (machine network)

These are the results from our latest test from 4.13.13 to 4.14.0-rc1

$ oc get clusterversion
NAME     VERSION  AVAILABLE  PROGRESSING  SINCE  STATUS
version           True       True         2h8m   Unable to apply 4.14.0-rc.1: an unknown error has occurred: MultipleErrors

$ oc get mcp
NAME    CONFIG                                            UPDATED  UPDATING  DEGRADED  MACHINECOUNT  READYMACHINECOUNT  UPDATEDMACHINECOUNT  DEGRADEDMACHINECOUNT  AGE
master  rendered-master-ebb1da47ad5cb76c396983decb7df1ea  True     False     False     3             3                  3                    0                     3h41m
worker  rendered-worker-26ccb35941236935a570dddaa0b699db  False    True      True      3             2                  2                    1                     3h41m

$ oc get co
NAME                                      VERSION      AVAILABLE  PROGRESSING  DEGRADED  SINCE
authentication                            4.14.0-rc.1  True       False        False     2h21m
baremetal                                 4.14.0-rc.1  True       False        False     3h38m
cloud-controller-manager                  4.14.0-rc.1  True       False        False     3h41m
cloud-credential                          4.14.0-rc.1  True       False        False     2h23m
cluster-autoscaler                        4.14.0-rc.1  True       False        False     2h21m
config-operator                           4.14.0-rc.1  True       False        False     3h40m
console                                   4.14.0-rc.1  True       False        False     2h20m
control-plane-machine-set                 4.14.0-rc.1  True       False        False     3h40m
csi-snapshot-controller                   4.14.0-rc.1  True       False        False     2h21m
dns                                       4.13.13      True       True         True      2h9m
etcd                                      4.14.0-rc.1  True       False        False     2h40m
image-registry                            4.14.0-rc.1  True       False        False     2h9m
ingress                                   4.14.0-rc.1  True       True         True      1h14m
insights                                  4.14.0-rc.1  True       False        False     3h34m
kube-apiserver                            4.14.0-rc.1  True       False        False     2h35m
kube-controller-manager                   4.14.0-rc.1  True       False        False     2h30m
kube-scheduler                            4.14.0-rc.1  True       False        False     2h29m
kube-storage-version-migrator             4.14.0-rc.1  False      True         False     2h9m
machine-api                               4.14.0-rc.1  True       False        False     2h24m
machine-approver                          4.14.0-rc.1  True       False        False     3h40m
machine-config                            4.13.13      True       False        True      59m
marketplace                               4.14.0-rc.1  True       False        False     3h40m
monitoring                                4.14.0-rc.1  False      True         True      2h3m
network                                   4.13.13      True       True         True      2h4m
node-tuning                               4.14.0-rc.1  True       False        False     2h9m
openshift-apiserver                       4.14.0-rc.1  True       False        False     2h20m
openshift-controller-manager              4.14.0-rc.1  True       False        False     2h20m
openshift-samples                         4.14.0-rc.1  True       False        False     2h23m
operator-lifecycle-manager                4.14.0-rc.1  True       False        False     2h23m
operator-lifecycle-manager-catalog        4.14.0-rc.1  True       False        False     2h18m
operator-lifecycle-manager-packageserver  4.14.0-rc.1  True       False        False     2h20m
service-ca                                4.14.0-rc.1  True       False        False     2h23m
storage                                   4.14.0-rc.1  True       False        False     3h40m

Some OVN pods are running 7 out 8 containers (missing ovnkube-node) constantly restarting, and pods running only 5 containers that show errors to connect to the OVN DBs.

$ oc get pods -n openshift-ovn-kubernetes -o wide
NAME                                    READY  STATUS   RESTARTS  AGE    IP             NODE
ovnkube-control-plane-5f5c598768-czkjv  2/2    Running  0         2h16m  192.168.16.32  dciokd-master-1
ovnkube-control-plane-5f5c598768-kg69r  2/2    Running  0         2h16m  192.168.16.31  dciokd-master-0
ovnkube-control-plane-5f5c598768-prfb5  2/2    Running  0         2h16m  192.168.16.33  dciokd-master-2
ovnkube-node-9hjv9                      5/5    Running  1         3h43m  192.168.16.32  dciokd-master-1
ovnkube-node-fmswc                      7/8    Running  19        2h10m  192.168.16.36  dciokd-worker-2
ovnkube-node-pcjhp                      7/8    Running  20        2h15m  192.168.16.35  dciokd-worker-1
ovnkube-node-q7kcj                      5/5    Running  1         3h43m  192.168.16.33  dciokd-master-2
ovnkube-node-qsngm                      5/5    Running  3         3h27m  192.168.16.34  dciokd-worker-0
ovnkube-node-v2d4h                      7/8    Running  20        2h15m  192.168.16.31  dciokd-master-0

$ oc logs ovnkube-node-9hjv9 -c ovnkube-node -n openshift-ovn-kubernetes | less
...
2023-09-19T03:40:23.112699529Z E0919 03:40:23.112660    5883 ovn_db.go:511] Failed to retrieve cluster/status info for database "OVN_Northbound", stderr: 2023-09-19T03:40:23Z|00001|unixctl|WARN|failed to connect to /var/run/ovn/ovnnb_db.ctl
2023-09-19T03:40:23.112699529Z ovn-appctl: cannot connect to "/var/run/ovn/ovnnb_db.ctl" (No such file or directory)
2023-09-19T03:40:23.112699529Z , err: (OVN command '/usr/bin/ovn-appctl -t /var/run/ovn/ovnnb_db.ctl --timeout=5 cluster/status OVN_Northbound' failed: exit status 1)
2023-09-19T03:40:23.112699529Z E0919 03:40:23.112677    5883 ovn_db.go:590] OVN command '/usr/bin/ovn-appctl -t /var/run/ovn/ovnnb_db.ctl --timeout=5 cluster/status OVN_Northbound' failed: exit status 1
2023-09-19T03:40:23.114791313Z E0919 03:40:23.114777    5883 ovn_db.go:283] Failed retrieving memory/show output for "OVN_NORTHBOUND", stderr: 2023-09-19T03:40:23Z|00001|unixctl|WARN|failed to connect to /var/run/ovn/ovnnb_db.ctl
2023-09-19T03:40:23.114791313Z ovn-appctl: cannot connect to "/var/run/ovn/ovnnb_db.ctl" (No such file or directory)
2023-09-19T03:40:23.114791313Z , err: (OVN command '/usr/bin/ovn-appctl -t /var/run/ovn/ovnnb_db.ctl --timeout=5 memory/show' failed: exit status 1)
2023-09-19T03:40:23.116492808Z E0919 03:40:23.116478    5883 ovn_db.go:511] Failed to retrieve cluster/status info for database "OVN_Southbound", stderr: 2023-09-19T03:40:23Z|00001|unixctl|WARN|failed to connect to /var/run/ovn/ovnsb_db.ctl
2023-09-19T03:40:23.116492808Z ovn-appctl: cannot connect to "/var/run/ovn/ovnsb_db.ctl" (No such file or directory)
2023-09-19T03:40:23.116492808Z , err: (OVN command '/usr/bin/ovn-appctl -t /var/run/ovn/ovnsb_db.ctl --timeout=5 cluster/status OVN_Southbound' failed: exit status 1)
2023-09-19T03:40:23.116492808Z E0919 03:40:23.116488    5883 ovn_db.go:590] OVN command '/usr/bin/ovn-appctl -t /var/run/ovn/ovnsb_db.ctl --timeout=5 cluster/status OVN_Southbound' failed: exit status 1
2023-09-19T03:40:23.118468064Z E0919 03:40:23.118450    5883 ovn_db.go:283] Failed retrieving memory/show output for "OVN_SOUTHBOUND", stderr: 2023-09-19T03:40:23Z|00001|unixctl|WARN|failed to connect to /var/run/ovn/ovnsb_db.ctl
2023-09-19T03:40:23.118468064Z ovn-appctl: cannot connect to "/var/run/ovn/ovnsb_db.ctl" (No such file or directory)
2023-09-19T03:40:23.118468064Z , err: (OVN command '/usr/bin/ovn-appctl -t /var/run/ovn/ovnsb_db.ctl --timeout=5 memory/show' failed: exit status 1)
2023-09-19T03:40:25.118085671Z E0919 03:40:25.118056    5883 ovn_northd.go:128] Failed to get ovn-northd status stderr() :(failed to run the command since failed to get ovn-northd's pid: open /var/run/ovn/ovn-northd.pid: no such file or directory)

Description: During an upgrade from non-IC to IC, the CNO status logic looks up a well-known configmap that indicates whether the an upgrade to IC is ongoing in order not to report the new operator version (4.14) until the second and final phase of the IC upgrade is done.

The following corrections are needed:

  •  CNO shouldn't report new version if IC configmap can't be retrieved for whatever reason, as suggested in a review to the CNO PR that enabled IC support: https://github.com/openshift/cluster-network-operator/pull/1874#pullrequestreview-1560616992
  •  looking up the key "ongoing-upgrade" inside the IC configmap is enough; checking for its value to be "true" is not correct, since after code reviews it was decided not to set the value to "true", but to set it to an empty string;
  • (optimization) CNO shouldn't look up the IC configmap if the cluster is not running ovn kubernetes 

Please review the following PR: https://github.com/openshift/images/pull/133

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

This is a clone of issue OCPBUGS-18003. The following is the description of the original issue:

Description of problem:

Found auto case OCP-42340 failed in ci job which version is 4.14.0-ec.4 and then reproduced issue in 4.14.0-0.nightly-2023-08-22-221456

Version-Release number of selected component (if applicable):

4.14.0-ec.4 4.14.0-0.nightly-2023-08-22-221456

How reproducible:

Always

Steps to Reproduce:

1. Deploy egressrouter on baremetal with 
{
    "kind": "List",
    "apiVersion": "v1",
    "metadata": {},
    "items": [
        {
            "apiVersion": "network.operator.openshift.io/v1",
            "kind": "EgressRouter",
            "metadata": {
                "name": "egressrouter-42430",
                "namespace": "e2e-test-networking-egressrouter-l4xgx"
            },
            "spec": {
                "addresses": [
                    {
                        "gateway": "192.168.111.1",
                        "ip": "192.168.111.55/24"
                    }
                ],
                "mode": "Redirect",
                "networkInterface": {
                    "macvlan": {
                        "mode": "Bridge"
                    }
                },
                "redirect": {
                    "redirectRules": [
                        {
                            "destinationIP": "142.250.188.206",
                            "port": 80,
                            "protocol": "TCP"
                        },
                        {
                            "destinationIP": "142.250.188.206",
                            "port": 8080,
                            "protocol": "TCP",
                            "targetPort": 80
                        },
                        {
                            "destinationIP": "142.250.188.206",
                            "port": 8888,
                            "protocol": "TCP",
                            "targetPort": 80
                        }
                    ]
                }
            }
        }
    ]
}

 % oc get pods -n  e2e-test-networking-egressrouter-l4xgx -o wide
NAME                                           READY   STATUS    RESTARTS   AGE   IP            NODE       NOMINATED NODE   READINESS GATES
egress-router-cni-deployment-c4bff88cf-skv9j   1/1     Running   0          69m   10.131.0.26   worker-0   <none>           <none>

2. Create service which point to egressrouter
% oc get svc -n e2e-test-networking-egressrouter-l4xgx -o yaml  
apiVersion: v1
items:
- apiVersion: v1
  kind: Service
  metadata:
    creationTimestamp: "2023-08-23T05:58:30Z"
    name: ovn-egressrouter-multidst-svc
    namespace: e2e-test-networking-egressrouter-l4xgx
    resourceVersion: "50383"
    uid: 07341ff1-6df3-40a6-b27e-59102d56e9c1
  spec:
    clusterIP: 172.30.10.103
    clusterIPs:
    - 172.30.10.103
    internalTrafficPolicy: Cluster
    ipFamilies:
    - IPv4
    ipFamilyPolicy: SingleStack
    ports:
    - name: con1
      port: 80
      protocol: TCP
      targetPort: 80
    - name: con2
      port: 5000
      protocol: TCP
      targetPort: 8080
    - name: con3
      port: 6000
      protocol: TCP
      targetPort: 8888
    selector:
      app: egress-router-cni
    sessionAffinity: None
    type: ClusterIP
  status:
    loadBalancer: {}
kind: List
metadata:
  resourceVersion: ""

  3. create a test pod to access the service or curl the egressrouter IP:port directly 
oc rsh -n e2e-test-networking-egressrouter-l4xgx hello-pod1                                  
~ $ curl 172.30.10.103:80 --connect-timeout 5
curl: (28) Connection timeout after 5001 ms
~ $ curl 10.131.0.26:80 --connect-timeout 5
curl: (28) Connection timeout after 5001 ms
 $ curl 10.131.0.26:8080 --connect-timeout 5
curl: (28) Connection timeout after 5001 ms




Actual results:

  connection failed

Expected results:

  connection succeed

Additional info:
Note, the issue didn't exist in 4.13. It passed in 4.13 latest nightly build 4.13.0-0.nightly-2023-08-11-101506

08-23 15:26:16.955  passed: (1m3s) 2023-08-23T07:26:07 "[sig-networking] SDN ConnectedOnly-Author:huirwang-High-42340-Egress router redirect mode with multiple destinations."

Description of problem:

node-exporter profiling shows that ~16% of CPU time is spend fetching details about btrfs mounts. RHEL kernel doesn't have btrfs, so its safe to disable this exporter

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Please review the following PR: https://github.com/openshift/telemeter/pull/460

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

As endorsed at DNS Flag Day, the DNS Community recommends a bufsize setting of 1232 as a safe default that supports larger payloads, while generally avoiding IP fragmentation on most networks. This is particularly relevant for payloads like those generated by DNSSEC, which tend to be larger.

Previously, CoreDNS always used the EDNS0 extension, which enables UDP-based DNS queries to exceed 512 bytes, when CoreDNS forwarded DNS queries to an upstream name server, and so OpenShift specified a bufsize setting of 512 to maintain compatibility with applications and name servers that did not support the EDNS0 extension.

For clients and name servers that do support EDNS0, a bufsize setting of 512 can result in more DNS truncation and unnecessary TCP retransmissions, resulting in worse DNS performance for most users. This is due to the fact that if a response is larger than the bufsize setting, it gets truncated, prompting clients to initiate a TCP retry. In this situation, two DNS requests are made for a single DNS answer, leading to higher bandwidth usage and longer response times.

Currently, CoreDNS no longer uses EDNS0 when forwarding requests if the original client request did not use EDNS0 (ref: coredns/coredns@a5b9749), and so the reasoning for using a bufsize setting of 512 no longer applies. By increasing the bufsize setting to the recommended value of 1232 bytes, we can enhance DNS performance by decreasing the probability of DNS truncations.

Using a larger bufsize setting of 1232 bytes also would potentially help alleviate bugs like https://issues.redhat.com/browse/OCPBUGS-6829 in which a non-compliant upstream DNS is not respecting a bufsize of 512 bytes and sending larger-than-512-bytes responses. A bufsize setting of 1232 bytes doesn't fix the root cause of this issue; rather, it decreases the likelihood of its occurrence by increasing the acceptable size range for UDP responses.

Note that clients that don’t support EDNS0 or TCP, such as applications built using older versions of Alpine Linux, are still subject to the aforementioned truncation issue. To avoid these issues, ensure that your application is built using a DNS resolver library that supports EDNS0 or TCP-based DNS queries.

Brief history of OpenShift's Bufsize changes:

  1. During the development of OpenShift 4.8.0, we updated to 1232 bytes due to Bug - 1949361 and backported to 4.7 and 4.6. However, later on, 4.8.0 (in development), 4.7, and 4.6 were reverted back to 512 bytes due to Bug - 1966116.
  2. Also in OpenShift 4.8.0, we bumped CoreDNS to v1.8.1, and picked up a commit that forced DNS queries that did not have the DO Bit (DNSSEC) set to set bufsize as 2048 bytes despite 512 bytes being set in the configuration.
  3. In OpenShift 4.12.0, we fixed OCPBUGS-240 to limit all DNS queries, specifically queries that had DO Bit off, to what is configured in the configuration file (512 bytes) and we backported the fix to 4.11, 4.10, and 4.9.
  4. Now, this PR is changing bufsize to 1232 bytes.

Version-Release number of selected component (if applicable):

4.14, 4.13, 4.12. 4.11

How reproducible:

100%

Steps to Reproduce:

1. oc -n openshift-dns get configmaps/dns-default -o yaml | grep -i bufsize

Actual results:

Bufsize = 512

Expected results:

Bufsize = 1232

Additional info:

 

This is a clone of issue OCPBUGS-20104. The following is the description of the original issue:

Description of problem:

The recently introduced node identify feature introduces pods that are running as root. While it's understood there may be situations where that is absolutely required, the goal should be to always run with least privilege / non-root.

Version-Release number of selected component (if applicable):

 

How reproducible:

100%

Steps to Reproduce:

1. Deploy an IBM Managed OpenShift 4.14.0 cluster. I suspect any OpenShift 4.14.0 cluster will have these pods running as root as well.

Actual results:

network-node-identity pods are running as root

Expected results:

network-node-identity pods should be running as non-root

Additional info:

Due to the introduction of these pods running as root in an IBM Managed OpenShift 4.14.0 cluster, we will have to file for a security exception.

Description of the problem:

Cluster events search for message=\ , or message=%5C returns all writing image to disk messages.
e.g. "Host: test-infra-cluster-f5e3a8e9-master-1, reached installation stage Writing image to disk: 5%"

 

How reproducible:

100%  

 

Steps to reproduce:

1.Install cluster 

2. List events with message=\ , or message=%5C

 

curl -s -v  --location --request GET 'https://api.stage.openshift.com/api/assisted-install/v2/events?cluster_id=2aa44b94-e533-44fe-9c0f-3b20a3d91b4e&message=%5C' --header "Authorization: Bearer $(ocm token)" | jq '.'

or

curl -s -v  --location --request GET 'https://api.stage.openshift.com/api/assisted-install/v2/events?cluster_id=2aa44b94-e533-44fe-9c0f-3b20a3d91b4e&message=\' --header "Authorization: Bearer $(ocm token)" | jq '.' 

 

Actual results:

All  "writing image to disk" are returns 

 

Expected results:

Only events including '\' returns

Description of the problem:

CVO 4.14 failed to install when Nutanix platform provider is selected.

 

 

{
"cluster_id": "c8359d4e-141b-45ff-9979-d49dd679d56b",
"name": "cvo",
"operator_type": "builtin",
"status": "failed",
"status_updated_at": "2023-06-29T07:40:47.855Z",
"timeout_seconds": 3600,
"version": "4.14.0-0.nightly-2023-06-27-233015"
}

 

 

e.g https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-assisted-test-infra-master-e2e-nutanix-assisted-periodic/1674303871989583872

How reproducible:

 

Steps to reproduce:

1.

2.

3.

Actual results:

 

Expected results:

We need to improve our must-gather so as we can collect CRs on which vSphere CSI driver depends.

IMO they contain vital cluster state and not collecting them makes certain part of CSI driver debugging way harder than it needs to be.

Sanitize OWNERS/OWNER_ALIASES:

1) OWNERS must have:

component: "Storage / Kubernetes External Components"

2) OWNER_ALIASES must have all team members of Storage team.

 Some unit tests are flaky because we check timestamps to have changed.

When creation and test happen very quickly, this might seem to not have changed.

https://redhat-internal.slack.com/archives/C014N2VLTQE/p1681827276489839

 

We can fix this by simulating host creation to have happened in the past

Description of problem:

Bump Kubernetes to 0.27.1 and bump dependencies

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

IPI install on azure stack failed when setting platform.azure.osDiks.diskType as StandardSSD_LRS in install-config.yaml.

When setting controlPlane.platform.azure.osDisk.diskType as StandardSSD_LRS, get error in terraform log and some resources have been created.

level=error msg=Error: expected storage_os_disk.0.managed_disk_type to be one of [Premium_LRS Standard_LRS], got StandardSSD_LRS
level=error
level=error msg=  with azurestack_virtual_machine.bootstrap,
level=error msg=  on main.tf line 107, in resource "azurestack_virtual_machine" "bootstrap":
level=error msg= 107: resource "azurestack_virtual_machine" "bootstrap" {
level=error
level=error msg=failed to fetch Cluster: failed to generate asset "Cluster": failure applying terraform for "bootstrap" stage: failed to create cluster: failed to apply Terraform: exit status 1
level=error
level=error msg=Error: expected storage_os_disk.0.managed_disk_type to be one of [Premium_LRS Standard_LRS], got StandardSSD_LRS
level=error
level=error msg=  with azurestack_virtual_machine.bootstrap,
level=error msg=  on main.tf line 107, in resource "azurestack_virtual_machine" "bootstrap":
level=error msg= 107: resource "azurestack_virtual_machine" "bootstrap" {
level=error
level=error

When setting compute.platform.azure.osDisk.diskType as StandardSSD_LRS, fail to provision compute machines

$ oc get machine -n openshift-machine-api
NAME                                     PHASE     TYPE              REGION   ZONE   AGE
jima414ash03-xkq5x-master-0              Running   Standard_DS4_v2   mtcazs          62m
jima414ash03-xkq5x-master-1              Running   Standard_DS4_v2   mtcazs          62m
jima414ash03-xkq5x-master-2              Running   Standard_DS4_v2   mtcazs          62m
jima414ash03-xkq5x-worker-mtcazs-89mgn   Failed                                      52m
jima414ash03-xkq5x-worker-mtcazs-jl5kk   Failed                                      52m
jima414ash03-xkq5x-worker-mtcazs-p5kvw   Failed                                      52m

$ oc describe machine jima414ash03-xkq5x-worker-mtcazs-jl5kk -n openshift-machine-api
...
  Error Message:           failed to reconcile machine "jima414ash03-xkq5x-worker-mtcazs-jl5kk": failed to create vm jima414ash03-xkq5x-worker-mtcazs-jl5kk: failure sending request for machine jima414ash03-xkq5x-worker-mtcazs-jl5kk: cannot create vm: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: Code="InvalidParameter" Message="Storage account type 'StandardSSD_LRS' is supported by Microsoft.Compute API version 2018-04-01 and above" Target="osDisk.managedDisk.storageAccountType"
...

Based on azure-stack doc[1], supported disk types on ASH are Premium SSD, Standard HDD. It's better to do validation for diskType on Azure Stack to avoid above errors.

[1]https://learn.microsoft.com/en-us/azure-stack/user/azure-stack-managed-disk-considerations?view=azs-2206&tabs=az1%2Caz2#cheat-sheet-managed-disk-differences

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-05-16-085836

How reproducible:

Always

Steps to Reproduce:

1. Prepare install-config.yaml, set platform.azure.osDiks.diskType as StandardSSD_LRS
2. Install IPI cluster on Azure Stack
3.

Actual results:

Installation failed

Expected results:

Installer validate diskType on AzureStack Cloud, and exit for unsupported disk type with error message

Additional info:

 

Seen in build02, currently running 4.12.0-ec.3:

mcd_update_state{node="build0-gstfj-m-0.c.openshift-ci-build-farm.internal"}

returns:

Those are identical, except:

  • The first has config populated with rendered-... and has a non-zero value.
  • The second has config empty and has a zero value.

Looking at the backing code, my guess is that we're doing something like this:

  • Things are happy; export with a populated config.
  • Things get sad. Export with an empty config and a new error. But the happy time-series sticks around, and somehow has the value move to zero.
  • Things get happy again; and we return to setting a value for the happy time-series. But the sad time-series sticks around, and somehow has the value move to zero.

Or something like that.  I expect we want to drop the zero-valued time-series, but I'm not clear enough on how the MCO pushes values into the export set to have code suggestions.

 

When displaying my pipeline it is not rendered correctly with overlapping segments between parallel branches. However if I edit the pipeline then it appears fine. I have attached screenshots showing the issue.

This is a regression from 4.11 where it rendered fine.

Description of problem:
When "Service Binding Operator" is successfully installed in the cluster for the first time, the page will automatically redirect to Operator installation page with the error message "A subscription for this Operator already exists in Namespace "XXX" " 

Notice: This issue only happened when the user installed "Service Binding Operator" for the first time. If the user uninstalls and re-installs the operator again, this issue will be gone 

Version-Release number of selected components (if applicable):
4.12.0-0.nightly-2022-08-12-053438

How reproducible:
Always

Steps to Reproduce:

  1. Login to OCP web console. Go to Operators -> OperatorHub page
  2. Install "Service Binding Operator", wait until finish, check the page
  3.  

Actual results:
The page will redirect to Operator installation page with the error message "A subscription for this Operator already exists in Namespace "XXX" " 
 
Expected results:
The page should stay on the install page, with the message "Installed operator- ready for use"

Additional info:

Please find the attached snap for more details 

Description of problem:

SCOS times out during provisioning of BM nodes

Version-Release number of selected component (if applicable):

4.13

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

https://github.com/openshift/ironic-image/pull/377

Description of problem:

In Helm Charts we define a values.schema.json file - a JSON schema for all the possible values the user can set in a chart. This schema needs to follow JSON schema standard. The standard includes something called $ref - a reference to the either local or remote definition. If we use a schema with remote references in OCP, it causes various troubles in OCP. Different OCP versions gives different results, also on the same OCP version you can get different results based on how tight down the cluster networking is.

Prerequisites (if any, like setup, operators/versions):

Tried in Developer Sandbox, OpenShift Local, Baremetal Public Cluster in Operate First, OCP provisioned through clusterbot. It behaves differently in each instance. Individual cases are described below.

Steps to Reproduce

1. Go to the "Helm" tab in Developer Perspective
2. Click "Create" in top right and select "Repository"
3. Use following ProjectHelmChartRepository resource and click "Create" (this repo contains single chart, this chart has values.schema.json with content linked below):

apiVersion: helm.openshift.io/v1beta1
kind: ProjectHelmChartRepository
metadata:
  name: reproducer
spec:
  connectionConfig:
    url: https://raw.githubusercontent.com/tumido/helm-backstage/reproducer

4. Go back the "Helm" tab in Developer Perspective
5. Click "Create" in top right and select "Helm Release"
6. In filters section of the catalog in the "Chart repositories" select "Reproducer"
7. Click on the single tile available (Backstage)
8. Click "Install Helm Chart"
9. Either you will be greeted with various error screens or you see the "YAML view" tab (this tab selection is not the default and is remembered during user session only I suppose)
10. Select "Form view"

Actual results:

Various error screens depending on OCP version and network restrictions. I've attached screen captures how it behaves in different settings.

Expected results:

Either render the form view (resolve remote references) or make it obvious that remote references are not supporter. Optionally fallback to the "YAML view" regarding that user doesn't have full schema available, but the chart is still deployable.

Reproducibility (Always/Intermittent/Only Once):

Depends on the environment
Always in OpenShift Local, Developer Sandbox, cluster bot clusters

Build Details:

Workaround:

1. Select any other chart to install, click "Install Helm Chart"
2. Change the view to "YAML view"
3. Go back to the Helm catalog without actually deploying anything
4. Select the faulty chart and click "Install Helm Chart"
5. Proceed with installation

Additional info:

The new test introduced by https://issues.redhat.com/browse/HOSTEDCP-960 fails for platforms other than AWS because some AWS specific conditions like `ValidAWSIdentityProvider` are always set regardless of the platform.

OCP Version at Install Time: 4.11-fc.3
RHCOS Version at Install Time: 411.86.202206172255-0
Platform: vSphere
Architecture: x86_64

I'm trying to verify that the IPI installer uses UEFI when creating VMs on VMware, following https://github.com/coreos/coreos-assembler/pull/2762 (merged Mar 19).

However, the 4.11.0-fc.3 installer taken from https://mirror.openshift.com/pub/openshift-v4/clients/ocp-dev-preview/4.11.0-fc.3/openshift-install-linux.tar.gz still seems to use BIOS.

Reproducing:

1. Run openshift-install against a VMware vSphere cluster.
2. Wait for an OpenShift VM (bootstrap, control, or worker node) to show up in vCenter.
3. Go to the VM's boot options - the firmware is set to BIOS instead of UEFI, which was supposed to be set by default.

Description of problem:

Bump Kubernetes to 0.27.1 and bump dependencies

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

OCP cluster born on 4.1 fails to scale-up node due to older podman version 1.0.2 present in 4.1 bootimage. This was observed while testing bug https://issues.redhat.com/browse/OCPBUGS-7559?focusedCommentId=21889975&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-21889975

Journal log:
- Unit machine-config-daemon-update-rpmostree-via-container.service has finished starting up.
--
-- The start-up result is RESULT.
Mar 10 10:41:29 ip-10-0-218-217 podman[18103]: flag provided but not defined: -authfile
Mar 10 10:41:29 ip-10-0-218-217 podman[18103]: See 'podman run --help'.
Mar 10 10:41:29 ip-10-0-218-217 systemd[1]: machine-config-daemon-update-rpmostree-via-container.service: Main process exited, code=exited, status=125/n/a
Mar 10 10:41:29 ip-10-0-218-217 systemd[1]: machine-config-daemon-update-rpmostree-via-container.service: Failed with result 'exit-code'.
Mar 10 10:41:29 ip-10-0-218-217 systemd[1]: machine-config-daemon-update-rpmostree-via-container.service: Consumed 24ms CPU time

Version-Release number of selected component (if applicable):

OCP 4.12 and later

Steps to Reproduce:

1.Upgrade a 4.1 based cluster to 4.12 or later version
2. Try to Scale up node
3. Node will fail to join

 

Additional info:  https://issues.redhat.com/browse/OCPBUGS-7559?focusedCommentId=21890647&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-21890647

This is the downstreaming issue for the upstream operator-registry changes. Upstream olm-docs repo will be downstreamed as part of later docs updates.
https://docs.google.com/document/d/139yXeOqAJbV1ndC7Q4NbaOtzbSdNpcuJan0iemORd3g/

-------------------------------------------

 

Veneer is viewed as a confusing and counter-intuitive term.  PM floated `catalog template` (`template` for short) as a replacement and it's resonated sufficiently with folks that we want to update references/commands to use the new term. 

 

A/C:

  • updates to all upstream docs (olm.operatorframework.io)
  • updates to hackmd references (hierarchy head at https://hackmd.io/O-DelGCnRbSmioFYnuBqkA)
  • updates to operator-registry commands (strongly prefer to also make changes to code paths, module names, etc. to make the change consistently)
  • updates to the generated demo for semver (or deletion.... really, the thing here is to be consistent)
  • Docs audit (collaboration with docs Michael Peter and Alex Dellapenta )
  • creation of a new downstreaming story to populate the changes to master, 4.12 so that early adopters aren't ambushed by what is merely a name change.

 

 

 

 

Description of problem:

When we delete any CR from the common OCP operator page, it would be good to add a indication that resource being deleted or atleast to grey out the dot at the right corner as the user perspective. 

Version-Release number of selected component (if applicable):

4.13

How reproducible:

 

Steps to Reproduce:

1. Go to Operators -> installed operators -> click any installed operator -> click CRD name from header tab -> delete any CR from list page using kebab menu.
2. No indication about deletion, user can do any action even after deletion is triggered.

Actual results:

 No indication about deletion on kebab menu

Expected results:

grey out the dot and display the tooltip about deletion.

Additional info:

https://github.com/openshift/console/pull/11860 is not fixing this issue for operator page.

Description of problem:

This ticket was created to track: https://issues.redhat.com/browse/CNV-31770

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-18464. The following is the description of the original issue:

Description of problem:

Hide the Builds NavItem if BuildConfig is not installed in the cluster

This is a clone of issue OCPBUGS-18641. The following is the description of the original issue:

Description of problem:

vSphere Dual-stack install fails in bootstrap.
All nodes are node.cloudprovider.kubernetes.io/uninitialized

cloud-controller-manager can't find the nodes?

I0906 15:05:22.922183       1 search.go:49] WhichVCandDCByNodeID called but nodeID is empty
E0906 15:05:22.922187       1 nodemanager.go:197] shakeOutNodeIDLookup failed. Err=nodeID is empty

Version-Release number of selected component (if applicable):

4.14.0-0.ci.test-2023-09-06-141839-ci-ln-98f4iqb-latest

How reproducible:

Always

Steps to Reproduce:

1. Install vSphere IPI with OVN Dual-stack
platform:
  vsphere:
    apiVIPs:
      - 192.168.134.3
      - fd65:a1a8:60ad:271c::200
    ingressVIPs:
      - 192.168.134.4
      - fd65:a1a8:60ad:271c::201
networking:
  networkType: OVNKubernetes
  machineNetwork:
  - cidr: 192.168.0.0/16
  - cidr: fd65:a1a8:60ad:271c::/64
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  - cidr: fd65:10:128::/56
    hostPrefix: 64
  serviceNetwork:
  - 172.30.0.0/16
  - fd65:172:16::/112

Actual results:

Install fails in bootstrap

Expected results:

Install succeeds

Additional info:

I0906 15:03:21.393629       1 search.go:69] WhichVCandDCByNodeID by UUID
I0906 15:03:21.393632       1 search.go:76] WhichVCandDCByNodeID nodeID: 421b78c3-f8bb-970c-781b-76827306e89e
I0906 15:03:21.406797       1 search.go:208] Found node 421b78c3-f8bb-970c-781b-76827306e89e
I0906 15:03:21.406816       1 search.go:210] Hostname: ci-ln-bllxr6t-c1627-5p7mq-master-2, UUID: 421b78c3-f8bb-970c-781b-76827306e89e
I0906 15:03:21.406830       1 nodemanager.go:159] Discovered VM using normal UUID format
I0906 15:03:21.416168       1 nodemanager.go:268] Adding Hostname: ci-ln-bllxr6t-c1627-5p7mq-master-2
I0906 15:03:21.416218       1 nodemanager.go:438] Adding Internal IP: 192.168.134.60
I0906 15:03:21.416229       1 nodemanager.go:443] Adding External IP: 192.168.134.60
I0906 15:03:21.416244       1 nodemanager.go:349] Found node 421b78c3-f8bb-970c-781b-76827306e89e
I0906 15:03:21.416266       1 nodemanager.go:351] Hostname: ci-ln-bllxr6t-c1627-5p7mq-master-2 UUID: 421b78c3-f8bb-970c-781b-76827306e89e
I0906 15:03:21.416278       1 instances.go:77] instances.NodeAddressesByProviderID() FOUND with 421b78c3-f8bb-970c-781b-76827306e89e
E0906 15:03:21.416326       1 node_controller.go:236] error syncing 'ci-ln-bllxr6t-c1627-5p7mq-master-2': failed to get node modifiers from cloud provider: provided node ip for node "ci-ln-bllxr6t-c1627-5p7mq-master-2" is not valid: failed to get node address from cloud provider that matches ip: fd65:a1a8:60ad:271c::70, requeuing
I0906 15:03:21.623573       1 instances.go:102] instances.InstanceID() CACHED with ci-ln-bllxr6t-c1627-5p7mq-master-1

Description of problem:

Upgrade from 4.12 > 4.13 will cause the cpuset-configure.service to faile, because `mkdir` wasn't persistent for `/sys/fs/cgroup/cpuset/system` and `/sys/fs/cgroup/cpuset/machine.slice`. 

Version-Release number of selected component (if applicable):

 

How reproducible:

Extremely (probably for every upgrade to the NTO)

Steps to Reproduce:

1. Upgrade from 4.12
2. Service will fail...

Actual results:

 

Expected results:

Service should start/finish correctly

Additional info:

 

Description of problem:

The cluster network operator crashes in an IBM ROKS with the following error:
2023-06-07T12:21:37.402285420-05:00 stderr F I0607 17:21:37.402108       1 log.go:198] Failed to render: failed to render multus admission controller manifests: failed to render file bindata/network/multus-admission-controller/admission-controller.yaml: failed to render manifest bindata/network/multus-admission-controller/admission-controller.yaml: template: bindata/network/multus-admission-controller/admission-controller.yaml:199:12: executing "bindata/network/multus-admission-controller/admission-controller.yaml" at <.HCPNodeSelector>: map has no entry for key "HCPNodeSelector"

Version-Release number of selected component (if applicable):

4.13.1

How reproducible:

Always

Steps to Reproduce:

1. Run a ROKS cluster with OCP 4.13.1
2.
3.

Actual results:

CNO crashes

Expected results:

CNO functions normally

Additional info:

ROKS worked ok with 4.13.0
This change was introduced in 4.13.1:
https://github.com/openshift/cluster-network-operator/pull/1802

Description of problem:
Duplicate using in log message.

log.Infof("For node %s selected peer address %s using using OVN annotations.", node.Name, addr)

Version-Release number of selected component (if applicable):

4.14

How reproducible:
always

Steps to Reproduce:

1. code review
2.
3.

Actual results:

log.Infof("For node %s selected peer address %s using using OVN annotations.", node.Name, addr)
 

Expected results:

log.Infof("For node %s selected peer address %s using OVN annotations.", node.Name, addr)
 

Additional info:

Description of problem:
Installed and uninstalled some helm charts, and got an issue that the helm details page couldn't be loaded successfully. This issue exists also in old versions and is aligned with OCPBUGS-7517.

If the backend fails to load the frontend never stops loading the helm details page.

Version-Release number of selected component (if applicable):
Details page never stops loading

How reproducible:
Always with the Helm chart secret below.

Steps to Reproduce:
Unable to reproduce this manually again.

But you can apply the Secret at the end to any namespace.

You can create this in any namespace, but because it contains a namespace info "christoph" the helm list page links to an non existing URL. You can fix that manually or use the namespace "christoph".

Actual results:

  1. Helm release detail page never finishes loading

Expected results:

  1. Helm release detail page should load fine

Additional info:

Secret to reproduce this issue:

kind: Secret
apiVersion: v1
metadata: 
  name: sh.helm.release.v1.dotnet.v1
  labels: 
    name: dotnet
    owner: helm
    status: deployed
    version: '1'
data: 
  release: >-
    SDRzSUFBQUFBQUFDLytTOWEzT2JUTkl3L0ZmMDZ2NzRPZ2tnS3h1NWFqOFlZaUVVaVVUSTRyVFoybUlHREVqRDRSRWdHZTJULy83VXpBQUNoR3pMY1pMcjNyMnFyb3JGWWVqcGMvZDB6L3k3SDFxQjA3L3AyMUVhT21uL3F1K0hEMUgvNXQvOUIzK2JwUCt5blJoRnVXUDNiL29jd3czZU1kdzc5dnFlRzl4Y2oyNVk3djNINFhBMFpKa2g5Lzh6N0EzRDlLLzZ5SHJOVzdhRG5KUThUMzRrY092SHFSK0YvWnUrRkNhcGhWQVBSa0dNSCtwZjlaUFVTck1FQTExKzU2b2ZScW1ETDMwUGpTamI5dDdMZC9jOUs0NTdmdElEbVk5c1AzVC92OTU5MU52NXpyNlhlZzY5MmtPUm0xejF0bGw0OHozOEhrYVFYT2dCK0lIaW8vZnUzVU9FVUxUSGQrVW9kWHFwWjZXOUhIL2lNL2w0NElScGIrOGoxTnM2Y2JSTmU5LzdkOXV0RkZpdTh5MUQ2SHUvWjRWMjczdS91c0piY1BQMTRlRjd2NWVGcVk5cXNQaEpOY24zdmE4aGRMcnZYZEhQKzNoQSttWGc5S3dzalFJcjlhR0ZVTjdiUmdnNWRpL0swdmY5SDFkOTZGbmJGTk0wY0ZMTHRsSUwvOTJtKzg3WkpoVGp6SHZtUFh0Q2g5dmV4RUZCajR6VlM2TUNManc1U29VSzVjaUhGbjRuNlYvMU4wNitqN1oyMHIvNVIzK0w1eHM0K0hMeDBYOWU5YTNZVjZzUDc3aitWZDhLd3lndEJyajVONFg5WDlrVzlXLzZYcHJHeWMySEQ2NmZlaGw0RDZQZ1F4UTdZZUw1RCtrN3owSEJPL0owOHFINForc2d4MHFjNUlNZDdVTVVXZmFIcldON1Z2cU9mdjhkbVdqWHRmZXBlK2ovK0hIVlJ4SGM5Ry9DREtHcmZ1b0VNYklJbC8yalFsOTE4WVA4OWY1dStUNTl4TGlrT080N2d5U1ZSQlJJd25CcGFvL0kwR1UwMjZERFVoc2ViSEdjQUlFZlBTeGlFd3pVWEJLR1h4VjE0Um82djVkRWRKREVLV3Rwanh0TEc0YlNrbCtCbk9jc1RSMUlFeVV5bDd4dmF5Z3hCVDRCbkgyWUNYeHVhOWNmQlRmZUdUbTlKb25UOVd5US9rMFNVV1p3ajZ3cHJsd3BVSGEyT0VTMk1Nd01qVVdTZjV0SkUzWWtDVXhxQnFNRWlLT0I0TVpmd1VCQitEdUd2bkFkYmNSQ243OHpkVDRCQTVTYTJwQ1JKbllNeEwwTEEzVVBCbE5HRXFaakdFNm5RQnVIcHNxelFOejdrampPVE9IV1gycXNaM0xxd3RZZWswVXdYbHZNS0RCOXliVzFJV05wZTljV1BWVE9sYzViM2dHZFQweGRRVE9mL3dZQ0diWG1ITU9jWHdPTzNRTlJaY3psdm9ReEp0OWY4Z05MZTB3a2NZb2tjY3phNGlnMWRDVTJ1SEVDSmhzWGtubXFHMGtjc2Jady9ZWFNTTTFNUW91Ly83My80NnFEdVAveUhCUTcyK1I5R3FNMmZSVmtCaWd6bDdlK0tZNFlFS2pNTEJoNlFGdjVrc0JpK3Y3TnlmbU5xWm1lclQweVRWNGd6N2t6WkhwZ29pS1lEME1nam54RDIyY2dHS2ZtYXNTWitqUzNORXdQZGlTRTZkOW1TeDZCWU9IT2RIWWt1SGhzeGpWRk5iQzBJWktFNlFZTWxNelVGeGtReDc2cFBSNGsvelo5MEprdmxxZ21ZRGs4V01Kb2JZbmozUDRjdWM0Z2NXY2JPVEwwS1ZQQ2dwL0YxeTF0dUFZVGRkT1lWeWdqSUtwcld4emw5OGZ4c3hwc3NlbmZaZ3ZPODJDNHlCWTZ2MWNETlljYzJnR2Y4TG9ISjdlWk5WQjlVNTltbU1Zd0g4WWdKL004V05vbysrcnpmM1B5czJOOGtpWmpsdkpuRXg4WWJpdzdzeUJsalVETk1ieW1QczhzN2RNT2FPUE0wR3hrQ3F6djNCZnpSbE04Rnc5eXEyekZxYmtkb0xXNVBNSUlCandDb1J4Wm1zbk1BclNiRGFZc0NKVVlhS3VQa2xqSTBXMkJmMjI0SWZBOFFRL0lxWW1weVF3WVRPZUdOa1ZnTWkvNTR4eE9pSXhTZlBBeENPVEV4bnhRcHpIbWthWGt6cDdHYlF4Q21URzA0ZHJzbVBzOUdZTXYrTFNZQzRFcituS2V2MUdLOFhsZmZsK250S0RQamxrd1diaGZudEU3WDVhM21ScU1FMXRURCtWNGhkeGdXbjZrY0ZaeVVjajJrREUwV1BKb0liZUV2OC9JTFRGU01Bb2ZmUGQ5YmdXb1V6ZHJodmJJWWw0eFFqVUc0aUl6dGFGbkJJK0k2b1RZZ3lMU2F2eFo2S0hoRG9wcUJqa3ZOc01GNFRON2ZmZkY0bEJtZm83Y0JSbEwrUXk0WVdCcDhBdlFWTWJRRk04Vzd6NEsvcTFMYUZmUW8xUFdLaDF5VGVZckNYeFM4QTE1WHhKNFFxL3VkeDg5STFBVG1CUGUrQ1NKd3hnRUNnTGh3cFhwbkE1UVZOZGYzY2dsZW5EQ3MvYm42SXNrM0xyU1JObVI2d0w1eHRiU2hwdFNKby8wb3ZwNkZoVHZDa1B5SEpFQkF0dXRLNGtFL28vUzVLd05valJkTmVkWjBZWGFqOGpVNytwOFVPSGUxcFc5clM4eWdtL2gxbGZFMGRyaTFKemFtNVhmNUs4VGVQZTJMa2NyVGwzRFFHV09jUE9ONnpVODFHVHh3bkZiT2tvUytBTzI5d2EzS3VuSU9EcVB4NTVZK29MU1FMVGppaDdDcld2cjAvanN0ME0xdDRqOFZyRG1wbVpRdkhmd05nelVvSzJ2VCtjajcwQ29JR2VpM0ZtNlZNQ040V3BjUC9zTmd4dGx0cWhlMjNkS0RQMldiaWx3RFFkS2J1Z0tNZ2ViTmg3dUMvd1UvQ2p2YkgyNk5sV1pnY0dZTVRWN1dLTkxBSU5SZXZ5TllVeGpFQ3crU25kVXA2d0dTbTVxNDFRVngySEZtMEpUL2pyNGkvSW0rYWJxQVZYeHpMelFYUVg4VCtrUGkvbzg5L1praWd5TlhSa2F6R3hkUnFzQTI0RHhnZks4ZW9EaXVMQTVkZmhyOTg3cGExM2VHNXJjZ2tWTklMYzYwcXJHdDNEQWU1amZ6dEdyQzE1dy9qdUZyeFM1ZUx5elBCUmlQL0R4M3RUazNOUVhEYmpnUkUzQVdFYkdZSXJxZlA2c25IV095Wi93RnNXam10bnI2TXRSVHhwZGRFWWdOTm80STgvYkV6NlJCSThCTFBLQXRqLzNia3owbHNCbldQOFIzL2l6K3pSY3dxMDdXNWJ6dlBXVnU5SHFmcU91ZEZaZUxkVHBTbFg1aDlWNCttMjVVVCtyZ2xTREM4c0NoZUU4Zm1RRyszSzJ6aTlnTVBvLzJOK1FKbnNYNnVyT0ZsZE4vZHF0R3dyZHBCNHFPYTFkSytXTWpERlJkcG8yVG9IUUJjY1VRVzdFd2tCR01PKzBQeWU0c1NmVDJPUnNCTVBTdnQzaWJ3eWh1UG9vM2NrN0VKaXh5Y2lSb1ExRHMvVit0KzJuWmordzRoZFlmbE5VOTBBY0RXZkJlQS9GUnh3dE1OamFyeVpUYk9WelcwU1k4Z2dFWTU5RUR4anFZTHkzVkJQQlVJNEJkLzFSbWhpUFFsQnFueEppMW9PM2NXcnFpbWVLWThhNEp4eDVvV2VIclhSaDBRK2xsYWFTMS9sdXd6UGZ1d0I3YjFnYWhGdHFrWUtqRjRJOVppQ2lOWTZRQUhlZHdmcDhENUg3SUREMTd6RlFiRWpDaGthRm02dzV6aEJ6Mzk3VXA0eUZ1U0xrYyt4TncxQ0pUWDEremlONUFVWHRLdVBTSXFtaDgzRVZKS3dqTWkyWWo3ajVJaTRkbUVZQUt3UXNzc1h4eHRBVmp6cEJ6em9yallDWk9IQUZtaHRDMGYxdWNuVDQyOHBpMGVYMGRCaGtOVE8wYVdKcTJMRldIN3VFRnd4VUJrNnc4MGRZMEpVMnB3WlE4amVsY3ZKQU1OelpJbVh6aXEzRTBoRWY3VTF0ZUxCRUZCQkhMUjh4TUVDaHlhbDVneTJFVzFmYjE0elhKR2txTEdGS0RMUzBqdjlXVjRERlBlbzBycU15U1ZBM1FQN3NObW85ZitzWFl2RlJDaTl5S3YzbXQvblJ5ZGhZVkxYSHpRcmpROERqeTN0VG0yTW5Kb1hpbzJlTHF3d09lR1Rrd3pYZ2NCQ0NNaHdRYUFmZUxvTVhxZTZFVEk3NDBkdktqbzc5c1ZDdWhkak1UNHh6cFpLd01xVXE2YWlVcTJCU0twMm4xTkNWdFhYWFVoTlA4K1dCSkNJR3lnNXVuZ2dZU2hVMFVSRFErUVE3YlNYUHQ0TWF5a09uTUx3MldQa3FISjBqZ3Y5RDlPVnA5WTB5U3lkQlYwV2pwbE53ZXIvaFBOYlUzSmRPQTZkZ1d1eWNKWVlSTVF2aTZJNWpnSFZQdmprRGYzekdRU0hPdEdkcHc3clJhemtJL01EVVdrNUFJYU5QbVkvQ29mdmExbGsxZXBERUhTenhPWmw2SUxCUk4vL3hPeGdxaDlNeDhQOU0wNUUrQnZBdFBVQytXWmVkQmY1KzZjalk0amczT1pWWmlhUGNGbG9PY1VVYmJFYVVuY0dOa3ZJOWJLNXNjYlFHM0w2VkZEaml2ZVg0VlNhcnk5bXB3WnFiZWhGNDZFM2ExUG1rd3prOE0zN0RERC9PdTRLaXpvQ3M0cmZFMGswRUF2VUFXVDRIM0JSMXdIenlUSU8zWCtiY1Z2QURFWEVtNXMyQmpNMjVieTVQK1h0K0w3MEc3NTRwWWg2UUQ5aTlNb0lPZmlGMlFJbmZhaTRkMyt4dzNPL3lyb0Q5YVgyalpyWi95cSttTnVSK0JsNzgvcGZsaWZ2MkdyNUJJRFJGWW9OUCthVzY5N093S3VGMEIxN01IODF2MmNFb3NUVVczV3NqRm9USzRhdjdOUDg4NldyVy9LUXVIVlRUcXg2YzhJbWx5WjR0b2gzdzJYMTluRk05aDIwZGdXOWg2RXAwR29CVitHNk9peHF1YjFZZjQySmVDODBkbUtpcHVXSjN0alprWU42bEoxOU90emJlMzRyZm5JbVNHNnVHYmJzMHdEN3lsdTR4TUJnMzdIVUhuTmZkaWJZbWY4Rm5mWWNMUWovL1ozbUsyTUxBMG16WjBHOVA3Y3VsOGNocitFaC9QVjBxbkw3UTUra081OGdLZHBKdUhTczRGNkwvcG5pb0hUOVMvMm5Wc1FoVUQvR2I0LzNWWXNyU1g1YklJdkZvYSt2OEFuQ1BzWEZNdUNhQWt6M3dPWEx0eVpSOVdWSmxHMldwYzBsQ0paenViRjFCYmMzY3hqZ01SaXlPc3A3RStKaU85UmZHTkFPcVNMcUVXVVl3Tk9NcW5mMEtXQ0gyaW8vTE14NE1iR1NQc1ZlK29PTk1sUDJaS0NYSFVrQ1d6ZlJwYU9yS2dpN1hYNzlBU3hSMEM1V2tTL3ZaNGhGM3Rxam1RRU5aa1VSNktwS3RqOG1ZK2pTMXRHR2hMWS9Xek5Kd1pDcXpNRkRIcG1nanRUSCtzT0xpRjM0bkJxR01qSUdhbXl0MVkzTHFxdkZkeE8rd04rRXNuMHBwbitJVFRPYVp4YW5EMnRMUjF0UXhUUHUwMHVaa3JKZU8wN0JvM0dVcDkrNXhUVkU5MkdLRnQ4K0xsVXc5a1lCNFQ3VUlndCtZdXN4VU9ObkkvSUlqbGkrZzFtejFxbms5Ly8yM243UEJqVDlUaTJzU1MxNWZYam01ZDkvTVpKSHZQczlQYStPM3pLT0IvL29TWE9QYlgzMyswK3k0OUVjMGVIY0VSUFVybHR0WjhCcjRPNzNBMHVNNHMveWVPTnVkRDUxbnNyWDFaZk9xRkdQeEYwdWExN0oyOWdUdE81WU9qN2d1NTZDUzVZdHEyYmZLdHEyZmg2ZTdYS1RiL3RTek9SZG1zZWg3SFY2Y1hKVkQvZk9xdjdOUTVwQnFQRkpQUWNyeW9qQjFIdFBQL3JZc2ozTkNDeURIN3QrazI4ekJQM2ZsSGVMbkxZbWZkMis1SjdXSE53TlNiWWl2SmJFRjhZMnFxcTkvMWM4U1I2RjFmUEx4aVFjTEpjNlBxMzZVcFhGR1NoczNmbWozYjJpWjVmRmJWLzA0Uzd5bEE3ZE9Tc0g1Z1M4aFZMOTAxZDg2RHhVNE1ObzY3eWhJV3llSnNpM0VVNmZQSmFtMVRiUDQyelphT3pEdDMvU3RPTVlnYnYzdTZzU3l0TkRaT1NpS25lMkhoUFBmMVQ3alBQWi9YQlZWckhnU3RlckpiMXY4UXVwVHZGZklKUk8vNmdkUkZxYmZyTlRyMy9Scnl5TGxvdGNIUFBIYUFQMy8rWi9lY2NDZUcvVThaK3ZnYjlmSTVJUzc4VFlLcXArUDZkWVNvakMxL05EWlZpandRejg5dllyOG5STTZTZkp0R3dFSEE1ekNlQm5CalVOb0UwZmJ0RUFRcWFyRXZ4dFZsT1RPVmZIY0orWVRROEJQSXhpaC9rMy9YdmpXditxbjF0WjEwbS9WSTVneHQ0NWwrNDN2NHBIRTRxc0ZlcXFCandCc0hZTG5wSC9EZGxDWitMZ05yRk9XcmtOUWdwd2lXcVZxQ1JpM0Q1aDRUamtPUEwxa08wbnFoNFRBd20zSEs2MHYrbUhpd0d6cjNObXVjKzlzZytMVmJ4SHlZZDYvNlN1TzdXOHhKNUpLMjJPaGF2VWs5czV0MXlHVExwVHhmUjVqbEFzb1MxSm5LMkhVN2lLVUJjNGM4MVNGQkhvdHFZVEdSUkd3VUNtOFgzZk9kdXZiVG5XYnlQaFJ0QXRBc0xUM2lTbElLUWpRY3dJU01wU0xScjV5TURnUEFlM08vK3JmK3RaRVllRG5hRGZqNGdQZ3JsUEl5Wkdwc2Q0c0dPVm1QdHJBWUJ6WUFyT1g4MUgxWHJXWTAxeG45TEdwYUFUVzVVTE5PbktkL2VuaUVsSHJTK21qSkV4M1JoQWpZN0RvWElUQ2JvMHhtTVp3UXR4ZEFyZFNWUHpCbkdjc2NWVUdrS1F5VlpyWUhsYXB0dmpKTFlMVFhlbGFTUDcrTkZIKzNEeTZGc1JPRnQ3T3pHMmMrSENnNUtTcTJOKzdVakFrMWJyNmN2L2srMTF6TGlvSGQ2Wi8yWnhuUG45WFZzTmlmSUVDWjdDc2ppa1NLak5mNm9acHdpVG44R0hqb0w2YnZ2V0ZSMUpwaEorVFFwbUIyTlRuMHJreEM5NVJFT1RrM05KNWtod2k3eUxGTXduc1kwYWFvSjI5NUFjR3FZNVdkbVZGODR3clRlM1p1WnhjYnlUMTJuTXRFaUMvZ29kTDkwQ2FUQkVReHd3TzFUSDlHaFhhUDhtdng4cktDM2hXbVBxQUcySGV5RHEvLzV4c0Z0K2NjVW9NT1J6R3J1aWM3a2FmVndJdjBHcWVWL0NhUG8xL0g2K3B5eVdWdFltbEs1S3RTVVgxdlJ6YjRpaC9nci9Pd2s4cUFYOFgvQnM3dG1sbG90Lzlic2VpZk1WZkhWVk52dzN2dkdlTExwR0QyV1k0VmdWK1g4RWdtakhtSlRDUVhEaVo3cXhBWGRzQ3Y3SDBLWFh6dzAwbXVkMHdQcHpWdDFPeVNHcnFIcU9JS0w5a25sbytQZGlUYVF3QzZNK3diUWpWQkFoVCt5eGU2ZnM0OUYvREFPMXBHb2JJMitjU0JrbFVZaGpRaW45bnlROHNYWWtzN2Jyc3VKaFkrb0x5SWRYakxPUldycUhQcVh4TnBqc3dXTGhtTU1xYkhSeVg4MnBIaGVKVGVxYmdHeEorRVIwQXVDbWwyU3YweDVONmNUTFByWEplb3BwUEIzUDNoY1VzZFJvMEZncWVwL2xGdHYvblpPSkNINkJyN3MrT2Y1N3VkZGhaeUtsVjUweWpDdlpsK0RyaENTTVk3WUNvZXNEL09Sd29vbHFtTXJIL0Z6K0JDOWZTNXk2V0gycFRHMVhBUXpEQXNqTkZsTTlVRDNLWVBsaXVwR2ZwKy9DMC8xYm90L3IzL2l6ZnFLZnpwMzdVc1NqbVVPaU02V2tsOWd2d3NZaWU0cmVMOVUrNW1IU1IzUWxHUHJVSnI3RTc4dDdVNU5nTUVPYXBnU1dxT2NYUjBjKzJPWlFBZ2ZmTkplMWFLUFVTNEliclV0ZmF3clY3cjQzd3V6RUl6QjBNV0pyaVhVY3VpYlVtODQremZMUUJuSHhvRmYydEFjZnNqRnFCMDR3V2Z3VmdNRTFuaDBVbSt5T3E5eWJ6c3NNSzI1NjA4UGZUNHdLY3h3QnQvMHQwSU8zKytMTzgzSkwvQ0F4Z0lkOUZ2RmwwU3hyQnlvVVQ5V0NKNnVZWk8xU0hGNEZRVFF2NzNpVUxpU1JNN3dBbmIwMjl1TCtjMm0ra2M1dmRMSy9Tc3p5bzQxa1NwcG10UFNZU1luNEs1eXVNUjVKU3BaMEExQmJ1WFZ1WGtTbndPeEE4RHVrQ01sMzgvWGJQdUZORzJSbGNpbUN4RUR6OUEzcWswZmtnL0ptNFhXM3dKdW5XZFdGQjQ1blB5MkF3eGZjejdMY0JqUlpEZlBYNXlKNG9lM2lJZGpOTzJSbURlV3VwVnQ2QjVhaFc0Q2NWaGJOWTV6QTdXYmptWmx4TnY0eEh0RkJYcitzTys2SHc4dzZ6Z1hyQWM1MXBSVUd5VHVCTVN6aGhQb3hza1UxZTRWOGpFQm9YK0k0OGtJSnhEb1B4OEdmeHJvUlRaR29FSDZSQURNY1BwdmE0ZVJuT1Q3dGFUWEcwaHZtSU1YUjVDL05SREdpOG41SWxreVhiS2tZWmJVek5qRUd3U3ZHM0xYMjZBd0dMUUxoSTdXQ2NXeHFKaTlPR3ZzOWZGVk1laXg0dmkxMk8rWXFmaTExRUdLaUk0SEhaS09KMHNTMEY0aUtUN3RnZERMQWRIUkpiVmkxYml4NWpUL2pEVi8vVHJxT0xsdHBJVHQ2QlFFWndvaFIvbTdHSjUwdkhxRHFOWjNxOUE0WnRGSTAvZ2RjTGMwRlZicWxiajd3MC91bjBQOHBsTFJ5ell6bFdOeVN2UlgyeVRXTTNBSEVVa1B5WExyWDdTVHB6VDQwZWsvd3BIVGpOOFhjd0R6LzkzRW0rS0FhaGdreE96VjhUNzkySGFvcGxqYzZMMzVta0dMaUVnOFM1NVZMZkszSVpkYjY0VFAvWGFQaG1lcWdocjIramo5YkUvOVI1cHZnN3NEU2JoUUVkWThheEhnaXdqOExXWmJPaGQyRCs2VFU1b3FMTVJsMlZPdVUzNVlDeFBOQ3hDTDlVNVQ0MDl6MllJb1B1WlBGV09EMlkrcFN6TktKWHNINGFnQUZwcEFsbmcrcmJPMm5BczBid0dFUE9JejU1dFNTdHo0OS9MMWtDTjN5Tm5oZEh1VDJaWDVTRE1mUnBidWdiL3lkMWVqVi9MSnVrTWVIdCtmWmxPT2JvemdqVVQ3bXI0ZlUxZHBPVVoveituQmIyejN3ZSsxVFlUOGpNblBlQXozOWJzTGRsU2Q0dmlkc3VXQWM0RTF0UGQ0QjdSSVoyL1F4OHorV2xGVlNXcjVra04yTlV1VXVibE1iSUVSaW9pVW5qN0FKUDZ1YWMzL2tpWGRWY3I2bzF2dndGY2pKRXBoWXU0SzVkS0k0MmREMFRIWTc0NEhjV0xUNS91N3dVS2F0NkxSKzhNTWR5b1R3YzdSYVJpZFg5ZUU1d1ltalg3ajBqTDBwOC9OcDRhWWpzaWIyRHBQd1Y3cWc4NHRpb0tHZlVGbWwxN1VVNWxsZXQyWjJScGFxYzkvSjMzMEtXTDgvSkVocEN6dHZaWktjcFRMUGpIQzZCLzBVODNaRHhSbm5zeitQcmNubC96cjVPUWFEUWtzaHE3VVpCTUdCalVQaHRSU2YrTDhYVEM4dCsvM2ZnVDhTMkJtRVpkWTFBalF6ZGpNRkFvbXRoSXNvZ3A2NXRIZk1namlHSHlCaVFPUjZldHl1WDV2RHFNcHNpNTZLOC95L0o1ejJIeTVtcGIzQ3NubUI3UzlZaGliMlJMb0Q1UmJhM3NlWjZVdEo3U2E3ejdmSjFGL2d0TFhqSlRuZmVEZ2FJY1piOHVsVUMvZHZ3MlB6dVg1N1hQdjhoUEQxWGJ2L1RPdTUxdFFBdnQ2KzA4VjNON0FuMml3cWYrVTdtMitYcE5DWW1NWEtBNXd2YXJRYitaWGgrR1U1ZThOeTUzUDJUN3o5QjUrQXh0Z08xM21COFlZNzU2TWUrbU04NzlZS1ptNXBLOHBxU2VBTFRSVGxRR2hjM2Q3a3p1RkZLOHA1bGM2ZlA3a2xOQkluTlB6R3p0YkZyTmVnZVpseXpzWEttZWNqUUhobExLSEw0Wisrem5vSDkyL0dvM1ZnWm1kbzRzVVgzVmZtM2RtUDVYeUpQclkwM1ZxUFpIc3NMamp0LzZmcHRFNi9odkVXNzY5UVNWUTlNbEtpSUw5Wm43MnRqc2thdW42WGw1VGtSc2taeUdXMDhHRTQ5Wi9tV01rUWEvQytoeGRiV3BnZ0dRMFRqTWxUR2Z6dGJIQitzd1J6SHp5UnZNNk1icDZRdG5PN0szVU5ubXByWkFjb0JOeVI1OXBsdWVqQkF0SlpScTh2Z0svS2xnWnJaR3pNSEhQT1hXQXVqR3dpOExaNHg3NXNCQ3JHZlBkUDVuU25VMTJHa2MvZHgzSjhhK3UxT3FxM3ZtRXZXQStJK3RUaDFpT2tBSmlwK3g3UDA2V0dtb1d5bTNhWGxlRUFiNzJmYStOQ2tFWXRBYU1Zd0dHVUEwMVZnT1VPZnhpVCsxT2V2b045VHplb1hyWlU4VlNmOG5BNHJXK1dLdUVhOUpyRnVNRTRzUGNhK0I4bVhQTG5IV0k1cC9vaWV5MkhDeStiM016bUsxOVFkUDRlbk83S09TT0pwOVVEcUVxaFAxTEpydzJZdXRhZ3ZiZVVzNmpoR3BzREh3T2U5eG41endsdlZpOUdOSnFwTnNmNTR2ZGpqcnVSM1dtTlA0U3R3MllrN1cvejBWdldIcjhsei81WmFtQmZKVjdGekt3aVZ3MHR5MTAvQmM2NG01b21hQ3c1d2p5elFWQmtNU016d3gyMU9lL09UWDdHR1pJdWpuTlFDREtvTk4zYXZxRmNwY1hmNDQ3N1E1TGh4eUp2WFVneElxNnRuY3F2ZGNYT1IxL2cxSFJ2QS9YRWZzZ09tdCtjM3NrWUp4SkZuVHVZN3VuWXpJcHZVTmZ5UThGVTgyTFdwengrUjRZV21iR0RPVTNpV2pRM2xEclEraGRhaDRQblBhSzhHWjJrS3RwTWV6SGtQeDhSd3NDQTVDL3hNZlZORG1QK2MzNG45UDROVDkvWmt2ck81VVc1eGp6dER3N3pON3pCTnBBSHhNUmxUSzJMbUoveSswdzNuR2h2MXRQaC9XcDRhY1lZbUwxMHZlOXJIczBYUHN3WCtZSWtqRm9nTFVzOXE0amtHNDBRU3gyc1lqQTR3NS9lR1BpVXQ1SVkyM0pBOHVLZ1dyZlM4WkdxUHFTVFNFeWRncDA5d2daMmw5ZXpmN0VETllZQTJmNndQY21BaUdFNWpmSy93UmZLeVQ2SFk3aVdUN2xBS3hmSGFuc3lidGNIMHZ2dUYxS28yVDBHdzlMa0xSRFd3QmQ0SDRqaXo4azJCMC9aWTlIbk0wMlc5aVNtVWEvaTErcDV6Y24rNmlWaUQvOHI3K0YreUpjQlYvOEZIeldOd2xMdmJ6L083OTRGOTNPVkJ5bSt6KzQyNmt1NDhCRVRHTFE3MCtMSllOdG1nV0hJdFdtaUtFb1JnNFpidG9vUkU0cDJyNWNPdmlxcllYNW9wS21hR1JWSDRGRXRpTXlTU3hGRW0zWkdVeUQxTmlWeC9FZno1V2hyenVhbFBFZFRWR0hJSWkrR1hSYUFtWUFDTDlvdVQreVhycDRhK29lSE1aRVBLZTRvMktOcjJ4STBQNXJMNFJzNlRBMitrc3RUM05qYkJvQ3JaejB5dEtLY1Q1ZHpVU09yWmE1ZmkwakNCdEpYdXlWamlPSlBHODN4WmF6ZVNSQkxDa3pETDFEMEdtMWxESXdmemhKWXVNNlFGYmF3ZXl0YUI4cEFmaWxONUJ6U1c0TnJRNTY1QjJOWkVFSWROcmZLbFNxMU9WQXgvV1hiOVVRaDQxeENuSHVUY0w0Q2IxNTR2UzV6NURTMU5sOUlhVEM3UU55a2RpNjFLdUdkTDl2Z3NLYVZSODI5TDVWNVJwNXFpVGg5VWRUb25CeFVWQnozTWRQVkE1OHVpYjB0RlhUSHE4bjR6bHBYbGJUclRpbEp2bjkwYnVuekE2dGo4ekd4V2QrUDdGV3QvVzIyYTN6TTExck8wL1doNnA4cUxGWnFUZWQxR1h6UnR4RXFpN0lHQ3hyYm94VEEvbHAwYjRjYUY0dmhRdE9yZ01RLzBlckdoZW1PamV6WjFsaXloNVV3dnJvbTNGTThpbFJHclBCaEt2RTBrY1pRWDlGK1RNTHFXcnNCcXBtd2xNcFk4ZHd6RFVXTGVSMThNa1hjZGJaeUMyNWp5Q3RrOWl2SlJkYmlGejUvQ2N4dTdobmo3aGFvWklrOURTWndPcFFudndZRk1RM3FCU2UyeVhFSlF0TVhxVVZWVStVSFpvTG1pM1dhQ0c2MmxmTzRXSmZybFp2MFVsMFVyQkFoVXJLSVlrSmNsTlNzOEQ5SnVVT09kMlBScFc1VE5qYkg1d004WHoxQityRnBoTUE1L253elVXdzkrVmdZTzFuKzRERlJ0UUNGKzN5djVZWFE2ZjhXbCsvMDlNUWJkQ1c3VVBPeEZkYWt1NVNOcVYxQUdCNG9IeEVkM0p2RFl0dERXT202YzBlWDJNcVdIK1lId0JVMmhUS3FTWnhJeWYzV3hMUEJEUTJNVG9XaTc3end3cHpwd3BObGUwbm1nVENsenVoeTFaWTdhcGdLR2ZTeVkydVBPenNsaFo1NDBVVWphbHl5bmlhcG5jSzVMWlhCVnRyd1FXVHFXTFZWMG9vZDlZZHVIaUM1SmJWMW1pTzNWaG1FcEU0WllPMHhHdzdxbk91UUl5ME5GbDBxdkRSUVBoZ29MeDN3T3VCZ1pBNFhTWURKRlpxRG1kVmNjRE95c2JadG5HTGZQSFQ2aTNicEFWdzgyTGIwai9FRGEyN1F4USthUFhaL2kwRHo3ZUVBWUN6bDFRM1ZXWjBrNjFrOWZIZ1NldXJWTC9wTjAxd3JmSm50WEVkWEEwTlhFRnZEOThjWVVFYm1IOWNxeUdlcTZEN2Z4Snl5VHN5V1Q0bmZ4djdYLzNRZmh0dnJkY2IvMVAvOUpDUGV1MFhBRk1YL3YzZGR1dDlHWUQzRVppOWJsd2k4NzNYYUQveVNOaVJ0YjhFNTVvRjdwcTR3YU9oTHJYWTFwN29XcEw3MWYxaTVVekR1RnhiZGd6cWFHTmlUN2RWb1RKUVhDeGpXQzhjSFVhQmxqSVFmVVJpNExlb2w4N1VBeG0rQ1g3QWRhTy9Ud09ad2FCRjcxZ0czNGc4Q3ZpQnhSSDdmMDgwcmJ0b09CR21DZmtoeG1TUHk0OUxTVmoyV2lOMWpTcXkzWEtzZDJLcTVzb3lxK3A4L1RxbFdsV05xcm50V2JyVmNsbm1lNjRwa0QrZUU4L3JIK1A4RjlGWjZRdmRweG1LWXRTaG9jRDlJcGRwYzBCQ3FQN1pMY3NxVVU4Mm9SM2pSYWU1b3A4b0pQdFFXUlpXT2k1SFloUXl0T1pVcmZpZnBkV04vS3lPajJOY28yRCtLYndFRGxMT3p2YzhRbnd2cVNxVUkybjUrYjJwaWoraFFkc08wdnhxQjdxMnEzWkI1bitLb1dLai91SEJ3TGlFT0VkVC9sMGVzMnZsZ1lJRElmaFVPTU5DWmJneFRiMEZEOWEycERncGNpUGdscjJ6UjhhaXp4YzRpeEpxcGZ5R051YW9UL1UxTlVPV3gvb0tqbXM4RTh0N0NmUUhhbVU5WmdNRVV6VGIybU1sT1hyZU1obDBXYmpDZE5aSThjNW5teTFISHRMb0tWQ3dzN0RISitkd3pqQ3h3ZTQ1K1RiWFVLZTFUUnA1am56dVpPbDV1a3lSN3JlN2QrQUJybHMycExFdmNvVTdUbG5zaHljNXl6dEh2QS9sMGROL2Z6Ykw2a09wVHdXV0duelM0ZGZCWS92QStEY1dad2JpYmRZWDlmTG16Nkp4ZFU2WWJLeG5meGJubFlhaU9XZnRJbSs2WHRPWCtZRk1IYitMZ2xDcDFENlFOUXZNQlF4V090S0EySjMwanRoVi95eFBGNVd1WjNHNW5MWVRqeitUVWM0SHRSWFBodFhtdjdrY3BlbERrQjRuTnlqM1VZcU15VHVaVjZHY3NqOWF1T0IxUXFoMW83MjhFZ0tibFRtaCszZGZsei9ObzRUQzg0MmhvNFVQMloxdHFlcGJaTndNbGNpcll6N0FOajNqRjFMNFlENVVCOUVoNzdNRzA5Yy9ZUVBNb0g1N1pub0g1RXJmUnJZK295aG5OVDRzT3N4VzU0cnVWWk1GL1g4Mnl1dlpSdVJYNXBVaDNCZFN4RmU3bmVLV01hSEZNZnQvTVNXYzhXNWVoWUR0MzhSckVWMEpBYzNOek9PNVU3TkUvbUo3Uzg2R0JBaW9kelM4ZkhIRkRIUVljTk82RFJHSzJTNitEMCtjclBwSHljMWVUaS9EWWx4emlMNUFVS1hkVE03UkZhN2V0MjlpSStvd2NYcW9WL2RySmxTRk1mdkFndkdKYWFUMFYweEczYXBsTXozbFRjd29mdTNPYmx1ZTA0c0gvY3RSVU1HRWpZcjJodDNYQWJuY2xoeEJpR3JuRWVoTE5MU0wrd05ZWHlFc3hITmQrUExlakpIZzluSmY1NHk2NmNPU3kxc0MwczVOeGFEclE0VDRqU3lGVlB0bmM0Y3hEdFoyWWtDWFlMdDdETmQ0MThHUFVKdXJkRnFIVGtMeXlick5OQUwvejFCbjdaWXd4azZ5UVhsMWErZUNPWHFwVHRRS2QrZG1oU1hxNHRoUnVpbXRrQk9aRFZBMkhzWFRMWHMwdTdBOFdEWElyampvemFUNWJzMWovVFdqNEhWeDR1L1VSNTFMSlp0dFhoejFWTEhKU0c5YVhYZXB3Z3U5S0VISVQ5MFNqdG5mREhXNTFMR3RYUEtpcnN0a3pqL3BlM294TmJTdTNuVWFjMXpwRXJXODg1WWx2cUphaHVDNDhIV3h4blRuQlh5ZDViTFkzVzh0amwxaFJsL0o3V1lUaFp0ajZaVDdQbHUvQkp5STdiWkhuM2VLaDdJOCtNeDFod2p5d2NLQWh1MElMeXpKdVZNWlF3SFdaYXEzMnZndWZUR2s1VUg0Z0kyeUNzTit2dHh0WGZNeDNQaFp3ZDFwbzNiWHMrNGZWYTY3a3hxWjRwNlpnVjhTK1N4bW01WHBDUk5ZSXhFODM4VWZPYXNIbEwyZmpKZHVyU2ZwenNsUDk4M21ESkwrbXp6V1hyMmpJcGxqMEdoaXJsbjdzeFdSdUFPYjhHSXFMbFVpVGZLNU40Qy9QVlBkYlRpT3BwNnVPUDE1amVNRC9uRDU3SVlYbWFRTzBrRCtYbzR4UXR1TVdhSS9FeFZyUVYrNWpuS0gxTWdXZGdNQTdNQUsyUHZoYmhXYmZVNDJkd1IwOW9LTnV3eFNkOXpVNGNxbmVQOXpOTnpZekJkQWduWGJqOGhiWXlVQmxhWW9IbFowVG5wTTkzWlZ1ZEtiRGx0WllQMG8ySFpvdm1zTXZmSTNTZzVRTWtCMHZhU1Z1dG5UUW0xbVZFNlVCT0c2RTZmTUNYNTZ4VzEyY0MwYmtJQkhMdTZEeGpDSHNzdHg0Y3lJdzFtZTVzelk2TVorQitXY3NrT3VlL0hrOUdXZGI1cU5IeW5wdFZqS2x1eUx6R1UyU0tLSy95QVhlamZkRSs4RkVTcGp3UUgzZDJzUzJacGN0MDNjTGZ1eEk2dmlmNXo4eUxVNGVGUDVpRGdVbExLOFFVT2N1T2NzYlNYeW55TFJaZHh5dlhwekxwRHN2b2NHY0wvTm9TMWJWVnRiU1hzU1hLTU4xTURqR3paK0E2T1VHRXhtaUxzc3lvNUJOeWVvZkFlN2F1UkdBd2plM0p4bTJmNkdIVVdxaEtHL3VjdkRiSEtIS2FaVjRWei9zTnZ2SGNxUzBuZERuRytMVzJMdjd6WXNtMzIrc0NzbFZ1amVmTE5ucHlTTGpBZFBzdEo2MVZhb2NQMi9WTTZldVJVelB1VFczbGFvUHF0QTZ5cnFjdjNXeld1dmJsRi92NXY3RTlxc3UzYkoyRDJZSExqck0zZjcwZjhkbzR0SWtibUovRlJXRUg1SFAzVTBNanZQaHdyc1hwbEM5cDNOVDJvMDF0eUwyMS8vd0xXNGRPZUtScXg5RzY1MWJlK3pYeFlxUGxaZys0UlhOTHVqUDNxN2FiWEs2dmdhZUc5cGpNTkd3MzJHS05ndDBiR3NwaHpzYkFaejJDQ3p3ZXgzcFFZTDNtVm0zU0UxdmxkZmpsTHp3LzhxeXYycEY3cnBYeE44c3VxeHpSWSt5UXZDcktKUEhPWFJTNHVOZkcreWZ5Ymk4OFM3WFcva0g5d3puZ3F2VUpIRk9scEp4ZjZNdzNkN2NoWUppRVVXUDd1R1BzREhldmdyT3hsWWxjeXhXMjlHWnp5NU9PTFFhZXNFSHRzMWRQNCtlaVRIOVZuaE43ZUJPNWVMNUUvZ1JYMWIyek1LcC9ERFpMRzhiMlhUMnVsMC9zRDNsR2FKZDJ2elc4SkM1UEFEZmV3SHkwQjV4Q2MxWDY0dG44VE5lWnRLZDVwOWI1dDIrY1EzbGhlWGtKZTFrZVR1dHFWaVBPMUtjNTlsY0wvNzM2RGM4Y3hYTDBzZXZiRi9JZXBnV3QxVm00aG1ZeXBpNlYyMmNWVzVnWDk1Zlh6YlgzWWsyMGEwMjh0YjdaT2RiRGJmVDMvbzkvL3JqcTAvT3VHa2VUdGM5UStuT25qLzA0ZTdMWWY5NUJZdityVHdDNy9NQ3YxeC9VZGVIcFhQV3p0VTdPMHdxczBIL0FQMjc2Nzk2OSt4NytUMjlKampLNzZWSGUrTkI5Rk9QMzBJcDkxZGttZmhUZTlIYnM5eER6NzAxdlNaLzVIZ1pPYXRsV2F0MThEM3M5VEtGeVFQd2JXY0JCQ2JuVjYza09DdDRuM2dmb1dkdTAvbFN2WjhYeCswMEduRzNvcEU3eTNvOCt0RWZxZXNZUGs5UUs0YlBQQlZab3VZNzlEdVEzdlltRGd1TnpsZmppeDdaWm1QcjFyeWF4QXduc2FSNDdONzBLMGZoUzRpQUhwdEgyNXEwblFOaTlHUFZkZ1ZETVQvUUt2WC9Ud3p4ZFhTbVkvNlozTDN3ckx4NzVzWHo0T2FvZkpicUQ4RlljSngrTzFQOWNQZnNmelFDOW5oV0dVVW9Fc0p3RWtiRG1lK25XZDExbm05ejAvdSs3RXYvL0tQL285ZjU5L0xQWCs5NS8yRWJCOS81TjR5cStqakg3dlgvenZXVVp2dmV2Mms5aTFKQW5DN05FcGZ4N3YvN2NqNnZXVjMwSDJWaDVreGN4Wjc4dlNmK2UvSUxWUVkzL1lQNzVuc3l5UHVLUDhzOS8xdVNpMUl3M1BiWkRKZ0lyaGQ2c3pnQXZvL05MS1YzQ1gzNnV6b2Y0UDlUODlKUDg5M0xZWHM2SGwvRGlTL214MTZ1UWovODdFcTAyelZKcjdCMVE1d0ZDMG5Lc2ttZHE5K3VLcHoxVVhRR1YvMVhmL25haWthb2h1elFUSVUzZEEyaDlzM0lHYms2R0l4OXF3OUkwNjYyWENndC9PcFNWZWplOUR5LzdRdjNNeTJxazR0RExtK2NWSy9FMnFZL1VvVm5KM1NiZGozcVd4emNGOHVwL2g2V2xibkl4alRTcXNFM1IwZEtNeGIzNkJHcDhUVTlxTFljaUZsejBDOGhkLzhnUzJkYW5KTC9Zank1SDJEb1A1ZmRMdjUwQWtHNnQxSEh6QmdpVVN3cFJKbjh2bTQvMWV0aEExQmoycWFtM0psOTgrSGlIQkNFM3ZRcjU1VjBuM0hVb2pPLzl6MS92NWJ2N2Z5M3ZiNVg3MWJkL2ZWTytUdStFKzZabElTYzg0NGV0T0taM0t2dFhlaTJGdjBUNFZ2Q3MwSFdlbHhLaW5oSXl3UTRwNmJDNlJ5bXA0ZWEvUTBwUUZHMnltRVlNeFdSUUJDMTAwOE1oeHZPNEprRk1CNWJwOVROWVZ2RE4veEovdjFROHJWaW5MWENsdjE1S2VNM25MbTFJV3FHakZzemQ5SEFzVi9pVFQ0V0RONzB5R3Z3ZTlxLzZPMG9vRW9qV3N4RFEyL3BKR3NWZS84Zi9Dd0FBLy8raHFZVU1wYWNBQUE9PQ==
type: helm.sh/release.v1

Decoded json:

{
  "name": "dotnet",
  "info": {
    "first_deployed": "2023-02-14T23:49:12.655951052+01:00",
    "last_deployed": "2023-02-14T23:49:12.655951052+01:00",
    "deleted": "",
    "description": "Install complete",
    "status": "deployed",
    "notes": "\nYour .NET app is building! To view the build logs, run:\n\noc logs bc/dotnet --follow\n\nNote that your Deployment will report \"ErrImagePull\" and \"ImagePullBackOff\" until the build is complete. Once the build is complete, your image will be automatically rolled out."
  },
  "chart": {
    "metadata": {
      "name": "dotnet",
      "version": "0.0.1",
      "description": "A Helm chart to build and deploy .NET applications",
      "keywords": [
        "runtimes",
        "dotnet"
      ],
      "apiVersion": "v2",
      "annotations": {
        "chart_url": "https://github.com/openshift-helm-charts/charts/releases/download/redhat-dotnet-0.0.1/redhat-dotnet-0.0.1.tgz"
      }
    },
    "lock": null,
    "templates": [
      /* removed */
    ],
    "values": {
      "build": {
        "contextDir": null,
        "enabled": true,
        "env": null,
        "imageStreamTag": {
          "name": "dotnet:3.1",
          "namespace": "openshift",
          "useReleaseNamespace": false
        },
        "output": {
          "kind": "ImageStreamTag",
          "pushSecret": null
        },
        "pullSecret": null,
        "ref": "dotnetcore-3.1",
        "resources": null,
        "startupProject": "app",
        "uri": "https://github.com/redhat-developer/s2i-dotnetcore-ex"
      },
      "deploy": {
        "applicationProperties": {
          "enabled": false,
          "mountPath": "/deployments/config/",
          "properties": "## Properties go here"
        },
        "env": null,
        "envFrom": null,
        "extraContainers": null,
        "initContainers": null,
        "livenessProbe": {
          "tcpSocket": {
            "port": "http"
          }
        },
        "ports": [
          {
            "name": "http",
            "port": 8080,
            "protocol": "TCP",
            "targetPort": 8080
          }
        ],
        "readinessProbe": {
          "httpGet": {
            "path": "/",
            "port": "http"
          }
        },
        "replicas": 1,
        "resources": null,
        "route": {
          "enabled": true,
          "targetPort": "http",
          "tls": {
            "caCertificate": null,
            "certificate": null,
            "destinationCACertificate": null,
            "enabled": true,
            "insecureEdgeTerminationPolicy": "Redirect",
            "key": null,
            "termination": "edge"
          }
        },
        "serviceType": "ClusterIP",
        "volumeMounts": null,
        "volumes": null
      },
      "global": {
        "nameOverride": null
      },
      "image": {
        "name": null,
        "tag": "latest"
      }
    },
    "schema": "removed",
    "files": [
      {
        "name": "README.md",
        "data": "removed"
      }
    ]
  },
  "config": {
    "build": {
      "enabled": true,
      "imageStreamTag": {
        "name": "dotnet:3.1",
        "namespace": "openshift",
        "useReleaseNamespace": false
      },
      "output": {
        "kind": "ImageStreamTag"
      },
      "ref": "dotnetcore-3.1",
      "startupProject": "app",
      "uri": "https://github.com/redhat-developer/s2i-dotnetcore-ex"
    },
    "deploy": {
      "applicationProperties": {
        "enabled": false,
        "mountPath": "/deployments/config/",
        "properties": "## Properties go here"
      },
      "livenessProbe": {
        "tcpSocket": {
          "port": "http"
        }
      },
      "ports": [
        {
          "name": "http",
          "port": 8080,
          "protocol": "TCP",
          "targetPort": 8080
        }
      ],
      "readinessProbe": {
        "httpGet": {
          "path": "/",
          "port": "http"
        }
      },
      "replicas": 1,
      "route": {
        "enabled": true,
        "targetPort": "http",
        "tls": {
          "enabled": true,
          "insecureEdgeTerminationPolicy": "Redirect",
          "termination": "edge"
        }
      },
      "serviceType": "ClusterIP"
    },
    "image": {
      "tag": "latest"
    }
  },
  "manifest": "---\n# Source: dotnet/templates/service.yaml\napiVersion: v1\nkind: Service\nmetadata:\n  name: dotnet\n  labels:\n    helm.sh/chart: dotnet\n    app.kubernetes.io/name: dotnet\n    app.kubernetes.io/instance: dotnet\n    app.kubernetes.io/managed-by: Helm\n    app.openshift.io/runtime: dotnet\nspec:\n  type: ClusterIP\n  selector:\n    app.kubernetes.io/name: dotnet\n    app.kubernetes.io/instance: dotnet\n  ports:\n    - name: http\n      port: 8080\n      protocol: TCP\n      targetPort: 8080\n---\n# Source: dotnet/templates/deployment.yaml\napiVersion: apps/v1\nkind: Deployment\nmetadata:\n  name: dotnet\n  labels:\n    helm.sh/chart: dotnet\n    app.kubernetes.io/name: dotnet\n    app.kubernetes.io/instance: dotnet\n    app.kubernetes.io/managed-by: Helm\n    app.openshift.io/runtime: dotnet\n  annotations:\n    image.openshift.io/triggers: |-\n      [\n        {\n          \"from\":{\n            \"kind\":\"ImageStreamTag\",\n            \"name\":\"dotnet:latest\"\n          },\n          \"fieldPath\":\"spec.template.spec.containers[0].image\"\n        }\n      ]\nspec:\n  replicas: 1\n  selector:\n    matchLabels:\n      app.kubernetes.io/name: dotnet\n      app.kubernetes.io/instance: dotnet\n  template:\n    metadata:\n      labels:\n        helm.sh/chart: dotnet\n        app.kubernetes.io/name: dotnet\n        app.kubernetes.io/instance: dotnet\n        app.kubernetes.io/managed-by: Helm\n        app.openshift.io/runtime: dotnet\n    spec:\n      containers:\n        - name: web\n          image: dotnet:latest\n          ports:\n            - name: http\n              containerPort: 8080\n              protocol: TCP\n          livenessProbe:\n            tcpSocket:\n              port: http\n          readinessProbe:\n            httpGet:\n              path: /\n              port: http\n          volumeMounts:\n      volumes:\n---\n# Source: dotnet/templates/buildconfig.yaml\napiVersion: build.openshift.io/v1\nkind: BuildConfig\nmetadata:\n  name: dotnet\n  labels:\n    helm.sh/chart: dotnet\n    app.kubernetes.io/name: dotnet\n    app.kubernetes.io/instance: dotnet\n    app.kubernetes.io/managed-by: Helm\n    app.openshift.io/runtime: dotnet\nspec:\n  output:\n    to:\n      kind: ImageStreamTag\n      name: dotnet:latest\n  source:\n    type: Git\n    git:\n      uri: https://github.com/redhat-developer/s2i-dotnetcore-ex\n      ref: dotnetcore-3.1\n  strategy:\n    type: Source\n    sourceStrategy:\n      from:\n        kind: ImageStreamTag\n        name: dotnet:3.1\n        namespace: openshift\n      env:\n        - name: \"DOTNET_STARTUP_PROJECT\"\n          value: \"app\"\n  triggers:\n    - type: ConfigChange\n---\n# Source: dotnet/templates/imagestream.yaml\napiVersion: image.openshift.io/v1\nkind: ImageStream\nmetadata:\n  name: dotnet\n  labels:\n    helm.sh/chart: dotnet\n    app.kubernetes.io/name: dotnet\n    app.kubernetes.io/instance: dotnet\n    app.kubernetes.io/managed-by: Helm\n    app.openshift.io/runtime: dotnet\nspec:\n  lookupPolicy:\n    local: true\n---\n# Source: dotnet/templates/route.yaml\napiVersion: route.openshift.io/v1\nkind: Route\nmetadata:\n  name: dotnet\n  labels:\n    helm.sh/chart: dotnet\n    app.kubernetes.io/name: dotnet\n    app.kubernetes.io/instance: dotnet\n    app.kubernetes.io/managed-by: Helm\n    app.openshift.io/runtime: dotnet\nspec:\n  to:\n    kind: Service\n    name: dotnet\n  port:\n    targetPort: http\n  tls:\n    termination: edge\n    insecureEdgeTerminationPolicy: Redirect\n",
  "version": 1
}

Clone of OCPBUGS-7906, but for all the other CSI drivers and operators than shared resource. All Pods / containers that are part of the OCP platform should run on dedicated "management" CPUs (if configured). I.e. they should have annotation 'target.workload.openshift.io/management:{"effect": "PreferredDuringScheduling"}' .

Enhancement: https://github.com/openshift/enhancements/blob/master/enhancements/workload-partitioning/management-workload-partitioning.md

So far nobody ran our cloud CSI drivers with CPU pinning enabled, so this bug is a low prio. I checked LSO, it already has correct CPU pinning in all Pods, e.g. here.

Description of problem:

The UI should add an alert for deprecating DeploymentConfig in 4.14

Version-Release number of selected component (if applicable):

pre-merge

How reproducible:

Always

Steps to Reproduce:

1. 
2.
3.

Actual results:

The alert is missing

Expected results:

The alert should exist

Additional info:

 

Description of problem:

This is to track the SDN specific issue in https://issues.redhat.com/browse/OCPBUGS-18389

4.14 nightly has a higher pod ready latency compared to 4.14 ec4 and 4.13.z in node-density (lite) test

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-09-11-201102

How reproducible:

Everytime

Steps to Reproduce:

1. Install a SDN cluster and scale up to 24 worker nodes, install 3 infra nodes and move monitoring, ingress, registry components to infra nodes. 
2. Run node-density (lite) test with 245 pod per node
3. Compare the pod ready latency to 4.13.z, and 4.14 ec4 

Actual results:

4.14 nightly has a higher pod ready latency compared to 4.14 ec4 and 4.13.10

Expected results:

4.14 should have similar pod ready latency compared to previous release

Additional info:

 
OCP Version Flexy Id Scale Ci Job Grafana URL Cloud Arch Type Network Type Worker Count PODS_PER_NODE Avg Pod Ready (ms) P99 Pod Ready (ms) Must-gather
4.14.0-ec.4 231559 292 087eb40c-6600-4db3-a9fd-3b959f4a434a aws amd64 SDN 24 245 2186 3256 https://drive.google.com/file/d/1NInCiai7WWIIVT8uL-5KKeQl9CtQN_Ck/view?usp=drive_link
4.14.0-0.nightly-2023-09-02-132842 231558 291 62404e34-672e-4168-b4cc-0bd575768aad aws amd64 SDN 24 245 58725 294279 https://drive.google.com/file/d/1BbVeNrWzVdogFhYihNfv-99_q8oj6eCN/view?usp=drive_link

 

With the new multus image provided by Dan Williams in https://issues.redhat.com/browse/OCPBUGS-18389, SDN 24 nodes's latency is similar to without the fix. 

% oc -n openshift-network-operator get deployment.apps/network-operator -o yaml | grep MULTUS_IMAGE -A 1
        - name: MULTUS_IMAGE
          value: quay.io/dcbw/multus-cni:informer 
 % oc get pod -n openshift-multus -o yaml | grep image: | grep multus
      image: quay.io/dcbw/multus-cni:informer
....
OCP Version Flexy Id Scale Ci Job Grafana URL Cloud Arch Type Network Type Worker Count PODS_PER_NODE Avg Pod Ready (ms) P99 Pod Ready (ms) Must-gather
4.14.0-0.nightly-2023-09-11-201102 quay.io/dcbw/multus-cni:informer 232389 314 f2c290c1-73ea-4f10-a797-3ab9d45e94b3 aws amd64 SDN 24 245 61234 311776 https://drive.google.com/file/d/1o7JXJAd_V3Fzw81pTaLXQn1ms44lX6v5/view?usp=drive_link
4.14.0-ec.4 231559 292 087eb40c-6600-4db3-a9fd-3b959f4a434a aws amd64 SDN 24 245 2186 3256 https://drive.google.com/file/d/1NInCiai7WWIIVT8uL-5KKeQl9CtQN_Ck/view?usp=drive_link
4.14.0-0.nightly-2023-09-02-132842 231558 291 62404e34-672e-4168-b4cc-0bd575768aad aws amd64 SDN 24 245 58725 294279 https://drive.google.com/file/d/1BbVeNrWzVdogFhYihNfv-99_q8oj6eCN/view?usp=drive_link

 

Zenghui Shi Peng Liu request to modify the multus-daemon-config ConfigMap by removing readinessindicatorfile flag

  1. scale down CNO deployment to 0
  2. edit configmap to remove 80-openshift-network.conf (sdn) or 10-ovn-kubernetes.conf (ovn-k)
  3. restart (delete) multus pod on each worker

Steps:

  1. oc scale --replicas=0 -n openshift-network-operator deployments network-operator
  2. oc edit cm multus-daemon-config -n openshift-multus, and remove the line "readinessindicatorfile": "/host/run/multus/cni/net.d/80-openshift-network.conf",
  3. oc get po n openshift-multus | grep multus | egrep -v "multus-additional|multus-admission" | awk '{print $1}' | xargs oc delete po -n openshift-multus

Now the readinessindicatorfile flag is removed and And all multus pods are restarted

 

% oc get cm multus-daemon-config -n openshift-multus -o yaml | grep readinessindicatorfile -c
0  

Test Result: p99 is better compared to without the fix(remove readinessindicatorfile) but is stall worse than ec4, avg is still bad.
 

OCP Version Flexy Id Scale Ci Job Grafana URL Cloud Arch Type Network Type Worker Count PODS_PER_NODE Avg Pod Ready (ms) P99 Pod Ready (ms) Must-gather
4.14.0-0.nightly-2023-09-11-201102 quay.io/dcbw/multus-cni:informer and remove readinessindicatorfile flag 232389 316 d7a754aa-4f52-49eb-80cf-907bee38a81b aws amd64 SDN 24 245 51775 105296 https://drive.google.com/file/d/1h-3JeZXQRO-zsgWzen6aNDQfSDqoKAs2/view?usp=drive_link

Zenghui Shi Peng Liu request to set logLever to debug in additional to removing readinessindicatorfile flag

edit the cm to set "logLevel": "verbose" -> "debug" and restart all multus pods

Now the logLever is debug and And all multus pods are restarted

% oc get cm multus-daemon-config -n openshift-multus -o yaml | grep logLevel
        "logLevel": "debug",
% oc get cm multus-daemon-config -n openshift-multus -o yaml | grep readinessindicatorfile -c
0 
OCP Version Flexy Id Scale Ci Job Grafana URL Cloud Arch Type Network Type Worker Count PODS_PER_NODE Avg Pod Ready (ms) P99 Pod Ready (ms) Must-gather
4.14.0-0.nightly-2023-09-11-201102  quay.io/dcbw/multus-cni:informer and remove readinessindicatorfile flag and logLevel=debug 232389 320 5d1d3e6a-bfa1-4a4b-bbfc-daedc5605f7d aws amd64 SDN 24 245 49586 105314 https://drive.google.com/file/d/1p1PDbnqm0NlWND-komc9jbQ1PyQMeWcV/view?usp=drive_link

 
Edit

Description of problem:

The bootstrapExternalStaticGateway IP uses as DNS for bootstrap node

Version-Release number of selected component (if applicable):

4.11

How reproducible:

100%

Steps to Reproduce:

1. Deploy baremetal IPI using static boostrap IP.
2. It consumes bootstrapExternalStaticGateway as DNS for the bootstrap node.
3.

Actual results:

Sometimes bootstrapExternalStaticGateway cannot act as DNS

Expected results:

DNS resolution should work on bootstrap if it uses static IP

Additional info:

 

Description of problem: While running scale tests of OpenShift on OpenStack at scale, we're seeing it performing significantly worse than on AWS platform for the same number of nodes. More specifically, we're seeing high traffic to API server, and high load for the haproxy pod.

Version-Release number of selected component (if applicable):

All supported versions

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

Slack thread at https://coreos.slack.com/archives/CBZHF4DHC/p1669910986729359 provides more info.

Description of the problem:

When starting installation where the nodes has multiple disks on 4.13, after reboot the installation might stuck on "pending user action" with the following error:

Expected the host to boot from disk, but it booted the installation image - please reboot and fix the boot order to boot from disk QEMU_HARDDISK 05abcd32e95a61a3 (sda, /dev/disk/by-id/wwn-0x05abcd32e95a61a3). 

 

When running the live-iso with RHEL /dev/sda might actually be vdb.
Since the boot order configuration is usally HD first, machine usually try vda before it moves on to try other boot options (that are not HD).
When installing on /dev/sda (vdb) the machine might not try to boot from the installation disk.

Solution suggestion:
A better way to find vda is by the hctl ( 0:0:0:0 should be /vda)
Action item: in case of libvirt (why not all platforms?) we should update the way we choose the default installation disk and choose the disk with hctl 0:0:0:0 (when it's available...)

 

How reproducible:

Create nodes with 2 disks and start installation.

 

Steps to reproduce:

1. Register new cluster

2. Add 6 nodes (3 master + 3 workers) with multiple disks each - might be even reproducible with only 3 masters

3. Start the installation

 

Note that it might take a few attempts to reproduce this issue

 

Actual results:

Pending for input

 

Expected results:

Installation success 

 

Slack thread https://redhat-internal.slack.com/archives/CUPJTHQ5P/p1684317064257809

 

 

 

Description of problem:
If secure boot is currently disabled, and user attempts to enable it via ZTP, install will not begin the first time ZTP was triggered.

When secure boot is enabled viz ZTP, then boot options will be configured before virtual CD was attached, thus first boot will be booting into existing HD with secure boot on. Install will then get stuck because boot from CD was never triggered.

Version-Release number of selected component (if applicable):
4.10

How reproducible:
Always

Steps to Reproduce:
1. Secure boot is currently disabled in bios
2. Attempt to deploy a cluster with secure boot enabled via ZTP
3.

Actual results:

  • spoke cluster got booted with secure boot option toggled, into existing HD
  • spoke cluster did not boot into virtual CD, thus install never started.
  • agentclusterinstall gets stuck here:
    State: insufficient
    State Info: Cluster is not ready for install

Expected results:

  • installation started and completed successfully

Additional info:

Secure boot config used in ZTP siteconfig:
http://registry.kni-qe-0.lab.eng.rdu2.redhat.com:3000/kni-qe/ztp-site-configs/src/ff814164cdcd355ed980f1edf269dbc2afbe09aa/siteconfig/master-2.yaml#L40

Description of problem:

The option to Enable/Disable a console plugin on Operator details page is not shown any more, it looks like a regression(the option is shown in 4.13)

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-04-19-125337

How reproducible:

Always

Steps to Reproduce:

1. Subscribe 'OpenShift Data Foundation' Operator from OperatorHub
2. on Operator installation page, we choose 'Disable' plugin
3. once operator is successfully installed, go to Installed Operators list page /k8s/all-namespaces/operators.coreos.com~v1alpha1~ClusterServiceVersion
4. console will show 'Plugin available' button for 'OpenShift Data Foundation' Operator, click on the button and hit 'View operator details', user will be taken to Operator details page

Actual results:

4. in OCP <= 4.13, we will show a 'Console plugin' item where user can Enable/Disable the console plugin operator has bring in

however this option is not shown in 4.14

Expected results:

4. Enable/Disable console plugin should be shown on Operator details page

Additional info:

screen recording https://drive.google.com/drive/folders/1fNlodAg6yUeUqf07BG9scvwHlzAwS-Ao?usp=share_link 

Description of problem:

Due to a CI configuration issue (lack of nmstatectl in the image), the current CI unit-test job skips silently those unit tests requiring nmstatectl.

Version-Release number of selected component (if applicable):


How reproducible:

hack/go-test.sh

Steps to Reproduce:

1.
2.
3.

Actual results:

Unit tests are failing

Expected results:

No failure

Additional info:


The following install-config fields are new in 4.13:

  • cpuPartitioning
  • platform.baremetal.loadBalancer
  • platform.vsphere.loadBalancer

These fields are ignored by the agent-based installation method. Until such time as they are implemented, we should print a warning if they are set to non-default values, as we do for other fields that are ignored.

Description of problem:

After a component is ready, if we edit the component YAML from the console, it shows a stream of error. The YAML does get updated but the error goes away only after multiple reload.

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1. Deploy a pod/deployment
2. After they are seen ready, update the YAML from console
3. Error is seen

Actual results:

 

Expected results:

No error

Additional info:

 

Please review the following PR: https://github.com/openshift/machine-api-operator/pull/1127

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/cluster-machine-approver/pull/180

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Extracting the cli in darwin from a multi payload leads to "filtered all images from manifest list"

Version-Release number of selected component (if applicable):

Tested with oc4.11

How reproducible:

Always on Darwin machines

Steps to Reproduce:

1.oc adm release extract --command=oc quay.io/openshift-release-dev/ocp-release:4.11.4-multi -v5

Actual results:

I0909 18:37:28.591323   37669 config.go:127] looking for config.json at /Users/lwan/.docker/config.jsonI0909 18:37:28.591601   37669 config.go:135] found valid config.json at /Users/lwan/.docker/config.jsonWarning: the default reading order of registry auth file will be changed from "${HOME}/.docker/config.json" to podman registry config locations in the future version of oc. "${HOME}/.docker/config.json" is deprecated, but can still be used for storing credentials as a fallback. See https://github.com/containers/image/blob/main/docs/containers-auth.json.5.md for the order of podman registry config locations.I0909 18:37:30.391895   37669 client_mirrored.go:174] Attempting to connect to quay.io/openshift-release-dev/ocp-releaseI0909 18:37:30.696483   37669 client_mirrored.go:412] get manifest for sha256:53679d92dc0aea8ff6ea4b6f0351fa09ecc14ee9eda1b560deeb0923ca2290a1 served from registryclient.retryManifest{ManifestService:registryclient.manifestServiceVerifier{ManifestService:(*client.manifests)(0x14000a36330)}, repo:(*registryclient.retryRepository)(0x14000f46e80)}: <nil>I0909 18:37:30.696738   37669 manifest.go:405] Skipping image sha256:fcf4d95df9a189527453d8961a22a3906514f5ecbb05afbcd0b2cdd212aab1a2 for manifestlist.PlatformSpec{Architecture:"amd64", OS:"linux", OSVersion:"", OSFeatures:[]string(nil), Variant:"", Features:[]string(nil)} from quay.io/openshift-release-dev/ocp-release:4.11.4-multiI0909 18:37:30.696843   37669 manifest.go:405] Skipping image sha256:1992a4713410b7363ae18b0557a7587eb9e0d734c5f0f21fb1879196f40233a3 for manifestlist.PlatformSpec{Architecture:"ppc64le", OS:"linux", OSVersion:"", OSFeatures:[]string(nil), Variant:"", Features:[]string(nil)} from quay.io/openshift-release-dev/ocp-release:4.11.4-multiI0909 18:37:30.696869   37669 manifest.go:405] Skipping image sha256:3698082cd66e90d2b79b62d659b4e7399bfe0b86c05840a4c31d3197cdac4bfa for manifestlist.PlatformSpec{Architecture:"s390x", OS:"linux", OSVersion:"", OSFeatures:[]string(nil), Variant:"", Features:[]string(nil)} from quay.io/openshift-release-dev/ocp-release:4.11.4-multiI0909 18:37:30.697106   37669 manifest.go:405] Skipping image sha256:15fc18c81f053cad15786e7a52dc8bff29e647ea642b3e1fabf2621953f727eb for manifestlist.PlatformSpec{Architecture:"arm64", OS:"linux", OSVersion:"", OSFeatures:[]string(nil), Variant:"", Features:[]string(nil)} from quay.io/openshift-release-dev/ocp-release:4.11.4-multiI0909 18:37:30.697570   37669 workqueue.go:143] about to send work queue error: unable to read image quay.io/openshift-release-dev/ocp-release:4.11.4-multi: filtered all images from manifest listerror: unable to read image quay.io/openshift-release-dev/ocp-release:4.11.4-multi: filtered all images from manifest list

Expected results:

The darwin/$(uname -m) cli is extracted

Additional info:

Are we re-using some function from the `oc mirror` feature to select the manifest to use? It's like it is looking for a "darwin/$(uname -m)" and filter-out all the available linux manifests.

This is a clone of issue OCPBUGS-19037. The following is the description of the original issue:

The agent-interactive-console service is required by both sshd and systemd-logind, so if it exits with an error code there is no way to connect or log in to the box to debug.

Platform:

IPI on Baremetal

What happened?

In cases where no hostname is provided, host are automatically assigned the name "localhost" or "localhost.localdomain".

[kni@provisionhost-0-0 ~]$ oc get nodes
NAME STATUS ROLES AGE VERSION
localhost.localdomain Ready master 31m v1.22.1+6859754
master-0-1 Ready master 39m v1.22.1+6859754
master-0-2 Ready master 39m v1.22.1+6859754
worker-0-0 Ready worker 12m v1.22.1+6859754
worker-0-1 Ready worker 12m v1.22.1+6859754

What did you expect to happen?

Having all hosts come up as localhost is the worst possible user experience, because they'll fail to form a cluster but you won't know why.

However, we know the BMH name in the image-customization-controller, it would be possible to configure the ignition to set a default hostname if we don't have one from DHCP/DNS.

If not, we should at least fail the installation with a specific error message to this situation.

----------
30/01/22 - adding how to reproduce
----------

How to Reproduce:

1)prepare and installation with day-1 static ip.

add to install-config uner one of the nodes:
networkConfig:
routes:
config:

  • destination: 0.0.0.0/0
    next-hop-address: 192.168.123.1
    next-hop-interface: enp0s4
    dns-resolver:
    config:
    server:
  • 192.168.123.1
    interfaces:
  • name: enp0s4
    type: ethernet
    state: up
    ipv4:
    address:
  • ip: 192.168.123.110
    prefix-length: 24
    enabled: true

2)Ensure a DNS PTR for the address IS NOT configured.

3)create manifests and cluster from install-config.yaml

installation should either:
1)fail as early as possible, and provide some sort of feed back as to the fact that no hostname was provided.
2)derive the Hostname from the bmh or the ignition files

Please review the following PR: https://github.com/openshift/images/pull/131

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/images/pull/132

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Nodes are taking more than 5m0s to stage OSUpdate

https://sippy.dptools.openshift.org/sippy-ng/tests/4.13/analysis?test=%5Bbz-Machine%20Config%20Operator%5D%20Nodes%20should%20reach%20OSUpdateStaged%20in%20a%20timely%20fashion 

Test started failing back on 2/16/2023. First occurrence of the failure https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.13-e2e-aws-sdn-upgrade/1626326464246845440 

Most recent occurrences across multiple platforms https://search.ci.openshift.org/?search=Nodes+should+reach+OSUpdateStaged+in+a+timely+fashion&maxAge=48h&context=1&type=junit&name=4.13&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job 

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

6 nodes took over 5m0s to stage OSUpdate:node/ip-10-0-216-81.ec2.internal OSUpdateStarted at 2023-02-16T22:24:56Z, did not make it to OSUpdateStaged
node/ip-10-0-174-123.ec2.internal OSUpdateStarted at 2023-02-16T22:13:07Z, did not make it to OSUpdateStaged
node/ip-10-0-144-29.ec2.internal OSUpdateStarted at 2023-02-16T22:12:50Z, did not make it to OSUpdateStaged
node/ip-10-0-179-251.ec2.internal OSUpdateStarted at 2023-02-16T22:15:48Z, did not make it to OSUpdateStaged
node/ip-10-0-180-197.ec2.internal OSUpdateStarted at 2023-02-16T22:19:07Z, did not make it to OSUpdateStaged
node/ip-10-0-213-155.ec2.internal OSUpdateStarted at 2023-02-16T22:19:21Z, did not make it to OSUpdateStaged}

Expected results:

 

Additional info:

 

Please review the following PR: https://github.com/openshift/cluster-capi-operator/pull/112

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Tracker issue for bootimage bump in 4.14. This issue should block issues which need a bootimage bump to fix.

The previous bump was OCPBUGS-13253.

This is a clone of issue OCPBUGS-18103. The following is the description of the original issue:

Description:

Now that the huge e2e test case failures in CI jobs is resolved in the recent jobs observed a Undiagnosed panic detected in pod issue.

JobLink

Error:

{ pods/openshift-image-registry_cluster-image-registry-operator-7f7bd7c9b4-k8fmh_cluster-image-registry-operator_previous.log.gz:E0825 02:44:06.686400 1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) pods/openshift-image-registry_cluster-image-registry-operator-7f7bd7c9b4-k8fmh_cluster-image-registry-operator_previous.log.gz:E0825 02:44:06.686630 1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)}

Some Observations:
1)While starting ImageConfigController it Failed to watch *v1.Route: as the server could not find the requested resource",

2)which eventually lead sync problem "E0825 01:26:52.428694       1 clusteroperator.go:104] unable to sync ClusterOperatorStatusController: config.imageregistry.operator.openshift.io "cluster" not found, requeuing" 

3)and then while creating deployment resource for "cluster-image-registry-operator" it caused a panic error: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference):"

Description of the problem:

When installing a cluster and we have multiple networks,  we can not change the machine network from UI  ( its not changed to the new machine network) but when installing it shows the chosen network.

 

from customer view :

he choose machine network , its in the list but never shown as chosen but actually it appears when installing.

How reproducible:

Always

Steps to reproduce:

Install cluster , mutiple networks

Try to change machine network -> does not work

Actual results:

 

Expected results:

Please review the following PR: https://github.com/openshift/cloud-provider-openstack/pull/187

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

On add storage page, if user choose use existing pvc, but leave the pvc name empty, after other fields are filled, click "Save", there is not warning info about the pvc name field. The loading dot icons are shown under "Save" button.

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-07-12-124310

How reproducible:

Always

Steps to Reproduce:

1.Create a deployment.
2.Click "Add Storage" item in action list of the deployment
3.Choose "Use existing claim", but leave it empty.
4.Set mount dir and click "Save".

Actual results:

4. There is not warning info about the empty pvc name.

Expected results:

4. Should show info for the field:"Please fill out this field"

Additional info:

 

Description of the problem:

When creating/updating an InfraEnv, the size of compressed ignition should be validated.
I.e. the service should generate the entire ignition for each request, compress it (as done in ignition Archive), and ensure its size is up to 256KiB.

Notes:

  • The validation added by MGMT-13008 is performed directly on the `IgnitionConfigOverride` property. Thus, the validation isn't accurate as it should be done on the entire generated ignition config.
  • See full discussion here.
  • Related issue: MGMT-13643

How reproducible:

100%

Steps to reproduce:

1. Register an InfraEnv that would result with an ignition archive larger than 256KIB.
E.g. Invoke 'POST /v2/infra-envs' with large values in body (infra-env-create-params)

Actual results:

Register request succeed, but downloading the ISO fails.

Expected results:

The request should fail with an error message explaining the generated ignition archive is too large.

Description of problem:

oc-mirror fails to complete with heads only complaining about devworkspace-operator

Version-Release number of selected component (if applicable):

# oc-mirror version
Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.12.0-202302280915.p0.g3d51740.assembly.stream-3d51740", GitCommit:"3d517407dcbc46ededd7323c7e8f6d6a45efc649", GitTreeState:"clean", BuildDate:"2023-03-01T00:20:53Z", GoVersion:"go1.19.4", Compiler:"gc", Platform:"linux/amd64"}

How reproducible:

Attempt a headsonly mirroring for registry.redhat.io/redhat/redhat-operator-index:v4.10

Steps to Reproduce:

1. Imageset currently:
kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v1alpha2
storageConfig:
  registry:
    imageURL: myregistry.mydomain:5000/redhat-operators
    skipTLS: false
mirror:
  operators:
  - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.10
2.$ oc mirror --config=./imageset-config.yml docker://otherregistry.mydomain:5000/redhat-operators

Checking push permissions for otherregistry.mydomain:5000
Found: oc-mirror-workspace/src/publish
Found: oc-mirror-workspace/src/v2
Found: oc-mirror-workspace/src/charts
Found: oc-mirror-workspace/src/release-signatures
WARN[0026] DEPRECATION NOTICE:
Sqlite-based catalogs and their related subcommands are deprecated. Support for
them will be removed in a future release. Please migrate your catalog workflows
to the new file-based catalog format. 

The rendered catalog is invalid.

Run "oc-mirror list operators --catalog CATALOG-NAME --package PACKAGE-NAME" for more information.  

error: error generating diff: channel fast: head "devworkspace-operator.v0.19.1-0.1679521112.p" not reachable from bundle "devworkspace-operator.v0.19.1"  

Actual results:

error: error generating diff: channel fast: head "devworkspace-operator.v0.19.1-0.1679521112.p" not reachable from bundle "devworkspace-operator.v0.19.1"

Expected results:

For the catalog to be mirrored.

Description of problem:

When deploying with external platform, the reported state of the machine config pool is degraded, and we can observe a drift in the configuration:

$ diff /etc/mcs-machine-config-content.json ~/rendered-master-1b6aab788192600896f36c5388d48374
<                         "contents": "[Unit]\nDescription=Kubernetes Kubelet\nWants=rpc-statd.service network-online.target\nRequires=crio.service kubelet-auto-node-size.service\nAfter=network-online.target crio.service kubelet-auto-node-size.service\nAfter=ostree-finalize-staged.service\n\n[Service]\nType=notify\nExecStartPre=/bin/mkdir --parents /etc/kubernetes/manifests\nExecStartPre=/bin/rm -f /var/lib/kubelet/cpu_manager_state\nExecStartPre=/bin/rm -f /var/lib/kubelet/memory_manager_state\nEnvironmentFile=/etc/os-release\nEnvironmentFile=-/etc/kubernetes/kubelet-workaround\nEnvironmentFile=-/etc/kubernetes/kubelet-env\nEnvironmentFile=/etc/node-sizing.env\n\nExecStart=/usr/local/bin/kubenswrapper \\\n    /usr/bin/kubelet \\\n      --config=/etc/kubernetes/kubelet.conf \\\n      --bootstrap-kubeconfig=/etc/kubernetes/kubeconfig \\\n      --kubeconfig=/var/lib/kubelet/kubeconfig \\\n      --container-runtime-endpoint=/var/run/crio/crio.sock \\\n      --runtime-cgroups=/system.slice/crio.service \\\n      --node-labels=node-role.kubernetes.io/control-plane,node-role.kubernetes.io/master,node.openshift.io/os_id=${ID} \\\n      --node-ip=${KUBELET_NODE_IP} \\\n      --minimum-container-ttl-duration=6m0s \\\n      --cloud-provider=external \\\n      --volume-plugin-dir=/etc/kubernetes/kubelet-plugins/volume/exec \\\n       \\\n      --hostname-override=${KUBELET_NODE_NAME} \\\n      --provider-id=${KUBELET_PROVIDERID} \\\n      --register-with-taints=node-role.kubernetes.io/master=:NoSchedule \\\n      --pod-infra-container-image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:bde9fb486f1e8369b465a8c0aff7152c2a1f5a326385ee492140592b506638d6 \\\n      --system-reserved=cpu=${SYSTEM_RESERVED_CPU},memory=${SYSTEM_RESERVED_MEMORY},ephemeral-storage=${SYSTEM_RESERVED_ES} \\\n      --v=${KUBELET_LOG_LEVEL}\n\nRestart=always\nRestartSec=10\n\n[Install]\nWantedBy=multi-user.target\n",
---
>                         "contents": "[Unit]\nDescription=Kubernetes Kubelet\nWants=rpc-statd.service network-online.target\nRequires=crio.service kubelet-auto-node-size.service\nAfter=network-online.target crio.service kubelet-auto-node-size.service\nAfter=ostree-finalize-staged.service\n\n[Service]\nType=notify\nExecStartPre=/bin/mkdir --parents /etc/kubernetes/manifests\nExecStartPre=/bin/rm -f /var/lib/kubelet/cpu_manager_state\nExecStartPre=/bin/rm -f /var/lib/kubelet/memory_manager_state\nEnvironmentFile=/etc/os-release\nEnvironmentFile=-/etc/kubernetes/kubelet-workaround\nEnvironmentFile=-/etc/kubernetes/kubelet-env\nEnvironmentFile=/etc/node-sizing.env\n\nExecStart=/usr/local/bin/kubenswrapper \\\n    /usr/bin/kubelet \\\n      --config=/etc/kubernetes/kubelet.conf \\\n      --bootstrap-kubeconfig=/etc/kubernetes/kubeconfig \\\n      --kubeconfig=/var/lib/kubelet/kubeconfig \\\n      --container-runtime-endpoint=/var/run/crio/crio.sock \\\n      --runtime-cgroups=/system.slice/crio.service \\\n      --node-labels=node-role.kubernetes.io/control-plane,node-role.kubernetes.io/master,node.openshift.io/os_id=${ID} \\\n      --node-ip=${KUBELET_NODE_IP} \\\n      --minimum-container-ttl-duration=6m0s \\\n      --cloud-provider= \\\n      --volume-plugin-dir=/etc/kubernetes/kubelet-plugins/volume/exec \\\n       \\\n      --hostname-override=${KUBELET_NODE_NAME} \\\n      --provider-id=${KUBELET_PROVIDERID} \\\n      --register-with-taints=node-role.kubernetes.io/master=:NoSchedule \\\n      --pod-infra-container-image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:bde9fb486f1e8369b465a8c0aff7152c2a1f5a326385ee492140592b506638d6 \\\n      --system-reserved=cpu=${SYSTEM_RESERVED_CPU},memory=${SYSTEM_RESERVED_MEMORY},ephemeral-storage=${SYSTEM_RESERVED_ES} \\\n      --v=${KUBELET_LOG_LEVEL}\n\nRestart=always\nRestartSec=10\n\n[Install]\nWantedBy=multi-user.target\n",


the difference is --cloud-provider=external /--cloud-provider= is the flags passed to the kubelet.


We also observe the following log in the MCC:
W0629 09:57:44.583046       1 warnings.go:70] unknown field "spec.infra.status.platformStatus.external.cloudControllerManager"


"spec.infra.status.platformStatus.external.cloudControllerManager" is basically the flag in the Infrastructure object that enables the external platform.

Version-Release number of selected component (if applicable):

4.14 nightly

How reproducible:

Always when platform is external

Steps to Reproduce:

1. Deploy a cluster with the external platform enabled, the featureSet TechPreviewNoUpgrade should be set and the Infrastructure object should look like:

apiVersion: config.openshift.io/v1
kind: Infrastructure
metadata:
  creationTimestamp: "2023-06-28T10:37:12Z"
  generation: 1
  name: cluster
  resourceVersion: "538"
  uid: 57e09773-0eca-4767-95ce-8ec7d0f2cdae
spec:
  cloudConfig:
    name: ""
  platformSpec:
    external:
      platformName: oci
    type: External
status:
  apiServerInternalURI: https://api-int.test-infra-cluster-3cd17632.assisted-ci.oci-rhelcert.edge-sro.rhecoeng.com:6443
  apiServerURL: https://api.test-infra-cluster-3cd17632.assisted-ci.oci-rhelcert.edge-sro.rhecoeng.com:6443
  controlPlaneTopology: HighlyAvailable
  cpuPartitioning: None
  etcdDiscoveryDomain: ""
  infrastructureName: test-infra-cluster-3c-pqqqm
  infrastructureTopology: HighlyAvailable
  platform: External
  platformStatus:
    external:
      cloudControllerManager:
        state: External
    type: External
2. Observe the drift with: oc get mcp

Actual results:

$ oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master                                                      False     True       True       3              0                   0                     3                      138m
worker   rendered-worker-d48036fe2b657e6c71d5d1275675fefc   True      False      False      3              3                   3                     0                      138m

Expected results:

$ oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-2ff4e25f807ef3b20b7c6e0c6526f05d   True      False      False      3              3                   3                     0                      33m
worker   rendered-worker-48b7f39d78e3b1d94a0aba1ef4358d01   True      False      False      3              3                   3                     0                      33m

Additional info:

https://redhat-internal.slack.com/archives/C02CZNQHGN8/p1688035248716119

The TestMetrics e2e test is not correctly cleaning up the MachineConfigs and MachineConfigPools it creates. This means that other e2e tests which run after this e2e test can falsely fail or become flaky.

What's happening is this:

  1. The target node is removed from the ephemeral MachineConfigPool by unlabelling it.
  2. A race condition occurs when we call WaitForPoolComplete because technically, the pool is updated at this point since it has not yet picked up the unlabelling event from the target node.
  3. We delete the ephemeral MachineConfigPool, which deletes the rendered MachineConfigs that belong to it.
  4. The node starts the update process, but cannot find the rendered MachineConfigs for the ephemeral pool since they were deleted. The MCD degrades at this point and blocks the worker MachineConfigPool.

 

The cleanup flow should look like this:

  1. The target node is removed from the ephemeral MachineConfigPool by unlabeling it.
  2. Wait until the target node completes the switch back to the worker pool.
  3. Delete the ephemeral MachineConfigPool that was created for the test.
  4. Delete any MachineConfigs assigned to that ephemeral MachineConfigPool.

 

Description of problem:

a cluster update request with empty strings for api_vip and ingress_vip will not remove the cluster vips.

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1. See the following test: https://gist.github.com/nmagnezi/4a3dad01ee197d3984fa7a0604b62cc0
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

https://issues.redhat.com//browse/OCPBUGS-5287 disabled the test due to https://issues.redhat.com/browse/THREESCALE-9015.  Once https://issues.redhat.com/browse/THREESCALE-9015 is resolved, need to re-enable the test.

Description of problem:

After an upgrade from 4.9 to 4.10 collect+ process causing  CPU bursts of 5-6 seconds every 15 minutes regularly. During each burst collect+ consume 100% CPU.

Top Command Dump Sample:
top - 07:00:04 up 10:10,  0 users,  load average: 0.20, 0.24, 0.27
Tasks: 247 total,   1 running, 246 sleeping,   0 stopped,   0 zombie
%Cpu(s):  6.3 us,  4.5 sy,  0.0 ni, 80.8 id,  7.4 wa,  0.8 hi,  0.3 si,  0.0 st
MiB Mem :  32151.9 total,  22601.4 free,   2182.1 used,   7368.4 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  29420.7 avail Mem     PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   2009 root      20   0 3741252 172136  71396 S  12.9   0.5  36:42.79 kubelet
   1954 root      20   0 2663680 130928  46156 S   7.9   0.4   6:50.44 crio
   9440 root      20   0 1633728 546036  60836 S   7.9   1.7  21:06.80 fluentd
      1 root      20   0  238416  15412   8968 S   5.9   0.0   1:56.73 systemd
   1353 800       10 -10  796808 165380  40916 S   5.0   0.5   2:32.11 ovs-vsw+
   5454 root      20   0 1729112  73680  37404 S   2.0   0.2   3:52.21 coredns
1061248 1000360+  20   0 1113524  24304  17776 S   2.0   0.1   0:00.03 collect+
    306 root       0 -20       0      0      0 I   1.0   0.0   0:00.37 kworker+
    957 root      20   0  264076 126280 119596 S   1.0   0.4   0:06.80 systemd+
   1114 dbus      20   0   83188   6224   5140 S   1.0   0.0   0:04.30 dbus-da+
   5710 root      20   0  406004  31384  15068 S   1.0   0.1   0:04.11 tuned
   6198 nobody    20   0 1632272  46588  20516 S   1.0   0.1   0:17.60 network+
1061291 1000650+  20   0   11896   2748   2496 S   1.0   0.0   0:00.01 bash
1061355 1000650+  20   0   11896   2868   2616 S   1.0   0.0   0:00.01 bashtop - 07:00:05 up 10:10,  0 users,  load average: 0.20, 0.24, 0.27
Tasks: 248 total,   2 running, 245 sleeping,   0 stopped,   1 zombie
%Cpu(s): 11.4 us,  2.0 sy,  0.0 ni, 81.5 id,  4.2 wa,  0.6 hi,  0.2 si,  0.0 st
MiB Mem :  32151.9 total,  22601.4 free,   2182.1 used,   7368.4 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  29420.7 avail Mem     PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
1061248 1000360+  20   0 1484936  36464  21300 S  74.3   0.1   0:00.78 collect+
   9440 root      20   0 1633728 545412  60900 S  11.9   1.7  21:06.92 fluentd
   2009 root      20   0 3741252 172396  71396 S   4.0   0.5  36:42.83 kubelet
      1 root      20   0  238416  15412   8968 S   1.0   0.0   1:56.74 systemd
    300 root       0 -20       0      0      0 I   1.0   0.0   0:00.46 kworker+
   1427 root      20   0   19656   2204   2064 S   1.0   0.0   0:01.55 agetty
   2419 root      20   0 1714748  38812  22884 S   1.0   0.1   0:24.42 coredns+
   2528 root      20   0 1634680  36464  20628 S   1.0   0.1   0:22.01 dynkeep+
1009372 root      20   0       0      0      0 I   1.0   0.0   0:00.42 kworker+
1053353 root      20   0   50200   4012   3292 R   1.0   0.0   0:01.56 toptop - 07:00:06 up 10:10,  0 users,  load average: 0.20, 0.24, 0.27
Tasks: 247 total,   1 running, 246 sleeping,   0 stopped,   0 zombie
%Cpu(s): 15.3 us,  1.5 sy,  0.0 ni, 82.7 id,  0.1 wa,  0.2 hi,  0.1 si,  0.0 st
MiB Mem :  32151.9 total,  22595.9 free,   2185.7 used,   7370.2 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  29416.7 avail Mem     PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
1061248 1000360+  20   0 1484936  35740  21428 S  99.0   0.1   0:01.78 collect+
   2009 root      20   0 3741252 172396  71396 S   3.0   0.5  36:42.86 kubelet
   9440 root      20   0 1633728 545076  60900 S   2.0   1.7  21:06.94 fluentd
   1353 800       10 -10  796808 165380  40916 S   1.0   0.5   2:32.12 ovs-vsw+
   1954 root      20   0 2663680 131452  46156 S   1.0   0.4   6:50.45 crio top - 07:00:07 up 10:10,  0 users,  load average: 0.20, 0.24, 0.27
Tasks: 247 total,   1 running, 246 sleeping,   0 stopped,   0 zombie
%Cpu(s): 14.7 us,  1.1 sy,  0.0 ni, 83.6 id,  0.1 wa,  0.4 hi,  0.1 si,  0.0 st
MiB Mem :  32151.9 total,  22595.9 free,   2185.7 used,   7370.2 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  29416.7 avail Mem     PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
1061248 1000360+  20   0 1484936  35236  21492 S 102.0   0.1   0:02.80 collect+
   2009 root      20   0 3741252 172660  71396 S   7.0   0.5  36:42.93 kubelet
   3288 nobody    20   0  718964  30648  11680 S   3.0   0.1   3:36.84 node_ex+
      1 root      20   0  238416  15412   8968 S   1.0   0.0   1:56.75 systemd
   1353 800       10 -10  796808 165380  40916 S   1.0   0.5   2:32.13 ovs-vsw+
   1954 root      20   0 2663680 131452  46156 S   1.0   0.4   6:50.46 crio
   5454 root      20   0 1729112  73680  37404 S   1.0   0.2   3:52.22 coredns
   9440 root      20   0 1633728 545080  60900 S   1.0   1.7  21:06.95 fluentd
1053353 root      20   0   50200   4012   3292 R   1.0   0.0   0:01.57 toptop - 07:00:08 up 10:10,  0 users,  load average: 0.20, 0.24, 0.27
Tasks: 247 total,   2 running, 245 sleeping,   0 stopped,   0 zombie
%Cpu(s): 14.2 us,  0.9 sy,  0.0 ni, 84.5 id,  0.0 wa,  0.2 hi,  0.1 si,  0.0 st
MiB Mem :  32151.9 total,  22595.9 free,   2185.7 used,   7370.2 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  29416.7 avail Mem     PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
1061248 1000360+  20   0 1484936  35164  21492 S 100.0   0.1   0:03.81 collect+
   2009 root      20   0 3741252 172660  71396 S   3.0   0.5  36:42.96 kubelet
1061543 1000650+  20   0   34564   9804   5772 R   3.0   0.0   0:00.03 python
   9440 root      20   0 1633728 543952  60900 S   2.0   1.7  21:06.97 fluentd
1053353 root      20   0   50200   4012   3292 R   2.0   0.0   0:01.59 top
   2330 root      20   0 1654612  61260  34720 S   1.0   0.2   0:55.81 coredns
   8023 root      20   0   12056   3044   2580 S   1.0   0.0   0:24.59 install+top - 07:00:09 up 10:10,  0 users,  load average: 0.34, 0.27, 0.28
Tasks: 235 total,   2 running, 233 sleeping,   0 stopped,   0 zombie
%Cpu(s):  8.9 us,  3.2 sy,  0.0 ni, 85.6 id,  1.5 wa,  0.5 hi,  0.2 si,  0.0 st
MiB Mem :  32151.9 total,  22621.0 free,   2160.5 used,   7370.4 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  29441.9 avail Mem     PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   2009 root      20   0 3741252 172660  71396 S   5.0   0.5  36:43.01 kubelet
   9440 root      20   0 1633728 542684  60900 S   4.0   1.6  21:07.01 fluentd
   1353 800       10 -10  796808 165380  40916 S   2.0   0.5   2:32.15 ovs-vsw+
      1 root      20   0  238416  15412   8968 S   1.0   0.0   1:56.76 systemd
   1954 root      20   0 2663680 131452  46156 S   1.0   0.4   6:50.47 crio
   5454 root      20   0 1729112  73680  37404 S   1.0   0.2   3:52.23 coredns
   6198 nobody    20   0 1632272  45936  20516 S   1.0   0.1   0:17.61 network+
   7016 root      20   0   12052   3204   2736 S   1.0   0.0   0:24.19 install+

Version-Release number of selected component (if applicable):

 

How reproducible:

Lab environment does not present same behavior.

Steps to Reproduce:

1.
2.
3.

Actual results:

Regular high CPU spikes

Expected results:

No CPU spikes

Additional info:

Provided logs:
1-) top command dump uploaded to SF case 03317387
2-) must-gather uploaded to SF case 03317387

 

When we update a Secret referenced in the BareMetalHost, an immediate reconcile of the corresponding BMH is not triggered. In most states we requeue each CR after a timeout, so we should eventually see the changes.

In the case of BMC Secrets, this has been broken since the fix for OCPBUGS-1080 in 4.12.

Description of problem:

PipelineRun has Duration column and inside it TaskRun - doesn't

Version-Release number of selected component (if applicable):

4.12

How reproducible:

Have OpenShift Pipeline with 2+ tasks configured and invoked

Steps to Reproduce:

1. Once PipelineRun is invoked - navigate to invoked TaskRuns
2. You will see there columns like Status, Started, but no Duration

Actual results:

 

Expected results:

 

Additional info:

I'll add screenshots for PipelineRuns and TaskRuns

Please review the following PR: https://github.com/openshift/machine-api-provider-nutanix/pull/47

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

After all cluster operators have reconciled after the password rotation, we can still see authentication failures in keystone (attached screenshot of splunk query)

Version-Release number of selected component (if applicable):

Environment:
- OpenShift 4.12.10 on OpenStack 16
- The cluster is managed via RHACM, but password rotation shall be done via "regular"  OpenShift means.

How reproducible:

Rotated the OpenStack credentials according to the documentation [1]

[1] https://docs.openshift.com/container-platform/4.12/authentication/managing_cloud_provider_credentials/cco-mode-passthrough.html#manually-rotating-cloud-creds_cco-mode-passthrough 

Additional info:

- we can't trace back where these authentication failures come from - they do disappear after a cluster upgrade (so when nodes are rebooted and all pods are restarted which indicates that there's still a component using the old credentials)
- The relevant technical integration points _seem_ to be working though (LBaaS, CSI, Machine API, Swift)

What is the business impact? Please also provide timeframe information.

- We cannot rely on splunk monitoring for authentication issues since it's currently constantly showing authentication errors - We cannot be entirely sure that everything works as expected since we don't know the component that doesn't seem to use the new credentials

 

Description of problem:

E2E test suite is getting failed with below error -

Falling back to built-in suite, failed reading external test suites: unable to extract k8s-tests binary: failed extracting "/usr/bin/k8s-tests" from "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f98d9998691052cb8049f806f8c1dc9a6bac189c10c33af9addd631eedfb5528": exit status 1
No manifest filename passed

Version-Release number of selected component (if applicable):

4.14

How reproducible:

So far with 4.14 clusters on Power

Steps to Reproduce:

1. Deploy 4.14 cluster on Power
2. Run e2e test suite from - https://github.com/openshift/origin
3. Monitor e2e

Actual results:

E2E test failed

Expected results:

E2E should pass

Additional info:

./openshift-tests run -f ./test-suite.txt -o /tmp/conformance-parallel-out.txt
warning: KUBE_TEST_REPO_LIST may not be set when using openshift-tests and will be ignored
openshift-tests version: v4.1.0-6960-gd9cf51f
  Aug  9 00:48:21.959: INFO: Enabling in-tree volume drivers
Attempting to pull tests from external binary...
Falling back to built-in suite, failed reading external test suites: unable to extract k8s-tests binary: failed extracting "/usr/bin/k8s-tests" from "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f98d9998691052cb8049f806f8c1dc9a6bac189c10c33af9addd631eedfb5528": exit status 1
creating a TCP service service-test with type=LoadBalancer in namespace e2e-service-lb-test-bvmbl
  Aug  9 00:48:35.424: INFO: Waiting up to 15m0s for service "service-test" to have a LoadBalancer
  Aug  9 00:48:36.272: INFO: ns/openshift-authentication route/oauth-openshift disruption/ingress-to-oauth-server connection/new started responding to GET requests over new connections
  Aug  9 00:48:36.272: INFO: ns/openshift-authentication route/oauth-openshift disruption/ingress-to-oauth-server connection/reused started responding to GET requests over reused connections
  Aug  9 00:48:36.310: INFO: ns/openshift-console route/console disruption/ingress-to-console connection/new started responding to GET requests over new connections
  Aug  9 00:48:36.310: INFO: ns/openshift-console route/console disruption/ingress-to-console connection/reused started responding to GET requests over reused connections
  Aug  9 01:04:07.507: INFO: disruption/ci-cluster-network-liveness connection/reused started responding to GET requests over reused connections
  Aug  9 01:04:07.507: INFO: disruption/ci-cluster-network-liveness connection/new started responding to GET requests over new connections
Starting SimultaneousPodIPController
  I0809 01:04:37.551879  134117 shared_informer.go:311] Waiting for caches to sync for SimultaneousPodIPController
  Aug  9 01:04:37.558: INFO: ns/openshift-image-registry route/test-disruption-reused disruption/image-registry connection/reused started responding to GET requests over reused connections
  Aug  9 01:04:37.624: INFO: disruption/cache-kube-api connection/new started responding to GET requests over new connections
  E0809 01:04:37.719406  134117 shared_informer.go:314] unable to sync caches for SimultaneousPodIPControllerSuite run returned error: error waiting for load balancer: timed out waiting for service "service-test" to have a load balancer: timed out waiting for the condition
disruption/kube-api connection/new producer sampler context is done
disruption/cache-kube-api connection/reused producer sampler context is done
disruption/oauth-api connection/new producer sampler context is done
disruption/oauth-api connection/reused producer sampler context is done
ERRO[0975] disruption sample failed: context canceled    auditID=464fb276-71b0-48bf-8fb4-3099ae37cedf backend=oauth-api type=reused
disruption/cache-kube-api connection/new producer sampler context is done
disruption/openshift-api connection/reused producer sampler context is done
disruption/cache-openshift-api connection/reused producer sampler context is done
ns/openshift-authentication route/oauth-openshift disruption/ingress-to-oauth-server connection/new producer sampler context is done
ns/openshift-authentication route/oauth-openshift disruption/ingress-to-oauth-server connection/reused producer sampler context is done
ns/openshift-console route/console disruption/ingress-to-console connection/new producer sampler context is done
disruption/ci-cluster-network-liveness connection/reused producer sampler context is done
disruption/ci-cluster-network-liveness connection/new producer sampler context is done
ns/openshift-image-registry route/test-disruption-new disruption/image-registry connection/new producer sampler context is done
ns/openshift-image-registry route/test-disruption-reused disruption/image-registry connection/reused producer sampler context is done
ns/openshift-console route/console disruption/ingress-to-console connection/reused producer sampler context is done
disruption/kube-api connection/reused producer sampler context is done
disruption/openshift-api connection/new producer sampler context is done
disruption/cache-openshift-api connection/new producer sampler context is done
disruption/cache-oauth-api connection/reused producer sampler context is done

disruption/cache-oauth-api connection/new producer sampler context is done
Shutting down SimultaneousPodIPController
SimultaneousPodIPController shut down
No manifest filename passed
error running options: error waiting for load balancer: timed out waiting for service "service-test" to have a load balancer: timed out waiting for the conditionerror: error waiting for load balancer: timed out waiting for service "service-test" to have a load balancer: timed out waiting for the condition

This is a clone of issue OCPBUGS-11286. The following is the description of the original issue:

Description of problem:


Version-Release number of selected component (if applicable):

OCP 4.13.0-0.nightly-2023-03-23-204038
ODF 4.13.0-121.stable

How reproducible:


Steps to Reproduce:

1. Installed ODF over OCP, everything was fine on the Installed Operators page.
2. Later when checked Installed Operators page, it crashed with "Oh no! Something went wrong" error.
3.

Actual results:

 Installed Operators page crashes with "Oh no! Something went wrong." error

Expected results:

 Installed Operators page shouldn't crash

Component and Stack trace logs from the console page- http://pastebin.test.redhat.com/1096522

Additional info:


Description of problem:

Customer has noticed that object count quotas ("count/*") do not work for certain objects in ClusterResourceQuotas. For example, the following ResourceQuota works as expected:

~~~
apiVersion: v1
kind: ResourceQuota
metadata:
[..]
spec:
  hard:
    count/routes.route.openshift.io: "900"
    count/servicemonitors.monitoring.coreos.com: "100"
    pods: "100"
status:
  hard:
    count/routes.route.openshift.io: "900"
    count/servicemonitors.monitoring.coreos.com: "100"
    pods: "100"
  used:
    count/routes.route.openshift.io: "0"
    count/servicemonitors.monitoring.coreos.com: "1"
    pods: "4"
~~~

However when using "count/servicemonitors.monitoring.coreos.com" in ClusterResourceQuotas, this does not work (note the missing "used"):

~~~
apiVersion: quota.openshift.io/v1
kind: ClusterResourceQuota
metadata:
[..]
spec:
  quota:
    hard:
      count/routes.route.openshift.io: "900"
      count/servicemonitors.monitoring.coreos.com: "100"
      count/simon.krenger.ch: "100"
      pods: "100"
  selector:
    annotations:
      openshift.io/requester: kube:admin
status:
  namespaces:
[..]
  total:
    hard:
      count/routes.route.openshift.io: "900"
      count/servicemonitors.monitoring.coreos.com: "100"
      count/simon.krenger.ch: "100"
      pods: "100"
    used:
      count/routes.route.openshift.io: "0"
      pods: "4"
~~~

This behaviour does not only apply to "servicemonitors.monitoring.coreos.com" objects, but also to other objects, such as:

- count/kafkas.kafka.strimzi.io: '0' - count/prometheusrules.monitoring.coreos.com: '100' - count/servicemonitors.monitoring.coreos.com: '100' 

The debug output for kube-controller-manager shows the following entries, which may or may not be related:

~~~
$ oc logs kube-controller-manager-ip-10-0-132-228.eu-west-1.compute.internal | grep "servicemonitor" I0511 15:07:17.297620 1 patch_informers_openshift.go:90] Couldn't find informer for monitoring.coreos.com/v1, Resource=servicemonitors I0511 15:07:17.297630 1 resource_quota_monitor.go:181] QuotaMonitor using a shared informer for resource "monitoring.coreos.com/v1, Resource=servicemonitors" I0511 15:07:17.297642 1 resource_quota_monitor.go:233] QuotaMonitor created object count evaluator for servicemonitors.monitoring.coreos.com [..] I0511 15:07:17.486279 1 patch_informers_openshift.go:90] Couldn't find informer for monitoring.coreos.com/v1, Resource=servicemonitors I0511 15:07:17.486297 1 graph_builder.go:176] using a shared informer for resource "monitoring.coreos.com/v1, Resource=servicemonitors", kind "monitoring.coreos.com/v1, Kind=ServiceMonitor" ~~~

Version-Release number of selected component (if applicable):

OpenShift Container Platform 4.12.15

How reproducible:

Always

Steps to Reproduce:

1. On an OCP 4.12 cluster, create the following ClusterResourceQuota:

~~~
apiVersion: quota.openshift.io/v1
kind: ClusterResourceQuota
metadata:
  name: case-03509174
spec:
  quota: 
    hard:
      count/servicemonitors.monitoring.coreos.com: "100"
      pods: "100"
  selector:
    annotations: 
      openshift.io/requester: "kube:admin"
~~~

2. As "kubeadmin", create a new project and deploy one new ServiceMonitor, for example: 

~~~
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: simon-servicemon-2
  namespace: simon-1
spec:
  endpoints:
    - path: /metrics
      port: http
      scheme: http
  jobLabel: component
  selector:
    matchLabels:
      deployment: echoenv-1
~~~

Actual results:

The "used" field for ServiceMonitors is not populated in the ClusterResourceQuota for certain objects. It is unclear if these quotas are enforced or not

Expected results:

ClusterResourceQuota for ServiceMonitors is updated and enforced

Additional info:

* Must-gather for a cluster showing this behaviour (added debug for kube-controller-manager) is available here: https://drive.google.com/file/d/1ioEEHZQVHG46vIzDdNm6pwiTjkL9QQRE/view?usp=share_link
* Slack discussion: https://redhat-internal.slack.com/archives/CKJR6200N/p1683876047243989

Description of problem:

oc adm inspect generated files sometime have the leading "---" and some time do not. This depends on the order of objects collected. This by itself is not an issue.

However this becomes an issue when combined with multiple invocations of oc adm inspect and collecting data to the same directory like must-gather does.

If an object is collected multiple times then the second time oc might overwrite the original file improperly and leave 4 bytes of the original content behind.

This is happening when not writing the "---\n" in the second invocation as this makes the content 4B shorter and the original tailing 4B are left in the file intact.

This garbage confuses YAML parsers.

Version-Release number of selected component (if applicable):

4.14 nighly as of Jul 25 and before

How reproducible:

Always

Steps to Reproduce:

Run oc adm inspect twice with different order of objects:

[msivak@x openshift-must-gather]$ oc adm inspect performanceprofile,machineconfigs,nodes --dest-dir=inspect.dual --all-namespaces
[msivak@x openshift-must-gather]$ oc adm inspect nodes --dest-dir=inspect.dual --all-namespaces


And then check the alphabetically first node yaml file - it will have garbage at the end of the file.

Actual results:

Garbage at the end of the file.

Expected results:

No garbage.

Additional info:

I believe this is caused by the lack of Truncate mode here https://github.com/openshift/oc/blob/master/pkg/cli/admin/inspect/writer.go#L54


Collecting data multiple times cannot be easily avoided when multiple collect scripts are combined with relatedObjects requested by operators.

Description of problem:

CVO is observing panic and throwing following error

Interface conversion: cache.DeletedFinalStateUnknown is not v1.Object: missing method GetAnnotations

Linking the job https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.14-e2e-aws-sdn-serial/1687876857824808960 

Observed on other jobs https://search.ci.openshift.org/?search=cache.DeletedFinalStateUnknown+is+not+v1.Object&maxAge=48h&context=1&type=bug%2Bissue%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job 

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Please review the following PR: https://github.com/openshift/cluster-openshift-apiserver-operator/pull/531

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Sanitize OWNERS/OWNER_ALIASES:

1) OWNERS must have:

component: "Storage / Kubernetes External Components"

2) OWNER_ALIASES must have all team members of Storage team.

Description of the problem:

We are turning on the feature-usage flag for custom manifests whenever we are crating a new custom cluster manifest. When we delete that manifest the flag is stays on.

 

Expected results:

Need to turn off the flag when deleting the custom manifest

Description of problem:

The current openshift_sdn_pod_operations_latency metrics is broken which is not calculating actual duration of setup/teardown for the latency metric.
We also need additional metrics to measure the pod latency from end to end so that it gives overall summary for total processing time spent by cni server.

 

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-18841. The following is the description of the original issue:

Description of problem:

Failed to run auto OCP-57089 on a 4.14 azure platform, manually checked it, the created load-balancer service couldn't get an external-IP address

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-09-09-164123

How reproducible:

100% on the cluster

Steps to Reproduce:

1. Add a wait in the auto script, then run the case
      g.By("check if the lb services have obtained the EXTERNAL-IPs")
      regExp := "([0-9]+.[0-9]+.[0-9]+.[0-9]+)"
      time.Sleep(3600 * time.Second) 
% ./bin/extended-platform-tests run all --dry-run | grep 57089 | ./bin/extended-platform-tests run -f -

2.
% oc get ns | grep e2e-test-router
e2e-test-router-ingressclass-n2z2c                 Active   2m51s 

3. It was pending in EXTERNAL-IP column for internal-lb-57089 service
% oc -n e2e-test-router-ingressclass-n2z2c get svc
NAME                TYPE           CLUSTER-IP      EXTERNAL-IP   PORT(S)           AGE
external-lb-57089   LoadBalancer   172.30.198.7    20.42.34.61   28443:30193/TCP   3m6s
internal-lb-57089   LoadBalancer   172.30.214.30   <pending>     29443:31507/TCP   3m6s
service-secure      ClusterIP      172.30.47.70    <none>        27443/TCP         3m13s
service-unsecure    ClusterIP      172.30.175.59   <none>        27017/TCP         3m13s
% 

4.
% oc -n e2e-test-router-ingressclass-n2z2c get svc internal-lb-57089 -oyaml
apiVersion: v1
kind: Service
metadata:
  annotations:
    service.beta.kubernetes.io/azure-load-balancer-internal: "true"
  creationTimestamp: "2023-09-12T07:56:42Z"
  finalizers:
  - service.kubernetes.io/load-balancer-cleanup
  name: internal-lb-57089
  namespace: e2e-test-router-ingressclass-n2z2c
  resourceVersion: "209376"
  uid: b163bc03-b1c6-4e7b-b4e1-c996e9d135f4
spec:
  allocateLoadBalancerNodePorts: true
  clusterIP: 172.30.214.30
  clusterIPs:
  - 172.30.214.30
  externalTrafficPolicy: Cluster
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - name: https
    nodePort: 31507
    port: 29443
    protocol: TCP
    targetPort: 8443
  selector:
    name: web-server-rc
  sessionAffinity: None
  type: LoadBalancer
status:
  loadBalancer: {}
%

Actual results:

internal-lb-57089 service couldn't get an external-IP address

Expected results:

internal-lb-57089 service can get an external-IP address

Additional info:

 

Please review the following PR: https://github.com/openshift/router/pull/453

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

This is a clone of issue OCPBUGS-19918. The following is the description of the original issue:

Description of problem:

Issue was found when analyzing  bug https://issues.redhat.com/browse/OCPBUGS-19817

Version-Release number of selected component (if applicable):

4.15.0-0.ci-2023-09-25-165744

How reproducible:

everytime 

Steps to Reproduce:

The cluster is ipsec cluster and enabled NS extension and ipsec service.
1.  enable e-w ipsec & wait for cluster to settle
2.  disable ipsec & wait for cluster to settle

you'll observer ipsec pods are deleted

Actual results:

no pods

Expected results:

pods should stay
see https://github.com/openshift/cluster-network-operator/blob/master/pkg/network/ovn_kubernetes.go#L314
	// If IPsec is enabled for the first time, we start the daemonset. If it is
	// disabled after that, we do not stop the daemonset but only stop IPsec.
	//
	// TODO: We need to do this as, by default, we maintain IPsec state on the
	// node in order to maintain encrypted connectivity in the case of upgrades.
	// If we only unrender the IPsec daemonset, we will be unable to cleanup
	// the IPsec state on the node and the traffic will continue to be
	// encrypted.

Additional info:


Description of problem:

agent-gather script does not collect agent-tui logs

Version-Release number of selected component (if applicable):

 

How reproducible:

Login into a node (before bootstrap is completed), and run agent-gather script

Steps to Reproduce:

1. ssh into one of the node
2. run agent-gather
3. Check the content of the produced tar artifacts

Actual results:

The agent-gather-*.tar.xz does not contain agent-tui logs

Expected results:

The agent-gather-*.tar.xz must contain /var/log/agent/agent-tui.log

Additional info:

agent-tui logs are fundamental to troubleshoot any eventual issue that could happen during the bootstrap, affecting the agent-tui console. 

Description of problem:

When deploy 4.12 spoke clusters(using rhcos-412.86.202306132230-0-live.x86_64.iso) or 4.10 spoke clusters from a 4.14.0-ec.4 hub, bmh gets stuck in provisioning state due to Failed to update hostname: Command '['chroot', '/mnt/coreos', 'hostnamectl', 'hostname']' returned non-zero exit status 1. Running `hostnamectl hostname` returns `Unknown operation hostname`. It looks like older versions of hostnamectl do not support the hostname option.

Version-Release number of selected component (if applicable):

4.14.0-ec.4

How reproducible:

100%

Steps to Reproduce:

1. From a 4.14.0-ec.4 hub cluster deploy a 4.12 spoke cluster using rhcos-412.86.202306132230-0-live.x86_64.iso via ZTP procedure

Actual results:

BMH stuck in provisioning state

Expected results:

BMH gets provisioned

Additional info:

I also tried using a 4.14 iso image to deploy the 4.12 payload but then kubelet would fail with err="failed to parse kubelet flag: unknown flag: --container-runtime"

MGMT-7549 added a change to use openshift-install instead of openshift-baremetal-install for platform:none clusters. This was to work around a problem where the baremetal binary was not available for an ARM target cluster, and at the time only none platform was supported on ARM. This problem was resolved by MGMT-9206, so we no longer need the workaround.

Please review the following PR: https://github.com/openshift/prometheus-operator/pull/230

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

oc login --token=$token
--server=https://api.dalh-dev-hs-2.05zb.p3.openshiftapps.com:443 --certificate-authority=ca.crt
The server uses a certificate signed by an unknown authority.
You can bypass the certificate check, but any data you send to the server could be intercepted by others.

The referenced "ca.crt" comes from the Secret created when a Service Account is created.

Version-Release number of selected component (if applicable): 4.12.12

How reproducible: Always

Description of problem:

etcd pods running in a hypershift control plane use an exec probe to check cluster health and have a very small timeout (1s). We should be using the same as standalone etcd with a 30s timeout

Version-Release number of selected component (if applicable):

All

How reproducible:

Always

Steps to Reproduce:

1. Create a hypershift hosted cluster
2. Examine etcd pod(s) yaml

Actual results:

Probe is of type exec and has a timeout of 1s

Expected results:

Probe is of type http and has a timeout of 30s

Additional info:

 

Description of problem:
CU wanted to restrict access to vcenter API and originating traffic needs to use a configured EgressIP. This is working fine for the machine API but the vsphere CSI driver controller uses the host network and hence the configured EgressIP isn't used. 

Is it possible to disable this( use of host-network) for CSI controller?

slack thread: https://redhat-internal.slack.com/archives/CBQHQFU0N/p1683135077822559

Description of problem:

APIServer endpoint isn't healthy after a PublicAndPrivate cluster is created. PROGRESS  of the cluster is Completed and PROCESS is false, Nodes are ready, cluster operators on the guest cluster are Available, only issue is condition Type Available is False due to APIServer endpoint is not healthy.

jiezhao-mac:hypershift jiezhao$ oc get hostedcluster -n clusters
NAME   VERSION               KUBECONFIG         PROGRESS  AVAILABLE  PROGRESSING  MESSAGE
jz-test  4.14.0-0.nightly-2023-04-30-235516  jz-test-admin-kubeconfig  Completed  False    False     APIServer endpoint a23663b1e738a4d6783f6256da73fe76-2649b36a23f49ed7.elb.us-east-2.amazonaws.com is not healthy

jiezhao-mac:hypershift jiezhao$ oc get hostedcluster/jz-test -n clusters -ojsonpath='{.spec.platform.aws.endpointAccess}{"\n"}'
PublicAndPrivate

jiezhao-mac:hypershift jiezhao$ oc get pods -n clusters-jz-test
NAME                                                  READY   STATUS    RESTARTS   AGE
aws-cloud-controller-manager-666559d4f-rdsw4          2/2     Running   0          149m
aws-ebs-csi-driver-controller-79fdfb6c76-vb7wr        7/7     Running   0          148m
aws-ebs-csi-driver-operator-7dbd789984-mb9rp          1/1     Running   0          148m
capi-provider-5b7847db9-nlrvz                         2/2     Running   0          151m
catalog-operator-7ccb468d86-7c5j6                     2/2     Running   0          149m
certified-operators-catalog-895787778-5rjb6           1/1     Running   0          149m
cloud-network-config-controller-86698fd7dd-kgzhv      3/3     Running   0          148m
cluster-api-6fd4f86878-hjw59                          1/1     Running   0          151m
cluster-autoscaler-bdd688949-f9xmk                    1/1     Running   0          150m
cluster-image-registry-operator-6f5cb67d88-8svd6      3/3     Running   0          149m
cluster-network-operator-7bc69f75f4-npjfs             1/1     Running   0          149m
cluster-node-tuning-operator-5855b6576b-rckhh         1/1     Running   0          149m
cluster-policy-controller-56d4d6b57c-glx4w            1/1     Running   0          149m
cluster-storage-operator-7cc56c68bb-jd4d2             1/1     Running   0          149m
cluster-version-operator-bd969b677-bh4w4              1/1     Running   0          149m
community-operators-catalog-5c545484d7-hbzb4          1/1     Running   0          149m
control-plane-operator-fc49dcbb4-5ncvf                2/2     Running   0          151m
csi-snapshot-controller-85f7cc9945-n5vgq              1/1     Running   0          149m
csi-snapshot-controller-operator-6597b45897-hqf5p     1/1     Running   0          149m
csi-snapshot-webhook-644d765546-lk9hj                 1/1     Running   0          149m
dns-operator-5b5577d6c7-8dh8d                         1/1     Running   0          149m
etcd-0                                                2/2     Running   0          150m
hosted-cluster-config-operator-5b75ccf55d-6rzch       1/1     Running   0          149m
ignition-server-596fc9d9fb-sb94h                      1/1     Running   0          150m
ingress-operator-6497d476bc-whssz                     3/3     Running   0          149m
konnectivity-agent-6656d8dfd6-h5tcs                   1/1     Running   0          150m
konnectivity-server-5ff9d4b47-stb2m                   1/1     Running   0          150m
kube-apiserver-596fc4bb8b-7kfd8                       3/3     Running   0          150m
kube-controller-manager-6f86bb7fbd-4wtxk              1/1     Running   0          138m
kube-scheduler-bf5876b4b-flk96                        1/1     Running   0          149m
machine-approver-574585d8dd-h5ffh                     1/1     Running   0          150m
multus-admission-controller-67b6f85fbf-bfg4x          2/2     Running   0          148m
oauth-openshift-6b6bfd55fb-8sdq7                      2/2     Running   0          148m
olm-operator-5d97fb977c-sbf6w                         2/2     Running   0          149m
openshift-apiserver-5bb9f99974-2lfp4                  3/3     Running   0          138m
openshift-controller-manager-65666bdf79-g8cf5         1/1     Running   0          149m
openshift-oauth-apiserver-56c8565bb6-6b5cv            2/2     Running   0          149m
openshift-route-controller-manager-775f844dfc-jj2ft   1/1     Running   0          149m
ovnkube-master-0                                      7/7     Running   0          148m
packageserver-6587d9674b-6jwpv                        2/2     Running   0          149m
redhat-marketplace-catalog-5f6d45b457-hdn77           1/1     Running   0          149m
redhat-operators-catalog-7958c4449b-l4hbx             1/1     Running   0          12m
router-5b7899cc97-chs6t                               1/1     Running   0          150m

jiezhao-mac:hypershift jiezhao$ oc get node --kubeconfig=hostedcluster.kubeconfig 
NAME                                        STATUS   ROLES    AGE    VERSION
ip-10-0-137-99.us-east-2.compute.internal   Ready    worker   131m   v1.26.2+d2e245f
ip-10-0-140-85.us-east-2.compute.internal   Ready    worker   132m   v1.26.2+d2e245f
ip-10-0-141-46.us-east-2.compute.internal   Ready    worker   131m   v1.26.2+d2e245f
jiezhao-mac:hypershift jiezhao$ 
jiezhao-mac:hypershift jiezhao$ oc get co --kubeconfig=hostedcluster.kubeconfig 
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
console                                    4.14.0-0.nightly-2023-04-30-235516   True        False         False      126m    
csi-snapshot-controller                    4.14.0-0.nightly-2023-04-30-235516   True        False         False      140m    
dns                                        4.14.0-0.nightly-2023-04-30-235516   True        False         False      129m    
image-registry                             4.14.0-0.nightly-2023-04-30-235516   True        False         False      128m    
ingress                                    4.14.0-0.nightly-2023-04-30-235516   True        False         False      129m    
insights                                   4.14.0-0.nightly-2023-04-30-235516   True        False         False      130m    
kube-apiserver                             4.14.0-0.nightly-2023-04-30-235516   True        False         False      140m    
kube-controller-manager                    4.14.0-0.nightly-2023-04-30-235516   True        False         False      140m    
kube-scheduler                             4.14.0-0.nightly-2023-04-30-235516   True        False         False      140m    
kube-storage-version-migrator              4.14.0-0.nightly-2023-04-30-235516   True        False         False      129m    
monitoring                                 4.14.0-0.nightly-2023-04-30-235516   True        False         False      129m    
network                                    4.14.0-0.nightly-2023-04-30-235516   True        False         False      140m    
node-tuning                                4.14.0-0.nightly-2023-04-30-235516   True        False         False      131m    
openshift-apiserver                        4.14.0-0.nightly-2023-04-30-235516   True        False         False      140m    
openshift-controller-manager               4.14.0-0.nightly-2023-04-30-235516   True        False         False      140m    
openshift-samples                          4.14.0-0.nightly-2023-04-30-235516   True        False         False      129m    
operator-lifecycle-manager                 4.14.0-0.nightly-2023-04-30-235516   True        False         False      140m    
operator-lifecycle-manager-catalog         4.14.0-0.nightly-2023-04-30-235516   True        False         False      140m    
operator-lifecycle-manager-packageserver   4.14.0-0.nightly-2023-04-30-235516   True        False         False      140m    
service-ca                                 4.14.0-0.nightly-2023-04-30-235516   True        False         False      130m    
storage                                    4.14.0-0.nightly-2023-04-30-235516   True        False         False      131m    
jiezhao-mac:hypershift jiezhao$ 

HC conditions:
==============
  status:
    conditions:
    - lastTransitionTime: "2023-05-01T19:45:49Z"
      message: All is well
      observedGeneration: 3
      reason: AsExpected
      status: "True"
      type: ValidAWSIdentityProvider
    - lastTransitionTime: "2023-05-01T20:00:18Z"
      message: Cluster version is 4.14.0-0.nightly-2023-04-30-235516
      observedGeneration: 3
      reason: FromClusterVersion
      status: "False"
      type: ClusterVersionProgressing
    - lastTransitionTime: "2023-05-01T19:46:22Z"
      message: Payload loaded version="4.14.0-0.nightly-2023-04-30-235516" image="registry.ci.openshift.org/ocp/release:4.14.0-0.nightly-2023-04-30-235516"
        architecture="amd64"
      observedGeneration: 3
      reason: PayloadLoaded
      status: "True"
      type: ClusterVersionReleaseAccepted
    - lastTransitionTime: "2023-05-01T20:03:14Z"
      message: Condition not found in the CVO.
      observedGeneration: 3
      reason: StatusUnknown
      status: Unknown
      type: ClusterVersionUpgradeable
    - lastTransitionTime: "2023-05-01T20:00:18Z"
      message: Done applying 4.14.0-0.nightly-2023-04-30-235516
      observedGeneration: 3
      reason: FromClusterVersion
      status: "True"
      type: ClusterVersionAvailable
    - lastTransitionTime: "2023-05-01T20:00:18Z"
      message: ""
      observedGeneration: 3
      reason: FromClusterVersion
      status: "True"
      type: ClusterVersionSucceeding
    - lastTransitionTime: "2023-05-01T19:47:51Z"
      message: The hosted cluster is not degraded
      observedGeneration: 3
      reason: AsExpected
      status: "False"
      type: Degraded
    - lastTransitionTime: "2023-05-01T19:45:01Z"
      message: ""
      observedGeneration: 3
      reason: QuorumAvailable
      status: "True"
      type: EtcdAvailable
    - lastTransitionTime: "2023-05-01T19:45:38Z"
      message: Kube APIServer deployment is available
      observedGeneration: 3
      reason: AsExpected
      status: "True"
      type: KubeAPIServerAvailable
    - lastTransitionTime: "2023-05-01T19:44:27Z"
      message: All is well
      observedGeneration: 3
      reason: AsExpected
      status: "True"
      type: InfrastructureReady
    - lastTransitionTime: "2023-05-01T19:44:11Z"
      message: External DNS is not configured
      observedGeneration: 3
      reason: StatusUnknown
      status: Unknown
      type: ExternalDNSReachable
    - lastTransitionTime: "2023-05-01T19:44:19Z"
      message: Configuration passes validation
      observedGeneration: 3
      reason: AsExpected
      status: "True"
      type: ValidHostedControlPlaneConfiguration
    - lastTransitionTime: "2023-05-01T19:44:11Z"
      message: AWS KMS is not configured
      observedGeneration: 3
      reason: StatusUnknown
      status: Unknown
      type: ValidAWSKMSConfig
    - lastTransitionTime: "2023-05-01T19:44:37Z"
      message: All is well
      observedGeneration: 3
      reason: AsExpected
      status: "True"
      type: ValidReleaseInfo
    - lastTransitionTime: "2023-05-01T19:44:11Z"
      message: APIServer endpoint a23663b1e738a4d6783f6256da73fe76-2649b36a23f49ed7.elb.us-east-2.amazonaws.com
        is not healthy
      observedGeneration: 3
      reason: waitingForAvailable
      status: "False"
      type: Available
    - lastTransitionTime: "2023-05-01T19:47:18Z"
      message: All is well
      reason: AWSSuccess
      status: "True"
      type: AWSEndpointAvailable
    - lastTransitionTime: "2023-05-01T19:47:18Z"
      message: All is well
      reason: AWSSuccess
      status: "True"
      type: AWSEndpointServiceAvailable
    - lastTransitionTime: "2023-05-01T19:44:11Z"
      message: Configuration passes validation
      observedGeneration: 3
      reason: AsExpected
      status: "True"
      type: ValidConfiguration
    - lastTransitionTime: "2023-05-01T19:44:11Z"
      message: HostedCluster is supported by operator configuration
      observedGeneration: 3
      reason: AsExpected
      status: "True"
      type: SupportedHostedCluster
    - lastTransitionTime: "2023-05-01T19:45:39Z"
      message: Ignition server deployment is available
      observedGeneration: 3
      reason: AsExpected
      status: "True"
      type: IgnitionEndpointAvailable
    - lastTransitionTime: "2023-05-01T19:44:11Z"
      message: Reconciliation active on resource
      observedGeneration: 3
      reason: AsExpected
      status: "True"
      type: ReconciliationActive
    - lastTransitionTime: "2023-05-01T19:44:12Z"
      message: Release image is valid
      observedGeneration: 3
      reason: AsExpected
      status: "True"
      type: ValidReleaseImage
    - lastTransitionTime: "2023-05-01T19:44:12Z"
      message: HostedCluster is at expected version
      observedGeneration: 3
      reason: AsExpected
      status: "False"
      type: Progressing
    - lastTransitionTime: "2023-05-01T19:44:13Z"
      message: OIDC configuration is valid
      observedGeneration: 3
      reason: AsExpected
      status: "True"
      type: ValidOIDCConfiguration
    - lastTransitionTime: "2023-05-01T19:44:13Z"
      message: Reconciliation completed succesfully
      observedGeneration: 3
      reason: ReconciliatonSucceeded
      status: "True"
      type: ReconciliationSucceeded
    - lastTransitionTime: "2023-05-01T19:45:52Z"
      message: All is well
      observedGeneration: 3
      reason: AsExpected
      status: "True"
      type: AWSDefaultSecurityGroupCreated

kube-apiserver log:
==================
E0501 19:45:07.024278       7 memcache.go:238] couldn't get current server API group list: Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused
unable to recognize "/work/0000_03_authorization-openshift_01_rolebindingrestriction.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused
unable to recognize "/work/0000_03_config-operator_01_proxy.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused
unable to recognize "/work/0000_03_quota-openshift_01_clusterresourcequota.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused
unable to recognize "/work/0000_03_security-openshift_01_scc.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused
unable to recognize "/work/0000_03_securityinternal-openshift_02_rangeallocation.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused
unable to recognize "/work/0000_10_config-operator_01_apiserver-Default.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused
unable to recognize "/work/0000_10_config-operator_01_authentication.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused
unable to recognize "/work/0000_10_config-operator_01_build.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused
unable to recognize "/work/0000_10_config-operator_01_console.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused
unable to recognize "/work/0000_10_config-operator_01_dns.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused
unable to recognize "/work/0000_10_config-operator_01_featuregate.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused
unable to recognize "/work/0000_10_config-operator_01_image.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused
unable to recognize "/work/0000_10_config-operator_01_imagecontentpolicy.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused
unable to recognize "/work/0000_10_config-operator_01_imagecontentsourcepolicy.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused
unable to recognize "/work/0000_10_config-operator_01_imagedigestmirrorset.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused
unable to recognize "/work/0000_10_config-operator_01_imagetagmirrorset.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused
unable to recognize "/work/0000_10_config-operator_01_infrastructure-Default.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused
unable to recognize "/work/0000_10_config-operator_01_ingress.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused
unable to recognize "/work/0000_10_config-operator_01_network.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused
unable to recognize "/work/0000_10_config-operator_01_node.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused
unable to recognize "/work/0000_10_config-operator_01_oauth.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused
unable to recognize "/work/0000_10_config-operator_01_project.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused
unable to recognize "/work/0000_10_config-operator_01_scheduler.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused

Version-Release number of selected component (if applicable):

 

How reproducible:

always

Steps to Reproduce:

1. Create a PublicAndPrivate cluster

Actual results:

APIServer endpoint is not healthy, and HC condition Type 'Available' is False

Expected results:

APIServer endpoint should be healthy, and Type 'Available' should be True

Additional info:

 

Description of problem:

console will have panic error when duplicate entry is set in spec.plugins

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2022-12-19-122634

How reproducible:

Always

Steps to Reproduce:

1. Create console-demo-plugin manifests
$ oc apply -f dynamic-demo-plugin/oc-manifest.yaml 
namespace/console-demo-plugin created
deployment.apps/console-demo-plugin created
service/console-demo-plugin created
consoleplugin.console.openshift.io/console-demo-plugin created 
2.Enable console-demo-plugin
$ oc patch consoles.operator.openshift.io cluster --patch '{ "spec": { "plugins": ["console-demo-plugin"] } }' --type=merge 
console.operator.openshift.io/cluster patched
3. Add a duplicate entry in spec.plugins in consoles.operator/cluster 
$ oc patch consoles.operator.openshift.io cluster --patch '{ "spec": { "plugins": ["console-demo-plugin", "console-demo-plugin"] } }' --type=merge  console.operator.openshift.io/cluster patched
$ oc get consoles.operator cluster -o json | jq .spec.plugins
[
  "console-demo-plugin",
  "console-demo-plugin"
]
4. check console pods status
$ oc get pods -n openshift-console                        
NAME                         READY   STATUS             RESTARTS      AGE
console-6bcc87c7b4-6g2cf     0/1     CrashLoopBackOff   1 (21s ago)   50s
console-6bcc87c7b4-9g6kk     0/1     CrashLoopBackOff   3 (3s ago)    50s
console-7dc78ffd78-sxvcv     1/1     Running            0             2m58s
downloads-758fc74758-9k426   1/1     Running            0             3h18m
downloads-758fc74758-k4q72   1/1     Running            0             3h21m

Actual results:

3. console pods will be in CrashLoopBackOff status
$ oc logs console-6bcc87c7b4-9g6kk -n openshift-console
W1220 06:48:37.279871       1 main.go:228] Flag inactivity-timeout is set to less then 300 seconds and will be ignored!
I1220 06:48:37.279889       1 main.go:238] The following console plugins are enabled:
I1220 06:48:37.279898       1 main.go:240]  - console-demo-plugin
I1220 06:48:37.279911       1 main.go:354] cookies are secure!
I1220 06:48:37.331802       1 server.go:607] The following console endpoints are now proxied to these services:
I1220 06:48:37.331843       1 server.go:610]  - /api/proxy/plugin/console-demo-plugin/thanos-querier/ -> https://thanos-querier.openshift-monitoring.svc.cluster.local:9091
I1220 06:48:37.331884       1 server.go:610]  - /api/proxy/plugin/console-demo-plugin/thanos-querier/ -> https://thanos-querier.openshift-monitoring.svc.cluster.local:9091
panic: http: multiple registrations for /api/proxy/plugin/console-demo-plugin/thanos-querier/goroutine 1 [running]:
net/http.(*ServeMux).Handle(0xc0005b6600, {0xc0005d9a40, 0x35}, {0x35aaf60?, 0xc000735260})
    /usr/lib/golang/src/net/http/server.go:2503 +0x239
github.com/openshift/console/pkg/server.(*Server).HTTPHandler.func1({0xc0005d9940?, 0x35?}, {0x35aaf60, 0xc000735260})
    /go/src/github.com/openshift/console/pkg/server/server.go:245 +0x149
github.com/openshift/console/pkg/server.(*Server).HTTPHandler(0xc000056c00)
    /go/src/github.com/openshift/console/pkg/server/server.go:621 +0x330b
main.main()
    /go/src/github.com/openshift/console/cmd/bridge/main.go:785 +0x5ff5

Expected results:

3. console pods should be running well

Additional info:

 

 

 

 

Please review the following PR: https://github.com/openshift/prometheus-operator/pull/221

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

If both the below mentioned annotations are used on an operator CSV, the uninstall instructions don't show up on the UI.
- console.openshift.io/disable-operand delete: "true"
- operator.openshift.io/uninstall-message: "some message"

Version-Release number of selected component (if applicable):

➜  $> oc version
Client Version: 4.12.0
Kustomize Version: v4.5.7
Server Version: 4.13.0-rc.5
Kubernetes Version: v1.26.3+379cd9f

➜  $> oc get co | grep console
console                                    4.13.0-rc.5   True        False         False      4h49m

How reproducible:

Always

Steps to Reproduce:

1.Add both the mentioned annotations on an operator CSV. 
2.Make sure "console.openshift.io/disable-operand delete" is set to "true".
3.Upon clicking "Uninstall operator", the result can be observed on the pop-up.

Actual results:

The uninstall pop-up doesn't have the "Message from Operator developer" section.

Expected results:

The uninstall instructions should show up under "Message from Operator developer".

Additional info:

The two annotations seemed to be linked here, https://github.com/openshift/console/blob/3e0bb0928ce09030bc3340c9639b2a1df9e0a007/frontend/packages/operator-lifecycle-manager/src/components/modals/uninstall-operator-modal.tsx#LL395C10-L395C26

Description of problem

When the ingress operator creates or updates a router deployment that specifies spec.template.spec.hostNetwork: true, the operator does not set spec.template.spec.containers[*].ports[*].hostPort. As a result, the API sets each port's hostPort field to the port's containerPort field value. The operator detects this as an external update and attempts to revert it. The operator should not update the deployment in response to API defaulting.

Version-Release number of selected component (if applicable)

I observed this in CI for OCP 4.14 and was able to reproduce the issue on OCP 4.11.37. The problematic code was added in https://github.com/openshift/cluster-ingress-operator/pull/694/commits/af653f9fa7368cf124e11b7ea4666bc40e601165 in OCP 4.11 to implement NE-674.

How reproducible

Easily.

Steps to Reproduce

1. Create an IngressController that specifies the "HostNetwork" endpoint publishing strategy type:

oc create -f - <<EOF
apiVersion: operator.openshift.io/v1
kind: IngressController
metadata:
  name: example-hostnetwork
  namespace: openshift-ingress-operator
spec:
  domain: example.xyz
  endpointPublishingStrategy:
    type: HostNetwork
EOF

2. Check the ingress operator's logs:

oc -n openshift-ingress-operator logs -c ingress-operator deployments/ingress-operator

Actual results

The ingress operator logs "updated router deployment" multiple times for the "example-hostnetwork" IngressController, such as the following:

2023-06-15T02:11:47.229Z        INFO    operator.ingress_controller     ingress/deployment.go:131       updated router deployment       {"namespace": "openshift-ingress", "name": "router-example-hostnetwork", "diff": "  &v1.Deployment{\n  \tTypeMeta:   {},\n  \tObjectMeta: {Name: \"router-example-hostnetwork\", Namespace: \"openshift-ingress\", UID: \"d7c51022-460e-4962-8521-e00255f649c3\", ResourceVersion: \"3356177\", ...},\n  \tSpec: v1.DeploymentSpec{\n  \t\tReplicas: &2,\n  \t\tSelector: &{MatchLabels: {\"ingresscontroller.operator.openshift.io/deployment-ingresscontroller\": \"example-hostnetwork\"}},\n  \t\tTemplate: v1.PodTemplateSpec{\n  \t\t\tObjectMeta: {Labels: {\"ingresscontroller.operator.openshift.io/deployment-ingresscontroller\": \"example-hostnetwork\", \"ingresscontroller.operator.openshift.io/hash\": \"b7c697fd\"}, Annotations: {\"target.workload.openshift.io/management\": `{\"effect\": \"PreferredDuringScheduling\"}`, \"unsupported.do-not-use.openshift.io/override-liveness-grace-period-seconds\": \"10\"}},\n  \t\t\tSpec: v1.PodSpec{\n  \t\t\t\tVolumes: []v1.Volume{\n  \t\t\t\t\t{Name: \"default-certificate\", VolumeSource: {Secret: &{SecretName: \"router-certs-example-hostnetwork\", DefaultMode: &420}}},\n  \t\t\t\t\t{\n  \t\t\t\t\t\tName: \"service-ca-bundle\",\n  \t\t\t\t\t\tVolumeSource: v1.VolumeSource{\n  \t\t\t\t\t\t\t... // 16 identical fields\n  \t\t\t\t\t\t\tFC:        nil,\n  \t\t\t\t\t\t\tAzureFile: nil,\n  \t\t\t\t\t\t\tConfigMap: &v1.ConfigMapVolumeSource{\n  \t\t\t\t\t\t\t\tLocalObjectReference: {Name: \"service-ca-bundle\"},\n  \t\t\t\t\t\t\t\tItems:                {{Key: \"service-ca.crt\", Path: \"service-ca.crt\"}},\n- \t\t\t\t\t\t\t\tDefaultMode:          &420,\n+ \t\t\t\t\t\t\t\tDefaultMode:          nil,\n  \t\t\t\t\t\t\t\tOptional:             &false,\n  \t\t\t\t\t\t\t},\n  \t\t\t\t\t\t\tVsphereVolume: nil,\n  \t\t\t\t\t\t\tQuobyte:       nil,\n  \t\t\t\t\t\t\t... // 8 identical fields\n  \t\t\t\t\t\t},\n  \t\t\t\t\t},\n  \t\t\t\t\t{\n  \t\t\t\t\t\tName: \"stats-auth\",\n  \t\t\t\t\t\tVolumeSource: v1.VolumeSource{\n  \t\t\t\t\t\t\t... // 3 identical fields\n  \t\t\t\t\t\t\tAWSElasticBlockStore: nil,\n  \t\t\t\t\t\t\tGitRepo:              nil,\n  \t\t\t\t\t\t\tSecret: &v1.SecretVolumeSource{\n  \t\t\t\t\t\t\t\tSecretName:  \"router-stats-example-hostnetwork\",\n  \t\t\t\t\t\t\t\tItems:       nil,\n- \t\t\t\t\t\t\t\tDefaultMode: &420,\n+ \t\t\t\t\t\t\t\tDefaultMode: nil,\n  \t\t\t\t\t\t\t\tOptional:    nil,\n  \t\t\t\t\t\t\t},\n  \t\t\t\t\t\t\tNFS:   nil,\n  \t\t\t\t\t\t\tISCSI: nil,\n  \t\t\t\t\t\t\t... // 21 identical fields\n  \t\t\t\t\t\t},\n  \t\t\t\t\t},\n  \t\t\t\t\t{\n  \t\t\t\t\t\tName: \"metrics-certs\",\n  \t\t\t\t\t\tVolumeSource: v1.VolumeSource{\n  \t\t\t\t\t\t\t... // 3 identical fields\n  \t\t\t\t\t\t\tAWSElasticBlockStore: nil,\n  \t\t\t\t\t\t\tGitRepo:              nil,\n  \t\t\t\t\t\t\tSecret: &v1.SecretVolumeSource{\n  \t\t\t\t\t\t\t\tSecretName:  \"router-metrics-certs-example-hostnetwork\",\n  \t\t\t\t\t\t\t\tItems:       nil,\n- \t\t\t\t\t\t\t\tDefaultMode: &420,\n+ \t\t\t\t\t\t\t\tDefaultMode: nil,\n  \t\t\t\t\t\t\t\tOptional:    nil,\n  \t\t\t\t\t\t\t},\n  \t\t\t\t\t\t\tNFS:   nil,\n  \t\t\t\t\t\t\tISCSI: nil,\n  \t\t\t\t\t\t\t... // 21 identical fields\n  \t\t\t\t\t\t},\n  \t\t\t\t\t},\n  \t\t\t\t},\n  \t\t\t\tInitContainers: nil,\n  \t\t\t\tContainers: []v1.Container{\n  \t\t\t\t\t{\n  \t\t\t\t\t\t... // 3 identical fields\n  \t\t\t\t\t\tArgs:       nil,\n  \t\t\t\t\t\tWorkingDir: \"\",\n  \t\t\t\t\t\tPorts: []v1.ContainerPort{\n  \t\t\t\t\t\t\t{\n  \t\t\t\t\t\t\t\tName:          \"http\",\n- \t\t\t\t\t\t\t\tHostPort:      80,\n+ \t\t\t\t\t\t\t\tHostPort:      0,\n  \t\t\t\t\t\t\t\tContainerPort: 80,\n  \t\t\t\t\t\t\t\tProtocol:      \"TCP\",\n  \t\t\t\t\t\t\t\tHostIP:        \"\",\n  \t\t\t\t\t\t\t},\n  \t\t\t\t\t\t\t{\n  \t\t\t\t\t\t\t\tName:          \"https\",\n- \t\t\t\t\t\t\t\tHostPort:      443,\n+ \t\t\t\t\t\t\t\tHostPort:      0,\n  \t\t\t\t\t\t\t\tContainerPort: 443,\n  \t\t\t\t\t\t\t\tProtocol:      \"TCP\",\n  \t\t\t\t\t\t\t\tHostIP:        \"\",\n  \t\t\t\t\t\t\t},\n  \t\t\t\t\t\t\t{\n  \t\t\t\t\t\t\t\tName:          \"metrics\",\n- \t\t\t\t\t\t\t\tHostPort:      1936,\n+ \t\t\t\t\t\t\t\tHostPort:      0,\n  \t\t\t\t\t\t\t\tContainerPort: 1936,\n  \t\t\t\t\t\t\t\tProtocol:      \"TCP\",\n  \t\t\t\t\t\t\t\tHostIP:        \"\",\n  \t\t\t\t\t\t\t},\n  \t\t\t\t\t\t},\n  \t\t\t\t\t\tEnvFrom:       nil,\n  \t\t\t\t\t\tEnv:           {{Name: \"DEFAULT_CERTIFICATE_DIR\", Value: \"/etc/pki/tls/private\"}, {Name: \"DEFAULT_DESTINATION_CA_PATH\", Value: \"/var/run/configmaps/service-ca/service-ca.crt\"}, {Name: \"RELOAD_INTERVAL\", Value: \"5s\"}, {Name: \"ROUTER_ALLOW_WILDCARD_ROUTES\", Value: \"false\"}, ...},\n  \t\t\t\t\t\tResources:     {Requests: {s\"cpu\": {i: {...}, s: \"100m\", Format: \"DecimalSI\"}, s\"memory\": {i: {...}, Format: \"BinarySI\"}}},\n  \t\t\t\t\t\tVolumeMounts:  {{Name: \"default-certificate\", ReadOnly: true, MountPath: \"/etc/pki/tls/private\"}, {Name: \"service-ca-bundle\", ReadOnly: true, MountPath: \"/var/run/configmaps/service-ca\"}, {Name: \"stats-auth\", ReadOnly: true, MountPath: \"/var/lib/haproxy/conf/metrics-auth\"}, {Name: \"metrics-certs\", ReadOnly: true, MountPath: \"/etc/pki/tls/metrics-certs\"}},\n  \t\t\t\t\t\tVolumeDevices: nil,\n  \t\t\t\t\t\tLivenessProbe: &v1.Probe{\n  \t\t\t\t\t\t\tProbeHandler: v1.ProbeHandler{\n  \t\t\t\t\t\t\t\tExec: nil,\n  \t\t\t\t\t\t\t\tHTTPGet: &v1.HTTPGetAction{\n  \t\t\t\t\t\t\t\t\tPath:        \"/healthz\",\n  \t\t\t\t\t\t\t\t\tPort:        {IntVal: 1936},\n  \t\t\t\t\t\t\t\t\tHost:        \"localhost\",\n- \t\t\t\t\t\t\t\t\tScheme:      \"HTTP\",\n+ \t\t\t\t\t\t\t\t\tScheme:      \"\",\n  \t\t\t\t\t\t\t\t\tHTTPHeaders: nil,\n  \t\t\t\t\t\t\t\t},\n  \t\t\t\t\t\t\t\tTCPSocket: nil,\n  \t\t\t\t\t\t\t\tGRPC:      nil,\n  \t\t\t\t\t\t\t},\n  \t\t\t\t\t\t\tInitialDelaySeconds:           0,\n  \t\t\t\t\t\t\tTimeoutSeconds:                1,\n- \t\t\t\t\t\t\tPeriodSeconds:                 10,\n+ \t\t\t\t\t\t\tPeriodSeconds:                 0,\n- \t\t\t\t\t\t\tSuccessThreshold:              1,\n+ \t\t\t\t\t\t\tSuccessThreshold:              0,\n- \t\t\t\t\t\t\tFailureThreshold:              3,\n+ \t\t\t\t\t\t\tFailureThreshold:              0,\n  \t\t\t\t\t\t\tTerminationGracePeriodSeconds: nil,\n  \t\t\t\t\t\t},\n  \t\t\t\t\t\tReadinessProbe: &v1.Probe{\n  \t\t\t\t\t\t\tProbeHandler: v1.ProbeHandler{\n  \t\t\t\t\t\t\t\tExec: nil,\n  \t\t\t\t\t\t\t\tHTTPGet: &v1.HTTPGetAction{\n  \t\t\t\t\t\t\t\t\tPath:        \"/healthz/ready\",\n  \t\t\t\t\t\t\t\t\tPort:        {IntVal: 1936},\n  \t\t\t\t\t\t\t\t\tHost:        \"localhost\",\n- \t\t\t\t\t\t\t\t\tScheme:      \"HTTP\",\n+ \t\t\t\t\t\t\t\t\tScheme:      \"\",\n  \t\t\t\t\t\t\t\t\tHTTPHeaders: nil,\n  \t\t\t\t\t\t\t\t},\n  \t\t\t\t\t\t\t\tTCPSocket: nil,\n  \t\t\t\t\t\t\t\tGRPC:      nil,\n  \t\t\t\t\t\t\t},\n  \t\t\t\t\t\t\tInitialDelaySeconds:           0,\n  \t\t\t\t\t\t\tTimeoutSeconds:                1,\n- \t\t\t\t\t\t\tPeriodSeconds:                 10,\n+ \t\t\t\t\t\t\tPeriodSeconds:                 0,\n- \t\t\t\t\t\t\tSuccessThreshold:              1,\n+ \t\t\t\t\t\t\tSuccessThreshold:       
      0,\n- \t\t\t\t\t\t\tFailureThreshold:              3,\n+ \t\t\t\t\t\t\tFailureThreshold:              0,\n  \t\t\t\t\t\t\tTerminationGracePeriodSeconds: nil,\n  \t\t\t\t\t\t},\n  \t\t\t\t\t\tStartupProbe: &v1.Probe{\n  \t\t\t\t\t\t\tProbeHandler: v1.ProbeHandler{\n  \t\t\t\t\t\t\t\tExec: nil,\n  \t\t\t\t\t\t\t\tHTTPGet: &v1.HTTPGetAction{\n  \t\t\t\t\t\t\t\t\tPath:        \"/healthz/ready\",\n  \t\t\t\t\t\t\t\t\tPort:        {IntVal: 1936},\n  \t\t\t\t\t\t\t\t\tHost:        \"localhost\",\n- \t\t\t\t\t\t\t\t\tScheme:      \"HTTP\",\n+ \t\t\t\t\t\t\t\t\tScheme:      \"\",\n  \t\t\t\t\t\t\t\t\tHTTPHeaders: nil,\n  \t\t\t\t\t\t\t\t},\n  \t\t\t\t\t\t\t\tTCPSocket: nil,\n  \t\t\t\t\t\t\t\tGRPC:      nil,\n  \t\t\t\t\t\t\t},\n  \t\t\t\t\t\t\tInitialDelaySeconds:           0,\n  \t\t\t\t\t\t\tTimeoutSeconds:                1,\n  \t\t\t\t\t\t\tPeriodSeconds:                 1,\n- \t\t\t\t\t\t\tSuccessThreshold:              1,\n+ \t\t\t\t\t\t\tSuccessThreshold:              0,\n  \t\t\t\t\t\t\tFailureThreshold:              120,\n  \t\t\t\t\t\t\tTerminationGracePeriodSeconds: nil,\n  \t\t\t\t\t\t},\n  \t\t\t\t\t\tLifecycle:              nil,\n  \t\t\t\t\t\tTerminationMessagePath: \"/dev/termination-log\",\n  \t\t\t\t\t\t... // 6 identical fields\n  \t\t\t\t\t},\n  \t\t\t\t},\n  \t\t\t\tEphemeralContainers: nil,\n  \t\t\t\tRestartPolicy:       \"Always\",\n  \t\t\t\t... // 31 identical fields\n  \t\t\t},\n  \t\t},\n  \t\tStrategy:        {Type: \"RollingUpdate\", RollingUpdate: &{MaxUnavailable: &{Type: 1, StrVal: \"25%\"}, MaxSurge: &{}}},\n  \t\tMinReadySeconds: 30,\n  \t\t... // 3 identical fields\n  \t},\n  \tStatus: {ObservedGeneration: 1, Replicas: 2, UpdatedReplicas:
2, UnavailableReplicas: 2, ...},\n  }\n"}

Note the following in the diff:

                                                Ports: []v1.ContainerPort{                                                                                                                                                                                                                                                                                                                                                               
                                                        {                                                                                                                                                                                                                                                                                                                                                                                
                                                                Name:          \"http\",                                                                                                                                                                                                                                                                                                                                                 
-                                                               HostPort:      80,                                                                                                                                                                                                                                                                                                                                                       
+                                                               HostPort:      0,                                                                                                                                                                                                                                                                                                                                                        
                                                                ContainerPort: 80,                                                                                                                                                                                                                                                                                                                                                       
                                                                Protocol:      \"TCP\",                                                                                                                                                                                                                                                                                                                                                  
                                                                HostIP:        \"\",                                                                                                                                                                                                                                                                                                                                                     
                                                        },                                                                                                                                                                                                                                                                                                                                                                               
                                                        {
                                                                Name:          \"https\",
-                                                               HostPort:      443,
+                                                               HostPort:      0,
                                                                ContainerPort: 443,
                                                                Protocol:      \"TCP\",
                                                                HostIP:        \"\",
                                                        },
                                                        {
                                                                Name:          \"metrics\",
-                                                               HostPort:      1936,
+                                                               HostPort:      0,
                                                                ContainerPort: 1936,
                                                                Protocol:      \"TCP\",
                                                                HostIP:        \"\",
                                                        },
                                                },

Expected results

The operator should ignore updates by the API that only set default values. The operator should not perform these unnecessary updates to the router deployment.

Description of problem:

oc-mirror fails to  on arm64 platform with error : Rendering catalog image "ec2-18-224-73-36.us-east-2.compute.amazonaws.com:5000/arm/home/ec2-user/ocmtest/oci-multi-index:1fb06f" with file-based catalog 
Rendering catalog image "ec2-18-224-73-36.us-east-2.compute.amazonaws.com:5000/arm/redhat/community-operator-index:v4.13" with file-based catalog 
error: error rebuilding catalog images from file-based catalogs: error regenerating the cache for ec2-18-224-73-36.us-east-2.compute.amazonaws.com:5000/arm/redhat/community-operator-index:v4.13: fork/exec /home/ec2-user/ocmtest/oc-mirror-workspace/src/catalogs/registry.redhat.io/redhat/community-operator-index/v4.13/bin/usr/bin/registry/opm: exec format error

Version-Release number of selected component (if applicable):


How reproducible:

always 

Steps to Reproduce:

1.  Clone the repo to arm64 cluster and build oc-mirror;
2. Copy the catalog index to localhost ;
`skopeo copy --all  --format oci  docker://registry.redhat.io/redhat/redhat-operator-index:v4.13 oci:///home/ec2-user/ocmtest/oci-multi-index  --remove-signatures`
3.  Run the oc-mirror command :
apiVersion: mirror.openshift.io/v1alpha2
kind: ImageSetConfiguration
archiveSize: 16
mirror:
  operators:
  - catalog: oci:///home/ec2-user/ocmtest/oci-multi-index
    full: false # only mirror the latest versions
    packages:
    - name: cluster-logging
  - catalog: registry.redhat.io/redhat/community-operator-index:v4.13
    full: false # only mirror the latest versions
    packages:
    - name: namespace-configuration-operator
`oc-mirror --config config-413.yaml docker://xxxx:5000/arm --dest-skip-tls` 

Expected results:

No errors and succeed 

After installation with the assisted installer, the cluster contains BareMetalHost CRs (in the 'unmanaged' state) generated by assisted. These CRs include HardwareDetails data captured from the assisted-installer-agent.
Likely due to misleading documentation in Metal³ (since fixed by https://github.com/metal3-io/baremetal-operator/pull/657), the name field of storage devices is set to a name like sda instead of what Metal³'s own inspection would set it to, which is /dev/sda. This field is meant to be round-trippable to the rootDeviceHints, and as things stand it is not.

Description of problem:

Due to https://github.com/openshift/cluster-monitoring-operator/pull/1986, the prometheus-operator was instructed to inject the app.kubernetes.io/part-of: openshift-monitoring label (via its --labels option) to resources it creates.

The label is also 

Version-Release number of selected component (if applicable):

4.14

How reproducible:

upgrade to a 4.14 version with the commit https://github.com/openshift/cluster-monitoring-operator/pull/1986

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

We should avoid recreating the statefulset as this leads to downtime (for Prometheus, both Pods are recreated)

Additional info:

 
  • See why prometheus-operator doesn't use cascade=orphan for deletion (to keep the Pods around and avoid downtime)
  • Maybe other statefulsets are recreated as well (alertmanager etc.), maybe removing the --labels option will fix it for all of them (they are all created by the operator)
  • See if we touched the matchLabels of other statefulsets outside the control of the prom operator
  • See if we can add an origin test to make sure Statefulsets (maybe other resources as well are not recreated), can we really live with that? (what if we really want to change an immutable field), maybe in origin we can specify upgrade versions??

When we set the  k8s.ovn.org/node-primary-ifaddr annotation on the node, we simply take the first valid IP address we find on the node gateway. We exclude link-local addresses and those in internally reserved subnets (https://github.com/openshift/ovn-kubernetes/pull/1386). 

Now, we might have more than one "valid" IP address on the gateway, as observed in:
 https://bugzilla.redhat.com/show_bug.cgi?id=2081390#c11 , https://bugzilla.redhat.com/show_bug.cgi?id=2081390#c14

For instance, taken from a different cluster than in the linked BZ:

sh-4.4# ip a show br-ex
7: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether 00:52:12:af:f3:53 brd ff:ff:ff:ff:ff:ff
inet6 fd69::2/125 scope global dadfailed tentative <---- masquerade IP, excluded
valid_lft forever preferred_lft forever
inet6 fd2e:6f44:5dd8:c956::4/128 scope global nodad deprecated <--- real node IP, included
valid_lft forever preferred_lft 0sec
inet6 fd2e:6f44:5dd8:c956::17/128 scope global dynamic noprefixroute <---added by keepalive, INCLUDED!!
valid_lft 3017sec preferred_lft 3017sec
inet6 fe80::252:12ff:feaf:f353/64 scope link noprefixroute <--- link local, excluded
valid_lft forever preferred_lft forever

Above we have fd2e:6f44:5dd8:c956::4/128 which is the LB VIP of ingress added by keepalive.

We don't currently distinguish in the code between the node IP as in node.spec.IP and other IPs that might be added to br-ex by other components. 

Would it be a good idea to just set the node primary address annotation to match node.spec.IP?

Description of problem:

If you check the Ironic API logs from a bootstrap VM, you'll see that terraform is making several GET requests per second. This is way too much, bare metal machine states do not change that fast. Not even on virtual emulation.

2023-03-01 12:37:38.234 1 INFO eventlet.wsgi.server [None req-c5628ecb-c94c-4b7c-95b3-2ee933ba850b - - - - - -] fd2e:6f44:5dd8:c956::1 "GET /v1/nodes/a7364b73-eefb-4f0a-8d63-753d30b9d090 HTTP/1.1" status: 200  len: 3659 time: 0.0060174
2023-03-01 12:37:38.240 1 INFO eventlet.wsgi.server [None req-275e077e-8ec7-43a9-8948-e1d39b46b331 - - - - - -] fd2e:6f44:5dd8:c956::1 "GET /v1/nodes/a7364b73-eefb-4f0a-8d63-753d30b9d090 HTTP/1.1" status: 200  len: 3659 time: 0.0056679
2023-03-01 12:37:38.246 1 INFO eventlet.wsgi.server [None req-0d867822-fcff-4ba0-8773-37415b3f532f - - - - - -] fd2e:6f44:5dd8:c956::1 "GET /v1/nodes/a7364b73-eefb-4f0a-8d63-753d30b9d090 HTTP/1.1" status: 200  len: 3659 time: 0.0056052
2023-03-01 12:37:38.252 1 INFO eventlet.wsgi.server [None req-7e64cb21-869e-4a98-ad18-54adb6e5dec5 - - - - - -] fd2e:6f44:5dd8:c956::1 "GET /v1/nodes/a7364b73-eefb-4f0a-8d63-753d30b9d090 HTTP/1.1" status: 200  len: 3659 time: 0.0055907
2023-03-01 12:37:38.258 1 INFO eventlet.wsgi.server [None req-de9995a8-9201-47b0-aa40-505e39b48279 - - - - - -] fd2e:6f44:5dd8:c956::1 "GET /v1/nodes/a7364b73-eefb-4f0a-8d63-753d30b9d090 HTTP/1.1" status: 200  len: 3659 time: 0.0055318
2023-03-01 12:37:38.265 1 INFO eventlet.wsgi.server [None req-9e969582-0388-4e47-ad5b-966e1fd2a6da - - - - - -] fd2e:6f44:5dd8:c956::1 "GET /v1/nodes/a7364b73-eefb-4f0a-8d63-753d30b9d090 HTTP/1.1" status: 200  len: 3659 time: 0.0059781
2023-03-01 12:37:38.354 1 INFO eventlet.wsgi.server [None req-84fad0b8-2a28-476e-90c9-ebb6a9cda833 - - - - - -] fd2e:6f44:5dd8:c956::1 "GET /v1/nodes/a7364b73-eefb-4f0a-8d63-753d30b9d090 HTTP/1.1" status: 200  len: 3659 time: 0.0884116

Description of problem:

Currently the Knative Routes Details page doesnot show the URL of the Route.

Version-Release number of selected component (if applicable):

4.14

How reproducible:

Always

Steps to Reproduce:

1. Install Knative Serving (Serverless Operator)
2. Create a SF from the Add Page.
3. Navigate to the Knative Routes Details page

Actual results:

No URL is shown

Expected results:

URL should be shown

Additional info:

Images: https://drive.google.com/drive/folders/13Ya0mFhDrgFIrVcq6DaLyOxZbatz82Al?usp=share_link

Description of problem:

when using agent based installer to provision OCP, the Validation failed with the following message:
"id": "sufficient-installation-disk-speed"
"status": "failure"
"message": "While preparing the previous installation the installation disk speed measurement failed or was found to be insufficient"


Version-Release number of selected component (if applicable):

4.13.0
{

  "versions": {

    "assisted-installer": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3a8b33263729ab42c0ff29b9d5e8b767b7b1a9b31240c592fa8d173463fb04d1",

    "assisted-installer-controller": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ce3e2e4aac617077ac98b82d9849659595d85cd31f17b3213da37bc5802b78e1",

    "assisted-installer-service": "Unknown",

    "discovery-agent": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:70397ac41dffaa5f3333c00ac0c431eff7debad9177457a038b6e8c77dc4501a"

  }

}

How reproducible:

100%

Steps to Reproduce:

1. Using agent based installer provision the DELL 16G server
2. 
3.

Actual results:

Validation failed with "sufficient-installation-disk-speed"

Expected results:

Validation pass

Additional info:

[root@c2-esx02 bin]# lsblkNAME        MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTSloop0         7:0    0 125.7G  0 loop /var/lib/containers/storage/overlay                                      /var                                      /etc                                      /run/ephemeralloop1         7:1    0   934M  0 loop /usr                                      /boot                                      /                                      /sysrootnvme1n1     259:0    0   1.5T  0 disknvme0n1     259:2    0 894.2G  0 disk├─nvme0n1p1 259:6    0     2M  0 part├─nvme0n1p2 259:7    0    20M  0 part├─nvme0n1p3 259:8    0  93.1G  0 part├─nvme0n1p4 259:9    0 701.9G  0 part└─nvme0n1p5 259:10   0  99.2G  0 partnvme2n1     259:3    0   1.5T  0 disknvme4n1     259:4    0   1.5T  0 disknvme3n1     259:5    0   1.5T  0 disk[root@c2-esx02 bin]# ls -lh /dev |grep nvmecrw-------.   1 root root    239,     0 Jun 12 06:01 nvme0-rw-r--r--.   1 root root          4.0M Jun 12 06:04 nvme0c0n1brw-rw----.   1 root disk    259,     2 Jun 12 06:01 nvme0n1brw-rw----.   1 root disk    259,     6 Jun 12 06:01 nvme0n1p1brw-rw----.   1 root disk    259,     7 Jun 12 06:01 nvme0n1p2brw-rw----.   1 root disk    259,     8 Jun 12 06:01 nvme0n1p3brw-rw----.   1 root disk    259,     9 Jun 12 06:01 nvme0n1p4brw-rw----.   1 root disk    259,    10 Jun 12 06:01 nvme0n1p5crw-------.   1 root root    239,     1 Jun 12 06:01 nvme1brw-rw----.   1 root disk    259,     0 Jun 12 06:01 nvme1n1crw-------.   1 root root    239,     2 Jun 12 06:01 nvme2brw-rw----.   1 root disk    259,     3 Jun 12 06:01 nvme2n1crw-------.   1 root root    239,     3 Jun 12 06:01 nvme3brw-rw----.   1 root disk    259,     5 Jun 12 06:01 nvme3n1crw-------.   1 root root    239,     4 Jun 12 06:01 nvme4brw-rw----.   1 root disk    259,     4 Jun 12 06:01 nvme4n1[root@c2-esx02 bin]# lsblk -f nvme0c0n1lsblk: nvme0c0n1: not a block device[root@c2-esx02 bin]# ls -l /dev/disk/by-id/total 0lrwxrwxrwx. 1 root root 13 Jun 12 06:01 google-CN0WW56VFCP0033900HU -> ../../nvme0n1lrwxrwxrwx. 1 root root 15 Jun 12 06:01 google-CN0WW56VFCP0033900HU-part1 -> ../../nvme0n1p1lrwxrwxrwx. 1 root root 15 Jun 12 06:01 google-CN0WW56VFCP0033900HU-part2 -> ../../nvme0n1p2lrwxrwxrwx. 1 root root 15 Jun 12 06:01 google-CN0WW56VFCP0033900HU-part3 -> ../../nvme0n1p3lrwxrwxrwx. 1 root root 15 Jun 12 06:01 google-CN0WW56VFCP0033900HU-part4 -> ../../nvme0n1p4lrwxrwxrwx. 1 root root 15 Jun 12 06:01 google-CN0WW56VFCP0033900HU-part5 -> ../../nvme0n1p5lrwxrwxrwx. 1 root root 13 Jun 12 06:01 google-PHAB112600291P9SGN -> ../../nvme3n1lrwxrwxrwx. 1 root root 13 Jun 12 06:01 google-PHAB115400P81P9SGN -> ../../nvme2n1lrwxrwxrwx. 1 root root 13 Jun 12 06:01 google-PHAB120401CP1P9SGN -> ../../nvme1n1lrwxrwxrwx. 1 root root 13 Jun 12 06:01 google-PHAB124501MF1P9SGN -> ../../nvme4n1lrwxrwxrwx. 1 root root 13 Jun 12 06:01 nvme-Dell_BOSS-N1_CN0WW56VFCP0033900HU -> ../../nvme0n1lrwxrwxrwx. 1 root root 15 Jun 12 06:01 nvme-Dell_BOSS-N1_CN0WW56VFCP0033900HU-part1 -> ../../nvme0n1p1lrwxrwxrwx. 1 root root 15 Jun 12 06:01 nvme-Dell_BOSS-N1_CN0WW56VFCP0033900HU-part2 -> ../../nvme0n1p2lrwxrwxrwx. 1 root root 15 Jun 12 06:01 nvme-Dell_BOSS-N1_CN0WW56VFCP0033900HU-part3 -> ../../nvme0n1p3lrwxrwxrwx. 1 root root 15 Jun 12 06:01 nvme-Dell_BOSS-N1_CN0WW56VFCP0033900HU-part4 -> ../../nvme0n1p4lrwxrwxrwx. 1 root root 15 Jun 12 06:01 nvme-Dell_BOSS-N1_CN0WW56VFCP0033900HU-part5 -> ../../nvme0n1p5lrwxrwxrwx. 1 root root 13 Jun 12 06:01 nvme-Dell_Ent_NVMe_P5600_MU_U.2_1.6TB_PHAB112600291P9SGN -> ../../nvme3n1lrwxrwxrwx. 1 root root 13 Jun 12 06:01 nvme-Dell_Ent_NVMe_P5600_MU_U.2_1.6TB_PHAB115400P81P9SGN -> ../../nvme2n1lrwxrwxrwx. 1 root root 13 Jun 12 06:01 nvme-Dell_Ent_NVMe_P5600_MU_U.2_1.6TB_PHAB120401CP1P9SGN -> ../../nvme1n1lrwxrwxrwx. 1 root root 13 Jun 12 06:01 nvme-Dell_Ent_NVMe_P5600_MU_U.2_1.6TB_PHAB124501MF1P9SGN -> ../../nvme4n1lrwxrwxrwx. 1 root root 13 Jun 12 06:01 nvme-eui.0050434209000001 -> ../../nvme0n1lrwxrwxrwx. 1 root root 15 Jun 12 06:01 nvme-eui.0050434209000001-part1 -> ../../nvme0n1p1lrwxrwxrwx. 1 root root 15 Jun 12 06:01 nvme-eui.0050434209000001-part2 -> ../../nvme0n1p2lrwxrwxrwx. 1 root root 15 Jun 12 06:01 nvme-eui.0050434209000001-part3 -> ../../nvme0n1p3lrwxrwxrwx. 1 root root 15 Jun 12 06:01 nvme-eui.0050434209000001-part4 -> ../../nvme0n1p4lrwxrwxrwx. 1 root root 15 Jun 12 06:01 nvme-eui.0050434209000001-part5 -> ../../nvme0n1p5lrwxrwxrwx. 1 root root 13 Jun 12 06:01 nvme-eui.01000000000000005cd2e44e7a445351 -> ../../nvme2n1lrwxrwxrwx. 1 root root 13 Jun 12 06:01 nvme-eui.01000000000000005cd2e48f14515351 -> ../../nvme1n1lrwxrwxrwx. 1 root root 13 Jun 12 06:01 nvme-eui.01000000000000005cd2e49d3e605351 -> ../../nvme4n1lrwxrwxrwx. 1 root root 13 Jun 12 06:01 nvme-eui.01000000000000005cd2e4fd973e5351 -> ../../nvme3n1[root@c2-esx02 bin]# ls -l /dev/disk/by-pathtotal 0lrwxrwxrwx. 1 root root 13 Jun 12 06:01 pci-0000:01:00.0-nvme-1 -> ../../nvme0n1lrwxrwxrwx. 1 root root 15 Jun 12 06:01 pci-0000:01:00.0-nvme-1-part1 -> ../../nvme0n1p1lrwxrwxrwx. 1 root root 15 Jun 12 06:01 pci-0000:01:00.0-nvme-1-part2 -> ../../nvme0n1p2lrwxrwxrwx. 1 root root 15 Jun 12 06:01 pci-0000:01:00.0-nvme-1-part3 -> ../../nvme0n1p3lrwxrwxrwx. 1 root root 15 Jun 12 06:01 pci-0000:01:00.0-nvme-1-part4 -> ../../nvme0n1p4lrwxrwxrwx. 1 root root 15 Jun 12 06:01 pci-0000:01:00.0-nvme-1-part5 -> ../../nvme0n1p5lrwxrwxrwx. 1 root root 13 Jun 12 06:01 pci-0000:c3:00.0-nvme-1 -> ../../nvme1n1lrwxrwxrwx. 1 root root 13 Jun 12 06:01 pci-0000:c4:00.0-nvme-1 -> ../../nvme2n1lrwxrwxrwx. 1 root root 13 Jun 12 06:01 pci-0000:c5:00.0-nvme-1 -> ../../nvme3n1lrwxrwxrwx. 1 root root 13 Jun 12 06:01 pci-0000:c6:00.0-nvme-1 -> ../../nvme4n1

Description of problem:

The timeout of calls to the csi driver from both the external csi-provisioner and csi-attacher are 15 seconds by default. However hotplugging a volume into the Virtual Machine can take up to a minute (sometimes more). This causes the context timeout to expire, and in some cases causes the bookkeeping of what volumes are attached to become corrupted, and detaching the volumes doesn't always get handled properly afterwards.

Version-Release number of selected component (if applicable):


How reproducible:

Run the standard csi conformance tests against the csi driver. Most of the runs this issue will appear as a random failed test or two. The failed test are because the deletion of the persistent volume never happens.

Because of this we cannot get a good signal on the state of the csi driver.

Steps to Reproduce:

1.
2.
3.

Actual results:

Random failed tests of the csi conformance suite.

Expected results:

csi conformance suite passes

Additional info:

Fixed in upstream by increasing the timeouts to 3 minutes instead of 15 seconds.

Description of problem:

After adding FailureDomain topology as day-2 operation, I get ProvisioningFailed due to error generating accessibility requirements: no topology key found on CSINode ocp-storage-fxsc6-worker-0-fb977

Version-Release number of selected component (if applicable):

pre-merge payload with opt-in CSIMigration PRs

How reproducible:

2/2

Steps to Reproduce:

1. I installed the cluster without specifying the failureDomains (so I got one which generated by installer)
2. Added new failureDomain to test topology, and make sure all related resources(datacenterand ClusterComputeResource) are tagged in vsphere 
3. create pvc but failed with provisioning:
Warning ProvisioningFailed 80m (x14 over 103m) csi.vsphere.vmware.com_ocp-storage-fxsc6-master-0_a18e2651-6455-42b2-abc2-b3b3d197da56 failed to provision volume with StorageClass "thin-csi": error generating accessibility requirements: no topology key found on CSINode ocp-storage-fxsc6-worker-0-fb977

4. Here is the node label and csinode info 
$ oc get node ocp-storage-fxsc6-worker-0-b246w --show-labels 
NAME STATUS ROLES AGE VERSION LABELS 
ocp-storage-fxsc6-worker-0-b246w Ready worker 8h v1.26.3+2727aff beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=ocp-storage-fxsc6-worker-0-b246w,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.openshift.io/os_id=rhcos 
$ oc get csinode ocp-storage-fxsc6-worker-0-b246w -ojson | jq .spec.drivers[].topologyKeys 
null 

5. other logs:
I only find something in csi-driver-controller-8597f567f8-4f8z6 {"level":"info","time":"2023-04-17T10:30:13.352999527Z","caller":"k8sorchestrator/topology.go:326","msg":"failed to retrieve tags for category \"cns.vmware.topology-preferred-datastores\". Reason: GET https://ocp-storage.vmc.qe.devcluster.openshift.com:443/rest/com/vmware/cis/tagging/category/id:cns.vmware.topology-preferred-datastores: 404 Not Found","TraceId":"573c3fc8-e6cf-4594-8154-07bd514fcb46"}

In the vpd pod, the tag check passed: I0417 11:05:02.711093 1 util.go:110] Looking for CC: workloads-02 I0417 11:05:02.766516 1 zones.go:168] ClusterComputeResource: ClusterComputeResource:domain-c5265 @ /OCP-DC/host/workloads-02 I0417 11:05:02.766622 1 zones.go:64] Validating tags for ClusterComputeResource:domain-c5265. I0417 11:05:02.813568 1 zones.go:81] Processing attached tags I0417 11:05:02.813678 1 zones.go:90] Found Region: region-A I0417 11:05:02.813721 1 zones.go:96] Found Zone: zone-B I0417 11:05:02.834718 1 util.go:110] Looking for CC: qe-cluster/workloads-03 I0417 11:05:02.844475 1 reflector.go:559] k8s.io/client-go@v0.26.1/tools/cache/reflector.go:169: Watch close - *v1.ConfigMap total 7 items received I0417 11:05:02.890279 1 zones.go:168] ClusterComputeResource: ClusterComputeResource:domain-c9002 @ /OCP-DC/host/qe-cluster/workloads-03 I0417 11:05:02.890406 1 zones.go:64] Validating tags for ClusterComputeResource:domain-c9002. I0417 11:05:02.946720 1 zones.go:81] Processing attached tags I0417 11:05:02.946871 1 zones.go:96] Found Zone: zone-C I0417 11:05:02.946917 1 zones.go:90] Found Region: region-A I0417 11:05:02.946965 1 vsphere_check.go:242] CheckZoneTags passed 

Actual results:

Provisioning failed.

Expected results:

Provisioning should be succeed.

Additional info:

 

Please review the following PR: https://github.com/openshift/bond-cni/pull/52

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

If a custom API server certificate is added as per documentation[1], but the secret name is wrong and points to a non-existing secret, the following happens:
- The kube-apiserver config is rendered with some of the namedCertificates pointing to /etc/kubernetes/static-pod-certs/secrets/user-serving-cert-000/
- As the secret in apiserver/cluster object is wrong, no user-serving-cert-000 secret is generated, so the /etc/kubernetes/static-pod-certs/secrets/user-serving-cert-000/ does not exist (and may be automatically removed if manually created).
- The combination of the 2 points above causes kube-apiserver to start crash-looping because its config points to non-existent certificates.

This is a cluster-kube-apiserver-operator, because it should validate that the specified secret exists and degrade and do nothing if it doesn't, not render inconsistent configuration.

Version-Release number of selected component (if applicable):

First found in 4.11.13, but also reproduced in the latest nightly build.

How reproducible:

Always

Steps to Reproduce:

1. Setup a named certificate pointing to a secret that doesn't exist.
2.
3.

Actual results:

Inconsistent configuration that points to non-existing secret. Kube API server pod crash-loop.

Expected results:

Cluster Kube API Server Operator to detect that the secret is wrong, do nothing and only report itself as degraded with meaningful message so the user can fix. No Kube API server pod crash-looping.

Additional info:

Once the kube-apiserver is broken, even if the apiserver/cluster object is fixed, it is usually needed to apply a manual workaround in the crash-looping master. An example of workaround that works is[2], even though that KB article was written for another bug with different root cause. 

References:

[1] - https://docs.openshift.com/container-platform/4.11/security/certificates/api-server.html#api-server-certificates
[2] - https://access.redhat.com/solutions/4893641

The ability to schedule workloads on master nodes is currently exposed via the REST API as a boolean Cluster property "schedulable_masters". For the k8s, we should align with other OpenShift APIs and have a boolean property in the ACM Spec called mastersSchedulable.

Description of problem:

[performance] Checking IRQBalance settings Verify GloballyDisableIrqLoadBalancing Spec field [test_id:36150] Verify that IRQ load balancing is enabled/disabled correctly

[rfe_id:27368][performance]
 Pre boot tuning adjusted by tuned  
[test_id:35363][crit:high][vendor:cnf-qe@redhat.com][level:acceptance] 
stalld daemon is running on the host

[rfe_id:27363][performance]
 CPU Management Verification of cpu manager functionality Verify CPU 
usage by stress PODs [test_id:27492] Guaranteed POD should work on 
isolated cpu

tests fails often in 4.13 and 4.14 upstream CI jobs

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.13-e2e-telco5g-cnftests/1669344976506458112/artifacts/e2e-telco5g-cnftests/telco5g-cnf-tests/artifacts/test_results.html


Version-Release number of selected component (if applicable):

4.14 4.13

How reproducible:

CI job

Steps to Reproduce:

Ci job

Actual results:

failures

Expected results:

pass

Additional info:

https://snapshots.raintank.io/dashboard/snapshot/6sZ1uBR5P1O1gknyxebPQPtEo7RVEu0C
history and pass/fail ratio

Description of problem:

Update the VScode extension link to https://marketplace.visualstudio.com/items?itemName=redhat.vscode-openshift-connector

 

And change the description to 

The OpenShift Serverless Functions support in the VSCode IDE extension enables developers to effortlessly create, build, run, invoke and deploy serverless functions on OpenShift, providing a seamless development experience within the familiar VSCode environment.

This is a clone of issue OCPBUGS-19019. The following is the description of the original issue:

Using metal-ipi with okd-scos ironic fails to provision nodes

Description of problem:

I have completed to install OCP as 3 masters and 2 workers.
But I was not able to find mastersSchedulable parameter after command below from all files on manafest directory.
$ openshift-install agent create cluster-manifests  --log-level debug --dir kni

And I used the installer this.
https://github.com/openshift/installer/releases/tag/agent-installer-v4.11.0-dev-preview-2

Version-Release number of selected component (if applicable):

 

How reproducible:

execution the installer

Steps to Reproduce:

1. download the installer
2. openshift-install agent create cluster-manifests  --log-level debug --dir kni 

Actual results:

There is no mastersSchedulable parameter

Expected results:

Some file(like cluster-scheduler-02-config.yml) has mastersSchedulable parameter

Additional info:

 

Description of the problem:

In BE 2.16.0 - try to install new cluster with enabled ignore-validation {"host-validation-ids": "[\"all\"]", "cluster-validation-ids": "[\"all\"]"}  - one host with less HD space (18GB).  Installation starts, but after 20 minutes waiting, cluster is back to draft status without any event

How reproducible:

100%

Steps to reproduce:

1. Create new multi cluster - configure one of the hosts to have 18GB HD (minimum req is 20GB)

2. Enable ignore-validations by:

curl -X 'PUT' \
  'http://api.openshift.com/api/assisted-install/v2/clusters/eaffbd37-2a0b-42b2-a706-ad5b23ff17a3/ignored-validations' \
  --header "Authorization: Bearer $(ocm token)" \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "ignored_host_validations": "[\"all\"]",
  "ignored_cluster_validations": "[\"all\"]"
}'
 

3. start installation. cluster is stuck on prepare-for-installation for 20 minutes and then moves to draft with no event about the reason

Actual results:

 

Expected results:

 This issue is valid for UI and API.
For UI
If a new cluster is being created and s390x is selected as architecture, an error message pops up if next button is being pressed (all other necessary values are filed correctly): 

"cannot use Minimal ISO because it's not compatible with the s390x architecture on version 4.13.0-rc.3-multi of OpenShift"

There is no workaround because the matching selection (full-iso or iPXE) could be set on addHosts Dialog.

For API
The infra env object could not be created if type is not set. The error message:
"cannot use Minimal ISO because it's not compatible with the s390x architecture on version 4.13.0-rc.3-multi of OpenShift"
is returned.

Workaround is to set image_type to "full-iso" during infra env creation.

For s390x architecture the default should be always full-iso.

Please review the following PR: https://github.com/openshift/cloud-provider-nutanix/pull/12

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

There is error when creating image:
FATAL failed to fetch Agent Installer ISO: failed to generate asset "Agent Installer ISO": stat /home/core/.cache/agent/files_cache/libnmstate.so.1.3.3: no such file or directory

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2023-04-06-060829

How reproducible:

always

Steps to Reproduce:

1. Prepare the agent-config.yaml and install-config.yaml files

2. Run 'bin/openshift-install agent create image --log-level debug'

3. There is following output with errors:
DEBUG extracting /usr/bin/agent-tui to /home/core/.cache/agent/files_cache, oc image extract --path /usr/bin/agent-tui:/home/core/.cache/agent/files_cache --confirm quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c11d31d47db4afb03e4a4c8c40e7933981a2e3a7ef9805a1413c441f492b869b 
DEBUG Fetching image from OCP release (oc adm release info --image-for=agent-installer-node-agent --insecure=true registry.ci.openshift.org/ocp/release@sha256:83caa0a8f2633f6f724c4feb517576181d3f76b8b76438ff752204e8c7152bac) 
DEBUG extracting /usr/lib64/libnmstate.so.1.3.3 to /home/core/.cache/agent/files_cache, oc image extract --path /usr/lib64/libnmstate.so.1.3.3:/home/core/.cache/agent/files_cache --confirm quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c11d31d47db4afb03e4a4c8c40e7933981a2e3a7ef9805a1413c441f492b869b 
DEBUG File /usr/lib64/libnmstate.so.1.3.3 was not found, err stat /home/core/.cache/agent/files_cache/libnmstate.so.1.3.3: no such file or directory 
ERROR failed to write asset (Agent Installer ISO) to disk: cannot generate ISO image due to configuration errors 
FATAL failed to fetch Agent Installer ISO: failed to generate asset "Agent Installer ISO": stat /home/core/.cache/agent/files_cache/libnmstate.so.1.3.3: no such file or directory  

Actual results:

The image generate fail

Expected results:

The image should generate success.

Additional info:

 

Description of problem:

When typing into the filter input field at the Quick Starts page, console will crash

Version-Release number of selected component (if applicable):

4.13.0-rc.7

How reproducible:

Always

Steps to Reproduce:

1. Go to the Quick Starts page 
2. Type something into the filter input field
3.

Actual results:

Console will crash:


TypeError
Description:
t.toLowerCase is not a functionComponent trace:
at Sn (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/vendors~main-chunk-141f889230d63da0ba53.min.js:36:168364)
    at t.default (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/main-chunk-4a1d080acbda22020fbd.min.js:1:874032)
    at t.default (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/quick-start-chunk-274c58e3845ea0aa718b.min.js:1:202)
    at s (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/main-chunk-4a1d080acbda22020fbd.min.js:1:241397)
    at s (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/main-chunk-4a1d080acbda22020fbd.min.js:1:241397)
    at t (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/vendors~main-chunk-141f889230d63da0ba53.min.js:21:67583)
    at T
    at t (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/vendors~main-chunk-141f889230d63da0ba53.min.js:21:69628)
    at Suspense
    at i (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/main-chunk-4a1d080acbda22020fbd.min.js:1:450974)
    at section
    at m (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/vendor-patternfly-core-chunk-67ceb971158ed93c9c79.min.js:1:720272)
    at div
    at div
    at t.a (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/main-chunk-4a1d080acbda22020fbd.min.js:1:1528877)
    at div
    at div
    at c (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/vendor-patternfly-core-chunk-67ceb971158ed93c9c79.min.js:1:545409)
    at d (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/vendor-patternfly-core-chunk-67ceb971158ed93c9c79.min.js:1:774923)
    at div
    at d (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/vendor-patternfly-core-chunk-67ceb971158ed93c9c79.min.js:1:458124)
    at l (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/main-chunk-4a1d080acbda22020fbd.min.js:1:1170951)
    at https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/main-chunk-4a1d080acbda22020fbd.min.js:1:457833
    at S (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/vendors~main-chunk-141f889230d63da0ba53.min.js:98:86864)
    at main
    at div
    at v (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/vendor-patternfly-core-chunk-67ceb971158ed93c9c79.min.js:1:264066)
    at div
    at div
    at c (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/vendor-patternfly-core-chunk-67ceb971158ed93c9c79.min.js:1:62024)
    at div
    at div
    at c (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/vendor-patternfly-core-chunk-67ceb971158ed93c9c79.min.js:1:545409)
    at d (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/vendor-patternfly-core-chunk-67ceb971158ed93c9c79.min.js:1:774923)
    at div
    at d (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/vendor-patternfly-core-chunk-67ceb971158ed93c9c79.min.js:1:458124)
    at Un (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/vendors~main-chunk-141f889230d63da0ba53.min.js:36:183620)
    at t.default (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/main-chunk-4a1d080acbda22020fbd.min.js:1:874032)
    at t.default (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/quick-start-chunk-274c58e3845ea0aa718b.min.js:1:1261)
    at s (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/main-chunk-4a1d080acbda22020fbd.min.js:1:241397)
    at t.a (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/main-chunk-4a1d080acbda22020fbd.min.js:1:1605535)
    at ee (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/main-chunk-4a1d080acbda22020fbd.min.js:1:1623254)
    at _t (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/vendors~main-chunk-141f889230d63da0ba53.min.js:36:142374)
    at ee (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/main-chunk-4a1d080acbda22020fbd.min.js:1:1623254)
    at ee (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/main-chunk-4a1d080acbda22020fbd.min.js:1:1623254)
    at ee (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/main-chunk-4a1d080acbda22020fbd.min.js:1:1623254)
    at i (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/main-chunk-4a1d080acbda22020fbd.min.js:1:829516)
    at t.a (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/main-chunk-4a1d080acbda22020fbd.min.js:1:1599727)
    at t.a (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/main-chunk-4a1d080acbda22020fbd.min.js:1:1599916)
    at t.a (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/main-chunk-4a1d080acbda22020fbd.min.js:1:1597332)
    at te (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/main-chunk-4a1d080acbda22020fbd.min.js:1:1623385)
    at https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/main-chunk-4a1d080acbda22020fbd.min.js:1:1626517
    at r (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/vendors~main-chunk-141f889230d63da0ba53.min.js:36:121910)
    at t (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/vendors~main-chunk-141f889230d63da0ba53.min.js:21:67583)
    at t (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/vendors~main-chunk-141f889230d63da0ba53.min.js:21:69628)
    at t (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/vendors~main-chunk-141f889230d63da0ba53.min.js:21:64188)
    at re (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/main-chunk-4a1d080acbda22020fbd.min.js:1:1626828)
    at t.a (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/main-chunk-4a1d080acbda22020fbd.min.js:1:803496)
    at t.a (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/main-chunk-4a1d080acbda22020fbd.min.js:1:1074899)
    at s (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/main-chunk-4a1d080acbda22020fbd.min.js:1:652518)
    at t.a (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/vendors~main-chunk-141f889230d63da0ba53.min.js:150:190871)
    at Suspense
Stack trace:
TypeError: t.toLowerCase is not a function
    at pt (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/vendors~main-chunk-141f889230d63da0ba53.min.js:36:136019)
    at Sn (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/vendors~main-chunk-141f889230d63da0ba53.min.js:36:168723)
    at na (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/vendors~main-chunk-141f889230d63da0ba53.min.js:263:58879)
    at za (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/vendors~main-chunk-141f889230d63da0ba53.min.js:263:68397)
    at Hs (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/vendors~main-chunk-141f889230d63da0ba53.min.js:263:112289)
    at xl (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/vendors~main-chunk-141f889230d63da0ba53.min.js:263:98327)
    at Cl (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/vendors~main-chunk-141f889230d63da0ba53.min.js:263:98255)
    at _l (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/vendors~main-chunk-141f889230d63da0ba53.min.js:263:98118)
    at pl (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/vendors~main-chunk-141f889230d63da0ba53.min.js:263:95105)
    at https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/vendors~main-chunk-141f889230d63da0ba53.min.js:263:44774

Expected results:

Console should work

Additional info:

 

Description of problem:

Console-operator's config file gets updated every couple of seconds, where only the `resourceVersion` field get s changed.

Version-Release number of selected component (if applicable):

4.14-ec-2

How reproducible:

Always

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

Kubernetes and other associated dependencies need to be updated to protect against potential vulnerabilities.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

Getting below error while creating cluster in mon01 zone
Joblink: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-multiarch-master-nightly-4.14-ocp-e2e-ovn-ppc64le-powervs/1680759459892170752
Error:
level=info msg=Cluster operator insights SCAAvailable is False with Forbidden: Failed to pull SCA certs from https://api.openshift.com/api/accounts_mgmt/v1/certificates: OCM API https://api.openshift.com/api/accounts_mgmt/v1/certificates returned HTTP 403: {"code":"ACCT-MGMT-11","href":"/api/accounts_mgmt/v1/errors/11","id":"11","kind":"Error","operation_id":"c3773b1e-8818-4bfc-9605-dbd9dbc0c03f","reason":"Account with ID 2DUeKzzTD9ngfsQ6YgkzdJn1jA4 denied access to perform create on Certificate with HTTP call POST /api/accounts_mgmt/v1/certificates"}
level=info msg=Cluster operator network ManagementStateDegraded is False with : 
level=error msg=Cluster operator storage Degraded is True with PowerVSBlockCSIDriverOperatorCR_PowerVSBlockCSIDriverStaticResourcesController_SyncError: PowerVSBlockCSIDriverOperatorCRDegraded: PowerVSBlockCSIDriverStaticResourcesControllerDegraded: "rbac/main_attacher_binding.yaml" (string): clusterroles.rbac.authorization.k8s.io "openshift-csi-main-attacher-role" not found
level=error msg=PowerVSBlockCSIDriverOperatorCRDegraded: PowerVSBlockCSIDriverStaticResourcesControllerDegraded: "rbac/main_provisioner_binding.yaml" (string): clusterroles.rbac.authorization.k8s.io "openshift-csi-main-provisioner-role" not found
level=error msg=PowerVSBlockCSIDriverOperatorCRDegraded: PowerVSBlockCSIDriverStaticResourcesControllerDegraded: "rbac/volumesnapshot_reader_provisioner_binding.yaml" (string): clusterroles.rbac.authorization.k8s.io "openshift-csi-provisioner-volumesnapshot-reader-role" not found
level=error msg=PowerVSBlockCSIDriverOperatorCRDegraded: PowerVSBlockCSIDriverStaticResourcesControllerDegraded: "rbac/main_resizer_binding.yaml" (string): clusterroles.rbac.authorization.k8s.io "openshift-csi-main-resizer-role" not found
level=error msg=PowerVSBlockCSIDriverOperatorCRDegraded: PowerVSBlockCSIDriverStaticResourcesControllerDegraded: "rbac/storageclass_reader_resizer_binding.yaml" (string): clusterroles.rbac.authorization.k8s.io "openshift-csi-resizer-storageclass-reader-role" not found
level=error msg=PowerVSBlockCSIDriverOperatorCRDegraded: PowerVSBlockCSIDriverStaticResourcesControllerDegraded: 

Version-Release number of selected component (if applicable):

4.14

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Expected results:

cluster creation should be successful

Additional info:

 

The cluster-kube-apiserver-operator CI has been constantly failing for the past week and more specifically the e2e-gcp-operator job because the test cluster ends in a state where a lot of requests start failing with "Unauthorized" errors.

This caused multiple operators to become degraded and tests to fail.

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-kube-apiserver-operator/1450/pull-ci-openshift-cluster-kube-apiserver-operator-master-e2e-gcp-operator/1631333936435040256

Looking at the failures and a must-gather we were able to capture inside of a test cluster, it turned out that the service account issuer could be the culprit here. Because of that we opened https://issues.redhat.com/browse/API-1549.

However, it turned that disabling TestServiceAccountIssuer didn't resolve the issue and the cluster was still too unstable for the tests to pass.

In a separate attempt we also tried disabling TestBoundTokenSignerController and this time the tests were passing. However, the cluster was still very unstable during the e2e run and the kube-apiserver-operator went degraded a couple of times: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-kube-apiserver-operator/1455/pull-ci-openshift-cluster-kube-apiserver-operator-master-e2e-gcp-operator/1632871645171421184/artifacts/e2e-gcp-operator/gather-extra/artifacts/pods/openshift-kube-apiserver-operator_kube-apiserver-operator-5cf9d4569-m2spq_kube-apiserver-operator.log.

On top of that instead of seeing Unauthorized errors, we are now seeing a lot of connection refused.

Description of problem:

The description for the BuildAdapter SDK extension is wrong.

Actual results:

BuildAdapter contributes an adapter to adapt element to data that can be used by Pod component

Expected results:

BuildAdapter contributes an adapter to adapt element to data that can be used by Build component

Additional info:

 

Description of problem:

Version-Release number of selected component (if applicable):
All versions?
At least on 4.12+

How reproducible:
Always

Steps to Reproduce:

  1. Open the console and click on the + sign in the top right navigation header.

This JSON works fine:

{
  "apiVersion": "v1",
  "kind": "ConfigMap",
  "metadata": {
    "generateName": "a-configmap-"
  }
}

But neither an array could be used to import multiple resources:

[
  {
    "apiVersion": "v1",
    "kind": "ConfigMap",
    "metadata": {
      "generateName": "a-configmap-"
    }
  },
  {
    "apiVersion": "v1",
    "kind": "ConfigMap",
    "metadata": {
      "generateName": "a-configmap-"
    }
  }
]

Fails with error: No "apiVersion" field found in YAML.

Nor a Kubernetes List "resource" could be used:

{
  "apiVersion": "v1",
  "kind": "List",
  "items": [
    {
      "apiVersion": "v1",
      "kind": "ConfigMap",
      "metadata": {
        "generateName": "a-configmap-"
      }
    },
    {
      "apiVersion": "v1",
      "kind": "ConfigMap",
      "metadata": {
        "generateName": "a-configmap-"
      }
    }
  ]
}

Fails with error: The server doesn't have a resource type "kind: List, apiVersion: v1".

Actual results:
Both JSON structures could not be imported.

Expected results:
Both JSON structures works fine and create multiple resources.

If the JSON array contains just one item the resource detail page should be opened, otherwise the import result page similar to when the user imports a yaml with multiple resources.

Additional info:
Found this JSON structure for example in issue OCPBUGS-4646

Description of problem:

DNS Local endpoint preference is not working for TCP DNS requests for Openshift SDN.

Reference code: https://github.com/openshift/sdn/blob/b58a257b896d774e0a092612be250fb9414af5ca/vendor/k8s.io/kubernetes/pkg/proxy/iptables/proxier.go#L999-L1012

This is where the DNS request is short-circuited to the local DNS endpoint if it exists. This is important because DNS local preference protects against another outstanding bug, in which daemonset pods go stale for a few second upon node shutdown (see https://issues.redhat.com/browse/OCPNODE-549 for fix for graceful node shutdown). This appears to be contributing to DNS issues in our internal CI clusters. https://lookerstudio.google.com/reporting/3a9d4e62-620a-47b9-a724-a5ebefc06658/page/MQwFD?s=kPTlddLa2AQ shows large amounts of "dns_tcp_lookup" failures, which I attribute to this bug.

UDP DNS local preference is working fine in Openshift SDN. Both UDP and TCP local preference work fine in OVN. It's just TCP DNS Local preference that is not working Openshift SDN.

Version-Release number of selected component (if applicable):

4.13, 4.12, 4.11

How reproducible:

100%

Steps to Reproduce:

1. oc debug -n openshift-dns
2. dig +short +tcp +vc +noall +answer CH TXT hostname.bind
# Retry multiple times, and you should always get the same local DNS pod.

Actual results:

[gspence@gspence origin]$ oc debug -n openshift-dns
Starting pod/image-debug ...
Pod IP: 10.128.2.10
If you don't see a command prompt, try pressing enter.
sh-4.4# dig +short +tcp +vc +noall +answer CH TXT hostname.bind
"dns-default-glgr8"
sh-4.4# dig +short +tcp +vc +noall +answer CH TXT hostname.bind
"dns-default-gzlhm"
sh-4.4# dig +short +tcp +vc +noall +answer CH TXT hostname.bind
"dns-default-dnbsp"
sh-4.4# dig +short +tcp +vc +noall +answer CH TXT hostname.bind
"dns-default-gzlhm"

Expected results:

[gspence@gspence origin]$ oc debug -n openshift-dns
Starting pod/image-debug ...
Pod IP: 10.128.2.10
If you don't see a command prompt, try pressing enter.
sh-4.4# dig +short +tcp +vc +noall +answer CH TXT hostname.bind
"dns-default-glgr8"
sh-4.4# dig +short +tcp +vc +noall +answer CH TXT hostname.bind
"dns-default-glgr8"
sh-4.4# dig +short +tcp +vc +noall +answer CH TXT hostname.bind
"dns-default-glgr8"
sh-4.4# dig +short +tcp +vc +noall +answer CH TXT hostname.bind
"dns-default-glgr8" 

Additional info:

https://issues.redhat.com/browse/OCPBUGS-488 is the previous bug I opened for UDP DNS local preference not working.

iptables-save from a 4.13 vanilla cluster bot AWS,SDN: https://drive.google.com/file/d/1jY8_f64nDWi5SYT45lFMthE0vhioYIfe/view?usp=sharing 

As a user of the HyperShift CLI, I would like to be able to set the NodePool UpgradeType through a flag when either creating a new cluster or creating a new NodePool.


DoD:

  • A flag has been added to the create new cluster command allowing the NodePool UpgradeType to be set to either Replace or InPlace
  • A flag has been added to the create new NodePool command allowing the NodePool UpgradeType to be set to either Replace or InPlace
  • If either flag is not set, the default will be Replace as that is the current default

Description of problem:

we need update the govc version to support PR:https://github.com/openshift/release/pull/42334.
As the command "govc vm.network.change -dc xxx  -vm -net xxxxx " only support after govc version v0.30.4. then vm can not fetch ip correctly.

Version-Release number of selected component (if applicable):

ocp 4.14

How reproducible:

 

Steps to Reproduce:

 

1.

 

2.

 

3.

 

Actual results:

"govc: path 'ci-segment-151'" resolves to multiple networks
if specific the -net with network path, will got "govc: network '/IBMCloud/host/vcs-mdcnc-workload-1/ci-segment-151' not found"

Expected results:

govc version update, govc vm.network.change can be used to get the unique network.

Additional info:

 

OCP 4.14.0-rc.0
advanced-cluster-management.v2.9.0-130
multicluster-engine.v2.4.0-154

After encountering https://issues.redhat.com/browse/OCPBUGS-18959

Attempted to forcefully delete the BMH by removing the finalizer.
Then deleted all the metal3 pods.

Attempted to re-create the bmh.

Result:
the bmh is stuck in

oc get bmh
NAME                                           STATE         CONSUMER   ONLINE   ERROR   AGE
hp-e910-01.kni-qe-65.lab.eng.rdu2.redhat.com   registering              true             15m

seeing this entry in the BMO log:

{"level":"info","ts":"2023-09-13T16:15:57Z","logger":"controllers.BareMetalHost","msg":"start","baremetalhost":{"name":"hp-e910-01.kni-qe-65.lab.eng.rdu2.redhat.com","namespace":"kni-qe-65"}}
{"level":"info","ts":"2023-09-13T16:15:57Z","logger":"controllers.BareMetalHost","msg":"hardwareData is ready to be deleted","baremetalhost":{"name":"hp-e910-01.kni-qe-65.lab.eng.rdu2.redhat.com","namespace":"kni-qe-65"}}
{"level":"info","ts":"2023-09-13T16:15:57Z","logger":"controllers.BareMetalHost","msg":"host ready to be powered off","baremetalhost":

{"name":"hp-e910-01.kni-qe-65.lab.eng.rdu2.redhat.com","namespace":"kni-qe-65"}

,"provisioningState":"powering off before delete"}

{"level":"info","ts":"2023-09-13T16:15:57Z","logger":"provisioner.ironic","msg":"ensuring host is powered off (mode: hard)","host":"kni-qe-65~hp-e910-01.kni-qe-65.lab.eng.rdu2.redhat.com"}

{"level":"error","ts":"2023-09-13T16:15:57Z","msg":"Reconciler error","controller":"baremetalhost","controllerGroup":"metal3.io","controllerKind":"BareMetalHost","BareMetalHost":

{"name":"hp-e910-01.kni-qe-65.lab.eng.rdu2.redhat.com","namespace":"kni-qe-65"}

,"namespace":"kni-qe-65","name":"hp-e910-01.kni-qe-65.lab.eng.rdu2.redhat.com","reconcileID":"167061cc-7ab4-4c4a-ae45-8c19dfc3ac22","error":"action \"powering off before delete\" failed: failed to power off before deleting node: Host not registered","errorVerbose":"Host not registered\nfailed to power off before deleting node\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*BareMetalHostReconciler).actionPowerOffBeforeDeleting\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/baremetalhost_controller.go:493\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).handlePoweringOffBeforeDelete\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:585\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).ReconcileState\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:202\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*BareMetalHostReconciler).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/baremetalhost_controller.go:225\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:118\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:314\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:265\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:226\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1598\naction \"powering off before delete\" failed\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*BareMetalHostReconciler).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/baremetalhost_controller.go:229\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:118\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:314\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:265\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:226\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1598","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:324\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:265\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:226"}

Description of problem:

ovn-ipsec pods Crashes when IPSec NS extension/svc is enabled on any $ROLE nodes

IPSec ext and svc were enabled for 2 WORKERS only and their corresponding ovn-ipsec pods are in CLBO


[root@dell-per740-36 ipsec]# oc get pods 
NAME                                       READY   STATUS             RESTARTS         AGE
dell-per740-14rhtsengpek2redhatcom-debug   1/1     Running            0                3m37s
ovn-ipsec-bptr6                            0/1     CrashLoopBackOff   26 (3m58s ago)   130m
ovn-ipsec-bv88z                            1/1     Running            0                3h5m
ovn-ipsec-pre414-6pb25                     1/1     Running            0                3h5m
ovn-ipsec-pre414-b6vzh                     1/1     Running            0                3h5m
ovn-ipsec-pre414-jzwcm                     1/1     Running            0                3h5m
ovn-ipsec-pre414-vgwqx                     1/1     Running            3                132m
ovn-ipsec-pre414-xl4hb                     1/1     Running            3                130m
ovn-ipsec-qb2bj                            1/1     Running            0                3h5m
ovn-ipsec-r4dfw                            1/1     Running            0                3h5m
ovn-ipsec-xhdpw                            0/1     CrashLoopBackOff   28 (116s ago)    132m
ovnkube-control-plane-698c9845b8-4v58f     2/2     Running            0                3h5m
ovnkube-control-plane-698c9845b8-nlgs8     2/2     Running            0                3h5m
ovnkube-control-plane-698c9845b8-wfkd4     2/2     Running            0                3h5m
ovnkube-node-l6sr5                         8/8     Running            27 (66m ago)     130m
ovnkube-node-mj8bs                         8/8     Running            27 (75m ago)     132m
ovnkube-node-p24x8                         8/8     Running            0                178m
ovnkube-node-rlpbh                         8/8     Running            0                178m
ovnkube-node-wdxbg                         8/8     Running            0                178m
[root@dell-per740-36 ipsec]# 

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-09-12-024050

How reproducible:

Always

Steps to Reproduce:

1.Install OVN IPSec cluster (East-West) 
2.Enable IPSec OS extension for North-South
3.Enable IPSec service for North-South

Actual results:

ovn-ipsec pods in CLBO state

Expected results:

All pods under ovn-kubernetes ns should be Running fine

Additional info:

One of the ovn-ipsec CLBO pods logs

# oc logs ovn-ipsec-bptr6
Defaulted container "ovn-ipsec" out of: ovn-ipsec, ovn-keys (init)
+ rpm --dbpath=/usr/share/rpm -q libreswan
libreswan-4.9-4.el9_2.x86_64
+ counter=0
+ '[' -f /etc/cni/net.d/10-ovn-kubernetes.conf ']'
+ echo 'ovnkube-node has configured node.'
ovnkube-node has configured node.
+ ip x s flush
+ ip x p flush
+ ulimit -n 1024
+ /usr/libexec/ipsec/addconn --config /etc/ipsec.conf --checkconfig
+ /usr/libexec/ipsec/_stackmanager start
+ /usr/sbin/ipsec --checknss
+ /usr/libexec/ipsec/pluto --leak-detective --config /etc/ipsec.conf --logfile /var/log/openvswitch/libreswan.log
FATAL ERROR: /usr/libexec/ipsec/pluto: lock file "/run/pluto/pluto.pid" already exists
leak: string logger, item size: 48
leak: string logger prefix, item size: 27
leak detective found 2 leaks, total size 75

journalctl -u ipsec here: https://privatebin.corp.redhat.com/?216142833d016b3c#2Es8ACSyM3VWvwi85vTaYtSx8X3952ahxCvSHeY61UtT

The issue:

An interesting issue came up on #forum-ui-extensibility. There was an attempt to use extensions to nest a details page under a details page that contained a horizontal nav. This caused an issue with rendering the page content when a sub link was clicked – which caused confusion.

The why:

The reason this happened was the resource details page had a tab that contained a resource list page. This resource list page showed a number of items of CRs that when clicked would try to append their name onto the URL. This confused the navigation, thinking that this path must be another tab, so no tabs were selected and no content was visible. The goal was to reuse this longer path name as a details page of its own with its own horizontal nav. This issue is a conceptual misunderstanding of the way our list & details pages work in OpenShift Console.

List Pages are sometimes found via direct navigation links. List pages are almost all shown on the Search page, allowing a user to navigate to both existing nav items and other non-primary resources.

Details Pages are individual items found in the List Pages (a row). These are stand alone pages that show details of a singular CR and optionally can have tabs that list other resources – but they always transition to a fresh Details page instead of compounding on the currently visible one.

The ask:

If we could document this in a fashion that can help Plugin developers share the same UX that the rest of the Console does then we will have a more unified approach to UX within the Console and through any installed Plugins.

==> Description of problem:

"Import from git" functionality with a local Bitbucket instance does not work, due to repository validation that requires to repository to be hosted on Bitbucket Cloud. [1][2]

[1] https://github.com/openshift/console/blob/release-4.10/frontend/packages/git-service/src/services/bitbucket-service.ts#L63

[2] https://github.com/openshift/console/blob/release-4.10/frontend/packages/git-service/src/services/bitbucket-service.ts#L18

==> Version-Release number of selected component (if applicable):

Tested in OCP 4.10

==> How reproducible: 100%

==> Steps to Reproduce:
1. Go to: Developer View > Add+ > From Git
2. Fill the "Git Repo URL" field with the BitBucket repo URL (i.e. http://<bitbucket_url>/scm/<project>/<repository>.git)
3. Select BitBucket from the "Git type" dropdowns button

==> Actual results:
"URL is valid but cannot be reached. If this is a private repository, enter a source Secret in advanced Git options"

==> Expected results:

This functionality should work also with hosted Bitbucket

==> Additional info:

To retrieve slug information from hosted BitBucket we can query: http://<bitbucket_url>/rest/api/1.0/projects/<project>/repos/<repository>

An example:

~~~
curl -ks http://bitbucket-server-bitbucket.apps.gmeghnag.lab.cluster/rest/api/1.0/projects/test/repos/test-repo | jq
{
"slug": "test-repo",
"id": 1,
"name": "test-repo",
"hierarchyId": "28fc5c8782050b43e223",
"scmId": "git",
"state": "AVAILABLE",
"statusMessage": "Available",
"forkable": true,
"project": {
"key": "TEST",
"id": 1,
"name": "test",
"public": false,
"type": "NORMAL",
"links": {
"self": [

{ "href": "http://bitbucket-server-bitbucket.apps.gmeghnag.lab.cluster/projects/TEST" }

]
}
},
"public": true,
"archived": false,
"links": {
"clone": [

{ "href": "http://bitbucket-server-bitbucket.apps.gmeghnag.lab.cluster/scm/test/test-repo.git", "name": "http" }

,

{ "href": "ssh://git@bitbucket-server-bitbucket.apps.gmeghnag.lab.cluster:7999/test/test-repo.git", "name": "ssh" }

],
"self": [

{ "href": "http://bitbucket-server-bitbucket.apps.gmeghnag.lab.cluster/projects/TEST/repos/test-repo/browse" }

]
}
}
~~~

Description of problem:

The must gather should contain additional debug information such as the current configuration and firmware settings of any Bluefields / Mellanox device when using SRIOV

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

When the management cluster has ICSP resources, the pull reference of the Kube APIServer is replaced with a pull ref from the management cluster ICSPs resulting in a pull failure.

Version-Release number of selected component (if applicable):

4.14

How reproducible:

Always

Steps to Reproduce:

1. Create a cluster with release registry.ci.openshift.org/ocp/release:4.14.0-0.nightly-2023-08-28-154013 on a management cluster that has ICSPs
2. Watch the kube-apiserver pods.

Actual results:

kube-apiserver pods are initially deployed with a pull ref from the release payload and they start, but then the deployment is updated with a pull ref from an ICSP mapping and the deployment fails to roll out.

Expected results:

kube-apiserver pods roll out successfully.

Additional info:

 

Description of problem:

The network-tools image stream is missing in the cluster samples. It is needed for CI tests.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

When creating a deployment with `oc new-app` and using `--import-mode=PreserveOriginal`, if there are containerports that are present in the dockerfile, they do not get propagated to the deployment `spec.containers[i].ports[i].containerPort`.

On further inspection this is because the config object which gets passed from the image to the deployment does not contain these details. The image reference in this case is a manifestlisted image which does not contain the docker metadata. Instead these need to be derived from the child manifest.

test=[sig-cluster-lifecycle][Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously [Timeout:30m][apigroup:machine.openshift.io] [Suite:openshift/conformance/serial]

Appears to be perma-failing on gcp serial jobs.

We're at the edge of our visible data, but it looks like this may have happened around July 7

Sample failure: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.14-e2e-gcp-sdn-techpreview-serial/1681814026218115072

Description of problem:

revert "force cert rotation every couple days for development" in 4.13

Below is the steps to verify this bug:

# oc adm release info --commits registry.ci.openshift.org/ocp/release:4.11.0-0.nightly-2022-06-25-081133|grep -i cluster-kube-apiserver-operator
  cluster-kube-apiserver-operator                https://github.com/openshift/cluster-kube-apiserver-operator                7764681777edfa3126981a0a1d390a6060a840a3

# git log --date local --pretty="%h %an %cd - %s" 776468 |grep -i "#1307"
08973b820 openshift-ci[bot] Thu Jun 23 22:40:08 2022 - Merge pull request #1307 from tkashem/revert-cert-rotation

# oc get clusterversions.config.openshift.io 
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-06-25-081133   True        False         64m     Cluster version is 4.11.0-0.nightly-2022-06-25-081133

$ cat scripts/check_secret_expiry.sh
FILE="$1"
if [ ! -f "$1" ]; then
  echo "must provide \$1" && exit 0
fi
export IFS=$'\n'
for i in `cat "$FILE"`
do
  if `echo "$i" | grep "^#" > /dev/null`; then
    continue
  fi
  NS=`echo $i | cut -d ' ' -f 1`
  SECRET=`echo $i | cut -d ' ' -f 2`
  rm -f tls.crt; oc extract secret/$SECRET -n $NS --confirm > /dev/null
  echo "Check cert dates of $SECRET in project $NS:"
  openssl x509 -noout --dates -in tls.crt; echo
done

$ cat certs.txt
openshift-kube-controller-manager-operator csr-signer-signer
openshift-kube-controller-manager-operator csr-signer
openshift-kube-controller-manager kube-controller-manager-client-cert-key
openshift-kube-apiserver-operator aggregator-client-signer
openshift-kube-apiserver aggregator-client
openshift-kube-apiserver external-loadbalancer-serving-certkey
openshift-kube-apiserver internal-loadbalancer-serving-certkey
openshift-kube-apiserver service-network-serving-certkey
openshift-config-managed kube-controller-manager-client-cert-key
openshift-config-managed kube-scheduler-client-cert-key
openshift-kube-scheduler kube-scheduler-client-cert-key

Checking the Certs,  they are with one day expiry times, this is as expected.
# ./check_secret_expiry.sh certs.txt
Check cert dates of csr-signer-signer in project openshift-kube-controller-manager-operator:
notBefore=Jun 27 04:41:38 2022 GMT
notAfter=Jun 28 04:41:38 2022 GMT

Check cert dates of csr-signer in project openshift-kube-controller-manager-operator:
notBefore=Jun 27 04:52:21 2022 GMT
notAfter=Jun 28 04:41:38 2022 GMT

Check cert dates of kube-controller-manager-client-cert-key in project openshift-kube-controller-manager:
notBefore=Jun 27 04:52:26 2022 GMT
notAfter=Jul 27 04:52:27 2022 GMT

Check cert dates of aggregator-client-signer in project openshift-kube-apiserver-operator:
notBefore=Jun 27 04:41:37 2022 GMT
notAfter=Jun 28 04:41:37 2022 GMT

Check cert dates of aggregator-client in project openshift-kube-apiserver:
notBefore=Jun 27 04:52:26 2022 GMT
notAfter=Jun 28 04:41:37 2022 GMT

Check cert dates of external-loadbalancer-serving-certkey in project openshift-kube-apiserver:
notBefore=Jun 27 04:52:26 2022 GMT
notAfter=Jul 27 04:52:27 2022 GMT

Check cert dates of internal-loadbalancer-serving-certkey in project openshift-kube-apiserver:
notBefore=Jun 27 04:52:49 2022 GMT
notAfter=Jul 27 04:52:50 2022 GMT

Check cert dates of service-network-serving-certkey in project openshift-kube-apiserver:
notBefore=Jun 27 04:52:28 2022 GMT
notAfter=Jul 27 04:52:29 2022 GMT

Check cert dates of kube-controller-manager-client-cert-key in project openshift-config-managed:
notBefore=Jun 27 04:52:26 2022 GMT
notAfter=Jul 27 04:52:27 2022 GMT

Check cert dates of kube-scheduler-client-cert-key in project openshift-config-managed:
notBefore=Jun 27 04:52:47 2022 GMT
notAfter=Jul 27 04:52:48 2022 GMT

Check cert dates of kube-scheduler-client-cert-key in project openshift-kube-scheduler:
notBefore=Jun 27 04:52:47 2022 GMT
notAfter=Jul 27 04:52:48 2022 GMT
# 

# cat check_secret_expiry_within.sh
#!/usr/bin/env bash
# usage: ./check_secret_expiry_within.sh 1day # or 15min, 2days, 2day, 2month, 1year
WITHIN=${1:-24hours}
echo "Checking validity within $WITHIN ..."
oc get secret --insecure-skip-tls-verify -A -o json | jq -r '.items[] | select(.metadata.annotations."auth.openshift.io/certificate-not-after" | . != null and fromdateiso8601<='$( date --date="+$WITHIN" +%s )') | "\(.metadata.annotations."auth.openshift.io/certificate-not-before")  \(.metadata.annotations."auth.openshift.io/certificate-not-after")  \(.metadata.namespace)\t\(.metadata.name)"'

# ./check_secret_expiry_within.sh 1day
Checking validity within 1day ...
2022-06-27T04:41:37Z  2022-06-28T04:41:37Z  openshift-kube-apiserver-operator	aggregator-client-signer
2022-06-27T04:52:26Z  2022-06-28T04:41:37Z  openshift-kube-apiserver	aggregator-client
2022-06-27T04:52:21Z  2022-06-28T04:41:38Z  openshift-kube-controller-manager-operator	csr-signer
2022-06-27T04:41:38Z  2022-06-28T04:41:38Z  openshift-kube-controller-manager-operator	csr-signer-signer

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

 

Description of the problem:

In RHEL 8, the arping command (from iputils-s20180629) only returns 1 when used for duplicate address detection. In all other modes it returns 0 on success; 2 or -1 on error.

In RHEL 9, the arping command (from iputils 20210202) also returns 1 in other modes, essentially at random. (There is some kind of theory behind it, but even after multiple fixes to the logic it does not remotely work in any consistent way.)

How reproducible:

60-100% for individual arping commands

100% installation failure

Steps to reproduce:

  1. Build the agent container using RHEL 9 as the base image
  2. arping -c 10 -w 5 -I enp2s0 192.168.111.1; echo $?

Actual results:

arping returns 1

journal on the discovery ISO shows:

Jul 19 04:35:38 master-0 next_step_runne[3624]: time="19-07-2023 04:35:38" level=error msg="Error while processing 'arping' command" file="ipv4_arping_checker.go:28" error="exit status 1"

all hosts are marked invalid and install fails.

Expected results:

ideally arping returns 0

failing that, we should treat both 0 and 1 as success as previous versions of arping effectively did.

Sanitize OWNERS/OWNER_ALIASES:

1) OWNERS must have:

component: "Storage / Kubernetes External Components"

2) OWNER_ALIASES must have all team members of Storage team.

Refer to the CIS RedHat OpenShift Container Platform Benchmark PDF: https://drive.google.com/file/d/12o6O-M2lqz__BgmtBrfeJu1GA2SJ352c/view
1.1.7 Ensure that the etcd pod specification file permissions are set to 600 or more restrictive (Manual)
======================================================================================================
As per CIS v1.3 PDF permissions should be 600 with the following statement:
"The pod specification file is created on control plane nodes at /etc/kubernetes/manifests/etcd-member.yaml with permissions 644. Verify that the permissions are 600 or more restrictive."
But when I ran the following command it was showing 644 permissions

for i in $(oc get pods -n openshift-etcd -l app=etcd -o name | grep etcd )
do
echo "check pod $i"
oc rsh -n openshift-etcd $i \
stat -c %a /etc/kubernetes/manifests/etcd-pod.yaml
done

Context:

We currently convey cloud creds issues in ValidOIDCConfiguration and ValidAWSIdentityProvider conditions.

The HO relies on those https://github.com/openshift/hypershift/blob/9e4127055dd7be9cfe4fc8427c39cee27a86efcd/hypershift-operator/controllers/hostedcluster/internal/platform/aws/aws.go#L293

to decide if forcefully deletion should be applied and so potentially intentionally leaving resources behind in cloud. (E.g. use case: oidc creds where broken out of band).

The CPO relies on those to wait for deletion of guest cluster resources https://github.com/openshift/hypershift/blob/8596f7f131169a19c6a67dc6ce078c50467de648/control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go#L284-L299

DoD:

When any of the cases above results in the "move kube deletion forward skipping cloud resource deletion" path we should send a metric so consumers / SREs have a sense and can use it to notify customers in conjunction with https://issues.redhat.com/browse/SDA-8613

 

Description of the problem:

No limitation for Additional certificates UI field

 

How reproducible:

100%

 

Steps to reproduce:

1. create a cluster  

2. On add host select 'Configure cluster-wide trusted certificates'

3. On Additional certificates, paste a big string 

4. Generate Discovery ISO

 

Actual results:

UI send it to the BE

 

Expected results:

There should be a limitation on certificate field

Description of problem:

I created a cluster with _workerLatencyProfile: LowUpdateSlowReaction_, then I edited the latencyProfile to MediumUpdateAverageReaction using documentation linked and this test case document below. Once I switched I waited for KubeControllerManager and KubeAPIServer to stop progressing/complete and noticed the nodeStatusUpdateFrequency under /etc/kubernetes/kubelet.conf does not change as expected

https://docs.google.com/document/d/19dPIE4WFxVc3ldu-hNoXiOkjBCQrHC6I7wfyaUyTDqw/edit#heading=h.kf4qxogy9r6
Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-07-31-181848

How reproducible:

100% 

Steps to Reproduce:

1. Create cluster with LowUpdateSlowReaction manifest: Example: https://docs.google.com/document/d/19dPIE4WFxVc3ldu-hNoXiOkjBCQrHC6I7wfyaUyTDqw/edit#heading=h.22najgyaj9lh
2. Validate values of low update profile components 

$ oc debug node/<worker-node-name>
$ chroot /host 
$ sh-4.4# cat /etc/kubernetes/kubelet.conf | grep nodeStatusUpdateFrequency 
  "nodeStatusUpdateFrequency": "1m0s",
$ oc get KubeControllerManager -o yaml | grep -A 1 node-monitor
        node-monitor-grace-period:
        - 5m0s
$ oc get KubeAPIServer -o yaml | grep -A 1 default-
        default-not-ready-toleration-seconds:
        - "60"
        Default-unreachable-toleration-seconds:
        - "60"
3. *oc edit nodes.config/cluster*
spec: 
  workerLatencyProfile: MediumUpdateAverageReaction
4. Wait for components to complete using 

oc get KubeControllerManager -o yaml | grep -i workerlatency -A 5 -B 5
and 
oc get KubeAPIServer -o yaml | grep -i workerlatency -A 5 -B 5

5. Validate medium component values, hitting error here


Actual results:

% oc get KubeControllerManager -o yaml | grep -A 1 node-monitor
        node-monitor-grace-period:
        - 2m0s
prubenda@prubenda1-mac lrc % oc get KubeAPIServer -o yaml | grep -A 1 default-
        default-not-ready-toleration-seconds:
        - "60"
        default-unreachable-toleration-seconds:
        - "60"
sh-5.1# cat /etc/kubernetes/kubelet.conf | grep nodeStatusUpdateFrequency 
  "nodeStatusUpdateFrequency": "1m0s",

Expected results:

$ oc debug node/<worker-node-name>
$ chroot /host 
$ sh-4.4# cat /etc/kubernetes/kubelet.conf | grep nodeStatusUpdateFrequency 
  "nodeStatusUpdateFrequency": "20s",
$ oc get KubeControllerManager -o yaml | grep -A 1 node-monitor
        node-monitor-grace-period:
        - 2m0s
$ oc get KubeAPIServer -o yaml | grep -A 1 default-
        default-not-ready-toleration-seconds:
        - "60"
        default-unreachable-toleration-seconds:
        - "60"

Additional info:

In the documentation it states that workers will go disabled while the change is being applied and I never saw that occur

Description of problem:

Due to rpm-ostree regression (OKD-63) MCO was copying /var/lib/kubelet/config.json into /run/ostree/auth.json on FCOS and SCOS. This breaks Assisted Installer flow, which starts with Live ISO and doesn't have /var/lib/kubelet/config.json

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:

1.
2.
3.

Actual results:


Expected results:


Additional info:


Context:

As a SRE / cluster service / dev I'd like to have the ability to identify trends on the duration of granular components that belong to HC/NodePools and that might affect our SLOs, e.g etcd, infra, ignition, nodes.

DoD:

Add metrics to visualise components duration of transitions.

Start with a few and agree on the approach.

Follow up.

Add a page to our documentation to describe what information needs to be gathered in the case of a failure/bug.

Document how to use the `hypershift dump cluster` command.

  • Support impersonate flag to make it easier to run against prod envs.

 We are investigating issues with storage usage in production. Reverting until we have a root cause

Description of problem:

In an install where users bring their networks they also bring their own NSGs. However, the installer still creates NSG. In Azure environments using the rule [1] below, users are prohibited from installing cluster, as the apiserver_in rule has the rule set as 0.0.0.0[2]. Having a rule in place where the users could define this before install would allow them to set this connectivity without having the inbound access 



[1] - Rule: Network Security Groups shall not allow rule with 0.0.0.0/Any Source/Destination IP Addresses - Custom Deny

[2] - https://github.com/openshift/installer/blob/master/data/data/azure/vnet/nsg.tf#L31

Description of problem:

Pipeline as a code has been GA for some time. So, we should remove the Tech preview badge from the PAC pages. 

Version-Release number of selected component (if applicable):

4.13

Description of problem:

No timezone info in installer logs

Version-Release number of selected component (if applicable):

4.x

How reproducible:

100%

Steps to Reproduce:

1. openshift-install wait-for install-complete --dir=./foo
2.
3.

Actual results:

INFO Waiting up to 1h0m0s (until 4:52PM) for the cluster at https://api.ocp.example.local:6443 to initialize...

Expected results:

INFO Waiting up to 1h0m0s (until 4:52PM UTC) for the cluster at https://api.ocp.example.local:6443 to initialize...

Additional info:

 

Description of problem:

We should disable netlink mode of netclass collector in Node Exporter. The netlink mode of netclass collector is introduced in 4.13 into the Node Exporter. When using the netlink mode, several metrics become unavailable. So to avoid confusing our user when they upgrade the OCP cluster to a new version and find several metrics missing on the NICs. 

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

Using default config of CMO, Node Exporter's netclass collector is running in netlink mode.
The argument `--collector.netclass.netlink` is present in the `node-exporter` container in `node-exporter` daemonset.

Expected results:

Using default config of CMO, Node Exporter's netclass collector is running in classic mode. 
The argument `--collector.netclass.netlink` is absent in the `node-exporter` container in `node-exporter` daemonset.

Additional info:

 

Description of problem:

I have to create this OCPBUG in order to backport a test to the 4.14 branch.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Please review the following PR: https://github.com/openshift/prometheus-operator/pull/222

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

The following test is permafeailing in Prow CI:
[tuningcni] sysctl allowlist update [It] should start a pod with custom sysctl only after adding sysctl to allowlist

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-kni-cnf-features-deploy-master-e2e-gcp-ovn-periodic/1640987392103944192


[tuningcni]
9915/go/src/github.com/openshift-kni/cnf-features-deploy/cnf-tests/testsuites/e2esuite/security/tuning.go:26
9916  sysctl allowlist update
9917  /go/src/github.com/openshift-kni/cnf-features-deploy/cnf-tests/testsuites/e2esuite/security/tuning.go:141
9918    should start a pod with custom sysctl only after adding sysctl to allowlist
9919    /go/src/github.com/openshift-kni/cnf-features-deploy/cnf-tests/testsuites/e2esuite/security/tuning.go:156
9920  > Enter [BeforeEach] [tuningcni] - /go/src/github.com/openshift-kni/cnf-features-deploy/cnf-tests/testsuites/pkg/execute/ginkgo.go:9 @ 03/29/23 10:08:49.855
9921  < Exit [BeforeEach] [tuningcni] - /go/src/github.com/openshift-kni/cnf-features-deploy/cnf-tests/testsuites/pkg/execute/ginkgo.go:9 @ 03/29/23 10:08:49.855 (0s)
9922  > Enter [BeforeEach] sysctl allowlist update - /go/src/github.com/openshift-kni/cnf-features-deploy/cnf-tests/testsuites/e2esuite/security/tuning.go:144 @ 03/29/23 10:08:49.855
9923  < Exit [BeforeEach] sysctl allowlist update - /go/src/github.com/openshift-kni/cnf-features-deploy/cnf-tests/testsuites/e2esuite/security/tuning.go:144 @ 03/29/23 10:08:49.896 (41ms)
9924  > Enter [It] should start a pod with custom sysctl only after adding sysctl to allowlist - /go/src/github.com/openshift-kni/cnf-features-deploy/cnf-tests/testsuites/e2esuite/security/tuning.go:156 @ 03/29/23 10:08:49.896
9925  [FAILED] Unexpected error:
9926      <*errors.errorString | 0xc00044eec0>: {
9927          s: "timed out waiting for the condition",
9928      }
9929      timed out waiting for the condition
9930  occurred9931  In [It] at: /go/src/github.com/openshift-kni/cnf-features-deploy/cnf-tests/testsuites/e2esuite/security/tuning.go:186 @ 03/29/23 10:09:53.377

Version-Release number of selected component (if applicable):

master (4.14)

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

Test fails

Expected results:

Test passes

Additional info:

PR https://github.com/openshift-kni/cnf-features-deploy/pull/1445 adds some useful information to the reported archive.

The installer offers a graph command to output its internal dependency graph. It could be useful to have a similar command, ie agent graph to output the specific agent dependency graph

Description of problem:
When import a Serverless Service from a git repository the topology shows an Open URL decorator also when "Add Route" checkbox was unselected (which is selected by default).

The created kn Route makes the Service available within the cluster and the created URL looks like this: http://nodeinfo-private.serverless-test.svc.cluster.local

So the Service is NOT accidentally exposed. It's "just" that we link an internal route that will not be accessible to the user.

This might happen also for Serverless functions import flow and the import container image import flow.

Version-Release number of selected component (if applicable):
Tested older versions and could see this at least on 4.10+

How reproducible:
Always

Steps to Reproduce:

  1. Install the OpenShift Serverless operator and create the required kn Serving resource.
  2. Navigate to the Developer perspective > Add > Import from Git
  3. Enter a git repository (like https://gitlab.com/jerolimov/nodeinfo
  4. Unselect "Add Route" and press Create

Actual results:
The topology shows the new kn Service with a Open URL decorator on the top right corner.

The button is clickable but the target page could not be opened (as expected).

Expected results:
The topology should not show an Open URL decorator for "private" kn Routes.

The topology sidebar shows similar information, we should maybe release the Link there as well with a Text+Copy button???

A fix should be tested as well with Serverless functions as container images!

Additional info:
When the user unselects the "Add route" option an additional label is added to the kn Service. This label could also be added and removed later. When this label is specified the Open URL decorator should not be shown:

metadata:
  labels:
    networking.knative.dev/visibility: cluster-local

See also:

https://github.com/openshift/console/blob/1f6e238b924f4a4337ef917a0eba8aadae161e9c/frontend/packages/knative-plugin/src/utils/create-knative-utils.ts#L108

https://github.com/openshift/console/blob/1f6e238b924f4a4337ef917a0eba8aadae161e9c/frontend/packages/knative-plugin/src/topology/components/decorators/getServiceRouteDecorator.tsx#L15-L21

Description of problem:

SSH keys not configured on the worker nodes

Version-Release number of selected component (if applicable):

4.14.0-0.ci-2023-07-14-014011

How reproducible:

so far 100%

Steps to Reproduce:

1. Deploy baremetal cluster using IPI flow
2.
3.

Actual results:

Deployment succeeds but SSH keys not configured on the worker nodes

Expected results:

SSH keys configured on the worker nodes

Additional info:

SSH keys configured on the control-plane nodes
ssh core@master-0-0 'cat .ssh/authorized_keys.d/ignition'
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDm9hb6iTZJypEmzg4IZ767ze60UGhBWnjPXhovWVB7uKputdLzZhmlo36ifkXr/DTk8NGm47r6kXmz9NAF0pDHa5jX6yJFnhS4z5NY/mzsUX41gwiqBKYHgdp/KE1ylE8mbNon5ZpaaGvb876myjjPjPwWsD8hvXZirA5Q8TfDb/Pvgy1dhVH/uN05Ip1vVsp+bFGMPUJVWVUy/Eby5xW6OJv+FBOQq4nu6tslDZlHYXX2TSGrlW4x0i/oQMpKu/Y8ygAdjWqmAy6UBcho1nNWy15cp0jI5Fhjze171vSWZLAqJY+eFcL2kt/09RnY+MXyY/tIf+qNMyBE2Qltigah
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  creationTimestamp: "2023-07-14T12:13:00Z"
  generation: 1
  labels:
    machineconfiguration.openshift.io/role: worker
  name: 99-worker-ssh
  resourceVersion: "2242"
  uid: 0ef02005-509e-4fc9-91ee-fc0afe27d5e6
spec:
  config:
    ignition:
      version: 3.2.0
    passwd:
      users:
      - name: core
        sshAuthorizedKeys:
        - |
          ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDm9hb6iTZJypEmzg4IZ767ze60UGhBWnjPXhovWVB7uKputdLzZhmlo36ifkXr/DTk8NGm47r6kXmz9NAF0pDHa5jX6yJFnhS4z5NY/mzsUX41gwiqBKYHgdp/KE1ylE8mbNon5ZpaaGvb876myjjPjPwWsD8hvXZirA5Q8TfDb/Pvgy1dhVH/uN05Ip1vVsp+bFGMPUJVWVUy/Eby5xW6OJv+FBOQq4nu6tslDZlHYXX2TSGrlW4x0i/oQMpKu/Y8ygAdjWqmAy6UBcho1nNWy15cp0jI5Fhjze171vSWZLAqJY+eFcL2kt/09RnY+MXyY/tIf+qNMyBE2Qltigah
  extensions: null
  fips: false
  kernelArguments: null
  kernelType: ""
  osImageURL: ""

Description of problem:

After further discussion about https://issues.redhat.com/browse/RFE-3383 we have concluded that it needs to be addressed in 4.12 since OVNK will be default there. I'm opening this so we can backport the fix.

The fix for this is simply to alter the logic around enabling nodeip-configuration to handle the VSphere-unique case of platform type == "vsphere" and the VIP field is not populated.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

 

Description of the problem:

 4.14 jobs relying on LSO are failing because we should use the version N-1 for LSO.
Something similar to https://github.com/openshift/assisted-service/pull/4753 should be merged.

Actual results:

Job fail with:

 ++ make deploy_assisted_operator test_kube_api_parallel
Error from server (NotFound): namespaces "assisted-spoke-cluster" not found
error: the server doesn't have a resource type "clusterimageset"
namespace "assisted-installer" deleted
error: the server doesn't have a resource type "agentserviceconfigs"
error: the server doesn't have a resource type "localvolume"
Error from server (NotFound): catalogsources.operators.coreos.com "assisted-service-catalog" not found 

 https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-assisted-test-infra-master-e2e-metal-assisted-kube-api-late-binding-single-node

Expected results:

Job should be a success

Description of problem:

Changes to platform fields e.g. aws instance type doesn't trigger a rolling upgrade

Version-Release number of selected component (if applicable):


How reproducible:

Always

Steps to Reproduce:

1. Create a hostedCluster with nodepool on AWS
2. Change the instance type field on the nodepool spec.platfrom.aws

Actual results:

Machines are not restarted and the instance type didn't change

Expected results:

Machines are recreated with the new instance type

Additional info:

This is a result of the recent changes to CAPI which introduced in-place propagation to labels and annotations
Soultion:
MachineTemplate name should not be constant and should change with each spec change, so that spec.infraRef in the MachineDeployment is updated and a rolling upgrade is triggered.

In order to avoid possible issues with SDN during migration from SDN to OVNK, do not use port 9106 for ovnkube-control-plane metrics, since it's already used by SDN. Use a port that is not used by SDN, such as 9108.

Description of the problem:
Creating a cluster with ingress VIPs and user managed network will return an error

 
{
  "lastProbeTime": "2023-03-01T18:50:41Z",
  "lastTransitionTime": "2023-03-01T18:50:41Z",
  "message": "The Spec could not be synced due to an input error: API VIP cannot be set with User Managed Networking",
  "reason": "InputError",
  "status": "False",
  "type": "SpecSynced"
}

but setting ingress VIPs and user manged network to false and then edit only user managed network will not result in any error, will the cluster be using user managed network in this case?

How reproducible:

 

Steps to reproduce:

1. apply

apiVersion: extensions.hive.openshift.io/v1beta1
kind: AgentClusterInstall
metadata:
  name: acimulinode
  namespace: mfilanov
spec:
  apiVIP: 1.2.3.8
  apiVIPs:
   - 1.2.3.8
  clusterDeploymentRef:
    name: multinode
  imageSetRef:
    name: img4.12.5-x86-64-appsub
  ingressVIP: 1.2.3.10
  platformType: BareMetal
  networking:
    clusterNetwork:
    - cidr: 10.128.0.0/14
      hostPrefix: 23
    serviceNetwork:
    - 172.30.0.0/16
    userManagedNetworking: false
  provisionRequirements:
    controlPlaneAgents: 3
  compute:
  - hyperthreading: Enabled
    name: worker
  controlPlane:
    hyperthreading: Enabled
    name: master

2. check conditions

kubectl get aci -n mfilanov -o json | jq .items[].status.conditions[]
{
  "lastProbeTime": "2023-03-01T18:52:08Z",
  "lastTransitionTime": "2023-03-01T18:52:08Z",
  "message": "SyncOK",
  "reason": "SyncOK",
  "status": "True",
  "type": "SpecSynced"
}

3. edit user managed network and apply again

apiVersion: extensions.hive.openshift.io/v1beta1
kind: AgentClusterInstall
metadata:
  name: acimulinode
  namespace: mfilanov
spec:
  apiVIP: 1.2.3.8
  apiVIPs:
   - 1.2.3.8
  clusterDeploymentRef:
    name: multinode
  imageSetRef:
    name: img4.12.5-x86-64-appsub
  ingressVIP: 1.2.3.10
  platformType: BareMetal
  networking:
    clusterNetwork:
    - cidr: 10.128.0.0/14
      hostPrefix: 23
    serviceNetwork:
    - 172.30.0.0/16
    userManagedNetworking: true
  provisionRequirements:
    controlPlaneAgents: 3
  compute:
  - hyperthreading: Enabled
    name: worker
  controlPlane:
    hyperthreading: Enabled
    name: master

Actual results:

kubectl get aci -n mfilanov -o json | jq .items[].status.conditions[]
{
  "lastProbeTime": "2023-03-01T18:52:08Z",
  "lastTransitionTime": "2023-03-01T18:52:08Z",
  "message": "SyncOK",
  "reason": "SyncOK",
  "status": "True",
  "type": "SpecSynced"
}

 

Expected results:
probably should get an error because ingress vips already set

Description of problem:

While trying to update build01 from 4.13.rc2->4.13.rc3, the MCO degraded upon trying to upgrade the first master node. The error being:

E0414 15:42:29.597388 2323546 writer.go:200] Marking Degraded due to: exit status 1

Which I mapped to this line:
https://github.com/openshift/machine-config-operator/blob/release-4.13/pkg/daemon/update.go#L1551

I think this error can be improved since it is a bit confusing, but that's not the main problem.

We noticed that the actual issue was that there is an existing "/home/core/.ssh" directory, that seemed to have been created by 4.13.rc2 (but could have been earlier), that belonged to the root user, as such when we attempted to create the folder via runuser core by hand, it failed with permission denied (and since we return the exec status, I think it just returned status 1 and not this error message).

I am currently not sure if we introduced something that caused this issue. There was an ssh (only on master pool) in that build01 cluster for 600 days already, so it must have worked in the past?

Workaround is to delete the .ssh folder and let the MCD recreate it

Version-Release number of selected component (if applicable):

4.13.rc3

How reproducible:

uncertain, but shouldn't be very high otherwise we would have ran into this in CI much more I think?

Steps to Reproduce:

1. create some 4.12 cluster with sshkey
2. upgrade to 4.13.rc2
3. upgrade to 4.13.rc3

Actual results:

 

Expected results:

 

Additional info:

 

Please review the following PR: https://github.com/openshift/oc/pull/1408

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/cluster-cloud-controller-manager-operator/pull/235

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

When the user's pull secret contains a JSON null in the "auth" or "email" keys, assisted service crashes when we attempt to create the cluster:

May 31 21:06:27 example.dev.local service[3389]: time="2023-05-31T09:06:27Z" level=error msg="Failed to registered cluster example with id 3648b06e-4745-4542-9421-78ae2e249c0d" func="github.
com/openshift/assisted-service/internal/bminventory.(*bareMetalInventory).RegisterClusterInternal.func1" file="/src/internal/bminventory/inventory.go:448" cluster_id=3648b06e-4745-4542-9421-
78ae2e249c0d go-id=162 pkg=Inventory request_id=1252f666-cf5c-4aae-9be7-7b7a579b5bf6
May 31 21:06:27 example.dev.local service[3389]: 2023/05/31 09:06:27 http: panic serving 10.116.24.118:46262: interface conversion: interface {} is nil, not string
May 31 21:06:27 example.dev.local service[3389]: goroutine 162 [running]:
May 31 21:06:27 example.dev.local service[3389]: net/http.(*conn).serve.func1()
May 31 21:06:27 example.dev.local service[3389]:         /usr/lib/golang/src/net/http/server.go:1850 +0xbf
May 31 21:06:27 example.dev.local service[3389]: panic({0x25d0000, 0xc00148d7d0})
May 31 21:06:27 example.dev.local service[3389]:         /usr/lib/golang/src/runtime/panic.go:890 +0x262
May 31 21:06:27 example.dev.local service[3389]: github.com/openshift/assisted-service/internal/cluster/validations.ParsePullSecret({0xc001ed0780, 0x1c6})
May 31 21:06:27 example.dev.local service[3389]:         /src/internal/cluster/validations/validations.go:106 +0x718
May 31 21:06:27 example.dev.local service[3389]: github.com/openshift/assisted-service/internal/cluster/validations.(*registryPullSecretValidator).ValidatePullSecret(0xc0005880c0, {0xc001ed0780?, 0x7?}, {0x29916da, 0x5})
May 31 21:06:27 example.dev.local service[3389]:         /src/internal/cluster/validations/validations.go:160 +0x54
May 31 21:06:27 example.dev.local service[3389]: github.com/openshift/assisted-service/internal/bminventory.(*bareMetalInventory).ValidatePullSecret(...)
May 31 21:06:27 example.dev.local service[3389]:         /src/internal/bminventory/inventory.go:279
May 31 21:06:27 example.dev.local service[3389]: github.com/openshift/assisted-service/internal/bminventory.(*bareMetalInventory).RegisterClusterInternal(0xc00112f880, {0x2fd3e20, 0xc00148cd50}, 0x0, {0xc0007c0400, 0xc0008d69a0})
May 31 21:06:27 example.dev.local service[3389]:         /src/internal/bminventory/inventory.go:564 +0x16d0
May 31 21:06:27 example.dev.local service[3389]: github.com/openshift/assisted-service/internal/bminventory.(*bareMetalInventory).V2RegisterCluster(0x2fd3e20?, {0x2fd3e20?, 0xc00148cd50?}, {0xc0007c0400?, 0xc0008d69a0?})
May 31 21:06:27 example.dev.local service[3389]:         /src/internal/bminventory/inventory_v2_handlers.go:42 +0x39
May 31 21:06:27 example.dev.local service[3389]: github.com/openshift/assisted-service/restapi.HandlerAPI.func59({0xc0007c0400?, 0xc0008d69a0?}, {0x2390b20?, 0xc0014e0240?})
May 31 21:06:27 example.dev.local service[3389]:         /src/restapi/configure_assisted_install.go:639 +0xaf
May 31 21:06:27 example.dev.local service[3389]: github.com/openshift/assisted-service/restapi/operations/installer.V2RegisterClusterHandlerFunc.Handle(0xc000a9d068?, {0xc0007c0400?, 0xc0008d69a0?}, {0x2390b20?, 0xc0014e0240?})
May 31 21:06:27 example.dev.local service[3389]:         /src/restapi/operations/installer/v2_register_cluster.go:19 +0x3d
May 31 21:06:27 example.dev.local service[3389]: github.com/openshift/assisted-service/restapi/operations/installer.(*V2RegisterCluster).ServeHTTP(0xc000571470, {0x2fc7140, 0xc00034c040}, 0xc0007c0400)
May 31 21:06:27 example.dev.local service[3389]:         /src/restapi/operations/installer/v2_register_cluster.go:66 +0x298
May 31 21:06:27 example.dev.local service[3389]: github.com/go-openapi/runtime/middleware.NewOperationExecutor.func1({0x2fc7140, 0xc00034c040}, 0xc0007c0400)
May 31 21:06:27 example.dev.local service[3389]:         /src/vendor/github.com/go-openapi/runtime/middleware/operation.go:28 +0x59

Version-Release number of selected component (if applicable):

4.12.17

How reproducible:

Probably 100%

Steps to Reproduce:

1. Add to the pull secret in install-config.yaml an auth like:

        "example.com": {
          "auth": null,
          "email": null
        }

2. Generate the agent ISO as usual using "openshift-install agent create image"
3. Boot the ISO on the cluster hosts.

Actual results:

The create-cluster-and-infraenv.service fails to complete. In its log it reports:

    Failed to register cluster with assisted-service: Post \"http://10.1.1.2:8090/api/assisted-install/v2/clusters\": EOF

Expected results:

Cluster is installed.

Additional info:

This is particularly difficult to debug because users don't generally give us their pull secrets. The pull secret file in the agent-gather bundle has individual fields redacted, so it is a better guide than the install-config where the whole thing may be redacted.

DoD:

Let the HO export a metric with it own version so as an SRE I can easily understand which version is running where by looking at a grafana dashboard.

Context:

As we start receiving metrics consistently in OCM environments and we are creating SLOs dashboards that can consume data from any data source Prod/stage/CI we also want to revisit how we are sending metrics and make sure we are doing it int the most effective way. We have some wonky data coming through in prod atm.

DoD:

Atm we have high frequency reconciliation loop where we constantly review the over all state of the world by looping over all clusters.

We should review this approach and record metrics/events as it happens directly in the controllers/reconcile loop only once and not repeatedly in a loop when possible for each specific metric.

Description of problem:

While mirror to filesystem, if 429 error is received from registry, the layer is incorrectly flagged as having been mirrored & therefore not picked up by subsequent mirror re-run requests. It gives the impression as mirror to file system in second attempt is successful. However, causing issue while mirroring from filesystem to target registry (Due to missing files)

Version-Release number of selected component (if applicable):

oc version
Client Version: 4.8.42
Server Version: 4.8.14
Kubernetes Version: v1.21.1+a620f50

How reproducible:

When 429 occurs while mirror to file system

Steps to Reproduce:

1. Run mirror to filesystem command : oc image mirror -f mirror-to-filesystem.txt --filter-by-os '.*' -a $REGISTRY_AUTH_FILE --insecure --skip-multiple-scopes --max-per-registry=1 --continue-on-error=true --dir "$LOCAL_DIR_PATH"  

Output: 
info: Mirroring completed in 2h19m24.14s (25.75MB/s)
error: one or more errors occurred 
E.g
error: unable to push <registry>/namespace/<image-name>: failed to retrieve blob <image-digest>: error parsing HTTP 429 response body: unexpected end of JSON input: ""


2. Re Run mirror to filesystem command : oc image mirror -f mirror-to-filesystem.txt --filter-by-os '.*' -a $REGISTRY_AUTH_FILE --insecure --skip-multiple-scopes --max-per-registry=1 --continue-on-error=true --dir "$LOCAL_DIR_PATH"

Output:
info: Mirroring completed in 480ms (0B/s)


3. Run mirror from filesystem command : oc image mirror -f mirror-from-filesystem.txt -a $REGISTRY_AUTH_FILE --from-dir "$LOCAL_DIR_PATH" --filter-by-os '.*' --insecure --skip-multiple-scopes --max-per-registry=1 --continue-on-error=true

Output: 
info: Mirroring completed in 53m5.21s (67.61MB/s)
error: one or more errors occurred
E.g
error: unable to push file://local/namespace/<image-name>: failed to retrieve blob <image-digest>: open /root/local/namespace/<image-name>/blobs/<image-digest>: no such file or directory

 

Actual results:

1) mirror to filesystem first attempt: 

info: Mirroring completed in 2h19m24.14s (25.75MB/s) 
error: one or more errors occurred 
E.g
error: unable to push <registry>/namespace/<image-name>: failed to retrieve blob <image-digest>: error parsing HTTP 429 response body: unexpected end of JSON input: ""

2) mirror to filesystem second attempt: 

info: Mirroring completed in 480ms (0B/s)

 
3) mirror from filesystem to target registry:  

info: Mirroring completed in 53m5.21s (67.61MB/s) 
error: one or more errors occurred 
E.g 
error: unable to push file://local/namespace/<image-name>: failed to retrieve blob <image-digest>: open /root/local/namespace/<image-name>/blobs/<image-digest>: no such file or directory

Expected results:

source image mirror -> to file system and image mirror from file system -> target registry should complete successfully

Additional info:

 

Description of the problem:

Currently the `pre-network-manager-config.service` that we use to create static network configurations from the non minimal discovery ISO may run after NetworkManager, and therefore the configurations that it generates may be ignored.

How reproducible:

Not always reproducible, it is time sensitive. Has been observed when there is a large number of static network configurations. See OCPBUGS-16219 for details and steps to reproduce.

Please review the following PR: https://github.com/openshift/console-operator/pull/737

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

4.14 indexes have been bootstrapped and published on the registry. I was told they have to be added to https://github.com/operator-framework/operator-marketplace/blob/master/defaults/03_community_operators.yaml until they can be used in OCP clusters. 

Version-Release number of selected component (if applicable):

OCP 4.14

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

4.14 indexes were bootstrapped in CLOUDDST-17591

Description of problem:

 

Observation from CISv1.4 pdf:
1.1.9 Ensure that the Container Network Interface file permissions are set to 600 or more restrictive
“Container Network Interface provides various networking options for overlay networking.
You should consult their documentation and restrict their respective file permissions to maintain the integrity of those files. Those files should be writable by only the administrators on the system.”
 
To conform with CIS benchmarksChange, the /var/run/multus/cni/net.d/*.conf files on nodes should be updated to 600.

$ for i in $(oc get pods -n openshift-multus -l app=multus -oname); do oc exec -n openshift-multus $i -- /bin/bash -c "stat -c \"%a %n\" /host/var/run/multus/cni/net.d/*.conf"; done
644 /host/var/run/multus/cni/net.d/80-openshift-network.conf
644 /host/var/run/multus/cni/net.d/80-openshift-network.conf
644 /host/var/run/multus/cni/net.d/80-openshift-network.conf
644 /host/var/run/multus/cni/net.d/80-openshift-network.conf
644 /host/var/run/multus/cni/net.d/80-openshift-network.conf
644 /host/var/run/multus/cni/net.d/80-openshift-network.conf

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-07-20-215234

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

The file permissions of /var/run/multus/cni/net.d/*.conf on nodes is 644.

Expected results:

The file permissions of /var/run/multus/cni/net.d/*.conf on nodes should be updated to 600

Additional info:

 

Description of problem:

OCM-o does not support obtaining verbosity through OpenShiftControllerManager.operatorLogLevel object

Version-Release number of selected component (if applicable):

 

How reproducible:

modify the OpenShiftControllerManager.operatorLogLevel, and the OCM-o operator will not display the correspond logs 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Please review the following PR: https://github.com/openshift/kube-state-metrics/pull/91

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

As a cluster-admin, users can see pipelines section while using the `import from git` feature in the developer mode from web console.

However if users logged in as a normal user or a project admin, they are not able to see the pipelines section.

Version-Release number of selected component (if applicable):

Tested in OCP v4.12.18 and v4.12.20 

How reproducible:

Always

Steps to Reproduce:

Prerequisite- Install Red Hat OpenShift pipelines operator
1. Login as a kube-admin user from web console
2. Go to Developer View
3. Click on +Add
4. Under Git Repository, open page -> Import from git
5. Enter Git Repo URL (example git url- https://github.com/spring-projects/spring-petclinic)
6. Check if there are 3 section : General , Pipelines , Advance options
7. Then Login as a project admin user
8. Perform all the steps again from step 2 to step 6

Actual results:

Pipelines section is not visible when logged in as a project admin. Only General and Advance options sections are visible in import from git.
However Pipeline section is visible as a cluster-admin.

Expected results:

Pipelines section should be visible when logged in as a project admin, along with General and Advance options sections in import from git.

Additional info:

I checked by creating a separate rolebinding and clusterrolebindings to assign access for pipeline resources like below :
~~~
$ oc create clusterrole pipelinerole1 --verb=create,get,list,patch,delete --resource=tektonpipelines,openshiftpipelinesascodes
$ oc create clusterrole pipelinerole2 --verb=create,get,list,patch,delete --resource=repositories,pipelineruns,pipelines
$ oc adm policy add-cluster-role-to-user pipelinerole1 user1
$ oc adm policy add-role-to-user pipelinerole2 user1
~~~
However, even after assigning these rolebindings/clusterrolebinsings to the users , users are not able to see the Pipelines section.

Description of problem:

oc explain tests have to be enabled to ensure openapi/v3 is working properly

The tests have been temporarily disabled in order to unblock the oc kube bump (https://github.com/openshift/oc/pull/1420). 

The following efforts need to be done/merged to make openapi/v3 work:

- [DONE] oauth-apiserver kube bump: https://github.com/openshift/oauth-apiserver/pull/89
- [DONE] merge kubectl fix backport https://github.com/kubernetes/kubernetes/pull/118930 and bump kube dependency in oc to include this fix (https://github.com/openshift/oc/pull/1515)
- [DONE] merge https://github.com/kubernetes/kubernetes/pull/118881 and carry this PR in our kube-apiserver to stop oc explain being flaky (https://github.com/openshift/kubernetes/pull/1629)
- [DONE] merge https://github.com/kubernetes/kubernetes/pull/118879 and carry this PR in our kube-apiserver to enable apiservices (https://github.com/openshift/kubernetes/pull/1630)
- [DONE] make openapi/v3 work for our special groups https://github.com/openshift/kubernetes/pull/1654 (https://github.com/openshift/kubernetes/pull/1617#issuecomment-1609864043, slack discussion: https://redhat-internal.slack.com/archives/CC3CZCQHM/p1687882255536949?thread_ts=1687822265.954799&cid=CC3CZCQHM)
- [DONE] enable back oc explain tests: https://github.com/openshift/origin/pull/28155 and bring in new tests: https://github.com/openshift/origin/pull/28129
- [OPTIONAL] bring in additional upstream kubectl/oc explain tests: https://github.com/kubernetes/kubernetes/pull/118885
- [OPTIONAL] backport https://github.com/kubernetes/kubernetes/pull/119839 and https://github.com/kubernetes/kubernetes/pull/119841 (backport of https://github.com/kubernetes/kubernetes/pull/118881 and https://github.com/kubernetes/kubernetes/pull/118879)

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

 
 
 

 
 
 
 

 

Description of problem:

OCP upgrade blocks because of cluster operator csi-snapshot-controller fails to start its deployment with a fatal message of read-only filesystem

Version-Release number of selected component (if applicable):

Red Hat OpenShift 4.11
rhacs-operator.v3.72.1

How reproducible:

At least once in user's cluster while upgrading 

Steps to Reproduce:

1. Have a OCP 4.11 installed
2. Install ACS on top of the OCP cluster
3. Upgrade OCP to the next z-stream version

Actual results:

Upgrade gets blocked: waiting on csi-snapshot-controller

Expected results:

Upgrade should succeed

Additional info:

stackrox SCCs (stackrox-admission-control, stackrox-collector and stackrox-sensor) contain the `readOnlyRootFilesystem` set to `true`, if not explicitly defined/requested, other Pods might receive this SCC which will make the deployment to fail with a `read-only filesystem` message

Description of problem:

CCPMSO uses a copy of the manifests from openshift/api. However, these appear out-of-sync with respect to the vendored version of openshift/api

Description of problem:

Cluster-api pod can't create events due to RBAC. we may miss some useful event due to this.
E0503 07:20:44.925786       1 event.go:267] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"ad1-workers-f5f568855-vnzmn.175b911e43aa3f41", GenerateName:"", Namespace:"ocm-integration-23frm3gtnh3cf212daoe1a13su7buqk4-ad1", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Machine", Namespace:"ocm-integration-23frm3gtnh3cf212daoe1a13su7buqk4-ad1", Name:"ad1-workers-f5f568855-vnzmn", UID:"2b40a694-d36d-4b13-9afc-0b5daeecc509", APIVersion:"cluster.x-k8s.io/v1beta1", ResourceVersion:"144260357", FieldPath:""}, Reason:"DetectedUnhealthy", Message:"Machine ocm-integration-23frm3gtnh3cf212daoe1a13su7buqk4-ad1/ad1-workers/ad1-workers-f5f568855-vnzmn/ has unhealthy node ", Source:v1.EventSource{Component:"machinehealthcheck-controller", Host:""}, FirstTimestamp:time.Date(2023, time.May, 3, 7, 20, 44, 923289409, time.Local), LastTimestamp:time.Date(2023, time.May, 3, 7, 20, 44, 923289409, time.Local), Count:1, Type:"Normal", EventTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'events is forbidden: User "system:serviceaccount:ocm-integration-23frm3gtnh3cf212daoe1a13su7buqk4-ad1:cluster-api" cannot create resource "events" in API group "" in the namespace "ocm-integration-23frm3gtnh3cf212daoe1a13su7buqk4-ad1"' (will not retry!)

Version-Release number of selected component (if applicable):

4.12

How reproducible:

Always

Steps to Reproduce:

1. Create an hosted cluster
2. Check cluster-api pod for some kind of error (e.g. slow node startup)
3.

Actual results:

Error

Expected results:

Event generated

Additional info:
ClusterRole hypershift-cluster-api is created here https://github.com/openshift/hypershift/blob/e7eb32f259b2a01e5bbdddf2fe963b82b331180f/hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go#L2720

We should add create/patch/update for events there

Description of problem:

IPI installation failed in AWS, CreateVpcEndpoint not supported in C2S region

Version-Release number of selected component (if applicable):

 

How reproducible:

always

Steps to Reproduce:

IPI installation in AWS

1. terraform apply
2. When using a aws_vpc_endpoint resource with aws terraform provider >= 2.53.0 in the C2S regions (us-iso*), an error is thrown stating UnsupportedOperation. 
3.

Actual results:

Unable to install OCP 4.X in AWS C2S(top-secret) region

Expected results:

IPI installation in AWS C2S region

Additional info:

Upstream bug:

[Bug]: C2S CreateVpcEndpoint UnsupportedOperation: The operation is not supported in this region! · Issue #27048 · hashicorp/terraform-provider-aws · GitHub
https://github.com/hashicorp/terraform-provider-aws/issues/27048

Description of problem:

After adding additional CPU and Memory to the OpenShift Container Platform 4 - Control-Plane Node(s) it was noticed that a new MachineConfig was rolled out, causing all OpenShift Container Platform 4 - Node(s) to reboot unexpected.

Interesting enough, no new MachineConfig was rendered but actually a slightly older MachineConfig was picked and applied to all OpenShift Container Platform 4 - Node after the change on the OpenShift Container Platform 4 - Control-Plane Node(s) was performed.

The only visible change found in the MachineConfig was that nodeStatusUpdateFrequency was updated from 10s to 0s even though nodeStatusUpdateFrequency is not specified or configured in any MachineConfig or KubeletConfig.

https://issues.redhat.com/browse/OCPBUGS-6723 was found but given that the affected OpenShift Container Platform 4 - Cluster is running 4.11.35 it's difficult to understand what happen as generally this problem was/is suspected to be solved.

Version-Release number of selected component (if applicable):

OpenShift Container Platform 4.11.35

How reproducible:

Unknown

Steps to Reproduce:

1. OpenShift Container Platform 4 on AWS
2. Updating OpenShift Container Platform 4 - Control-Plane Node(s) to add more CPU and Memory 
3. Check whether a potential MachineConfig update is being applied

Actual results:

MachineConfig update is being rolled out to all OpenShift Container Platform 4 - Node(s) after adding CPU and Memoy to OpenShift Container Platform 4 - Control-Plane Node(s) as nodeStatusUpdateFrequency is being updated, which is rather unexpected or not clear why it's happening. 

Expected results:

Either no new MachineConfig to rollout after such a change or else to have a newly rendered MachineConfig that is being rolled out with information of what changed and why this change was applied

Additional info:


This is a clone of issue OCPBUGS-18832. The following is the description of the original issue:

Description of problem:

console does not enable customizing the abbreviation that appears on the resource icon badge. This causes an issue for the FAR operator with the CRD FenceAgentRemediationTemplate, the badge icon shows FART. The CRD includes a custom short name, but the console ignores it

Version-Release number of selected component (if applicable):

 

How reproducible:

100%

Steps to Reproduce:

1. create the CRD (included link to github)
2. navigate to Home -> search
3. Enter far into the Resources filter

Actual results:

The badge FART shows in the dropdown

Expected results:

The badge should show fartemplate - the content of the short name

Additional info:

 

 

Description of problem:

install discnnect private cluster, ssh to master/bootstrap nodes from the bastion on the vpc failed.

Version-Release number of selected component (if applicable):

Pre-merge build https://github.com/openshift/installer/pull/6836
registry.build05.ci.openshift.org/ci-ln-5g4sj02/release:latest
Tag: 4.13.0-0.ci.test-2023-02-27-033047-ci-ln-5g4sj02-latest

How reproducible:

always

Steps to Reproduce:

1.Create bastion instance maxu-ibmj-p1-int-svc 
2.Create vpc on the bastion host 
3.Install private disconnect cluster on the bastion host with mirror registry 
4.ssh to the bastion  
5.ssh to the master/bootstrap nodes from the bastion 

Actual results:

[core@maxu-ibmj-p1-int-svc ~]$ ssh -i ~/openshift-qe.pem core@10.241.0.5 -v
OpenSSH_8.8p1, OpenSSL 3.0.5 5 Jul 2022
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: Reading configuration data /etc/ssh/ssh_config.d/50-redhat.conf
debug1: Reading configuration data /etc/crypto-policies/back-ends/openssh.config
debug1: configuration requests final Match pass
debug1: re-parsing configuration
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: Reading configuration data /etc/ssh/ssh_config.d/50-redhat.conf
debug1: Reading configuration data /etc/crypto-policies/back-ends/openssh.config
debug1: Connecting to 10.241.0.5 [10.241.0.5] port 22.
debug1: connect to address 10.241.0.5 port 22: Connection timed out
ssh: connect to host 10.241.0.5 port 22: Connection timed out

Expected results:

ssh succeed.

Additional info:

$ibmcloud is sg-rules r014-5a6c16f4-8a4c-4c02-ab2d-626c14f72a77 --vpc maxu-ibmj-p1-vpc
Listing rules of security group r014-5a6c16f4-8a4c-4c02-ab2d-626c14f72a77 under account OpenShift-QE as user ServiceId-dff277a9-b608-410a-ad24-c544e59e3778...
ID                                          Direction   IP version   Protocol                      Remote   
r014-6739d68f-6827-41f4-b51a-5da742c353b2   outbound    ipv4         all                           0.0.0.0/0   
r014-06d44c15-d3fd-4a14-96c4-13e96aa6769c   inbound     ipv4         all                           shakiness-perfectly-rundown-take   r014-25b86956-5370-4925-adaf-89dfca9fb44b   inbound     ipv4         tcp Ports:Min=22,Max=22       0.0.0.0/0   
r014-e18f0f5e-c4e5-44a5-b180-7a84aa59fa97   inbound     ipv4         tcp Ports:Min=3128,Max=3129   0.0.0.0/0   
r014-7e79c4b7-d0bb-4fab-9f5d-d03f6b427d89   inbound     ipv4         icmp Type=8,Code=0            0.0.0.0/0   
r014-03f23b04-c67a-463d-9754-895b8e474e75   inbound     ipv4         tcp Ports:Min=5000,Max=5000   0.0.0.0/0   
r014-8febe8c8-c937-42b6-b352-8ae471749321   inbound     ipv4         tcp Ports:Min=6001,Max=6002   0.0.0.0/0   

Due to enabling upstream node-logs viewer feature we have to temporarily disable this test, since the plan to switch to upstream version requires the following steps in order:
1. Modify current patches to match upstream change (being done as part of 1.27 bump)
2. Modify oc to work with both old and new API (being done in parallel with 1.27 bump, will be linked below).
3. Land k8s 1.27.
4. Modify machine-config-operator to enable enableSystemLogQuery config option (can land only after k8s 1.27, will be linked below).
5. Bring the test back.

Our telemetry test using remote write is increasingly flaky. The recurring error is:

TestTelemeterRemoteWrite
    telemeter_test.go:103: timed out waiting for the condition: error validating response body "{\"status\":\"success\",\"data\":{\"resultType\":\"vector\",\"result\":[{\"metric\":{\"container\":\"kube-rbac-proxy\",\"endpoint\":\"metrics\",\"job\":\"prometheus-k8s\",\"namespace\":\"openshift-monitoring\",\"remote_name\":\"2bdd72\",\"service\":\"prometheus-k8s\",\"url\":\"https://infogw.api.openshift.com/metrics/v1/receive\"},\"value\":[1684889572.197,\"20.125925925925927\"]}]}}" for query "max without(pod,instance) (rate(prometheus_remote_storage_samples_failed_total{job=\"prometheus-k8s\",url=~\"https://infogw.api.openshift.com.+\"}[5m]))": expecting Prometheus remote write to see no failed samples but got 20.125926

Any failed samples will cause this test to fail. This is perhaps a too strict requirement. We could consider it good enough if some samples are send successfully. The current version tests telemeter behavior on top of CMO behavior.

Description of problem:

When running the installer on OSP with:

[...]
controlPlane:
  name: master
  platform: {}
  replicas: 3
[...]

in the install-config.yaml, it panics:

DEBUG OpenShift Installer 4.14.0-0.nightly-2023-07-20-215234
DEBUG Built from commit 1e9209ac80ed2cb4ba5663f519e51161a1d8858a
DEBUG Fetching Metadata...
DEBUG Loading Metadata...
DEBUG   Loading Cluster ID...
DEBUG     Loading Install Config...
DEBUG       Loading SSH Key...
DEBUG       Loading Base Domain...
DEBUG         Loading Platform...
DEBUG       Loading Cluster Name...
DEBUG         Loading Base Domain...
DEBUG         Loading Platform...
DEBUG       Loading Networking...
DEBUG         Loading Platform...
DEBUG       Loading Pull Secret...
DEBUG       Loading Platform...
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x3956f6d]goroutine 1 [running]:
github.com/openshift/installer/pkg/types/conversion.convertOpenStack(0xc000464dc0)
        /go/src/github.com/openshift/installer/pkg/types/conversion/installconfig.go:172 +0x1cd
github.com/openshift/installer/pkg/types/conversion.ConvertInstallConfig(0xc000464dc0)
        /go/src/github.com/openshift/installer/pkg/types/conversion/installconfig.go:47 +0x2af
github.com/openshift/installer/pkg/asset/installconfig.(*AssetBase).LoadFromFile(0xc000a18180, {0x20f8c650?, 0xc000696b40?})                                                                                                                 
        /go/src/github.com/openshift/installer/pkg/asset/installconfig/installconfigbase.go:64 +0x32b
github.com/openshift/installer/pkg/asset/installconfig.(*InstallConfig).Load(0xc000a18180, {0x20f8c650?, 0xc000696b40?})                                                                                                                     
        /go/src/github.com/openshift/installer/pkg/asset/installconfig/installconfig.go:118 +0x2e
github.com/openshift/installer/pkg/asset/store.(*storeImpl).load(0xc0008f3f20, {0x20f95950, 0xc0002f9a40}, {0xc000af060c, 0x4})                                                                                                              
        /go/src/github.com/openshift/installer/pkg/asset/store/store.go:263 +0x35f
github.com/openshift/installer/pkg/asset/store.(*storeImpl).load(0xc0008f3f20, {0x20f95920, 0xc00040cf60}, {0x819d89a, 0x2})                                                                                                                 
        /go/src/github.com/openshift/installer/pkg/asset/store/store.go:246 +0x256
github.com/openshift/installer/pkg/asset/store.(*storeImpl).load(0xc0008f3f20, {0x7fed58b9ec98, 0x25ba8530}, {0x0, 0x0})                                                                                                                     
        /go/src/github.com/openshift/installer/pkg/asset/store/store.go:246 +0x256
github.com/openshift/installer/pkg/asset/store.(*storeImpl).fetch(0xc0008f3f20, {0x7fed58b9ec98, 0x25ba8530}, {0x0, 0x0})                                                                                                                    
        /go/src/github.com/openshift/installer/pkg/asset/store/store.go:200 +0x1a9
github.com/openshift/installer/pkg/asset/store.(*storeImpl).Fetch(0x7ffd6b4992ff?, {0x7fed58b9ec98, 0x25ba8530}, {0x25b8ea80, 0x8, 0x8})                                                                                                     
        /go/src/github.com/openshift/installer/pkg/asset/store/store.go:76 +0x48
main.runTargetCmd.func1({0x7ffd6b4992ff, 0x6})
        /go/src/github.com/openshift/installer/cmd/openshift-install/create.go:260 +0x126
main.runTargetCmd.func2(0x25b96920?, {0xc0002f8100?, 0x4?, 0x4?})
        /go/src/github.com/openshift/installer/cmd/openshift-install/create.go:290 +0xe7
github.com/spf13/cobra.(*Command).execute(0x25b96920, {0xc0002f80c0, 0x4, 0x4})
        /go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:920 +0x847
github.com/spf13/cobra.(*Command).ExecuteC(0xc000a0c000)
        /go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:1040 +0x3bd
github.com/spf13/cobra.(*Command).Execute(...)
        /go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:968
main.installerMain()
        /go/src/github.com/openshift/installer/cmd/openshift-install/main.go:61 +0x2b0
main.main()
        /go/src/github.com/openshift/installer/cmd/openshift-install/main.go:38 +0xff

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-07-20-215234

How reproducible:

Always

Steps to Reproduce:

1. Create the install-config.yaml with an empty controlPlane.platform
2. Run the installer

Actual results:

Panic

Expected results:

Controlled error message if the platform is strictly necessary, otherwise a successful installation.

Additional info:

 

Description of problem:

When use the command `oc-mirror --config config-oci-target.yaml  docker://localhost:5000  --use-oci-feature  --dest-use-http  --dest-skip-tls` , the command exit with code 0, but print log like : unable to parse reference oci://mno/redhat-operator-index:v4.12: lstat /mno: no such file or directory.

Version-Release number of selected component (if applicable):

oc-mirror version 
Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.13.0-202303011628.p0.g2e3885b.assembly.stream-2e3885b", GitCommit:"2e3885b469ee7d895f25833b04fd609955a2a9f6", GitTreeState:"clean", BuildDate:"2023-03-01T16:49:12Z", GoVersion:"go1.19.2", Compiler:"gc", Platform:"linux/amd64"}

How reproducible:

always

Steps to Reproduce:

1. with imagesetconfig like : 
cat config-oci-target.yaml 
kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v1alpha2
storageConfig:
  local:
    path: /home/ocmirrortest/0302/60597
mirror:
  operators:
  - catalog: oci:///home/ocmirrortest/noo/redhat-operator-index
    targetCatalog: mno/redhat-operator-index
    targetTag: v4.12
    packages:
    - name: aws-load-balancer-operator
`oc-mirror --config config-oci-target.yaml  docker://localhost:5000  --use-oci-feature  --dest-use-http  --dest-skip-tls`


Actual results:

1. the command exit with code 0, but print strange logs like:
sha256:95c45fae0ca9e9bee0fa2c13652634e726d8133e4e3009b363fcae6814b3461d localhost:5000/albo/aws-load-balancer-rhel8-operator:95c45f
sha256:ab38b37c14f7f0897e09a18eca4a232a6c102b76e9283e401baed832852290b5 localhost:5000/albo/aws-load-balancer-rhel8-operator:ab38b3
info: Mirroring completed in 43.87s (28.5MB/s)
Rendering catalog image "localhost:5000/mno/redhat-operator-index:v4.12" with file-based catalog 
Writing image mapping to oc-mirror-workspace/results-1677743154/mapping.txt
Writing CatalogSource manifests to oc-mirror-workspace/results-1677743154
Writing ICSP manifests to oc-mirror-workspace/results-1677743154
unable to parse reference oci://mno/redhat-operator-index:v4.12: lstat /mno: no such file or directory

Expected results:

no such log  .

 

Description of problem:

While troubleshooting a problem, oc incorrectly recommended to use a deprecated command "oc admin registry" in the output text.

Version-Release number of selected component (if applicable):

$ oc version
Client Version: 4.12.0-202302280915.p0.gb05f7d4.assembly.stream-b05f7d4
Kustomize Version: v4.5.7
Server Version: 4.12.6
Kubernetes Version: v1.25.4+18eadca

Though this is likely broken in all previous version of openshift4

How reproducible:

Only during error conditions where this error message is printed.

Steps to Reproduce:

1. have cluster without proper storage configured for the registry
2. try to build something.
3. "oc status --suggest" prints message with deprecated "oc admin registry" command.

Actual results:

$ oc status --suggest
In project pvctest on server https://api.pelauter-bm01.lab.home.arpa:6443https://my-test-pvctest.apps.pelauter-bm01.lab.home.arpa (redirects) to pod port 8080-tcp (svc/my-test)
  deployment/my-test deploys istag/my-test:latest <-
    bc/my-test source builds https://github.com/sclorg/django-ex.git on openshift/python:3.9-ubi8
      build #1 new for 3 hours (can't push to image)
    deployment #1 running for 3 hours - 0/1 podsErrors:
  * bc/my-test is pushing to istag/my-test:latest, but the administrator has not configured the integrated container image registry.

    try: oc adm registry -h
^ oc adm regisistry is deprecated in openshift4, this should guide the user to the registry operator.

Expected results:

A pointer to the proper feature to manage the registry, like the openshift registry operator.

Additional info:

I know my cluster is not set up correctly, but oc should still not give me incorrect information.
If this version of oc is expected to also work against ocp3 clusters, the fix should take this into account, where that command is still valid.

Description of problem:

CCO watches too many things.

Version-Release number of selected component (if applicable):

 

How reproducible:

100%

Steps to Reproduce:

1. Run CCO in a cluster with a large amount of data in ConfigMaps or Secrets or Namespaces.
2. Watch memory usage scale linearly with the size of both.
3.

Actual results:

Memory usage scales linearly with the size of all ConfigMaps, Secrets and Namespaces on the cluster.

Expected results:

Memory usage scales linearly with the data CCO actually needs to function.

Additional info:

 

Description of problem:

External link icon in `resource added` toast notification not linked and cannot be clicked to open the app URL.

Version-Release number of selected component (if applicable):

4.13

How reproducible:

 

Steps to Reproduce:

1. use the +Add page and import from git
2. after creating the app a toast notification will appear
3. Click the external link icon 

Actual results:

External link icon is not part of the link but has a pointer cursor and a hover effect. Clicking this icon does nothing.

Expected results:

External link icon should be part of the link and clickable.

Additional info:

 

We set image links on CMO's jsonnet code, as these can sometimes be used to populate labels and is generally considered good documentation pratice.

In a cluster these links are replaced by CVO.

prometheus-adapter is now a k8s project and has moved locations accordingly from directxman12/k8s-prometheus-adapter to kubernetes-sigs/prometheus-adapter. This should be reflected in our image links, set at https://github.com/openshift/cluster-monitoring-operator/blob/35a063722c7e3b68d57aed18dc81f0dbdfbfc004/jsonnet/main.jsonnet#L66.

Description of the problem:

In Staging, BE 2.20.1 - trying to set "Integrate with platform" switch on, getting:

Failed to update the cluster
only x86-64 CPU architecture is supported on Nutanix clusters 

How reproducible:

100%

Steps to reproduce:

1. Create new cluster with OCP multi version

2. Discover NTNX hosts and turn integrate with platform on

3.

Actual results:

 

Expected results:

Description of problem:

Reported in https://github.com/openshift/cluster-ingress-operator/issues/911

When you open a new issue, it still directs you to Bugzilla, and then doesn't work.

It can be changed here: https://github.com/openshift/cluster-ingress-operator/blob/master/.github/ISSUE_TEMPLATE/config.yml
, but to what?

The correct Jira link is
https://issues.redhat.com/secure/CreateIssueDetails!init.jspa?pid=12332330&issuetype=1&components=12367900&priority=10300&customfield_12316142=26752

But can the public use this mechanism? Yes - https://redhat-internal.slack.com/archives/CB90SDCAK/p1682527645965899 

Version-Release number of selected component (if applicable):

n/a

How reproducible:

May be in other repos too.

Steps to Reproduce:

1. Open Issue in the repo - click on New Issue
2. Follow directions and click on link to open Bugzilla
3. Get message that this doesn't work anymore

Actual results:

You get instructions that don't work to open a bug from an Issue.

Expected results:

You get instructions to just open an Issue, or get correct instructions on how to open a bug using Jira.

Additional info:

 

Description of problem:

Create a private Shared VPC cluster on AWS, Ingress operator degraded due to the following error:

2023-06-14T09:55:50.240Z	INFO	operator.dns_controller	controller/controller.go:118	reconciling	{"request": {"name":"default-wildcard","namespace":"openshift-ingress-operator"}}
2023-06-14T09:55:50.363Z	ERROR	operator.dns_controller	dns/controller.go:354	failed to publish DNS record to zone	{"record": {"dnsName":"*.apps.ci-op-2x6lics3-849ce.qe.devcluster.openshift.com.","targets":["internal-ac656ce4d29f64da289152053f50c908-1642793317.us-east-1.elb.amazonaws.com"],"recordType":"CNAME","recordTTL":30,"dnsManagementPolicy":"Managed"}, "dnszone": {"id":"Z0698684SM2RRJSYHP43"}, "error": "failed to get hosted zone for load balancer target \"internal-ac656ce4d29f64da289152053f50c908-1642793317.us-east-1.elb.amazonaws.com\": couldn't find hosted zone ID of ELB internal-ac656ce4d29f64da289152053f50c908-1642793317.us-east-1.elb.amazonaws.com"}


ingress operator:
ingress                                                                         False       True          True       37m     The "default" ingress controller reports Available=False: IngressControllerUnavailable: One or more status conditions indicate unavailable: DNSReady=False (FailedZones: The record failed to provision in some zones: [{Z0698684SM2RRJSYHP43 map[]}])

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-06-13-223353 

How reproducible:

always

Steps to Reproduce:

1. Create a private Shared VPC cluster on AWS using STS

Actual results:

ingress operator degraded

Expected results:

cluster is healthy

Additional info:

public cluster no such issue.

Description of problem:
Older images are pulled even when using minVersion in ImageSetConfiguration.

Version-Release number of selected component (if applicable):
oc mirror version
Client Version: version.Info

{Major:"", Minor:"", GitVersion:"4.11.0-202208031306.p0.g3c1c80c.assembly.stream-3c1c80c", GitCommit:"3c1c80ca6a5a22b5826c88897e7a9e5acd7c1a96", GitTreeState:"clean", BuildDate:"2022-08-03T14:23:35Z", GoVersion:"go1.18.4", Compiler:"gc", Platform:"linux/amd64"}

How reproducible:
Always

Steps to Reproduce:
1. get attached ImageSetConfiguration
2. run 'oc mirror --config=./image-set.yaml docker://<yourRegistry> --continue-on-error'

Actual results:
Output contains a lot of 'unable to retrieve source image' errors for images which are older than defined in minVersion (those images are known to be missing, a goal was to use minVersion to filter out those older images to get rid of those errors but it's not working)

Expected results:
Those older images should not be included

Additional info:
image-set.yaml is attached
Full output of 'oc mirror' attached
There are more images failing but as an example:

error: unable to retrieve source image registry.redhat.io/openshift-service-mesh/pilot-rhel8 manifest sha256:f7c468b5a35bfce54e53b4d8d00438f33a0861549697d14445eae52d8ead9a68: for image pulls. Use the equivalent V2 schema 2 manifest digest instead. For more information see https://access.redhat.com/articles/6138332

This image is from version 1.0.11 but minVersion: '2.2.1-0' so it should not be included.
Here is how I checked that image:

podman inspect registry-proxy.engineering.redhat.com/rh-osbs/openshift-service-mesh-pilot-rhel8@sha256:f7c468b5a35bfce54e53b4d8d00438f33a0861549697d14445eae52d8ead9a68 | grep version
                "istio_version": "1.1.17",
                "version": "1.0.11"
            "istio_version": "1.1.17",
            "version": "1.0.11"

This is a clone of issue OCPBUGS-8512. The following is the description of the original issue:

Description of problem:

WebhookConfiguration caBundle injection is incorrect when some webhooks already configured with caBundle.

Behavior seems to be that the first n number of webhooks in `.webhooks` array have caBundle injected, where n is the number of webhooks that do not have caBundle set.

Version-Release number of selected component (if applicable):

 

How reproducible

 

Steps to Reproduce:

1. Create a validatingwebhookconfigurations or mutatingwebhookconfigurations with `service.beta.openshift.io/inject-cabundle: "true"` annotation.

2. oc edit validatingwebhookconfigurations (or oc edit mutatingwebhookconfigurations)

3. Add a new webhook to the end of the list `.webhooks`. It will not have caBundle set manually as service-ca should inject it. 

4. Observe new webhook does not get caBundle injected.

Note: it is important in step. 3 that the new webhook is added to the end of the list. 

 

Actual results:

Only the first n webhooks have caBundle injected where n is the number of webhooks without caBundle set.

Expected results:

All webhooks have caBundle injected when they do not have it set.

Additional info:

Open PR here: https://github.com/openshift/service-ca-operator/pull/207

The issue seems to be a mistake with go-lang for range syntax where "i" is the index of desired "i" to update.  

tl dr; code should update the value of the int in the array, not the index of the int in the array.

Description of problem: 

 

monitoringPlugin tolerations not working

 

Version-Release number of selected component (if applicable):

 

How reproducible:

100%

Steps to Reproduce:

apply monitoringPlugin tolerations to cm `cluster-monitoring-config`
example:
...  
    monitoringPlugin:
      tolerations:
        - key: "key1"
          operator: "Equal"
          value: "value1"
          effect: "NoSchedule"

Actual results:

the cm applyed but not take effect to the deployment

Expected results:

able to see the tolerations applyed to deployment/pod

Additional info:

same condition to NodeSelector, TopologySpreadConstraints

Description of problem:

The prometheus-operator pod has the "app.kubernetes.io/version: 0.63.0" annotation while it's based on 0.65.1. 

Version-Release number of selected component (if applicable):

4.14

How reproducible:

Always

Steps to Reproduce:

1. Check app.kubernetes.io/version annotations for prometheus-operator pod.
2.
3.

Actual results:

0.63.0

Expected results:

0.65.1

Additional info:

 

This is a clone of issue OCPBUGS-19715. The following is the description of the original issue:

Description of problem:

 

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Due to the EOL of RHV in OCP, we'll need to disable oVirt as an installation option in the installer.
Note: The first step is disabling it. Removing all related code from the installer will be done in a later release.

Please review the following PR: https://github.com/openshift/cluster-ingress-operator/pull/898

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

The dev workflow for OCP operators wanting to use feature gates is

1) change openshift/api
2) bump openshift/api in cluster-config-operator (CCO)
3) bump openshift/api in your operator and add logic for the feature gate

Currently, hypershift requires its bump to openshift/api in order to set the proper feature gates and this is not preferred. It is preferred that the single place where a api bump is required is cluster-config-operator.

Hypershift should use CCO `render` command to generate the FeatureGate CR

Description of problem:

If we add a configmap to buildconfig as build input, the configmap data is not present at the destnationDir on the build pod.

Version-Release number of selected component (if applicable):

 

How reproducible:

Follow below steps to reproduce.

Steps to Reproduce:

1. Create a configmap to pass as build input

apiVersion: v1
data:
  settings.xml: |+
    xxx
    yyy
kind: ConfigMap
metadata:
  name: build-test
  namespace: test

2. Create a buidlconfig like below

apiVersion: build.openshift.io/v1
kind: BuildConfig
metadata:
  labels:
    app: custom-build
  name: custom-build
spec:
  source:
    configMaps:
    - configMap:
        name: build-test
      destinationDir: /tmp
    type: None
  output:
    to:
      kind: ImageStreamTag
      name: custom-build:latest
  postCommit: {}
  runPolicy: Serial
  strategy:
    customStrategy:
      from:
        kind: "DockerImage"
        name: "registry.redhat.io/rhel8/s2i-base"

 3. start a new build

    oc start-build custom-build

 4. As per the documentation[a] the configmap data should present on the build pod location "/var/run/secrets/openshift.io/build" if we didn't explicitly mention the "destinationDir". in above example "destinationDir" set to "/tmp" so "server.xml" file from the configmap should present in "/tmp" directory of the build pod.
 
[a] https://docs.openshift.com/container-platform/4.12/cicd/builds/creating-build-inputs.html#builds-custom-strategy_creating-build-inputs

Actual results:

Configmap data is not present on the "destinationDir" or in default location "/var/run/secrets/openshift.io/build"

Expected results:

Configmap data should be present on the destinationDir of the builder pod.

Additional info:

 

Description of problem:

As a user when I select the All projects option from the Projects dropdown in the Dev perspective Pipelines pages then the selected option says as undefined. 

Version-Release number of selected component (if applicable):

4.13

How reproducible:

 

Steps to Reproduce:

1. Navigate to Pipelines page in the Dev perspective
2. Select the All projects option from the Projects dropdown

Actual results:

Selected option shows as undefined and all Projects list is not shown

Expected results:

Selected option should be All projects and open All projects list page

Additional info:

Other Incomplete

This section includes Jira cards that are not linked to either an Epic or a Feature. These tickets were not completed when this image was assembled

Follow up for https://issues.redhat.com/browse/HOSTEDCP-975

  • Explore and discuss granular metrics to track NodePool lifecycle bottle necks, infra, ignition, node networking, available. Consolidate that with hostedClusterTransitionSeconds metrics and dashboard panels
  • Explore and discuss metrics for upgrade duration SLO for NodePool.

Description of problem:

IBM VPC CSI Driver failed to provisioning volume in proxy cluster, (if I understand correctly) it seems the proxy in not injected because in our definition (https://github.com/openshift/ibm-vpc-block-csi-driver-operator/blob/master/assets/controller.yaml), we are injecting proxy to csi-driver:
    config.openshift.io/inject-proxy: csi-driver
    config.openshift.io/inject-proxy-cabundle: csi-driver
but the container name is iks-vpc-block-driver in https://github.com/openshift/ibm-vpc-block-csi-driver-operator/blob/master/assets/controller.yaml#L153

I checked the proxy in not defined in controller pod or driver container ENV.

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-08-11-055332

How reproducible:

Always

Steps to Reproduce:

1. Create IBM cluster with proxy setting
2. create pvc/pod with IBM VPC CSI Driver 

Actual results:

It failed to provisioning volume

Expected results:

Provisioning volume works well on proxy cluster

Additional info:

 

Description of problem:

When use the command `oc-mirror list operators --catalog=registry.redhat.io/redhat/certified-operator-index:v4.12 -v 9` , at begging the response code is 200 okay , when the command will hang for a while , then will got response code 401.

Version-Release number of selected component (if applicable):


How reproducible:

sometimes

Steps to Reproduce:

Using the advanced cluster management package as an example.

1. oc-mirror list operators --catalog=registry.redhat.io/redhat/certified-operator-index:v4.12 -v 9


Actual results: After hang a while , will got 401 code , seems when timeout the oc-mirror try again forgot to read the credentials

level=debug msg=fetch response received digest=sha256:a67257cfe913ad09242bf98c44f2330ec7e8261ca3a8db3431cb88158c3d4837 mediatype=application/vnd.docker.image.rootfs.diff.tar.gzip response.header.accept-ranges=bytes response.header.age=714959 response.header.connection=keep-alive response.header.content-length=80847073 response.header.content-type=binary/octet-stream response.header.date=Mon, 06 Feb 2023 06:52:06 GMT response.header.etag="a428fafd37ee58f4bdeae1a7ff7235b5-1" response.header.last-modified=Fri, 16 Sep 2022 17:54:09 GMT response.header.server=AmazonS3 response.header.via=1.1 010c0731b9775a983eceaec0f5fa6a2e.cloudfront.net (CloudFront) response.header.x-amz-cf-id=rEfKWnJdasWIKnjWhYyqFn9eHY8v_3Y9WwSRnnkMTkPayHlBxWX1EQ== response.header.x-amz-cf-pop=HIO50-C1 response.header.x-amz-replication-status=COMPLETED response.header.x-amz-server-side-encryption=AES256 response.header.x-amz-storage-class=INTELLIGENT_TIERING response.header.x-amz-version-id=GfqTTjWbdqB0sreyjv3fyo1k6LQ9kZKC response.header.x-cache=Hit from cloudfront response.status=200 OK size=80847073 url=https://registry.redhat.io/v2/redhat/certified-operator-index/blobs/sha256:a67257cfe913ad09242bf98c44f2330ec7e8261ca3a8db3431cb88158c3d4837
level=debug msg=fetch response received digest=sha256:d242c7b4380d3c9db3ac75680c35f5c23639a388ad9313f263d13af39a9c8b8b mediatype=application/vnd.docker.image.rootfs.diff.tar.gzip response.header.accept-ranges=bytes response.header.age=595868 response.header.connection=keep-alive response.header.content-length=98028196 response.header.content-type=binary/octet-stream response.header.date=Tue, 07 Feb 2023 15:56:56 GMT response.header.etag="f702c84459b479088565e4048a890617-1" response.header.last-modified=Wed, 18 Jan 2023 06:55:12 GMT response.header.server=AmazonS3 response.header.via=1.1 7f5e0d3b9ea85d0d75063a66c0ebc840.cloudfront.net (CloudFront) response.header.x-amz-cf-id=Tw9cjJjYCy8idBiQ1PvljDkhAoEDEzuDCNnX6xJub4hGeh8V0CIP_A== response.header.x-amz-cf-pop=HIO50-C1 response.header.x-amz-replication-status=COMPLETED response.header.x-amz-server-side-encryption=AES256 response.header.x-amz-storage-class=INTELLIGENT_TIERING response.header.x-amz-version-id=nt7yY.YmjWF0pfAhzh_fH2xI_563GnPz response.header.x-cache=Hit from cloudfront response.status=200 OK size=98028196 url=https://registry.redhat.io/v2/redhat/certified-operator-index/blobs/sha256:d242c7b4380d3c9db3ac75680c35f5c23639a388ad9313f263d13af39a9c8b8b
level=debug msg=fetch response received digest=sha256:664a8226a152ea0f1078a417f2ec72d3a8f9971e8a374859b486b60049af9f18 mediatype=application/vnd.docker.container.image.v1+json response.header.accept-ranges=bytes response.header.age=17430 response.header.connection=keep-alive response.header.content-length=24828 response.header.content-type=binary/octet-stream response.header.date=Tue, 14 Feb 2023 08:37:35 GMT response.header.etag="57eb6fdca8ce82a837bdc2cebadc3c7b-1" response.header.last-modified=Mon, 13 Feb 2023 16:11:57 GMT response.header.server=AmazonS3 response.header.via=1.1 0c96ded7ff282d2dbcf47c918b6bb500.cloudfront.net (CloudFront) response.header.x-amz-cf-id=w9zLDWvPJ__xbTpI8ba5r9DRsFXbvZ9rSx5iksG7lFAjWIthuokOsA== response.header.x-amz-cf-pop=HIO50-C1 response.header.x-amz-replication-status=COMPLETED response.header.x-amz-server-side-encryption=AES256 response.header.x-amz-version-id=Enw8mLebn4.ShSajtLqdo4riTDHnVEFZ response.header.x-cache=Hit from cloudfront response.status=200 OK size=24828 url=https://registry.redhat.io/v2/redhat/certified-operator-index/blobs/sha256:664a8226a152ea0f1078a417f2ec72d3a8f9971e8a374859b486b60049af9f18
level=debug msg=fetch response received digest=sha256:130c9d0ca92e54f59b68c4debc5b463674ff9555be1f319f81ca2f23e22de16f mediatype=application/vnd.docker.image.rootfs.diff.tar.gzip response.header.accept-ranges=bytes response.header.age=829779 response.header.connection=keep-alive response.header.content-length=26039246 response.header.content-type=binary/octet-stream response.header.date=Sat, 04 Feb 2023 22:58:25 GMT response.header.etag="a08688b701b31515c6861c69e4d87ebd-1" response.header.last-modified=Tue, 06 Dec 2022 20:50:51 GMT response.header.server=AmazonS3 response.header.via=1.1 000f4a2f631bace380a0afa747a82482.cloudfront.net (CloudFront) response.header.x-amz-cf-id=S-h31zheAEOhOs6uH52Rpq0ZnoRRdd5VfaqVbZWXzAX-Zym-0XtuKA== response.header.x-amz-cf-pop=HIO50-C1 response.header.x-amz-replication-status=COMPLETED response.header.x-amz-server-side-encryption=AES256 response.header.x-amz-storage-class=INTELLIGENT_TIERING response.header.x-amz-version-id=BQOjon.COXTTON_j20wZbWWoDEmGy1__ response.header.x-cache=Hit from cloudfront response.status=200 OK size=26039246 url=https://registry.redhat.io/v2/redhat/certified-operator-index/blobs/sha256:130c9d0ca92e54f59b68c4debc5b463674ff9555be1f319f81ca2f23e22de16f




level=debug msg=do request digest=sha256:db8e9d2f583af66157f383f9ec3628b05fa0adb0d837269bc9f89332c65939b9 mediatype=application/vnd.docker.image.rootfs.diff.tar.gzip request.header.accept=application/vnd.docker.image.rootfs.diff.tar.gzip, */* request.header.range=bytes=13417268- request.header.user-agent=opm/alpha request.method=GET size=91700480 url=https://registry.redhat.io/v2/redhat/certified-operator-index/blobs/sha256:db8e9d2f583af66157f383f9ec3628b05fa0adb0d837269bc9f89332c65939b9
level=debug msg=fetch response received digest=sha256:db8e9d2f583af66157f383f9ec3628b05fa0adb0d837269bc9f89332c65939b9 mediatype=application/vnd.docker.image.rootfs.diff.tar.gzip response.header.cache-control=max-age=0, no-cache, no-store response.header.connection=keep-alive response.header.content-length=99 response.header.content-type=application/json response.header.date=Tue, 14 Feb 2023 13:34:06 GMT response.header.docker-distribution-api-version=registry/2.0 response.header.expires=Tue, 14 Feb 2023 13:34:06 GMT response.header.pragma=no-cache response.header.registry-proxy-request-id=0d7ea55f-e96d-4311-885a-125b32c8e965 response.header.www-authenticate=Bearer realm="https://registry.redhat.io/auth/realms/rhcc/protocol/redhat-docker-v2/auth",service="docker-registry",scope="repository:redhat/certified-operator-index:pull" response.status=401 Unauthorized size=91700480 url=https://registry.redhat.io/v2/redhat/certified-operator-index/blobs/sha256:db8e9d2f583af66157f383f9ec3628b05fa0adb0d837269bc9f89332c65939b9.

Expected results:

Should always read the credentials for the command .

 

Description of problem:

Using openshift-install v4.13.0, no issue messages are displayed to console.
Looking at /etc/issue.d/, the issues are sent just not displayed by agetty.
# cat /etc/issue.d/70_agent-services.issue
\e{cyan}Waiting for services:\e{reset}
[\e{cyan}start\e{reset}] Service that starts cluster installation

Version-Release number of selected component (if applicable):

4.13

How reproducible:

100%

Steps to Reproduce:

1. Build agent image using openshift-install v4.13.0
2. Mount the ISO and boot a machine
3. Wait for a while until issues are created in /etc/issue.d/ 

Actual results:

No messages are displayed to console

Expected results:

All messages should be displayed

Additional info:

https://redhat-internal.slack.com/archives/C02SPBZ4GPR/p1686646256441329

 

User Story:

When changing platform fields e.g. aws instance type we trigger a rolling upgrade, however nothing is signalled in the NodePool state which result in bad UX.

NodePools should signal rolling upgrade because of platform changes.

Acceptance Criteria:

Description of criteria:

  • Upstream documentation
  • Point 1
  • Point 2
  • Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

Description of problem:

The agent installer integration test fails because of the change in the base iso's kargs.json and uses fedora-coreos instead of rhcos. As the integration test uses strict checks using `cmp` function, the test fails because of absence of "coreos.liveiso=fedora-coreos-38.20230609.3.0" in the expected result of the integration test. 

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1. Get latest code from master branch
2. Run ./hack/go-integration-test.sh

Actual results:

INFO[2023-09-01T02:23:01Z] --- FAIL: TestAgentIntegration (369.83s)19    --- FAIL: TestAgentIntegration/agent_pxe_configurations (0.00s)20        --- FAIL: TestAgentIntegration/agent_pxe_configurations/sno (49.93s)21            testscript.go:520: # Verify a default configuration for the SNO topology (49.805s)22                > exec openshift-install agent create pxe-files --dir $WORK23                [stderr]24                level=warning msg=CPUPartitioning:  is ignored25                level=info msg=Configuration has 1 master replicas and 0 worker replicas26                level=info msg=The rendezvous host IP (node0 IP) is 192.168.111.2027                level=info msg=Extracting base ISO from release payload28                level=info msg=Verifying cached file29                level=info msg=Using cached Base ISO /.cache/agent/image_cache/coreos-x86_64.iso30                level=info msg=Consuming Install Config from target directory31                level=info msg=Consuming Agent Config from target directory32                level=info msg=Created iPXE script agent.x86_64.ipxe in $WORK/pxe directory33                level=info msg=PXE-files created in: $WORK/pxe34                level=info msg=Kernel parameters for PXE boot: coreos.liveiso=fedora-coreos-38.20230609.3.0 ignition.firstboot ignition.platform.id=metal35                > stderr 'Created iPXE script agent.x86_64.ipxe'36                > exists $WORK/pxe/agent.x86_64-initrd.img37                > exists $WORK/pxe/agent.x86_64-rootfs.img38                > exists $WORK/pxe/agent.x86_64-vmlinuz39                > exists $WORK/auth/kubeconfig40                > exists $WORK/auth/kubeadmin-password41                > cmp $WORK/pxe/agent.x86_64.ipxe $WORK/expected/agent.x86_64.ipxe42                diff $WORK/pxe/agent.x86_64.ipxe $WORK/expected/agent.x86_64.ipxe43                --- $WORK/pxe/agent.x86_64.ipxe44                +++ $WORK/expected/agent.x86_64.ipxe45                @@ -1,4 +1,4 @@46                 #!ipxe47                 initrd --name initrd http://user-specified-pxe-infra.com/agent.x86_64-initrd.img48                -kernel http://user-specified-pxe-infra.com/agent.x86_64-vmlinuz initrd=initrd coreos.live.rootfs_url=http://user-specified-pxe-infra.com/agent.x86_64-rootfs.img coreos.liveiso=fedora-coreos-38.20230609.3.0 ignition.firstboot ignition.platform.id=metal49                +kernel http://user-specified-pxe-infra.com/agent.x86_64-vmlinuz initrd=initrd coreos.live.rootfs_url=http://user-specified-pxe-infra.com/agent.x86_64-rootfs.img ignition.firstboot ignition.platform.id=metal50                 boot51                52                FAIL: testdata/agent/pxe/configurations/sno.txt:13: $WORK/pxe/agent.x86_64.ipxe and $WORK/expected/agent.x86_64.ipxe differ

Expected results:

Test should always pass

Additional info:

 

Description of problem:

The configured accessTokenInactivityTimeout under tokenConfig in HostedCluster doesn't have any effect.
1. The value is not getting updated in oauth-openshift configmap 
2. hostedcluster allows user to set accessTokenInactivityTimeout value < 300s, where as in master cluster the value should be > 300s. 

Version-Release number of selected component (if applicable):

4.13

How reproducible:

Always

Steps to Reproduce:

1. Install a fresh 4.13 hypershift cluster  
2. Configure accessTokenInactivityTimeout as below:
$ oc edit hc -n clusters
...
  spec:
    configuration:
      oauth:
        identityProviders:
        ...
        tokenConfig:          
          accessTokenInactivityTimeout: 100s
...
3. Check the hcp:
$ oc get hcp -oyaml
...
        tokenConfig:           
          accessTokenInactivityTimeout: 1m40s
...

4. Login to guest cluster with testuser-1 and get the token
$ oc login https://a8890bba21c9b48d4a05096eee8d4edd-738276775c71fb8f.elb.us-east-2.amazonaws.com:6443 -u testuser-1 -p xxxxxxx
$ TOKEN=`oc whoami -t`
$ oc login --token="$TOKEN"
WARNING: Using insecure TLS client config. Setting this option is not supported!
Logged into "https://a8890bba21c9b48d4a05096eee8d4edd-738276775c71fb8f.elb.us-east-2.amazonaws.com:6443" as "testuser-1" using the token provided.
You don't have any projects. You can try to create a new project, by running
    oc new-project <projectname>

Actual results:

1. hostedcluster will allow user to set the value < 300s for accessTokenInactivityTimeout which is not possible on master cluster.

2. The value is not updated in oauth-openshift configmap:
$ oc get cm oauth-openshift -oyaml -n clusters-hypershift-ci-25785 
...
      tokenConfig:
        accessTokenMaxAgeSeconds: 86400
        authorizeTokenMaxAgeSeconds: 300
...

3. Login doesn't fail even if the user is not active for more than the set accessTokenInactivityTimeout seconds.

Expected results:

Login fails if the user is not active within the accessTokenInactivityTimeout seconds.

Kube 1.26 introduced the warning level TopologyAwareHintsDisabled event. TopologyAwareHintsDisabled is fired by the EndpointSliceController whenever reconciling a service that has activated topology aware hints via the service.kubernetes.io/topology-aware-hints annotation, but there is not enough information in the existing cluster resources (typically nodes) to apply the topology aware hints.

When re-basing OpnShift onto Kube 1.26, are CI builds are failing (except on AWS), because these events are firing "pathologically", for example:

: [sig-arch] events should not repeat pathologically
  events happened too frequently event happened 83 times, something is wrong: ns/openshift-dns service/dns-default - reason/TopologyAwareHintsDisabled Insufficient Node information: allocatable CPU or zone not specified on one or more nodes, addressType: IPv4 result=reject 

AWS nodes seem to have the proper values in the nodes. GCP has the values also, but they are not "right" for the purposes of the EndpointSliceController:

event happened 38 times, something is wrong: ns/openshift-dns service/dns-default - reason/TopologyAwareHintsDisabled Unable to allocate minimum required endpoints to each zone without exceeding overload threshold (5 endpoints, 3 zones), addressType: IPv4 result=reject }

https://github.com/openshift/origin/pull/27666 will mask this problem (make it stop erroring in CI) but changes still need to be made in the product so end users are not subjected to these events.

Now links to:
test=[sig-arch] events should not repeat pathologically for namespace openshift-dns

 

Description of problem:

The DNS egress router must run as privileged. With it being just an haproxy, it doesn't make much sense.

If I am not wrong, the biggest reason to need privileged is because of {{chroot}} option inherited from default file (https://github.com/openshift/images/blob/master/egress/dns-proxy/egress-dns-proxy.sh#L44). That option doesn't make much sense when we are already inside a container (hence why ingress controllers don't use it, for example).

So it may be worth exploring if this option can be removed and the DNS egress router can be run without requiring privileged mode, but maybe just CAP_NET_BIND_SERVICE

Version-Release number of selected component (if applicable):

4.12.0

How reproducible:

Always

Steps to Reproduce:

1. Forget to set privileged mode in the container
2.
3.

Actual results:

Pod cannot start due to chroot setting. I need to run the container as privileged, which lowers security too much.

Expected results:

Run the container without being privileged, maybe adding CAP_NET_BIND_SERVICE.

Additional info:


Description of problem:

migrator pod in `openshift-kube-storage-version-migrator` project stuck in Pending state

Version-Release number of selected component (if applicable):

4.12

How reproducible:

100%

Steps to Reproduce:

1. Add a default cluster-wide node selector with a label that do not match with any node label:
   $ oc edit scheduler cluster
   apiVersion: config.openshift.io/v1
   kind: Scheduler
   metadata:
     name: cluster
   ...
   spec:
     defaultNodeSelector: node-role.kubernetes.io/role=app
     mastersSchedulable: false

2. Delete the migrator pod running in the `openshift-kube-storage-version-migrator`
   $ oc delete pod migrator-6b78665974-zqd47 -n openshift-kube-storage-version-migrator

3. Check if the migrator pod comes up in running state or not.
   $ oc get pods -n openshift-kube-storage-version-migrator
   NAME                        READY   STATUS    RESTARTS   AGE
   migrator-6b78665974-j4jwp   0/1     Pending   0          2m41s

Actual results:

The pod goes into the pending state because it tries to get scheduled on the node having label `node-role.kubernetes.io/role=app`.

Expected results:

The pod should come up in running state, it should not get affected by the cluster-wide node-selector.

Additional info:

Setting the annotation `openshift.io/node-selector=` into the `openshift-kube-storage-version-migrator` project and then deleting the pending migrator pod helps in bringing the pod up.

The expectation with this bug is that the project `openshift-kube-storage-version-migrator` should have the annotation `openshift.io/node-selector=`, so that the pod running on this project will not get affected by the wrong cluster-wide node-selector configuration.

Description of problem:

Various jobs are failing in e2e-gcp-operator due to the LoadBalancer-Type Service not going "ready", which means it most likely not getting an IP address.

Tests so far affected are:
- TestUnmanagedDNSToManagedDNSInternalIngressController
- TestScopeChange
- TestInternalLoadBalancerGlobalAccessGCP
- TestInternalLoadBalancer
- TestAllowedSourceRanges

For example, in TestInternalLoadBalancer, the Load Balancer never comes back ready:

operator_test.go:1454: Expected conditions: map[Admitted:True Available:True DNSManaged:True DNSReady:True LoadBalancerManaged:True LoadBalancerReady:True]
         Current conditions: map[Admitted:True Available:False DNSManaged:True DNSReady:False Degraded:True DeploymentAvailable:True DeploymentReplicasAllAvailable:True DeploymentReplicasMinAvailable:True DeploymentRollingOut:False EvaluationConditionsDetected:False LoadBalancerManaged:True LoadBalancerProgressing:False LoadBalancerReady:False Progressing:False Upgradeable:True]

Where DNSReady:False and LoadBalancerReady:False.

Version-Release number of selected component (if applicable):

4.14

How reproducible:

10% of the time

Steps to Reproduce:

1. Run e2e-gcp-operator many times until you see one of these failures

Actual results:

Test Failure

Expected results:

Not failure

Additional info:

Search.CI Links:
TestScopeChange
TestInternalLoadBalancerGlobalAccessGCP & TestInternalLoadBalancer 

This does not seem related to https://issues.redhat.com/browse/OCPBUGS-6013. The DNS E2E tests actually pass this same condition check.

Description of problem:


When we merged https://github.com/openshift/cluster-control-plane-machine-set-operator/pull/229, it changed the way failure domains were injected for Azure so that additional fields could be accounted for. However, the CPMS failure domains have Azure zones as a string (which they should be) and the machine v1beta1 spec has them as a string pointer.

This means now that the CPMS is detecting the difference between the a nil zone and an empty string, even though every other piece of code in openshift treats them the same.

We should update the machine v1beta1 type to remove the pointer. This will be a no-op in terms of the data stored in etcd since the type is unstructured anyway.

It will then require updates to the MAPZ, CPMS, MAO and installer repositories to update their generation.

Version-Release number of selected component (if applicable):

4.14 nightlies from the merge of 229 onwards

How reproducible:

This is only affecting regions in Azure where there are no zones, currently in CI it's affecting about 20% of events.

Steps to Reproduce:

1.
2.
3.

Actual results:


Expected results:


Additional info:


Description of problem:

node Debug console is not available on all nodes when deploying hypershift on kubevirt using the 'hypershift create cluster kubevirt' default root-volume-size (16 GB).

Version-Release number of selected component (if applicable):

(.venv) [kni@ocp-edge77 ocp-edge-auto_cluster]$ oc version
Client Version: 4.12.0-0.nightly-2023-04-01-095001
Kustomize Version: v4.5.7
Server Version: 4.12.8
Kubernetes Version: v1.25.7+eab9cc9

How reproducible:

happens all the time.

Steps to Reproduce:

  1. the setup I deployed is a hub cluster of 3 master + 3 workers with 100G disk each, and on that, deployed a hosted cluster with 2 workers of 16G disk which is the default

Actual results:

(.venv) [kni@ocp-edge77 ocp-edge-auto_cluster]$ oc debug node/hyper-1-kd7sm
Temporary namespace openshift-debug-5cctb is created for debugging node...
Starting pod/hyper-1-kd7sm-debug ...
To use host binaries, run `chroot /host`

Removing debug pod ...
Temporary namespace openshift-debug-5cctb was removed.
Error from server (BadRequest): container "container-00" in pod "hyper-1-kd7sm-debug" is not available
(.venv) [kni@ocp-edge77 ocp-edge-auto_cluster]

Expected results:

(.venv) [kni@ocp-edge77 ocp-edge-auto_cluster]$ oc debug node/hyper-1-rkkkm
Temporary namespace openshift-debug-v6xr8 is created for debugging node...
Starting pod/hyper-1-rkkkm-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.128.2.76
If you don't see a command prompt, try pressing enter.
sh-4.4# 

Additional info:

1. in the output of :

(.venv) [kni@ocp-edge77 ocp-edge-auto_cluster]$ oc describe node hyper-1-kd7sm 

Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Sun, 23 Apr 2023 17:27:02 +0300   Sun, 02 Apr 2023 19:45:20 +0300   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     True    Sun, 23 Apr 2023 17:27:02 +0300   Sat, 15 Apr 2023 00:10:46 +0300   KubeletHasDiskPressure       kubelet has disk pressure
  PIDPressure      False   Sun, 23 Apr 2023 17:27:02 +0300   Sun, 02 Apr 2023 19:45:20 +0300   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Sun, 23 Apr 2023 17:27:02 +0300   Sun, 02 Apr 2023 19:47:53 +0300   KubeletReady                 kubelet is posting ready status

 

2. deploying with a non-default value for --root-volume-size=64 works fine.

3. [root@ocp-edge44 ~]# oc get catalogsource -n openshift-marketplace
NAME                  DISPLAY                                 TYPE   PUBLISHER   AGE
certified-operators   Certified Operators                     grpc   Red Hat     27h
community-operators   Community Operators                     grpc   Red Hat     27h
mce-custom-registry   2.2.4-DOWNANDBACK-2023-04-20-19-04-35   grpc   Red Hat     26h
redhat-marketplace    Red Hat Marketplace                     grpc   Red Hat     27h
redhat-operators      Red Hat Operators                       grpc   Red Hat     27h

 

User Story:

As IBM running HCs I want to upgrade an existing 4.12 HC suffering https://issues.redhat.com/browse/OCPBUGS-13639 towards 4.13 and let the private link endpoint to use the right security group.

Acceptance Criteria:

There's an automated/documented steps for the HC to endup with the endpoint pointing to the right SG.

A possible semi-automated path would be to manually delete and detach the endpoint from the service, so the next reconciliation loop reset status https://github.com/openshift/hypershift/blob/7d24b30c6f79be052404bf23ede7783342f0d0e5/control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller.go#L410-L444

And the next one would recreate the new endpoint with the right security group https://github.com/openshift/hypershift/blob/7d24b30c6f79be052404bf23ede7783342f0d0e5/control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller.go#L470-L525

Note this would produce connectivity down time while reconciliation happens.

Alternatively we could codify a path to update the endpoint SG when we detect a discrepancy with the hypershift SG.

 

Description of problem:

Samples tab is not visible when the Sample Deployment is created. Whereas Snippets tab is visible when added `snippet: true` in the Sample Deployment.
Check attached file for exact details.

Version-Release number of selected component (if applicable):

4.11.x

How reproducible:

Always

Steps to Reproduce:

1. On CLI, create the Sample Deployment
2. On Web console, create a Deployment
3. Deployment will be created with details mentioned in Sample Deployment.
4. Samples tab must be visible in YAML view on web console
5. Screenshots are attached for refernec.

Actual results:

When a Sample Deployment is created with the `kind: ConsoleYAMLSample` and `snippet: true`, the snippets tab shows up. When a Sample Deployment is created with a same details but without using `snippet: true`, the "Samples" tab does not show up .

Expected results:

When a Sample Deployment is created with the `kind: ConsoleYAMLSample` and NO `snippet:true`, the "Samples" tab must show up.

Additional info:

When a Sample Deployment is created with the `kind: ConsoleYAMLSample`, the "Samples" tab shows up in OCP cluster version 4.10.x , However it doesn't show up in OCP cluster version 4.11.x .

NOTE : Attached file have all the required details.

 

 

 

Description of problem:

OLMv0 over-uses listers and consumes too much memory. Also, $GOMEMLIMIT is not used and the runtime overcommits on RSS. See the following doc for more detail:

https://docs.google.com/document/d/11J7lv1HtEq_c3l6fLTWfsom8v1-7guuG4DziNQDU6cY/edit#heading=h.ttj9tfltxgzt

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Currently the 'dump cluster' command requires public access to the guest cluster to dump its contents. It should be possible for it to access the guest cluster via the kube-apiserver service on the mgmt cluster. This would enable it for private clusters as well.

 Currently we save in filesystem each installer binary we ever needed in case we users used many different versions pod is reaching storage limit as each binary have ~500mb

We should add ttl to installer cache and remove binaries that are not used

We need to validate that we are able to recover an hosted cluster's etcd (backed by storage such as LVM or HPP) when an underlying management cluster node disappears.

In this scenario, we need to understand what happens when an etcd instance fails, and the underlying PVC is permanently gone. Will the etcd operator be able to detect this and recover? or will the etcd cluster in question remain in a degraded state indefinitely? Those are the types of questions that need answers which will help guide what the next steps are for supporting local storage for etcd.

In the interest of shipping 4.13, we landed a snapshot of nmstate code with some logic for NIC name pinning.

 

In https://github.com/nmstate/nmstate/commit/03c7b03bd4c9b0067d3811dbbf72635201519356 a few changes were made.

 

TODO elaborate in this issue what bugs are fixed

 

This issue is tracking the merge of https://github.com/openshift/machine-config-operator/pull/3685 which was also aiming to ensure 4.14 is compatible.

I recently noticed that cluster-autoscaler pod in the hosted control plane namespace is going continuous restarts. Upon observing the issue, found out liveness and readiness probe failing on this pod. 

Also, checking further the logs of this pod, points to rbac missing for the cluster-autoscaler pod in this case. Please see the logs trace for reference. 

E0215 14:52:59.936182 1 reflector.go:140] k8s.io/client-go/dynamic/dynamicinformer/informer.go:91: Failed to watch *unstructured.Unstructured: failed to list *unstructured.Unstructured: agentmachinetemplates.capi-provider.agent-install.openshift.io is forbidden: User "system:serviceaccount:clusters-hcp01:cluster-autoscaler" cannot list resource "agentmachinetemplates" in API group "capi-provider.agent-install.openshift.io" in the namespace "clusters-hcp01"

Description of problem:

Business Automation Operands fail to load in uninstall operator modal. With "Cannot load Operands. There was an error loading operands for this operator. Operands will need to be deleted manually..." alert message.

"Delete all operand instances for this operator__checkbox" is not shown so the test fails. 

https://search.ci.openshift.org/?search=Testing+uninstall+of+Business+Automation+Operator&maxAge=168h&context=1&type=junit&name=pull-ci-openshift-console-master-e2e-gcp-console&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Need to follow up HOSTEDCP-1065 with an e2e to test ControlPlaneRelease functionality:

Test should:

  • Set the ControlPlaneRelease for the HC
  • Wait for HC rollout
  • Ensure no HCP pod container is using the Release payload (except CVO prepare-payload)
  • Ensure no guest cluster pod containers is using the ControlPlaneRelease
  • Ensure no node rollouts occur
  • Ensure ClusterVersion in the guest cluster reflects Release version
  • Ensure all ClusterOperators in the guest cluster reflect Release version

`ec2:ReleaseAddress` is documented as a required permission for the NodePool management policy: https://github.com/openshift/hypershift/blob/main/api/v1beta1/hostedcluster_types.go#L1285

 

This is too permissive and the permission will at least need a condition to scope it. However, it may not be used by the NodePool controller at all. In that case, this permission should be removed.

 

Done Criteria:

  • Determine if ec2:ReleaseAddress is required for NodePool management in Hypershift
  • If not required, remove the permission from documentation

DoD:

Either enforce immutability in the API via cel or add first class support for mutability i.e enable node rollout when changed

This is a clone of issue OCPBUGS-19052. The following is the description of the original issue:

Description of problem:

With OCPBUGS-18274 we had to update the etcdctl binary. Unfortunately the script does not attempt to update the binary if it's found in the path already:

https://github.com/openshift/cluster-etcd-operator/blob/master/bindata/etcd/etcd-common-tools#L16-L24

This causes confusion as the binary might not be the latest that we're shipping with etcd.

Pulling the binary shouldn't be a big deal, etcd is running locally anyway and the local image should be cached already just fine. We should always replace the binary

Version-Release number of selected component (if applicable):

any currently supported release

How reproducible:

always

Steps to Reproduce:

1. run cluster-backup.sh to download the binary
2. update the etcd image (take a different version or so)
3. run cluster-backup.sh again 

Actual results:

cluster-backup.sh will simply print "etcdctl is already installed"

Expected results:

etcdctl should always be pulled

Additional info:

 

I have a console extension (https://github.com/gnunn1/dev-console-plugin) that simply adds the Topology and Add+ views to the Admin perspective but otherwise should expose no modules. However if I try to build this extension without an exposedModules the webpack assembly fails with the stack trace below.

As a workaround I'm leaving in the example module from the template and just removing it from being added it to the OpenShift menu.

$ yarn run build                                                                                                                                                                                                            main 
yarn run v1.22.19
$ yarn clean && NODE_ENV=production yarn ts-node node_modules/.bin/webpack
$ rm -rf dist
$ ts-node -O '\{"module":"commonjs"}' node_modules/.bin/webpack
[webpack-cli] HookWebpackError: Called Compilation.updateAsset for not existing filename plugin-entry.js
    at makeWebpackError (/home/gnunn/Development/openshift/dev-console-plugin/node_modules/webpack/lib/HookWebpackError.js:48:9)
    at /home/gnunn/Development/openshift/dev-console-plugin/node_modules/webpack/lib/Compilation.js:3058:12
    at eval (eval at create (/home/gnunn/Development/openshift/dev-console-plugin/node_modules/tapable/lib/HookCodeFactory.js:33:10), <anonymous>:41:1)
    at fn (/home/gnunn/Development/openshift/dev-console-plugin/node_modules/webpack/lib/Compilation.js:479:17)
    at _next0 (eval at create (/home/gnunn/Development/openshift/dev-console-plugin/node_modules/tapable/lib/HookCodeFactory.js:33:10), <anonymous>:39:1)
    at eval (eval at create (/home/gnunn/Development/openshift/dev-console-plugin/node_modules/tapable/lib/HookCodeFactory.js:33:10), <anonymous>:52:1)
    at eval (eval at create (/home/gnunn/Development/openshift/dev-console-plugin/node_modules/tapable/lib/HookCodeFactory.js:33:10), <anonymous>:13:1)
    at processTicksAndRejections (node:internal/process/task_queues:95:5)
-- inner error --
Error: Called Compilation.updateAsset for not existing filename plugin-entry.js
    at Compilation.updateAsset (/home/gnunn/Development/openshift/dev-console-plugin/node_modules/webpack/lib/Compilation.js:4298:10)
    at /home/gnunn/Development/openshift/dev-console-plugin/node_modules/src/webpack/ConsoleAssetPlugin.ts:82:23
    at fn (/home/gnunn/Development/openshift/dev-console-plugin/node_modules/webpack/lib/Compilation.js:477:10)
    at _next0 (eval at create (/home/gnunn/Development/openshift/dev-console-plugin/node_modules/tapable/lib/HookCodeFactory.js:33:10), <anonymous>:39:1)
    at eval (eval at create (/home/gnunn/Development/openshift/dev-console-plugin/node_modules/tapable/lib/HookCodeFactory.js:33:10), <anonymous>:52:1)
    at eval (eval at create (/home/gnunn/Development/openshift/dev-console-plugin/node_modules/tapable/lib/HookCodeFactory.js:33:10), <anonymous>:13:1)
    at processTicksAndRejections (node:internal/process/task_queues:95:5)
caused by plugins in Compilation.hooks.processAssets
Error: Called Compilation.updateAsset for not existing filename plugin-entry.js
    at Compilation.updateAsset (/home/gnunn/Development/openshift/dev-console-plugin/node_modules/webpack/lib/Compilation.js:4298:10)
    at /home/gnunn/Development/openshift/dev-console-plugin/node_modules/src/webpack/ConsoleAssetPlugin.ts:82:23
    at fn (/home/gnunn/Development/openshift/dev-console-plugin/node_modules/webpack/lib/Compilation.js:477:10)
    at _next0 (eval at create (/home/gnunn/Development/openshift/dev-console-plugin/node_modules/tapable/lib/HookCodeFactory.js:33:10), <anonymous>:39:1)
    at eval (eval at create (/home/gnunn/Development/openshift/dev-console-plugin/node_modules/tapable/lib/HookCodeFactory.js:33:10), <anonymous>:52:1)
    at eval (eval at create (/home/gnunn/Development/openshift/dev-console-plugin/node_modules/tapable/lib/HookCodeFactory.js:33:10), <anonymous>:13:1)
    at processTicksAndRejections (node:internal/process/task_queues:95:5)
error Command failed with exit code 2.
info Visit 
{{[https://yarnpkg.com/en/docs/cli/run]}}
 for documentation about this command.
error Command failed with exit code 2.
info Visit 
{{[https://yarnpkg.com/en/docs/cli/run]}}
 for documentation about this command.

 

User Story:

We enabled balance similar node groups via https://issues.redhat.com/browse/OCPBUGS-15769

We should include a validation for this behaviour in our e2e autoscaler testing.

We can probably reused what we do in Machine API test https://github.com/openshift/cluster-api-actuator-pkg/blob/77764237f2e6160d95990dc905b8e87662bc4d16/pkg/autoscaler/autoscaler.go#L437

Acceptance Criteria:

Description of criteria:

  • Upstream documentation
  • Point 1
  • Point 2
  • Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.